[jira] [Commented] (LUCENE-10493) Can we unify the viterbi search logic in the tokenizers of kuromoji and nori?

2022-04-28 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529710#comment-17529710
 ] 

ASF subversion and git services commented on LUCENE-10493:
--

Commit c28f575b6db1ece837e1cba3fa5526e30135eb5a in lucene's branch 
refs/heads/main from Tomoko Uchida
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=c28f575b6db ]

LUCENE-10493: move n-best logic to analysis-common (#846)



> Can we unify the viterbi search logic in the tokenizers of kuromoji and nori?
> -
>
> Key: LUCENE-10493
> URL: https://issues.apache.org/jira/browse/LUCENE-10493
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Major
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> We now have common dictionary interfaces for kuromoji and nori 
> ([LUCENE-10393]). A natural question would be: is it possible to unify the 
> Japanese/Korean tokenizers? 
> The core methods of the two tokenizers are `parse()` and `backtrace()` to 
> calculate the minimum cost path by Viterbi search. I'd set the goal of this 
> issue to factoring out them into a separate class (in analysis-common) that 
> is shared between JapaneseTokenizer and KoreanTokenizer. 
> The algorithm to solve the minimum cost path itself is of course 
> language-agnostic, so I think it should be theoretically possible; the most 
> difficult part here might be the N-best path calculation - which is supported 
> only by JapaneseTokenizer and not by KoreanTokenizer.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10493) Can we unify the viterbi search logic in the tokenizers of kuromoji and nori?

2022-04-25 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17527428#comment-17527428
 ] 

ASF subversion and git services commented on LUCENE-10493:
--

Commit c89f8a7ea1e7dfa64ab6d85c22dcbb977f8e09d0 in lucene's branch 
refs/heads/main from Tomoko Uchida
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=c89f8a7ea1e ]

LUCENE-10493: factor out Viterbi algorithm and share it between kuromoji and 
nori (#805)



> Can we unify the viterbi search logic in the tokenizers of kuromoji and nori?
> -
>
> Key: LUCENE-10493
> URL: https://issues.apache.org/jira/browse/LUCENE-10493
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Major
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> We now have common dictionary interfaces for kuromoji and nori 
> ([LUCENE-10393]). A natural question would be: is it possible to unify the 
> Japanese/Korean tokenizers? 
> The core methods of the two tokenizers are `parse()` and `backtrace()` to 
> calculate the minimum cost path by Viterbi search. I'd set the goal of this 
> issue to factoring out them into a separate class (in analysis-common) that 
> is shared between JapaneseTokenizer and KoreanTokenizer. 
> The algorithm to solve the minimum cost path itself is of course 
> language-agnostic, so I think it should be theoretically possible; the most 
> difficult part here might be the N-best path calculation - which is supported 
> only by JapaneseTokenizer and not by KoreanTokenizer.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10493) Can we unify the viterbi search logic in the tokenizers of kuromoji and nori?

2022-04-08 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17519870#comment-17519870
 ] 

Tomoko Uchida commented on LUCENE-10493:


I opened the main PR: https://github.com/apache/lucene/pull/805.
Please see its description for the design specification and the current 
limitation (those will be deferred to future work).
It's still a draft but is already self-contained and works correctly (for me). 
Feedback is welcome.

> Can we unify the viterbi search logic in the tokenizers of kuromoji and nori?
> -
>
> Key: LUCENE-10493
> URL: https://issues.apache.org/jira/browse/LUCENE-10493
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Major
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> We now have common dictionary interfaces for kuromoji and nori 
> ([LUCENE-10393]). A natural question would be: is it possible to unify the 
> Japanese/Korean tokenizers? 
> The core methods of the two tokenizers are `parse()` and `backtrace()` to 
> calculate the minimum cost path by Viterbi search. I'd set the goal of this 
> issue to factoring out them into a separate class (in analysis-common) that 
> is shared between JapaneseTokenizer and KoreanTokenizer. 
> The algorithm to solve the minimum cost path itself is of course 
> language-agnostic, so I think it should be theoretically possible; the most 
> difficult part here might be the N-best path calculation - which is supported 
> only by JapaneseTokenizer and not by KoreanTokenizer.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10493) Can we unify the viterbi search logic in the tokenizers of kuromoji and nori?

2022-04-08 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17519463#comment-17519463
 ] 

ASF subversion and git services commented on LUCENE-10493:
--

Commit 13630d361e285ee0ef73ad0a4432e81d63db03ce in lucene's branch 
refs/heads/main from Tomoko Uchida
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=13630d361e2 ]

LUCENE-10493: Unify token Type enum in kuromoji and nori (#801)



> Can we unify the viterbi search logic in the tokenizers of kuromoji and nori?
> -
>
> Key: LUCENE-10493
> URL: https://issues.apache.org/jira/browse/LUCENE-10493
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Major
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> We now have common dictionary interfaces for kuromoji and nori 
> ([LUCENE-10393]). A natural question would be: is it possible to unify the 
> Japanese/Korean tokenizers? 
> The core methods of the two tokenizers are `parse()` and `backtrace()` to 
> calculate the minimum cost path by Viterbi search. I'd set the goal of this 
> issue to factoring out them into a separate class (in analysis-common) that 
> is shared between JapaneseTokenizer and KoreanTokenizer. 
> The algorithm to solve the minimum cost path itself is of course 
> language-agnostic, so I think it should be theoretically possible; the most 
> difficult part here might be the N-best path calculation - which is supported 
> only by JapaneseTokenizer and not by KoreanTokenizer.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10493) Can we unify the viterbi search logic in the tokenizers of kuromoji and nori?

2022-04-07 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17518852#comment-17518852
 ] 

ASF subversion and git services commented on LUCENE-10493:
--

Commit 9aa8ec9d06a2b271559ec0a93e1405239bbb6af2 in lucene's branch 
refs/heads/main from Tomoko Uchida
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=9aa8ec9d06a ]

LUCENE-10493: Unify TokenInfoFST in kuromoji and nori (#795)



> Can we unify the viterbi search logic in the tokenizers of kuromoji and nori?
> -
>
> Key: LUCENE-10493
> URL: https://issues.apache.org/jira/browse/LUCENE-10493
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Major
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> We now have common dictionary interfaces for kuromoji and nori 
> ([LUCENE-10393]). A natural question would be: is it possible to unify the 
> Japanese/Korean tokenizers? 
> The core methods of the two tokenizers are `parse()` and `backtrace()` to 
> calculate the minimum cost path by Viterbi search. I'd set the goal of this 
> issue to factoring out them into a separate class (in analysis-common) that 
> is shared between JapaneseTokenizer and KoreanTokenizer. 
> The algorithm to solve the minimum cost path itself is of course 
> language-agnostic, so I think it should be theoretically possible; the most 
> difficult part here might be the N-best path calculation - which is supported 
> only by JapaneseTokenizer and not by KoreanTokenizer.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10493) Can we unify the viterbi search logic in the tokenizers of kuromoji and nori?

2022-04-07 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17518851#comment-17518851
 ] 

ASF subversion and git services commented on LUCENE-10493:
--

Commit 4d2b08554a1908d4ec90ed2cb91bab4f4b29b2d3 in lucene's branch 
refs/heads/main from Tomoko Uchida
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=4d2b08554a1 ]

LUCENE-10493: add 'backWordPos' array to JapaneseTokenizer.Position (#793)



> Can we unify the viterbi search logic in the tokenizers of kuromoji and nori?
> -
>
> Key: LUCENE-10493
> URL: https://issues.apache.org/jira/browse/LUCENE-10493
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> We now have common dictionary interfaces for kuromoji and nori 
> ([LUCENE-10393]). A natural question would be: is it possible to unify the 
> Japanese/Korean tokenizers? 
> The core methods of the two tokenizers are `parse()` and `backtrace()` to 
> calculate the minimum cost path by Viterbi search. I'd set the goal of this 
> issue to factoring out them into a separate class (in analysis-common) that 
> is shared between JapaneseTokenizer and KoreanTokenizer. 
> The algorithm to solve the minimum cost path itself is of course 
> language-agnostic, so I think it should be theoretically possible; the most 
> difficult part here might be the N-best path calculation - which is supported 
> only by JapaneseTokenizer and not by KoreanTokenizer.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10493) Can we unify the viterbi search logic in the tokenizers of kuromoji and nori?

2022-04-06 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17518271#comment-17518271
 ] 

Tomoko Uchida commented on LUCENE-10493:


I'm trying to factor out the core algorithm from Japanese/Korean Tokenizers 
with the above modifications - it is still a very rough patch but anyhow, seems 
to work... 
I'd merge #793 and #795 after waiting for one or two days and then prepare the 
main PR. The next step can't be small to show the full picture (creating a base 
`Viterbi` class in analysis-common, moving the common logic to it, and 
rewriting  Japanese/Korean Tokenizers upon it), though, I will try to sort out 
the interfaces for review.

> Can we unify the viterbi search logic in the tokenizers of kuromoji and nori?
> -
>
> Key: LUCENE-10493
> URL: https://issues.apache.org/jira/browse/LUCENE-10493
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> We now have common dictionary interfaces for kuromoji and nori 
> ([LUCENE-10393]). A natural question would be: is it possible to unify the 
> Japanese/Korean tokenizers? 
> The core methods of the two tokenizers are `parse()` and `backtrace()` to 
> calculate the minimum cost path by Viterbi search. I'd set the goal of this 
> issue to factoring out them into a separate class (in analysis-common) that 
> is shared between JapaneseTokenizer and KoreanTokenizer. 
> The algorithm to solve the minimum cost path itself is of course 
> language-agnostic, so I think it should be theoretically possible; the most 
> difficult part here might be the N-best path calculation - which is supported 
> only by JapaneseTokenizer and not by KoreanTokenizer.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10493) Can we unify the viterbi search logic in the tokenizers of kuromoji and nori?

2022-04-06 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17517988#comment-17517988
 ] 

Tomoko Uchida commented on LUCENE-10493:


I'm starting this with small steps. I'll try to keep the commits 
self-contained, and also as small as possible for safety.
https://github.com/apache/lucene/pull/793

Let me know if there is any feedback, thanks! 

> Can we unify the viterbi search logic in the tokenizers of kuromoji and nori?
> -
>
> Key: LUCENE-10493
> URL: https://issues.apache.org/jira/browse/LUCENE-10493
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We now have common dictionary interfaces for kuromoji and nori 
> ([LUCENE-10393]). A natural question would be: is it possible to unify the 
> Japanese/Korean tokenizers? 
> The core methods of the two tokenizers are `parse()` and `backtrace()` to 
> calculate the minimum cost path by Viterbi search. I'd set the goal of this 
> issue to factoring out them into a separate class (in analysis-common) that 
> is shared between JapaneseTokenizer and KoreanTokenizer. 
> The algorithm to solve the minimum cost path itself is of course 
> language-agnostic, so I think it should be theoretically possible; the most 
> difficult part here might be the N-best path calculation - which is supported 
> only by JapaneseTokenizer and not by KoreanTokenizer.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10493) Can we unify the viterbi search logic in the tokenizers of kuromoji and nori?

2022-04-01 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17515909#comment-17515909
 ] 

Tomoko Uchida commented on LUCENE-10493:


I looked through the `parse()` method of JapaneseTokenizer and KoreanTokenizer  
(I'm inclined to change the method name to `forward()`, this is more aligned 
with the terminology of the algorithm to me). I perhaps could be too 
optimistic, but they are not so diverged as I first thought - except for the 
"unknown" word handling: this is inevitably language-specific.

The N-best path part is also language-agnostic, so I'm expecting that we can 
safely factor it out from JapaneseTokenizer and have the common utility for 
performing the minimum n-best cost path calculation. Maybe we will have a 
utility class for the Viterbi search, and possibly a base tokenizer. I'll try 
to make a draft patch.

 

> Can we unify the viterbi search logic in the tokenizers of kuromoji and nori?
> -
>
> Key: LUCENE-10493
> URL: https://issues.apache.org/jira/browse/LUCENE-10493
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Major
>
> We now have common dictionary interfaces for kuromoji and nori 
> ([LUCENE-10393]). A natural question would be: is it possible to unify the 
> Japanese/Korean tokenizers? 
> The core methods of the two tokenizers are `parse()` and `backtrace()` to 
> calculate the minimum cost path by Viterbi search. I'd set the goal of this 
> issue to factoring out them into a separate class (in analysis-common) that 
> is shared between JapaneseTokenizer and KoreanTokenizer. 
> The algorithm to solve the minimum cost path itself is of course 
> language-agnostic, so I think it should be theoretically possible; the most 
> difficult part here might be the N-best path calculation - which is supported 
> only by JapaneseTokenizer and not by KoreanTokenizer.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org