[ https://issues.apache.org/jira/browse/LUCENE-8548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16694827#comment-16694827 ]
Christophe Bismuth commented on LUCENE-8548: -------------------------------------------- I'm hacking around in the {{KoreanTokenizer}} class, but I'll need some mentoring to keep going on. Here is what I've done so far: * Implement a Cyrillic test failure (see previous comment) * Locate the {{KoreanAnalyzer}} and {{KoreanTokenizer}} classes * Locate the {{ICUTokenizer}} and its {{CompositeBreakIterator}} class attribute (following UAX #29: Unicode Text Segmentation) * Try to make Ant {{nori}} module depend on {{icu}} module to try to reuse some {{ICUTokenizer}} logic parts (but I failed to tweak Ant scripts) * Enable verbose output (see output below) * Enable Graphiz ouput (see attached picture) * Debug step by step the {{org.apache.lucene.analysis.ko.KoreanTokenizer#parse}} method * Add a breakpoint in the {{DictionaryToken}} constructor to try to understand how and when tokens are built (I also played with {{outputUnknownUnigrams}} parameters) I would need some code or documentation pointers when you have time. !testCyrillicWord.dot.png! Tokenizer verbose output below. {noformat} PARSE extend @ pos=0 char=м hex=43c 1 arcs in UNKNOWN word len=1 1 wordIDs fromIDX=0: cost=138 (prevCost=0 wordCost=795 bgCost=138 spacePenalty=0) leftID=1793 leftPOS=SL) ** + cost=933 wordID=36 leftID=1793 leastIDX=0 toPos=1 toPos.idx=0 backtrace: endPos=1 pos=1; 1 characters; last=0 cost=933 add token=DictionaryToken("м" pos=0 length=1 posLen=1 type=UNKNOWN wordId=0 leftID=1798) freeBefore pos=1 TEST-TestKoreanAnalyzer.testCyrillicWord-seed#[9AA9487A32EFEB]: incToken: return token=DictionaryToken("м" pos=0 length=1 posLen=1 type=UNKNOWN wordId=0 leftID=1798) PARSE extend @ pos=1 char=o hex=6f 1 arcs in UNKNOWN word len=6 1 wordIDs fromIDX=0: cost=-1030 (prevCost=0 wordCost=795 bgCost=-1030 spacePenalty=0) leftID=1793 leftPOS=SL) ** + cost=-235 wordID=30 leftID=1793 leastIDX=0 toPos=7 toPos.idx=0 no arcs in; skip pos=2 no arcs in; skip pos=3 no arcs in; skip pos=4 no arcs in; skip pos=5 no arcs in; skip pos=6 end: 1 nodes backtrace: endPos=7 pos=7; 6 characters; last=1 cost=-235 add token=DictionaryToken("w" pos=6 length=1 posLen=1 type=UNKNOWN wordId=0 leftID=1798) add token=DictionaryToken("o" pos=5 length=1 posLen=1 type=UNKNOWN wordId=0 leftID=1798) add token=DictionaryToken("c" pos=4 length=1 posLen=1 type=UNKNOWN wordId=0 leftID=1798) add token=DictionaryToken("s" pos=3 length=1 posLen=1 type=UNKNOWN wordId=0 leftID=1798) add token=DictionaryToken("s" pos=2 length=1 posLen=1 type=UNKNOWN wordId=0 leftID=1798) add token=DictionaryToken("o" pos=1 length=1 posLen=1 type=UNKNOWN wordId=0 leftID=1798) freeBefore pos=7 {noformat} > Reevaluate scripts boundary break in Nori's tokenizer > ----------------------------------------------------- > > Key: LUCENE-8548 > URL: https://issues.apache.org/jira/browse/LUCENE-8548 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Jim Ferenczi > Priority: Minor > Attachments: testCyrillicWord.dot.png > > > This was first reported in https://issues.apache.org/jira/browse/LUCENE-8526: > {noformat} > Tokens are split on different character POS types (which seem to not quite > line up with Unicode character blocks), which leads to weird results for > non-CJK tokens: > εἰμί is tokenized as three tokens: ε/SL(Foreign language) + ἰ/SY(Other > symbol) + μί/SL(Foreign language) > ka̠k̚t͡ɕ͈a̠k̚ is tokenized as ka/SL(Foreign language) + ̠/SY(Other symbol) + > k/SL(Foreign language) + ̚/SY(Other symbol) + t/SL(Foreign language) + > ͡ɕ͈/SY(Other symbol) + a/SL(Foreign language) + ̠/SY(Other symbol) + > k/SL(Foreign language) + ̚/SY(Other symbol) > Ба̀лтичко̄ is tokenized as ба/SL(Foreign language) + ̀/SY(Other symbol) + > лтичко/SL(Foreign language) + ̄/SY(Other symbol) > don't is tokenized as don + t; same for don't (with a curly apostrophe). > אוֹג׳וּ is tokenized as אוֹג/SY(Other symbol) + וּ/SY(Other symbol) > Мoscow (with a Cyrillic М and the rest in Latin) is tokenized as м + oscow > While it is still possible to find these words using Nori, there are many > more chances for false positives when the tokens are split up like this. In > particular, individual numbers and combining diacritics are indexed > separately (e.g., in the Cyrillic example above), which can lead to a > performance hit on large corpora like Wiktionary or Wikipedia. > Work around: use a character filter to get rid of combining diacritics before > Nori processes the text. This doesn't solve the Greek, Hebrew, or English > cases, though. > Suggested fix: Characters in related Unicode blocks—like "Greek" and "Greek > Extended", or "Latin" and "IPA Extensions"—should not trigger token splits. > Combining diacritics should not trigger token splits. Non-CJK text should be > tokenized on spaces and punctuation, not by character type shifts. > Apostrophe-like characters should not trigger token splits (though I could > see someone disagreeing on this one).{noformat} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org