[jira] [Commented] (LUCENE-8548) Reevaluate scripts boundary break in Nori's tokenizer

Christophe Bismuth (JIRA) Wed, 21 Nov 2018 07:12:11 -0800


    [ 
https://issues.apache.org/jira/browse/LUCENE-8548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16694827#comment-16694827
 ]


Christophe Bismuth commented on LUCENE-8548:
--------------------------------------------

I'm hacking around in the {{KoreanTokenizer}} class, but I'll need some 
mentoring to keep going on.

Here is what I've done so far:
 * Implement a Cyrillic test failure (see previous comment)
 * Locate the {{KoreanAnalyzer}} and {{KoreanTokenizer}} classes
 * Locate the {{ICUTokenizer}} and its {{CompositeBreakIterator}} class 
attribute (following UAX #29: Unicode Text Segmentation)
 * Try to make Ant {{nori}} module depend on {{icu}} module to try to reuse 
some {{ICUTokenizer}} logic parts (but I failed to tweak Ant scripts)
 * Enable verbose output (see output below)
 * Enable Graphiz ouput (see attached picture)
 * Debug step by step the 
{{org.apache.lucene.analysis.ko.KoreanTokenizer#parse}} method
 * Add a breakpoint in the {{DictionaryToken}} constructor to try to understand 
how and when tokens are built (I also played with {{outputUnknownUnigrams}} 
parameters)

I would need some code or documentation pointers when you have time.

!testCyrillicWord.dot.png!

Tokenizer verbose output below.
{noformat}
PARSE

  extend @ pos=0 char=м hex=43c
    1 arcs in
    UNKNOWN word len=1 1 wordIDs
      fromIDX=0: cost=138 (prevCost=0 wordCost=795 bgCost=138 spacePenalty=0) 
leftID=1793 leftPOS=SL)
        **
      + cost=933 wordID=36 leftID=1793 leastIDX=0 toPos=1 toPos.idx=0

  backtrace: endPos=1 pos=1; 1 characters; last=0 cost=933
    add token=DictionaryToken("м" pos=0 length=1 posLen=1 type=UNKNOWN wordId=0 
leftID=1798)
  freeBefore pos=1
TEST-TestKoreanAnalyzer.testCyrillicWord-seed#[9AA9487A32EFEB]:    incToken: 
return token=DictionaryToken("м" pos=0 length=1 posLen=1 type=UNKNOWN wordId=0 
leftID=1798)

PARSE

  extend @ pos=1 char=o hex=6f
    1 arcs in
    UNKNOWN word len=6 1 wordIDs
      fromIDX=0: cost=-1030 (prevCost=0 wordCost=795 bgCost=-1030 
spacePenalty=0) leftID=1793 leftPOS=SL)
        **
      + cost=-235 wordID=30 leftID=1793 leastIDX=0 toPos=7 toPos.idx=0
    no arcs in; skip pos=2
    no arcs in; skip pos=3
    no arcs in; skip pos=4
    no arcs in; skip pos=5
    no arcs in; skip pos=6
  end: 1 nodes

  backtrace: endPos=7 pos=7; 6 characters; last=1 cost=-235
    add token=DictionaryToken("w" pos=6 length=1 posLen=1 type=UNKNOWN wordId=0 
leftID=1798)
    add token=DictionaryToken("o" pos=5 length=1 posLen=1 type=UNKNOWN wordId=0 
leftID=1798)
    add token=DictionaryToken("c" pos=4 length=1 posLen=1 type=UNKNOWN wordId=0 
leftID=1798)
    add token=DictionaryToken("s" pos=3 length=1 posLen=1 type=UNKNOWN wordId=0 
leftID=1798)
    add token=DictionaryToken("s" pos=2 length=1 posLen=1 type=UNKNOWN wordId=0 
leftID=1798)
    add token=DictionaryToken("o" pos=1 length=1 posLen=1 type=UNKNOWN wordId=0 
leftID=1798)
  freeBefore pos=7
{noformat}

> Reevaluate scripts boundary break in Nori's tokenizer
> -----------------------------------------------------
>
>                 Key: LUCENE-8548
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8548
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Jim Ferenczi
>            Priority: Minor
>         Attachments: testCyrillicWord.dot.png
>
>
> This was first reported in https://issues.apache.org/jira/browse/LUCENE-8526:
> {noformat}
> Tokens are split on different character POS types (which seem to not quite 
> line up with Unicode character blocks), which leads to weird results for 
> non-CJK tokens:
> εἰμί is tokenized as three tokens: ε/SL(Foreign language) + ἰ/SY(Other 
> symbol) + μί/SL(Foreign language)
> ka̠k̚t͡ɕ͈a̠k̚ is tokenized as ka/SL(Foreign language) + ̠/SY(Other symbol) + 
> k/SL(Foreign language) + ̚/SY(Other symbol) + t/SL(Foreign language) + 
> ͡ɕ͈/SY(Other symbol) + a/SL(Foreign language) + ̠/SY(Other symbol) + 
> k/SL(Foreign language) + ̚/SY(Other symbol)
> Ба̀лтичко̄ is tokenized as ба/SL(Foreign language) + ̀/SY(Other symbol) + 
> лтичко/SL(Foreign language) + ̄/SY(Other symbol)
> don't is tokenized as don + t; same for don't (with a curly apostrophe).
> אוֹג׳וּ is tokenized as אוֹג/SY(Other symbol) + וּ/SY(Other symbol)
> Мoscow (with a Cyrillic М and the rest in Latin) is tokenized as м + oscow
> While it is still possible to find these words using Nori, there are many 
> more chances for false positives when the tokens are split up like this. In 
> particular, individual numbers and combining diacritics are indexed 
> separately (e.g., in the Cyrillic example above), which can lead to a 
> performance hit on large corpora like Wiktionary or Wikipedia.
> Work around: use a character filter to get rid of combining diacritics before 
> Nori processes the text. This doesn't solve the Greek, Hebrew, or English 
> cases, though.
> Suggested fix: Characters in related Unicode blocks—like "Greek" and "Greek 
> Extended", or "Latin" and "IPA Extensions"—should not trigger token splits. 
> Combining diacritics should not trigger token splits. Non-CJK text should be 
> tokenized on spaces and punctuation, not by character type shifts. 
> Apostrophe-like characters should not trigger token splits (though I could 
> see someone disagreeing on this one).{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-8548) Reevaluate scripts boundary break in Nori's tokenizer

Reply via email to