[jira] [Updated] (LUCENE-9112) SegmentingTokenizerBase splits terms that occupy 1024th positions in text

Markus Jelsma (Jira) Wed, 29 Jan 2020 03:21:10 -0800


     [ 
https://issues.apache.org/jira/browse/LUCENE-9112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Markus Jelsma updated LUCENE-9112:
----------------------------------
    Attachment:     (was: en-sent.bin)

> SegmentingTokenizerBase splits terms that occupy 1024th positions in text
> -------------------------------------------------------------------------
>
>                 Key: LUCENE-9112
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9112
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: master (9.0)
>            Reporter: Markus Jelsma
>            Priority: Major
>              Labels: opennlp
>             Fix For: master (9.0)
>
>         Attachments: LUCENE-9112-unittest.patch, LUCENE-9112-unittest.patch
>
>
> The OpenNLP tokenizer show weird behaviour when text contains spurious 
> punctuation such as having triple dots trailing a sentence...
> # the first dot becomes part of the token, having 'sentence.' becomes the 
> token
> # much further down the text, a seemingly unrelated token is then suddenly 
> split up, in my example (see attached unit test) the name 'Baron' is  split 
> into 'Baro' and 'n', this is the real problem
> The problems never seem to occur when using small texts in unit tests but it 
> certainly does in real world examples. Depending on how many 'spurious' dots, 
> a completely different term can become split, or the same term in just a 
> different location.
> I am not too sure if this is actually a problem in the Lucene code, but it is 
> a problem and i have a Lucene unit test proving the problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-9112) SegmentingTokenizerBase splits terms that occupy 1024th positions in text

Reply via email to