[ https://issues.apache.org/jira/browse/LUCENE-9112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17005798#comment-17005798 ]
Steven Rowe commented on LUCENE-9112: ------------------------------------- You unit test depends on a test model created with very little training data ( < 100 sentences; see {{opennlp/src/tools/test-model-data/tokenizer.txt}}), so it's not at all surprising that you see weird behavior. I would not consider this indicative of a bug in Lucene's OpenNLP support. I think you should open an OPENNLP issue for this problem, but it's likely that the most you'll get from them is a pointer to the training data they used to create the model they publish. The most likely outcome is that you will have to create a training set that performs better against data you see, and then create a model from that. If you can do that in a way that is shareable with other OpenNLP users, I'm sure they would be interested in your contribution. > OpenNLP tokenizer is fooled by text containing spurious punctuation > ------------------------------------------------------------------- > > Key: LUCENE-9112 > URL: https://issues.apache.org/jira/browse/LUCENE-9112 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis > Affects Versions: master (9.0) > Reporter: Markus Jelsma > Priority: Major > Labels: opennlp > Fix For: master (9.0) > > Attachments: LUCENE-9112-unittest.patch > > > The OpenNLP tokenizer show weird behaviour when text contains spurious > punctuation such as having triple dots trailing a sentence... > # the first dot becomes part of the token, having 'sentence.' becomes the > token > # much further down the text, a seemingly unrelated token is then suddenly > split up, in my example (see attached unit test) the name 'Baron' is split > into 'Baro' and 'n', this is the real problem > The problems never seem to occur when using small texts in unit tests but it > certainly does in real world examples. Depending on how many 'spurious' dots, > a completely different term can become split, or the same term in just a > different location. > I am not too sure if this is actually a problem in the Lucene code, but it is > a problem and i have a Lucene unit test proving the problem. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org