[jira] [Commented] (LUCENE-9112) OpenNLP tokenizer is fooled by text containing spurious punctuation

2020-01-06 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/LUCENE-9112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17008871#comment-17008871 ] Markus Jelsma commented on LUCENE-9112: --- SegmentingTokenizerBase works fine on texts smaller than

[jira] [Commented] (LUCENE-9112) OpenNLP tokenizer is fooled by text containing spurious punctuation

2019-12-31 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/LUCENE-9112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17006105#comment-17006105 ] Markus Jelsma commented on LUCENE-9112: --- There it is: {code} usableLength = findSafeEnd();

[jira] [Commented] (LUCENE-9112) OpenNLP tokenizer is fooled by text containing spurious punctuation

2019-12-31 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/LUCENE-9112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17006089#comment-17006089 ] Markus Jelsma commented on LUCENE-9112: --- I now believe it is a problem in the Lucene code, namely

[jira] [Commented] (LUCENE-9112) OpenNLP tokenizer is fooled by text containing spurious punctuation

2019-12-31 Thread Markus Jelsma (Jira)
[ https://issues.apache.org/jira/browse/LUCENE-9112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17006076#comment-17006076 ] Markus Jelsma commented on LUCENE-9112: --- Hello [~sarowe], I first spotted the issue with a Dutch

[jira] [Commented] (LUCENE-9112) OpenNLP tokenizer is fooled by text containing spurious punctuation

2019-12-30 Thread Steven Rowe (Jira)
[ https://issues.apache.org/jira/browse/LUCENE-9112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17005798#comment-17005798 ] Steven Rowe commented on LUCENE-9112: - You unit test depends on a test model created with very