Sentence boundary bug

Lucas Kot-Zaniewski Tue, 06 Sep 2022 16:44:54 -0700

Hi All!

I think I've found a bug with sentence boundary detection explained in
detail here https://github.com/apache/lucene/issues/11735


It affects KeywordRepeatFilter + OpenNLPLemmatizer configuration which
apparently is thought to be common enough to be directly mentioned in solr
documentation/examples
https://solr.apache.org/guide/7_3/language-analysis.html#opennlp-lemmatizer-filter

The bug should be fairly easy to verify with the this test
https://github.com/kotman12/lucene/blob/8ecd42ec88685f47d42a88dd2536e879028af023/lucene/analysis/opennlp/src/test/org/apache/lucene/analysis/opennlp/TestOpenNLPLemmatizerFilterFactory.java#L298
and
I'd greatly appreciate if someone could give this a look. I'm also
proposing a fix here https://github.com/apache/lucene/pull/11734 but
naturally I am open to other thoughts on how to approach this.

Thanks,
Luke

Sentence boundary bug

Reply via email to