Github user conniey commented on the issue:
https://github.com/apache/lucenenet/pull/191
1. Sentence breaking not working when first word of sentence is lower case.
* According to the [sentence boundary
rules](http://www.unicode.org/reports/tr29/#Sentence_Boundary_Rules) icu
follows, it is returning the correct sentence breaks. (It is defined in the
section "Do not break after full stop in certain contexts. [See note below.]").
2. The response for 1 also applies, where it is breaking prematurely on
new-lines.
3. Word breaking is happening on hyphenated words instead of treating them
as a single word, for example, "high-performance" should be considered a single
word, not 2 words.
* According to their [word break
rules](http://www.unicode.org/reports/tr29/#Word_Boundary_Rules), we are
returning the expected behaviour. The hyphens that are visualised are breaking
hyphens, but if we had added a soft hyphen, it would not have broken the word.
4. "The ThaiWordBreaker class was added to work-around another
BreakIterator difference from Java - namely that in Java Thai characters were
broken into separate "words" if adjacent to non-Thai characters."
* Unfortunately, this is due to the word breaking rules in ICU since it
sees these as part of the same word since they are characters.
One way to fix the points above is to use a RuleBasedBreakIterator and
modify the default rules for creating a break iterator. Would that work for
Lucene.NET? I would have to add a native method to icu-dotnet to call to
[ubrk_openRules](http://icu-project.org/apiref/icu4c/ubrk_8h.html#a11826cb21213916c2d91579b673d8949)
to let you create a BreakIterator. The default rules are here:
* [Sentence
rules](http://source.icu-project.org/repos/icu/tags/release-54-1/icu4c/source/data/brkitr/sent.txt)
* [Word
rules](http://source.icu-project.org/repos/icu/tags/release-54-1/icu4c/source/data/brkitr/word.txt)
* [Blog post on creating custom
rules](http://sujitpal.blogspot.com/2008/05/tokenizing-text-with-icu4js.html)
5. I updated ThaiTokenizer with your code snippet and tested it against
TestNumeralBreakages
RE: BreakIterator Dependencies
* I agree that it should be an abstract class and have more functionality
(ie. moving backwards and forwards) similar to its Java counterpart. I'll see
about writing a PR and submitting it to
[sillsdev/icu-dotnet](https://github.com/sillsdev/icu-dotnet) to see if they
will accept this feature.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---