[GitHub] lucenenet issue #191: Migrating Lucene.Net to .NET Core

conniey Mon, 12 Dec 2016 23:25:04 -0800

Github user conniey commented on the issue:

    https://github.com/apache/lucenenet/pull/191
  
    1. Sentence breaking not working when first word of sentence is lower case.
        * According to the [sentence boundary 
rules](http://www.unicode.org/reports/tr29/#Sentence_Boundary_Rules) icu 
follows, it is returning the correct sentence breaks. (It is defined in the 
section "Do not break after full stop in certain contexts. [See note below.]").
    2. The response for 1 also applies, where it is breaking prematurely on 
new-lines.
    3. Word breaking is happening on hyphenated words instead of treating them 
as a single word, for example, "high-performance" should be considered a single 
word, not 2 words.
        * According to their [word break 
rules](http://www.unicode.org/reports/tr29/#Word_Boundary_Rules), we are 
returning the expected behaviour. The hyphens that are visualised are breaking 
hyphens, but if we had added a soft hyphen, it would not have broken the word.
    4. "The ThaiWordBreaker class was added to work-around another 
BreakIterator difference from Java - namely that in Java Thai characters were 
broken into separate "words" if adjacent to non-Thai characters."
        * Unfortunately, this is due to the word breaking rules in ICU since it 
sees these as part of the same word since they are characters.
    
    One way to fix the points above is to use a RuleBasedBreakIterator and 
modify the default rules for creating a break iterator.  Would that work for 
Lucene.NET? I would have to add a native method to icu-dotnet to call to 
[ubrk_openRules](http://icu-project.org/apiref/icu4c/ubrk_8h.html#a11826cb21213916c2d91579b673d8949)
 to let you create a BreakIterator.  The default rules are here:
    
    * [Sentence 
rules](http://source.icu-project.org/repos/icu/tags/release-54-1/icu4c/source/data/brkitr/sent.txt)
    * [Word 
rules](http://source.icu-project.org/repos/icu/tags/release-54-1/icu4c/source/data/brkitr/word.txt)
    * [Blog post on creating custom 
rules](http://sujitpal.blogspot.com/2008/05/tokenizing-text-with-icu4js.html)
    
    5. I updated ThaiTokenizer with your code snippet and tested it against 
TestNumeralBreakages
    
    RE: BreakIterator Dependencies
    
    * I agree that it should be an abstract class and have more functionality 
(ie. moving backwards and forwards) similar to its Java counterpart.  I'll see 
about writing a PR and submitting it to 
[sillsdev/icu-dotnet](https://github.com/sillsdev/icu-dotnet) to see if they 
will accept this feature.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] lucenenet issue #191: Migrating Lucene.Net to .NET Core

Reply via email to