[jira] Updated: (LUCENE-1488) issues with standardanalyzer on multilingual text

Robert Muir (JIRA) Fri, 25 Sep 2009 12:05:48 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Robert Muir updated LUCENE-1488:
--------------------------------

    Attachment: LUCENE-1488.patch

here I complete Lao support (fully implementing 
http://www.panl10n.net/english/final%20reports/pdf%20files/Laos/LAO06.pdf)

Also fix a tokenstream bug (not back-compat issue!) in the bigramfilter.

I think all language/unicode features are done, basically we can get better 
language support in the future from ICU automatically, but I think all 
languages are handled in a reasonable way for now. 

imho all that is left is:
* fix docs, improve tests, java api, rbbi grammars, any bugs, TODOs
* decide if we want to merge this with the collation contrib (I think it might 
be a good idea)
* test various versions of ICU to know which ones it works with

it works and the tests pass, but some tests are slow (10+ seconds, though I 
made them faster).
The problem is these slow tests have found bugs and will help test version 
compatibility, so I like them.


> issues with standardanalyzer on multilingual text
> -------------------------------------------------
>
>                 Key: LUCENE-1488
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1488
>             Project: Lucene - Java
>          Issue Type: Wish
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: ICUAnalyzer.patch, LUCENE-1488.patch, LUCENE-1488.patch, 
> LUCENE-1488.patch, LUCENE-1488.txt, LUCENE-1488.txt
>
>
> The standard analyzer in lucene is not exactly unicode-friendly with regards 
> to breaking text into words, especially with respect to non-alphabetic 
> scripts.  This is because it is unaware of unicode bounds properties.
> I actually couldn't figure out how the Thai analyzer could possibly be 
> working until i looked at the jflex rules and saw that codepoint range for 
> most of the Thai block was added to the alphanum specification. defining the 
> exact codepoint ranges like this for every language could help with the 
> problem but you'd basically be reimplementing the bounds properties already 
> stated in the unicode standard. 
> in general it looks like this kind of behavior is bad in lucene for even 
> latin, for instance, the analyzer will break words around accent marks in 
> decomposed form. While most latin letter + accent combinations have composed 
> forms in unicode, some do not. (this is also an issue for asciifoldingfilter 
> i suppose). 
> I've got a partially tested standardanalyzer that uses icu Rule-based 
> BreakIterator instead of jflex. Using this method you can define word 
> boundaries according to the unicode bounds properties. After getting it into 
> some good shape i'd be happy to contribute it for contrib but I wonder if 
> theres a better solution so that out of box lucene will be more friendly to 
> non-ASCII text. Unfortunately it seems jflex does not support use of these 
> properties such as [\p{Word_Break = Extend}] so this is probably the major 
> barrier.
> Thanks,
> Robert

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1488) issues with standardanalyzer on multilingual text

Reply via email to