[jira] Commented: (LUCENE-1488) issues with standardanalyzer on multilingual text

uday kumar maddigatla (JIRA) Tue, 28 Apr 2009 04:31:57 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12703598#action_12703598
 ]


uday kumar maddigatla commented on LUCENE-1488:
-----------------------------------------------

hi,

i too just facing the same problem. my documet contains english as well as 
danish elements.

I tried to use this analyzer. when i try to use this i got this error .

Exception in thread "main" java.lang.ExceptionInInitializerError
        at 
org.apache.lucene.analysis.icu.ICUAnalyzer.tokenStream(ICUAnalyzer.java:74)
        at 
org.apache.lucene.analysis.Analyzer.reusableTokenStream(Analyzer.java:48)
        at 
org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:117)
        at 
org.apache.lucene.index.DocFieldConsumersPerField.processFields(DocFieldConsumersPerField.java:36)
        at 
org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:234)
        at 
org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:765)
        at 
org.apache.lucene.index.DocumentsWriter.addDocument(DocumentsWriter.java:743)
        at 
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1918)
        at 
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1895)
        at com.IndexFiles.indexDocs(IndexFiles.java:87)
        at com.IndexFiles.indexDocs(IndexFiles.java:80)
        at com.IndexFiles.main(IndexFiles.java:57)
Caused by: java.lang.IllegalArgumentException: Error 66063 at line 2 column 17
        at com.ibm.icu.text.RBBIRuleScanner.error(RBBIRuleScanner.java:505)
        at com.ibm.icu.text.RBBIRuleScanner.scanSet(RBBIRuleScanner.java:1047)
        at 
com.ibm.icu.text.RBBIRuleScanner.doParseActions(RBBIRuleScanner.java:484)
        at com.ibm.icu.text.RBBIRuleScanner.parse(RBBIRuleScanner.java:912)
        at 
com.ibm.icu.text.RBBIRuleBuilder.compileRules(RBBIRuleBuilder.java:298)
        at 
com.ibm.icu.text.RuleBasedBreakIterator.compileRules(RuleBasedBreakIterator.java:316)
        at 
com.ibm.icu.text.RuleBasedBreakIterator.<init>(RuleBasedBreakIterator.java:71)
        at 
org.apache.lucene.analysis.icu.ICUBreakIterator.<init>(ICUBreakIterator.java:53)
        at 
org.apache.lucene.analysis.icu.ICUBreakIterator.<init>(ICUBreakIterator.java:45)
        at 
org.apache.lucene.analysis.icu.ICUTokenizer.<clinit>(ICUTokenizer.java:58)
        ... 12 more

please help me in this.

> issues with standardanalyzer on multilingual text
> -------------------------------------------------
>
>                 Key: LUCENE-1488
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1488
>             Project: Lucene - Java
>          Issue Type: Wish
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: ICUAnalyzer.patch
>
>
> The standard analyzer in lucene is not exactly unicode-friendly with regards 
> to breaking text into words, especially with respect to non-alphabetic 
> scripts.  This is because it is unaware of unicode bounds properties.
> I actually couldn't figure out how the Thai analyzer could possibly be 
> working until i looked at the jflex rules and saw that codepoint range for 
> most of the Thai block was added to the alphanum specification. defining the 
> exact codepoint ranges like this for every language could help with the 
> problem but you'd basically be reimplementing the bounds properties already 
> stated in the unicode standard. 
> in general it looks like this kind of behavior is bad in lucene for even 
> latin, for instance, the analyzer will break words around accent marks in 
> decomposed form. While most latin letter + accent combinations have composed 
> forms in unicode, some do not. (this is also an issue for asciifoldingfilter 
> i suppose). 
> I've got a partially tested standardanalyzer that uses icu Rule-based 
> BreakIterator instead of jflex. Using this method you can define word 
> boundaries according to the unicode bounds properties. After getting it into 
> some good shape i'd be happy to contribute it for contrib but I wonder if 
> theres a better solution so that out of box lucene will be more friendly to 
> non-ASCII text. Unfortunately it seems jflex does not support use of these 
> properties such as [\p{Word_Break = Extend}] so this is probably the major 
> barrier.
> Thanks,
> Robert

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-1488) issues with standardanalyzer on multilingual text

Reply via email to