[ 
https://issues.apache.org/jira/browse/LUCENE-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771363#action_12771363
 ] 

Robert Muir commented on LUCENE-2014:
-------------------------------------

Mike, its up to you.

I was just analyzing some not-ridiculously-large Chinese texts from Gutenberg, 
when I hit the issue.

The problem is that smartcn indexes punctuation as individual tokens, but 
filters them out with StopFilter (its stopword list is all punctuation).
This means it makes heavy use of stopfilter, compared to other analyzers.


> position increment bug: smartcn
> -------------------------------
>
>                 Key: LUCENE-2014
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2014
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>             Fix For: 3.0
>
>         Attachments: LUCENE-2014.patch, LUCENE-2014.patch, 
> LUCENE-2014_branch.patch
>
>
> If i use LUCENE_VERSION >= 2.9 with smart chinese analyzer, it will crash 
> indexwriter with any reasonable amount of chinese text.
> its especially annoying because it happens in 2.9.1 RC as well.
> this is because the position increments for tokens after stopwords are bogus:
> Here's an example (from test case), where the position increment should be 2, 
> but is instead 91975314!
> {code}
>   public void testChineseStopWords2() throws Exception {
>     Analyzer ca = new SmartChineseAnalyzer(Version.LUCENE_CURRENT); /* will 
> load stopwords */
>     String sentence = "Title:San"; // : is a stopword
>     String result[] = { "titl", "san"};
>     int startOffsets[] = { 0, 6 };
>     int endOffsets[] = { 5, 9 };
>     int posIncr[] = { 1, 2 };
>     assertAnalyzesTo(ca, sentence, result, startOffsets, endOffsets, posIncr);
>   }
> {code}
> junit.framework.AssertionFailedError: posIncrement 1 expected:<2> but 
> was:<91975314>
>       at junit.framework.Assert.fail(Assert.java:47)
>       at junit.framework.Assert.failNotEquals(Assert.java:280)
>       at junit.framework.Assert.assertEquals(Assert.java:64)
>       at junit.framework.Assert.assertEquals(Assert.java:198)
>       at 
> org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:83)
>       ...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to