[jira] Commented: (LUCENE-2014) position increment bug: smartcn

Uwe Schindler (JIRA) Thu, 29 Oct 2009 02:09:25 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771344#action_12771344
 ]


Uwe Schindler commented on LUCENE-2014:
---------------------------------------

bq. i worry about this clearAttributes solution though, perhaps WordTokenFilter 
should use captureState/restoreState api, like the ThaiWordFilter does (very 
similar analyzer).
bq. If i use capture/restoreState this should not be a problem right?

I think the filter is fine how it is at the moment. The problem is only the 
missing clearAttributes when you produce more than one token out of one big one 
(the sentence). No need for captureState, because the tokens are new ones. If 
somebody adds custom attributes, they would have cleared, but would that be not 
correct?

> position increment bug: smartcn
> -------------------------------
>
>                 Key: LUCENE-2014
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2014
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>         Attachments: LUCENE-2014.patch, LUCENE-2014.patch
>
>
> If i use LUCENE_VERSION >= 2.9 with smart chinese analyzer, it will crash 
> indexwriter with any reasonable amount of chinese text.
> its especially annoying because it happens in 2.9.1 RC as well.
> this is because the position increments for tokens after stopwords are bogus:
> Here's an example (from test case), where the position increment should be 2, 
> but is instead 91975314!
> {code}
>   public void testChineseStopWords2() throws Exception {
>     Analyzer ca = new SmartChineseAnalyzer(Version.LUCENE_CURRENT); /* will 
> load stopwords */
>     String sentence = "Title:San"; // : is a stopword
>     String result[] = { "titl", "san"};
>     int startOffsets[] = { 0, 6 };
>     int endOffsets[] = { 5, 9 };
>     int posIncr[] = { 1, 2 };
>     assertAnalyzesTo(ca, sentence, result, startOffsets, endOffsets, posIncr);
>   }
> {code}
> junit.framework.AssertionFailedError: posIncrement 1 expected:<2> but 
> was:<91975314>
>       at junit.framework.Assert.fail(Assert.java:47)
>       at junit.framework.Assert.failNotEquals(Assert.java:280)
>       at junit.framework.Assert.assertEquals(Assert.java:64)
>       at junit.framework.Assert.assertEquals(Assert.java:198)
>       at 
> org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:83)
>       ...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2014) position increment bug: smartcn

Reply via email to