[jira] Issue Comment Edited: (LUCENE-2014) position increment bug: smartcn

Uwe Schindler (JIRA) Thu, 29 Oct 2009 02:13:26 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771344#action_12771344
 ]


Uwe Schindler edited comment on LUCENE-2014 at 10/29/09 9:11 AM:
-----------------------------------------------------------------

bq. i worry about this clearAttributes solution though, perhaps WordTokenFilter 
should use captureState/restoreState api, like the ThaiWordFilter does (very 
similar analyzer).
bq. If i use capture/restoreState this should not be a problem right?

I think the filter is fine how it is at the moment. The problem is only the 
missing clearAttributes when you produce more than one token out of one big one 
(the sentence). No need for captureState, because the tokens are new ones. If 
somebody adds custom attributes, they would have cleared, but would that be not 
correct?

bq. I guess the only advantage would be that it would preserve any 
customAttributes or payloads that someone might add after the 
SentenceTokenizer, but before the WordTokenFilter propagating them downto the 
individual words.

Does this make sense to insert a filter between both? The transition from 
sentence tokens to word tokens creates totally different tokens, how should a 
payload or other custom att work correct here? Normally such payload filters 
should be inserted after the WordFilter. The problem of capture/restore state 
is addiional copy cost for nothing (the *long* sentence token is copied again 
and again and always reset to the text word).

      was (Author: thetaphi):
    bq. i worry about this clearAttributes solution though, perhaps 
WordTokenFilter should use captureState/restoreState api, like the 
ThaiWordFilter does (very similar analyzer).
bq. If i use capture/restoreState this should not be a problem right?

I think the filter is fine how it is at the moment. The problem is only the 
missing clearAttributes when you produce more than one token out of one big one 
(the sentence). No need for captureState, because the tokens are new ones. If 
somebody adds custom attributes, they would have cleared, but would that be not 
correct?
  
> position increment bug: smartcn
> -------------------------------
>
>                 Key: LUCENE-2014
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2014
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>             Fix For: 3.0
>
>         Attachments: LUCENE-2014.patch, LUCENE-2014.patch
>
>
> If i use LUCENE_VERSION >= 2.9 with smart chinese analyzer, it will crash 
> indexwriter with any reasonable amount of chinese text.
> its especially annoying because it happens in 2.9.1 RC as well.
> this is because the position increments for tokens after stopwords are bogus:
> Here's an example (from test case), where the position increment should be 2, 
> but is instead 91975314!
> {code}
>   public void testChineseStopWords2() throws Exception {
>     Analyzer ca = new SmartChineseAnalyzer(Version.LUCENE_CURRENT); /* will 
> load stopwords */
>     String sentence = "Title:San"; // : is a stopword
>     String result[] = { "titl", "san"};
>     int startOffsets[] = { 0, 6 };
>     int endOffsets[] = { 5, 9 };
>     int posIncr[] = { 1, 2 };
>     assertAnalyzesTo(ca, sentence, result, startOffsets, endOffsets, posIncr);
>   }
> {code}
> junit.framework.AssertionFailedError: posIncrement 1 expected:<2> but 
> was:<91975314>
>       at junit.framework.Assert.fail(Assert.java:47)
>       at junit.framework.Assert.failNotEquals(Assert.java:280)
>       at junit.framework.Assert.assertEquals(Assert.java:64)
>       at junit.framework.Assert.assertEquals(Assert.java:198)
>       at 
> org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:83)
>       ...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-2014) position increment bug: smartcn

Reply via email to