[jira] Commented: (LUCENE-2279) eliminate pathological performance on StopFilter when using a Set instead of CharArraySet

Robert Muir (JIRA) Tue, 23 Feb 2010 11:42:51 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12837410#action_12837410
 ]


Robert Muir commented on LUCENE-2279:
-------------------------------------

reusableTokenStream() is called again for each document. if you don't implement 
it, the default is to defer to tokenStream(), which must create new instances 
of StopFilter, LowerCaseFilter, whatever else you have going on in your 
analyzer.

instead, if you implement reusableTokenStream(), you can keep a reference to 
these things, and just reset() your tokenfilters, and pass the reader to your 
tokenizer's reset(Reader) method.

of course, for this to work, you must implement reset() correctly in any custom 
filters you have: if they keep some state such as accumulated offsets or 
something, then these should be reset back to what they are just as if you 
created a new one.

For an example, see StandardAnalyzer

> eliminate pathological performance on StopFilter when using a Set<String> 
> instead of CharArraySet
> -------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2279
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2279
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: thushara wijeratna
>
> passing a Set<Srtring> to a StopFilter instead of a CharArraySet results in a 
> very slow filter.
> this is because for each document, Analyzer.tokenStream() is called, which 
> ends up calling the StopFilter (if used). And if a regular Set<String> is 
> used in the StopFilter all the elements of the set are copied to a 
> CharArraySet, as we can see in it's ctor:
> public StopFilter(boolean enablePositionIncrements, TokenStream input, Set 
> stopWords, boolean ignoreCase)
>   {
>     super(input);
>     if (stopWords instanceof CharArraySet) {
>       this.stopWords = (CharArraySet)stopWords;
>     } else {
>       this.stopWords = new CharArraySet(stopWords.size(), ignoreCase);
>       this.stopWords.addAll(stopWords);
>     }
>     this.enablePositionIncrements = enablePositionIncrements;
>     init();
>   }
> i feel we should make the StopFilter signature specific, as in specifying 
> CharArraySet vs Set, and there should be a JavaDoc warning on using the other 
> variants of the StopFilter as they all result in a copy for each invocation 
> of Analyzer.tokenStream().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2279) eliminate pathological performance on StopFilter when using a Set instead of CharArraySet

Reply via email to