[jira] Commented: (LUCENE-2279) eliminate pathological performance on StopFilter when using a Set instead of CharArraySet

Simon Willnauer (JIRA) Tue, 23 Feb 2010 13:57:52 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12837465#action_12837465
 ]


Simon Willnauer commented on LUCENE-2279:
-----------------------------------------

I don't consider this as an issue at all. Each analyzer creating StopFilter 
instances uses CharArraySet internally and if you write your own you should do 
so too. The JavaDoc of StopFilter clearly describes what is going on if you use 
a set in favor of CharArraySet.
You should really consider reusabelTokenStream AND use a CharArraySet instance. 
You should have a look at the current trunk how all the analyzers handle 
stopwords. Once 3.1 is out you will also be able to subclass 
ReusableAnalyzerBase which enables reusableTokenStream on the the fly in 99% of 
the cases.

I tend to close this issue though, Robert?



> eliminate pathological performance on StopFilter when using a Set<String> 
> instead of CharArraySet
> -------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2279
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2279
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: thushara wijeratna
>
> passing a Set<Srtring> to a StopFilter instead of a CharArraySet results in a 
> very slow filter.
> this is because for each document, Analyzer.tokenStream() is called, which 
> ends up calling the StopFilter (if used). And if a regular Set<String> is 
> used in the StopFilter all the elements of the set are copied to a 
> CharArraySet, as we can see in it's ctor:
> public StopFilter(boolean enablePositionIncrements, TokenStream input, Set 
> stopWords, boolean ignoreCase)
>   {
>     super(input);
>     if (stopWords instanceof CharArraySet) {
>       this.stopWords = (CharArraySet)stopWords;
>     } else {
>       this.stopWords = new CharArraySet(stopWords.size(), ignoreCase);
>       this.stopWords.addAll(stopWords);
>     }
>     this.enablePositionIncrements = enablePositionIncrements;
>     init();
>   }
> i feel we should make the StopFilter signature specific, as in specifying 
> CharArraySet vs Set, and there should be a JavaDoc warning on using the other 
> variants of the StopFilter as they all result in a copy for each invocation 
> of Analyzer.tokenStream().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2279) eliminate pathological performance on StopFilter when using a Set instead of CharArraySet

Reply via email to