eliminate pathological performance on StopFilter when using a Set<String> 
instead of CharArraySet
-------------------------------------------------------------------------------------------------

                 Key: LUCENE-2279
                 URL: https://issues.apache.org/jira/browse/LUCENE-2279
             Project: Lucene - Java
          Issue Type: Improvement
            Reporter: thushara wijeratna


passing a Set<Srtring> to a StopFilter instead of a CharArraySet results in a 
very slow filter.
this is because for each document, Analyzer.tokenStream() is called, which ends 
up calling the StopFilter (if used). And if a regular Set<String> is used in 
the StopFilter all the elements of the set are copied to a CharArraySet, as we 
can see in it's ctor:

public StopFilter(boolean enablePositionIncrements, TokenStream input, Set 
stopWords, boolean ignoreCase)
  {
    super(input);
    if (stopWords instanceof CharArraySet) {
      this.stopWords = (CharArraySet)stopWords;
    } else {
      this.stopWords = new CharArraySet(stopWords.size(), ignoreCase);
      this.stopWords.addAll(stopWords);
    }
    this.enablePositionIncrements = enablePositionIncrements;
    init();
  }

i feel we should make the StopFilter signature specific, as in specifying 
CharArraySet vs Set, and there should be a JavaDoc warning on using the other 
variants of the StopFilter as they all result in a copy for each invocation of 
Analyzer.tokenStream().



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to