eliminate pathological performance on StopFilter when using a Set<String>
instead of CharArraySet
-------------------------------------------------------------------------------------------------
Key: LUCENE-2279
URL: https://issues.apache.org/jira/browse/LUCENE-2279
Project: Lucene - Java
Issue Type: Improvement
Reporter: thushara wijeratna
passing a Set<Srtring> to a StopFilter instead of a CharArraySet results in a
very slow filter.
this is because for each document, Analyzer.tokenStream() is called, which ends
up calling the StopFilter (if used). And if a regular Set<String> is used in
the StopFilter all the elements of the set are copied to a CharArraySet, as we
can see in it's ctor:
public StopFilter(boolean enablePositionIncrements, TokenStream input, Set
stopWords, boolean ignoreCase)
{
super(input);
if (stopWords instanceof CharArraySet) {
this.stopWords = (CharArraySet)stopWords;
} else {
this.stopWords = new CharArraySet(stopWords.size(), ignoreCase);
this.stopWords.addAll(stopWords);
}
this.enablePositionIncrements = enablePositionIncrements;
init();
}
i feel we should make the StopFilter signature specific, as in specifying
CharArraySet vs Set, and there should be a JavaDoc warning on using the other
variants of the StopFilter as they all result in a copy for each invocation of
Analyzer.tokenStream().
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]