eliminate pathological performance on StopFilter when using a Set<String> instead of CharArraySet -------------------------------------------------------------------------------------------------
Key: LUCENE-2279 URL: https://issues.apache.org/jira/browse/LUCENE-2279 Project: Lucene - Java Issue Type: Improvement Reporter: thushara wijeratna passing a Set<Srtring> to a StopFilter instead of a CharArraySet results in a very slow filter. this is because for each document, Analyzer.tokenStream() is called, which ends up calling the StopFilter (if used). And if a regular Set<String> is used in the StopFilter all the elements of the set are copied to a CharArraySet, as we can see in it's ctor: public StopFilter(boolean enablePositionIncrements, TokenStream input, Set stopWords, boolean ignoreCase) { super(input); if (stopWords instanceof CharArraySet) { this.stopWords = (CharArraySet)stopWords; } else { this.stopWords = new CharArraySet(stopWords.size(), ignoreCase); this.stopWords.addAll(stopWords); } this.enablePositionIncrements = enablePositionIncrements; init(); } i feel we should make the StopFilter signature specific, as in specifying CharArraySet vs Set, and there should be a JavaDoc warning on using the other variants of the StopFilter as they all result in a copy for each invocation of Analyzer.tokenStream(). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org