[jira] Commented: (LUCENE-2279) eliminate pathological performance on StopFilter when using a Set instead of CharArraySet

Michael McCandless (JIRA) Wed, 24 Feb 2010 02:34:52 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12837736#action_12837736
 ]


Michael McCandless commented on LUCENE-2279:
--------------------------------------------

Should we deprecate (eventually, remove) Analyzer.tokenStream?

Maybe we should absorb ReusableAnalyzerBase back into Analyzer?

Or.... maybe now is an opportune time to create a separate standalone
analyzers package (subproject under the Lucene tlp)?  We've broached
this idea in the past, and I think it's compelling.... I think
Lucene/Solr/Nutch need to eventually get to this point (where they
share analyzers from a single source), so maybe now is the time.

It'd be a single place where we would pull in all of Lucene's
core/contrib, plus Solr's analyzers, plus new analyzers Robert keeps
making ;) Robert's efforts to upgrade Solr's analyzers to 3.0
(currently a big patch waiting on SOLR-1657), plus his various other
pending analyzer bug fixes, could be done in this new analyzers
package.  And we could immediately fix "problems" we have with the
current analyzers API (like this reusable/tokenStream amibiguity).


> eliminate pathological performance on StopFilter when using a Set<String> 
> instead of CharArraySet
> -------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2279
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2279
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: thushara wijeratna
>            Priority: Minor
>
> passing a Set<Srtring> to a StopFilter instead of a CharArraySet results in a 
> very slow filter.
> this is because for each document, Analyzer.tokenStream() is called, which 
> ends up calling the StopFilter (if used). And if a regular Set<String> is 
> used in the StopFilter all the elements of the set are copied to a 
> CharArraySet, as we can see in it's ctor:
> public StopFilter(boolean enablePositionIncrements, TokenStream input, Set 
> stopWords, boolean ignoreCase)
>   {
>     super(input);
>     if (stopWords instanceof CharArraySet) {
>       this.stopWords = (CharArraySet)stopWords;
>     } else {
>       this.stopWords = new CharArraySet(stopWords.size(), ignoreCase);
>       this.stopWords.addAll(stopWords);
>     }
>     this.enablePositionIncrements = enablePositionIncrements;
>     init();
>   }
> i feel we should make the StopFilter signature specific, as in specifying 
> CharArraySet vs Set, and there should be a JavaDoc warning on using the other 
> variants of the StopFilter as they all result in a copy for each invocation 
> of Analyzer.tokenStream().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2279) eliminate pathological performance on StopFilter when using a Set instead of CharArraySet

Reply via email to