[
https://issues.apache.org/jira/browse/LUCENE-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12976341#action_12976341
]
Jason Rutherglen commented on LUCENE-2841:
------------------------------------------
bq. you still aren't ever removing any stopwords, but using this solely to
speed up phrase queries by forming bigrams around the common terms.
Isn't ShingleFilter for that case?
> CommonGramsFilter improvements
> ------------------------------
>
> Key: LUCENE-2841
> URL: https://issues.apache.org/jira/browse/LUCENE-2841
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Analysis
> Affects Versions: 3.1, 4.0
> Reporter: Steven Rowe
> Priority: Minor
> Fix For: 3.1, 4.0
>
>
> Currently CommonGramsFilter expects users to remove the common words around
> which output token ngrams are formed, by appending a StopFilter to the
> analysis pipeline. This is inefficient in two ways: captureState() is called
> on (trailing) stopwords, and then the whole stream has to be re-examined by
> the following StopFilter.
> The current ctor should be deprecated, and another ctor added with a boolean
> option controlling whether the common words should be output as unigrams.
> If common words *are* configured to be output as unigrams, captureState()
> will still need to be called, as it is now.
> If the common words are *not* configured to be output as unigrams, rather
> than calling captureState() for the trailing token in each output token
> ngram, the term text, position and offset can be maintained in the same way
> as they are now for the leading token: using a System.arrayCopy()'d term
> buffer and a few ints for positionIncrement and offsetd. The user then no
> longer would need to append a StopFilter to the analysis chain.
> An example illustrating both possibilities should also be added.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]