[ https://issues.apache.org/jira/browse/LUCENE-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12976344#action_12976344 ]
Steven Rowe commented on LUCENE-2841: ------------------------------------- {quote} bq. you still aren't ever removing any stopwords, but using this solely to speed up phrase queries by forming bigrams around the common terms. Isn't ShingleFilter for that case? {quote} On the index side: ShingleFilter generates token ngrams for all input tokens, not just those around and including common words, so although it *could* be used to speed up phrase queries, it would be at the expense of a much larger term dicitionary. On the query side: ShingleFilter could be a useful replacement for CommonGramsQueryFilter if you don't have access to the list of words used by CommonGramsFilter on the index side. > CommonGramsFilter improvements > ------------------------------ > > Key: LUCENE-2841 > URL: https://issues.apache.org/jira/browse/LUCENE-2841 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Affects Versions: 3.1, 4.0 > Reporter: Steven Rowe > Priority: Minor > Fix For: 3.1, 4.0 > > > Currently CommonGramsFilter expects users to remove the common words around > which output token ngrams are formed, by appending a StopFilter to the > analysis pipeline. This is inefficient in two ways: captureState() is called > on (trailing) stopwords, and then the whole stream has to be re-examined by > the following StopFilter. > The current ctor should be deprecated, and another ctor added with a boolean > option controlling whether the common words should be output as unigrams. > If common words *are* configured to be output as unigrams, captureState() > will still need to be called, as it is now. > If the common words are *not* configured to be output as unigrams, rather > than calling captureState() for the trailing token in each output token > ngram, the term text, position and offset can be maintained in the same way > as they are now for the leading token: using a System.arrayCopy()'d term > buffer and a few ints for positionIncrement and offsetd. The user then no > longer would need to append a StopFilter to the analysis chain. > An example illustrating both possibilities should also be added. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org