[ https://issues.apache.org/jira/browse/LUCENE-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14035978#comment-14035978 ]
Itamar Syn-Hershko commented on LUCENE-2841: -------------------------------------------- Can anyone review and comment? > CommonGramsFilter improvements > ------------------------------ > > Key: LUCENE-2841 > URL: https://issues.apache.org/jira/browse/LUCENE-2841 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis > Affects Versions: 3.1, 4.0-ALPHA > Reporter: Steve Rowe > Priority: Minor > Fix For: 4.9, 5.0 > > Attachments: commit-6402a55.patch > > > Currently CommonGramsFilter expects users to remove the common words around > which output token ngrams are formed, by appending a StopFilter to the > analysis pipeline. This is inefficient in two ways: captureState() is called > on (trailing) stopwords, and then the whole stream has to be re-examined by > the following StopFilter. > The current ctor should be deprecated, and another ctor added with a boolean > option controlling whether the common words should be output as unigrams. > If common words *are* configured to be output as unigrams, captureState() > will still need to be called, as it is now. > If the common words are *not* configured to be output as unigrams, rather > than calling captureState() for the trailing token in each output token > ngram, the term text, position and offset can be maintained in the same way > as they are now for the leading token: using a System.arrayCopy()'d term > buffer and a few ints for positionIncrement and offsetd. The user then no > longer would need to append a StopFilter to the analysis chain. > An example illustrating both possibilities should also be added. -- This message was sent by Atlassian JIRA (v6.2#6252) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org