CommonGramsFilter improvements
------------------------------
Key: LUCENE-2841
URL: https://issues.apache.org/jira/browse/LUCENE-2841
Project: Lucene - Java
Issue Type: Improvement
Components: Analysis
Affects Versions: 3.1, 4.0
Reporter: Steven Rowe
Priority: Minor
Fix For: 3.1, 4.0
Currently CommonGramsFilter expects users to remove the common words around
which output token ngrams are formed, by appending a StopFilter to the analysis
pipeline. This is inefficient in two ways: captureState() is called on
(trailing) stopwords, and then the whole stream has to be re-examined by the
following StopFilter.
The current ctor should be deprecated, and another ctor added with a boolean
option controlling whether the common words should be output as unigrams.
If common words *are* configured to be output as unigrams, captureState() will
still need to be called, as it is now.
If the common words are *not* configured to be output as unigrams, rather than
calling captureState() for the trailing token in each output token ngram, the
term text, position and offset can be maintained in the same way as they are
now for the leading token: using a System.arrayCopy()'d term buffer and a few
ints for positionIncrement and offsetd. The user then no longer would need to
append a StopFilter to the analysis chain.
An example illustrating both possibilities should also be added.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]