CommonGramsFilter improvements
------------------------------

                 Key: LUCENE-2841
                 URL: https://issues.apache.org/jira/browse/LUCENE-2841
             Project: Lucene - Java
          Issue Type: Improvement
          Components: Analysis
    Affects Versions: 3.1, 4.0
            Reporter: Steven Rowe
            Priority: Minor
             Fix For: 3.1, 4.0


Currently CommonGramsFilter expects users to remove the common words around 
which output token ngrams are formed, by appending a StopFilter to the analysis 
pipeline.  This is inefficient in two ways: captureState() is called on 
(trailing) stopwords, and then the whole stream has to be re-examined by the 
following StopFilter.

The current ctor should be deprecated, and another ctor added with a boolean 
option controlling whether the common words should be output as unigrams.

If common words *are* configured to be output as unigrams, captureState() will 
still need to be called, as it is now.

If the common words are *not* configured to be output as unigrams, rather than 
calling captureState() for the trailing token in each output token ngram, the 
term text, position and offset can be maintained in the same way as they are 
now for the leading token: using a System.arrayCopy()'d term buffer and a few 
ints for positionIncrement and offsetd.  The user then no longer would need to 
append a StopFilter to the analysis chain.

An example illustrating both possibilities should also be added.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to