[jira] [Commented] (LUCENE-2841) CommonGramsFilter improvements

Itamar Syn-Hershko (JIRA) Wed, 18 Jun 2014 10:19:04 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14035978#comment-14035978
 ]


Itamar Syn-Hershko commented on LUCENE-2841:
--------------------------------------------

Can anyone review and comment?

> CommonGramsFilter improvements
> ------------------------------
>
>                 Key: LUCENE-2841
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2841
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>    Affects Versions: 3.1, 4.0-ALPHA
>            Reporter: Steve Rowe
>            Priority: Minor
>             Fix For: 4.9, 5.0
>
>         Attachments: commit-6402a55.patch
>
>
> Currently CommonGramsFilter expects users to remove the common words around 
> which output token ngrams are formed, by appending a StopFilter to the 
> analysis pipeline.  This is inefficient in two ways: captureState() is called 
> on (trailing) stopwords, and then the whole stream has to be re-examined by 
> the following StopFilter.
> The current ctor should be deprecated, and another ctor added with a boolean 
> option controlling whether the common words should be output as unigrams.
> If common words *are* configured to be output as unigrams, captureState() 
> will still need to be called, as it is now.
> If the common words are *not* configured to be output as unigrams, rather 
> than calling captureState() for the trailing token in each output token 
> ngram, the term text, position and offset can be maintained in the same way 
> as they are now for the leading token: using a System.arrayCopy()'d term 
> buffer and a few ints for positionIncrement and offsetd.  The user then no 
> longer would need to append a StopFilter to the analysis chain.
> An example illustrating both possibilities should also be added.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2841) CommonGramsFilter improvements

Reply via email to