[jira] Commented: (LUCENE-2841) CommonGramsFilter improvements

Robert Muir (JIRA) Fri, 31 Dec 2010 11:37:11 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12976338#action_12976338
 ]


Robert Muir commented on LUCENE-2841:
-------------------------------------

+1, this would be a great improvement.

there are two basic use cases (that I see):
# you still aren't ever removing any stopwords, but using this solely to speed 
up phrase queries by forming bigrams around the common terms.
# you are using commongrams+stopfilter as a "stopfilter replacement", which 
gives a more reasonable index size, the relevance benefits of stopwords, but a 
user can always refine the query with double quotes and the stopwords are taken 
into consideration, and fast.

the latter case currently requires you to also use a stopfilter, but it means 
we are doing needless captureState very very often (by definition, on common 
terms!). It also means you are specifying your stopwords list twice, and 
hashing two chararraysets, etc. So it would be nice to add the boolean and 
accelerate case #2.


> CommonGramsFilter improvements
> ------------------------------
>
>                 Key: LUCENE-2841
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2841
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 3.1, 4.0
>            Reporter: Steven Rowe
>            Priority: Minor
>             Fix For: 3.1, 4.0
>
>
> Currently CommonGramsFilter expects users to remove the common words around 
> which output token ngrams are formed, by appending a StopFilter to the 
> analysis pipeline.  This is inefficient in two ways: captureState() is called 
> on (trailing) stopwords, and then the whole stream has to be re-examined by 
> the following StopFilter.
> The current ctor should be deprecated, and another ctor added with a boolean 
> option controlling whether the common words should be output as unigrams.
> If common words *are* configured to be output as unigrams, captureState() 
> will still need to be called, as it is now.
> If the common words are *not* configured to be output as unigrams, rather 
> than calling captureState() for the trailing token in each output token 
> ngram, the term text, position and offset can be maintained in the same way 
> as they are now for the leading token: using a System.arrayCopy()'d term 
> buffer and a few ints for positionIncrement and offsetd.  The user then no 
> longer would need to append a StopFilter to the analysis chain.
> An example illustrating both possibilities should also be added.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2841) CommonGramsFilter improvements

Reply via email to