[jira] Issue Comment Edited: (LUCENE-1224) NGramTokenFilter creates bad TokenStream

Otis Gospodnetic (JIRA) Thu, 15 May 2008 09:08:17 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12597174#action_12597174
 ]


otis edited comment on LUCENE-1224 at 5/15/08 9:07 AM:
-------------------------------------------------------------------

Hiroaki:
I agree with Grant about unit tests.  I looked at the unit tests and thought 
the same thing as Grant - why is Hiroaki adding indexing/searching into the 
mix?  Your change is about modifying the positions of n-grams, and you don't 
need to index or search for that.  The test will be a lot simpler if you just 
test for positions, like Grant suggested.

Also, once you change the unit test this way, it will be a lot easier to play 
with positions and figure out what the "right" way to handle positions is.

Finally, it might turn out that people have different needs or different 
expectations for n-gram positions.  Thus, when making changes, perhaps you can 
think of a mechanism that allows the caller to instruct the n-gram tokenizer 
which token positioning approach to take (e.g. the "incremental" one, or the 
one based on the position of the originating token, or...)


      was (Author: otis):
    Hiroaki:
I agree with Grant about unit tests.  I looked at the unit tests and thought 
the same thing as Grant - why is Hiroaki adding indexing/searching into the 
mix?  Your change is about modifying the positions of n-grams, and you don't 
need to index or search for that.  The test will be a lot simpler if you just 
test for positions, like Grant suggested.
  
> NGramTokenFilter creates bad TokenStream
> ----------------------------------------
>
>                 Key: LUCENE-1224
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1224
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/*
>            Reporter: Hiroaki Kawai
>            Assignee: Grant Ingersoll
>            Priority: Critical
>         Attachments: LUCENE-1224.patch, NGramTokenFilter.patch, 
> NGramTokenFilter.patch
>
>
> With current trunk NGramTokenFilter(min=2,max=4) , I index "abcdef" string 
> into an index, but I can't query it with "abc". If I query with "ab", I can 
> get a hit result.
> The reason is that the NGramTokenFilter generates badly ordered TokenStream. 
> Query is based on the Token order in the TokenStream, that how stemming or 
> phrase should be anlayzed is based on the order (Token.positionIncrement).
> With current filter, query string "abc" is tokenized to : ab bc abc 
> meaning "query a string that has ab bc abc in this order".
> Expected filter will generate : ab abc(positionIncrement=0) bc
> meaning "query a string that has (ab|abc) bc in this order"
> I'd like to submit a patch for this issue. :-)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Issue Comment Edited: (LUCENE-1224) NGramTokenFilter creates bad TokenStream

Reply via email to