[jira] Issue Comment Edited: (LUCENE-1491) EdgeNGramTokenFilter stops on tokens smaller then minimum gram size.

Karl Wettin (JIRA) Tue, 02 Jun 2009 14:53:33 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12715712#action_12715712
 ]


Karl Wettin edited comment on LUCENE-1491 at 6/2/09 2:51 PM:
-------------------------------------------------------------

Although you have a valid point I'd like to argue this a bit. 

My arguments are probably considered silly by some. Perhaps it's just me that 
use ngrams for something completly different than what everybody else does, but 
here we go: Adding the feature as suggested by this patch is, according to me, 
to fix symptoms from bad use of character ngrams.

BOL, EOL, whitespace and punctuation are all valid parts of character ngrams 
than can increase precision/recall quite a bit. EdgeNGrams could sort of be 
considered such data too. So what I'm saying here is that I consider your 
example a bad use of charachter ngrams, that the whole sentance should have 
been grammed up. So in the case of 4-grams the output would end up as: "to b", 
"o be", " be ", "be o", and so on. Perhaps even "$to ", "to b", "o be", and so 
on.

Supporting what I suggest will of course mean quite a bit of more work. A whole 
new filter that also does input text normalization such as removing double 
spaces and what not. That will probably not be implemented anytime soon. But 
adding the features in the patch to the filter actually means that this use is 
endorsed by the community and I'm not sure that's a good idea. I thus think it 
would be better with some sort of secondary filter that did the exact same 
thing as the patch.

Perhaps I should leave this issue alone and do some more work with LUCENE-1306 

      was (Author: karl.wettin):
    Although you have a valid point I'd like to argue this a bit. 

My arguments is probably considered silly by some. Perhaps it's just me that 
use ngrams for something completly different than what everybody else does, but 
here we go: Adding the feature as suggested by this patch is, according to me, 
to fix symptoms from bad use of character ngrams.

BOL, EOL, whitespace and punctuation are all valid parts of character ngrams 
than can increase precision/recall quite a bit. EdgeNGrams could sort of be 
considered such data too. So what I'm saying here is that I consider your 
example a bad use of charachter ngrams, that the whole sentance should have 
been grammed up. So in the case of 4-grams the output would end up as: "to b", 
"o be", " be ", "be o", and so on. Perhaps even "$to ", "to b", "o be", and so 
on.

Supporting what I suggest will of course mean quite a bit of more work. A whole 
new filter that also does input text normalization such as removing double 
spaces and what not. That will probably not be implemented anytime soon. But 
adding the features in the patch to the filter actually means that this use is 
endorsed by the community and I'm not sure that's a good idea. I thus think it 
would be better with some sort of secondary filter that did the exact same 
thing as the patch.

Perhaps I should leave this issue alone and do some more work with LUCENE-1306 
  
> EdgeNGramTokenFilter stops on tokens smaller then minimum gram size.
> --------------------------------------------------------------------
>
>                 Key: LUCENE-1491
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1491
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.4, 2.4.1, 2.9, 3.0
>            Reporter: Todd Feak
>            Assignee: Otis Gospodnetic
>             Fix For: 2.9
>
>         Attachments: LUCENE-1491.patch
>
>
> If a token is encountered in the stream that is shorter in length than the 
> min gram size, the filter will stop processing the token stream.
> Working up a unit test now, but may be a few days before I can provide it. 
> Wanted to get it in the system.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Issue Comment Edited: (LUCENE-1491) EdgeNGramTokenFilter stops on tokens smaller then minimum gram size.

Reply via email to