[
https://issues.apache.org/jira/browse/LUCENE-1491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12715712#action_12715712
]
Karl Wettin edited comment on LUCENE-1491 at 6/2/09 2:51 PM:
-------------------------------------------------------------
Although you have a valid point I'd like to argue this a bit.
My arguments are probably considered silly by some. Perhaps it's just me that
use ngrams for something completly different than what everybody else does, but
here we go: Adding the feature as suggested by this patch is, according to me,
to fix symptoms from bad use of character ngrams.
BOL, EOL, whitespace and punctuation are all valid parts of character ngrams
than can increase precision/recall quite a bit. EdgeNGrams could sort of be
considered such data too. So what I'm saying here is that I consider your
example a bad use of charachter ngrams, that the whole sentance should have
been grammed up. So in the case of 4-grams the output would end up as: "to b",
"o be", " be ", "be o", and so on. Perhaps even "$to ", "to b", "o be", and so
on.
Supporting what I suggest will of course mean quite a bit of more work. A whole
new filter that also does input text normalization such as removing double
spaces and what not. That will probably not be implemented anytime soon. But
adding the features in the patch to the filter actually means that this use is
endorsed by the community and I'm not sure that's a good idea. I thus think it
would be better with some sort of secondary filter that did the exact same
thing as the patch.
Perhaps I should leave this issue alone and do some more work with LUCENE-1306
was (Author: karl.wettin):
Although you have a valid point I'd like to argue this a bit.
My arguments is probably considered silly by some. Perhaps it's just me that
use ngrams for something completly different than what everybody else does, but
here we go: Adding the feature as suggested by this patch is, according to me,
to fix symptoms from bad use of character ngrams.
BOL, EOL, whitespace and punctuation are all valid parts of character ngrams
than can increase precision/recall quite a bit. EdgeNGrams could sort of be
considered such data too. So what I'm saying here is that I consider your
example a bad use of charachter ngrams, that the whole sentance should have
been grammed up. So in the case of 4-grams the output would end up as: "to b",
"o be", " be ", "be o", and so on. Perhaps even "$to ", "to b", "o be", and so
on.
Supporting what I suggest will of course mean quite a bit of more work. A whole
new filter that also does input text normalization such as removing double
spaces and what not. That will probably not be implemented anytime soon. But
adding the features in the patch to the filter actually means that this use is
endorsed by the community and I'm not sure that's a good idea. I thus think it
would be better with some sort of secondary filter that did the exact same
thing as the patch.
Perhaps I should leave this issue alone and do some more work with LUCENE-1306
> EdgeNGramTokenFilter stops on tokens smaller then minimum gram size.
> --------------------------------------------------------------------
>
> Key: LUCENE-1491
> URL: https://issues.apache.org/jira/browse/LUCENE-1491
> Project: Lucene - Java
> Issue Type: Bug
> Components: Analysis
> Affects Versions: 2.4, 2.4.1, 2.9, 3.0
> Reporter: Todd Feak
> Assignee: Otis Gospodnetic
> Fix For: 2.9
>
> Attachments: LUCENE-1491.patch
>
>
> If a token is encountered in the stream that is shorter in length than the
> min gram size, the filter will stop processing the token stream.
> Working up a unit test now, but may be a few days before I can provide it.
> Wanted to get it in the system.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]