Note, also, I am proposing to have the option, I agree this is a valid, good case. One could use payloads and some new fangled query to do the sub position thing, just for completeness here.

-Grant

On May 15, 2008, at 12:54 PM, Doug Cutting wrote:

The conventional use of ngrams when searching is not to treat them as a set but a sequence. Thus, for "foola" you could index the sequence ["_f", "fo", "oo", "ol", "la", "a_"], and then search for the phrase ["oo", "ol"] to find all occurences of "ool". This is useful in languages that use logograms without spaces, like Japanese and Chinese, and in other cases (e.g., Nutch uses word-ngrams to optimize searches for phrases containing very common terms).

Do you have a use-case for the alternative, where n-grams are treated as a set, rather than a sequence?

Doug

Grant Ingersoll wrote:
See https://issues.apache.org/jira/browse/LUCENE-1224
Do people have an opinion on what positions ngrams should be output at? For instance, given 1-grams on "abc fgh", these are currently output as: a, b, c, f, g,h all with a position increment of 1. That seems somewhat reasonable, but it has tradeoffs, namely you can't query for something like: "a f" without some amount of slop, which I think is a reasonable thing to do (but don't have an actual use case for at the moment.) An alternative way might be to output a, b, c all at the same position, then increment for f and then put g and h at the same position. I am _wondering_ whether it makes more sense to add an option to the NGram token streams such that we could have the choice of either outputting the n-grams within a "token" at the same position or at successive positions (to be back-compatible.) It isn't clear to me which is correct, or if there is even a notion of correctness here, in so much as they are both correct if that is the functionality you want in your application. As DM Smith noted, if Lucene supported the notion of "sub" positions, one could output 1.a, 1.b, 1.c, 2.a, 2.b and 2.c for the example above, but that capability doesn't exist in Lucene right now, AFAIK.
-Grant
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to