Re: NGrams and positions

Grant Ingersoll Fri, 16 May 2008 05:27:05 -0700

Note, also, I am proposing to have the option, I agree this is avalid, good case. One could use payloads and some new fangled queryto do the sub position thing, just for completeness here.


-Grant


On May 15, 2008, at 12:54 PM, Doug Cutting wrote:

The conventional use of ngrams when searching is not to treat themas a set but a sequence. Thus, for "foola" you could index thesequence ["_f", "fo", "oo", "ol", "la", "a_"], and then search forthe phrase ["oo", "ol"] to find all occurences of "ool". This isuseful in languages that use logograms without spaces, like Japaneseand Chinese, and in other cases (e.g., Nutch uses word-ngrams tooptimize searches for phrases containing very common terms).
Do you have a use-case for the alternative, where n-grams aretreated as a set, rather than a sequence?
Doug

Grant Ingersoll wrote:
See https://issues.apache.org/jira/browse/LUCENE-1224
Do people have an opinion on what positions ngrams should be outputat? For instance, given 1-grams on "abc fgh", these are currentlyoutput as: a, b, c, f, g,h all with a position increment of 1.That seems somewhat reasonable, but it has tradeoffs, namely youcan't query for something like: "a f" without some amount of slop,which I think is a reasonable thing to do (but don't have an actualuse case for at the moment.) An alternative way might be to outputa, b, c all at the same position, then increment for f and then putg and h at the same position.I am _wondering_ whether it makes more sense to add an option tothe NGram token streams such that we could have the choice ofeither outputting the n-grams within a "token" at the same positionor at successive positions (to be back-compatible.) It isn't clearto me which is correct, or if there is even a notion of correctnesshere, in so much as they are both correct if that is thefunctionality you want in your application. As DM Smith noted, ifLucene supported the notion of "sub" positions, one could output1.a, 1.b, 1.c, 2.a, 2.b and 2.c for the example above, but thatcapability doesn't exist in Lucene right now, AFAIK.
-Grant
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: NGrams and positions

Reply via email to