Note, also, I am proposing to have the option, I agree this is a
valid, good case. One could use payloads and some new fangled query
to do the sub position thing, just for completeness here.
-Grant
On May 15, 2008, at 12:54 PM, Doug Cutting wrote:
The conventional use of ngrams when searching is not to treat them
as a set but a sequence. Thus, for "foola" you could index the
sequence ["_f", "fo", "oo", "ol", "la", "a_"], and then search for
the phrase ["oo", "ol"] to find all occurences of "ool". This is
useful in languages that use logograms without spaces, like Japanese
and Chinese, and in other cases (e.g., Nutch uses word-ngrams to
optimize searches for phrases containing very common terms).
Do you have a use-case for the alternative, where n-grams are
treated as a set, rather than a sequence?
Doug
Grant Ingersoll wrote:
See https://issues.apache.org/jira/browse/LUCENE-1224
Do people have an opinion on what positions ngrams should be output
at? For instance, given 1-grams on "abc fgh", these are currently
output as: a, b, c, f, g,h all with a position increment of 1.
That seems somewhat reasonable, but it has tradeoffs, namely you
can't query for something like: "a f" without some amount of slop,
which I think is a reasonable thing to do (but don't have an actual
use case for at the moment.) An alternative way might be to output
a, b, c all at the same position, then increment for f and then put
g and h at the same position.
I am _wondering_ whether it makes more sense to add an option to
the NGram token streams such that we could have the choice of
either outputting the n-grams within a "token" at the same position
or at successive positions (to be back-compatible.) It isn't clear
to me which is correct, or if there is even a notion of correctness
here, in so much as they are both correct if that is the
functionality you want in your application. As DM Smith noted, if
Lucene supported the notion of "sub" positions, one could output
1.a, 1.b, 1.c, 2.a, 2.b and 2.c for the example above, but that
capability doesn't exist in Lucene right now, AFAIK.
-Grant
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]