"Erick Erickson" <[EMAIL PROTECTED]> wrote on 09/10/2006 13:09:21:
> ... The kicker is that what we are indexing is
> OCR data, some of which is pretty trashy. So you wind up with
"interesting"
> words in your index, things like rtyHrS. So the whole question of
allowing
> very specific queries on detailed wildcards (combined with spans) is
under
> discussion. It's not at all clear to me that there's any value to the end
> users in the capability of, say, two character prefixes. And, it's an
easy
> rule that "prefix queries must specify at least 3 non-wildcard
> characters"....

Erick, I may be out of course here, but, fwiw, have you considered n-gram
indexing/search for a degree of fuzziness to compensate for OCR errors..?

For a four words query you would probably get ~20 tokens (bigrams?) - no
matter what the index size is. You would then probably want to score higher
by LA (lexical affinity - query terms appear close to each other in the
document) - and I am not sure to what degree a span query (made of n-gram
terms) would serve that, because (1) all terms in the span need to be there
(well, I think:-); and, (2) you would like to increase doc score for
close-by terms only for close-by query n-grams.

So there might not be a ready to use solution in Lucene for this, but
perhaps this is a more robust direction to try than the wild card approach
- I mean, if users want to type a wild card query, it is their right to do
so, but for an application logic this does not seem the best choice.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to