Well, we defined this problem away for one of our products, but it's back
for a different product. Siiiiigggghhhh......

I'm valiantly trying to get our product manager (hereinafter PM) to define
this problem away, perhaps allowing me to deal with this by clever indexing
and/or some variant on prefix query. But in case that doesn't fly, I'm
wondering what wisdom exists.

Fortunately, the PM agrees that it's silly to think about span queries
involving OR or NOT for this app. So I'm left with something like Jo*n AND
sm*th AND jon?es WITHIN 6.

The only approach that's occurred to me is to create a filter on for the
terms, giving me a subset of my docs that have any terms satisfying the
above. For each doc in the filter, get creative with TermPositionVector for
determining whether the document matches. It seems that this would involve
creating a list of all positions in each doc in my filter that match jo*n,
another for sm*th, and another for jon?es and seeing if the distance
(however I define that) between any triple of terms (one from each list) is
less than 6.

My gut feel is that this explodes time-wise based upon the number of terms
that match. In this particular application, we are indexing 20K books. Based
on indexing 4K of them, this amounts to about a 4G index (although I
acutally expect this to be somewhat larger since I haven't indexed all the
fields, just the text so far). I can't imagine that comparing the expanded
terms for, say, 10,000 docs will be fast. I'm putting together an experiment
to test this though.

But someone could save me a lot of work by telling me that this is solved
already. This is your chance <G>......

The expanding queries (e.g. PrefixQuery, RegexQuery, WildcardQuery) all blow
up with TooManyClauses, and I've tried upping the MaxClauses field but that
takes forever and *then* blows up. Even with -Xmx set as high as I can.

I know, I know. If I solve this, feel free to submit it to the contribution
section.....

Thanks
Erick

P.S. Apologies if this is a re-post. But every time I try to submit a new
request from home, I get a error like this....
************
Technical details of permanent failure:
PERM_FAILURE: SMTP Error (state 12): 550 SpamAssassin score
5.1(DNS_FROM_RFC_ABUSE,HTML_00_10,HTML_MESSAGE,RCVD_IN_BL_SPAMCOP_NET)
exceeds threshold 5.0.
***********

Which appears to be related to the fact that I have a Direcway satellite
connection at home. Anybody who's figured out how to cure this. please feel
free to e-mail me. I don't quite know whether this is even getting to the
user list server or is getting returned from the Direcway processsing....

Reply via email to