Re: wildcard and span queries

Mark Miller Fri, 06 Oct 2006 13:46:13 -0700

Paul's parser is beyond my feeble comprehension...but I would start bylooking at SrndTruncQuery. It looks to me like this enumerates eachpossible match just like a SpanRegexQuery does...I am too lazy to figureout what the visitor pattern is doing so I don't know if they then getadded to a boolean query, but I don't know what else would happen. Ifthis is the case, I am wondering if it is any more efficient than theSpanRegex implementation...which could be changed to a SpanWildcardimplementation. How exactly is this better at avoiding a toomanyclausesexception or ram fillup. Is it just the fact that the (lets say) threewildcard terms are anded so this should dramatically reduce the matches?I don't want to sound any stupider, so I will stop there--hopefully Paulwill expound on this.


- Mark


Erick Erickson wrote:

Paul:

Splendid! Now if I just understood a single thing about the SrndQueryfamily

<G>.

I followed your link, and took a look at the text file. That shouldgive me

enough to get started.

But if you wanted to e-mail me any sample code or long explanations ofwhat

this all does, I would forever be your lackey <G>....

I should also fairly easily be able to run a few of these against the

partial index I already have to get some sense of now it'll all workout inmy problem space. I suspect that the actual number of distinct termswon'tgrow too much after the first 4,000 books, so it'll probably be prettysafeto get this running in the "worst case", find out if/where things blowup,

and put in some safeguards. Or perhaps discover that it's completely and
entirely perfect <G>.

Thanks again
Erick

On 10/6/06, Paul Elschot <[EMAIL PROTECTED]> wrote:


On Friday 06 October 2006 14:37, Erick Erickson wrote:
...
> Fortunately, the PM agrees that it's silly to think about span queries
> involving OR or NOT for this app. So I'm left with something like Jo*n
AND
> sm*th AND jon?es WITHIN 6.

OR works much the same as term expansion for wildcards.

> The only approach that's occurred to me is to create a filter onfor the> terms, giving me a subset of my docs that have any terms satisfyingthe> above. For each doc in the filter, get creative withTermPositionVector

for
> determining whether the document matches. It seems that this would
involve
> creating a list of all positions in each doc in my filter that match
jo*n,
> another for sm*th, and another for jon?es and seeing if the distance

> (however I define that) between any triple of terms (one from eachlist)

is
> less than 6.

> My gut feel is that this explodes time-wise based upon the number of
terms
> that match. In this particular application, we are indexing 20K books.
Based
> on indexing 4K of them, this amounts to about a 4G index (although I
> acutally expect this to be somewhat larger since I haven't indexed all
the
> fields, just the text so far). I can't imagine that comparing the
expanded
> terms for, say, 10,000 docs will be fast. I'm putting together an
experiment
> to test this though.
>
> But someone could save me a lot of work by telling me that this is
solved
> already. This is your chance <G>......

It's solved :) here:
http://svn.apache.org/viewvc/lucene/java/trunk/contrib/surround/

The surround query language uses only the spans package for
WITHIN like queries, no filters.
You may not want to use the parser, but all the rest could be handy.

> The expanding queries (e.g. PrefixQuery, RegexQuery, WildcardQuery)all

blow
> up with TooManyClauses, and I've tried upping the MaxClauses field but
that

> takes forever and *then* blows up. Even with -Xmx set as high as Ican.


The surround language has its own limitation on the maximum number
of terms expanded for wildcards, and it works nicely even for rather
high numbers of terms (thousands) for WITHIN like queries,
given enough RAM.

It shouldn't be too difficult to add NOT queries within WITHIN,
there already is a SpanNotQuery in Lucene to map onto.

Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: wildcard and span queries

Reply via email to