Re: wildcard and span queries

Paul Elschot Fri, 06 Oct 2006 13:31:42 -0700

Erick,

On Friday 06 October 2006 22:01, Erick Erickson wrote:
> Paul:
> 
> Splendid! Now if I just understood a single thing about the SrndQuery family
> <G>.
> 
> I followed your link, and took a look at the text file. That should give me
> enough to get started.
> 
> But if you wanted to e-mail me any sample code or long explanations of what
> this all does, I would forever be your lackey <G>....


Correcting any bug in the surround source code would do...

> I should also fairly easily be able to run a few of these against the
> partial index I already have to get some sense of now it'll all work out in
> my problem space. I suspect that the actual number of distinct terms won't
> grow too much after the first 4,000 books, so it'll probably be pretty safe
> to get this running in the "worst case", find out if/where things blow up,
> and put in some safeguards. Or perhaps discover that it's completely and
> entirely perfect <G>.

Have a look at the surround test code, and then try and use the surround 
parser as a replacement for the standard query parser.
That should be doable, and it will at least allow you to get some
(nested) proximity queries running against your own data.
The next step is to replace the parser with your own, assuming you 
have one already.

I never got round to writing javadocs for it, and I realize that is
an obstacle. The problem is that writing good javadocs takes me
about as much time as writing the code in the first place.

There is a safeguard in the surround code, it's built into the 
BasicQueryFactory class. It is used at the bottom
of the surround code to generate no more than a maximum
of Lucene TermQuery's and SpanTermQuery's.
The top of the surround code is its parser.

Regards,
Paul Elschot

 
> Thanks again
> Erick
> 
> On 10/6/06, Paul Elschot <[EMAIL PROTECTED]> wrote:
> >
> > On Friday 06 October 2006 14:37, Erick Erickson wrote:
> > ...
> > > Fortunately, the PM agrees that it's silly to think about span queries
> > > involving OR or NOT for this app. So I'm left with something like Jo*n
> > AND
> > > sm*th AND jon?es WITHIN 6.
> >
> > OR works much the same as term expansion for wildcards.
> >
> > > The only approach that's occurred to me is to create a filter on for the
> > > terms, giving me a subset of my docs that have any terms satisfying the
> > > above. For each doc in the filter, get creative with TermPositionVector
> > for
> > > determining whether the document matches. It seems that this would
> > involve
> > > creating a list of all positions in each doc in my filter that match
> > jo*n,
> > > another for sm*th, and another for jon?es and seeing if the distance
> > > (however I define that) between any triple of terms (one from each list)
> > is
> > > less than 6.
> >
> > > My gut feel is that this explodes time-wise based upon the number of
> > terms
> > > that match. In this particular application, we are indexing 20K books.
> > Based
> > > on indexing 4K of them, this amounts to about a 4G index (although I
> > > acutally expect this to be somewhat larger since I haven't indexed all
> > the
> > > fields, just the text so far). I can't imagine that comparing the
> > expanded
> > > terms for, say, 10,000 docs will be fast. I'm putting together an
> > experiment
> > > to test this though.
> > >
> > > But someone could save me a lot of work by telling me that this is
> > solved
> > > already. This is your chance <G>......
> >
> > It's solved :) here:
> > http://svn.apache.org/viewvc/lucene/java/trunk/contrib/surround/
> >
> > The surround query language uses only the spans package for
> > WITHIN like queries, no filters.
> > You may not want to use the parser, but all the rest could be handy.
> >
> > > The expanding queries (e.g. PrefixQuery, RegexQuery, WildcardQuery) all
> > blow
> > > up with TooManyClauses, and I've tried upping the MaxClauses field but
> > that
> > > takes forever and *then* blows up. Even with -Xmx set as high as I can.
> >
> > The surround language has its own limitation on the maximum number
> > of terms expanded for wildcards, and it works nicely even for rather
> > high numbers of terms (thousands) for WITHIN like queries,
> > given enough RAM.
> >
> > It shouldn't be too difficult to add NOT queries within WITHIN,
> > there already is a SpanNotQuery in Lucene to map onto.
> >
> > Regards,
> > Paul Elschot
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: wildcard and span queries

Reply via email to