Re: Why Lucene has to rewrite queries prior to actual searching?

Paul Elschot Tue, 08 Apr 2008 07:57:10 -0700

Op Tuesday 08 April 2008 15:18:34 schreef Itamar Syn-Hershko:
> Paul,
>
> I don't see how this answers the question.


Towards the end, the page describes when a Scorer is called and
roughly what it does.

> I was asking why Lucene 
> has to access the index with exact terms, and not use RegEx or
> simpler wildcards support internally? If Lucene will be able to look
> for "w?rd" or "wor*" and treat the wildcards as wildcards, this will
> greatly improve speed of searches and will eliminate the need for
> Query rewriting.

When it is known in advance that "w?rd" and "wor*" will be used
in queries a lot, one can write a tokenizer that indexes them so
that they can be searched directly.
The problem is to know that in advance, that is at indexing time.

> Since some people may want to index chars like those used in
> wildcards, they could be escaped (or, those people will use the
> standard search classes available today instead). I'm not entirely
> sure what part of Lucene does the actual access to the terms and
> position vectors, but if it could be sub-classed or cloned, and then
> modified to honor wildcards or even RegEx, that would bring Lucene to
> new heights.

There are regular expression queries in the regex contrib module,
however these work by rewriting to actually indexed terms.

> Unless, again, there is a specific reason why this can't 
> be done.

There is no specific reason why it cannot be done, one only needs
to provide the corresponding tokenizer to be used at indexing time.

Kind regards,
Paul Elschot


>
> Itamar.
>
> -----Original Message-----
> From: Paul Elschot [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, April 08, 2008 1:56 AM
> To: java-user@lucene.apache.org
> Subject: Re: Why Lucene has to rewrite queries prior to actual
> searching?
>
> Itamar,
>
> Have a look here:
> http://lucene.apache.org/java/2_3_1/scoring.html
>
> Regards,
> Paul Elschot
>
> Op Tuesday 08 April 2008 00:34:48 schreef Itamar Syn-Hershko:
> > Paul and John,
> >
> > Thanks for your quick reply.
> >
> > The problem with query rewriting is the beforementioned
> > MaxClauseException. Instead of inflating the query and passing a
> > deterministic list of terms to the actual search routine, Lucene
> > could have accessed the vectors in the index using some sort of
> > filter. So, for example, if it knows to access "Foobar" by its name
> > in the index, why can't it take "Foo*" and just get all the vectors
> > until "Fop" is met (for example). Why does it have to get
> > deterministic list of terms?
> >
> > I will take a look at the Scorer - can you describe in short what
> > exactly it does and where and when it is being called?
> >
> > I don't get John's comment though - Query::rewrite is being called
> > prior to the actual searching (through QueryParser), how come it
> > can use "information gathered from IndexReader at search time"?
> >
> > Itamar.
> >
> > -----Original Message-----
> > From: Paul Elschot [mailto:[EMAIL PROTECTED]
> > Sent: Tuesday, April 08, 2008 12:57 AM
> > To: java-user@lucene.apache.org
> > Subject: Re: Why Lucene has to rewrite queries prior to actual
> > searching?
> >
> > Itamar,
> >
> > Query rewrite replaces wildcards with terms available from the
> > index. Usually that involves replacing a wildcard with a
> > BooleanQuery that is an effective OR over the available terms while
> > using a flat coordination factor, i.e. it does not matter how many
> > of the available terms actually match a document, as long as at
> > least one matches.
> >
> > For the required query parts (AND like), Scorer.skipTo() is used,
> > and that could well be the filter mechanism you are referring to;
> > have a look at the javadocs of Scorer, and, if necessary, at the
> > actual code of ConjunctionScorer.
> >
> > Regards,
> > Paul Elschot
> >
> > Op Monday 07 April 2008 23:13:09 schreef Itamar Syn-Hershko:
> > > Hi all,
> > >
> > > Can someone from the experts here explain why Lucene has to get a
> > > "rewritten" query for the Searcher - so Phrase or Wildcards
> > > queries have to rewrite themselves into a "primitive" query, that
> > > is then passed to Lucene to look for? I'm probably not familiar
> > > too much with the internals of Lucene, but I'd imagine that if
> > > you can inflate a query using wildcards via xxxxQuery sub
> > > classing, you could as easily (?) have some sort of Filter
> > > mechanism during the search, so that Lucene retrieves the
> > > Position vectors for all the terms that pass that filter, instead
> > > of retrieving only the position data for deterministic terms
> > > (with no wildcards etc.). If that was possible to do somehow, it
> > > could greatly increase the searchability of Lucene indices by
> > > using RegEx (without re-writing and getting the dreaded
> > > MaxClauseCount error) and similar.
> > >
> > > Would love to hear some insights on this one.
> > >
> > > Itamar.
> >
> > -------------------------------------------------------------------
> >-- To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> >
> >
> >
> > -------------------------------------------------------------------
> >-- To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Why Lucene has to rewrite queries prior to actual searching?

Reply via email to