Hi all,

I am cross-posting my reply also to developer list because I think some of
my arguments belong there.

I was thinking about extending somehow the PhraseQuery analyzer in
order to better handle wild character expansion.

Sanyi idea to "optimize" the expansion of the terms to include just the ones
meaningful for the subset of documents found by other part of the
query is intriguing, but probably very difficult to implement.

My idea will probably more easy to implement, even if the final result
could be not 100% exact, it could probably be good enough. The idea is
to let the developer handle the boolean query limit in the following
way:
- leave the current implementation, raising an exception;
- handle the exception and limit the boolean query to the first 1024
(or what ever the limit is) terms;
- select, between the possible terms, only the first 1024 (or what
ever the limit is) more meaningful ones, leaving out all the others.

I had this idea watching how some terms where expanded against our
index. Many of them where clearly wrong words, filenames, or any other
kind of irrelevant info that was not easy to remove before indexing.

This solution changes the return results in a subtle way (even if only
when the current implementation is throwing an exception) and so the
developer should be very careful to report to her users that the query
could have left out some documents.

The "most meaningful", in this context, could be proportionate to the
number of documents having that term in the whole index, as a first
approximation.

Does this idea sounds interesting to any of you?

Regards,

Giulio Cesare Solaroli



On Thu, 11 Nov 2004 11:57:32 -0800 (PST), Sanyi <[EMAIL PROTECTED]> wrote:
> Yes, I understand all of this, but I don't want to set it to MaxInt, since it 
> can easily lead to
> (even accidental) DoS attacks.
> 
> What I'm saying is that there is no reason for the optimizer to expand wild* 
> to more than 1024
> variations when I search for "somerareword AND wild*", since somerareword is 
> only present in let's
> say 100 documents, so wild* should only expand to words beginning with "wild" 
> in those 100
> documents, then it should work fine with the default 1024 clause limit.
> 
> But it doesn't, so I can choose between unuseable queries or accidental DoS 
> attacks.
> 
> 
> 
> --- Will Allen <[EMAIL PROTECTED]> wrote:
> 
> > Any wildcard search will automatically expand your query to the number of 
> > terms it find in the
> > index that suit the wildcard.
> >
> > For example:
> >
> > wild*, would become wild OR wilderness OR wildman etc for each of the terms 
> > that exist in your
> > index.
> >
> > It is because of this, that you quickly reach the 1024 limit of clauses.  I 
> > automatically set it
> > to max int with the following line:
> >
> > BooleanQuery.setMaxClauseCount( Integer.MAX_VALUE );
> >
> >
> > -----Original Message-----
> > From: Sanyi [mailto:[EMAIL PROTECTED]
> > Sent: Thursday, November 11, 2004 6:46 AM
> > To: [EMAIL PROTECTED]
> > Subject: Bug in the BooleanQuery optimizer? ..TooManyClauses
> >
> >
> > Hi!
> >
> > First of all, I've read about BooleanQuery$TooManyClauses, so I know that 
> > it has a 1024 Clauses
> > limit by default which is good enough for me, but I still think it works 
> > strange.
> >
> > Example:
> > I have an index with about 20Million documents.
> > Let's say that there is about 3000 variants in the entire document set of 
> > this word mask: cab*
> > Let's say that about 500 documents are containing the word: spectrum
> > Now, when I search for "cab* AND spectrum", I don't expect it to throw an 
> > exception.
> > It should first restrict the search for the 500 documents containing the 
> > word "spectrum", then
> > it
> > should collect the variants of "cab*" withing these documents, which turns 
> > out in two or three
> > variants of "cab*" (cable, cables, maybe some more) and the search should 
> > return let's say 10
> > documents.
> >
> > Similar example: When I search for "cab* AND nonexistingword" it still 
> > throws a TooManyClauses
> > exception instead of saying "No results", since there is no 
> > "nonexistingword" in my document
> > set,
> > so it doesn't even have to start collecting the variations of "cab*".
> >
> > Is there any path for this issue?
> > Thank you for your time!
> >
> > Sanyi
> > (I'm using: lucene 1.4.2)
> >
> > p.s.: Sorry for re-sending this message, I was first sending it as an 
> > accidental reply to a
> > wrong thread..
> >
> >
> >
> > __________________________________
> > Do you Yahoo!?
> > Check out the new Yahoo! Front Page.
> > www.yahoo.com
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> 
> __________________________________
> Do you Yahoo!?
> Check out the new Yahoo! Front Page.
> www.yahoo.com
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to