Re: Bug in the BooleanQuery optimizer? ..TooManyClauses

2004-11-13 Thread Sanyi
 - leave the current implementation, raising an exception;
 - handle the exception and limit the boolean query to the first 1024
 (or what ever the limit is) terms;
 - select, between the possible terms, only the first 1024 (or what
 ever the limit is) more meaningful ones, leaving out all the others.

I like this idea and I would finalize to myself like this:
I'd also create a default rule for that to avoid handling exceptions for people 
who're happy with
the default behavior:

Keep and search for only the longest 1024 fragments, so it'll throw 
a,an,at,and,add,etc.., but
it'll automatically keep 1024 variations like 
alpha,alfa,advanced,automatical,etc..
So, it'll automatically lower the search overhead and will still search fine 
without throwing
exceptions.
(for people who prefer the widest search range and do not care about the huge 
overhead, we could
leave a boolean switch for keeping not the longest, but the shortest fragments)



__ 
Do you Yahoo!? 
Check out the new Yahoo! Front Page. 
www.yahoo.com 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Bug in the BooleanQuery optimizer? ..TooManyClauses

2004-11-13 Thread Paul Elschot
On Saturday 13 November 2004 09:16, Sanyi wrote:
  - leave the current implementation, raising an exception;
  - handle the exception and limit the boolean query to the first 1024
  (or what ever the limit is) terms;
  - select, between the possible terms, only the first 1024 (or what
  ever the limit is) more meaningful ones, leaving out all the others.
 
 I like this idea and I would finalize to myself like this:
 I'd also create a default rule for that to avoid handling exceptions for 
people who're happy with
 the default behavior:
 
 Keep and search for only the longest 1024 fragments, so it'll throw 
a,an,at,and,add,etc.., but
 it'll automatically keep 1024 variations like 
alpha,alfa,advanced,automatical,etc..

Wouldn't it be counterintuitive to only use the longest matches
for truncations?
To have only longer matches one can also use queries with
multiple ? characters, each matching exactly one character.

I think it would be better encourage the users to use longer
and maybe also more prefixes. This gives more precise results
and is more efficient to execute.

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Bug in the BooleanQuery optimizer? ..TooManyClauses

2004-11-12 Thread Paul Elschot
On Friday 12 November 2004 07:57, Sanyi wrote:
  That's the point: there is no query optimizer in Lucene.
 
 Sorry, I'm not very much into Lucene's internal Classes, I'm just telling 
your the viewpoint of a
 user. You know my users aren't technicians, so answers like yours won't make 
them happy.
 They will only see that I randomly don't allow them to search (with the 1024 
limit). They won't
 understand why am I displaying Please restrict your search a bit more.. 
when they've just
 searched for dodge AND vip* and there are only a few documents mathcing 
this criteria.
 
 So, is the only way to make them able to search happily by setting the max. 
clause limit to
 MaxInt?!

The problem is that there is a lot of freedom in choosing a query, but there
is a limited amount of resources available to search each query.

It is normally possible to reduce the numbers of such complaints a lot 
by imposing a minimum prefix length and eg. doubling or tripling the max. nr.
of clauses.

This reduces the freedom of the users because their queries
must be (a bit) more specific. The actual tradeoff depends on the user
requirements and the time and memory available on the server,
so the users get what they pay for.

Imposing a minimum prefix length can be done by overriding the method
in QueryParser that provides a prefix query.

Regards,
Paul Elschot



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Bug in the BooleanQuery optimizer? ..TooManyClauses

2004-11-12 Thread Sanyi
 It is normally possible to reduce the numbers of such complaints a lot 
 by imposing a minimum prefix length

I've alread limited it to a minimum of 5 characters (abcde*).
I can still easily find (for the first try) situations where it starts to 
search for minutes.
While another 5 char. partial words are searching for a second.
So, this is not a solution at all.

 and eg. doubling or tripling the max. nr. of clauses.

This is the only useful thing I could do and the other way I've found is 
similar: Unlimiting the
number of clauses, but limiting the memory given for java.
It'll the throw an exception if things are getting too hard for the searcher.

Anyway, this avoids DoS attacks, but results in very poor user interface and 
search abiliy.
For example: rareword AND commonfragment* would still refuse to work.
I won't be able to explain it to my users, since they don't need my technical 
reasons. They'll
only notice that dodge AND vip* fails to search instead of returning 1000 
documents.

If I unlimit everything and don't care about possible DoS attacks, it is still 
poor.
It'll search for dodge AND vip* for two minutes, just because vip* is too 
common in the entire
document set.
It doesn't matter that dodge is pretty rare and we're AND-ing it with vip*.




__ 
Do you Yahoo!? 
Check out the new Yahoo! Front Page. 
www.yahoo.com 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Bug in the BooleanQuery optimizer? ..TooManyClauses

2004-11-12 Thread Giulio Cesare Solaroli
Hi all,

I am cross-posting my reply also to developer list because I think some of
my arguments belong there.

I was thinking about extending somehow the PhraseQuery analyzer in
order to better handle wild character expansion.

Sanyi idea to optimize the expansion of the terms to include just the ones
meaningful for the subset of documents found by other part of the
query is intriguing, but probably very difficult to implement.

My idea will probably more easy to implement, even if the final result
could be not 100% exact, it could probably be good enough. The idea is
to let the developer handle the boolean query limit in the following
way:
- leave the current implementation, raising an exception;
- handle the exception and limit the boolean query to the first 1024
(or what ever the limit is) terms;
- select, between the possible terms, only the first 1024 (or what
ever the limit is) more meaningful ones, leaving out all the others.

I had this idea watching how some terms where expanded against our
index. Many of them where clearly wrong words, filenames, or any other
kind of irrelevant info that was not easy to remove before indexing.

This solution changes the return results in a subtle way (even if only
when the current implementation is throwing an exception) and so the
developer should be very careful to report to her users that the query
could have left out some documents.

The most meaningful, in this context, could be proportionate to the
number of documents having that term in the whole index, as a first
approximation.

Does this idea sounds interesting to any of you?

Regards,

Giulio Cesare Solaroli



On Thu, 11 Nov 2004 11:57:32 -0800 (PST), Sanyi [EMAIL PROTECTED] wrote:
 Yes, I understand all of this, but I don't want to set it to MaxInt, since it 
 can easily lead to
 (even accidental) DoS attacks.
 
 What I'm saying is that there is no reason for the optimizer to expand wild* 
 to more than 1024
 variations when I search for somerareword AND wild*, since somerareword is 
 only present in let's
 say 100 documents, so wild* should only expand to words beginning with wild 
 in those 100
 documents, then it should work fine with the default 1024 clause limit.
 
 But it doesn't, so I can choose between unuseable queries or accidental DoS 
 attacks.
 
 
 
 --- Will Allen [EMAIL PROTECTED] wrote:
 
  Any wildcard search will automatically expand your query to the number of 
  terms it find in the
  index that suit the wildcard.
 
  For example:
 
  wild*, would become wild OR wilderness OR wildman etc for each of the terms 
  that exist in your
  index.
 
  It is because of this, that you quickly reach the 1024 limit of clauses.  I 
  automatically set it
  to max int with the following line:
 
  BooleanQuery.setMaxClauseCount( Integer.MAX_VALUE );
 
 
  -Original Message-
  From: Sanyi [mailto:[EMAIL PROTECTED]
  Sent: Thursday, November 11, 2004 6:46 AM
  To: [EMAIL PROTECTED]
  Subject: Bug in the BooleanQuery optimizer? ..TooManyClauses
 
 
  Hi!
 
  First of all, I've read about BooleanQuery$TooManyClauses, so I know that 
  it has a 1024 Clauses
  limit by default which is good enough for me, but I still think it works 
  strange.
 
  Example:
  I have an index with about 20Million documents.
  Let's say that there is about 3000 variants in the entire document set of 
  this word mask: cab*
  Let's say that about 500 documents are containing the word: spectrum
  Now, when I search for cab* AND spectrum, I don't expect it to throw an 
  exception.
  It should first restrict the search for the 500 documents containing the 
  word spectrum, then
  it
  should collect the variants of cab* withing these documents, which turns 
  out in two or three
  variants of cab* (cable, cables, maybe some more) and the search should 
  return let's say 10
  documents.
 
  Similar example: When I search for cab* AND nonexistingword it still 
  throws a TooManyClauses
  exception instead of saying No results, since there is no 
  nonexistingword in my document
  set,
  so it doesn't even have to start collecting the variations of cab*.
 
  Is there any path for this issue?
  Thank you for your time!
 
  Sanyi
  (I'm using: lucene 1.4.2)
 
  p.s.: Sorry for re-sending this message, I was first sending it as an 
  accidental reply to a
  wrong thread..
 
 
 
  __
  Do you Yahoo!?
  Check out the new Yahoo! Front Page.
  www.yahoo.com
 
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 __
 Do you Yahoo!?
 Check out the new Yahoo! Front Page.
 www.yahoo.com

Re: Bug in the BooleanQuery optimizer? ..TooManyClauses

2004-11-12 Thread Luke Francl
On Thu, 2004-11-11 at 14:48, Daniel Naber wrote:
 On Thursday 11 November 2004 20:57, Sanyi wrote:
 
  What I'm saying is that there is no reason for the optimizer to expand
  wild* to more than 1024 variations
 
 That's the point: there is no query optimizer in Lucene.

Would it be possible to write one? I would be very interested in this
feature. 

I poked around in the index and search packages today to see if it could
be done. I think it would take a big change in the Query.rewrite and
related code in the IndexReaders to make the results of the required and
prohibited parts of the query available. 

Again, I don't know if that's even possible. But it would be a great
feature.

Luke


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Bug in the BooleanQuery optimizer? ..TooManyClauses

2004-11-12 Thread Daniel Naber
On Friday 12 November 2004 21:28, Luke Francl wrote:

  That's the point: there is no query optimizer in Lucene.

 Would it be possible to write one? I would be very interested in this
 feature.

There are two different issues: first, reorder the query so that those 
terms with less matches appear first, because as soon as the first term 
with 0 matches occurs, search stops. There will probably be a 
non-so-difficult implementation for that, but this will have more overhead 
than it saves time I guess.

The other thing is that prefix queries get expanded first, then the search 
happens. And that TooManyQueries exception happens when expanding the 
query, not during search. I'm not sure, but I think that's difficult to 
change, at least in a clean way.

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Bug in the BooleanQuery optimizer? ..TooManyClauses

2004-11-12 Thread Luke Francl
On Fri, 2004-11-12 at 14:52, Daniel Naber wrote:

 There are two different issues: first, reorder the query so that those 
 terms with less matches appear first, because as soon as the first term 
 with 0 matches occurs, search stops. There will probably be a 
 non-so-difficult implementation for that, but this will have more overhead 
 than it saves time I guess.

It could be done only with searches that have expansions (RangeQuery,
WildcardQuery, etc) to prevent unnecessary work...

 The other thing is that prefix queries get expanded first, then the search 
 happens. And that TooManyQueries exception happens when expanding the 
 query, not during search. I'm not sure, but I think that's difficult to 
 change, at least in a clean way.

Expansion (and thus TooManyClauses) happens during Query.weight(), which
is right before the search. Maybe after it's rewriten the query could be
tested for instanceof BooleanQuery and then see if
BooleanQuery.getClauses().length  BooleanQuery.maxClauseCount?



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Bug in the BooleanQuery optimizer? ..TooManyClauses

2004-11-11 Thread Sanyi
Hi!

First of all, I've read about BooleanQuery$TooManyClauses, so I know that it 
has a 1024 Clauses
limit by default which is good enough for me, but I still think it works 
strange.

Example:
I have an index with about 20Million documents.
Let's say that there is about 3000 variants in the entire document set of this 
word mask: cab*
Let's say that about 500 documents are containing the word: spectrum
Now, when I search for cab* AND spectrum, I don't expect it to throw an 
exception.
It should first restrict the search for the 500 documents containing the word 
spectrum, then it
should collect the variants of cab* withing these documents, which turns out 
in two or three
variants of cab* (cable, cables, maybe some more) and the search should 
return let's say 10
documents.

Similar example: When I search for cab* AND nonexistingword it still throws a 
TooManyClauses
exception instead of saying No results, since there is no nonexistingword 
in my document set,
so it doesn't even have to start collecting the variations of cab*.

Is there any path for this issue?
Thank you for your time!

Sanyi
(I'm using: lucene 1.4.2)

p.s.: Sorry for re-sending this message, I was first sending it as an 
accidental reply to a wrong thread..



__ 
Do you Yahoo!? 
Check out the new Yahoo! Front Page. 
www.yahoo.com 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Bug in the BooleanQuery optimizer? ..TooManyClauses

2004-11-11 Thread Will Allen
Any wildcard search will automatically expand your query to the number of terms 
it find in the index that suit the wildcard.

For example:

wild*, would become wild OR wilderness OR wildman etc for each of the terms 
that exist in your index.

It is because of this, that you quickly reach the 1024 limit of clauses.  I 
automatically set it to max int with the following line:

BooleanQuery.setMaxClauseCount( Integer.MAX_VALUE );


-Original Message-
From: Sanyi [mailto:[EMAIL PROTECTED]
Sent: Thursday, November 11, 2004 6:46 AM
To: [EMAIL PROTECTED]
Subject: Bug in the BooleanQuery optimizer? ..TooManyClauses


Hi!

First of all, I've read about BooleanQuery$TooManyClauses, so I know that it 
has a 1024 Clauses
limit by default which is good enough for me, but I still think it works 
strange.

Example:
I have an index with about 20Million documents.
Let's say that there is about 3000 variants in the entire document set of this 
word mask: cab*
Let's say that about 500 documents are containing the word: spectrum
Now, when I search for cab* AND spectrum, I don't expect it to throw an 
exception.
It should first restrict the search for the 500 documents containing the word 
spectrum, then it
should collect the variants of cab* withing these documents, which turns out 
in two or three
variants of cab* (cable, cables, maybe some more) and the search should 
return let's say 10
documents.

Similar example: When I search for cab* AND nonexistingword it still throws a 
TooManyClauses
exception instead of saying No results, since there is no nonexistingword 
in my document set,
so it doesn't even have to start collecting the variations of cab*.

Is there any path for this issue?
Thank you for your time!

Sanyi
(I'm using: lucene 1.4.2)

p.s.: Sorry for re-sending this message, I was first sending it as an 
accidental reply to a wrong thread..



__ 
Do you Yahoo!? 
Check out the new Yahoo! Front Page. 
www.yahoo.com 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Bug in the BooleanQuery optimizer? ..TooManyClauses

2004-11-11 Thread Sanyi
Yes, I understand all of this, but I don't want to set it to MaxInt, since it 
can easily lead to
(even accidental) DoS attacks.

What I'm saying is that there is no reason for the optimizer to expand wild* to 
more than 1024
variations when I search for somerareword AND wild*, since somerareword is 
only present in let's
say 100 documents, so wild* should only expand to words beginning with wild 
in those 100
documents, then it should work fine with the default 1024 clause limit.

But it doesn't, so I can choose between unuseable queries or accidental DoS 
attacks.

--- Will Allen [EMAIL PROTECTED] wrote:

 Any wildcard search will automatically expand your query to the number of 
 terms it find in the
 index that suit the wildcard.
 
 For example:
 
 wild*, would become wild OR wilderness OR wildman etc for each of the terms 
 that exist in your
 index.
 
 It is because of this, that you quickly reach the 1024 limit of clauses.  I 
 automatically set it
 to max int with the following line:
 
 BooleanQuery.setMaxClauseCount( Integer.MAX_VALUE );
 
 
 -Original Message-
 From: Sanyi [mailto:[EMAIL PROTECTED]
 Sent: Thursday, November 11, 2004 6:46 AM
 To: [EMAIL PROTECTED]
 Subject: Bug in the BooleanQuery optimizer? ..TooManyClauses
 
 
 Hi!
 
 First of all, I've read about BooleanQuery$TooManyClauses, so I know that it 
 has a 1024 Clauses
 limit by default which is good enough for me, but I still think it works 
 strange.
 
 Example:
 I have an index with about 20Million documents.
 Let's say that there is about 3000 variants in the entire document set of 
 this word mask: cab*
 Let's say that about 500 documents are containing the word: spectrum
 Now, when I search for cab* AND spectrum, I don't expect it to throw an 
 exception.
 It should first restrict the search for the 500 documents containing the word 
 spectrum, then
 it
 should collect the variants of cab* withing these documents, which turns 
 out in two or three
 variants of cab* (cable, cables, maybe some more) and the search should 
 return let's say 10
 documents.
 
 Similar example: When I search for cab* AND nonexistingword it still throws 
 a TooManyClauses
 exception instead of saying No results, since there is no nonexistingword 
 in my document
 set,
 so it doesn't even have to start collecting the variations of cab*.
 
 Is there any path for this issue?
 Thank you for your time!
 
 Sanyi
 (I'm using: lucene 1.4.2)
 
 p.s.: Sorry for re-sending this message, I was first sending it as an 
 accidental reply to a
 wrong thread..
 
 
   
 __ 
 Do you Yahoo!? 
 Check out the new Yahoo! Front Page. 
 www.yahoo.com 
  
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 




__ 
Do you Yahoo!? 
Check out the new Yahoo! Front Page. 
www.yahoo.com 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Bug in the BooleanQuery optimizer? ..TooManyClauses

2004-11-11 Thread Daniel Naber
On Thursday 11 November 2004 20:57, Sanyi wrote:

 What I'm saying is that there is no reason for the optimizer to expand
 wild* to more than 1024 variations

That's the point: there is no query optimizer in Lucene.

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Bug in the BooleanQuery optimizer? ..TooManyClauses

2004-11-11 Thread Sanyi
 That's the point: there is no query optimizer in Lucene.

Sorry, I'm not very much into Lucene's internal Classes, I'm just telling your 
the viewpoint of a
user. You know my users aren't technicians, so answers like yours won't make 
them happy.
They will only see that I randomly don't allow them to search (with the 1024 
limit). They won't
understand why am I displaying Please restrict your search a bit more.. when 
they've just
searched for dodge AND vip* and there are only a few documents mathcing this 
criteria.

So, is the only way to make them able to search happily by setting the max. 
clause limit to
MaxInt?!




__ 
Do you Yahoo!? 
Check out the new Yahoo! Front Page. 
www.yahoo.com 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Bug in the BooleanQuery optimizer? ..TooManyClauses

2004-11-10 Thread Sanyi
Hi!

First of all, I've read about BooleanQuery$TooManyClauses, so I know that it 
has a 1024 Clauses
limit by default which is good enough for me, but I still think it works 
strange.

Example:
I have an index with about 20Million documents.
Let's say that there is about 3000 variants in the entire document set of this 
word mask: cab*
Let's say that about 500 documents are containing the word: spectrum
Now, when I search for cab* AND spectrum, I don't expect it to throw an 
exception.
It should first restrict the search for the 500 documents containing the word 
spectrum, then it
should collect the variants of cab* withing these documents, which turns out 
in two or three
variants of cab* (cable, cables, maybe some more) and the search should 
return let's say 10
documents.

Similar example: When I search for cab* AND nonexistingword it still throws a 
TooManyClauses
exception instead of saying No results, since there is no nonexistingword 
in my document set,
so it doesn't even have to start collecting the variations of cab*.

Is there any path for this issue?
Thank you for your time!

Sanyi
(I'm using: lucene 1.4.2)



__ 
Do you Yahoo!? 
Check out the new Yahoo! Front Page. 
www.yahoo.com 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]