Re: Bug in the BooleanQuery optimizer? ..TooManyClauses
On Saturday 13 November 2004 09:16, Sanyi wrote: > > - leave the current implementation, raising an exception; > > - handle the exception and limit the boolean query to the first 1024 > > (or what ever the limit is) terms; > > - select, between the possible terms, only the first 1024 (or what > > ever the limit is) more meaningful ones, leaving out all the others. > > I like this idea and I would finalize to myself like this: > I'd also create a default rule for that to avoid handling exceptions for people who're happy with > the default behavior: > > Keep and search for only the longest 1024 fragments, so it'll throw a,an,at,and,add,etc.., but > it'll automatically keep 1024 variations like alpha,alfa,advanced,automatical,etc.. Wouldn't it be counterintuitive to only use the longest matches for truncations? To have only longer matches one can also use queries with multiple ? characters, each matching exactly one character. I think it would be better encourage the users to use longer and maybe also more prefixes. This gives more precise results and is more efficient to execute. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Bug in the BooleanQuery optimizer? ..TooManyClauses
> - leave the current implementation, raising an exception; > - handle the exception and limit the boolean query to the first 1024 > (or what ever the limit is) terms; > - select, between the possible terms, only the first 1024 (or what > ever the limit is) more meaningful ones, leaving out all the others. I like this idea and I would finalize to myself like this: I'd also create a default rule for that to avoid handling exceptions for people who're happy with the default behavior: Keep and search for only the longest 1024 fragments, so it'll throw a,an,at,and,add,etc.., but it'll automatically keep 1024 variations like alpha,alfa,advanced,automatical,etc.. So, it'll automatically lower the search overhead and will still search fine without throwing exceptions. (for people who prefer the widest search range and do not care about the huge overhead, we could leave a boolean switch for keeping not the longest, but the shortest fragments) __ Do you Yahoo!? Check out the new Yahoo! Front Page. www.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Bug in the BooleanQuery optimizer? ..TooManyClauses
On Fri, 2004-11-12 at 14:52, Daniel Naber wrote: > There are two different issues: first, reorder the query so that those > terms with less matches appear first, because as soon as the first term > with 0 matches occurs, search stops. There will probably be a > non-so-difficult implementation for that, but this will have more overhead > than it saves time I guess. It could be done only with searches that have expansions (RangeQuery, WildcardQuery, etc) to prevent unnecessary work... > The other thing is that prefix queries get expanded first, then the search > happens. And that TooManyQueries exception happens when expanding the > query, not during search. I'm not sure, but I think that's difficult to > change, at least in a clean way. Expansion (and thus TooManyClauses) happens during Query.weight(), which is right before the search. Maybe after it's rewriten the query could be tested for instanceof BooleanQuery and then see if BooleanQuery.getClauses().length > BooleanQuery.maxClauseCount? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Bug in the BooleanQuery optimizer? ..TooManyClauses
On Friday 12 November 2004 21:28, Luke Francl wrote: > > That's the point: there is no query optimizer in Lucene. > > Would it be possible to write one? I would be very interested in this > feature. There are two different issues: first, reorder the query so that those terms with less matches appear first, because as soon as the first term with 0 matches occurs, search stops. There will probably be a non-so-difficult implementation for that, but this will have more overhead than it saves time I guess. The other thing is that prefix queries get expanded first, then the search happens. And that TooManyQueries exception happens when expanding the query, not during search. I'm not sure, but I think that's difficult to change, at least in a clean way. Regards Daniel -- http://www.danielnaber.de - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Bug in the BooleanQuery optimizer? ..TooManyClauses
On Thu, 2004-11-11 at 14:48, Daniel Naber wrote: > On Thursday 11 November 2004 20:57, Sanyi wrote: > > > What I'm saying is that there is no reason for the optimizer to expand > > wild* to more than 1024 variations > > That's the point: there is no query optimizer in Lucene. Would it be possible to write one? I would be very interested in this feature. I poked around in the index and search packages today to see if it could be done. I think it would take a big change in the Query.rewrite and related code in the IndexReaders to make the results of the required and prohibited parts of the query available. Again, I don't know if that's even possible. But it would be a great feature. Luke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Bug in the BooleanQuery optimizer? ..TooManyClauses
Hi all, I am cross-posting my reply also to developer list because I think some of my arguments belong there. I was thinking about extending somehow the PhraseQuery analyzer in order to better handle wild character expansion. Sanyi idea to "optimize" the expansion of the terms to include just the ones meaningful for the subset of documents found by other part of the query is intriguing, but probably very difficult to implement. My idea will probably more easy to implement, even if the final result could be not 100% exact, it could probably be good enough. The idea is to let the developer handle the boolean query limit in the following way: - leave the current implementation, raising an exception; - handle the exception and limit the boolean query to the first 1024 (or what ever the limit is) terms; - select, between the possible terms, only the first 1024 (or what ever the limit is) more meaningful ones, leaving out all the others. I had this idea watching how some terms where expanded against our index. Many of them where clearly wrong words, filenames, or any other kind of irrelevant info that was not easy to remove before indexing. This solution changes the return results in a subtle way (even if only when the current implementation is throwing an exception) and so the developer should be very careful to report to her users that the query could have left out some documents. The "most meaningful", in this context, could be proportionate to the number of documents having that term in the whole index, as a first approximation. Does this idea sounds interesting to any of you? Regards, Giulio Cesare Solaroli On Thu, 11 Nov 2004 11:57:32 -0800 (PST), Sanyi <[EMAIL PROTECTED]> wrote: > Yes, I understand all of this, but I don't want to set it to MaxInt, since it > can easily lead to > (even accidental) DoS attacks. > > What I'm saying is that there is no reason for the optimizer to expand wild* > to more than 1024 > variations when I search for "somerareword AND wild*", since somerareword is > only present in let's > say 100 documents, so wild* should only expand to words beginning with "wild" > in those 100 > documents, then it should work fine with the default 1024 clause limit. > > But it doesn't, so I can choose between unuseable queries or accidental DoS > attacks. > > > > --- Will Allen <[EMAIL PROTECTED]> wrote: > > > Any wildcard search will automatically expand your query to the number of > > terms it find in the > > index that suit the wildcard. > > > > For example: > > > > wild*, would become wild OR wilderness OR wildman etc for each of the terms > > that exist in your > > index. > > > > It is because of this, that you quickly reach the 1024 limit of clauses. I > > automatically set it > > to max int with the following line: > > > > BooleanQuery.setMaxClauseCount( Integer.MAX_VALUE ); > > > > > > -Original Message- > > From: Sanyi [mailto:[EMAIL PROTECTED] > > Sent: Thursday, November 11, 2004 6:46 AM > > To: [EMAIL PROTECTED] > > Subject: Bug in the BooleanQuery optimizer? ..TooManyClauses > > > > > > Hi! > > > > First of all, I've read about BooleanQuery$TooManyClauses, so I know that > > it has a 1024 Clauses > > limit by default which is good enough for me, but I still think it works > > strange. > > > > Example: > > I have an index with about 20Million documents. > > Let's say that there is about 3000 variants in the entire document set of > > this word mask: cab* > > Let's say that about 500 documents are containing the word: spectrum > > Now, when I search for "cab* AND spectrum", I don't expect it to throw an > > exception. > > It should first restrict the search for the 500 documents containing the > > word "spectrum", then > > it > > should collect the variants of "cab*" withing these documents, which turns > > out in two or three > > variants of "cab*" (cable, cables, maybe some more) and the search should > > return let's say 10 > > documents. > > > > Similar example: When I search for "cab* AND nonexistingword" it still > > throws a TooManyClauses > > exception instead of saying "No results", since there is no > > "nonexistingword" in my document > > set, > > so it doesn't even have to start collecting the variations of "cab*". > > > > Is there any path for this issue? > > Thank you for your time! > > > > Sanyi > > (I'm using: lucene 1.4.2) > > > > p.s.: Sorry for re-sending this message, I was first sending it as an > > accidental reply to a > > wrong thread.. > > > > > > > > __ > > Do you Yahoo!? > > Check out the new Yahoo! Front Page. > > www.yahoo.com > > > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL
Re: Bug in the BooleanQuery optimizer? ..TooManyClauses
> It is normally possible to reduce the numbers of such complaints a lot > by imposing a minimum prefix length I've alread limited it to a minimum of 5 characters (abcde*). I can still easily find (for the first try) situations where it starts to search for minutes. While another 5 char. partial words are searching for a second. So, this is not a solution at all. > and eg. doubling or tripling the max. nr. of clauses. This is the only useful thing I could do and the other way I've found is similar: Unlimiting the number of clauses, but limiting the memory given for java. It'll the throw an exception if things are getting too hard for the searcher. Anyway, this avoids DoS attacks, but results in very poor user interface and search abiliy. For example: "rareword AND commonfragment*" would still refuse to work. I won't be able to explain it to my users, since they don't need my technical reasons. They'll only notice that "dodge AND vip*" fails to search instead of returning 1000 documents. If I unlimit everything and don't care about possible DoS attacks, it is still poor. It'll search for "dodge AND vip*" for two minutes, just because "vip*" is too common in the entire document set. It doesn't matter that "dodge" is pretty rare and we're AND-ing it with "vip*". __ Do you Yahoo!? Check out the new Yahoo! Front Page. www.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Bug in the BooleanQuery optimizer? ..TooManyClauses
On Friday 12 November 2004 07:57, Sanyi wrote: > > That's the point: there is no query optimizer in Lucene. > > Sorry, I'm not very much into Lucene's internal Classes, I'm just telling your the viewpoint of a > user. You know my users aren't technicians, so answers like yours won't make them happy. > They will only see that I randomly don't allow them to search (with the 1024 limit). They won't > understand why am I displaying "Please restrict your search a bit more.." when they've just > searched for "dodge AND vip*" and there are only a few documents mathcing this criteria. > > So, is the only way to make them able to search happily by setting the max. clause limit to > MaxInt?! The problem is that there is a lot of freedom in choosing a query, but there is a limited amount of resources available to search each query. It is normally possible to reduce the numbers of such complaints a lot by imposing a minimum prefix length and eg. doubling or tripling the max. nr. of clauses. This reduces the freedom of the users because their queries must be (a bit) more specific. The actual tradeoff depends on the user requirements and the time and memory available on the server, so the users get what they pay for. Imposing a minimum prefix length can be done by overriding the method in QueryParser that provides a prefix query. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Bug in the BooleanQuery optimizer? ..TooManyClauses
> That's the point: there is no query optimizer in Lucene. Sorry, I'm not very much into Lucene's internal Classes, I'm just telling your the viewpoint of a user. You know my users aren't technicians, so answers like yours won't make them happy. They will only see that I randomly don't allow them to search (with the 1024 limit). They won't understand why am I displaying "Please restrict your search a bit more.." when they've just searched for "dodge AND vip*" and there are only a few documents mathcing this criteria. So, is the only way to make them able to search happily by setting the max. clause limit to MaxInt?! __ Do you Yahoo!? Check out the new Yahoo! Front Page. www.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Bug in the BooleanQuery optimizer? ..TooManyClauses
On Thursday 11 November 2004 20:57, Sanyi wrote: > What I'm saying is that there is no reason for the optimizer to expand > wild* to more than 1024 variations That's the point: there is no query optimizer in Lucene. Regards Daniel -- http://www.danielnaber.de - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Bug in the BooleanQuery optimizer? ..TooManyClauses
Yes, I understand all of this, but I don't want to set it to MaxInt, since it can easily lead to (even accidental) DoS attacks. What I'm saying is that there is no reason for the optimizer to expand wild* to more than 1024 variations when I search for "somerareword AND wild*", since somerareword is only present in let's say 100 documents, so wild* should only expand to words beginning with "wild" in those 100 documents, then it should work fine with the default 1024 clause limit. But it doesn't, so I can choose between unuseable queries or accidental DoS attacks. --- Will Allen <[EMAIL PROTECTED]> wrote: > Any wildcard search will automatically expand your query to the number of > terms it find in the > index that suit the wildcard. > > For example: > > wild*, would become wild OR wilderness OR wildman etc for each of the terms > that exist in your > index. > > It is because of this, that you quickly reach the 1024 limit of clauses. I > automatically set it > to max int with the following line: > > BooleanQuery.setMaxClauseCount( Integer.MAX_VALUE ); > > > -Original Message- > From: Sanyi [mailto:[EMAIL PROTECTED] > Sent: Thursday, November 11, 2004 6:46 AM > To: [EMAIL PROTECTED] > Subject: Bug in the BooleanQuery optimizer? ..TooManyClauses > > > Hi! > > First of all, I've read about BooleanQuery$TooManyClauses, so I know that it > has a 1024 Clauses > limit by default which is good enough for me, but I still think it works > strange. > > Example: > I have an index with about 20Million documents. > Let's say that there is about 3000 variants in the entire document set of > this word mask: cab* > Let's say that about 500 documents are containing the word: spectrum > Now, when I search for "cab* AND spectrum", I don't expect it to throw an > exception. > It should first restrict the search for the 500 documents containing the word > "spectrum", then > it > should collect the variants of "cab*" withing these documents, which turns > out in two or three > variants of "cab*" (cable, cables, maybe some more) and the search should > return let's say 10 > documents. > > Similar example: When I search for "cab* AND nonexistingword" it still throws > a TooManyClauses > exception instead of saying "No results", since there is no "nonexistingword" > in my document > set, > so it doesn't even have to start collecting the variations of "cab*". > > Is there any path for this issue? > Thank you for your time! > > Sanyi > (I'm using: lucene 1.4.2) > > p.s.: Sorry for re-sending this message, I was first sending it as an > accidental reply to a > wrong thread.. > > > > __ > Do you Yahoo!? > Check out the new Yahoo! Front Page. > www.yahoo.com > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > __ Do you Yahoo!? Check out the new Yahoo! Front Page. www.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Bug in the BooleanQuery optimizer? ..TooManyClauses
Any wildcard search will automatically expand your query to the number of terms it find in the index that suit the wildcard. For example: wild*, would become wild OR wilderness OR wildman etc for each of the terms that exist in your index. It is because of this, that you quickly reach the 1024 limit of clauses. I automatically set it to max int with the following line: BooleanQuery.setMaxClauseCount( Integer.MAX_VALUE ); -Original Message- From: Sanyi [mailto:[EMAIL PROTECTED] Sent: Thursday, November 11, 2004 6:46 AM To: [EMAIL PROTECTED] Subject: Bug in the BooleanQuery optimizer? ..TooManyClauses Hi! First of all, I've read about BooleanQuery$TooManyClauses, so I know that it has a 1024 Clauses limit by default which is good enough for me, but I still think it works strange. Example: I have an index with about 20Million documents. Let's say that there is about 3000 variants in the entire document set of this word mask: cab* Let's say that about 500 documents are containing the word: spectrum Now, when I search for "cab* AND spectrum", I don't expect it to throw an exception. It should first restrict the search for the 500 documents containing the word "spectrum", then it should collect the variants of "cab*" withing these documents, which turns out in two or three variants of "cab*" (cable, cables, maybe some more) and the search should return let's say 10 documents. Similar example: When I search for "cab* AND nonexistingword" it still throws a TooManyClauses exception instead of saying "No results", since there is no "nonexistingword" in my document set, so it doesn't even have to start collecting the variations of "cab*". Is there any path for this issue? Thank you for your time! Sanyi (I'm using: lucene 1.4.2) p.s.: Sorry for re-sending this message, I was first sending it as an accidental reply to a wrong thread.. __ Do you Yahoo!? Check out the new Yahoo! Front Page. www.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]