Re: How to handle range queries over large ranges and avoid Too Many Boolean clauses
Thanks, I will look at the sorting code. Sorting results by date is next on list. For now, I only have a small number of documents but the set is to grow to over 8 million documents for the collection I am working on. Another collection we have is 40 million documents or so. From what you say it seems to me that sorting will not scale then when I get to larger number of documents. I am considering using an SQL back end to implement sorting: bring back the unique IDs from lucene and then sort in SQL. Claude On May 18, 2004, at 11:23 PM, Morus Walter wrote: Claude Devarenne writes: Hi, I have over 60,000 documents in my index which is slightly over a 1 GB in size. The documents range from the late seventies up to now. I have indexed dates as a keyword field using a string because the dates are in MMDD format. When I do range queries things are OK as long as I don't exceed the built-in number of boolean clauses, so that's a range of 3 years, e.g. 1979 to 1981. The users are not only doing complex queries but also want to query over long ranges, e.g. [19790101 TO 19991231]. Given these requirements, I am thinking of doing a query without the date range, bring the unique ids back from the hits and then do a date query in the SQL database I have that contains the same data. Another alternative is to do the query without the date range in Lucene and then sort the results within the range. I still have to learn how to use the new sorting code and confessed I did not have time to look at it yet. Is there a simpler, easier way to do this? I think it would be worth to take a look at the sorting code. The idea of the sorting code is to have an array of the dates for each doc in memory and access this array for sorting. Now sorting isn't the only thing one might use this array for. Doing a range check is another. So you might extend the sorting code by a range selection. There is no code for this in lucene and you have to create your own searcher but it gives you a fast way to search and sort by date. I did this independently from the new sorting code (I just started a little to early) and it works quite well. The only drawback from this (and the new sorting code) is, that it requires an array of field values that must be rebuilt each time the index changes. Shouldn't be a problem for 6 documents. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How to handle range queries over large ranges and avoid Too Many Boolean clauses
Claude Devarenne writes: > Hi, > > I have over 60,000 documents in my index which is slightly over a 1 GB > in size. The documents range from the late seventies up to now. I > have indexed dates as a keyword field using a string because the dates > are in MMDD format. When I do range queries things are OK as long > as I don't exceed the built-in number of boolean clauses, so that's a > range of 3 years, e.g. 1979 to 1981. The users are not only doing > complex queries but also want to query over long ranges, e.g. [19790101 > TO 19991231]. > > Given these requirements, I am thinking of doing a query without the > date range, bring the unique ids back from the hits and then do a date > query in the SQL database I have that contains the same data. Another > alternative is to do the query without the date range in Lucene and > then sort the results within the range. I still have to learn how to > use the new sorting code and confessed I did not have time to look at > it yet. > > Is there a simpler, easier way to do this? > I think it would be worth to take a look at the sorting code. The idea of the sorting code is to have an array of the dates for each doc in memory and access this array for sorting. Now sorting isn't the only thing one might use this array for. Doing a range check is another. So you might extend the sorting code by a range selection. There is no code for this in lucene and you have to create your own searcher but it gives you a fast way to search and sort by date. I did this independently from the new sorting code (I just started a little to early) and it works quite well. The only drawback from this (and the new sorting code) is, that it requires an array of field values that must be rebuilt each time the index changes. Shouldn't be a problem for 6 documents. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How to handle range queries over large ranges and avoid Too Many Boolean clauses
On May 18, 2004, at 3:56 PM, Claude Devarenne wrote: Thanks, I'll try that. It would nice too if I could extend field (it is a final class) and create a numerical field. Is that not desirable? It isn't that much more effort to have something like NumberUtils listed here: http://wiki.apache.org/jakarta-lucene/SearchNumericalFields I'm not sure of the pros/cons to making Field extensible or not, but it really is of marginal benefit since it ultimately it needs a String and a conversion of numeric to String in your own code isn't involved. I suppose we could put something like NumberUtils (maybe called NumberField to be like DateField) in the core to have a built-in solution. We probably ought to also go another step and provide Date -> MMDD conversion as additional parts to DateField. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How to handle range queries over large ranges and avoid Too Many Boolean clauses
> Is there a simpler, easier way to do this? Yes. I have started implementing a "QuickRangeQuery" class, that doesn't have the BooleanQuery limitation, but scores every matching document as 1.0. I will see if I can get it finished in the next 24 hours, and post back to this thread. =Matt PS: I'm not sure about the "QuickRangeQuery" class name... maybe "NormalizedRangeQuery", "RangeQuery2"... *shrug* Claude Devarenne wrote: Hi, I have over 60,000 documents in my index which is slightly over a 1 GB in size. The documents range from the late seventies up to now. I have indexed dates as a keyword field using a string because the dates are in MMDD format. When I do range queries things are OK as long as I don't exceed the built-in number of boolean clauses, so that's a range of 3 years, e.g. 1979 to 1981. The users are not only doing complex queries but also want to query over long ranges, e.g. [19790101 TO 19991231]. Given these requirements, I am thinking of doing a query without the date range, bring the unique ids back from the hits and then do a date query in the SQL database I have that contains the same data. Another alternative is to do the query without the date range in Lucene and then sort the results within the range. I still have to learn how to use the new sorting code and confessed I did not have time to look at it yet. Is there a simpler, easier way to do this? Claude - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How to handle range queries over large ranges and avoid Too Many Boolean clauses
Thanks, I'll try that first and then Ype's suggestion if necessary. I have been shying away from filters so now I have no excuse ;-) Claude On May 18, 2004, at 1:35 PM, Andy Goodell wrote: In our application we had a similar problem with non-date ranges until we realized that it wasnt so much that we were searching for the values in the range as restricting the search to that range, and then we used an extension to the org.apache.lucene.search.Filter class, and our implementation got much simpler and faster. - andy g On Tue, 18 May 2004 10:38:01 -0700, Claude Devarenne <[EMAIL PROTECTED]> wrote: Hi, I have over 60,000 documents in my index which is slightly over a 1 GB in size. The documents range from the late seventies up to now. I have indexed dates as a keyword field using a string because the dates are in MMDD format. When I do range queries things are OK as long as I don't exceed the built-in number of boolean clauses, so that's a range of 3 years, e.g. 1979 to 1981. The users are not only doing complex queries but also want to query over long ranges, e.g. [19790101 TO 19991231]. Given these requirements, I am thinking of doing a query without the date range, bring the unique ids back from the hits and then do a date query in the SQL database I have that contains the same data. Another alternative is to do the query without the date range in Lucene and then sort the results within the range. I still have to learn how to use the new sorting code and confessed I did not have time to look at it yet. Is there a simpler, easier way to do this? Claude - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How to handle range queries over large ranges and avoid Too Many Boolean clauses
In our application we had a similar problem with non-date ranges until we realized that it wasnt so much that we were searching for the values in the range as restricting the search to that range, and then we used an extension to the org.apache.lucene.search.Filter class, and our implementation got much simpler and faster. - andy g On Tue, 18 May 2004 10:38:01 -0700, Claude Devarenne <[EMAIL PROTECTED]> wrote: > > Hi, > > I have over 60,000 documents in my index which is slightly over a 1 GB > in size. The documents range from the late seventies up to now. I > have indexed dates as a keyword field using a string because the dates > are in MMDD format. When I do range queries things are OK as long > as I don't exceed the built-in number of boolean clauses, so that's a > range of 3 years, e.g. 1979 to 1981. The users are not only doing > complex queries but also want to query over long ranges, e.g. [19790101 > TO 19991231]. > > Given these requirements, I am thinking of doing a query without the > date range, bring the unique ids back from the hits and then do a date > query in the SQL database I have that contains the same data. Another > alternative is to do the query without the date range in Lucene and > then sort the results within the range. I still have to learn how to > use the new sorting code and confessed I did not have time to look at > it yet. > > Is there a simpler, easier way to do this? > > Claude > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How to handle range queries over large ranges and avoid Too Many Boolean clauses
Thanks, I'll try that. It would nice too if I could extend field (it is a final class) and create a numerical field. Is that not desirable? Claude On May 18, 2004, at 12:06 PM, Ype Kingma wrote: On Tuesday 18 May 2004 19:38, Claude Devarenne wrote: Hi, I have over 60,000 documents in my index which is slightly over a 1 GB in size. The documents range from the late seventies up to now. I have indexed dates as a keyword field using a string because the dates are in MMDD format. When I do range queries things are OK as long as I don't exceed the built-in number of boolean clauses, so that's a range of 3 years, e.g. 1979 to 1981. The users are not only doing complex queries but also want to query over long ranges, e.g. [19790101 TO 19991231]. Given these requirements, I am thinking of doing a query without the date range, bring the unique ids back from the hits and then do a date query in the SQL database I have that contains the same data. Another alternative is to do the query without the date range in Lucene and then sort the results within the range. I still have to learn how to use the new sorting code and confessed I did not have time to look at it yet. Is there a simpler, easier way to do this? I wouldn't know of a simpler and easier way, but there is another way to reduce the number of clauses involved in long date ranges. This can be done by indexing not only MMDD but also MM and , and adapting the query range mechanism to use the shorter term whenever possible. (YYY and MMD might also be useful.) Kind regards, Ype - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How to handle range queries over large ranges and avoid Too Many Boolean clauses
On Tuesday 18 May 2004 19:38, Claude Devarenne wrote: > Hi, > > I have over 60,000 documents in my index which is slightly over a 1 GB > in size. The documents range from the late seventies up to now. I > have indexed dates as a keyword field using a string because the dates > are in MMDD format. When I do range queries things are OK as long > as I don't exceed the built-in number of boolean clauses, so that's a > range of 3 years, e.g. 1979 to 1981. The users are not only doing > complex queries but also want to query over long ranges, e.g. [19790101 > TO 19991231]. > > Given these requirements, I am thinking of doing a query without the > date range, bring the unique ids back from the hits and then do a date > query in the SQL database I have that contains the same data. Another > alternative is to do the query without the date range in Lucene and > then sort the results within the range. I still have to learn how to > use the new sorting code and confessed I did not have time to look at > it yet. > > Is there a simpler, easier way to do this? I wouldn't know of a simpler and easier way, but there is another way to reduce the number of clauses involved in long date ranges. This can be done by indexing not only MMDD but also MM and , and adapting the query range mechanism to use the shorter term whenever possible. (YYY and MMD might also be useful.) Kind regards, Ype - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
How to handle range queries over large ranges and avoid Too Many Boolean clauses
Hi, I have over 60,000 documents in my index which is slightly over a 1 GB in size. The documents range from the late seventies up to now. I have indexed dates as a keyword field using a string because the dates are in MMDD format. When I do range queries things are OK as long as I don't exceed the built-in number of boolean clauses, so that's a range of 3 years, e.g. 1979 to 1981. The users are not only doing complex queries but also want to query over long ranges, e.g. [19790101 TO 19991231]. Given these requirements, I am thinking of doing a query without the date range, bring the unique ids back from the hits and then do a date query in the SQL database I have that contains the same data. Another alternative is to do the query without the date range in Lucene and then sort the results within the range. I still have to learn how to use the new sorting code and confessed I did not have time to look at it yet. Is there a simpler, easier way to do this? Claude - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]