Re: Large scale sorting

2007-04-11 Thread jian chen
I agree. this falls into the area where technical limit is reached. Time to modify the spec. I thought about this issue over this couple of days, there is really NO silver bullet. If the field is multi-value field and the distinct field values are not too many, you might reduce memory usage by st

Re: Large scale sorting

2007-04-11 Thread Chris Hostetter
: I'm wondering then if the Sorting infrastructure could be refactored : to allow with some sort of policy/strategy where one can choose a : point where one is not willing to use memory for sorting, but willing ... : To accomplish this would require a substantial change to the : FieldSor

Re: Large scale sorting

2007-04-09 Thread Paul Smith
A memory saving optimization would be to not load the corresponding String[] in the string index (as discussed previously), but there is currently no way to tell the FieldCachethat the strings are unneeded. The String values are only needed for merging results in a MultiSearcher. Yep, which hap

Re: Large scale sorting

2007-04-09 Thread Yonik Seeley
On 4/9/07, jian chen <[EMAIL PROTECTED]> wrote: But, on a higher level, my idea is really just to create an array of integers for each sort field. The array length is NumOfDocs in the index. Each integer corresponds to a displayable string value. For example, if you have a field of different colo

Re: Large scale sorting

2007-04-09 Thread jian chen
Hi, Paul, I think to warm-up or not, it needs some benchmarking for specific application. For the implementation of the sort fields, when I talk about norms in Lucene, I am thinking we could borrow the same implmentation of the norms to do it. But, on a higher level, my idea is really just to c

Re: Large scale sorting

2007-04-09 Thread Paul Smith
In our application, we have to sync up the index pretty frequently, the warm-up of the index is killing it. Yep, it speeds up the first sort, but at the cost of making all the others slower (maybe significantly so). That's obviously not ideal but could make use of sorts in larger index

Re: Large scale sorting

2007-04-09 Thread jian chen
Hi, Paul, Thanks for your reply. For your previous email about the need for disk based sorting solution, I kind of agree about your points. One incentive for your approach is that we don't need to warm-up the index anymore in case that the index is huge. In our application, we have to sync up th

Re: Large scale sorting

2007-04-09 Thread Doug Cutting
Paul Smith wrote: I don't disagree with the premise that it involves substantial I/O and would increase the time taken to sort, and why this approach shouldn't be the default mechanism, but it's not too difficult to build a disk I/O subsystem that can allocate many spindles to service this and

Re: Large scale sorting

2007-04-09 Thread Paul Smith
Now, if we could use integers to represent the sort field values, which is typically the case for most applications, maybe we can afford to have the sort field values stored in the disk and do disk lookup for each document matched? The look up of the sort field value will be as simple as

Re: Large scale sorting

2007-04-09 Thread Paul Smith
On 10/04/2007, at 4:18 AM, Doug Cutting wrote: Paul Smith wrote: Disadvantages to this approach: * It's a lot more I/O intensive I think this would be prohibitive. Queries matching more than a few hundred documents will take several seconds to sort, since random disk accesses are requir

Re: Large scale sorting

2007-04-09 Thread jian chen
Hi, Doug, I have been thinking about this as well lately and have some thoughts similar to Paul's approach. Lucene has the norm data for each document field. Conceptually it is a byte array with one byte for each document field. At query time, I think the norm array is loaded into memory the fir

Re: Large scale sorting

2007-04-09 Thread Doug Cutting
Paul Smith wrote: Disadvantages to this approach: * It's a lot more I/O intensive I think this would be prohibitive. Queries matching more than a few hundred documents will take several seconds to sort, since random disk accesses are required per matching document. Such an approach is only

Large scale sorting

2007-04-06 Thread Paul Smith
A discussion on the user list brought my mind to the longer term scalability issues of Lucene. Lucene is inherently memory efficient, except for sorting, when the inverted index nature of the index works against the required nature of having a value for each object to sort against. I'm h