See my similar request from last March: http://www.nabble.com/FieldSortedHitQueue-enhancement-to9733550.html#a9733550
Peter On Dec 11, 2007 11:54 AM, Nadav Har'El <[EMAIL PROTECTED]> wrote: > On Mon, Dec 10, 2007, Shai Erera wrote about "Performance Improvement for > Search using PriorityQueue": > > Hi > > > > Lucene's PQ implements two methods: put (assumes the PQ has room for the > > object) and insert (checks whether the object can be inserted etc.). The > > implementation of insert() requires the application that uses it to > allocate > > a new object every time it calls insert. Specifically, it cannot reuse > the > > objects that were removed from the PQ. > > I've read this entire thread, and would like to add my comments about > three > independent issues, which I think that can and perhaps should be > considered > separately: > > 1. When Shai wanted to add the insertWithOverflow() method to > PriorityQueue > he couldn't just subclass PriorityQueue in his application, but rather > was forced to modify PriorityQueue itself. Why? just because one field > of that classi - "heap" - was defined "private" instead of "protected". > > Is there a special reason for that? If not, can we make the trivial > change > to make PriorityQueue's fields protected, to allow Shai and others (see > the > next point) to add functionality on top of PriorityQueue? > > 2. PriorityQueue, in addition to being used in about a dozen places inside > Lucene, is also a public class that advanced users often find useful > when > implementing features like new collectors, new queries, and so on. > Unfortunately, my experience exactly matches Shai's: In the two > occasions > where I used a PriorityQueue, I found that I needed such a > insertWithOverflow() method. If this feature is so useful (I can't > believe > that Shai and me are the only ones who found it useful), I think it > would > be nice to add it to Lucene's PriorityQueue, even if it isn't (yet) used > inside Lucene. > > Just to make it sound more interesting, let me give you an example where > I needed (and implemented) an "insertWithOverflow()": I was implementing > a > faceted search capability over Lucene. It calculated a count for each > facet value, and then I used a PriorityQueue to find the 10 best values. > The problem is that I also needed an "other" aggregator, which was > supposed > to aggregate (in various ways) all the facets except the 10 best ones. > For > that, I needed to know which facets dropped off the priorityqueue. > > 3. Finally, Shai asked for this new PriorityQueue.insertWithOverflow() > to be used in TopDocCollector. I have to admit I don't know how much > of a benefit this will be in the "typical" case. But I do know that > there's no such thing as a "typical" case... > I can easily think of a quite typical "worst case" though: Consider a > collection indexed in order of document age (a pretty typical scenario > for a long-running index), and then you do a sorting query > (TopFieldDocCollector), asking it to bring the 10 newest documents. > In that case, each and every document will have a new DocScore created - > which is the worst-case that Shai feared. > It would be nice if Shai or someone else could provide a measurement in > that case. > > P.S. When looking now at PriorityQueue's code, I found two tiny > performance improvements that could be easily made to it - I wonder if > there's any reason not to do them: > > 1. Insert can use heap[1] directly instead of calling top(). After all, > this is done in an if() that already ensures that size>0. > > 2. Regardless, top() could return heap[1] always, without any if(). After > all, the heap array is initialized to all nulls, so when size==0, > heap[1] > is null anyway. > > > > > PriorityQueue change (added insertWithOverflow method) > > > ----------------------------------------------------------------------------------- > > /** > > * insertWithOverflow() is similar to insert(), except its return > value: > > it > > * returns the object (if any) that was dropped off the heap because > it > > was > > * full. This can be the given parameter (in case it is smaller than > the > > * full heap's minimum, and couldn't be added) or another object > that > > was > > * previously the smallest value in the heap and now has been > replaced > > by a > > * larger one. > > */ > > public Object insertWithOverflow(Object element) { > > if (size < maxSize) { > > put(element); > > return null; > > } else if (size > 0 && !lessThan(element, top())) { > > Object ret = heap[1]; > > heap[1] = element; > > adjustTop(); > > return ret; > > } else { > > return element; > > } > > } > > [Very similar to insert(), only it returns the object that was kicked > out of > > the Queue, or null] > > -- > Nadav Har'El | Tuesday, Dec 11 2007, 3 Tevet > 5768 > IBM Haifa Research Lab > |----------------------------------------- > |A professor is one who talks in > someone > http://nadav.harel.org.il |else's sleep. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >