I came across an interesting quirk when using Lucene's PriorityQueue. 
It's not a bug per se but I thought it might be worth logging here if anyone 
else experiences it.

I was using a PriorityQueue to support a GUI that pages through the top terms 
in an index. It was observed that terms were often duplicated from one page to 
the next.
I can imagine other PriorityQueue usage scenarios where this could be a problem 
e.g. a document is seen to be shown on more than one page of search results.


In my scenario the PriorityQueue was sized to be (pageNumber x pageSize) so 
given a pageSize of ten results the PriorityQueue size used to serve pages 1, 2 
and 3 would be 10, 20 and 30 respectively.
Having accumulated the PriorityQueue only the last 10 results are shown i.e. 1 
to 10, 11 to 20 and 21 to 30.
In order to avoid excessive object allocation an "if currentValue > 
lowestValue" check was used where lowest value is maintained after each 
insertion.

So far, so familiar. This practice is used all over.

Why then, was this leading to results moving position and why don't we see this 
problem elsewhere? 
Why this happens I'll come to in a minute but I suspect the reason we don't 
often see this elsewhere is because it only occurs when the score values are 
duplicated. The score value used in query results is typically different for 
each document so is not usually a problem but if you are unlucky enough to have 
a query that produces similar scoresI would anticipate you may see this. 
Unfortunately in my dataset the values are often duplicated e.g. several "top" 
terms that have a document frequency of 8. 

The reason this happens is that the PriorityQueue size is different for each 
page request. For the first page request the queue size is small and therefore 
the "quality" of results in the set is high and because we use the "if 
currentValue>lowestValue" check not all items will be passed to the insert 
method to be ranked. This means a duplicate scored term towards the end of the 
list of terms may not make it into the selected set because by then the queue 
is fully populated with higher quality items. When the page number increases, 
so does the required PriorityQueue size. This larger queue size now means that 
by the time we get to our duplicate scored term it now passes the (if 
currentValue>lowestValue) check and an insert is performed running the 
associated ranking logic.  
What this means is that the exact ranking order of the same set of entries will 
vary dependent on the PriorityQueue size used to analyse them.

I changed my code to make the "if currentValue > lowestValue" check to be ">=" 
which I hoped would force a consistent ranking, independent of queue size. This 
removed some of the duplication but did not completely eliminate it so I can 
only guess there is still some peculiarity to the ranking logic given changing 
queue sizes and the sequence of inserts/ranks. My current thinking is to 
increase the PriorityQueue size used to more than is strictly required for a 
page to allow enough "space" for the queue to rank things more consistently 
when there are duplicates. 
This is probably OK because I suspect I can live with the odd discrepancy that 
might squeak through. Others may have more critical requirements for 
eliminating duplicates and will need a more robust strategy (e.g. removing the 
"lowestValue" check and inserting absolutely everything.


Cheers,
Mark








      ___________________________________________________________ 
Rise to the challenge for Sport Relief with Yahoo! For Good  

http://uk.promotions.yahoo.com/forgood/

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to