RE: Sorting

Rob Staveley (Tom) Wed, 02 Aug 2006 02:38:49 -0700

> Scorers are by contract expected to score docs in docId order

This was my missing link. Now it makes sense to me to use a buffered
RandomAccessFile and not bother with the presort.

Many thanks, Chris, that was very well explained. 

I'll have a crack at a lean-memory SortComparatorSource implementation,
which uses a buffered RandomAccessFile, as described.

-----Original Message-----
From: Chris Hostetter [mailto:[EMAIL PROTECTED] 
Sent: 02 August 2006 04:32
To: [email protected]
Subject: RE: Sorting

: I'm with you now. So you do seeks in your comparator. For a large index
you
: might as well use java.io.RandomAccessFile for the "array", because there
: would be little value in buffering when the comparator is liable to jump
all

yep .. that's what i was getting at ... but i'm not so sure that buffering
won't be usefull.  I've i'm not mistaken, all Scorers are by contract
expected to score docs in docId order so when your hits are being collected
for sorting, you should allways be moving forward in the file
-- but you may skip ahead alot when the result set isn't a high percentage
of the total number of docs.
(i may be wrong about all Scorers going in docId order ... if you explicilty
use the 1.4 BooleanScorer you may not get that behavior, but i think
everything else works that way ... perhaps someone else can verify
that)

: around the file. This sounds very expensive, though. If you don't open a
: Searcher to frequently, it makes sense (in my muddled mind) to pre-sort to
: reduce the number of seeks. That was the half-baked idea of the third
file,
: which essentially orders document IDs.

presort on what exactly, the field you want to sort on?  -- That's
esentially what the TermEnum is.  I'm not sure how having that helps you ...
let's assume you've got some data structure (let's not worry about the
file/ram or TermEnum distinction just yet) containing every document in your
index of 100,000,000 products sorted on the price field, and you've done a
search for "apple" and there are 1,000,000 docIds for matching products
ready to be collected by your new custom Scoring code ... how does the full
list of all docIds sorted by price help you as you are given docIds and have
to decide if that doc is better or worse then the docs you've already
collected?

: > Bear in mind, there have been some improvements recently to the ability
to
: grab individual stored fields per document....
:
: I can't see anything like that in 2.0. Is that something in the Lucene
HEAD
: build?

I guess so ... search the java-dev archives for "lazy field loading" or
"Fieldable" .. that should find some of the discussion about it and the jira
issue with the changes.

-Hoss

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

smime.p7s
Description: S/MIME cryptographic signature

RE: Sorting

Reply via email to