Chris Hostetter wrote:
: What this issue doesn't discuss is what to do with partial results obtained
: when a timeout occurred. As the original poster points out, document lists are
: traversed in the order they were added and not the order of their importance,
: which introduces a bias to partial results in that they reflect results from a
: random sample (which is likely not the most relevant, i.e. there could have
: been more relevant results later in the traversal order).
:
: The answer to this issue is org.apache.nutch.indexer.IndexSorter, which
skimming this it doesn't seem like a refactored version that was less
nutch specific cold make a handy contrib ... but it also seems like there
may be a simpler approach for the (i assume) common case of prefering docs
that were indexed later....
if we eliminate the requirement for *strict* preference of recent
documents and make that a more loose desire, then we coulnd't we do a
pretty good job if we just changed Segment merging to reorder reverse the
order of the segments before each merge? it wouldn't be very useful to
start doing this on an index that's already a decent size, but if this was
happening on every merge right from the very begining, then the most
recent documents would percollate to the front of the index right?
The only downside i can think of would be that docids would frequently
(not not very predictably) change even if there were no deletions .. but
you'd pay that same penalty with something like the nutch's IndexSorter.
I'm not much of an expert on segment merging.. but with the exception of
docid order i can'tthink of many reasons why there couldn't be a merger
that revesed the order of hte segments.
I think this would be too messy - currently we can be sure of the simple
rule that documents added to the index get incrementally higher docids,
i.e. the higher the docid the more recent is the document. I think it
would be much simpler to write a FilterIndexReader that simply reverses
the order of docids.
The issue with Nutch's IndexSorter is that it allows you to reorder
docids in an arbitrary manner, using a user-supplied mapping between old
and new docids, which can be based on values retrieved from the current
index or from any other source. So I think this would be the main value
of this class applicable to various scenarios.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]