On Monday 24 July 2006 08:17, Martin Braun wrote:
> I think I didn't explain my Problem good enough.
>
> The harder problem for me is how to get the proposals for the
> refinement?  I have a date-range of 16xx to now, for about 4 bn. docs.
> So the number of found documents could be quite large. But the
> distribution of the dates could be very different form one query to
> another. I hope there is a better way than to collect all dates with
> HitIterator, and do statistics on the data?
>
> Is there something that could be done while indexing?
> What would be a high-performance heuristic?
>
> The same problem with other categories like the author: how to find good
> proposals for a given result set?

That's a lot trickier and there might be others on the list who can give a 
better answer. I think what you need to do is to extend HitCollector:

http://lucene.apache.org/java/docs/api/org/apache/lucene/search/HitCollector.html

Your implementation will keep a count of dates and categories that the results 
are in. Then you can use this information to build up your refinements. The 
problem with this approach is that HitCollector#collect only provides you 
with the document number, not the document itself, hence you still have to 
load the document to find out its date and category, this will be slow for 
large result sets.

You might want to keep this metadata in another store so you can quickly look 
up the date and category based on the document number. This avoids having to 
load all the documents from Lucene, which is an expensive operation. The 
downside with this approach is that the document number changes when you 
optimise your index, hence you'll have to rebuild your metadata store each 
time you optimise.



Miles

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to