On Sep 15, 2005, at 5:00 AM, JMA wrote:
I know I can get all the fields in an index: reader.getFieldNames()
and also all the terms:  reader.terms()

However, I need to be able to get all the terms and fields given a search filter. For example, say I have an index that has crawled 5000 pdf files
(books) and I have the following fields:

content, author (not tokenized), and publish_date

I can easily find all the *distinct* authors in the index using
'reader.terms()'. But say I want to list all the *distinct* authors that have published books in 2002? I can do a simple search to get all the books filtered by publish_date:2002. But then I have to do my own scan of the
results and pull out the author, removing duplicates.

Is there an easier way to do this?

I'm currently building a faceted navigation system (think Google for Nineteenth century literature, except with browsing navigation by author, date range, genre, and probably some others as it evolves. This is very much like the CNET implementation that Chris detailed here: http://www.lucenebook.com/blog/announcements/2005/08/31/cnet.html

My index is pretty static after it is built, so I cache a lot. The first thing I do is walk all the unique terms (using reader.terms()) for the faceted fields, and for each one I create a BitSet that has set bits corresponding to each document that has that term. I allow the user to build up constraints while navigating with any number of these facets, and simply AND the BitSets together to find the matching documents. I also allow for full-text search to occur within those constraints, and leverage QueryFilter.bits() in that case. The BitSet's allow me to display how many documents, based on the constraints, are in each of the "buckets".

So more to your question - using the scheme I just described, you could build up a BitSet for each of the authors. Then a BitSet for 2002 (this could be a simple QueryFilter with a TermQuery ("publish_date", "2002") for example). AND the BitSet of 2002 to all of the author BitSets, and any BitSet with a cardinality > 0 has documents for that author.

Make sense?

    Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to