On Sep 15, 2005, at 5:00 AM, JMA wrote:
I know I can get all the fields in an index: reader.getFieldNames()
and also all the terms: reader.terms()
However, I need to be able to get all the terms and fields given a
search
filter. For example, say I have an index that has crawled 5000 pdf
files
(books) and I have the following fields:
content, author (not tokenized), and publish_date
I can easily find all the *distinct* authors in the index using
'reader.terms()'. But say I want to list all the *distinct*
authors that
have published books in 2002? I can do a simple search to get all
the books
filtered by publish_date:2002. But then I have to do my own scan
of the
results and pull out the author, removing duplicates.
Is there an easier way to do this?
I'm currently building a faceted navigation system (think Google for
Nineteenth century literature, except with browsing navigation by
author, date range, genre, and probably some others as it evolves.
This is very much like the CNET implementation that Chris detailed
here: http://www.lucenebook.com/blog/announcements/2005/08/31/cnet.html
My index is pretty static after it is built, so I cache a lot. The
first thing I do is walk all the unique terms (using reader.terms())
for the faceted fields, and for each one I create a BitSet that has
set bits corresponding to each document that has that term. I allow
the user to build up constraints while navigating with any number of
these facets, and simply AND the BitSets together to find the
matching documents. I also allow for full-text search to occur
within those constraints, and leverage QueryFilter.bits() in that
case. The BitSet's allow me to display how many documents, based on
the constraints, are in each of the "buckets".
So more to your question - using the scheme I just described, you
could build up a BitSet for each of the authors. Then a BitSet for
2002 (this could be a simple QueryFilter with a TermQuery
("publish_date", "2002") for example). AND the BitSet of 2002 to all
of the author BitSets, and any BitSet with a cardinality > 0 has
documents for that author.
Make sense?
Erik
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]