Re: Facet performance with heterogeneous 'facets'?

Michael Imbeault Mon, 18 Sep 2006 20:31:35 -0700

Yonik Seeley wrote:

I noticed this too, and have been thinking about ways to fix it.
The root of the problem is that lucene, like all full-text search
engines, uses inverted indicies.  It's fast and easy to get all
documents for a particular term, but getting all terms for a document
documents is either not possible, or not fast (assuming many documents
match a query).

Yeah that's what I've been thinking; the index isn't built to handlesuch searches, sadly :( It would be very nice to be able to rapidlysearch by most frequent author, journal, etc.

For cases like "author", if there is only one value per document, then
a possible fix is to use the field cache.  If there can be multiple
occurrences, there doesn't seem to be a good way that preserves exact
counts, except maybe if the number of documents matching a query is
low.

I have one value per document (I have fields for authors, last_authorand first_author, and I'm doing faceted search on first and last authorsfields). How would I use the field cache to fix my problem? Also, wouldit be better to store a unique number (for each possible author) in anint field along with the string, and do the faceted searching on the intfield? Would this be faster / require less memory? I guess that yes, andI'll test that when I have the time.

Just a little follow-up - I did a little more testing, and the query
takes 20 seconds no matter what - If there's one document in the results
set, or if I do a query that returns all 130000 documents.


Yes, currently the same strategy is always used.
  intersection_count(docs_matching_query, docs_matching_author1)
  intersection_count(docs_matching_query, docs_matching_author2)
  intersection_count(docs_matching_query, docs_matching_author3)
  etc...

Normally, the docsets will be cached, but since the number of authors
is greater than the size of the filtercache, the effective cache hit
rate will be 0%

-Yonik

So more memory would fix the problem? Also, I was under the impressionthat it was only searching / sorting for authors that it knows are inthe result set... in the case of only one document (1 result), it seemsstrange that it takes the same time as for 130 000 results. It shouldjust check the results, see that there's only one author, and returnthat? And in the case of 2 documents, just sort 2 authors (or 1 ifthey're the same)? I understand your answer (it does intersections), butI wonder why its intersecting from the whole document set at first, andnot docs_matching_query like you said.


Thanks for the support,

Michael

Re: Facet performance with heterogeneous 'facets'?

Reply via email to