Yonik Seeley wrote:
I noticed this too, and have been thinking about ways to fix it.
The root of the problem is that lucene, like all full-text search
engines, uses inverted indicies.  It's fast and easy to get all
documents for a particular term, but getting all terms for a document
documents is either not possible, or not fast (assuming many documents
match a query).
Yeah that's what I've been thinking; the index isn't built to handle such searches, sadly :( It would be very nice to be able to rapidly search by most frequent author, journal, etc.
For cases like "author", if there is only one value per document, then
a possible fix is to use the field cache.  If there can be multiple
occurrences, there doesn't seem to be a good way that preserves exact
counts, except maybe if the number of documents matching a query is
low.

I have one value per document (I have fields for authors, last_author and first_author, and I'm doing faceted search on first and last authors fields). How would I use the field cache to fix my problem? Also, would it be better to store a unique number (for each possible author) in an int field along with the string, and do the faceted searching on the int field? Would this be faster / require less memory? I guess that yes, and I'll test that when I have the time.

Just a little follow-up - I did a little more testing, and the query
takes 20 seconds no matter what - If there's one document in the results
set, or if I do a query that returns all 130000 documents.

Yes, currently the same strategy is always used.
  intersection_count(docs_matching_query, docs_matching_author1)
  intersection_count(docs_matching_query, docs_matching_author2)
  intersection_count(docs_matching_query, docs_matching_author3)
  etc...

Normally, the docsets will be cached, but since the number of authors
is greater than the size of the filtercache, the effective cache hit
rate will be 0%

-Yonik
So more memory would fix the problem? Also, I was under the impression that it was only searching / sorting for authors that it knows are in the result set... in the case of only one document (1 result), it seems strange that it takes the same time as for 130 000 results. It should just check the results, see that there's only one author, and return that? And in the case of 2 documents, just sort 2 authors (or 1 if they're the same)? I understand your answer (it does intersections), but I wonder why its intersecting from the whole document set at first, and not docs_matching_query like you said.

Thanks for the support,

Michael

Reply via email to