On 9/18/06, Michael Imbeault <[EMAIL PROTECTED]> wrote:
Yonik Seeley wrote:
> For cases like "author", if there is only one value per document, then
> a possible fix is to use the field cache.  If there can be multiple
> occurrences, there doesn't seem to be a good way that preserves exact
> counts, except maybe if the number of documents matching a query is
> low.
>
I have one value per document (I have fields for authors, last_author
and first_author, and I'm doing faceted search on first and last authors
fields). How would I use the field cache to fix my problem?

Unless you want to dive into Solr development, you don't :-)
It requires extensive changes to the faceting code and doing things a
different way in some cases.

The FieldCache is the fastest way to "uninvert" single valued
fields... it's currently only used for Sorting, where one needs to
quickly know the field value given the document id.
The downside is high memory use, and that it's not a general
solution... it can't handle fields with multiple tokens (tokenized
fields or multi-valued fields).

So the strategy would be to step through the documents, get the value
for the field from the FieldCache, increment a counter for that value,
then find the top counters when we are done.

Also, would
it be better to store a unique number (for each possible author) in an
int field along with the string, and do the faceted searching on the int
field?

It won't really help.  It wouldn't be faster, and it would require
only slightly less memory.

>> Just a little follow-up - I did a little more testing, and the query
>> takes 20 seconds no matter what - If there's one document in the results
>> set, or if I do a query that returns all 130000 documents.
>
> Yes, currently the same strategy is always used.
>   intersection_count(docs_matching_query, docs_matching_author1)
>   intersection_count(docs_matching_query, docs_matching_author2)
>   intersection_count(docs_matching_query, docs_matching_author3)
>   etc...
>
> Normally, the docsets will be cached, but since the number of authors
> is greater than the size of the filtercache, the effective cache hit
> rate will be 0%
>
> -Yonik
So more memory would fix the problem?

Yes, if your collection size isn't that large...  it's not a practical
solution for many cases though.

Also, I was under the impression
that it was only searching / sorting for authors that it knows are in
the result set...

That's the problem... it's not necessarily easy to know *what* authors
are in the result set.  If we could quickly determine that, we could
just count them and not do any intersections or anything at all.

 in the case of only one document (1 result), it seems
strange that it takes the same time as for 130 000 results. It should
just check the results, see that there's only one author, and return
that? And in the case of 2 documents, just sort 2 authors (or 1 if
they're the same)? I understand your answer (it does intersections), but
I wonder why its intersecting from the whole document set at first, and
not docs_matching_query like you said.

It is just intersecting docs_matching_query.  The problem is that it's
intersecting that set with all possible author sets since it doesn't
know ahead of time what authors are in the docs that match the query.

There could be optimizations when docs_matching_query.size() is small,
so we start somehow with terms in the documents rather than terms in
the index.  That requires termvectors to be stored (medium speed), or
requires that the field be stored and that we re-analyze it (very
slow).

More optimization of special cases hasn't been done simply because no
one has done it yet... (as you note, faceting is a new feature).


-Yonik

Reply via email to