On Thu, 2012-06-07 at 10:01 +0200, Andrew Laird wrote:
> For our needs we don't really need to know that a particular facet has
> exactly 14,203,527 matches - just knowing that there are "more than a
> million" is enough.  If I could somehow limit the hit counts to a
> million (say) [...]

It should be feasible to stop the collector after 1M documents has been
processed. If nothing else then just by ignoring subsequent IDs.
However, the ID's received would be in index-order, which normally means
old-to-new. If the nature of the corpus, and thereby the facet values,
changes over time, this change would not be reflected in the facets that
has many hits as the collector never reaches the newer documents.

> it seems like that could decrease the work required to
> compute the values (just stop counting after the limit is reached) and
> potentially improve faceted search time - especially when we have 20-30
> fields to facet on.  Has anyone else tried to do something like this?

The current Solr facet implementation treats every facet structure
individually. It works fine in a lot of areas but it also means that the
list of IDs for matching documents is iterated once for every facet: In
the sample case, 14M+ hits * 25 fields = 350M+ hits processed.

I have been experimenting with an alternative approach (SOLR-2412) that
packs the terms in the facets as a single structure underneath the hood,
which means only 14M+ hits processed in the current case. Unfortunately
it is not mature and only works for text fields.

- Toke Eskildsen, State and University Library, Denmark

Reply via email to