Re: Highest frequency terms for a subset of documents

Yonik Seeley Thu, 21 Apr 2011 06:40:43 -0700

On Thu, Apr 21, 2011 at 9:24 AM, Ofer Fort <o...@tra.cx> wrote:
> Another strange behavior is that the Qtime seems pretty stable, no matter
> how many object match my query. 200K and 20K both take about 17s.
> I would have guessed that since the time is going over all the terms of all
> the subset documents, would mean that the more documents, the more time.


facet.method=enum steps over all terms in the index for that field...
that takes time regardless of how many documents are in the base set.

There are also short-circuit methods that avoid looking at the docs
for a term if it's docfreq is low enough that it couldn't possibly
make it into the priority queue.  Because if this, it can actually be
faster to facet on a larger base set (try *:* as the base query).

Actually, it might be interesting to see the query time if you set
facet.mincount equal to the number of docs in the base set - that will
test pretty much just the time to enumerate over the terms without
doing any set intersections at all.  Be careful not to set mincount
greater than the number of docs in the base set though - solr will
short-circuit that too and skip enumeration altogether.

The work on the bulkpostings branch should definitely speed up your
case even more - but I have no idea when it will "land" on trunk.


-Yonik
http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
25-26, San Francisco

Re: Highest frequency terms for a subset of documents

Reply via email to