Re: Highest frequency terms for a subset of documents

Ofer Fort Thu, 21 Apr 2011 07:41:48 -0700

I see, thanks.
So if I would want to implement something that would fit my needs, would
going through the subset of documents and counting all the terms in each
one, would be faster? and easier to implement?


On Thu, Apr 21, 2011 at 5:36 PM, Yonik Seeley <yo...@lucidimagination.com>wrote:

> On Thu, Apr 21, 2011 at 9:44 AM, Ofer Fort <o...@tra.cx> wrote:
> > Not sure i fully understand,
> > If "facet.method=enum steps over all terms in the index for that field",
> > than what does setting the q=field:subset do? if i set the q=*:*, than
> how
> > do i get the frequency only on my subset?
>
> It's an implementation detail.  Faceting *does* just give you counts
> that just match
> q=field:subset.  How it does it is a different matter (i.e. for
> facet.method=enum, it
> must step over all terms in the field), so it's closer to O(nterms in
> field) rather than O(ndocs in base set)
>
> -Yonik
> http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
> 25-26, San Francisco
>
>
> > Ofer
> >
> > On Thu, Apr 21, 2011 at 4:40 PM, Yonik Seeley <
> yo...@lucidimagination.com>
> > wrote:
> >>
> >> On Thu, Apr 21, 2011 at 9:24 AM, Ofer Fort <o...@tra.cx> wrote:
> >> > Another strange behavior is that the Qtime seems pretty stable, no
> >> > matter
> >> > how many object match my query. 200K and 20K both take about 17s.
> >> > I would have guessed that since the time is going over all the terms
> of
> >> > all
> >> > the subset documents, would mean that the more documents, the more
> time.
> >>
> >> facet.method=enum steps over all terms in the index for that field...
> >> that takes time regardless of how many documents are in the base set.
> >>
> >> There are also short-circuit methods that avoid looking at the docs
> >> for a term if it's docfreq is low enough that it couldn't possibly
> >> make it into the priority queue.  Because if this, it can actually be
> >> faster to facet on a larger base set (try *:* as the base query).
> >>
> >> Actually, it might be interesting to see the query time if you set
> >> facet.mincount equal to the number of docs in the base set - that will
> >> test pretty much just the time to enumerate over the terms without
> >> doing any set intersections at all.  Be careful not to set mincount
> >> greater than the number of docs in the base set though - solr will
> >> short-circuit that too and skip enumeration altogether.
> >>
> >> The work on the bulkpostings branch should definitely speed up your
> >> case even more - but I have no idea when it will "land" on trunk.
> >>
> >>
> >> -Yonik
> >> http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
> >> 25-26, San Francisco
> >
> >
>

Re: Highest frequency terms for a subset of documents

Reply via email to