Toke Eskildsen wrote
> adfel70 <

> adfel70@

> > wrote:
>> I am trying to understand why faceting on a field with lots of unique
>> values
>> has a great impact on query performance.
> 
> Faceting in Solr is performed in different ways. String faceting different
> from Numerics faceting, DocValued fields different from non-DocValued, fc
> different from enum. Let's look at String faceting with facet.method=fc
> and DocValues.
> 
> Strings (aka Terms) are represented in the faceting code with an ordinal,
> which is really just a number. The first term has number 0, the next
> number 1 and so forth. When doing a faceting call with the above premise,
> what happens is
> 
> 1) An counter of int[unique_values] is allocated.
> This is fairly fast, but still with a noticeable impact when the number of
> unique value creeps into the millions. On our machine it takes several
> hundred milliseconds for 100M values. Also relevant is the overall strain
> it puts on the garbage collector.
> 
> 2) For each hit in the result set, the corresponding ordinals are resolved
> and counter[ordinal]++ is triggered.
> This scales with the result set. Small sets are very fast, quite
> independent of the size of the counter-structure. Large result sets are
> (naturally) equally slow.
> 
> 3) The counter-structure is iterated and top-X are determined.
> This scales with the size of the counter-structure, (nearly) independent
> of the result set size.
> 
> 4) The Terms for the top-X ordinals are resolved from the index.
> This scales with X.
> 
> 
> Some of these parts has some non-intuitive penalties: Even very tiny
> result sets has aa constant overhead from allocation and iteration. Asking
> for top-1M hits means that the underlying priority queue will probably no
> longer fit in the CPU cache and will slow things down. Resolving Terms
> from ordinals relies of fast IO and a large number of unique Terms might
> mean that the disk cache is not large enough.
> 
> 
> Blatant plug: I have spend a fair amount of time trying to make some of
> this faster http://tokee.github.io/lucene-solr/
> 
> - Toke Eskildsen

Hi Toke, Thank you for the detailed explanation, thats exactly what I was
looking for, except this algorithm fit single index only. could you please
elaborate what adjustments are needed for distributed index?
The naive solution would count all terms on each shard, but the initial
shard (the one that executed the request) must have ALL results for correct
aggregation (its easy to find example which shows that top K results from
every shard is not good enough). 
Is that correct? I tried to verify this behaviour, but I didnt see that the
process who got the request from the user used more memory than the other
shards.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-facets-implementation-question-tp4227604p4229741.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to