: I'd like to wrap my head around how faceting in SolrCloud works, does
: Solr ask each shard for their maximum value and then use that to
: determine what else should be asked for from other shards, or does it
: ask for all values and do the aggregation on the requesting server?

For things like facet.range and facet.query, it's really simple: each 
shard is given the facet params, each shard computes an identical list of 
constraints (they are deterministic) and returns them to the coordinator, 
who sums the counts.

For facet.field things are a little more interesting...

The overall approach taken is designed to ensure that the constraint 
counts you get back are always correct (w/o requiring each shard to 
return the fill list of terms it knows about) even if the document 
distribution is really skewed and the constraints returned aren't the 
constraints with the highest count across the entire index.  

how it works (from memory) is basically..

1) each shard is asked for it's top constraints using an artificitlaly 
inlated "facet.limit" (i think it's something like 20 + 1.5 the original 
limit)

2) the coordinator node sums up the counts for any constraint returned by 
multiple nodes, and then picks the top (facet.limit) constraints based n 
the counts it knows about.

3) the coordinator node then asks each shard to compute it's exact count 
for the selected constraints (since some of those constraints may not have 
been in the original lists returned by some shards), and it then computes 
the final sum. This ensures that the constraint count will match the 
numFound if filter on that constraint (but i believe this is second query 
is optimized to only ask a shard about a constraint if it didn't already 
get the count in the first request)

So imagine you have 3 shards, and querying them individually with 
facet.field=cat&facet.limit=3 you get...

shardA: cars(8), books(7), computers(6)
shardB: toys(8), books(7), garden(5)
shardC: garden(4), books(3), computers(3)

If you made a solr cloud query (or an explicit distributed query of those 
three shards), the first request the coordinator would send to each shard 
would specify a higher facet.limit, and might get back something like...

shardA: cars(8), books(7), computers(6), cleaning(4), ...
shardB: toys(8), books(7), garden(5), cleaning(4), ...
shardC: garden(4), books(3), computers(3), plants(3), ...

...in which case "cleaning" pops up as a contender for being in the top 
constraints.  The coordinator sums up the counts for the constraints it 
knows about, and might decide that these are the top 3...

        books(17), computers(9), cleaning(8)

...at which point it sends a second query to shardB asking for an explicit 
count of constraint "computers" and to shardC for a count of the 
constraint "cleaning".  so that the final results can be exact.  when the 
responses come back, those counts are 
added in to the totals the coordinator already knows about.  So in our 
example, if the results of the second query are...

shardB: computers(0)
shardC: cleaning(2)

..then the final top 3 constriants for the cat field might be...

        books(17), cleaning(10), computers(9)

(Note that if we had just done a really naive sum of the original counts, 
"cleaning" wouldn't have even made the list)


-Hoss

Reply via email to