[ 
https://issues.apache.org/jira/browse/SOLR-2403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13007497#comment-13007497
 ] 

Toke Eskildsen commented on SOLR-2403:
--------------------------------------

Dividing by shard count is fairly risky. An example could be the shards
{code}
Shard 1: A(9) B(6) C(10) D(8)
Shard 2: A(4) B(5) C(4) D(3)
{code}
where the request of the top-3 elements with mincount=5 from each shard would 
give the merged result
{code}
B(11) C(10)
{code}
where the correct result would be
{code}
A(13) B(11) C(14) D(11)
{code}

The problem with using mincount=1 for each shard-call is of course that the 
single shard result sets needs to be humongous in order to ensure that the 
correct values are returned, when the field contains many value with low count 
and few values with high count. With shards like
{code}
Shard 1: A(1) B(1) C(1) D(1) E(1) F(9) G(1) H(1)
Shard 2: A(1) B(1) C(1) D(1) E(1) F(1) G(1) H(10)
{code}
and a request for mincount=10, all terms must be returned from both shards in 
order to get the result
{code}
F(10) H(11)
{code}

As you, Yonik, point out, a variant of the problem exists when sorting on 
count. However, for count it is mitigated by the fact that the results from the 
individual shards are sorted by the selecting key (count). This means that the 
chance of missing or miscounting tags is low and can be lowered further by 
relatively little over-requesting.

With lexical sorting, the selecting key (count again) is independent of the 
sorting key. Over-requesting helps, but only linear to the fraction of the full 
result-set from each shard that is requested. Furthermore, the need for 
over-requesting grows with the number of shards as the overlapping hills can be 
smaller while still summing up to mincount.

I do not have any real solution for the problem. One minor improvement would be 
a collector that kept collecting terms with a mincount=y until limit=n or the 
number of collected terms with mincount=x was equal to m, where x is the 
original mincount and y is dependent on the number of shards. This would at 
least stop the collection process when the result set was guaranteed to contain 
enough values above the given threshold. It would work well with spikes but 
poorly with hills just below mincount x and it would still not guarantee 
correctness of the sums of the counts, only correctness of the terms.

> Problem with facet.sort=lex, shards, and facet.mincount
> -------------------------------------------------------
>
>                 Key: SOLR-2403
>                 URL: https://issues.apache.org/jira/browse/SOLR-2403
>             Project: Solr
>          Issue Type: Bug
>          Components: search
>    Affects Versions: 4.0
>         Environment: RHEL5, Ubuntu 10.04
>            Reporter: Peter Cline
>
> I tested this on a recent trunk snapshot (2/25), haven't verified with 3.1 or 
> 1.4.1.  I can if necessary and update.
> Solr is not returning the proper number of facet values when sorting 
> alphabetically, using distributed search, and using a facet.mincount that 
> excludes some of the values in the first facet.limit values.
> Easiest explained by example.  Sorting alphabetically, the first 20 values 
> for my "subject_facet" field have few documents.  19 facet values have only 1 
> document associated, and 1 has 2 documents.  There are plenty after that have 
> more than 2.
> {code}
> http://localhost:8082/solr/select?q=*:*&facet=true&facet.field=subject_facet&facet.limit=20&facet.sort=lex&facet.mincount=2
> {code}
> comes back with the expected 20 facet values with >= 2 documents associated.
> If I add a shards parameter that points back to itself, the result is 
> different.
> {code}
> http://localhost:8082/solr/select?q=*:*&facet=true&facet.field=subject_facet&facet.limit=20&facet.sort=lex&facet.mincount=2&shards=localhost:8082/solr
> {code}
> comes back with only 1 facet value: the single value in the first 20 that had 
> more than 1 document.  
> It appears to me that mincount is ignored when doing the original query to 
> the shards, then applied afterwards.
> Let me know if you need any more info.  
> Thanks,
> Peter

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to