[ 
https://issues.apache.org/jira/browse/SOLR-11733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16281188#comment-16281188
 ] 

Hoss Man commented on SOLR-11733:
---------------------------------


Steps to reproduce..

*Build Collection & Index Some Data*

{noformat}
# start up a small solr cluster
$ bin/solr -e cloud -noprompt
...

# NOTE: we're ignoring the getting started collection that was created
# we'll make our own using the implicit router with one shard per node

$ curl 
'http://localhost:8983/solr/admin/collections?action=CREATE&name=test&router.name=implicit&numShards=2&shards=shardX,shardY'
...

# Index 5 docs to *each* shards with:
# - the same "top 5" terms in all 5 docs on both shards
# - a common "tail" term in 2 docs on *both* shards
#   - w/a total of 4 docs, 
# - some shard specific "distrating" terms that each appear in only 3 docs, and 
always on single shard
#   - On the 1st shard: there are 5 of these terms, such that 'tail' will be 
the #11 ranked term (on this shard)
#   - On the 2nd shard: 'tail' will be the #7 ranked term (on this shard)

$ curl -H 'Content-Type: application/json' 
'http://localhost:8983/solr/test/update?commit=true' --data-binary '[
{ "id": "1_1", "foo_t": "a1 a2 a3 a4 a5   x1 x2 x3 x4 x5" },
{ "id": "1_2", "foo_t": "a1 a2 a3 a4 a5   x1 x2 x3 x4 x5" },
{ "id": "1_3", "foo_t": "a1 a2 a3 a4 a5   x1 x2 x3 x4 x5" },
{ "id": "1_4", "foo_t": "a1 a2 a3 a4 a5                   tail" },
{ "id": "1_5", "foo_t": "a1 a2 a3 a4 a5                   tail" },
]'
...
$ curl -H 'Content-Type: application/json' 
'http://localhost:7574/solr/test/update?commit=true' --data-binary '[
{ "id": "2_1", "foo_t": "a1 a2 a3 a4 a5   yyy" },
{ "id": "2_2", "foo_t": "a1 a2 a3 a4 a5   yyy" },
{ "id": "2_3", "foo_t": "a1 a2 a3 a4 a5   yyy" },
{ "id": "2_4", "foo_t": "a1 a2 a3 a4 a5        tail" },
{ "id": "2_5", "foo_t": "a1 a2 a3 a4 a5        tail" },
]'
...
{noformat}


*Sanity Check Queries*

With an excessive 'limit' or 'overrequest' we can verify that 'tail' is the #6 
ranked term overall (even with refinement explicitly disabled)

{noformat}
$ curl http://localhost:7574/solr/test/select -d 
'q=*:*&wt=json&rows=0&json.facet={foo:{type:terms,field:foo_t,limit:7,overrequest:100,refine:false}}'
...
  "response":{"numFound":10,"start":0,"maxScore":1.0,"docs":[]
  },
  "facets":{
    "count":10,
    "foo":{
      "buckets":[{
          "val":"a1",
          "count":10},
        {
          "val":"a2",
          "count":10},
        {
          "val":"a3",
          "count":10},
        {
          "val":"a4",
          "count":10},
        {
          "val":"a5",
          "count":10},
        {
          "val":"tail",
          "count":4},
        {
          "val":"x1",
          "count":3}]}}}

$ curl http://localhost:7574/solr/test/select -d 
'q=*:*&wt=json&rows=0&json.facet={foo:{type:terms,field:foo_t,limit:100,overrequest:0,refine:false}}'
...
  "facets":{
    "count":10,
    "foo":{
      "buckets":[{
          "val":"a1",
          "count":10},
        {
          "val":"a2",
          "count":10},
        {
          "val":"a3",
          "count":10},
        {
          "val":"a4",
          "count":10},
        {
          "val":"a5",
          "count":10},
        {
          "val":"tail",
          "count":4},
        {
          "val":"x1",
          "count":3},
        ...
{noformat}

Likewise, if we query each shard individual (w/ {{distrib=false}} ) we confirm 
that the "tail" term shows up in it's expected ranking...

{noformat}
$ curl http://localhost:8983/solr/test/select -d 
'distrib=false&q=*:*&wt=json&rows=0&json.facet={foo:{type:terms,field:foo_t,limit:11}}'
...
      "buckets":[{
          "val":"a1",
          "count":5},
        {
          "val":"a2",
          "count":5},
        {
          "val":"a3",
          "count":5},
        {
          "val":"a4",
          "count":5},
        {
          "val":"a5",
          "count":5},
        {
          "val":"x1",
          "count":3},
        {
          "val":"x2",
          "count":3},
        {
          "val":"x3",
          "count":3},
        {
          "val":"x4",
          "count":3},
        {
          "val":"x5",
          "count":3},
        {
          "val":"tail",
          "count":2}]}}}

$ curl http://localhost:7574/solr/test/select -d 
'distrib=false&q=*:*&wt=json&rows=0&json.facet={foo:{type:terms,field:foo_t,limit:7}}'
...
      "buckets":[{
          "val":"a1",
          "count":5},
        {
          "val":"a2",
          "count":5},
        {
          "val":"a3",
          "count":5},
        {
          "val":"a4",
          "count":5},
        {
          "val":"a5",
          "count":5},
        {
          "val":"yyy",
          "count":3},
        {
          "val":"tail",
          "count":2}]}}}
{noformat}



*Queries that Fail*

w/refinement, a limit of 6 (plus the implicit default overrequest) should be 
enough to find 'tail' -- but it's not included in the response from this 
query...

{noformat}
$ curl http://localhost:7574/solr/test/select -d 
'q=*:*&wt=json&rows=0&json.facet={foo:{type:terms,field:foo_t,limit:6,refine:true}}'
...
      "buckets":[{
          "val":"a1",
          "count":10},
        {
          "val":"a2",
          "count":10},
        {
          "val":"a3",
          "count":10},
        {
          "val":"a4",
          "count":10},
        {
          "val":"a5",
          "count":10},
        {
          "val":"x1",
          "count":3}]}}}
{noformat}

Even if we assume the implicit overrequest calculation may be broken, a "limit" 
of 6 + an explicit overrequest of "1" should be enough to discover 'tail' on 
the 2nd shard, and (w/refinement) it should bubble up into the top 6 -- but 
again, this {{limit:6,overrequest:1}} query doesn't find tail...

{noformat}
$ curl http://localhost:7574/solr/test/select -d 
'q=*:*&wt=json&rows=0&json.facet={foo:{type:terms,field:foo_t,limit:6,overrequest:1,refine:true}}'
...
      "buckets":[{
          "val":"a1",
          "count":10},
        {
          "val":"a2",
          "count":10},
        {
          "val":"a3",
          "count":10},
        {
          "val":"a4",
          "count":10},
        {
          "val":"a5",
          "count":10},
        {
          "val":"x1",
          "count":3}]}}}
{noformat}

Here's the log messages from each node when the last request ( 
{{limit:6,overrequest:1,refine:true}} ) was executed...

{noformat}
INFO  - 2017-12-07 00:27:37.821; [c:test s:shardY r:core_node4 
x:test_shardY_replica_n2] org.apache.solr.core.SolrCore; 
[test_shardY_replica_n2]  webapp=/solr path=/select 
params={df=_text_&distrib=false&_facet_={}&fl=id&fl=score&shards.purpose=1048580&start=0&fsv=true&shard.url=http://127.0.1.1:8983/solr/test_shardY_replica_n2/&rows=0&version=2&q=*:*&json.facet={foo:{type:terms,field:foo_t,limit:6,overrequest:1,refine:true}}&NOW=1512606457819&isShard=true&wt=javabin}
 hits=5 status=0 QTime=0

==> example/cloud/node2/logs/solr.log <==
INFO  - 2017-12-07 00:27:37.821; [c:test s:shardX r:core_node3 
x:test_shardX_replica_n1] org.apache.solr.core.SolrCore; 
[test_shardX_replica_n1]  webapp=/solr path=/select 
params={df=_text_&distrib=false&_facet_={}&fl=id&fl=score&shards.purpose=1048580&start=0&fsv=true&shard.url=http://127.0.1.1:7574/solr/test_shardX_replica_n1/&rows=0&version=2&q=*:*&json.facet={foo:{type:terms,field:foo_t,limit:6,overrequest:1,refine:true}}&NOW=1512606457819&isShard=true&wt=javabin}
 hits=5 status=0 QTime=0
INFO  - 2017-12-07 00:27:37.823; [c:test s:shardX r:core_node3 
x:test_shardX_replica_n1] org.apache.solr.core.SolrCore; 
[test_shardX_replica_n1]  webapp=/solr path=/select 
params={df=_text_&distrib=false&_facet_={"refine":{"foo":{"_l":["x1"]}}}&shards.purpose=2097152&shard.url=http://127.0.1.1:7574/solr/test_shardX_replica_n1/&rows=0&version=2&q=*:*&json.facet={foo:{type:terms,field:foo_t,limit:6,overrequest:1,refine:true}}&NOW=1512606457819&isShard=true&facet=false&wt=javabin}
 hits=5 status=0 QTime=0
INFO  - 2017-12-07 00:27:37.824; [c:test s:shardX r:core_node3 
x:test_shardX_replica_n1] org.apache.solr.core.SolrCore; 
[test_shardX_replica_n1]  webapp=/solr path=/select 
params={q=*:*&json.facet={foo:{type:terms,field:foo_t,limit:6,overrequest:1,refine:true}}&rows=0&wt=json}
 hits=10 status=0 QTime=5
{noformat}


...note that this appears to show:

* an explicit {{"refine"}} request for " {{"_l":\["x1"]}} " logged by port 7574
** port 7574  doesn't have the term "x1" at all so would not have returned it 
in it's initial results
* *NO* indication of attempting to refine "x2", "yyy", or "tail"
** this in spite of the fact that they should have all been in the "top 6+1" 
from one shard, with counts making them competitive in the final results

What strikes me as most odd, is that even if there was some sort of "off by 
one" error preventing "x2" & "tail" (which should have been the "last" bucket 
from each of their respective shards) from being refined, "yyy" would have had 
the exact same count, and been in the exact same (shard specific) bucket as 
"x1" -- so why isn't there at a request to port #8983 to refine it?!  How is it 
different from "x1" ???



> json.facet refinement fails to bubble up some long tail (overrequested) terms?
> ------------------------------------------------------------------------------
>
>                 Key: SOLR-11733
>                 URL: https://issues.apache.org/jira/browse/SOLR-11733
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Hoss Man
>
> Something wonky is happening with {{json.facet}} refinement.
> "Long Tail" terms that may not be in the "top n" on every shard, but are in 
> the "top n + overrequest" for at least 1 shard aren't getting refined and 
> included in the aggragated response in some cases.
> I don't understand the code enough to explain this, but I have some steps to 
> reproduce that i'll post in a comment shortly



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to