[ https://issues.apache.org/jira/browse/SOLR-6314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14090909#comment-14090909 ]
Erick Erickson commented on SOLR-6314: -------------------------------------- bq: If you want to dedup facet parameters for some reason, then it should probably be done in the faceting code. Yeah, that's exactly what's making me uncomfortable about the patch, it's at such a low level and it affects _everything_. Unintended consequences and all that. OTOH, what's the case for allowing dups? Do you have specific cases where that's good or is your comment more of a statement that we shouldn't restrict future possibilities just because I'm not sufficiently imaginative ;) ? I'm really having a tough time imagining scenarios where allowing dups is useful, and I can come up with scenarios where allowing dups is harmful (imagine multiple, expensive, identical fq clause with cache=false for instance) that would be caught here. Hmmm, a WARN-level log message is indicated for dups no matter what I think. The counter-argument is that the user should be free to shoot themselves in the foot as they want to. The counter-counter argument is that when we identify potential traps we should do something about them if we can. What do you think about this alternative? (note, I'm not proposing it as much as throwing it out for discussion). Leave the dup-detection where it is and log a WARN level message when dups are detected, and move the actual de-duping out to the faceting code. Then de-dupe on a case- by-case basis as situations arise. Where this started was that the exact same query over the exact same data set returns different results in sharded and non-sharded situations. The results have the same information, just repeated in the single shard case. Which means that somehow the sharded code manages to ignore the extra entries. I'll look at how in a bit. At any rate, the sharded case manages to avoid returning the data multiple times so either there's code in there specifically to deal with this or it's happening by chance, which is its own gotcha. I've seen some very large queries out in the wild and it's hard in many cases to see things like this so logging a message would help the users figure out their (perhaps machine-generated) code was doing things they _probably_ don't want. So this is a long winded way of saying "Hell, I don't know". My _slight_ preference here would be to dedupe as it's being done in this patch (and log warnings when doing so). It just feels "more correct" and may prevent weird behavior in the future. But I'm not adamant about that, if the general consensus is that doing this on a case-by-case basis is a better idea I can make it so for the facet case. > Multi-threaded facet counts differ when SolrCloud has >1 shard > -------------------------------------------------------------- > > Key: SOLR-6314 > URL: https://issues.apache.org/jira/browse/SOLR-6314 > Project: Solr > Issue Type: Bug > Components: SearchComponents - other, SolrCloud > Affects Versions: 5.0 > Reporter: Vamsee Yarlagadda > Assignee: Erick Erickson > Attachments: SOLR-6314.patch > > > I am trying to work with multi-threaded faceting on SolrCloud and in the > process i was hit by some issues. > I am currently running the below upstream test on different SolrCloud > configurations and i am getting a different result set per configuration. > https://github.com/apache/lucene-solr/blob/trunk/solr/core/src/test/org/apache/solr/request/TestFaceting.java#L654 > Setup: > - *Indexed 50 docs into SolrCloud.* > - *If the SolrCloud has only 1 shard, the facet field query has the below > output (which matches with the expected upstream test output - # facet fields > ~ 50).* > {code} > $ curl > "http://localhost:8983/solr/collection1/select?facet=true&fl=id&indent=true&q=id%3A*&facet.limit=-1&facet.threads=1000&facet.field=f0_ws&facet.field=f0_ws&facet.field=f0_ws&facet.field=f0_ws&facet.field=f0_ws&facet.field=f1_ws&facet.field=f1_ws&facet.field=f1_ws&facet.field=f1_ws&facet.field=f1_ws&facet.field=f2_ws&facet.field=f2_ws&facet.field=f2_ws&facet.field=f2_ws&facet.field=f2_ws&facet.field=f3_ws&facet.field=f3_ws&facet.field=f3_ws&facet.field=f3_ws&facet.field=f3_ws&facet.field=f4_ws&facet.field=f4_ws&facet.field=f4_ws&facet.field=f4_ws&facet.field=f4_ws&facet.field=f5_ws&facet.field=f5_ws&facet.field=f5_ws&facet.field=f5_ws&facet.field=f5_ws&facet.field=f6_ws&facet.field=f6_ws&facet.field=f6_ws&facet.field=f6_ws&facet.field=f6_ws&facet.field=f7_ws&facet.field=f7_ws&facet.field=f7_ws&facet.field=f7_ws&facet.field=f7_ws&facet.field=f8_ws&facet.field=f8_ws&facet.field=f8_ws&facet.field=f8_ws&facet.field=f8_ws&facet.field=f9_ws&facet.field=f9_ws&facet.field=f9_ws&facet.field=f9_ws&facet.field=f9_ws&rows=1&wt=xml" > <?xml version="1.0" encoding="UTF-8"?> > <response> > <lst name="responseHeader"> > <int name="status">0</int> > <int name="QTime">21</int> > <lst name="params"> > <str name="facet">true</str> > <str name="fl">id</str> > <str name="indent">true</str> > <str name="q">id:*</str> > <str name="facet.limit">-1</str> > <str name="facet.threads">1000</str> > <arr name="facet.field"> > <str>f0_ws</str> > <str>f0_ws</str> > <str>f0_ws</str> > <str>f0_ws</str> > <str>f0_ws</str> > <str>f1_ws</str> > <str>f1_ws</str> > <str>f1_ws</str> > <str>f1_ws</str> > <str>f1_ws</str> > <str>f2_ws</str> > <str>f2_ws</str> > <str>f2_ws</str> > <str>f2_ws</str> > <str>f2_ws</str> > <str>f3_ws</str> > <str>f3_ws</str> > <str>f3_ws</str> > <str>f3_ws</str> > <str>f3_ws</str> > <str>f4_ws</str> > <str>f4_ws</str> > <str>f4_ws</str> > <str>f4_ws</str> > <str>f4_ws</str> > <str>f5_ws</str> > <str>f5_ws</str> > <str>f5_ws</str> > <str>f5_ws</str> > <str>f5_ws</str> > <str>f6_ws</str> > <str>f6_ws</str> > <str>f6_ws</str> > <str>f6_ws</str> > <str>f6_ws</str> > <str>f7_ws</str> > <str>f7_ws</str> > <str>f7_ws</str> > <str>f7_ws</str> > <str>f7_ws</str> > <str>f8_ws</str> > <str>f8_ws</str> > <str>f8_ws</str> > <str>f8_ws</str> > <str>f8_ws</str> > <str>f9_ws</str> > <str>f9_ws</str> > <str>f9_ws</str> > <str>f9_ws</str> > <str>f9_ws</str> > </arr> > <str name="wt">xml</str> > <str name="rows">1</str> > </lst> > </lst> > <result name="response" numFound="50" start="0"> > <doc> > <float name="id">0.0</float></doc> > </result> > <lst name="facet_counts"> > <lst name="facet_queries"/> > <lst name="facet_fields"> > <lst name="f0_ws"> > <int name="zero_1">25</int> > <int name="zero_2">25</int> > </lst> > <lst name="f0_ws"> > <int name="zero_1">25</int> > <int name="zero_2">25</int> > </lst> > <lst name="f0_ws"> > <int name="zero_1">25</int> > <int name="zero_2">25</int> > </lst> > <lst name="f0_ws"> > <int name="zero_1">25</int> > <int name="zero_2">25</int> > </lst> > <lst name="f0_ws"> > <int name="zero_1">25</int> > <int name="zero_2">25</int> > </lst> > <lst name="f1_ws"> > <int name="one_1">33</int> > <int name="one_3">17</int> > </lst> > <lst name="f1_ws"> > <int name="one_1">33</int> > <int name="one_3">17</int> > </lst> > <lst name="f1_ws"> > <int name="one_1">33</int> > <int name="one_3">17</int> > </lst> > <lst name="f1_ws"> > <int name="one_1">33</int> > <int name="one_3">17</int> > </lst> > <lst name="f1_ws"> > <int name="one_1">33</int> > <int name="one_3">17</int> > </lst> > <lst name="f2_ws"> > <int name="two_1">37</int> > <int name="two_4">13</int> > </lst> > <lst name="f2_ws"> > <int name="two_1">37</int> > <int name="two_4">13</int> > </lst> > <lst name="f2_ws"> > <int name="two_1">37</int> > <int name="two_4">13</int> > </lst> > <lst name="f2_ws"> > <int name="two_1">37</int> > <int name="two_4">13</int> > </lst> > <lst name="f2_ws"> > <int name="two_1">37</int> > <int name="two_4">13</int> > </lst> > <lst name="f3_ws"> > <int name="three_1">40</int> > <int name="three_5">10</int> > </lst> > <lst name="f3_ws"> > <int name="three_1">40</int> > <int name="three_5">10</int> > </lst> > <lst name="f3_ws"> > <int name="three_1">40</int> > <int name="three_5">10</int> > </lst> > <lst name="f3_ws"> > <int name="three_1">40</int> > <int name="three_5">10</int> > </lst> > <lst name="f3_ws"> > <int name="three_1">40</int> > <int name="three_5">10</int> > </lst> > <lst name="f4_ws"> > <int name="four_1">41</int> > <int name="four_6">9</int> > </lst> > <lst name="f4_ws"> > <int name="four_1">41</int> > <int name="four_6">9</int> > </lst> > <lst name="f4_ws"> > <int name="four_1">41</int> > <int name="four_6">9</int> > </lst> > <lst name="f4_ws"> > <int name="four_1">41</int> > <int name="four_6">9</int> > </lst> > <lst name="f4_ws"> > <int name="four_1">41</int> > <int name="four_6">9</int> > </lst> > <lst name="f5_ws"> > <int name="five_1">42</int> > <int name="five_7">8</int> > </lst> > <lst name="f5_ws"> > <int name="five_1">42</int> > <int name="five_7">8</int> > </lst> > <lst name="f5_ws"> > <int name="five_1">42</int> > <int name="five_7">8</int> > </lst> > <lst name="f5_ws"> > <int name="five_1">42</int> > <int name="five_7">8</int> > </lst> > <lst name="f5_ws"> > <int name="five_1">42</int> > <int name="five_7">8</int> > </lst> > <lst name="f6_ws"> > <int name="six_1">43</int> > <int name="six_8">7</int> > </lst> > <lst name="f6_ws"> > <int name="six_1">43</int> > <int name="six_8">7</int> > </lst> > <lst name="f6_ws"> > <int name="six_1">43</int> > <int name="six_8">7</int> > </lst> > <lst name="f6_ws"> > <int name="six_1">43</int> > <int name="six_8">7</int> > </lst> > <lst name="f6_ws"> > <int name="six_1">43</int> > <int name="six_8">7</int> > </lst> > <lst name="f7_ws"> > <int name="seven_1">44</int> > <int name="seven_9">6</int> > </lst> > <lst name="f7_ws"> > <int name="seven_1">44</int> > <int name="seven_9">6</int> > </lst> > <lst name="f7_ws"> > <int name="seven_1">44</int> > <int name="seven_9">6</int> > </lst> > <lst name="f7_ws"> > <int name="seven_1">44</int> > <int name="seven_9">6</int> > </lst> > <lst name="f7_ws"> > <int name="seven_1">44</int> > <int name="seven_9">6</int> > </lst> > <lst name="f8_ws"> > <int name="eight_1">45</int> > <int name="eight_10">5</int> > </lst> > <lst name="f8_ws"> > <int name="eight_1">45</int> > <int name="eight_10">5</int> > </lst> > <lst name="f8_ws"> > <int name="eight_1">45</int> > <int name="eight_10">5</int> > </lst> > <lst name="f8_ws"> > <int name="eight_1">45</int> > <int name="eight_10">5</int> > </lst> > <lst name="f8_ws"> > <int name="eight_1">45</int> > <int name="eight_10">5</int> > </lst> > <lst name="f9_ws"> > <int name="nine_1">45</int> > <int name="nine_11">5</int> > </lst> > <lst name="f9_ws"> > <int name="nine_1">45</int> > <int name="nine_11">5</int> > </lst> > <lst name="f9_ws"> > <int name="nine_1">45</int> > <int name="nine_11">5</int> > </lst> > <lst name="f9_ws"> > <int name="nine_1">45</int> > <int name="nine_11">5</int> > </lst> > <lst name="f9_ws"> > <int name="nine_1">45</int> > <int name="nine_11">5</int> > </lst> > </lst> > <lst name="facet_dates"/> > <lst name="facet_ranges"/> > </lst> > </response> > {code} > - *Now, if a create a new collection with 2 shards (>1 shard SolrCloud), the > same above query results in a different output. (# facet fields ~ 10 ; > Expected 50)* > {code} > $ curl > "http://localhost:8983/solr/collection1/select?facet=true&fl=id&indent=true&q=id%3A*&facet.limit=-1&facet.threads=1000&facet.field=f0_ws&facet.field=f0_ws&facet.field=f0_ws&facet.field=f0_ws&facet.field=f0_ws&facet.field=f1_ws&facet.field=f1_ws&facet.field=f1_ws&facet.field=f1_ws&facet.field=f1_ws&facet.field=f2_ws&facet.field=f2_ws&facet.field=f2_ws&facet.field=f2_ws&facet.field=f2_ws&facet.field=f3_ws&facet.field=f3_ws&facet.field=f3_ws&facet.field=f3_ws&facet.field=f3_ws&facet.field=f4_ws&facet.field=f4_ws&facet.field=f4_ws&facet.field=f4_ws&facet.field=f4_ws&facet.field=f5_ws&facet.field=f5_ws&facet.field=f5_ws&facet.field=f5_ws&facet.field=f5_ws&facet.field=f6_ws&facet.field=f6_ws&facet.field=f6_ws&facet.field=f6_ws&facet.field=f6_ws&facet.field=f7_ws&facet.field=f7_ws&facet.field=f7_ws&facet.field=f7_ws&facet.field=f7_ws&facet.field=f8_ws&facet.field=f8_ws&facet.field=f8_ws&facet.field=f8_ws&facet.field=f8_ws&facet.field=f9_ws&facet.field=f9_ws&facet.field=f9_ws&facet.field=f9_ws&facet.field=f9_ws&rows=1&wt=xml" > > <?xml version="1.0" encoding="UTF-8"?> > <response> > <lst name="responseHeader"> > <int name="status">0</int> > <int name="QTime">31</int> > <lst name="params"> > <str name="facet">true</str> > <str name="fl">id</str> > <str name="indent">true</str> > <str name="q">id:*</str> > <str name="facet.limit">-1</str> > <str name="facet.threads">1000</str> > <arr name="facet.field"> > <str>f0_ws</str> > <str>f0_ws</str> > <str>f0_ws</str> > <str>f0_ws</str> > <str>f0_ws</str> > <str>f1_ws</str> > <str>f1_ws</str> > <str>f1_ws</str> > <str>f1_ws</str> > <str>f1_ws</str> > <str>f2_ws</str> > <str>f2_ws</str> > <str>f2_ws</str> > <str>f2_ws</str> > <str>f2_ws</str> > <str>f3_ws</str> > <str>f3_ws</str> > <str>f3_ws</str> > <str>f3_ws</str> > <str>f3_ws</str> > <str>f4_ws</str> > <str>f4_ws</str> > <str>f4_ws</str> > <str>f4_ws</str> > <str>f4_ws</str> > <str>f5_ws</str> > <str>f5_ws</str> > <str>f5_ws</str> > <str>f5_ws</str> > <str>f5_ws</str> > <str>f6_ws</str> > <str>f6_ws</str> > <str>f6_ws</str> > <str>f6_ws</str> > <str>f6_ws</str> > <str>f7_ws</str> > <str>f7_ws</str> > <str>f7_ws</str> > <str>f7_ws</str> > <str>f7_ws</str> > <str>f8_ws</str> > <str>f8_ws</str> > <str>f8_ws</str> > <str>f8_ws</str> > <str>f8_ws</str> > <str>f9_ws</str> > <str>f9_ws</str> > <str>f9_ws</str> > <str>f9_ws</str> > <str>f9_ws</str> > </arr> > <str name="wt">xml</str> > <str name="rows">1</str> > </lst> > </lst> > <result name="response" numFound="50" start="0" maxScore="1.0"> > <doc> > <float name="id">2.0</float></doc> > </result> > <lst name="facet_counts"> > <lst name="facet_queries"/> > <lst name="facet_fields"> > <lst name="f0_ws"> > <int name="zero_1">25</int> > <int name="zero_2">25</int> > </lst> > <lst name="f1_ws"> > <int name="one_1">33</int> > <int name="one_3">17</int> > </lst> > <lst name="f2_ws"> > <int name="two_1">37</int> > <int name="two_4">13</int> > </lst> > <lst name="f3_ws"> > <int name="three_1">40</int> > <int name="three_5">10</int> > </lst> > <lst name="f4_ws"> > <int name="four_1">41</int> > <int name="four_6">9</int> > </lst> > <lst name="f5_ws"> > <int name="five_1">42</int> > <int name="five_7">8</int> > </lst> > <lst name="f6_ws"> > <int name="six_1">43</int> > <int name="six_8">7</int> > </lst> > <lst name="f7_ws"> > <int name="seven_1">44</int> > <int name="seven_9">6</int> > </lst> > <lst name="f8_ws"> > <int name="eight_1">45</int> > <int name="eight_10">5</int> > </lst> > <lst name="f9_ws"> > <int name="nine_1">45</int> > <int name="nine_11">5</int> > </lst> > </lst> > <lst name="facet_dates"/> > <lst name="facet_ranges"/> > </lst> > </response> > {code} > This behavior is quite strange as it is being dependent on the number of > shards in SolrCloud. It would be great if someone can shed some light on this? -- This message was sent by Atlassian JIRA (v6.2#6252) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org