[ https://issues.apache.org/jira/browse/SOLR-3793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hoss Man updated SOLR-3793: --------------------------- Description: Günter Hipler reported on the solr-user mailing list that he was seeing inconsistencies in facet counts compared to the numFound when drilling down onto those facets (using "fq") - in particular: when adding an "fq" such as `fq={!term+f%3DnavNetwork}nebis`, the resulting numFound was higher then the number of docs reported by the facet constraint for nebis in the base request. I've been able to trivially reproduce this using the example data from Solr 4.0-BETA (details in comment to follow) Important things to note from Günter's email thread with his assessment of the problem... https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201208.mbox/%3ccam_u7jfdpnrgfmwmntnachcdcjw4yb-rlbbvrw_wp_jdob_...@mail.gmail.com%3E bq. The behaviour is not consistent. Some of the facets provide the correct result, some not. What I can't say for sure: The behaviour was correct (if I'm not wrong) once the whole index was newly created. After running some updates I got these results. bq. I'm going to setup a new index with the Lucene 4.0 version from March (to be more exactly: it's version 4.0-2012-03-09_11-29-20) to see what are the results even in case of frequent updates ... the version deployed in march doesn't contain the error I now come across in Beta4.0 was: Günter Hipler reported on the solr-user mailing list that he was seeing inconsistencies in facet counts compared to the numFound when drilling down onto those facets (using "fq") - in particular: when adding an "fq" such as `fq={!term+f%3DnavNetwork}nebis`, the resulting numFound was higher then the number of docs reported by the facet constraint for nebis in the base request. I've been able to trivially reproduce this using the example data from Solr 4.0-BETA (details in comment to follow) Important things to note from Günter's email thread with his assessment of the problem... https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201208.mbox/%3ccam_u7jfdpnrgfmwmntnachcdcjw4yb-rlbbvrw_wp_jdob_...@mail.gmail.com%3E bq. The behaviour is not consistent. Some of the facets provide the correct result, some not. What I can't say for sure: The behaviour was correct (if I'm not wrong) once the whole index was newly created. After running some updates I got these results. bq. I'm going to setup a new index with the Lucene 4.0 version from March (to be more exactly: it's version 4.0-2012-03-09_11-29-20) to see what are the results even in case of frequent updates ... the version deployed in march doesn't contain the error I now come across in Beta4.0 Fix Version/s: 4.0 Steps to reproduce... {panel} 1) Start with a clean install of 4.0-BETA, containing a completley empty example index, and run solr... {noformat} hossman@frisbee:~/tmp/apache-solr-4.0.0-BETA/solr/example$ ls -a solr/collection1/data/ . .. hossman@frisbee:~/tmp/apache-solr-4.0.0-BETA/solr/example$ java -jar start.jar 2012-09-05 12:59:56.596:INFO:oejs.Server:jetty-8.1.2.v20120308 ... {noformat} 2) In another window, index all sample documents... {noformat} hossman@frisbee:~/tmp/apache-solr-4.0.0-BETA/solr/example/exampledocs$ java -jar post.jar *.xml ... {noformat} 3) Observe the results of a simple query faceting on "cat", as well as the results of filtering on one of those cat values... {noformat} hossman@frisbee:~/tmp/apache-solr-4.0.0-BETA/solr/example/exampledocs$ curl 'http://localhost:8983/solr/select?echoParams=none&q=ipod&rows=5&fl=id&facet=true&facet.field=cat&facet.mincount=1&wt=json&indent=true' { "responseHeader":{ "status":0, "QTime":2}, "response":{"numFound":3,"start":0,"docs":[ { "id":"IW-02"}, { "id":"F8V7067-APL-KIT"}, { "id":"MA147LL/A"}] }, "facet_counts":{ "facet_queries":{}, "facet_fields":{ "cat":[ "electronics",3, "connector",2, "music",1]}, "facet_dates":{}, "facet_ranges":{}}} hossman@frisbee:~/tmp/apache-solr-4.0.0-BETA/solr/example/exampledocs$ curl 'http://localhost:8983/solr/select?echoParams=none&q=ipod&rows=5&fl=id&facet=true&facet.field=cat&facet.mincount=1&wt=json&indent=true&fq=cat:electronics' { "responseHeader":{ "status":0, "QTime":8}, "response":{"numFound":3,"start":0,"docs":[ { "id":"IW-02"}, { "id":"F8V7067-APL-KIT"}, { "id":"MA147LL/A"}] }, "facet_counts":{ "facet_queries":{}, "facet_fields":{ "cat":[ "electronics",3, "connector",2, "music",1]}, "facet_dates":{}, "facet_ranges":{}}} {noformat} 4) Re-index some of the sample documents, forcing a new segment to be created, as well as some deletions... {noformat} hossman@frisbee:~/tmp/apache-solr-4.0.0-BETA/solr/example/exampledocs$ java -jar post.jar ipod_*S ... {noformat} 5) observe that while the "simple" results are unchanged, the filtered request now includes duplicate (deleted?) documents in the result set... {noformat} hossman@frisbee:~/tmp/apache-solr-4.0.0-BETA/solr/example/exampledocs$ curl 'http://localhost:8983/solr/select?echoParams=none&q=ipod&rows=5&fl=id&facet=true&facet.field=cat&facet.mincount=1&wt=json&indent=true' { "responseHeader":{ "status":0, "QTime":2}, "response":{"numFound":3,"start":0,"docs":[ { "id":"IW-02"}, { "id":"F8V7067-APL-KIT"}, { "id":"MA147LL/A"}] }, "facet_counts":{ "facet_queries":{}, "facet_fields":{ "cat":[ "electronics",3, "connector",2, "music",1]}, "facet_dates":{}, "facet_ranges":{}}} hossman@frisbee:~/tmp/apache-solr-4.0.0-BETA/solr/example/exampledocs$ curl 'http://localhost:8983/solr/select?echoParams=none&q=ipod&rows=5&fl=id&facet=true&facet.field=cat&facet.mincount=1&wt=json&indent=true&fq=cat:electronics' { "responseHeader":{ "status":0, "QTime":2}, "response":{"numFound":6,"start":0,"docs":[ { "id":"IW-02"}, { "id":"IW-02"}, { "id":"F8V7067-APL-KIT"}, { "id":"F8V7067-APL-KIT"}, { "id":"MA147LL/A"}] }, "facet_counts":{ "facet_queries":{}, "facet_fields":{ "cat":[ "electronics",3, "connector",2, "music",1]}, "facet_dates":{}, "facet_ranges":{}}} {noformat} {panel} Interesting things to note... 1) stoping & restarting jetty does not make the problem go away, which initially suggested to me that the problem is not related to any sort of stale-caching of filters/docsets -- however if you stop & restart jetty, or even just issue a commit, and then re-issue the same two requests in reverse order, then no duplicates are included. do another commit, send the requests in the (original) problematic order and the problem re-appears... {noformat} hossman@frisbee:~/tmp/apache-solr-4.0.0-BETA/solr/example/exampledocs$ curl 'http://localhost:8983/solr/select?echoParams=none&q=ipod&rows=5&fl=id&facet=true&facet.field=cat&facet.mincount=1&wt=json&indent=true&fq=cat:electronics' { "responseHeader":{ "status":0, "QTime":6}, "response":{"numFound":3,"start":0,"docs":[ { "id":"IW-02"}, { "id":"F8V7067-APL-KIT"}, { "id":"MA147LL/A"}] }, "facet_counts":{ "facet_queries":{}, "facet_fields":{ "cat":[ "electronics",3, "connector",2, "music",1]}, "facet_dates":{}, "facet_ranges":{}}} hossman@frisbee:~/tmp/apache-solr-4.0.0-BETA/solr/example/exampledocs$ curl 'http://localhost:8983/solr/select?echoParams=none&q=ipod&rows=5&fl=id&facet=true&facet.field=cat&facet.mincount=1&wt=json&indent=true' { "responseHeader":{ "status":0, "QTime":2}, "response":{"numFound":3,"start":0,"docs":[ { "id":"IW-02"}, { "id":"F8V7067-APL-KIT"}, { "id":"MA147LL/A"}] }, "facet_counts":{ "facet_queries":{}, "facet_fields":{ "cat":[ "electronics",3, "connector",2, "music",1]}, "facet_dates":{}, "facet_ranges":{}}} hossman@frisbee:~/tmp/apache-solr-4.0.0-BETA/solr/example/exampledocs$ java -Ddata=args -jar post.jar '<commit/>'SimplePostTool version 1.5 POSTing args to http://localhost:8983/solr/update.. COMMITting Solr index changes to http://localhost:8983/solr/update.. hossman@frisbee:~/tmp/apache-solr-4.0.0-BETA/solr/example/exampledocs$ curl 'http://localhost:8983/solr/select?echoParams=none&q=ipod&rows=5&fl=id&facet=true&facet.field=cat&facet.mincount=1&wt=json&indent=true' { "responseHeader":{ "status":0, "QTime":2}, "response":{"numFound":3,"start":0,"docs":[ { "id":"IW-02"}, { "id":"F8V7067-APL-KIT"}, { "id":"MA147LL/A"}] }, "facet_counts":{ "facet_queries":{}, "facet_fields":{ "cat":[ "electronics",3, "connector",2, "music",1]}, "facet_dates":{}, "facet_ranges":{}}} hossman@frisbee:~/tmp/apache-solr-4.0.0-BETA/solr/example/exampledocs$ curl 'http://localhost:8983/solr/select?echoParams=none&q=ipod&rows=5&fl=id&facet=true&facet.field=cat&facet.mincount=1&wt=json&indent=true&fq=cat:electronics' { "responseHeader":{ "status":0, "QTime":3}, "response":{"numFound":6,"start":0,"docs":[ { "id":"IW-02"}, { "id":"IW-02"}, { "id":"F8V7067-APL-KIT"}, { "id":"F8V7067-APL-KIT"}, { "id":"MA147LL/A"}] }, "facet_counts":{ "facet_queries":{}, "facet_fields":{ "cat":[ "electronics",6, "connector",4, "music",1]}, "facet_dates":{}, "facet_ranges":{}}} {noformat} 2) Optimizing seems to eliminate the problem completley, suggesting that the root cause is definitely related to multiple segments containing deletions. 3) Bizarely, the problem seems to be specific to faceting: using the same steps, with the same simple queries & fq, but leaving out the facet params, the duplicate documents are not returned... {noformat} hossman@frisbee:~/tmp/apache-solr-4.0.0-BETA/solr/example/exampledocs$ java -jar post.jar *.xml ... hossman@frisbee:~/tmp/apache-solr-4.0.0-BETA/solr/example/exampledocs$ curl 'http://localhost:8983/solr/select?echoParams=none&q=ipod&rows=5&fl=id&wt=json&indent=true' { "responseHeader":{ "status":0, "QTime":13}, "response":{"numFound":3,"start":0,"docs":[ { "id":"IW-02"}, { "id":"F8V7067-APL-KIT"}, { "id":"MA147LL/A"}] }} hossman@frisbee:~/tmp/apache-solr-4.0.0-BETA/solr/example/exampledocs$ curl 'http://localhost:8983/solr/select?echoParams=none&q=ipod&rows=5&fl=id&wt=json&indent=true&fq=cat:electronics' { "responseHeader":{ "status":0, "QTime":10}, "response":{"numFound":3,"start":0,"docs":[ { "id":"IW-02"}, { "id":"F8V7067-APL-KIT"}, { "id":"MA147LL/A"}] }} hossman@frisbee:~/tmp/apache-solr-4.0.0-BETA/solr/example/exampledocs$ java -jar post.jar ipod_* ... hossman@frisbee:~/tmp/apache-solr-4.0.0-BETA/solr/example/exampledocs$ curl 'http://localhost:8983/solr/select?echoParams=none&q=ipod&rows=5&fl=id&wt=json&indent=true' { "responseHeader":{ "status":0, "QTime":2}, "response":{"numFound":3,"start":0,"docs":[ { "id":"IW-02"}, { "id":"F8V7067-APL-KIT"}, { "id":"MA147LL/A"}] }} hossman@frisbee:~/tmp/apache-solr-4.0.0-BETA/solr/example/exampledocs$ curl 'http://localhost:8983/solr/select?echoParams=none&q=ipod&rows=5&fl=id&wt=json&indent=true&fq=cat:electronics' { "responseHeader":{ "status":0, "QTime":3}, "response":{"numFound":3,"start":0,"docs":[ { "id":"IW-02"}, { "id":"F8V7067-APL-KIT"}, { "id":"MA147LL/A"}] }} {noformat} > duplicate (deleted) documents included in result set when using field > faceting with fq > -------------------------------------------------------------------------------------- > > Key: SOLR-3793 > URL: https://issues.apache.org/jira/browse/SOLR-3793 > Project: Solr > Issue Type: Bug > Affects Versions: 4.0-BETA > Reporter: Hoss Man > Priority: Blocker > Fix For: 4.0 > > > Günter Hipler reported on the solr-user mailing list that he was seeing > inconsistencies in facet counts compared to the numFound when drilling down > onto those facets (using "fq") - in particular: when adding an "fq" such as > `fq={!term+f%3DnavNetwork}nebis`, the resulting numFound was higher then the > number of docs reported by the facet constraint for nebis in the base request. > I've been able to trivially reproduce this using the example data from Solr > 4.0-BETA (details in comment to follow) > Important things to note from Günter's email thread with his assessment of > the problem... > https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201208.mbox/%3ccam_u7jfdpnrgfmwmntnachcdcjw4yb-rlbbvrw_wp_jdob_...@mail.gmail.com%3E > bq. The behaviour is not consistent. Some of the facets provide the correct > result, some not. What I can't say for sure: The behaviour was correct (if > I'm not wrong) once the whole index was newly created. After running some > updates I got these results. > bq. I'm going to setup a new index with the Lucene 4.0 version from March (to > be more exactly: it's version 4.0-2012-03-09_11-29-20) to see what are the > results even in case of frequent updates ... the version deployed in march > doesn't contain the error I now come across in Beta4.0 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org