[ https://issues.apache.org/jira/browse/SOLR-3765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13444061#comment-13444061 ]
Yonik Seeley commented on SOLR-3765: ------------------------------------ Thanks for tracking this down Per, I agree this is a bug for multi-collection searches. > Wrong handling of documents with same id in cross collection searches > --------------------------------------------------------------------- > > Key: SOLR-3765 > URL: https://issues.apache.org/jira/browse/SOLR-3765 > Project: Solr > Issue Type: Bug > Components: search, SolrCloud > Affects Versions: 4.0 > Environment: Self-build version of Solr fra 4.x branch (revision ) > Reporter: Per Steffensen > Labels: collections, inconsistency, numFound, search > > Dialog with myself from solr-users mailing list: > Per Steffensen skrev: > {quote} > Hi > Due to what we have seen in recent tests I got in doubt how Solr search is > actually supposed to behave > * Searching with "distrib=true&q=*:*&rows=10&collection=x,y,z&sort=timestamp > asc" > ** Is Solr supposed to return the 10 documents with the lowest timestamp > across all documents in all slices of collection x, y and z, or is it > supposed to just pick 10 random documents from those slices and just sort > those 10 randomly selected documents? > ** Put in another way - is this search supposed to be consistent, returning > exactly the same set of documents when performed several times (no documents > are updated between consecutive searches)? > {quote} > Fortunately I believe the answer is, that it ought to "return the 10 > documents with the lowest timestamp across all documents in all slices of > collection x, y and Z". The reason I asked was because I got different > responses for consecutive simular requests. Now I believe it can be explained > by the bug described below. I guess they you do cross-collection/shard > searches, the "request-handling" Solr forwards the query to all involved > shards simultanious and merges sub-results into the final result as they are > returned from the shards. Because of the "consider documents with same id as > the same document even though the come from different collections"-bug it is > kinda random (depending on which shards responds first/last), for a given id, > what collection the document with that specific id is taken from. And if > documents with the same id from different collections has different timestamp > it is random where that document ends up in the final sorted result. > So i believe this inconsistency can be explained by the bug described below. > {quote} > * A search returns a "numFound"-field telling how many documents all in all > matches the search-criteria, even though not all those documents are returned > by the search. It is a crazy question to ask, but I will do it anyway because > we actually see a problem with this. Isnt it correct that two searches which > only differs on the "rows"-number (documents to be returned) should always > return the same value for "numFound"? > {quote} > Well I found out myself what the problem is (or seems to be) - see: > http://lucene.472066.n3.nabble.com/Changing-value-of-start-parameter-affects-numFound-td2460645.html > http://lucene.472066.n3.nabble.com/numFound-inconsistent-for-different-rows-param-td3997269.html > http://lucene.472066.n3.nabble.com/Solr-v3-5-0-numFound-changes-when-paging-through-results-on-8-shard-cluster-td3990400.html > Until 4.0 this "bug" could be "ignored" because it was ok for a cross-shards > search to consider documents with identical id's as dublets and therefore > only returning/counting one of them. It is still, in 4.0, ok within the same > collection, but across collections identical id's should not be considered > dublicates and should not reduce documents returned/counted. So i believe > this "feature" has now become a bug in 4.0 when it comes to cross-collections > searches. > {quote} > Thanks! > Regards, Steff > {quote} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org