[
https://issues.apache.org/jira/browse/SOLR-3765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13445210#comment-13445210
]
Per Steffensen commented on SOLR-3765:
--------------------------------------
No problem. Glad to help.
We will not be working on a fix. We will do a workaround in our own
application, so that we will not have id-clash across collections. We need to
control ids very strictly in order for our
fail-on-unique-key-constraint-violaton to serve its purpose correctly.
Basically we just prefix our ids with the name of the collection - will still
provide unique-key-clash within the collection but will not prevent documents
with same id (except for the collection-name-part) from being returned/counted.
> Wrong handling of documents with same id in cross collection searches
> ---------------------------------------------------------------------
>
> Key: SOLR-3765
> URL: https://issues.apache.org/jira/browse/SOLR-3765
> Project: Solr
> Issue Type: Bug
> Components: search, SolrCloud
> Affects Versions: 4.0
> Environment: Self-build version of Solr fra 4.x branch (revision )
> Reporter: Per Steffensen
> Labels: collections, inconsistency, numFound, search
>
> Dialog with myself from solr-users mailing list:
> Per Steffensen skrev:
> {quote}
> Hi
> Due to what we have seen in recent tests I got in doubt how Solr search is
> actually supposed to behave
> * Searching with "distrib=true&q=*:*&rows=10&collection=x,y,z&sort=timestamp
> asc"
> ** Is Solr supposed to return the 10 documents with the lowest timestamp
> across all documents in all slices of collection x, y and z, or is it
> supposed to just pick 10 random documents from those slices and just sort
> those 10 randomly selected documents?
> ** Put in another way - is this search supposed to be consistent, returning
> exactly the same set of documents when performed several times (no documents
> are updated between consecutive searches)?
> {quote}
> Fortunately I believe the answer is, that it ought to "return the 10
> documents with the lowest timestamp across all documents in all slices of
> collection x, y and Z". The reason I asked was because I got different
> responses for consecutive simular requests. Now I believe it can be explained
> by the bug described below. I guess they you do cross-collection/shard
> searches, the "request-handling" Solr forwards the query to all involved
> shards simultanious and merges sub-results into the final result as they are
> returned from the shards. Because of the "consider documents with same id as
> the same document even though the come from different collections"-bug it is
> kinda random (depending on which shards responds first/last), for a given id,
> what collection the document with that specific id is taken from. And if
> documents with the same id from different collections has different timestamp
> it is random where that document ends up in the final sorted result.
> So i believe this inconsistency can be explained by the bug described below.
> {quote}
> * A search returns a "numFound"-field telling how many documents all in all
> matches the search-criteria, even though not all those documents are returned
> by the search. It is a crazy question to ask, but I will do it anyway because
> we actually see a problem with this. Isnt it correct that two searches which
> only differs on the "rows"-number (documents to be returned) should always
> return the same value for "numFound"?
> {quote}
> Well I found out myself what the problem is (or seems to be) - see:
> http://lucene.472066.n3.nabble.com/Changing-value-of-start-parameter-affects-numFound-td2460645.html
> http://lucene.472066.n3.nabble.com/numFound-inconsistent-for-different-rows-param-td3997269.html
> http://lucene.472066.n3.nabble.com/Solr-v3-5-0-numFound-changes-when-paging-through-results-on-8-shard-cluster-td3990400.html
> Until 4.0 this "bug" could be "ignored" because it was ok for a cross-shards
> search to consider documents with identical id's as dublets and therefore
> only returning/counting one of them. It is still, in 4.0, ok within the same
> collection, but across collections identical id's should not be considered
> dublicates and should not reduce documents returned/counted. So i believe
> this "feature" has now become a bug in 4.0 when it comes to cross-collections
> searches.
> {quote}
> Thanks!
> Regards, Steff
> {quote}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]