[jira] [Commented] (SOLR-3765) Wrong handling of documents with same id in cross collection searches

Per Steffensen (JIRA) Thu, 30 Aug 2012 12:18:09 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-3765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13445210#comment-13445210
 ]


Per Steffensen commented on SOLR-3765:
--------------------------------------

No problem. Glad to help. 

We will not be working on a fix. We will do a workaround in our own 
application, so that we will not have id-clash across collections. We need to 
control ids very strictly in order for our 
fail-on-unique-key-constraint-violaton to serve its purpose correctly. 
Basically we just prefix our ids with the name of the collection - will still 
provide unique-key-clash within the collection but will not prevent documents 
with same id (except for the collection-name-part) from being returned/counted.
                
> Wrong handling of documents with same id in cross collection searches
> ---------------------------------------------------------------------
>
>                 Key: SOLR-3765
>                 URL: https://issues.apache.org/jira/browse/SOLR-3765
>             Project: Solr
>          Issue Type: Bug
>          Components: search, SolrCloud
>    Affects Versions: 4.0
>         Environment: Self-build version of Solr fra 4.x branch (revision )
>            Reporter: Per Steffensen
>              Labels: collections, inconsistency, numFound, search
>
> Dialog with myself from solr-users mailing list:
> Per Steffensen skrev:
> {quote} 
> Hi
> Due to what we have seen in recent tests I got in doubt how Solr search is 
> actually supposed to behave
> * Searching with "distrib=true&q=*:*&rows=10&collection=x,y,z&sort=timestamp 
> asc"
> ** Is Solr supposed to return the 10 documents with the lowest timestamp 
> across all documents in all slices of collection x, y and z, or is it 
> supposed to just pick 10 random documents from those slices and just sort 
> those 10 randomly selected documents?
> ** Put in another way - is this search supposed to be consistent, returning 
> exactly the same set of documents when performed several times (no documents 
> are updated between consecutive searches)?
> {quote}
> Fortunately I believe the answer is, that it ought to "return the 10 
> documents with the lowest timestamp across all documents in all slices of 
> collection x, y and Z". The reason I asked was because I got different 
> responses for consecutive simular requests. Now I believe it can be explained 
> by the bug described below. I guess they you do cross-collection/shard 
> searches, the "request-handling" Solr forwards the query to all involved 
> shards simultanious and merges sub-results into the final result as they are 
> returned from the shards. Because of the "consider documents with same id as 
> the same document even though the come from different collections"-bug it is 
> kinda random (depending on which shards responds first/last), for a given id, 
> what collection the document with that specific id is taken from. And if 
> documents with the same id from different collections has different timestamp 
> it is random where that document ends up in the final sorted result.
> So i believe this inconsistency can be explained by the bug described below.
> {quote}
> * A search returns a "numFound"-field telling how many documents all in all 
> matches the search-criteria, even though not all those documents are returned 
> by the search. It is a crazy question to ask, but I will do it anyway because 
> we actually see a problem with this. Isnt it correct that two searches which 
> only differs on the "rows"-number (documents to be returned) should always 
> return the same value for "numFound"?
> {quote}
> Well I found out myself what the problem is (or seems to be) - see:
> http://lucene.472066.n3.nabble.com/Changing-value-of-start-parameter-affects-numFound-td2460645.html
> http://lucene.472066.n3.nabble.com/numFound-inconsistent-for-different-rows-param-td3997269.html
> http://lucene.472066.n3.nabble.com/Solr-v3-5-0-numFound-changes-when-paging-through-results-on-8-shard-cluster-td3990400.html
> Until 4.0 this "bug" could be "ignored" because it was ok for a cross-shards 
> search to consider documents with identical id's as dublets and therefore 
> only returning/counting one of them. It is still, in 4.0, ok within the same 
> collection, but across collections identical id's should not be considered 
> dublicates and should not reduce documents returned/counted. So i believe 
> this "feature" has now become a bug in 4.0 when it comes to cross-collections 
> searches.
> {quote}
> Thanks!
> Regards, Steff
> {quote}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3765) Wrong handling of documents with same id in cross collection searches

Reply via email to