[jira] [Updated] (SOLR-3765) Wrong handling of documents with same id in cross collection searches

Per Steffensen (JIRA) Tue, 28 Aug 2012 23:40:10 -0700

     [ 
https://issues.apache.org/jira/browse/SOLR-3765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Per Steffensen updated SOLR-3765:
---------------------------------

    Description: 
Dialog with myself from solr-users mailing list:

Per Steffensen skrev:
{quote} 
Hi

Due to what we have seen in recent tests I got in doubt how Solr search is 
actually supposed to behave
* Searching with "distrib=true&q=*:*&rows=10&collection=x,y,z&sort=timestamp 
asc"
** Is Solr supposed to return the 10 documents with the lowest timestamp across 
all documents in all slices of collection x, y and z, or is it supposed to just 
pick 10 random documents from those slices and just sort those 10 randomly 
selected documents?
** Put in another way - is this search supposed to be consistent, returning 
exactly the same set of documents when performed several times (no documents 
are updated between consecutive searches)?
{quote}

Fortunately I believe the answer is, that it ought to "return the 10 documents 
with the lowest timestamp across all documents in all slices of collection x, y 
and Z". The reason I asked was because I got different responses for 
consecutive simular requests. Now I believe it can be explained by the bug 
described below. I guess they you do cross-collection/shard searches, the 
"request-handling" Solr forwards the query to all involved shards simultanious 
and merges sub-results into the final result as they are returned from the 
shards. Because of the "consider documents with same id as the same document 
even though the come from different collections"-bug it is kinda random 
(depending on which shards responds first/last), for a given id, what 
collection the document with that specific id is taken from. And if documents 
with the same id from different collections has different timestamp it is 
random where that document ends up in the final sorted result.

So i believe this inconsistency can be explained by the bug described below.

{quote}
* A search returns a "numFound"-field telling how many documents all in all 
matches the search-criteria, even though not all those documents are returned 
by the search. It is a crazy question to ask, but I will do it anyway because 
we actually see a problem with this. Isnt it correct that two searches which 
only differs on the "rows"-number (documents to be returned) should always 
return the same value for "numFound"?
{quote}

Well I found out myself what the problem is (or seems to be) - see:
http://lucene.472066.n3.nabble.com/Changing-value-of-start-parameter-affects-numFound-td2460645.html
http://lucene.472066.n3.nabble.com/numFound-inconsistent-for-different-rows-param-td3997269.html
http://lucene.472066.n3.nabble.com/Solr-v3-5-0-numFound-changes-when-paging-through-results-on-8-shard-cluster-td3990400.html

Until 4.0 this "bug" could be "ignored" because it was ok for a cross-shards 
search to consider documents with identical id's as dublets and therefore only 
returning/counting one of them. It is still, in 4.0, ok within the same 
collection, but across collections identical id's should not be considered 
dublicates and should not reduce documents returned/counted. So i believe this 
"feature" has now become a bug in 4.0 when it comes to cross-collections 
searches.

{quote}
Thanks!

Regards, Steff
{quote}



  was:
Dialog with myself from solr-users mailing list:

Per Steffensen skrev:
{qoute} 
Hi

Due to what we have seen in recent tests I got in doubt how Solr search is 
actually supposed to behave
* Searching with "distrib=true&q=*:*&rows=10&collection=x,y,z&sort=timestamp 
asc"
** Is Solr supposed to return the 10 documents with the lowest timestamp across 
all documents in all slices of collection x, y and z, or is it supposed to just 
pick 10 random documents from those slices and just sort those 10 randomly 
selected documents?
** Put in another way - is this search supposed to be consistent, returning 
exactly the same set of documents when performed several times (no documents 
are updated between consecutive searches)?
{quote}

Fortunately I believe the answer is, that it ought to "return the 10 documents 
with the lowest timestamp across all documents in all slices of collection x, y 
and Z". The reason I asked was because I got different responses for 
consecutive simular requests. Now I believe it can be explained by the bug 
described below. I guess they you do cross-collection/shard searches, the 
"request-handling" Solr forwards the query to all involved shards simultanious 
and merges sub-results into the final result as they are returned from the 
shards. Because of the "consider documents with same id as the same document 
even though the come from different collections"-bug it is kinda random 
(depending on which shards responds first/last), for a given id, what 
collection the document with that specific id is taken from. And if documents 
with the same id from different collections has different timestamp it is 
random where that document ends up in the final sorted result.

So i believe this inconsistency can be explained by the bug described below.

{quote}
* A search returns a "numFound"-field telling how many documents all in all 
matches the search-criteria, even though not all those documents are returned 
by the search. It is a crazy question to ask, but I will do it anyway because 
we actually see a problem with this. Isnt it correct that two searches which 
only differs on the "rows"-number (documents to be returned) should always 
return the same value for "numFound"?
{quote}

Well I found out myself what the problem is (or seems to be) - see:
http://lucene.472066.n3.nabble.com/Changing-value-of-start-parameter-affects-numFound-td2460645.html
http://lucene.472066.n3.nabble.com/numFound-inconsistent-for-different-rows-param-td3997269.html
http://lucene.472066.n3.nabble.com/Solr-v3-5-0-numFound-changes-when-paging-through-results-on-8-shard-cluster-td3990400.html

Until 4.0 this "bug" could be "ignored" because it was ok for a cross-shards 
search to consider documents with identical id's as dublets and therefore only 
returning/counting one of them. It is still, in 4.0, ok within the same 
collection, but across collections identical id's should not be considered 
dublicates and should not reduce documents returned/counted. So i believe this 
"feature" has now become a bug in 4.0 when it comes to cross-collections 
searches.

{quote}
Thanks!

Regards, Steff
{quote}



    
> Wrong handling of documents with same id in cross collection searches
> ---------------------------------------------------------------------
>
>                 Key: SOLR-3765
>                 URL: https://issues.apache.org/jira/browse/SOLR-3765
>             Project: Solr
>          Issue Type: Bug
>          Components: search, SolrCloud
>    Affects Versions: 4.0
>         Environment: Self-build version of Solr fra 4.x branch (revision )
>            Reporter: Per Steffensen
>              Labels: collections, inconsistency, numFound, search
>
> Dialog with myself from solr-users mailing list:
> Per Steffensen skrev:
> {quote} 
> Hi
> Due to what we have seen in recent tests I got in doubt how Solr search is 
> actually supposed to behave
> * Searching with "distrib=true&q=*:*&rows=10&collection=x,y,z&sort=timestamp 
> asc"
> ** Is Solr supposed to return the 10 documents with the lowest timestamp 
> across all documents in all slices of collection x, y and z, or is it 
> supposed to just pick 10 random documents from those slices and just sort 
> those 10 randomly selected documents?
> ** Put in another way - is this search supposed to be consistent, returning 
> exactly the same set of documents when performed several times (no documents 
> are updated between consecutive searches)?
> {quote}
> Fortunately I believe the answer is, that it ought to "return the 10 
> documents with the lowest timestamp across all documents in all slices of 
> collection x, y and Z". The reason I asked was because I got different 
> responses for consecutive simular requests. Now I believe it can be explained 
> by the bug described below. I guess they you do cross-collection/shard 
> searches, the "request-handling" Solr forwards the query to all involved 
> shards simultanious and merges sub-results into the final result as they are 
> returned from the shards. Because of the "consider documents with same id as 
> the same document even though the come from different collections"-bug it is 
> kinda random (depending on which shards responds first/last), for a given id, 
> what collection the document with that specific id is taken from. And if 
> documents with the same id from different collections has different timestamp 
> it is random where that document ends up in the final sorted result.
> So i believe this inconsistency can be explained by the bug described below.
> {quote}
> * A search returns a "numFound"-field telling how many documents all in all 
> matches the search-criteria, even though not all those documents are returned 
> by the search. It is a crazy question to ask, but I will do it anyway because 
> we actually see a problem with this. Isnt it correct that two searches which 
> only differs on the "rows"-number (documents to be returned) should always 
> return the same value for "numFound"?
> {quote}
> Well I found out myself what the problem is (or seems to be) - see:
> http://lucene.472066.n3.nabble.com/Changing-value-of-start-parameter-affects-numFound-td2460645.html
> http://lucene.472066.n3.nabble.com/numFound-inconsistent-for-different-rows-param-td3997269.html
> http://lucene.472066.n3.nabble.com/Solr-v3-5-0-numFound-changes-when-paging-through-results-on-8-shard-cluster-td3990400.html
> Until 4.0 this "bug" could be "ignored" because it was ok for a cross-shards 
> search to consider documents with identical id's as dublets and therefore 
> only returning/counting one of them. It is still, in 4.0, ok within the same 
> collection, but across collections identical id's should not be considered 
> dublicates and should not reduce documents returned/counted. So i believe 
> this "feature" has now become a bug in 4.0 when it comes to cross-collections 
> searches.
> {quote}
> Thanks!
> Regards, Steff
> {quote}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SOLR-3765) Wrong handling of documents with same id in cross collection searches

Reply via email to