I really think you'll be in a world of hurt if you have the same ID on different shards. I just wouldn't go there. The statement "may be non-deterministic" should be taken to mean that this is just unsupported.
Why is this the case? What is the use-case for putting the same ID on different shard? Because this seems like an XY problem... Best Erick On Wed, Feb 22, 2012 at 4:43 PM, jerry.min...@gmail.com <jerry.min...@gmail.com> wrote: > Hi, > > I stumbled across this thread after running into the same question. The > answers presented here seem a little vague and I was hoping to renew the > discussion. > > I am using using a branch of Solr 4, distributed searching over 12 shards. > I want the documents in the first shard to always be selected over > documents that appear in the other 11 shards. > > The queries to these shards looks something like this: " > http://solrserver/shard_1_app/select?shards=solr_server:9999/shard_1_app/,solr_server:9999/shard_2_app, > ... ,solr_server:9999/shard_12_app&q=id:xxxxxxxx" > > When I execute a query for an ID that I know exists in shard_1 and another > shard, I do always get the result from shard 1. > > Here's some questions that I have: > 1. Has anyone rigorously tested the comment in the wiki "If docs with > duplicate unique keys are encountered, Solr will make an attempt to return > valid results, but the behavior may be non-deterministic." > > 2. Who is relying on this behavior (the document of the first shard is > returned) today? When do you notice the wrong document is selected? Do you > have a feeling for how frequently your distributed search returns the > document from a shard other than the first? > > 3. Is there a good web source other than the Solr wiki for information > about Solr distributed queries? > > > Thanks, > Jerry M. > > > On Mon, Aug 8, 2011 at 7:41 PM, simon <mtnes...@gmail.com> wrote: > >> I think the first one to respond is indeed the way it works, but >> that's only deterministic up to a point (if your small index is in the >> throes of a commit and everything required for a response happens to >> be cached on the larger shard ... who knows ?) >> >> On Mon, Aug 8, 2011 at 7:10 PM, Shawn Heisey <s...@elyograg.org> wrote: >> > On 8/8/2011 4:07 PM, simon wrote: >> >> >> >> Only one should be returned, but it's non-deterministic. See >> >> >> >> >> http://wiki.apache.org/solr/DistributedSearch#Distributed_Searching_Limitations >> > >> > I had heard it was based on which one responded first. This is part of >> why >> > we have a small index that contains the newest content and only >> distribute >> > content to the other shards once a day. The hope is that the small index >> > (less than 1GB, fits into RAM on that virtual machine) will always >> respond >> > faster than the other larger shards (over 18GB each). Is this an >> incorrect >> > assumption on our part? >> > >> > The build system does do everything it can to ensure that periods of >> overlap >> > are limited to the time it takes to commit a change across all of the >> > shards, which should amount to just a few seconds once a day. There >> might >> > be situations when the index gets out of whack and we have duplicate id >> > values for a longer time period, but in practice it hasn't happened yet. >> > >> > Thanks, >> > Shawn >> > >> > >>