Re: Same id on two shards

2012-02-23 Thread Erick Erickson
I really think you'll be in a world of hurt if you have the same
ID on different shards. I just wouldn't go there. The statement
may be non-deterministic should be taken to mean that this
is just unsupported.

Why is this the case? What is the use-case for putting the
same ID on different shard? Because this seems like
an  XY problem...

Best
Erick

On Wed, Feb 22, 2012 at 4:43 PM, jerry.min...@gmail.com
jerry.min...@gmail.com wrote:
 Hi,

 I stumbled across this thread after running into the same question. The
 answers presented here seem a little vague and I was hoping to renew the
 discussion.

 I am using using a branch of Solr 4, distributed searching over 12 shards.
 I want the documents in the first shard to always be selected over
 documents that appear in the other 11 shards.

 The queries to these shards looks something like this: 
 http://solrserver/shard_1_app/select?shards=solr_server:/shard_1_app/,solr_server:/shard_2_app,
 ... ,solr_server:/shard_12_appq=id:

 When I execute a query for an ID that I know exists in shard_1 and another
 shard, I do always get the result from shard 1.

 Here's some questions that I have:
 1. Has anyone rigorously tested the comment in the wiki If docs with
 duplicate unique keys are encountered, Solr will make an attempt to return
 valid results, but the behavior may be non-deterministic.

 2. Who is relying on this behavior (the document of the first shard is
 returned) today? When do you notice the wrong document is selected? Do you
 have a feeling for how frequently your distributed search returns the
 document from a shard other than the first?

 3. Is there a good web source other than the Solr wiki for information
 about Solr distributed queries?


 Thanks,
 Jerry M.


 On Mon, Aug 8, 2011 at 7:41 PM, simon mtnes...@gmail.com wrote:

 I think the first one to respond is indeed the way it works, but
 that's only deterministic up to a point (if your small index is in the
 throes of a commit and everything required for a response happens to
 be  cached on the larger shard ... who knows ?)

 On Mon, Aug 8, 2011 at 7:10 PM, Shawn Heisey s...@elyograg.org wrote:
  On 8/8/2011 4:07 PM, simon wrote:
 
  Only one should be returned, but it's non-deterministic. See
 
 
 http://wiki.apache.org/solr/DistributedSearch#Distributed_Searching_Limitations
 
  I had heard it was based on which one responded first.  This is part of
 why
  we have a small index that contains the newest content and only
 distribute
  content to the other shards once a day.  The hope is that the small index
  (less than 1GB, fits into RAM on that virtual machine) will always
 respond
  faster than the other larger shards (over 18GB each).  Is this an
 incorrect
  assumption on our part?
 
  The build system does do everything it can to ensure that periods of
 overlap
  are limited to the time it takes to commit a change across all of the
  shards, which should amount to just a few seconds once a day.  There
 might
  be situations when the index gets out of whack and we have duplicate id
  values for a longer time period, but in practice it hasn't happened yet.
 
  Thanks,
  Shawn
 
 



Re: Same id on two shards

2012-02-22 Thread jerry.min...@gmail.com
Hi,

I stumbled across this thread after running into the same question. The
answers presented here seem a little vague and I was hoping to renew the
discussion.

I am using using a branch of Solr 4, distributed searching over 12 shards.
I want the documents in the first shard to always be selected over
documents that appear in the other 11 shards.

The queries to these shards looks something like this: 
http://solrserver/shard_1_app/select?shards=solr_server:/shard_1_app/,solr_server:/shard_2_app,
... ,solr_server:/shard_12_appq=id:

When I execute a query for an ID that I know exists in shard_1 and another
shard, I do always get the result from shard 1.

Here's some questions that I have:
1. Has anyone rigorously tested the comment in the wiki If docs with
duplicate unique keys are encountered, Solr will make an attempt to return
valid results, but the behavior may be non-deterministic.

2. Who is relying on this behavior (the document of the first shard is
returned) today? When do you notice the wrong document is selected? Do you
have a feeling for how frequently your distributed search returns the
document from a shard other than the first?

3. Is there a good web source other than the Solr wiki for information
about Solr distributed queries?


Thanks,
Jerry M.


On Mon, Aug 8, 2011 at 7:41 PM, simon mtnes...@gmail.com wrote:

 I think the first one to respond is indeed the way it works, but
 that's only deterministic up to a point (if your small index is in the
 throes of a commit and everything required for a response happens to
 be  cached on the larger shard ... who knows ?)

 On Mon, Aug 8, 2011 at 7:10 PM, Shawn Heisey s...@elyograg.org wrote:
  On 8/8/2011 4:07 PM, simon wrote:
 
  Only one should be returned, but it's non-deterministic. See
 
 
 http://wiki.apache.org/solr/DistributedSearch#Distributed_Searching_Limitations
 
  I had heard it was based on which one responded first.  This is part of
 why
  we have a small index that contains the newest content and only
 distribute
  content to the other shards once a day.  The hope is that the small index
  (less than 1GB, fits into RAM on that virtual machine) will always
 respond
  faster than the other larger shards (over 18GB each).  Is this an
 incorrect
  assumption on our part?
 
  The build system does do everything it can to ensure that periods of
 overlap
  are limited to the time it takes to commit a change across all of the
  shards, which should amount to just a few seconds once a day.  There
 might
  be situations when the index gets out of whack and we have duplicate id
  values for a longer time period, but in practice it hasn't happened yet.
 
  Thanks,
  Shawn
 
 



Re: Same id on two shards

2011-08-08 Thread simon
Only one should be returned, but it's non-deterministic. See
http://wiki.apache.org/solr/DistributedSearch#Distributed_Searching_Limitations

-Simon

On Sat, Aug 6, 2011 at 6:27 AM, Pooja Verlani pooja.verl...@gmail.com wrote:
 Hi,

 We have a multicore solr with 6 cores. We merge the results using shards
 parameter or distrib handler.
 I have a problem, I might post one document on one of the cores and then
 post it after some days on another core, as I have a time-sliced multicore
 setup!

 The question is if I retrieve a document which is posted on both the shards,
 will solr return me only one document or both. And if only one document will
 be return, which one?

 Regards,
 Pooja



Re: Same id on two shards

2011-08-08 Thread Shawn Heisey

On 8/8/2011 4:07 PM, simon wrote:

Only one should be returned, but it's non-deterministic. See
http://wiki.apache.org/solr/DistributedSearch#Distributed_Searching_Limitations


I had heard it was based on which one responded first.  This is part of 
why we have a small index that contains the newest content and only 
distribute content to the other shards once a day.  The hope is that the 
small index (less than 1GB, fits into RAM on that virtual machine) will 
always respond faster than the other larger shards (over 18GB each).  Is 
this an incorrect assumption on our part?


The build system does do everything it can to ensure that periods of 
overlap are limited to the time it takes to commit a change across all 
of the shards, which should amount to just a few seconds once a day.  
There might be situations when the index gets out of whack and we have 
duplicate id values for a longer time period, but in practice it hasn't 
happened yet.


Thanks,
Shawn



Re: Same id on two shards

2011-08-08 Thread simon
I think the first one to respond is indeed the way it works, but
that's only deterministic up to a point (if your small index is in the
throes of a commit and everything required for a response happens to
be  cached on the larger shard ... who knows ?)

On Mon, Aug 8, 2011 at 7:10 PM, Shawn Heisey s...@elyograg.org wrote:
 On 8/8/2011 4:07 PM, simon wrote:

 Only one should be returned, but it's non-deterministic. See

 http://wiki.apache.org/solr/DistributedSearch#Distributed_Searching_Limitations

 I had heard it was based on which one responded first.  This is part of why
 we have a small index that contains the newest content and only distribute
 content to the other shards once a day.  The hope is that the small index
 (less than 1GB, fits into RAM on that virtual machine) will always respond
 faster than the other larger shards (over 18GB each).  Is this an incorrect
 assumption on our part?

 The build system does do everything it can to ensure that periods of overlap
 are limited to the time it takes to commit a change across all of the
 shards, which should amount to just a few seconds once a day.  There might
 be situations when the index gets out of whack and we have duplicate id
 values for a longer time period, but in practice it hasn't happened yet.

 Thanks,
 Shawn




Same id on two shards

2011-08-06 Thread Pooja Verlani
Hi,

We have a multicore solr with 6 cores. We merge the results using shards
parameter or distrib handler.
I have a problem, I might post one document on one of the cores and then
post it after some days on another core, as I have a time-sliced multicore
setup!

The question is if I retrieve a document which is posted on both the shards,
will solr return me only one document or both. And if only one document will
be return, which one?

Regards,
Pooja