Hi, Erick! That's it! I'm using a custom implementation of a SolrServer with distributed behavior that routes queries and updates using an in-house Round Robin method. But the thing is that I'm doing this myself because I've noticed that duplicated documents appears using LBHttpSolrServer implementation. Last week I modified my implementation to avoid that with this changes:
- I have normalized the key field to all documents. Now every document indexed must include *_id_* field that stores the selected key value. The value is setted with a *copyField*. - When I index a new document a *HttpSolrServer* from the shard list is selected using a Round Robin strategy. Then, a field called *_shard_* is setted to *SolrInputDocument*. That field value includes a relationship with the main shard selected. - If a document wants to be indexed/updated and it includes *_shard_*field to update it automatically the belonged shard ( *HttpSolrServer*) is selected. - If a document wants to be indexed/updated and *_shard_* field is not included then the key value from *_id_* is getted from *SolrInputDocument *. With that key a distributed search query is executed by it's key to retrieve *_shard_* field. With *_shard_* field we can now choose the correct shard (*HttpSolrServer*). It's not a good practice and performance isn't the best, but it's secure. Best Regards, - Luis Cappa 2013/5/26 Erick Erickson <erickerick...@gmail.com> > Valery: > > I share your puzzlement. _If_ you are letting Solr do the document > routing, and not doing any of the custom routing, then the same unique > key should be going to the same shard and replacing the previous doc > with that key. > > But, if you're using custom routing, if you've been experimenting with > different configurations and didn't start over, in general if you're > configuration is in an "interesting" state this could happen. > > So in the normal case if you have a document with the same key indexed > in multiple shards, that would indicate a bug. But there are many > ways, especially when experimenting, that you could have this happen > which are _not_ a bug. I'm guessing that Luis may be trying the custom > routing option maybe? > > Best > Erick > > On Fri, May 24, 2013 at 9:09 AM, Valery Giner <valgi...@research.att.com> > wrote: > > Shawn, > > > > How is it possible for more than one document with the same unique key to > > appear in the index, even in different shards? > > Isn't it a bug by definition? > > What am I missing here? > > > > Thanks, > > Val > > > > > > On 05/23/2013 09:55 AM, Shawn Heisey wrote: > >> > >> On 5/23/2013 1:51 AM, Luis Cappa Banda wrote: > >>> > >>> I've query each Solr shard server one by one and the total number of > >>> documents is correct. However, when I change rows parameter from 10 to > >>> 100 > >>> the total numFound of documents change: > >> > >> I've seen this problem on the list before and the cause has been > >> determined each time to be caused by documents with the same uniqueKey > >> value appearing in more than one shard. > >> > >> What I think happens here: > >> > >> With rows=10, you get the top ten docs from each of the three shards, > >> and each shard sends its numFound for that query to the core that's > >> coordinating the search. The coordinator adds up numFound, looks > >> through those thirty docs, and arranges them according to the requested > >> sort order, returning only the top 10. In this case, there happen to be > >> no duplicates. > >> > >> With rows=100, you get a total of 300 docs. This time, duplicates are > >> found and removed by the coordinator. I think that the coordinator > >> adjusts the total numFound by the number of duplicate documents it > >> removed, in an attempt to be more accurate. > >> > >> I don't know if adjusting numFound when duplicates are found in a > >> sharded query is the right thing to do, I'll leave that for smarter > >> people. Perhaps Solr should return a message with the results saying > >> that duplicates were found, and if a config option is not enabled, the > >> server should throw an exception and return a 4xx HTTP error code. One > >> idea for a config parameter name would be allowShardDuplicates, but > >> something better can probably be found. > >> > >> Thanks, > >> Shawn > >> > > > -- - Luis Cappa