Hi, Erick!

That's it! I'm using a custom implementation of a SolrServer with
distributed behavior that routes queries and updates using an in-house
Round Robin method. But the thing is that I'm doing this myself because
I've noticed that duplicated documents appears using LBHttpSolrServer
implementation. Last week I modified my implementation to avoid that with
this changes:


   - I have normalized the key field to all documents. Now every document
   indexed must include *_id_* field that stores the selected key value.
   The value is setted with a *copyField*.
   - When I index a new document a *HttpSolrServer* from the shard list is
   selected using a Round Robin strategy. Then, a field called *_shard_* is
   setted to *SolrInputDocument*. That field value includes a relationship
   with the main shard selected.
   - If a document wants to be indexed/updated and it includes
*_shard_*field to update it automatically the belonged shard (
   *HttpSolrServer*) is selected.
   - If a document wants to be indexed/updated and *_shard_* field is not
   included then the key value from *_id_* is getted from *SolrInputDocument
   *. With that key a distributed search query is executed by it's key to
   retrieve *_shard_* field. With *_shard_* field we can now choose the
   correct shard (*HttpSolrServer*). It's not a good practice and
   performance isn't the best, but it's secure.

Best Regards,

- Luis Cappa


2013/5/26 Erick Erickson <erickerick...@gmail.com>

> Valery:
>
> I share your puzzlement. _If_ you are letting Solr do the document
> routing, and not doing any of the custom routing, then the same unique
> key should be going to the same shard and replacing the previous doc
> with that key.
>
> But, if you're using custom routing, if you've been experimenting with
> different configurations and didn't start over, in general if you're
> configuration is in an "interesting" state this could happen.
>
> So in the normal case if you have a document with the same key indexed
> in multiple shards, that would indicate a bug. But there are many
> ways, especially when experimenting, that you could have this happen
> which are _not_ a bug. I'm guessing that Luis may be trying the custom
> routing option maybe?
>
> Best
> Erick
>
> On Fri, May 24, 2013 at 9:09 AM, Valery Giner <valgi...@research.att.com>
> wrote:
> > Shawn,
> >
> > How is it possible for more than one document with the same unique key to
> > appear in the index, even in different shards?
> > Isn't it a bug by definition?
> > What am I missing here?
> >
> > Thanks,
> > Val
> >
> >
> > On 05/23/2013 09:55 AM, Shawn Heisey wrote:
> >>
> >> On 5/23/2013 1:51 AM, Luis Cappa Banda wrote:
> >>>
> >>> I've query each Solr shard server one by one and the total number of
> >>> documents is correct. However, when I change rows parameter from 10 to
> >>> 100
> >>> the total numFound of documents change:
> >>
> >> I've seen this problem on the list before and the cause has been
> >> determined each time to be caused by documents with the same uniqueKey
> >> value appearing in more than one shard.
> >>
> >> What I think happens here:
> >>
> >> With rows=10, you get the top ten docs from each of the three shards,
> >> and each shard sends its numFound for that query to the core that's
> >> coordinating the search.  The coordinator adds up numFound, looks
> >> through those thirty docs, and arranges them according to the requested
> >> sort order, returning only the top 10.  In this case, there happen to be
> >> no duplicates.
> >>
> >> With rows=100, you get a total of 300 docs.  This time, duplicates are
> >> found and removed by the coordinator.  I think that the coordinator
> >> adjusts the total numFound by the number of duplicate documents it
> >> removed, in an attempt to be more accurate.
> >>
> >> I don't know if adjusting numFound when duplicates are found in a
> >> sharded query is the right thing to do, I'll leave that for smarter
> >> people.  Perhaps Solr should return a message with the results saying
> >> that duplicates were found, and if a config option is not enabled, the
> >> server should throw an exception and return a 4xx HTTP error code.  One
> >> idea for a config parameter name would be allowShardDuplicates, but
> >> something better can probably be found.
> >>
> >> Thanks,
> >> Shawn
> >>
> >
>



-- 
- Luis Cappa

Reply via email to