Re: Duplicated Documents Across shards

Erick Erickson Sat, 04 May 2013 08:54:21 -0700

Sounds like you've explicitly routed the same document to two
different shards. Document replacement only happens locally to a
shard, so the fact that you have documents with the same ID on two
different shards is why you're getting duplicate documents.


Best
Erick

On Fri, May 3, 2013 at 3:44 PM, Iker Mtnz. Apellaniz
<mitxin...@gmail.com> wrote:
> We are currently using version 4.2.
> We have made tests with a single document and it gives us a 2 document
> count. But if we force to shard into te first machine, the one with a
> unique shard, the count gives us 1 document.
> I've tried using distrib=false parameter, it gives us no duplicate
> documents, but the same document appears to be in two different shards.
>
> Finally, about the separate directories, We have only one directory for the
> data in each physical machine and collection, and I don't see any subfolder
> for the different shards.
>
> Is it possible that we have something wrong with the dataDir configuration
> to use multiple shards in one machine?
>
> <dataDir>${solr.data.dir:}</dataDir>
> <directoryFactory name="DirectoryFactory"
> class="${solr.directoryFactory:solr.NRTCachingDirectoryFactory}"/>
>
>
>
> 2013/5/3 Erick Erickson <erickerick...@gmail.com>
>
>> What version of Solr? The custom routing stuff is quite new so
>> I'm guessing 4x?
>>
>> But this shouldn't be happening. The actual index data for the
>> shards should be in separate directories, they just happen to
>> be on the same physical machine.
>>
>> Try querying each one with &distrib=false to see the counts
>> from single shards, that may shed some light on this. It vaguely
>> sounds like you have indexed the same document to both shards
>> somehow...
>>
>> Best
>> Erick
>>
>> On Fri, May 3, 2013 at 5:28 AM, Iker Mtnz. Apellaniz
>> <mitxin...@gmail.com> wrote:
>> > Hi,
>> >   We have currently a solrCloud implementation running 5 shards in 3
>> > physical machines, so the first machine will have the shard number 1, the
>> > second machine shards 2 & 4, and the third shards 3 & 5. We noticed that
>> > while queryng numFoundDocs decreased when we increased the start param.
>> >   After some investigation we found that the documents in shards 2 to 5
>> > were being counted twice. Querying to shard 2 will give you back the
>> > results for shard 2 & 4, and the same thing for shards 3 & 5. Our guess
>> is
>> > that the physical index for both shard 2&4 is shared, so the shards don't
>> > know which part of it is for each one.
>> >   The uniqueKey is correctly defined, and we have tried using shard
>> prefix
>> > (shard1!docID).
>> >
>> >   Is there any way to solve this problem when a unique physical machine
>> > shares shards?
>> >   Is it a "real" problem os it just affects facet & numResults?
>> >
>> > Thanks
>> >    Iker
>> >
>> > --
>> > /** @author imartinez*/
>> > Person me = *new* Developer();
>> > me.setName(*"Iker Mtz de Apellaniz Anzuola"*);
>> > me.setTwit("@mitxino77 <https://twitter.com/mitxino77>");
>> > me.setLocations({"St Cugat, Barcelona", "Kanpezu, Euskadi", "*,
>> World"]});
>> > me.setSkills({*SoftwareDeveloper, Curious, AmateurCook*});
>> > me.setWebs({*urbasaabentura.com, ikertxef.com*});
>> > *return* me;
>>
>
>
>
> --
> /** @author imartinez*/
> Person me = *new* Developer();
> me.setName(*"Iker Mtz de Apellaniz Anzuola"*);
> me.setTwit("@mitxino77 <https://twitter.com/mitxino77>");
> me.setLocations({"St Cugat, Barcelona", "Kanpezu, Euskadi", "*, World"]});
> me.setSkills({*SoftwareDeveloper, Curious, AmateurCook*});
> *return* me;

Re: Duplicated Documents Across shards

Reply via email to