Let's back up a bit. You say "This seems to cause two replicas to
return different hits depending upon which one is queried."

OK, _how_ are they different? I've been assuming different numbers of
hits. If you're getting the same number of hits but different document
ordering, that's a completely different issue and may be easily
explainable. If this is true, skip the rest of this message. I only
realized we may be using a different definition of "different hits"
part way through writing this reply.

------------------------

Having the timestamp as a string isn't a problem, you can do something
very similar with wildcards and the like if it's a string that sorts
the same way the timestamp would. And it's best if it's created
upstream anyway that way it's guaranteed to be the same for the doc on
all replicas.

If the date is in canonical form (YYYY-MM-DDTHH:MM:SSZ) then a simple
copyfield to a date field would do the trick.

But there's no real reason to do any of that. Given that you see this
when there's no indexing going on then there's no point to those
tests, those were just for a way to examine your nodes while there was
active indexing.

How do you fix this problem when you see it? If it goes away by itself
that would gives at least a start on where to look. If you have to
manually intervene it would be good to know what you do.

The CDCR pattern is docs to from the leader on the source cluster to
the leader on the target cluster. Once the target leader gets the
docs, it's supposed to send the doc to all the replicas.

To try to narrow down the issue, next time it occurs can you look at
_both_ the source and target clusters and see if they _both_ show the
same discrepancy? What I'm looking for is whether both are
self-consistent. That is, all the replicas for shardN on the source
cluster show the same documents (M). All the replicas for shardN on
the target cluster show the same number of docs (N). I'm not as
concerned if M != N at this point. Note I'm looking at the number of
hits here, not say the document ordering.

To do this you'll have to do the trick I mentioned where you query
each replica separately.

And are you absolutely sure that your different results are coming
from the _same_ cluster? If you're comparing a query from the source
cluster with a query from the target cluster, that's different than if
the queries come from the same cluster.

Best,
Erick

On Wed, Dec 14, 2016 at 2:48 PM, Webster Homer <webster.ho...@sial.com> wrote:
> Thanks for the quick feedback.
>
> We are not doing continuous indexing, we do a complete load once a week and
> then have a daily partial load for any documents that have changed since
> the load. These partial loads take only a few minutes every morning.
>
> The problem is we see this discrepancy long after the data load completes.
>
> We have a source collection that uses cdcr to replicate to the target. I
> see the current=false setting in both the source and target collections.
> Only the target collection is being heavily searched so that is where my
> concern is. So what could cause this kind of issue?
> Do we have a configuration problem?
>
> It doesn't happen all the time, so I don't currently have a reproducible
> test case, yet.
>
> I will see about adding the timestamp, we have one, but it was created as a
> string, and was generated by our ETL job
>
> On Wed, Dec 14, 2016 at 3:42 PM, Erick Erickson <erickerick...@gmail.com>
> wrote:
>
>> The commit points on different replicas will trip at different wall
>> clock times so the leader and replica may return slightly different
>> results depending on whether doc X was included in the commit on one
>> replica but not on the second. After the _next_ commit interval (2
>> seconds in your case), doc X will be committed on the second replica:
>> that is it's not lost.
>>
>> Here's a couple of ways to verify:
>>
>> 1> turn off indexing and wait a few seconds. The replicas should have
>> the exact same documents. "A few seconds" is your autocommit (soft in
>> your case) interval + autowarm time. This last is unknown, but you can
>> check your admin/plugins-stats search handler times, it's reported
>> there. Now issue your queries. If the replicas don't report the same
>> docs A Bad Thing that should be worrying. BTW, with a 2 second soft
>> commit interval, which is really aggressive, you _better not_ have
>> very large autowarm intervals!
>>
>> 2> Include a timestamp in your docs when they are indexed. There's an
>> automatic way to do that BTW.... now do your queries and append an FQ
>> clause like &fq=timestamp:[* TO some_point_in_the_past]. The replicas
>> should have the same counts unless you are deleting documents. I
>> mention deletes on the off chance that you're deleting documents that
>> fall in the interval and then the same as above could theoretically
>> occur. Updates should be fine.
>>
>> BTW, I've seen continuous monitoring of this done by automated
>> scripts. The key is to get the shard URL and ping that with
>> &distrib=false. It'll look something like
>> http://host:port/solr/collection_shard1_replica1.... People usually
>> just use *:* and compare numFound.
>>
>> Best,
>> Erick
>>
>>
>>
>> On Wed, Dec 14, 2016 at 1:10 PM, Webster Homer <webster.ho...@sial.com>
>> wrote:
>> > We are using Solr Cloud 6.2
>> >
>> > We have been noticing an issue where the index in a core shows as
>> current =
>> > false
>> >
>> > We have autocommit set for 15 seconds, and soft commit at 2 seconds
>> >
>> > This seems to cause two replicas to return different hits depending upon
>> > which one is queried.
>> >
>> > What would lead to the indexes not being "current"? The documentation on
>> > the meaning of current is vague.
>> >
>> > The collections in our cloud have two shards each with two replicas. I
>> see
>> > this with several of the collections.
>> >
>> > We don't know how they get like this but it's troubling
>> >
>> > --
>> >
>> >
>> > This message and any attachment are confidential and may be privileged or
>> > otherwise protected from disclosure. If you are not the intended
>> recipient,
>> > you must not copy this message or attachment or disclose the contents to
>> > any other person. If you have received this transmission in error, please
>> > notify the sender immediately and delete the message and any attachment
>> > from your system. Merck KGaA, Darmstadt, Germany and any of its
>> > subsidiaries do not accept liability for any omissions or errors in this
>> > message which may arise as a result of E-Mail-transmission or for damages
>> > resulting from any unauthorized changes of the content of this message
>> and
>> > any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
>> > subsidiaries do not guarantee that this message is free of viruses and
>> does
>> > not accept liability for any damages caused by any virus transmitted
>> > therewith.
>> >
>> > Click http://www.merckgroup.com/disclaimer to access the German, French,
>> > Spanish and Portuguese versions of this disclaimer.
>>
>
> --
>
>
> This message and any attachment are confidential and may be privileged or
> otherwise protected from disclosure. If you are not the intended recipient,
> you must not copy this message or attachment or disclose the contents to
> any other person. If you have received this transmission in error, please
> notify the sender immediately and delete the message and any attachment
> from your system. Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do not accept liability for any omissions or errors in this
> message which may arise as a result of E-Mail-transmission or for damages
> resulting from any unauthorized changes of the content of this message and
> any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do not guarantee that this message is free of viruses and does
> not accept liability for any damages caused by any virus transmitted
> therewith.
>
> Click http://www.merckgroup.com/disclaimer to access the German, French,
> Spanish and Portuguese versions of this disclaimer.

Reply via email to