Chris,

You are spot on with the timestamps. The date command returns different
times on these vms and are not in sync with ntp. The ntpstat returns a
difference of about 8-10 seconds on the 4 vms and that would caused this
synchronization issues and marked the replicas as down. This just happened
recently and was working fine earlier. So would get this issue fixed and
hope everything falls in place.

Yes, there are some other errors that there is a javabin character 2
expected and is returning 60 which is "<" .

Thanks,
Satya

On Tue, Jun 19, 2018 at 8:01 AM Chris Ulicny <culicny@iq.media> wrote:

> Satya,
>
> There should be some other log messages that are probably relevant to the
> issue you are having. Something along the lines of "leader cannot
> communicate with follower...publishing replica as down." It's likely there
> also is a message of "expecting json/xml but got html" in another
> instance's logs.
>
> We've seen this problem in various scenarios in our own clusters, usually
> during high volumes of requests, and what seems to be happening to us is
> the following.
>
> Since authentication is enabled, all requests between nodes must be
> authenticated, and Solr is using a timestamp to do this (in some way, not
> sure on the details). When the recipient of the request processes it, the
> timestamp is checked to see if it is within the Time-To-Live (TTL)
> millisecond value (default of 5000). If the timestamp is too old, the
> request is rejected with the above error and a response of 401 is delivered
> to the sender.
>
> When a request is sent from the leader to the follower and receives a 401
> response, the leader becomes too proactive sometimes and declares the
> replica down. In older versions (6.3.0), it seems that the replica will
> never recover automatically (manually delete the down replicas and add new
> ones to fix). Fortunately, as of 7.2.1 (maybe earlier) the down replicas
> will usually start to recover at some point (and the leaders seem less
> proactive to declare replicas down). Although, we have had cases where they
> did not recover after being down for hours on 7.2.1.
>
> Likely the solution to the problem is to increase the TTL value by adding
> the line
>
> SOLR_OPTS="$SOLR_OPTS -Dpkiauth.ttl=######"
>
> to the solr environment file (solr.in.sh) on each node and restarting
> them.
> Replace ##### with some millisecond value of your choice. I'd suggest just
> increasing it by intervals of 5s to start. If this does not fix your
> problem, then there is likely too much pressure on your hardware for some
> reason or another.
>
> Hopefully that helps.
>
> If anyone with more knowledge about the authentication plugin has
> corrections, wants fill in gaps, or has an idea to figure out what requests
> cause this issue. It'd be greatly appreciated.
>
> Best,
> Chris
>
> On Mon, Jun 18, 2018 at 9:38 AM Satya Marivada <satya.chaita...@gmail.com>
> wrote:
>
> > Hi, We are using solr 6.3.0 and a collection has 3 of 4 replicas down
> and 1
> > is up and serving.
> >
> > I see a single line error repeating in logs as below. nothing else
> specific
> > exception apart from it. Wondering what this below message is saying, is
> it
> > the cause of nodes being down, but saw that this happened even before the
> > repllicas went down.
> >
> > 2018-06-18 04:45:51.818 ERROR (qtp1528637575-27215) [c:poi s:shard1
> > r:core_node5 x:poi_shard1_replica3] o.a.s.s.PKIAuthenticationPlugin
> Invalid
> > key request timestamp: 1529297138215 , received timestamp: 1529297151817
> ,
> > TTL: 5000
> >
> > Thanks,
> > Satya
> >
>

Reply via email to