Having time drift longer than the TTL would definitely cause these types of
problems.

In our case, the clusters are time-synchronized and the error is still
encountered periodically.


On Wed, Jun 20, 2018 at 10:07 AM Erick Erickson <erickerick...@gmail.com>
wrote:

> We've seen this exact issue when the times reported by various
> machines have different wall-clock times, so getting these times
> coordinated is definitely the very first thing I'd do.
>
> It's particularly annoying because if the clocks are drifting apart
> gradually, your setup can be running find for days/weeks/months then
> break all of the sudden....
>
> Best,
> Erick
>
> On Wed, Jun 20, 2018 at 5:39 AM, Satya Marivada
> <satya.chaita...@gmail.com> wrote:
> > Chris,
> >
> > You are spot on with the timestamps. The date command returns different
> > times on these vms and are not in sync with ntp. The ntpstat returns a
> > difference of about 8-10 seconds on the 4 vms and that would caused this
> > synchronization issues and marked the replicas as down. This just
> happened
> > recently and was working fine earlier. So would get this issue fixed and
> > hope everything falls in place.
> >
> > Yes, there are some other errors that there is a javabin character 2
> > expected and is returning 60 which is "<" .
> >
> > Thanks,
> > Satya
> >
> > On Tue, Jun 19, 2018 at 8:01 AM Chris Ulicny <culicny@iq.media> wrote:
> >
> >> Satya,
> >>
> >> There should be some other log messages that are probably relevant to
> the
> >> issue you are having. Something along the lines of "leader cannot
> >> communicate with follower...publishing replica as down." It's likely
> there
> >> also is a message of "expecting json/xml but got html" in another
> >> instance's logs.
> >>
> >> We've seen this problem in various scenarios in our own clusters,
> usually
> >> during high volumes of requests, and what seems to be happening to us is
> >> the following.
> >>
> >> Since authentication is enabled, all requests between nodes must be
> >> authenticated, and Solr is using a timestamp to do this (in some way,
> not
> >> sure on the details). When the recipient of the request processes it,
> the
> >> timestamp is checked to see if it is within the Time-To-Live (TTL)
> >> millisecond value (default of 5000). If the timestamp is too old, the
> >> request is rejected with the above error and a response of 401 is
> delivered
> >> to the sender.
> >>
> >> When a request is sent from the leader to the follower and receives a
> 401
> >> response, the leader becomes too proactive sometimes and declares the
> >> replica down. In older versions (6.3.0), it seems that the replica will
> >> never recover automatically (manually delete the down replicas and add
> new
> >> ones to fix). Fortunately, as of 7.2.1 (maybe earlier) the down replicas
> >> will usually start to recover at some point (and the leaders seem less
> >> proactive to declare replicas down). Although, we have had cases where
> they
> >> did not recover after being down for hours on 7.2.1.
> >>
> >> Likely the solution to the problem is to increase the TTL value by
> adding
> >> the line
> >>
> >> SOLR_OPTS="$SOLR_OPTS -Dpkiauth.ttl=######"
> >>
> >> to the solr environment file (solr.in.sh) on each node and restarting
> >> them.
> >> Replace ##### with some millisecond value of your choice. I'd suggest
> just
> >> increasing it by intervals of 5s to start. If this does not fix your
> >> problem, then there is likely too much pressure on your hardware for
> some
> >> reason or another.
> >>
> >> Hopefully that helps.
> >>
> >> If anyone with more knowledge about the authentication plugin has
> >> corrections, wants fill in gaps, or has an idea to figure out what
> requests
> >> cause this issue. It'd be greatly appreciated.
> >>
> >> Best,
> >> Chris
> >>
> >> On Mon, Jun 18, 2018 at 9:38 AM Satya Marivada <
> satya.chaita...@gmail.com>
> >> wrote:
> >>
> >> > Hi, We are using solr 6.3.0 and a collection has 3 of 4 replicas down
> >> and 1
> >> > is up and serving.
> >> >
> >> > I see a single line error repeating in logs as below. nothing else
> >> specific
> >> > exception apart from it. Wondering what this below message is saying,
> is
> >> it
> >> > the cause of nodes being down, but saw that this happened even before
> the
> >> > repllicas went down.
> >> >
> >> > 2018-06-18 04:45:51.818 ERROR (qtp1528637575-27215) [c:poi s:shard1
> >> > r:core_node5 x:poi_shard1_replica3] o.a.s.s.PKIAuthenticationPlugin
> >> Invalid
> >> > key request timestamp: 1529297138215 , received timestamp:
> 1529297151817
> >> ,
> >> > TTL: 5000
> >> >
> >> > Thanks,
> >> > Satya
> >> >
> >>
>

Reply via email to