Having time drift longer than the TTL would definitely cause these types of problems.
In our case, the clusters are time-synchronized and the error is still encountered periodically. On Wed, Jun 20, 2018 at 10:07 AM Erick Erickson <erickerick...@gmail.com> wrote: > We've seen this exact issue when the times reported by various > machines have different wall-clock times, so getting these times > coordinated is definitely the very first thing I'd do. > > It's particularly annoying because if the clocks are drifting apart > gradually, your setup can be running find for days/weeks/months then > break all of the sudden.... > > Best, > Erick > > On Wed, Jun 20, 2018 at 5:39 AM, Satya Marivada > <satya.chaita...@gmail.com> wrote: > > Chris, > > > > You are spot on with the timestamps. The date command returns different > > times on these vms and are not in sync with ntp. The ntpstat returns a > > difference of about 8-10 seconds on the 4 vms and that would caused this > > synchronization issues and marked the replicas as down. This just > happened > > recently and was working fine earlier. So would get this issue fixed and > > hope everything falls in place. > > > > Yes, there are some other errors that there is a javabin character 2 > > expected and is returning 60 which is "<" . > > > > Thanks, > > Satya > > > > On Tue, Jun 19, 2018 at 8:01 AM Chris Ulicny <culicny@iq.media> wrote: > > > >> Satya, > >> > >> There should be some other log messages that are probably relevant to > the > >> issue you are having. Something along the lines of "leader cannot > >> communicate with follower...publishing replica as down." It's likely > there > >> also is a message of "expecting json/xml but got html" in another > >> instance's logs. > >> > >> We've seen this problem in various scenarios in our own clusters, > usually > >> during high volumes of requests, and what seems to be happening to us is > >> the following. > >> > >> Since authentication is enabled, all requests between nodes must be > >> authenticated, and Solr is using a timestamp to do this (in some way, > not > >> sure on the details). When the recipient of the request processes it, > the > >> timestamp is checked to see if it is within the Time-To-Live (TTL) > >> millisecond value (default of 5000). If the timestamp is too old, the > >> request is rejected with the above error and a response of 401 is > delivered > >> to the sender. > >> > >> When a request is sent from the leader to the follower and receives a > 401 > >> response, the leader becomes too proactive sometimes and declares the > >> replica down. In older versions (6.3.0), it seems that the replica will > >> never recover automatically (manually delete the down replicas and add > new > >> ones to fix). Fortunately, as of 7.2.1 (maybe earlier) the down replicas > >> will usually start to recover at some point (and the leaders seem less > >> proactive to declare replicas down). Although, we have had cases where > they > >> did not recover after being down for hours on 7.2.1. > >> > >> Likely the solution to the problem is to increase the TTL value by > adding > >> the line > >> > >> SOLR_OPTS="$SOLR_OPTS -Dpkiauth.ttl=######" > >> > >> to the solr environment file (solr.in.sh) on each node and restarting > >> them. > >> Replace ##### with some millisecond value of your choice. I'd suggest > just > >> increasing it by intervals of 5s to start. If this does not fix your > >> problem, then there is likely too much pressure on your hardware for > some > >> reason or another. > >> > >> Hopefully that helps. > >> > >> If anyone with more knowledge about the authentication plugin has > >> corrections, wants fill in gaps, or has an idea to figure out what > requests > >> cause this issue. It'd be greatly appreciated. > >> > >> Best, > >> Chris > >> > >> On Mon, Jun 18, 2018 at 9:38 AM Satya Marivada < > satya.chaita...@gmail.com> > >> wrote: > >> > >> > Hi, We are using solr 6.3.0 and a collection has 3 of 4 replicas down > >> and 1 > >> > is up and serving. > >> > > >> > I see a single line error repeating in logs as below. nothing else > >> specific > >> > exception apart from it. Wondering what this below message is saying, > is > >> it > >> > the cause of nodes being down, but saw that this happened even before > the > >> > repllicas went down. > >> > > >> > 2018-06-18 04:45:51.818 ERROR (qtp1528637575-27215) [c:poi s:shard1 > >> > r:core_node5 x:poi_shard1_replica3] o.a.s.s.PKIAuthenticationPlugin > >> Invalid > >> > key request timestamp: 1529297138215 , received timestamp: > 1529297151817 > >> , > >> > TTL: 5000 > >> > > >> > Thanks, > >> > Satya > >> > > >> >