Satya, There should be some other log messages that are probably relevant to the issue you are having. Something along the lines of "leader cannot communicate with follower...publishing replica as down." It's likely there also is a message of "expecting json/xml but got html" in another instance's logs.
We've seen this problem in various scenarios in our own clusters, usually during high volumes of requests, and what seems to be happening to us is the following. Since authentication is enabled, all requests between nodes must be authenticated, and Solr is using a timestamp to do this (in some way, not sure on the details). When the recipient of the request processes it, the timestamp is checked to see if it is within the Time-To-Live (TTL) millisecond value (default of 5000). If the timestamp is too old, the request is rejected with the above error and a response of 401 is delivered to the sender. When a request is sent from the leader to the follower and receives a 401 response, the leader becomes too proactive sometimes and declares the replica down. In older versions (6.3.0), it seems that the replica will never recover automatically (manually delete the down replicas and add new ones to fix). Fortunately, as of 7.2.1 (maybe earlier) the down replicas will usually start to recover at some point (and the leaders seem less proactive to declare replicas down). Although, we have had cases where they did not recover after being down for hours on 7.2.1. Likely the solution to the problem is to increase the TTL value by adding the line SOLR_OPTS="$SOLR_OPTS -Dpkiauth.ttl=######" to the solr environment file (solr.in.sh) on each node and restarting them. Replace ##### with some millisecond value of your choice. I'd suggest just increasing it by intervals of 5s to start. If this does not fix your problem, then there is likely too much pressure on your hardware for some reason or another. Hopefully that helps. If anyone with more knowledge about the authentication plugin has corrections, wants fill in gaps, or has an idea to figure out what requests cause this issue. It'd be greatly appreciated. Best, Chris On Mon, Jun 18, 2018 at 9:38 AM Satya Marivada <satya.chaita...@gmail.com> wrote: > Hi, We are using solr 6.3.0 and a collection has 3 of 4 replicas down and 1 > is up and serving. > > I see a single line error repeating in logs as below. nothing else specific > exception apart from it. Wondering what this below message is saying, is it > the cause of nodes being down, but saw that this happened even before the > repllicas went down. > > 2018-06-18 04:45:51.818 ERROR (qtp1528637575-27215) [c:poi s:shard1 > r:core_node5 x:poi_shard1_replica3] o.a.s.s.PKIAuthenticationPlugin Invalid > key request timestamp: 1529297138215 , received timestamp: 1529297151817 , > TTL: 5000 > > Thanks, > Satya >