Satya,

There should be some other log messages that are probably relevant to the
issue you are having. Something along the lines of "leader cannot
communicate with follower...publishing replica as down." It's likely there
also is a message of "expecting json/xml but got html" in another
instance's logs.

We've seen this problem in various scenarios in our own clusters, usually
during high volumes of requests, and what seems to be happening to us is
the following.

Since authentication is enabled, all requests between nodes must be
authenticated, and Solr is using a timestamp to do this (in some way, not
sure on the details). When the recipient of the request processes it, the
timestamp is checked to see if it is within the Time-To-Live (TTL)
millisecond value (default of 5000). If the timestamp is too old, the
request is rejected with the above error and a response of 401 is delivered
to the sender.

When a request is sent from the leader to the follower and receives a 401
response, the leader becomes too proactive sometimes and declares the
replica down. In older versions (6.3.0), it seems that the replica will
never recover automatically (manually delete the down replicas and add new
ones to fix). Fortunately, as of 7.2.1 (maybe earlier) the down replicas
will usually start to recover at some point (and the leaders seem less
proactive to declare replicas down). Although, we have had cases where they
did not recover after being down for hours on 7.2.1.

Likely the solution to the problem is to increase the TTL value by adding
the line

SOLR_OPTS="$SOLR_OPTS -Dpkiauth.ttl=######"

to the solr environment file (solr.in.sh) on each node and restarting them.
Replace ##### with some millisecond value of your choice. I'd suggest just
increasing it by intervals of 5s to start. If this does not fix your
problem, then there is likely too much pressure on your hardware for some
reason or another.

Hopefully that helps.

If anyone with more knowledge about the authentication plugin has
corrections, wants fill in gaps, or has an idea to figure out what requests
cause this issue. It'd be greatly appreciated.

Best,
Chris

On Mon, Jun 18, 2018 at 9:38 AM Satya Marivada <satya.chaita...@gmail.com>
wrote:

> Hi, We are using solr 6.3.0 and a collection has 3 of 4 replicas down and 1
> is up and serving.
>
> I see a single line error repeating in logs as below. nothing else specific
> exception apart from it. Wondering what this below message is saying, is it
> the cause of nodes being down, but saw that this happened even before the
> repllicas went down.
>
> 2018-06-18 04:45:51.818 ERROR (qtp1528637575-27215) [c:poi s:shard1
> r:core_node5 x:poi_shard1_replica3] o.a.s.s.PKIAuthenticationPlugin Invalid
> key request timestamp: 1529297138215 , received timestamp: 1529297151817 ,
> TTL: 5000
>
> Thanks,
> Satya
>

Reply via email to