Re: some solr replicas down
On 6/20/2018 6:39 AM, Satya Marivada wrote: Yes, there are some other errors that there is a javabin character 2 expected and is returning 60 which is "<" . This happens when the response is an error. Error responses are sent in HTML format (so they render properly when viewed in a browser), and the first character of any HTML formatted data is the "<" character. -- Extreme pauses due to garbage collection are one thing that can cause time discrepancies in Java-based software, even when the systems themselves are in perfect time sync. Imagine a situation where a request is received, GC pauses happen taking up several seconds, and then the time data in the request is checked against the clock. -- If your VMs really are out of time sync, this might apply: The core NTP algorithm will refuse to sync a system that's off by more than 1024 milliseconds. So if the system is off by 8-10 seconds, NTP might not actually be able to get the system synchronized. Redhat-based distributions of Linux deal with this problem at boot time by running ntpdate before ntp itself, to hard-set the system to the exact time returned by an NTP source. Debian-based distributions handle it directly in NTP itself when NTP starts, though I do not know the exact details. Either way, the time on the system is then close enough for the core NTP algorithm to work. If your system is a version of Linux based on Redhat, you might be able to get NTP working properly with this sequence: service ntpd stop service ntpdate start sleep 10 service ntpd start On a Debian-based system (including Ubuntu and others), this might be enough: /etc/init.d/ntp restart For other operating systems, you'll need to research. Depending on what virtual machine technology you're using, special considerations for NTP might need to be made. I think VMware requires some special config for time synchronization to work properly. Thanks, Shawn
Re: some solr replicas down
Having time drift longer than the TTL would definitely cause these types of problems. In our case, the clusters are time-synchronized and the error is still encountered periodically. On Wed, Jun 20, 2018 at 10:07 AM Erick Erickson wrote: > We've seen this exact issue when the times reported by various > machines have different wall-clock times, so getting these times > coordinated is definitely the very first thing I'd do. > > It's particularly annoying because if the clocks are drifting apart > gradually, your setup can be running find for days/weeks/months then > break all of the sudden > > Best, > Erick > > On Wed, Jun 20, 2018 at 5:39 AM, Satya Marivada > wrote: > > Chris, > > > > You are spot on with the timestamps. The date command returns different > > times on these vms and are not in sync with ntp. The ntpstat returns a > > difference of about 8-10 seconds on the 4 vms and that would caused this > > synchronization issues and marked the replicas as down. This just > happened > > recently and was working fine earlier. So would get this issue fixed and > > hope everything falls in place. > > > > Yes, there are some other errors that there is a javabin character 2 > > expected and is returning 60 which is "<" . > > > > Thanks, > > Satya > > > > On Tue, Jun 19, 2018 at 8:01 AM Chris Ulicny wrote: > > > >> Satya, > >> > >> There should be some other log messages that are probably relevant to > the > >> issue you are having. Something along the lines of "leader cannot > >> communicate with follower...publishing replica as down." It's likely > there > >> also is a message of "expecting json/xml but got html" in another > >> instance's logs. > >> > >> We've seen this problem in various scenarios in our own clusters, > usually > >> during high volumes of requests, and what seems to be happening to us is > >> the following. > >> > >> Since authentication is enabled, all requests between nodes must be > >> authenticated, and Solr is using a timestamp to do this (in some way, > not > >> sure on the details). When the recipient of the request processes it, > the > >> timestamp is checked to see if it is within the Time-To-Live (TTL) > >> millisecond value (default of 5000). If the timestamp is too old, the > >> request is rejected with the above error and a response of 401 is > delivered > >> to the sender. > >> > >> When a request is sent from the leader to the follower and receives a > 401 > >> response, the leader becomes too proactive sometimes and declares the > >> replica down. In older versions (6.3.0), it seems that the replica will > >> never recover automatically (manually delete the down replicas and add > new > >> ones to fix). Fortunately, as of 7.2.1 (maybe earlier) the down replicas > >> will usually start to recover at some point (and the leaders seem less > >> proactive to declare replicas down). Although, we have had cases where > they > >> did not recover after being down for hours on 7.2.1. > >> > >> Likely the solution to the problem is to increase the TTL value by > adding > >> the line > >> > >> SOLR_OPTS="$SOLR_OPTS -Dpkiauth.ttl=##" > >> > >> to the solr environment file (solr.in.sh) on each node and restarting > >> them. > >> Replace # with some millisecond value of your choice. I'd suggest > just > >> increasing it by intervals of 5s to start. If this does not fix your > >> problem, then there is likely too much pressure on your hardware for > some > >> reason or another. > >> > >> Hopefully that helps. > >> > >> If anyone with more knowledge about the authentication plugin has > >> corrections, wants fill in gaps, or has an idea to figure out what > requests > >> cause this issue. It'd be greatly appreciated. > >> > >> Best, > >> Chris > >> > >> On Mon, Jun 18, 2018 at 9:38 AM Satya Marivada < > satya.chaita...@gmail.com> > >> wrote: > >> > >> > Hi, We are using solr 6.3.0 and a collection has 3 of 4 replicas down > >> and 1 > >> > is up and serving. > >> > > >> > I see a single line error repeating in logs as below. nothing else > >> specific > >> > exception apart from it. Wondering what this below message is saying, > is > >> it > >> > the cause of nodes being down, but saw that this happened even before > the > >> > repllicas went down. > >> > > >> > 2018-06-18 04:45:51.818 ERROR (qtp1528637575-27215) [c:poi s:shard1 > >> > r:core_node5 x:poi_shard1_replica3] o.a.s.s.PKIAuthenticationPlugin > >> Invalid > >> > key request timestamp: 1529297138215 , received timestamp: > 1529297151817 > >> , > >> > TTL: 5000 > >> > > >> > Thanks, > >> > Satya > >> > > >> >
Re: some solr replicas down
We've seen this exact issue when the times reported by various machines have different wall-clock times, so getting these times coordinated is definitely the very first thing I'd do. It's particularly annoying because if the clocks are drifting apart gradually, your setup can be running find for days/weeks/months then break all of the sudden Best, Erick On Wed, Jun 20, 2018 at 5:39 AM, Satya Marivada wrote: > Chris, > > You are spot on with the timestamps. The date command returns different > times on these vms and are not in sync with ntp. The ntpstat returns a > difference of about 8-10 seconds on the 4 vms and that would caused this > synchronization issues and marked the replicas as down. This just happened > recently and was working fine earlier. So would get this issue fixed and > hope everything falls in place. > > Yes, there are some other errors that there is a javabin character 2 > expected and is returning 60 which is "<" . > > Thanks, > Satya > > On Tue, Jun 19, 2018 at 8:01 AM Chris Ulicny wrote: > >> Satya, >> >> There should be some other log messages that are probably relevant to the >> issue you are having. Something along the lines of "leader cannot >> communicate with follower...publishing replica as down." It's likely there >> also is a message of "expecting json/xml but got html" in another >> instance's logs. >> >> We've seen this problem in various scenarios in our own clusters, usually >> during high volumes of requests, and what seems to be happening to us is >> the following. >> >> Since authentication is enabled, all requests between nodes must be >> authenticated, and Solr is using a timestamp to do this (in some way, not >> sure on the details). When the recipient of the request processes it, the >> timestamp is checked to see if it is within the Time-To-Live (TTL) >> millisecond value (default of 5000). If the timestamp is too old, the >> request is rejected with the above error and a response of 401 is delivered >> to the sender. >> >> When a request is sent from the leader to the follower and receives a 401 >> response, the leader becomes too proactive sometimes and declares the >> replica down. In older versions (6.3.0), it seems that the replica will >> never recover automatically (manually delete the down replicas and add new >> ones to fix). Fortunately, as of 7.2.1 (maybe earlier) the down replicas >> will usually start to recover at some point (and the leaders seem less >> proactive to declare replicas down). Although, we have had cases where they >> did not recover after being down for hours on 7.2.1. >> >> Likely the solution to the problem is to increase the TTL value by adding >> the line >> >> SOLR_OPTS="$SOLR_OPTS -Dpkiauth.ttl=##" >> >> to the solr environment file (solr.in.sh) on each node and restarting >> them. >> Replace # with some millisecond value of your choice. I'd suggest just >> increasing it by intervals of 5s to start. If this does not fix your >> problem, then there is likely too much pressure on your hardware for some >> reason or another. >> >> Hopefully that helps. >> >> If anyone with more knowledge about the authentication plugin has >> corrections, wants fill in gaps, or has an idea to figure out what requests >> cause this issue. It'd be greatly appreciated. >> >> Best, >> Chris >> >> On Mon, Jun 18, 2018 at 9:38 AM Satya Marivada >> wrote: >> >> > Hi, We are using solr 6.3.0 and a collection has 3 of 4 replicas down >> and 1 >> > is up and serving. >> > >> > I see a single line error repeating in logs as below. nothing else >> specific >> > exception apart from it. Wondering what this below message is saying, is >> it >> > the cause of nodes being down, but saw that this happened even before the >> > repllicas went down. >> > >> > 2018-06-18 04:45:51.818 ERROR (qtp1528637575-27215) [c:poi s:shard1 >> > r:core_node5 x:poi_shard1_replica3] o.a.s.s.PKIAuthenticationPlugin >> Invalid >> > key request timestamp: 1529297138215 , received timestamp: 1529297151817 >> , >> > TTL: 5000 >> > >> > Thanks, >> > Satya >> > >>
Re: some solr replicas down
Chris, You are spot on with the timestamps. The date command returns different times on these vms and are not in sync with ntp. The ntpstat returns a difference of about 8-10 seconds on the 4 vms and that would caused this synchronization issues and marked the replicas as down. This just happened recently and was working fine earlier. So would get this issue fixed and hope everything falls in place. Yes, there are some other errors that there is a javabin character 2 expected and is returning 60 which is "<" . Thanks, Satya On Tue, Jun 19, 2018 at 8:01 AM Chris Ulicny wrote: > Satya, > > There should be some other log messages that are probably relevant to the > issue you are having. Something along the lines of "leader cannot > communicate with follower...publishing replica as down." It's likely there > also is a message of "expecting json/xml but got html" in another > instance's logs. > > We've seen this problem in various scenarios in our own clusters, usually > during high volumes of requests, and what seems to be happening to us is > the following. > > Since authentication is enabled, all requests between nodes must be > authenticated, and Solr is using a timestamp to do this (in some way, not > sure on the details). When the recipient of the request processes it, the > timestamp is checked to see if it is within the Time-To-Live (TTL) > millisecond value (default of 5000). If the timestamp is too old, the > request is rejected with the above error and a response of 401 is delivered > to the sender. > > When a request is sent from the leader to the follower and receives a 401 > response, the leader becomes too proactive sometimes and declares the > replica down. In older versions (6.3.0), it seems that the replica will > never recover automatically (manually delete the down replicas and add new > ones to fix). Fortunately, as of 7.2.1 (maybe earlier) the down replicas > will usually start to recover at some point (and the leaders seem less > proactive to declare replicas down). Although, we have had cases where they > did not recover after being down for hours on 7.2.1. > > Likely the solution to the problem is to increase the TTL value by adding > the line > > SOLR_OPTS="$SOLR_OPTS -Dpkiauth.ttl=##" > > to the solr environment file (solr.in.sh) on each node and restarting > them. > Replace # with some millisecond value of your choice. I'd suggest just > increasing it by intervals of 5s to start. If this does not fix your > problem, then there is likely too much pressure on your hardware for some > reason or another. > > Hopefully that helps. > > If anyone with more knowledge about the authentication plugin has > corrections, wants fill in gaps, or has an idea to figure out what requests > cause this issue. It'd be greatly appreciated. > > Best, > Chris > > On Mon, Jun 18, 2018 at 9:38 AM Satya Marivada > wrote: > > > Hi, We are using solr 6.3.0 and a collection has 3 of 4 replicas down > and 1 > > is up and serving. > > > > I see a single line error repeating in logs as below. nothing else > specific > > exception apart from it. Wondering what this below message is saying, is > it > > the cause of nodes being down, but saw that this happened even before the > > repllicas went down. > > > > 2018-06-18 04:45:51.818 ERROR (qtp1528637575-27215) [c:poi s:shard1 > > r:core_node5 x:poi_shard1_replica3] o.a.s.s.PKIAuthenticationPlugin > Invalid > > key request timestamp: 1529297138215 , received timestamp: 1529297151817 > , > > TTL: 5000 > > > > Thanks, > > Satya > > >
Re: some solr replicas down
Satya, There should be some other log messages that are probably relevant to the issue you are having. Something along the lines of "leader cannot communicate with follower...publishing replica as down." It's likely there also is a message of "expecting json/xml but got html" in another instance's logs. We've seen this problem in various scenarios in our own clusters, usually during high volumes of requests, and what seems to be happening to us is the following. Since authentication is enabled, all requests between nodes must be authenticated, and Solr is using a timestamp to do this (in some way, not sure on the details). When the recipient of the request processes it, the timestamp is checked to see if it is within the Time-To-Live (TTL) millisecond value (default of 5000). If the timestamp is too old, the request is rejected with the above error and a response of 401 is delivered to the sender. When a request is sent from the leader to the follower and receives a 401 response, the leader becomes too proactive sometimes and declares the replica down. In older versions (6.3.0), it seems that the replica will never recover automatically (manually delete the down replicas and add new ones to fix). Fortunately, as of 7.2.1 (maybe earlier) the down replicas will usually start to recover at some point (and the leaders seem less proactive to declare replicas down). Although, we have had cases where they did not recover after being down for hours on 7.2.1. Likely the solution to the problem is to increase the TTL value by adding the line SOLR_OPTS="$SOLR_OPTS -Dpkiauth.ttl=##" to the solr environment file (solr.in.sh) on each node and restarting them. Replace # with some millisecond value of your choice. I'd suggest just increasing it by intervals of 5s to start. If this does not fix your problem, then there is likely too much pressure on your hardware for some reason or another. Hopefully that helps. If anyone with more knowledge about the authentication plugin has corrections, wants fill in gaps, or has an idea to figure out what requests cause this issue. It'd be greatly appreciated. Best, Chris On Mon, Jun 18, 2018 at 9:38 AM Satya Marivada wrote: > Hi, We are using solr 6.3.0 and a collection has 3 of 4 replicas down and 1 > is up and serving. > > I see a single line error repeating in logs as below. nothing else specific > exception apart from it. Wondering what this below message is saying, is it > the cause of nodes being down, but saw that this happened even before the > repllicas went down. > > 2018-06-18 04:45:51.818 ERROR (qtp1528637575-27215) [c:poi s:shard1 > r:core_node5 x:poi_shard1_replica3] o.a.s.s.PKIAuthenticationPlugin Invalid > key request timestamp: 1529297138215 , received timestamp: 1529297151817 , > TTL: 5000 > > Thanks, > Satya >
some solr replicas down
Hi, We are using solr 6.3.0 and a collection has 3 of 4 replicas down and 1 is up and serving. I see a single line error repeating in logs as below. nothing else specific exception apart from it. Wondering what this below message is saying, is it the cause of nodes being down, but saw that this happened even before the repllicas went down. 2018-06-18 04:45:51.818 ERROR (qtp1528637575-27215) [c:poi s:shard1 r:core_node5 x:poi_shard1_replica3] o.a.s.s.PKIAuthenticationPlugin Invalid key request timestamp: 1529297138215 , received timestamp: 1529297151817 , TTL: 5000 Thanks, Satya