empiredan opened a new issue, #1101:
URL: https://github.com/apache/incubator-pegasus/issues/1101

   ## Background
   
   Recently a cluster on production environment was found that primary meta 
server had frequently disconnected the replica servers, for the reason that the 
duration since the last beacon from each replica server had been received by 
the primary meta server was often greater than the grace period (70+ seconds 
vs. 22 seconds).
   
   The network latency is typically several hundreds of microseconds, which 
means something must have been wrong for this cluster. After trouble shooting 
for the root cause, it was found that there are 2 different NTP servers A and B 
in the configuration. A is slower than B by more than one minute. For example, 
meta server received a beacon at 12:05:25 from A; then the clocks jumped 
suddenly to 12:06:35; the meta server found that it has passed far more than 
the grace period, then disconnected the corresponding replica server.
   
   ## Implementation
   
   The duration since the meta server has received the last beacon for each 
replica server can be added as a gauge, to find the exception in the system 
faster.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to