[GitHub] [accumulo] EdColeman commented on issue #3138: Add shell command for marking tservers as "dead"

GitBox Tue, 20 Dec 2022 12:21:43 -0800


EdColeman commented on issue #3138:
URL: https://github.com/apache/accumulo/issues/3138#issuecomment-1360129408


   I was thinking specifically of systemd restart, but the same caution may 
hold for puppet,.....  There are different schools of thought.  Someone could 
be "strict" and never automatically restart a node without some verification, 
while others could decide that restarting has low enough risk that intervention 
is not required.
   
   Some classes of errors that kill a tserver such as loss of ZooKeeper lock or 
an OOM likely can be restarted - but they should also be trended so that 
underlying problems are not hidden because things "seem to work"  Repeatedly 
failing and then restarting a node - will can cause a lot of table migrations 
and work for recovery.
   
   One particular "fun-class" of problems are where "bad-data", maybe its an 
improper row, or its an iterator configuration issue. For example, if a file is 
bulk-imported it may have un-processable row(s) that will trigger a failure.  
Accumulo recovers, and the tablet / row migrates and the cycle repeats....
   
   In terms of this issue, killing the tserver via admin stop command or 
otherwise removing the ZooKeeper lock will kill the tserver - but that is 
different from being marked dead.  
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [accumulo] EdColeman commented on issue #3138: Add shell command for marking tservers as "dead"

Reply via email to