EdColeman commented on issue #3138: URL: https://github.com/apache/accumulo/issues/3138#issuecomment-1360129408
I was thinking specifically of systemd restart, but the same caution may hold for puppet,..... There are different schools of thought. Someone could be "strict" and never automatically restart a node without some verification, while others could decide that restarting has low enough risk that intervention is not required. Some classes of errors that kill a tserver such as loss of ZooKeeper lock or an OOM likely can be restarted - but they should also be trended so that underlying problems are not hidden because things "seem to work" Repeatedly failing and then restarting a node - will can cause a lot of table migrations and work for recovery. One particular "fun-class" of problems are where "bad-data", maybe its an improper row, or its an iterator configuration issue. For example, if a file is bulk-imported it may have un-processable row(s) that will trigger a failure. Accumulo recovers, and the tablet / row migrates and the cycle repeats.... In terms of this issue, killing the tserver via admin stop command or otherwise removing the ZooKeeper lock will kill the tserver - but that is different from being marked dead. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
