[ https://issues.apache.org/jira/browse/ACCUMULO-3569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14389258#comment-14389258 ]
Billie Rinaldi commented on ACCUMULO-3569: ------------------------------------------ I'd rather this be turned off by default. > Automatically restart accumulo processes intelligently > ------------------------------------------------------ > > Key: ACCUMULO-3569 > URL: https://issues.apache.org/jira/browse/ACCUMULO-3569 > Project: Accumulo > Issue Type: Bug > Components: scripts > Reporter: John Vines > Fix For: 1.7.0 > > Attachments: > 0001-ACCUMULO-3569-initial-pass-at-integrating-auto-resta.patch > > > On occasion process will die, for a variety of reasons. Some reasons are > critical whereas others may be due to momentary blips. There are a variety of > reasons, but not all of the reasons warrant keeping the server down and > requiring human attention. > With that, I would like to propose a watcher process, which is an option > component that wraps the calls to the various processes (tserver, master, > etc.). This process can watch the processes, get their exit codes, read their > logs, etc. and make intelligent decisions about how to behave. This behavior > would include coarse detection of failure types (will discuss below) and a > configurable response behavior around how many attempts should be made in a > given window before giving up entirely. > As for failure types, there are a few arch ones that seem to be regularly > repeating that I think are prime candidates for an initial approach- > Zookeeper lock lost - this can happen for a variety of reasons, mostly > related to network issues or server (tserver or zk node) congestion. These > are some of the most common errors and are typically transient. However, if > these occur with great frequency then it's a sign of a larger issue that > needs to be handled by an administrator. > Jvm OOM - There are two spaces where these really seem to occur - a system > that's just poorly configured and dies shortly after it starts up and then > there is the case where the system gets slammed in just the right way where > objects in our code and/or the iterator stack may push the JVM just over the > limits. In the former case, this will fail quickly and relatively rapidly > when being restarted, whereas the latter case is something that will occur > rarely and will want attention, but doesn't warrant keeping the node offline > in the meantime. > Standard shutdown - this is just a case that occurs where we don't want it to > automatically interact because we want it to go down. Just a design > consideration. > Unexpected exceptions - this is a catch all for everything else. We can > attempt to enumerate them, but they're less common. This would be something > configured to have less tolerance for, but just because a server goes down > due to a random software bug doesn't mean that server should be removed from > the cluster unless it happens repeatedly (because then it's a sign of a > hardware/system issue). But we should provide the ability to keep resources > available in this space. -- This message was sent by Atlassian JIRA (v6.3.4#6332)