[ 
https://issues.apache.org/jira/browse/ACCUMULO-3569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14481627#comment-14481627
 ] 

Dave Marion commented on ACCUMULO-3569:
---------------------------------------

I don't plan on standing in the way here, although it does appear that others 
have raised concerns. I was just merely asking if you had looked at JSW / 
YAJSW. These tools have been around for years. JSW's license precludes it use I 
think, and the license for YAJSW did also until recently.

> Automatically restart accumulo processes intelligently
> ------------------------------------------------------
>
>                 Key: ACCUMULO-3569
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-3569
>             Project: Accumulo
>          Issue Type: Bug
>          Components: scripts
>            Reporter: John Vines
>             Fix For: 1.8.0
>
>         Attachments: 
> 0001-ACCUMULO-3569-initial-pass-at-integrating-auto-resta.patch
>
>
> On occasion process will die, for a variety of reasons. Some reasons are 
> critical whereas others may be due to momentary blips. There are a variety of 
> reasons, but not all of the reasons warrant keeping the server down and 
> requiring human attention.
> With that, I would like to propose a watcher process, which is an option 
> component that wraps the calls to the various processes (tserver, master, 
> etc.). This process can watch the processes, get their exit codes, read their 
> logs, etc. and make intelligent decisions about how to behave. This behavior 
> would include coarse detection of failure types (will discuss below) and a 
> configurable response behavior around how many attempts should be made in a 
> given window before giving up entirely.
> As for failure types, there are a few arch ones that seem to be regularly 
> repeating that I think are prime candidates for an initial approach-
> Zookeeper lock lost - this can happen for a variety of reasons, mostly 
> related to network issues or server (tserver or zk node) congestion. These 
> are some of the most common errors and are typically transient. However, if 
> these occur with great frequency then it's a sign of a larger issue that 
> needs to be handled by an administrator.
> Jvm OOM - There are two spaces where these really seem to occur - a system 
> that's just poorly configured and dies shortly after it starts up and then 
> there is the case where the system gets slammed in just the right way where 
> objects in our code and/or the iterator stack may push the JVM just over the 
> limits. In the former case, this will fail quickly and relatively rapidly 
> when being restarted, whereas the latter case is something that will occur 
> rarely and will want attention, but doesn't warrant keeping the node offline 
> in the meantime.
> Standard shutdown - this is just a case that occurs where we don't want it to 
> automatically interact because we want it to go down. Just a design 
> consideration.
> Unexpected exceptions - this is a catch all for everything else. We can 
> attempt to enumerate them, but they're less common. This would be something 
> configured to have less tolerance for, but just because a server goes down 
> due to a random software bug doesn't mean that server should be removed from 
> the cluster unless it happens repeatedly (because then it's a sign of a 
> hardware/system issue). But we should provide the ability to keep resources 
> available in this space.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to