[
https://issues.apache.org/jira/browse/HADOOP-3628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12666211#action_12666211
]
Steve Loughran commented on HADOOP-3628:
----------------------------------------
Suresh, I'm finally sitting down to look at your comments; its going to take me
a while to work through them. Thank you (and tom!) for their comments.
Some quick answers.
"Should we have a state that captures out of service for maintenance?"
It's kind of tricky to have that on something that is network visible, since a
lot of maintenance makes the program unreachable. At the same time, I could
imagine a cluster being offline to new
job submissions, or a filesystem in read-only mode.
On a related note, I've long wondered if we should have an XHML format for web
sites to say that they are offline for maintenance in some way that was machine
readable; something you could aggregate and which would include forward
maintenance notices. Something would hit the status pages of the various
machines in your infrastructure and build up a calendar of planned outages,
work out which were the SPOF and aggregate them differently from the redundant
bits. I dont think this is the right time for this, but it's still an idea I'm
fond of.
"-I am not clear on how Failed state transitions to Terminated. If failed state
transitions to terminated, the fact that the service failed will no longer
available?"
Good question. When terminated, a service should shut down its thread, do any
cleanup. But
any underlying exception that triggered a failure should still be in the
failureCause field. So provided that a throwable was passed in to
enterFailedState(), the service remembers what happened.
What I'm not doing (currently) is retaining the entire history. We could do
that if you felt it was useful; build up a list of state transitions and
timestamps, the history.
I'm going to look more at your statemachine proposal. One thing that worries me
about being able to add new states is how well do they aggregate?
> Add a lifecycle interface for Hadoop components: namenodes, job clients, etc.
> -----------------------------------------------------------------------------
>
> Key: HADOOP-3628
> URL: https://issues.apache.org/jira/browse/HADOOP-3628
> Project: Hadoop Core
> Issue Type: Improvement
> Components: dfs, mapred
> Affects Versions: 0.20.0
> Reporter: Steve Loughran
> Assignee: Steve Loughran
> Attachments: AbstractHadoopComponent.java, hadoop-3628.patch,
> hadoop-3628.patch, hadoop-3628.patch, hadoop-3628.patch, hadoop-3628.patch,
> hadoop-3628.patch, hadoop-3628.patch, hadoop-3628.patch, hadoop-3628.patch,
> hadoop-3628.patch, hadoop-3628.patch, hadoop-3628.patch, hadoop-3628.patch,
> hadoop-3628.patch, hadoop-lifecycle.pdf, hadoop-lifecycle.sxw
>
>
> I'd like to propose we have a standard interface for hadoop components, the
> things that get started or stopped when you bring up a namenode. currently,
> some of these classes have a stop() or shutdown() method, with no standard
> name/interface, but no way of seeing if they are live, checking their health
> of shutting them down reliably. Indeed, there is a tendency for the spawned
> threads to not want to die; to require the entire process to be killed to
> stop the workers.
> Having a standard interface would make it easier for
> * management tools to manage the different things
> * monitoring the state of things
> * subclassing
> The latter is interesting as right now TaskTracker and JobTracker start up
> threads in their constructor; that's very dangerous as subclasses may have
> their methods called before they are full initialised. Adding this interface
> would be the right time to clean up the startup process so that subclassing
> is less risky.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.