Enhance YARN service model
--------------------------

                 Key: MAPREDUCE-3939
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3939
             Project: Hadoop Map/Reduce
          Issue Type: Improvement
          Components: mrv2
    Affects Versions: 0.24.0, 0.23.2
            Reporter: Steve Loughran
            Assignee: Steve Loughran


Having played the YARN service model, there are some issues
that I've identified based on past work and initial use.

This JIRA issue is an overall one to cover the issues, with solutions pushed 
out to separate JIRAs.

h2. state model prevents stopped state being entered if you could not 
successfully start the service.

In the current lifecycle you cannot stop a service unless it was successfully 
started, but
* {{init()}} may acquire resources that need to be explicitly released
* if the {{start()}} operation fails partway through, the {{stop()}} operation 
may be needed to release resources.

*Fix:* make {{stop()}} a valid state transition from all states and require the 
implementations to be able to stop safely without requiring all fields to be 
non null.

Before anyone points out that the {{stop()}} operations assume that all fields 
are valid; and if called before a {{start()}} they will NPE; MAPREDUCE-3431 
shows that this problem arises today, MAPREDUCE-3502 is a fix for this. It is 
independent of the rest of the issues in this doc but it will aid making 
{{stop()}} execute from all states other than "stopped".

MAPREDUCE-3502 is too big a patch and needs to be broken down for easier review 
and take up; this can be done with issues linked to this one.
h2. AbstractService doesn't prevent duplicate state change requests.

The {{ensureState()}} checks to verify whether or not a state transition is 
allowed from the current state are performed in the base {{AbstractService}} 
class -yet subclasses tend to call this *after* their own {{init()}}, 
{{start()}} & {{stop()}} operations. This means that these operations can be 
performed out of order, and even if the outcome of the call is an exception, 
all actions performed by the subclasses will have taken place. MAPREDUCE-3877 
demonstrates this.

This is a tricky one to address. In HADOOP-3128 I used a base class instead of 
an interface and made the {{init()}}, {{start()}} & {{stop()}} methods 
{{final}}. These methods would do the checks, and then invoke protected inner 
methods, {{innerStart()}}, {{innerStop()}}, etc. It should be possible to 
retrofit the same behaviour to everything that extends {{AbstractService}} 
-something that must be done before the class is considered stable (because 
once the lifecycle methods are declared final, all subclasses that are out of 
the source tree will need fixing by the respective developers.

h2. AbstractService state change doesn't defend against race conditions.

There's no concurrency locks on the state transitions. Whatever fix for wrong 
state calls is added should correct this to prevent re-entrancy, such as 
{{stop()}} being called from two threads.

h2.  Static methods to choreograph of lifecycle operations

Helper methods to move things through lifecycles. init->start is common, 
stop-if-service!=null another. Some static methods can execute these, and even 
call {{stop()}} if {{init()}} raises an exception. These could go into a class 
{{ServiceOps}} in the same package. These can be used by those services that 
wrap other services, and help manage more robust shutdowns.

h2. state transition failures are something that registered service listeners 
may wish to be informed of.

When a state transition fails a {{RuntimeException}} can be thrown -and the 
service listeners are not informed as the notification point isn't reached. 
They may wish to know this, especially for management and diagnostics.

*Fix:* extend {{ServiceStateChangeListener}} with a callback such as 
{{stateChangeFailed(Service service,Service.State targeted-state, 
RuntimeException e)}} that is invoked from the (final) state change methods in 
the {{AbstractService}} class (once they delegate to their inner 
{{innerStart()}}, {{innerStop()}} methods; make a no-op on the existing 
implementations of the interface.

h2. Service listener failures not handled

Is this an error an error or not? Log and ignore may not be what is desired.

*Proposed:* during {{stop()}} any exception by a listener is caught and 
discarded, to increase the likelihood of a better shutdown, but do not add 
try-catch clauses to the other state changes.

h2. Support static listeners for all AbstractServices

Add support to {{AbstractService}} that allow callers to register listeners for 
all instances. The existing listener interface could be used. This allows 
management tools to hook into the events.

The static listeners would be invoked for all state changes except creation 
(base class shouldn't be handing out references to itself at this point).

These static events could all be async, pushed through a shared 
{{ConcurrentLinkedQueue}}; failures logged at warn and the rest of the 
listeners invoked.

h2. Add some example listeners for management/diagnostics
* event to commons log for humans.
* events for machines hooked up to the JSON logger.
* for testing: something that be told to fail.

h2.  Services should support signal interruptibility

The services would benefit from a way of shutting them down on a kill signal; 
this can be done via a runtime hook. It should not be automatic though, as 
composite services will get into a very complex state during shutdown. Better 
to provide a hook that lets you register/unregister services to terminate, and 
have the relevant {{main()}} entry points tell their root services to register 
themselves.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to