[jira] [Commented] (YARN-45) Scheduler feedback to AM to release containers
[ https://issues.apache.org/jira/browse/YARN-45?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13631400#comment-13631400 ] Bikas Saha commented on YARN-45: My personal preference would be to not have an API that is not actionable. If the RM is not having any support for ResourceRequest scenarios then we can leave that out for later when such support does arise. Having something out there that does not work may lead to misunderstanding and confusion on the part of YARN app developers. > Scheduler feedback to AM to release containers > -- > > Key: YARN-45 > URL: https://issues.apache.org/jira/browse/YARN-45 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Chris Douglas >Assignee: Carlo Curino > Attachments: YARN-45.patch, YARN-45.patch > > > The ResourceManager strikes a balance between cluster utilization and strict > enforcement of resource invariants in the cluster. Individual allocations of > containers must be reclaimed- or reserved- to restore the global invariants > when cluster load shifts. In some cases, the ApplicationMaster can respond to > fluctuations in resource availability without losing the work already > completed by that task (MAPREDUCE-4584). Supplying it with this information > would be helpful for overall cluster utilization [1]. To this end, we want to > establish a protocol for the RM to ask the AM to release containers. > [1] http://research.yahoo.com/files/yl-2012-003.pdf -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-445) Ability to signal containers
[ https://issues.apache.org/jira/browse/YARN-445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13631399#comment-13631399 ] Bikas Saha commented on YARN-445: - Sounds like an enhancement in the NM API. Moving under YARN-386. Please unlink if that is not correct. I can see the usecase this seeks to solve. I am wondering what is the abstraction in the general case. That would help us to not change stuff for every similar use case. Keeping platform neutrality would be beneficial so that the usecases continue to work for non Java AM/tasks or on Windows. > Ability to signal containers > > > Key: YARN-445 > URL: https://issues.apache.org/jira/browse/YARN-445 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager >Affects Versions: 2.0.5-beta >Reporter: Jason Lowe > > It would be nice if an ApplicationMaster could send signals to contaniers > such as SIGQUIT, SIGUSR1, etc. > For example, in order to replicate the jstack-on-task-timeout feature > implemented by MAPREDUCE-1119 in Hadoop 0.21 the NodeManager needs an > interface for sending SIGQUIT to a container. For that specific feature we > could implement it as an additional field in the StopContainerRequest. > However that would not address other potential features like the ability for > an AM to trigger jstacks on arbitrary tasks *without* killing them. The > latter feature would be a very useful debugging tool for users who do not > have shell access to the nodes. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-445) Ability to signal containers
[ https://issues.apache.org/jira/browse/YARN-445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated YARN-445: Issue Type: Sub-task (was: New Feature) Parent: YARN-386 > Ability to signal containers > > > Key: YARN-445 > URL: https://issues.apache.org/jira/browse/YARN-445 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.0.5-beta >Reporter: Jason Lowe > > It would be nice if an ApplicationMaster could send signals to contaniers > such as SIGQUIT, SIGUSR1, etc. > For example, in order to replicate the jstack-on-task-timeout feature > implemented by MAPREDUCE-1119 in Hadoop 0.21 the NodeManager needs an > interface for sending SIGQUIT to a container. For that specific feature we > could implement it as an additional field in the StopContainerRequest. > However that would not address other potential features like the ability for > an AM to trigger jstacks on arbitrary tasks *without* killing them. The > latter feature would be a very useful debugging tool for users who do not > have shell access to the nodes. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-386) [Umbrella] YARN API Changes
[ https://issues.apache.org/jira/browse/YARN-386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated YARN-386: Summary: [Umbrella] YARN API Changes (was: [Umbrella] YARN API cleanup) > [Umbrella] YARN API Changes > --- > > Key: YARN-386 > URL: https://issues.apache.org/jira/browse/YARN-386 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Vinod Kumar Vavilapalli > > This is the umbrella ticket to capture any and every API cleanup that we wish > to do before YARN can be deemed beta/stable. Doing this API cleanup now and > ASAP will help us escape the pain of supporting bad APIs in beta/stable > releases. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-471) NM does not validate the resource capabilities before it registers with RM
[ https://issues.apache.org/jira/browse/YARN-471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13631392#comment-13631392 ] Hadoop QA commented on YARN-471: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12578646/YARN-471.1.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/734//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/734//console This message is automatically generated. > NM does not validate the resource capabilities before it registers with RM > -- > > Key: YARN-471 > URL: https://issues.apache.org/jira/browse/YARN-471 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Hitesh Shah >Assignee: Hitesh Shah > Labels: usability > Attachments: YARN-471.1.patch > > > Today, an NM could register with -1 memory and -1 cpu with the RM. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-358) bundle container classpath in temporary jar on all platforms, not just Windows
[ https://issues.apache.org/jira/browse/YARN-358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13631390#comment-13631390 ] Hitesh Shah commented on YARN-358: -- @Chris, are we just talking about the command line or does this affect environment variables too? Given that YARN can launch any kind of application ( C++/Java/script ), what are the areas of concern that need to be addressed for containers to launch correctly on windows? Should this be a YARN feature or is it better to hand this off to the application logic to handle correct launching of a container on a particular OS type? > bundle container classpath in temporary jar on all platforms, not just Windows > -- > > Key: YARN-358 > URL: https://issues.apache.org/jira/browse/YARN-358 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: trunk-win >Reporter: Chris Nauroth > > Currently, a Windows-specific code path bundles the classpath into a > temporary jar with a manifest to work around command line length limitations. > This code path does not need to be Windows-specific. We can use the same > approach on all platforms. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (YARN-55) YARN needs to properly check the NM,AM memory properties in yarn-site.xml and mapred.xml and report errors accordingly.
[ https://issues.apache.org/jira/browse/YARN-55?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah resolved YARN-55. - Resolution: Invalid Closing this as invalid. If an AM requests more memory than the configured maximum memory for the RM Scheduler, RM will throw an error. If there are no live nodes capable of handling the memory asked for, that should be looked at in YARN-56. > YARN needs to properly check the NM,AM memory properties in yarn-site.xml and > mapred.xml and report errors accordingly. > --- > > Key: YARN-55 > URL: https://issues.apache.org/jira/browse/YARN-55 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.0.2-alpha, 0.23.3 > Environment: CentOs6.0, Hadoop2.0.0 Alpha >Reporter: Anil Gupta > Labels: Map, Reduce, YARN > > Please refer to this discussion on the Hadoop Mailing list: > http://comments.gmane.org/gmane.comp.jakarta.lucene.hadoop.user/33110 > Summary: > I was running YARN(Hadoop2.0.0 Alpha) on a 8 datanode, 4 admin node > Hadoop/HBase cluster. My datanodes were only having 3.2GB of memory. So, i > configured the yarn.nodemanager.resource.memory-mb property in yarn-site.xml > to 1200. After setting the property if i run any Yarn Job then the > NodemManager wont be able to start any Map task since by default the > yarn.app.mapreduce.am.resource.mb property is set to 1500 MB in > mapred-site.xml. > Expected Behavior: NodeManager should give an error if > yarn.app.mapreduce.am.resource.mb >= yarn.nodemanager.resource.memory-mb. > Please let me know if more information is required. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-115) yarn commands shouldn't add "m" to the heapsize
[ https://issues.apache.org/jira/browse/YARN-115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated YARN-115: - Labels: usability (was: ) > yarn commands shouldn't add "m" to the heapsize > --- > > Key: YARN-115 > URL: https://issues.apache.org/jira/browse/YARN-115 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 0.23.3 >Reporter: Thomas Graves > Labels: usability > > the yarn commands add "m" to the heapsize. This is unlike the hdfs side and > the the old jt/tt used to do. > JAVA_HEAP_MAX="-Xmx""$YARN_RESOURCEMANAGER_HEAPSIZE""m" > JAVA_HEAP_MAX="-Xmx""$YARN_NODEMANAGER_HEAPSIZE""m" > We should not be adding in the "m" and allow the user to specify units. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-541) getAllocatedContainers() is not returning all the allocated containers
[ https://issues.apache.org/jira/browse/YARN-541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13631386#comment-13631386 ] Hitesh Shah commented on YARN-541: -- @Krishna, could you provide more information: - What scheduler are you using? - Could you attach the application logs as well as the RM's logs. > getAllocatedContainers() is not returning all the allocated containers > -- > > Key: YARN-541 > URL: https://issues.apache.org/jira/browse/YARN-541 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.0.3-alpha > Environment: Redhat Linux 64-bit >Reporter: Krishna Kishore Bonagiri > > I am running an application that was written and working well with the > hadoop-2.0.0-alpha but when I am running the same against 2.0.3-alpha, the > getAllocatedContainers() method called on AMResponse is not returning all the > containers allocated sometimes. For example, I request for 10 containers and > this method gives me only 9 containers sometimes, and when I looked at the > log of Resource Manager, the 10th container is also allocated. It happens > only sometimes randomly and works fine all other times. If I send one more > request for the remaining container to RM after it failed to give them the > first time(and before releasing already acquired ones), it could allocate > that container. I am running only one application at a time, but 1000s of > them one after another. > My main worry is, even though the RM's log is saying that all 10 requested > containers are allocated, the getAllocatedContainers() method is not > returning me all of them, it returned only 9 surprisingly. I never saw this > kind of issue in the previous version, i.e. hadoop-2.0.0-alpha. > Thanks, > Kishore > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-523) Container localization failures aren't reported from NM to RM
[ https://issues.apache.org/jira/browse/YARN-523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated YARN-523: - Assignee: Omkar Vinit Joshi > Container localization failures aren't reported from NM to RM > - > > Key: YARN-523 > URL: https://issues.apache.org/jira/browse/YARN-523 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Vinod Kumar Vavilapalli >Assignee: Omkar Vinit Joshi > > This is mainly a pain on crashing AMs, but once we fix this, containers also > can benefit - same fix for both. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-471) NM does not validate the resource capabilities before it registers with RM
[ https://issues.apache.org/jira/browse/YARN-471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated YARN-471: - Attachment: YARN-471.1.patch > NM does not validate the resource capabilities before it registers with RM > -- > > Key: YARN-471 > URL: https://issues.apache.org/jira/browse/YARN-471 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Hitesh Shah > Labels: usability > Attachments: YARN-471.1.patch > > > Today, an NM could register with -1 memory and -1 cpu with the RM. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (YARN-471) NM does not validate the resource capabilities before it registers with RM
[ https://issues.apache.org/jira/browse/YARN-471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah reassigned YARN-471: Assignee: Hitesh Shah > NM does not validate the resource capabilities before it registers with RM > -- > > Key: YARN-471 > URL: https://issues.apache.org/jira/browse/YARN-471 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Hitesh Shah >Assignee: Hitesh Shah > Labels: usability > Attachments: YARN-471.1.patch > > > Today, an NM could register with -1 memory and -1 cpu with the RM. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (YARN-576) RM should not allow registrations from NMs that do not satisfy minimum scheduler allocations
Hitesh Shah created YARN-576: Summary: RM should not allow registrations from NMs that do not satisfy minimum scheduler allocations Key: YARN-576 URL: https://issues.apache.org/jira/browse/YARN-576 Project: Hadoop YARN Issue Type: Bug Reporter: Hitesh Shah If the minimum resource allocation configured for the RM scheduler is 1 GB, the RM should drop all NMs that register with a total capacity of less than 1 GB. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-576) RM should not allow registrations from NMs that do not satisfy minimum scheduler allocations
[ https://issues.apache.org/jira/browse/YARN-576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated YARN-576: - Labels: newbie (was: ) > RM should not allow registrations from NMs that do not satisfy minimum > scheduler allocations > > > Key: YARN-576 > URL: https://issues.apache.org/jira/browse/YARN-576 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Hitesh Shah > Labels: newbie > > If the minimum resource allocation configured for the RM scheduler is 1 GB, > the RM should drop all NMs that register with a total capacity of less than 1 > GB. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-117) Enhance YARN service model
[ https://issues.apache.org/jira/browse/YARN-117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13631369#comment-13631369 ] Hadoop QA commented on YARN-117: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12578645/YARN-117.5.patch against trunk revision . {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/733//console This message is automatically generated. > Enhance YARN service model > -- > > Key: YARN-117 > URL: https://issues.apache.org/jira/browse/YARN-117 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 2.0.4-alpha >Reporter: Steve Loughran >Assignee: Steve Loughran > Attachments: YARN-117-2.patch, YARN-117-3.patch, YARN-117.4.patch, > YARN-117.5.patch, YARN-117.patch > > > Having played the YARN service model, there are some issues > that I've identified based on past work and initial use. > This JIRA issue is an overall one to cover the issues, with solutions pushed > out to separate JIRAs. > h2. state model prevents stopped state being entered if you could not > successfully start the service. > In the current lifecycle you cannot stop a service unless it was successfully > started, but > * {{init()}} may acquire resources that need to be explicitly released > * if the {{start()}} operation fails partway through, the {{stop()}} > operation may be needed to release resources. > *Fix:* make {{stop()}} a valid state transition from all states and require > the implementations to be able to stop safely without requiring all fields to > be non null. > Before anyone points out that the {{stop()}} operations assume that all > fields are valid; and if called before a {{start()}} they will NPE; > MAPREDUCE-3431 shows that this problem arises today, MAPREDUCE-3502 is a fix > for this. It is independent of the rest of the issues in this doc but it will > aid making {{stop()}} execute from all states other than "stopped". > MAPREDUCE-3502 is too big a patch and needs to be broken down for easier > review and take up; this can be done with issues linked to this one. > h2. AbstractService doesn't prevent duplicate state change requests. > The {{ensureState()}} checks to verify whether or not a state transition is > allowed from the current state are performed in the base {{AbstractService}} > class -yet subclasses tend to call this *after* their own {{init()}}, > {{start()}} & {{stop()}} operations. This means that these operations can be > performed out of order, and even if the outcome of the call is an exception, > all actions performed by the subclasses will have taken place. MAPREDUCE-3877 > demonstrates this. > This is a tricky one to address. In HADOOP-3128 I used a base class instead > of an interface and made the {{init()}}, {{start()}} & {{stop()}} methods > {{final}}. These methods would do the checks, and then invoke protected inner > methods, {{innerStart()}}, {{innerStop()}}, etc. It should be possible to > retrofit the same behaviour to everything that extends {{AbstractService}} > -something that must be done before the class is considered stable (because > once the lifecycle methods are declared final, all subclasses that are out of > the source tree will need fixing by the respective developers. > h2. AbstractService state change doesn't defend against race conditions. > There's no concurrency locks on the state transitions. Whatever fix for wrong > state calls is added should correct this to prevent re-entrancy, such as > {{stop()}} being called from two threads. > h2. Static methods to choreograph of lifecycle operations > Helper methods to move things through lifecycles. init->start is common, > stop-if-service!=null another. Some static methods can execute these, and > even call {{stop()}} if {{init()}} raises an exception. These could go into a > class {{ServiceOps}} in the same package. These can be used by those services > that wrap other services, and help manage more robust shutdowns. > h2. state transition failures are something that registered service listeners > may wish to be informed of. > When a state transition fails a {{RuntimeException}} can be thrown -and the > service listeners are not informed as the notification point isn't reached. > They may wish to know this, especially for management and diagnostics. > *Fix:* extend {{ServiceStateChangeListener}} with a callback such as > {{stateChangeFailed(Service service,Service.State targeted-state, > RuntimeException e)}} that is invoked from the (final) state change methods > in the {{AbstractService}} class (once they delegate to their inner > {{innerStart()}}, {{innerStop()}}
[jira] [Commented] (YARN-117) Enhance YARN service model
[ https://issues.apache.org/jira/browse/YARN-117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13631364#comment-13631364 ] Steve Loughran commented on YARN-117: - updated patch where {{TestNodeStatusUpdater.testNMConnectionToRM()}} should not fail irrespective of how long it takes for the NM's {{init()}} process to take. Until now the custom {{NodeStatusUpdater}} set its clock in the constructor, which was called during the NM's {{init()}}, but the test only set it's clock after {{init()}} and before {{start()}}. As a result, if the init took too long, the test would fail saying "the RM took too long", when the delay logic was actually working. The fix is for the custom {{NodeStatusUpdater}} to set its {{waitStartTime}} in it's {{innerStart()}}, so only when the service is started. This has a narrower gap between the tests's measured start time and the updater -though the time to start a couple of services before the updater is started could still be troublesome. Even so, retaining the time measurement logic in the test (rather than just probing the Updater to verify it triggered) is essential to be confident that {{NodeManager().start()}} doesn't complete until the connection has been established > Enhance YARN service model > -- > > Key: YARN-117 > URL: https://issues.apache.org/jira/browse/YARN-117 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Steve Loughran >Assignee: Steve Loughran > Attachments: YARN-117-2.patch, YARN-117-3.patch, YARN-117.4.patch, > YARN-117.5.patch, YARN-117.patch > > > Having played the YARN service model, there are some issues > that I've identified based on past work and initial use. > This JIRA issue is an overall one to cover the issues, with solutions pushed > out to separate JIRAs. > h2. state model prevents stopped state being entered if you could not > successfully start the service. > In the current lifecycle you cannot stop a service unless it was successfully > started, but > * {{init()}} may acquire resources that need to be explicitly released > * if the {{start()}} operation fails partway through, the {{stop()}} > operation may be needed to release resources. > *Fix:* make {{stop()}} a valid state transition from all states and require > the implementations to be able to stop safely without requiring all fields to > be non null. > Before anyone points out that the {{stop()}} operations assume that all > fields are valid; and if called before a {{start()}} they will NPE; > MAPREDUCE-3431 shows that this problem arises today, MAPREDUCE-3502 is a fix > for this. It is independent of the rest of the issues in this doc but it will > aid making {{stop()}} execute from all states other than "stopped". > MAPREDUCE-3502 is too big a patch and needs to be broken down for easier > review and take up; this can be done with issues linked to this one. > h2. AbstractService doesn't prevent duplicate state change requests. > The {{ensureState()}} checks to verify whether or not a state transition is > allowed from the current state are performed in the base {{AbstractService}} > class -yet subclasses tend to call this *after* their own {{init()}}, > {{start()}} & {{stop()}} operations. This means that these operations can be > performed out of order, and even if the outcome of the call is an exception, > all actions performed by the subclasses will have taken place. MAPREDUCE-3877 > demonstrates this. > This is a tricky one to address. In HADOOP-3128 I used a base class instead > of an interface and made the {{init()}}, {{start()}} & {{stop()}} methods > {{final}}. These methods would do the checks, and then invoke protected inner > methods, {{innerStart()}}, {{innerStop()}}, etc. It should be possible to > retrofit the same behaviour to everything that extends {{AbstractService}} > -something that must be done before the class is considered stable (because > once the lifecycle methods are declared final, all subclasses that are out of > the source tree will need fixing by the respective developers. > h2. AbstractService state change doesn't defend against race conditions. > There's no concurrency locks on the state transitions. Whatever fix for wrong > state calls is added should correct this to prevent re-entrancy, such as > {{stop()}} being called from two threads. > h2. Static methods to choreograph of lifecycle operations > Helper methods to move things through lifecycles. init->start is common, > stop-if-service!=null another. Some static methods can execute these, and > even call {{stop()}} if {{init()}} raises an exception. These could go into a > class {{ServiceOps}} in the same package. These can be used by those services > that wrap other services, and help manage more robust shutdowns. > h2. state tra
[jira] [Updated] (YARN-117) Enhance YARN service model
[ https://issues.apache.org/jira/browse/YARN-117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated YARN-117: Attachment: YARN-117.5.patch > Enhance YARN service model > -- > > Key: YARN-117 > URL: https://issues.apache.org/jira/browse/YARN-117 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Steve Loughran >Assignee: Steve Loughran > Attachments: YARN-117-2.patch, YARN-117-3.patch, YARN-117.4.patch, > YARN-117.5.patch, YARN-117.patch > > > Having played the YARN service model, there are some issues > that I've identified based on past work and initial use. > This JIRA issue is an overall one to cover the issues, with solutions pushed > out to separate JIRAs. > h2. state model prevents stopped state being entered if you could not > successfully start the service. > In the current lifecycle you cannot stop a service unless it was successfully > started, but > * {{init()}} may acquire resources that need to be explicitly released > * if the {{start()}} operation fails partway through, the {{stop()}} > operation may be needed to release resources. > *Fix:* make {{stop()}} a valid state transition from all states and require > the implementations to be able to stop safely without requiring all fields to > be non null. > Before anyone points out that the {{stop()}} operations assume that all > fields are valid; and if called before a {{start()}} they will NPE; > MAPREDUCE-3431 shows that this problem arises today, MAPREDUCE-3502 is a fix > for this. It is independent of the rest of the issues in this doc but it will > aid making {{stop()}} execute from all states other than "stopped". > MAPREDUCE-3502 is too big a patch and needs to be broken down for easier > review and take up; this can be done with issues linked to this one. > h2. AbstractService doesn't prevent duplicate state change requests. > The {{ensureState()}} checks to verify whether or not a state transition is > allowed from the current state are performed in the base {{AbstractService}} > class -yet subclasses tend to call this *after* their own {{init()}}, > {{start()}} & {{stop()}} operations. This means that these operations can be > performed out of order, and even if the outcome of the call is an exception, > all actions performed by the subclasses will have taken place. MAPREDUCE-3877 > demonstrates this. > This is a tricky one to address. In HADOOP-3128 I used a base class instead > of an interface and made the {{init()}}, {{start()}} & {{stop()}} methods > {{final}}. These methods would do the checks, and then invoke protected inner > methods, {{innerStart()}}, {{innerStop()}}, etc. It should be possible to > retrofit the same behaviour to everything that extends {{AbstractService}} > -something that must be done before the class is considered stable (because > once the lifecycle methods are declared final, all subclasses that are out of > the source tree will need fixing by the respective developers. > h2. AbstractService state change doesn't defend against race conditions. > There's no concurrency locks on the state transitions. Whatever fix for wrong > state calls is added should correct this to prevent re-entrancy, such as > {{stop()}} being called from two threads. > h2. Static methods to choreograph of lifecycle operations > Helper methods to move things through lifecycles. init->start is common, > stop-if-service!=null another. Some static methods can execute these, and > even call {{stop()}} if {{init()}} raises an exception. These could go into a > class {{ServiceOps}} in the same package. These can be used by those services > that wrap other services, and help manage more robust shutdowns. > h2. state transition failures are something that registered service listeners > may wish to be informed of. > When a state transition fails a {{RuntimeException}} can be thrown -and the > service listeners are not informed as the notification point isn't reached. > They may wish to know this, especially for management and diagnostics. > *Fix:* extend {{ServiceStateChangeListener}} with a callback such as > {{stateChangeFailed(Service service,Service.State targeted-state, > RuntimeException e)}} that is invoked from the (final) state change methods > in the {{AbstractService}} class (once they delegate to their inner > {{innerStart()}}, {{innerStop()}} methods; make a no-op on the existing > implementations of the interface. > h2. Service listener failures not handled > Is this an error an error or not? Log and ignore may not be what is desired. > *Proposed:* during {{stop()}} any exception by a listener is caught and > discarded, to increase the likelihood of a better shutdown, but do not add > try-catch clauses to the other state changes. > h2. Support static listeners for all AbstractServic