[jira] [Commented] (YARN-1197) Support changing resources of an allocated container
[ https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13813735#comment-13813735 ] Bikas Saha commented on YARN-1197: -- Can we do with just change_succeeded and change_failed lists instead of 4 lists. Using the containerId, the AM can determine which one was increase/decrease. {noformat} +messageChangeContainersResourceResponseProto { + repeatedContainerIdProto succeed_increased_containers= 1; + repeatedContainerIdProto succeed_decreased_containers= 2; + repeatedContainerIdProto failed_increased_containers = 3; + repeatedContainerIdProto failed_decreased_containers = 4; +} {noformat} I dont think its correct for ResourceRequest to be used to increase resources for an allocated container. I was expecting a new optional repeated field of type ResourceChangeContextProto in AllocateRequest. For requesting increase in container C's resource, the AM will add a ResourceChangeContextProto for that container in the next AllocateRequest. In AllocateResponse, the type of increased container should be ResourceIncreaseContextProto, right? Without that the AM cannot get the new container token for that container. The NM changes also need to handle enforcing the new resource via cgroups etc in addition to changing the monitoring. This needs to be clarified in the document. > Support changing resources of an allocated container > > > Key: YARN-1197 > URL: https://issues.apache.org/jira/browse/YARN-1197 > Project: Hadoop YARN > Issue Type: Task > Components: api, nodemanager, resourcemanager >Affects Versions: 2.1.0-beta >Reporter: Wangda Tan >Assignee: Wangda Tan > Attachments: yarn-1197-v2.pdf, yarn-1197-v3.pdf, yarn-1197.pdf > > > Currently, YARN cannot support merge several containers in one node to a big > container, which can make us incrementally ask resources, merge them to a > bigger one, and launch our processes. The user scenario is described in the > comments. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1307) Rethink znode structure for RM HA
[ https://issues.apache.org/jira/browse/YARN-1307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-1307: - Attachment: YARN-1307.4.patch Rebased on trunk. Bikas, is the change you mentioned YARN-353? > Rethink znode structure for RM HA > - > > Key: YARN-1307 > URL: https://issues.apache.org/jira/browse/YARN-1307 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Tsuyoshi OZAWA >Assignee: Tsuyoshi OZAWA > Attachments: YARN-1307.1.patch, YARN-1307.2.patch, YARN-1307.3.patch, > YARN-1307.4.patch > > > Rethink for znode structure for RM HA is proposed in some JIRAs(YARN-659, > YARN-1222). The motivation of this JIRA is quoted from Bikas' comment in > YARN-1222: > {quote} > We should move to creating a node hierarchy for apps such that all znodes for > an app are stored under an app znode instead of the app root znode. This will > help in removeApplication and also in scaling better on ZK. The earlier code > was written this way to ensure create/delete happens under a root znode for > fencing. But given that we have moved to multi-operations globally, this isnt > required anymore. > {quote} -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-979) [YARN-321] Add more APIs related to ApplicationAttempt and Container in ApplicationHistoryProtocol
[ https://issues.apache.org/jira/browse/YARN-979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13813769#comment-13813769 ] Mayank Bansal commented on YARN-979: bq. I still have one question w.r.t. the annotations of the getter/setter of GetRequest/Response. Some of them are marked as @Stable, and some are marked as @Unstable. In addition, some setters are marked as @Private, and some are marked as @Public. Do you have special consideration here? Maybe we should mark all as @Unstable for the initial AHS? Fixed the annotations Thanks, Mayank > [YARN-321] Add more APIs related to ApplicationAttempt and Container in > ApplicationHistoryProtocol > -- > > Key: YARN-979 > URL: https://issues.apache.org/jira/browse/YARN-979 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Mayank Bansal >Assignee: Mayank Bansal > Attachments: YARN-979-1.patch, YARN-979-3.patch, YARN-979-4.patch, > YARN-979-5.patch, YARN-979-6.patch, YARN-979.2.patch > > > ApplicationHistoryProtocol should have the following APIs as well: > * getApplicationAttemptReport > * getApplicationAttempts > * getContainerReport > * getContainers > The corresponding request and response classes need to be added as well. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-979) [YARN-321] Add more APIs related to ApplicationAttempt and Container in ApplicationHistoryProtocol
[ https://issues.apache.org/jira/browse/YARN-979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal updated YARN-979: --- Attachment: YARN-979-6.patch Attaching the latest patch. Thanks, Mayank > [YARN-321] Add more APIs related to ApplicationAttempt and Container in > ApplicationHistoryProtocol > -- > > Key: YARN-979 > URL: https://issues.apache.org/jira/browse/YARN-979 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Mayank Bansal >Assignee: Mayank Bansal > Attachments: YARN-979-1.patch, YARN-979-3.patch, YARN-979-4.patch, > YARN-979-5.patch, YARN-979-6.patch, YARN-979.2.patch > > > ApplicationHistoryProtocol should have the following APIs as well: > * getApplicationAttemptReport > * getApplicationAttempts > * getContainerReport > * getContainers > The corresponding request and response classes need to be added as well. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-261) Ability to kill AM attempts
[ https://issues.apache.org/jira/browse/YARN-261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13813780#comment-13813780 ] Hadoop QA commented on YARN-261: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12612119/YARN-261--n7.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:red}-1 javadoc{color}. The javadoc tool appears to have generated 1 warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The following test timeouts occurred in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.mapreduce.v2.TestUberAM {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/2369//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2369//console This message is automatically generated. > Ability to kill AM attempts > --- > > Key: YARN-261 > URL: https://issues.apache.org/jira/browse/YARN-261 > Project: Hadoop YARN > Issue Type: New Feature > Components: api >Affects Versions: 2.0.3-alpha >Reporter: Jason Lowe >Assignee: Andrey Klochkov > Attachments: YARN-261--n2.patch, YARN-261--n3.patch, > YARN-261--n4.patch, YARN-261--n5.patch, YARN-261--n6.patch, > YARN-261--n7.patch, YARN-261.patch > > > It would be nice if clients could ask for an AM attempt to be killed. This > is analogous to the task attempt kill support provided by MapReduce. > This feature would be useful in a scenario where AM retries are enabled, the > AM supports recovery, and a particular AM attempt is stuck. Currently if > this occurs the user's only recourse is to kill the entire application, > requiring them to resubmit a new application and potentially breaking > downstream dependent jobs if it's part of a bigger workflow. Killing the > attempt would allow a new attempt to be started by the RM without killing the > entire application, and if the AM supports recovery it could potentially save > a lot of work. It could also be useful in workflow scenarios where the > failure of the entire application kills the workflow, but the ability to kill > an attempt can keep the workflow going if the subsequent attempt succeeds. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-955) [YARN-321] Implementation of ApplicationHistoryProtocol
[ https://issues.apache.org/jira/browse/YARN-955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13813787#comment-13813787 ] Hadoop QA commented on YARN-955: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12612123/YARN-955-2.patch against trunk revision . {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2371//console This message is automatically generated. > [YARN-321] Implementation of ApplicationHistoryProtocol > --- > > Key: YARN-955 > URL: https://issues.apache.org/jira/browse/YARN-955 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Vinod Kumar Vavilapalli >Assignee: Mayank Bansal > Attachments: YARN-955-1.patch, YARN-955-2.patch > > -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-979) [YARN-321] Add more APIs related to ApplicationAttempt and Container in ApplicationHistoryProtocol
[ https://issues.apache.org/jira/browse/YARN-979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13813794#comment-13813794 ] Hadoop QA commented on YARN-979: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12612134/YARN-979-6.patch against trunk revision . {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2373//console This message is automatically generated. > [YARN-321] Add more APIs related to ApplicationAttempt and Container in > ApplicationHistoryProtocol > -- > > Key: YARN-979 > URL: https://issues.apache.org/jira/browse/YARN-979 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Mayank Bansal >Assignee: Mayank Bansal > Attachments: YARN-979-1.patch, YARN-979-3.patch, YARN-979-4.patch, > YARN-979-5.patch, YARN-979-6.patch, YARN-979.2.patch > > > ApplicationHistoryProtocol should have the following APIs as well: > * getApplicationAttemptReport > * getApplicationAttempts > * getContainerReport > * getContainers > The corresponding request and response classes need to be added as well. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1307) Rethink znode structure for RM HA
[ https://issues.apache.org/jira/browse/YARN-1307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13813796#comment-13813796 ] Hadoop QA commented on YARN-1307: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12612126/YARN-1307.4.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:red}-1 javac{color:red}. The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2372//console This message is automatically generated. > Rethink znode structure for RM HA > - > > Key: YARN-1307 > URL: https://issues.apache.org/jira/browse/YARN-1307 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Tsuyoshi OZAWA >Assignee: Tsuyoshi OZAWA > Attachments: YARN-1307.1.patch, YARN-1307.2.patch, YARN-1307.3.patch, > YARN-1307.4.patch > > > Rethink for znode structure for RM HA is proposed in some JIRAs(YARN-659, > YARN-1222). The motivation of this JIRA is quoted from Bikas' comment in > YARN-1222: > {quote} > We should move to creating a node hierarchy for apps such that all znodes for > an app are stored under an app znode instead of the app root znode. This will > help in removeApplication and also in scaling better on ZK. The earlier code > was written this way to ensure create/delete happens under a root znode for > fencing. But given that we have moved to multi-operations globally, this isnt > required anymore. > {quote} -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again
[ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13813811#comment-13813811 ] Hou Song commented on YARN-90: -- Thanks for the suggestions. I'm trying to modify my patch, and will upload it soon. However, I don't quite understand your saying "expose this end-to-end and not just metrics". We have been using failed-disk metric in our prodution cluster for a year, and it's good enough for our rapid disk repairment. Enlight me if you have a better way. > NodeManager should identify failed disks becoming good back again > - > > Key: YARN-90 > URL: https://issues.apache.org/jira/browse/YARN-90 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Ravi Gummadi > Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch > > > MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes > down, it is marked as failed forever. To reuse that disk (after it becomes > good), NodeManager needs restart. This JIRA is to improve NodeManager to > reuse good disks(which could be bad some time back). -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1323) Set HTTPS webapp address along with other RPC addresses in HAUtil
[ https://issues.apache.org/jira/browse/YARN-1323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13813848#comment-13813848 ] Hudson commented on YARN-1323: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #383 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/383/]) YARN-1323. Set HTTPS webapp address along with other RPC addresses in HAUtil (Karthik Kambatla via Sandy Ryza) (sandy: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1538851) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/HAUtil.java > Set HTTPS webapp address along with other RPC addresses in HAUtil > - > > Key: YARN-1323 > URL: https://issues.apache.org/jira/browse/YARN-1323 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 2.3.0 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla > Labels: ha > Fix For: 2.3.0 > > Attachments: yarn-1323-1.patch > > > YARN-1232 adds the ability to configure multiple RMs, but missed out the > https web app address. Need to add that in. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1388) Fair Scheduler page always displays blank fair share
[ https://issues.apache.org/jira/browse/YARN-1388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13813847#comment-13813847 ] Hudson commented on YARN-1388: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #383 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/383/]) YARN-1388. Fair Scheduler page always displays blank fair share (Liyin Liang via Sandy Ryza) (sandy: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1538855) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/FairSchedulerPage.java > Fair Scheduler page always displays blank fair share > > > Key: YARN-1388 > URL: https://issues.apache.org/jira/browse/YARN-1388 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.2.1 >Reporter: Liyin Liang >Assignee: Liyin Liang > Fix For: 2.2.1 > > Attachments: yarn-1388.diff > > > YARN-1044 fixed min/max/used resource display problem in the scheduler page. > But the "Fair Share" has the same problem and need to fix it. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1323) Set HTTPS webapp address along with other RPC addresses in HAUtil
[ https://issues.apache.org/jira/browse/YARN-1323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13813902#comment-13813902 ] Hudson commented on YARN-1323: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1600 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1600/]) YARN-1323. Set HTTPS webapp address along with other RPC addresses in HAUtil (Karthik Kambatla via Sandy Ryza) (sandy: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1538851) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/HAUtil.java > Set HTTPS webapp address along with other RPC addresses in HAUtil > - > > Key: YARN-1323 > URL: https://issues.apache.org/jira/browse/YARN-1323 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 2.3.0 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla > Labels: ha > Fix For: 2.3.0 > > Attachments: yarn-1323-1.patch > > > YARN-1232 adds the ability to configure multiple RMs, but missed out the > https web app address. Need to add that in. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1388) Fair Scheduler page always displays blank fair share
[ https://issues.apache.org/jira/browse/YARN-1388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13813901#comment-13813901 ] Hudson commented on YARN-1388: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1600 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1600/]) YARN-1388. Fair Scheduler page always displays blank fair share (Liyin Liang via Sandy Ryza) (sandy: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1538855) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/FairSchedulerPage.java > Fair Scheduler page always displays blank fair share > > > Key: YARN-1388 > URL: https://issues.apache.org/jira/browse/YARN-1388 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.2.1 >Reporter: Liyin Liang >Assignee: Liyin Liang > Fix For: 2.2.1 > > Attachments: yarn-1388.diff > > > YARN-1044 fixed min/max/used resource display problem in the scheduler page. > But the "Fair Share" has the same problem and need to fix it. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1388) Fair Scheduler page always displays blank fair share
[ https://issues.apache.org/jira/browse/YARN-1388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13813919#comment-13813919 ] Hudson commented on YARN-1388: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #1574 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1574/]) YARN-1388. Fair Scheduler page always displays blank fair share (Liyin Liang via Sandy Ryza) (sandy: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1538855) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/FairSchedulerPage.java > Fair Scheduler page always displays blank fair share > > > Key: YARN-1388 > URL: https://issues.apache.org/jira/browse/YARN-1388 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.2.1 >Reporter: Liyin Liang >Assignee: Liyin Liang > Fix For: 2.2.1 > > Attachments: yarn-1388.diff > > > YARN-1044 fixed min/max/used resource display problem in the scheduler page. > But the "Fair Share" has the same problem and need to fix it. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1323) Set HTTPS webapp address along with other RPC addresses in HAUtil
[ https://issues.apache.org/jira/browse/YARN-1323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13813920#comment-13813920 ] Hudson commented on YARN-1323: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #1574 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1574/]) YARN-1323. Set HTTPS webapp address along with other RPC addresses in HAUtil (Karthik Kambatla via Sandy Ryza) (sandy: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1538851) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/HAUtil.java > Set HTTPS webapp address along with other RPC addresses in HAUtil > - > > Key: YARN-1323 > URL: https://issues.apache.org/jira/browse/YARN-1323 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 2.3.0 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla > Labels: ha > Fix For: 2.3.0 > > Attachments: yarn-1323-1.patch > > > YARN-1232 adds the ability to configure multiple RMs, but missed out the > https web app address. Need to add that in. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1197) Support changing resources of an allocated container
[ https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13813933#comment-13813933 ] Wangda Tan commented on YARN-1197: -- [~bikassaha] Actually I was on half way of implementing these and stopped by other works. :-/ For putting increasing request to ResourceRequest, I agree, I really spent some time (the half-baked scheduler supporting increase) to prove putting increasing request to resource request is NOT good, even if you mentioned it before :(. The original reason I put increasing request to ResourceRequest because in literally speaking, the increasing request is another form of "resource request", it also ask for more resource, the only difference is increasing request add a restriction on the request. But in real YARN's implementation, it's problematic to make it being part of resource request, I need to handle increasing cases everywhere in RM. I think making it a new member in AllocateRequest is cleaner solution, but potentially, it will cause more interfaces/implements changes (like SchedulerApplication, YARNScheduler, etc.). I'll continue look at it before starting write code. I also agree for you comments for improving representation of ChangeContainerResourceResponse and the missed ResourceIncreaseContextProto in AllocateResponse. I'll add my design proposal for handle new resource in monitoring module. Again, your comments are really helpful, hope to get your more ideas :) > Support changing resources of an allocated container > > > Key: YARN-1197 > URL: https://issues.apache.org/jira/browse/YARN-1197 > Project: Hadoop YARN > Issue Type: Task > Components: api, nodemanager, resourcemanager >Affects Versions: 2.1.0-beta >Reporter: Wangda Tan >Assignee: Wangda Tan > Attachments: yarn-1197-v2.pdf, yarn-1197-v3.pdf, yarn-1197.pdf > > > Currently, YARN cannot support merge several containers in one node to a big > container, which can make us incrementally ask resources, merge them to a > bigger one, and launch our processes. The user scenario is described in the > comments. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1222) Make improvements in ZKRMStateStore for fencing
[ https://issues.apache.org/jira/browse/YARN-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13814065#comment-13814065 ] Karthik Kambatla commented on YARN-1222: Thanks [~bikassaha] for the close review. We should probably examine the ACL strategy a little more. My reasoning behind allowing the user to configure the ACLs is to avoid security holes. For instance, if we just use yarn.resourcemanager.address and the RM's cluster-time-stamp, third-parties (anything but the RM) can retrieve that information and mess with the store. bq. How is the following case going to work? How can the root node acl be set in the conf? Upon active, we have to remove the old RM's cd-acl and set our cd-acl. That cannot be statically set in conf right? The root-node ACLs are per RM instance. They need to be different for it to work. The documentation in yarn-default.xml explains this - we might have to make it even more clear? bq. My concern is that we are only adding new ACLs every time we failover but never deleting them. Is it possible that we end up creating too many ACLs for the root znode and hit ZK issues? Don't think that is possible. On failover, if not configured, we construct root node ACLs from the same initial ACLs. They are not adding up across iterations. The number of ACLs in the list is always bounded by (user-configured-for-store + 1). Am I missing something? e.g. If the user doesn't configure any ZK-ACLs, the ACLs for the store are {world:anyone:rwcda} and the ACLs for the root node are {world:anyone:rwa, active-rm-address:active-rm-timestamp:cd} always. bq. For both of the above, can we use well-known prefixes for the root znode acls (rm-admin-acl and rm-cd-acl). We might be able to do that, but the user can realize it in the current implementation by configuring the root ACLs to exactly that. A completely different approach to this would be to use session-based-authentication. The Active session claims create-delete. However, we might want to do that as a follow up - it might need some more refactoring on the store to stick to ensure a single session. bq. Can we move this logic into the common RMStateStore and notify it about HA state loss via a standard HA exception. I initially did that, but moved it to ZKRMStateStore because the common RMStateStore is oblivious to the implicit fencing mechanism in the ZKStore. Do you think we should make it aware of fencing - have something like a StoreFencedException? bq. Will the null return make the state store crash? It didn't store the crash in my testing. Will look at it more closely for the next revision of the patch. bq. This and other similar places need an @Private ZKRMStateStore itself is @Private @Unstable. Should we still label the methods @Private? {quote} @Private? {code} + public static String getConfValueForRMInstance(String prefix, {code} {quote} This was intentional - there might be merit to leaving these methods public and mark the class itself @Public @Unstable at the moment. > Make improvements in ZKRMStateStore for fencing > --- > > Key: YARN-1222 > URL: https://issues.apache.org/jira/browse/YARN-1222 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bikas Saha >Assignee: Karthik Kambatla > Attachments: yarn-1222-1.patch, yarn-1222-2.patch, yarn-1222-3.patch, > yarn-1222-4.patch > > > Using multi-operations for every ZK interaction. > In every operation, automatically creating/deleting a lock znode that is the > child of the root znode. This is to achieve fencing by modifying the > create/delete permissions on the root znode. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-953) [YARN-321] Change ResourceManager to use HistoryStorage to log history data
[ https://issues.apache.org/jira/browse/YARN-953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-953: - Attachment: YARN-953.5.patch I updated the a new patch, which coordinates the change of writer interface. The basic design is not changed: we still have a RMApplicationHistoryWriter, which handles writing requests from RM on different threads asynchronously. There're some other major changes: 1. Instead of handling all the writing events in one separate thread, I define a dispatcher vector to improve concurrency, and also ensure the events of one application is scheduled in the same thread. Therefore, the writing events of different applications will be processed concurrently, while events of the same application will be processed in the order where they are scheduled (It's important to ensure the events scheduled before applicationFinished to be processed first). 2. Make sure applicationFinished is called after all applicationAttemptsFinished, especially in the killing case, where RMApp moves to the final state before RMAppAttempt. 3. Improve the test cases. 4. Fix the break of other tests in RM project. There's something to be handled separately: 1. We need to make RMContainer have more information to fill ContainerHistoryData. It's going to be done in YARN-974 2. Like RMStateStore, RMApplicationHistoryWriter needs to flush all the pending events given RM stops. We can make use of the update in YARN-1121 later. > [YARN-321] Change ResourceManager to use HistoryStorage to log history data > --- > > Key: YARN-953 > URL: https://issues.apache.org/jira/browse/YARN-953 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Vinod Kumar Vavilapalli >Assignee: Zhijie Shen > Attachments: YARN-953-5.patch, YARN-953.1.patch, YARN-953.2.patch, > YARN-953.3.patch, YARN-953.4.patch, YARN-953.5.patch > > -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-953) [YARN-321] Enable ResourceManager to write history data
[ https://issues.apache.org/jira/browse/YARN-953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-953: - Summary: [YARN-321] Enable ResourceManager to write history data (was: [YARN-321] Change ResourceManager to use HistoryStorage to log history data) > [YARN-321] Enable ResourceManager to write history data > --- > > Key: YARN-953 > URL: https://issues.apache.org/jira/browse/YARN-953 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Vinod Kumar Vavilapalli >Assignee: Zhijie Shen > Attachments: YARN-953-5.patch, YARN-953.1.patch, YARN-953.2.patch, > YARN-953.3.patch, YARN-953.4.patch, YARN-953.5.patch > > -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-979) [YARN-321] Add more APIs related to ApplicationAttempt and Container in ApplicationHistoryProtocol
[ https://issues.apache.org/jira/browse/YARN-979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13814076#comment-13814076 ] Zhijie Shen commented on YARN-979: -- +1 > [YARN-321] Add more APIs related to ApplicationAttempt and Container in > ApplicationHistoryProtocol > -- > > Key: YARN-979 > URL: https://issues.apache.org/jira/browse/YARN-979 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Mayank Bansal >Assignee: Mayank Bansal > Attachments: YARN-979-1.patch, YARN-979-3.patch, YARN-979-4.patch, > YARN-979-5.patch, YARN-979-6.patch, YARN-979.2.patch > > > ApplicationHistoryProtocol should have the following APIs as well: > * getApplicationAttemptReport > * getApplicationAttempts > * getContainerReport > * getContainers > The corresponding request and response classes need to be added as well. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1307) Rethink znode structure for RM HA
[ https://issues.apache.org/jira/browse/YARN-1307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13814078#comment-13814078 ] Bikas Saha commented on YARN-1307: -- YARN-891 changed a lot of stuff > Rethink znode structure for RM HA > - > > Key: YARN-1307 > URL: https://issues.apache.org/jira/browse/YARN-1307 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Tsuyoshi OZAWA >Assignee: Tsuyoshi OZAWA > Attachments: YARN-1307.1.patch, YARN-1307.2.patch, YARN-1307.3.patch, > YARN-1307.4.patch > > > Rethink for znode structure for RM HA is proposed in some JIRAs(YARN-659, > YARN-1222). The motivation of this JIRA is quoted from Bikas' comment in > YARN-1222: > {quote} > We should move to creating a node hierarchy for apps such that all znodes for > an app are stored under an app znode instead of the app root znode. This will > help in removeApplication and also in scaling better on ZK. The earlier code > was written this way to ensure create/delete happens under a root znode for > fencing. But given that we have moved to multi-operations globally, this isnt > required anymore. > {quote} -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1266) Adding ApplicationHistoryProtocolPBService to make web apps to work
[ https://issues.apache.org/jira/browse/YARN-1266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13814085#comment-13814085 ] Zhijie Shen commented on YARN-1266: --- What's the relationship between this patch and "make web apps to work"? It seems to be related to RPC APIs instead. To make ApplicationHistoryProtocol work, you may need ApplicationHistoryProtocolPBClient as well. > Adding ApplicationHistoryProtocolPBService to make web apps to work > --- > > Key: YARN-1266 > URL: https://issues.apache.org/jira/browse/YARN-1266 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Mayank Bansal >Assignee: Mayank Bansal > Attachments: YARN-1266-1.patch, YARN-1266-2.patch > > > Adding ApplicationHistoryProtocolPBService to make web apps to work and > changing yarn to run AHS as a seprate process -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again
[ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13814090#comment-13814090 ] Vinod Kumar Vavilapalli commented on YARN-90: - bq. However, I don't quite understand your saying "expose this end-to-end and not just metrics". We have been using failed-disk metric in our prodution cluster for a year, and it's good enough for our rapid disk repairment. Enlight me if you have a better way. I meant that it should be part of client side RPC report, JMX as well as the metrics. Doing only one of those is incomplete and so I was suggesting that we do all of that in a separate JIRA. > NodeManager should identify failed disks becoming good back again > - > > Key: YARN-90 > URL: https://issues.apache.org/jira/browse/YARN-90 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Ravi Gummadi > Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch > > > MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes > down, it is marked as failed forever. To reuse that disk (after it becomes > good), NodeManager needs restart. This JIRA is to improve NodeManager to > reuse good disks(which could be bad some time back). -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1222) Make improvements in ZKRMStateStore for fencing
[ https://issues.apache.org/jira/browse/YARN-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13814092#comment-13814092 ] Bikas Saha commented on YARN-1222: -- bq. The root-node ACLs are per RM instance. They need to be different for it to work. The documentation in yarn-default.xml explains this - we might have to make it even more clear? Clarifying it, possibly with an example would be good. bq. The number of ACLs in the list is always bounded by (user-configured-for-store + 1). Am I missing something? I missed that the patch is modifying the base acl from config and not the actual acl from the znode. The latter would have increased the count. The former is fine. The current code is good. Where is the shared rm-admin-acl being set such that both RMs have admin access to the root znode? This probably works because the default is world:all. But if that is not the case, and we are using internally generated acls, then the rm has to give shared admin access to the other rm when it creates the root znode, right? bq. Do you think we should make it aware of fencing - have something like a StoreFencedException? I think it should be aware of when the store is not available to it because it has been fenced out. There are/were comments in state store error handling to differentiate between exceptions when we have such a differentiation. So we should create a Fenced exception (look at HDFS code for an example). This way all state store should be able to return this incident for identical handling in the upper layers. We would like to avoid state store impls (which are technically runtime pluggable pieces) to have to understand internal Hadoop code patterns for HA etc. bq. ZKRMStateStore itself is @Private @Unstable. Should we still label the methods @Private? At some point ZKRMStateStore will become public/stable but these methods should remain private for testing, right? > Make improvements in ZKRMStateStore for fencing > --- > > Key: YARN-1222 > URL: https://issues.apache.org/jira/browse/YARN-1222 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bikas Saha >Assignee: Karthik Kambatla > Attachments: yarn-1222-1.patch, yarn-1222-2.patch, yarn-1222-3.patch, > yarn-1222-4.patch > > > Using multi-operations for every ZK interaction. > In every operation, automatically creating/deleting a lock znode that is the > child of the root znode. This is to achieve fencing by modifying the > create/delete permissions on the root znode. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1222) Make improvements in ZKRMStateStore for fencing
[ https://issues.apache.org/jira/browse/YARN-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13814102#comment-13814102 ] Karthik Kambatla commented on YARN-1222: bq. Where is the shared rm-admin-acl being set such that both RMs have admin access to the root znode? The shared rm-admin-acl comes from ZK_RM_STATE_STORE_ACL set by the user or default (world:anyone), if ZK_RM_STATE_STORE_ROOT_NODE_ACL is not set. If ZK_RM_STATE_STORE_ROOT_NODE_ACL is set by the user, it is the user's responsibility to set the ACLs in a way that both RMs have admin access and claim exclusive c-d access. Agree with rest of the points. Will address them in the next revision. > Make improvements in ZKRMStateStore for fencing > --- > > Key: YARN-1222 > URL: https://issues.apache.org/jira/browse/YARN-1222 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bikas Saha >Assignee: Karthik Kambatla > Attachments: yarn-1222-1.patch, yarn-1222-2.patch, yarn-1222-3.patch, > yarn-1222-4.patch > > > Using multi-operations for every ZK interaction. > In every operation, automatically creating/deleting a lock znode that is the > child of the root znode. This is to achieve fencing by modifying the > create/delete permissions on the root znode. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1374) Resource Manager fails to start due to ConcurrentModificationException
[ https://issues.apache.org/jira/browse/YARN-1374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13814104#comment-13814104 ] Hudson commented on YARN-1374: -- FAILURE: Integrated in Hadoop-trunk-Commit #4695 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/4695/]) YARN-1374. Changed ResourceManager to start the preemption policy monitors as active services. Contributed by Karthik Kambatla. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1539089) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/TestSchedulingMonitor.java > Resource Manager fails to start due to ConcurrentModificationException > -- > > Key: YARN-1374 > URL: https://issues.apache.org/jira/browse/YARN-1374 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.3.0 >Reporter: Devaraj K >Assignee: Karthik Kambatla >Priority: Blocker > Fix For: 2.3.0 > > Attachments: yarn-1374-1.patch, yarn-1374-1.patch > > > Resource Manager is failing to start with the below > ConcurrentModificationException. > {code:xml} > 2013-10-30 20:22:42,371 INFO org.apache.hadoop.util.HostsFileReader: > Refreshing hosts (include/exclude) list > 2013-10-30 20:22:42,376 INFO org.apache.hadoop.service.AbstractService: > Service ResourceManager failed in state INITED; cause: > java.util.ConcurrentModificationException > java.util.ConcurrentModificationException > at > java.util.AbstractList$Itr.checkForComodification(AbstractList.java:372) > at java.util.AbstractList$Itr.next(AbstractList.java:343) > at > java.util.Collections$UnmodifiableCollection$1.next(Collections.java:1010) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:187) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:944) > 2013-10-30 20:22:42,378 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMHAProtocolService: > Transitioning to standby > 2013-10-30 20:22:42,378 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMHAProtocolService: > Transitioned to standby > 2013-10-30 20:22:42,378 FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error starting > ResourceManager > java.util.ConcurrentModificationException > at > java.util.AbstractList$Itr.checkForComodification(AbstractList.java:372) > at java.util.AbstractList$Itr.next(AbstractList.java:343) > at > java.util.Collections$UnmodifiableCollection$1.next(Collections.java:1010) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:187) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:944) > 2013-10-30 20:22:42,379 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: SHUTDOWN_MSG: > / > SHUTDOWN_MSG: Shutting down ResourceManager at HOST-10-18-40-24/10.18.40.24 > / > {code} -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1266) Adding ApplicationHistoryProtocolPBService
[ https://issues.apache.org/jira/browse/YARN-1266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal updated YARN-1266: Summary: Adding ApplicationHistoryProtocolPBService (was: Adding ApplicationHistoryProtocolPBService to make web apps to work) > Adding ApplicationHistoryProtocolPBService > -- > > Key: YARN-1266 > URL: https://issues.apache.org/jira/browse/YARN-1266 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Mayank Bansal >Assignee: Mayank Bansal > Attachments: YARN-1266-1.patch, YARN-1266-2.patch > > > Adding ApplicationHistoryProtocolPBService to make web apps to work and > changing yarn to run AHS as a seprate process -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1266) Adding ApplicationHistoryProtocolPBService to make web apps to work
[ https://issues.apache.org/jira/browse/YARN-1266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13814107#comment-13814107 ] Mayank Bansal commented on YARN-1266: - [~zjshen] the description is not correct for the JIRA, Fixing it . Yes it is related to RPC server to start. The client is used in CLI and already covered in that patch. Thanks, Mayank > Adding ApplicationHistoryProtocolPBService to make web apps to work > --- > > Key: YARN-1266 > URL: https://issues.apache.org/jira/browse/YARN-1266 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Mayank Bansal >Assignee: Mayank Bansal > Attachments: YARN-1266-1.patch, YARN-1266-2.patch > > > Adding ApplicationHistoryProtocolPBService to make web apps to work and > changing yarn to run AHS as a seprate process -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1222) Make improvements in ZKRMStateStore for fencing
[ https://issues.apache.org/jira/browse/YARN-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13814108#comment-13814108 ] Bikas Saha commented on YARN-1222: -- Lets make that clear in the yarn-site/configuration. > Make improvements in ZKRMStateStore for fencing > --- > > Key: YARN-1222 > URL: https://issues.apache.org/jira/browse/YARN-1222 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bikas Saha >Assignee: Karthik Kambatla > Attachments: yarn-1222-1.patch, yarn-1222-2.patch, yarn-1222-3.patch, > yarn-1222-4.patch > > > Using multi-operations for every ZK interaction. > In every operation, automatically creating/deleting a lock znode that is the > child of the root znode. This is to achieve fencing by modifying the > create/delete permissions on the root znode. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1320) Custom log4j properties in Distributed shell does not work properly.
[ https://issues.apache.org/jira/browse/YARN-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-1320: Attachment: YARN-1320.8.patch > Custom log4j properties in Distributed shell does not work properly. > > > Key: YARN-1320 > URL: https://issues.apache.org/jira/browse/YARN-1320 > Project: Hadoop YARN > Issue Type: Bug > Components: applications/distributed-shell >Reporter: Tassapol Athiapinya >Assignee: Xuan Gong > Fix For: 2.2.1 > > Attachments: YARN-1320.1.patch, YARN-1320.2.patch, YARN-1320.3.patch, > YARN-1320.4.patch, YARN-1320.4.patch, YARN-1320.4.patch, YARN-1320.4.patch, > YARN-1320.4.patch, YARN-1320.5.patch, YARN-1320.6.patch, YARN-1320.6.patch, > YARN-1320.7.patch, YARN-1320.8.patch > > > Distributed shell cannot pick up custom log4j properties (specified with > -log_properties). It always uses default log4j properties. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1121) RMStateStore should flush all pending store events before closing
[ https://issues.apache.org/jira/browse/YARN-1121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-1121: -- Attachment: YARN-1121.6.patch bq. There are 3 new booleans with 8 combinations possible between them Each of them serves different purpose, also add comment in the code: - drained: indicates the dispatcher queue's events have been drained and processed. - drainingStopNeeded(renames to drainEventsOnStop): a configuration boolean which enables or disables drain on stop functionality. - drainingStop(renames to blockNewRequests): only for the purpose to block newly coming events while draining to stop. bq. Given that storing stuff will be over the network and slow, why not have a wait notify between this thread and the draining thread? To do that, we may need to add things in dispatcher's runnable like "if(queueEmpty) notify", and this is likely to be invoked in every normal execution of the dispatch while loop if queue is empty, even it's not actually in stop phase, which may create more overhead, as this AsyncDispatcher is used everywhere. bq. DrainEventHandler sounds misleading. - renames DrainEventHandler to DropEventHandler bq.The other thing we can do is take a count of the number of pending events to drain at service stop. For that, we change the new logic from blocking new events coming to the queue To process a fixed number of events out of the queue, again we may need one more counter to indicate how many events we have processed out of the queue. Uploaded a new patch to address the comments > RMStateStore should flush all pending store events before closing > - > > Key: YARN-1121 > URL: https://issues.apache.org/jira/browse/YARN-1121 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.1.0-beta >Reporter: Bikas Saha >Assignee: Jian He > Fix For: 2.2.1 > > Attachments: YARN-1121.1.patch, YARN-1121.2.patch, YARN-1121.2.patch, > YARN-1121.3.patch, YARN-1121.4.patch, YARN-1121.5.patch, YARN-1121.6.patch > > > on serviceStop it should wait for all internal pending events to drain before > stopping. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1320) Custom log4j properties in Distributed shell does not work properly.
[ https://issues.apache.org/jira/browse/YARN-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13814137#comment-13814137 ] Xuan Gong commented on YARN-1320: - bq. I doubt if the patch is going to work if the remote file-system is HDFS. The propagation of the log4j properties file is via HDFS and it doesn't look like it is handled correctly. Please check. I set up a three node cluster locally and test it. It works. But I still make a little change. I believe the custom log4j should be application based. So, I change the code to upload file to user/appname/appid folder(the same position as we upload AppMaster.jar file) in file system instead of directly under /user folder. > Custom log4j properties in Distributed shell does not work properly. > > > Key: YARN-1320 > URL: https://issues.apache.org/jira/browse/YARN-1320 > Project: Hadoop YARN > Issue Type: Bug > Components: applications/distributed-shell >Reporter: Tassapol Athiapinya >Assignee: Xuan Gong > Fix For: 2.2.1 > > Attachments: YARN-1320.1.patch, YARN-1320.2.patch, YARN-1320.3.patch, > YARN-1320.4.patch, YARN-1320.4.patch, YARN-1320.4.patch, YARN-1320.4.patch, > YARN-1320.4.patch, YARN-1320.5.patch, YARN-1320.6.patch, YARN-1320.6.patch, > YARN-1320.7.patch, YARN-1320.8.patch > > > Distributed shell cannot pick up custom log4j properties (specified with > -log_properties). It always uses default log4j properties. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1121) RMStateStore should flush all pending store events before closing
[ https://issues.apache.org/jira/browse/YARN-1121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13814138#comment-13814138 ] Jian He commented on YARN-1121: --- bq. drainingStop(renames to blockNewRequests): Typo, in the patch, it's actually named blockNewEvents. > RMStateStore should flush all pending store events before closing > - > > Key: YARN-1121 > URL: https://issues.apache.org/jira/browse/YARN-1121 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.1.0-beta >Reporter: Bikas Saha >Assignee: Jian He > Fix For: 2.2.1 > > Attachments: YARN-1121.1.patch, YARN-1121.2.patch, YARN-1121.2.patch, > YARN-1121.3.patch, YARN-1121.4.patch, YARN-1121.5.patch, YARN-1121.6.patch > > > on serviceStop it should wait for all internal pending events to drain before > stopping. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-674) Slow or failing DelegationToken renewals on submission itself make RM unavailable
[ https://issues.apache.org/jira/browse/YARN-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13814143#comment-13814143 ] Vinod Kumar Vavilapalli commented on YARN-674: -- bq. We were intentionally going through the same submitApplication() method to make sure that all the initialization and setup code paths are consistently followed in both cases by keeping the code path identical as much as possible. I didn't mean to fork the code, but it seems like the patch is doing exactly that. My original intention was to make submitApplicationOnRecovery() call submitApplication(). > Slow or failing DelegationToken renewals on submission itself make RM > unavailable > - > > Key: YARN-674 > URL: https://issues.apache.org/jira/browse/YARN-674 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Vinod Kumar Vavilapalli >Assignee: Omkar Vinit Joshi > Attachments: YARN-674.1.patch, YARN-674.2.patch, YARN-674.3.patch, > YARN-674.4.patch, YARN-674.5.patch > > > This was caused by YARN-280. A slow or a down NameNode for will make it look > like RM is unavailable as it may run out of RPC handlers due to blocked > client submissions. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1121) RMStateStore should flush all pending store events before closing
[ https://issues.apache.org/jira/browse/YARN-1121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13814150#comment-13814150 ] Hadoop QA commented on YARN-1121: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12612223/YARN-1121.6.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:red}-1 javac{color:red}. The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2375//console This message is automatically generated. > RMStateStore should flush all pending store events before closing > - > > Key: YARN-1121 > URL: https://issues.apache.org/jira/browse/YARN-1121 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.1.0-beta >Reporter: Bikas Saha >Assignee: Jian He > Fix For: 2.2.1 > > Attachments: YARN-1121.1.patch, YARN-1121.2.patch, YARN-1121.2.patch, > YARN-1121.3.patch, YARN-1121.4.patch, YARN-1121.5.patch, YARN-1121.6.patch > > > on serviceStop it should wait for all internal pending events to drain before > stopping. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1320) Custom log4j properties in Distributed shell does not work properly.
[ https://issues.apache.org/jira/browse/YARN-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13814152#comment-13814152 ] Hadoop QA commented on YARN-1320: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/1261/YARN-1320.8.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/2374//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2374//console This message is automatically generated. > Custom log4j properties in Distributed shell does not work properly. > > > Key: YARN-1320 > URL: https://issues.apache.org/jira/browse/YARN-1320 > Project: Hadoop YARN > Issue Type: Bug > Components: applications/distributed-shell >Reporter: Tassapol Athiapinya >Assignee: Xuan Gong > Fix For: 2.2.1 > > Attachments: YARN-1320.1.patch, YARN-1320.2.patch, YARN-1320.3.patch, > YARN-1320.4.patch, YARN-1320.4.patch, YARN-1320.4.patch, YARN-1320.4.patch, > YARN-1320.4.patch, YARN-1320.5.patch, YARN-1320.6.patch, YARN-1320.6.patch, > YARN-1320.7.patch, YARN-1320.8.patch > > > Distributed shell cannot pick up custom log4j properties (specified with > -log_properties). It always uses default log4j properties. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1121) RMStateStore should flush all pending store events before closing
[ https://issues.apache.org/jira/browse/YARN-1121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-1121: -- Attachment: YARN-1121.6.patch No idea why jenkins' not applying the patch, submit the same patch again > RMStateStore should flush all pending store events before closing > - > > Key: YARN-1121 > URL: https://issues.apache.org/jira/browse/YARN-1121 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.1.0-beta >Reporter: Bikas Saha >Assignee: Jian He > Fix For: 2.2.1 > > Attachments: YARN-1121.1.patch, YARN-1121.2.patch, YARN-1121.2.patch, > YARN-1121.3.patch, YARN-1121.4.patch, YARN-1121.5.patch, YARN-1121.6.patch, > YARN-1121.6.patch > > > on serviceStop it should wait for all internal pending events to drain before > stopping. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-987) Adding History Service to use Store and converting Historydata to Report
[ https://issues.apache.org/jira/browse/YARN-987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13814196#comment-13814196 ] Mayank Bansal commented on YARN-987: Thanks [~zjshen] for the review. bq. As we're going to have cache, the abstraction of ApplicationHistoryContext may be necessary. However, one more question here: webUI and services are going to use ApplicationHistoryContext as well, right? if they are, returning report PB is actually not necessary for web. If they're not, webUI and services need a duplicate abstraction of combining cache and store, which is concise in terms of coding. As discussed offline, We should be using History context for both client and UI, however it has one drawback of using proto objects to UI. Otherwise we need to have seprate classes for UI which I think duplicate of work. bq. Add the config to yarn-default.xml as well. Btw, is "store.class" a bit better, as we have XXXApplicationHistoryStore, not XXXApplicationHistoryStorage? Done bq. Unnecessary code. ApplicationHistoryStore must be a service Done. bq. Unnecessary code. ApplicationHistoryStore must be a service Done bq. For ApplicationReport, you may want to get the history data of its last application attempt to fill the empty fields bellow. Done Thanks, Mayank > Adding History Service to use Store and converting Historydata to Report > > > Key: YARN-987 > URL: https://issues.apache.org/jira/browse/YARN-987 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Mayank Bansal >Assignee: Mayank Bansal > Attachments: YARN-987-1.patch, YARN-987-2.patch, YARN-987-3.patch, > YARN-987-4.patch > > -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-987) Adding History Service to use Store and converting Historydata to Report
[ https://issues.apache.org/jira/browse/YARN-987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal updated YARN-987: --- Attachment: YARN-987-5.patch Attaching Latest Patch. Thanks, Mayank > Adding History Service to use Store and converting Historydata to Report > > > Key: YARN-987 > URL: https://issues.apache.org/jira/browse/YARN-987 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Mayank Bansal >Assignee: Mayank Bansal > Attachments: YARN-987-1.patch, YARN-987-2.patch, YARN-987-3.patch, > YARN-987-4.patch, YARN-987-5.patch > > -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1121) RMStateStore should flush all pending store events before closing
[ https://issues.apache.org/jira/browse/YARN-1121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13814202#comment-13814202 ] Hadoop QA commented on YARN-1121: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12612228/YARN-1121.6.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/2376//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2376//console This message is automatically generated. > RMStateStore should flush all pending store events before closing > - > > Key: YARN-1121 > URL: https://issues.apache.org/jira/browse/YARN-1121 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.1.0-beta >Reporter: Bikas Saha >Assignee: Jian He > Fix For: 2.2.1 > > Attachments: YARN-1121.1.patch, YARN-1121.2.patch, YARN-1121.2.patch, > YARN-1121.3.patch, YARN-1121.4.patch, YARN-1121.5.patch, YARN-1121.6.patch, > YARN-1121.6.patch > > > on serviceStop it should wait for all internal pending events to drain before > stopping. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-987) Adding History Service to use Store and converting Historydata to Report
[ https://issues.apache.org/jira/browse/YARN-987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13814209#comment-13814209 ] Hadoop QA commented on YARN-987: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12612233/YARN-987-5.patch against trunk revision . {color:red}-1 patch{color}. Trunk compilation may be broken. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2377//console This message is automatically generated. > Adding History Service to use Store and converting Historydata to Report > > > Key: YARN-987 > URL: https://issues.apache.org/jira/browse/YARN-987 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Mayank Bansal >Assignee: Mayank Bansal > Attachments: YARN-987-1.patch, YARN-987-2.patch, YARN-987-3.patch, > YARN-987-4.patch, YARN-987-5.patch > > -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Created] (YARN-1390) Add applicationSource to ApplicationSubmissionContext and RMApp
Karthik Kambatla created YARN-1390: -- Summary: Add applicationSource to ApplicationSubmissionContext and RMApp Key: YARN-1390 URL: https://issues.apache.org/jira/browse/YARN-1390 Project: Hadoop YARN Issue Type: Improvement Components: api Affects Versions: 2.2.0 Reporter: Karthik Kambatla -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1390) Add applicationSource to ApplicationSubmissionContext and RMApp
[ https://issues.apache.org/jira/browse/YARN-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-1390: --- Description: In addition to other fields like application-type (added in YARN-563), it is useful to have an applicationSource field to track the source of an application. The application source can be useful in (1) fetching only those applications a user is interested in, (2) potentially adding source-specific optimizations in the future. Example of sources are: User-defined project names, Pig, Hive, Oozie, Sqoop etc. Target Version/s: 2.3.0 Assignee: Karthik Kambatla > Add applicationSource to ApplicationSubmissionContext and RMApp > --- > > Key: YARN-1390 > URL: https://issues.apache.org/jira/browse/YARN-1390 > Project: Hadoop YARN > Issue Type: Improvement > Components: api >Affects Versions: 2.2.0 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla > > In addition to other fields like application-type (added in YARN-563), it is > useful to have an applicationSource field to track the source of an > application. The application source can be useful in (1) fetching only those > applications a user is interested in, (2) potentially adding source-specific > optimizations in the future. > Example of sources are: User-defined project names, Pig, Hive, Oozie, Sqoop > etc. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1390) Add applicationSource to ApplicationSubmissionContext and RMApp
[ https://issues.apache.org/jira/browse/YARN-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-1390: --- Description: In addition to other fields like application-type (added in YARN-563), it is useful to have an applicationSource field to track the source of an application. The application source can be useful in (1) fetching only those applications a user is interested in, (2) potentially adding source-specific optimizations in the future. Examples of sources are: User-defined project names, Pig, Hive, Oozie, Sqoop etc. was: In addition to other fields like application-type (added in YARN-563), it is useful to have an applicationSource field to track the source of an application. The application source can be useful in (1) fetching only those applications a user is interested in, (2) potentially adding source-specific optimizations in the future. Example of sources are: User-defined project names, Pig, Hive, Oozie, Sqoop etc. > Add applicationSource to ApplicationSubmissionContext and RMApp > --- > > Key: YARN-1390 > URL: https://issues.apache.org/jira/browse/YARN-1390 > Project: Hadoop YARN > Issue Type: Improvement > Components: api >Affects Versions: 2.2.0 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla > > In addition to other fields like application-type (added in YARN-563), it is > useful to have an applicationSource field to track the source of an > application. The application source can be useful in (1) fetching only those > applications a user is interested in, (2) potentially adding source-specific > optimizations in the future. > Examples of sources are: User-defined project names, Pig, Hive, Oozie, Sqoop > etc. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Created] (YARN-1391) Lost node list contains many active node with different port
Siqi Li created YARN-1391: - Summary: Lost node list contains many active node with different port Key: YARN-1391 URL: https://issues.apache.org/jira/browse/YARN-1391 Project: Hadoop YARN Issue Type: Bug Reporter: Siqi Li Assignee: Siqi Li -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1391) Lost node list contains many active node with different port
[ https://issues.apache.org/jira/browse/YARN-1391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siqi Li updated YARN-1391: -- Description: When restarting node manager, the active node list in webUI will contain duplicate entries. Such two entries have the same host name with different port number. After expiry interval, the older entry will get expired and transitioned to lost node list, and stay there until this node gets restarted again. > Lost node list contains many active node with different port > > > Key: YARN-1391 > URL: https://issues.apache.org/jira/browse/YARN-1391 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Siqi Li >Assignee: Siqi Li > > When restarting node manager, the active node list in webUI will contain > duplicate entries. Such two entries have the same host name with different > port number. After expiry interval, the older entry will get expired and > transitioned to lost node list, and stay there until this node gets restarted > again. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1391) Lost node list contains many active node with different port
[ https://issues.apache.org/jira/browse/YARN-1391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13814272#comment-13814272 ] Sandy Ryza commented on YARN-1391: -- This is related to YARN-1382 > Lost node list contains many active node with different port > > > Key: YARN-1391 > URL: https://issues.apache.org/jira/browse/YARN-1391 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Siqi Li >Assignee: Siqi Li > > When restarting node manager, the active node list in webUI will contain > duplicate entries. Such two entries have the same host name with different > port number. After expiry interval, the older entry will get expired and > transitioned to lost node list, and stay there until this node gets restarted > again. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1391) Lost node list contains many active node with different port
[ https://issues.apache.org/jira/browse/YARN-1391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siqi Li updated YARN-1391: -- Component/s: resourcemanager > Lost node list contains many active node with different port > > > Key: YARN-1391 > URL: https://issues.apache.org/jira/browse/YARN-1391 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Siqi Li >Assignee: Siqi Li > > When restarting node manager, the active node list in webUI will contain > duplicate entries. Such two entries have the same host name with different > port number. After expiry interval, the older entry will get expired and > transitioned to lost node list, and stay there until this node gets restarted > again. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1391) Lost node list contains many active node with different port
[ https://issues.apache.org/jira/browse/YARN-1391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siqi Li updated YARN-1391: -- Affects Version/s: 2.0.5-alpha > Lost node list contains many active node with different port > > > Key: YARN-1391 > URL: https://issues.apache.org/jira/browse/YARN-1391 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.0.5-alpha >Reporter: Siqi Li >Assignee: Siqi Li > > When restarting node manager, the active node list in webUI will contain > duplicate entries. Such two entries have the same host name with different > port number. After expiry interval, the older entry will get expired and > transitioned to lost node list, and stay there until this node gets restarted > again. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-311) Dynamic node resource configuration: core scheduler changes
[ https://issues.apache.org/jira/browse/YARN-311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13814274#comment-13814274 ] Hudson commented on YARN-311: - SUCCESS: Integrated in Hadoop-trunk-Commit #4696 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/4696/]) YARN-311. RM/scheduler support for dynamic resource configuration. (Junping Du via llu) (llu: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1539134) * /hadoop/common/trunk/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/nodemanager/NodeInfo.java * /hadoop/common/trunk/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/RMNodeWrapper.java * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/ResourceOption.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/proto/yarn_protos.proto * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/ResourceOptionPBImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNode.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerNode.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerUtils.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/common/fica/FiCaSchedulerNode.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSSchedulerNode.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/FifoScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockNodes.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/TestFifoScheduler.java > Dynamic node resource configuration: core scheduler changes > --- > > Key: YARN-311 > URL: https://issues.apache.org/jira/browse/YARN-311 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager, scheduler >Reporter: Junping Du >Assignee: Junping Du > Attachments: YARN-311-v1.patch, YARN-311-v10.patch, > YARN-311-v11.patch, YARN-311-v12.patch, YARN-311-v12b.patch, > YARN-311-v13.patch, YARN-311-v2.patch, YARN-311-v3.patch, YARN-311-v4.patch, > YARN-311-v4.patch, YARN-311-v5.patch, YARN-311-v6.1.patch, > YARN-311-v6.2.patch, YARN-311-v6.patch, YARN-311-v7.patch, YARN-311-v8.patch, > YARN-311-v9.patch > > > As the first step, we go for resource change on RM side and expose admin APIs > (admin protocol, CLI, REST and JMX API) later. In this jira, we will only > contain changes in scheduler. > The flow to update node's resource and awareness in resource scheduling is: > 1. Resource update is through admin API to RM and take effect on RMNodeImpl. > 2. When next NM heartbeat for updating s
[jira] [Updated] (YARN-311) Dynamic node resource configuration: core scheduler changes
[ https://issues.apache.org/jira/browse/YARN-311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luke Lu updated YARN-311: - Fix Version/s: 2.3.0 Hadoop Flags: Reviewed Committed to trunk and branch-2. Thanks Junping for the patch! > Dynamic node resource configuration: core scheduler changes > --- > > Key: YARN-311 > URL: https://issues.apache.org/jira/browse/YARN-311 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager, scheduler >Reporter: Junping Du >Assignee: Junping Du > Fix For: 2.3.0 > > Attachments: YARN-311-v1.patch, YARN-311-v10.patch, > YARN-311-v11.patch, YARN-311-v12.patch, YARN-311-v12b.patch, > YARN-311-v13.patch, YARN-311-v2.patch, YARN-311-v3.patch, YARN-311-v4.patch, > YARN-311-v4.patch, YARN-311-v5.patch, YARN-311-v6.1.patch, > YARN-311-v6.2.patch, YARN-311-v6.patch, YARN-311-v7.patch, YARN-311-v8.patch, > YARN-311-v9.patch > > > As the first step, we go for resource change on RM side and expose admin APIs > (admin protocol, CLI, REST and JMX API) later. In this jira, we will only > contain changes in scheduler. > The flow to update node's resource and awareness in resource scheduling is: > 1. Resource update is through admin API to RM and take effect on RMNodeImpl. > 2. When next NM heartbeat for updating status comes, the RMNode's resource > change will be aware and the delta resource is added to schedulerNode's > availableResource before actual scheduling happens. > 3. Scheduler do resource allocation according to new availableResource in > SchedulerNode. > For more design details, please refer proposal and discussions in parent > JIRA: YARN-291. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-674) Slow or failing DelegationToken renewals on submission itself make RM unavailable
[ https://issues.apache.org/jira/browse/YARN-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13814291#comment-13814291 ] Omkar Vinit Joshi commented on YARN-674: Thanks [~bikassaha] for the review bq. We were intentionally going through the same submitApplication() method to make sure that all the initialization and setup code paths are consistently followed in both cases by keeping the code path identical as much as possible. The RM would submit a recovered application, in essence proxying a user submitting the application. Its a general pattern followed through the recovery logic - to be minimally invasive to the mainline code path so that we can avoid functional bugs as much as possible. Separating them into 2 methods has resulted in code duplication in both methods without any huge benefit that I can see. It also leave us susceptible to future code changes made in one code path and not the other. I agree with your suggestion... reverting the changes ..discussed with [~vinodkv] offline. bq. Why is isSecurityEnabled() being checked at this internal level. The code should not even reach this point if security is not enabled. you have a point ..fixing it.. bq. Also why is it calling rmContext.getDelegationTokenRenewer().addApplication(event) instead of DelegationTokenRenewer.this.addApplication(). Same for rmContext.getDelegationTokenRenewer().applicationFinished(evt); Makes sense...fixed it.. bq. Rename DelegationTokenRenewerThread to not have misleading Thread in the name ? fixed. bq. Can DelegationTokenRenewerAppSubmitEvent event objects have an event type different from VERIFY_AND_START_APPLICATION? If not, we dont need this check and we can change the constructor of DelegationTokenRenewerAppSubmitEvent to not expect an event type argument. It should set the VERIFY_AND_START_APPLICATION within the constructor. fixed.. bq. @Private + @VisibleForTesting??? fixed. > Slow or failing DelegationToken renewals on submission itself make RM > unavailable > - > > Key: YARN-674 > URL: https://issues.apache.org/jira/browse/YARN-674 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Vinod Kumar Vavilapalli >Assignee: Omkar Vinit Joshi > Attachments: YARN-674.1.patch, YARN-674.2.patch, YARN-674.3.patch, > YARN-674.4.patch, YARN-674.5.patch > > > This was caused by YARN-280. A slow or a down NameNode for will make it look > like RM is unavailable as it may run out of RPC handlers due to blocked > client submissions. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-674) Slow or failing DelegationToken renewals on submission itself make RM unavailable
[ https://issues.apache.org/jira/browse/YARN-674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Vinit Joshi updated YARN-674: --- Attachment: YARN-674.5.patch > Slow or failing DelegationToken renewals on submission itself make RM > unavailable > - > > Key: YARN-674 > URL: https://issues.apache.org/jira/browse/YARN-674 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Vinod Kumar Vavilapalli >Assignee: Omkar Vinit Joshi > Attachments: YARN-674.1.patch, YARN-674.2.patch, YARN-674.3.patch, > YARN-674.4.patch, YARN-674.5.patch, YARN-674.5.patch > > > This was caused by YARN-280. A slow or a down NameNode for will make it look > like RM is unavailable as it may run out of RPC handlers due to blocked > client submissions. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1222) Make improvements in ZKRMStateStore for fencing
[ https://issues.apache.org/jira/browse/YARN-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-1222: --- Attachment: yarn-1222-5.patch Here is a patch that moves the fencing and transition to standby logic from ZKRMStateStore to RMStateStore. The store implementations are expected to throw a {{StoreFencedException}} when they are fenced. Uploading this for any quick remarks. Will post another patch to address the suggested cosmetic changes. > Make improvements in ZKRMStateStore for fencing > --- > > Key: YARN-1222 > URL: https://issues.apache.org/jira/browse/YARN-1222 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bikas Saha >Assignee: Karthik Kambatla > Attachments: yarn-1222-1.patch, yarn-1222-2.patch, yarn-1222-3.patch, > yarn-1222-4.patch, yarn-1222-5.patch > > > Using multi-operations for every ZK interaction. > In every operation, automatically creating/deleting a lock znode that is the > child of the root znode. This is to achieve fencing by modifying the > create/delete permissions on the root znode. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1195) RM may relaunch already KILLED / FAILED jobs after RM restarts
[ https://issues.apache.org/jira/browse/YARN-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13814331#comment-13814331 ] Jian He commented on YARN-1195: --- YARN-891 fixed this. As we store the completed application info for FAILED/KILLED/FINISHED apps on app completion, on restart just look for if application is at such final state, if it is do not restart the app. Closed this. > RM may relaunch already KILLED / FAILED jobs after RM restarts > -- > > Key: YARN-1195 > URL: https://issues.apache.org/jira/browse/YARN-1195 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Jian He >Assignee: Jian He > > Just like YARN-540, RM restarts after job killed/failed , but before App > state info is cleaned from store. the next time RM comes back, it will > relaunch the job again. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Resolved] (YARN-1195) RM may relaunch already KILLED / FAILED jobs after RM restarts
[ https://issues.apache.org/jira/browse/YARN-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He resolved YARN-1195. --- Resolution: Fixed > RM may relaunch already KILLED / FAILED jobs after RM restarts > -- > > Key: YARN-1195 > URL: https://issues.apache.org/jira/browse/YARN-1195 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Jian He >Assignee: Jian He > > Just like YARN-540, RM restarts after job killed/failed , but before App > state info is cleaned from store. the next time RM comes back, it will > relaunch the job again. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-674) Slow or failing DelegationToken renewals on submission itself make RM unavailable
[ https://issues.apache.org/jira/browse/YARN-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13814336#comment-13814336 ] Hadoop QA commented on YARN-674: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12612257/YARN-674.5.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/2378//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2378//console This message is automatically generated. > Slow or failing DelegationToken renewals on submission itself make RM > unavailable > - > > Key: YARN-674 > URL: https://issues.apache.org/jira/browse/YARN-674 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Vinod Kumar Vavilapalli >Assignee: Omkar Vinit Joshi > Attachments: YARN-674.1.patch, YARN-674.2.patch, YARN-674.3.patch, > YARN-674.4.patch, YARN-674.5.patch, YARN-674.5.patch > > > This was caused by YARN-280. A slow or a down NameNode for will make it look > like RM is unavailable as it may run out of RPC handlers due to blocked > client submissions. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-671) Add an interface on the RM to move NMs into a maintenance state
[ https://issues.apache.org/jira/browse/YARN-671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13814346#comment-13814346 ] Cindy Li commented on YARN-671: --- YARN914 is for graceful decommission, which could tolerate longer waiting time. As to the resource reported to scheduler, seems the patch has been available in trunk, which we can base that on too. Either graceful decommission or draining method would result in resource wasted in the node. Another Jira MAPREDUCE 4710 deals with similar issue, where map output lost is tolerated but it has the benefit of not wasting resources in the node during the whole rolling restart process. > Add an interface on the RM to move NMs into a maintenance state > --- > > Key: YARN-671 > URL: https://issues.apache.org/jira/browse/YARN-671 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 2.0.4-alpha >Reporter: Siddharth Seth >Assignee: Siddharth Seth > -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1336) Work-preserving nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13814372#comment-13814372 ] Cindy Li commented on YARN-1336: In the case of rolling upgrade, e.g. some new configuration or fix would be picked up when node manager restarts, would that cause any issue during the state/work recovering process? > Work-preserving nodemanager restart > --- > > Key: YARN-1336 > URL: https://issues.apache.org/jira/browse/YARN-1336 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager >Affects Versions: 2.3.0 >Reporter: Jason Lowe > > This serves as an umbrella ticket for tasks related to work-preserving > nodemanager restart. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Reopened] (YARN-1195) RM may relaunch already KILLED / FAILED jobs after RM restarts
[ https://issues.apache.org/jira/browse/YARN-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli reopened YARN-1195: --- > RM may relaunch already KILLED / FAILED jobs after RM restarts > -- > > Key: YARN-1195 > URL: https://issues.apache.org/jira/browse/YARN-1195 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Jian He >Assignee: Jian He > > Just like YARN-540, RM restarts after job killed/failed , but before App > state info is cleaned from store. the next time RM comes back, it will > relaunch the job again. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Resolved] (YARN-1195) RM may relaunch already KILLED / FAILED jobs after RM restarts
[ https://issues.apache.org/jira/browse/YARN-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli resolved YARN-1195. --- Resolution: Duplicate > RM may relaunch already KILLED / FAILED jobs after RM restarts > -- > > Key: YARN-1195 > URL: https://issues.apache.org/jira/browse/YARN-1195 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Jian He >Assignee: Jian He > > Just like YARN-540, RM restarts after job killed/failed , but before App > state info is cleaned from store. the next time RM comes back, it will > relaunch the job again. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Reopened] (YARN-1330) Fair Scheduler: defaultQueueSchedulingPolicy does not take effect
[ https://issues.apache.org/jira/browse/YARN-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza reopened YARN-1330: -- It looks like this can lead to NPEs in the Fair Scheduler during certain initialization conditions. Will upload an addendum patch. > Fair Scheduler: defaultQueueSchedulingPolicy does not take effect > - > > Key: YARN-1330 > URL: https://issues.apache.org/jira/browse/YARN-1330 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.2.0 >Reporter: Sandy Ryza >Assignee: Sandy Ryza > Fix For: 2.2.1 > > Attachments: YARN-1330-1.patch, YARN-1330-1.patch, YARN-1330.patch > > > The defaultQueueSchedulingPolicy property for the Fair Scheduler allocations > file doesn't take effect. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-674) Slow or failing DelegationToken renewals on submission itself make RM unavailable
[ https://issues.apache.org/jira/browse/YARN-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13814403#comment-13814403 ] Bikas Saha commented on YARN-674: - The assert doesnt make it to the production jar - so it wont catch anything on the cluster. Need to throw an exception here. If we dont want to crash the RM here then we can log and error. When the attempt state machine gets the event then it will crash on the async dispatcher thread if the event is not handled in the current state. {code}+assert application.getState() == RMAppState.NEW;{code} > Slow or failing DelegationToken renewals on submission itself make RM > unavailable > - > > Key: YARN-674 > URL: https://issues.apache.org/jira/browse/YARN-674 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Vinod Kumar Vavilapalli >Assignee: Omkar Vinit Joshi > Attachments: YARN-674.1.patch, YARN-674.2.patch, YARN-674.3.patch, > YARN-674.4.patch, YARN-674.5.patch, YARN-674.5.patch > > > This was caused by YARN-280. A slow or a down NameNode for will make it look > like RM is unavailable as it may run out of RPC handlers due to blocked > client submissions. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1266) Adding ApplicationHistoryProtocolPBService
[ https://issues.apache.org/jira/browse/YARN-1266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13814405#comment-13814405 ] Zhijie Shen commented on YARN-1266: --- Then, it make sense. +1 > Adding ApplicationHistoryProtocolPBService > -- > > Key: YARN-1266 > URL: https://issues.apache.org/jira/browse/YARN-1266 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Mayank Bansal >Assignee: Mayank Bansal > Attachments: YARN-1266-1.patch, YARN-1266-2.patch > > > Adding ApplicationHistoryProtocolPBService to make web apps to work and > changing yarn to run AHS as a seprate process -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1121) RMStateStore should flush all pending store events before closing
[ https://issues.apache.org/jira/browse/YARN-1121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13814410#comment-13814410 ] Bikas Saha commented on YARN-1121: -- bq. To do that, we may need to add things in dispatcher's runnable like "if(queueEmpty) notify", and this is likely to be invoked in every normal execution of the dispatch while loop if queue is empty, even it's not actually in stop phase, which may create more overhead, as this AsyncDispatcher is used everywhere. Can this be only enabled when serviceStop sets the drain events flag. In normal situations that flag will not be set. Replacing the eventHandler to DropEventHandler (instead of GenericEventHandler) may not be enough. Someone may have already gotten a GenericEventHandler object and may send events using that object. So new events will keep getting added to the queue from those cached GenericEventHandler object. So, I think keeping track of the number of events to drain and only draining those many events will be a more robust solution. > RMStateStore should flush all pending store events before closing > - > > Key: YARN-1121 > URL: https://issues.apache.org/jira/browse/YARN-1121 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.1.0-beta >Reporter: Bikas Saha >Assignee: Jian He > Fix For: 2.2.1 > > Attachments: YARN-1121.1.patch, YARN-1121.2.patch, YARN-1121.2.patch, > YARN-1121.3.patch, YARN-1121.4.patch, YARN-1121.5.patch, YARN-1121.6.patch, > YARN-1121.6.patch > > > on serviceStop it should wait for all internal pending events to drain before > stopping. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1222) Make improvements in ZKRMStateStore for fencing
[ https://issues.apache.org/jira/browse/YARN-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13814425#comment-13814425 ] Bikas Saha commented on YARN-1222: -- REQUEST_BY_USER_FORCED is probably not the right choice. {code}+ target.getProxy(getConfig(), 1000).transitionToStandby( + new HAServiceProtocol.StateChangeRequestInfo( + HAServiceProtocol.RequestSource.REQUEST_BY_USER_FORCED)); +} catch (IOException e) { {code} There are finally blocks that call methods like notifyDoneStoringApplicationAttempt() These end up sending events to the RM modules which check for the exception and then call terminate for the RM Java process. We probably dont want that to happen since we simply want to transitionToStandby and discard all the internal state. Thinking aloud, using HAServiceTarget in RMStateStore to transitionToStandby() may not be the right solution. We are effectively doing an internal RPC on an ACL'd protocol. Is it guaranteed to succeed? Should we think of sending an event to the HAProtocolService or have a reference to the HAProtocolService so that it can be directly notified about this situation. Then the HAProtocolService may transition to standby internally. The store should inform the higher entity about the fenced state and not take action on the higher entity by fencing it. Thoughts? > Make improvements in ZKRMStateStore for fencing > --- > > Key: YARN-1222 > URL: https://issues.apache.org/jira/browse/YARN-1222 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bikas Saha >Assignee: Karthik Kambatla > Attachments: yarn-1222-1.patch, yarn-1222-2.patch, yarn-1222-3.patch, > yarn-1222-4.patch, yarn-1222-5.patch > > > Using multi-operations for every ZK interaction. > In every operation, automatically creating/deleting a lock znode that is the > child of the root znode. This is to achieve fencing by modifying the > create/delete permissions on the root znode. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-674) Slow or failing DelegationToken renewals on submission itself make RM unavailable
[ https://issues.apache.org/jira/browse/YARN-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13814442#comment-13814442 ] Jian He commented on YARN-674: -- Saw this is changed back to asynchronous submission on recovery, the original intention was to prevent client from seeing the application as a new application. If asynchronously, the client can query the application before recover event gets processed, meaning before the application is fully recovered as some recover logic happens when app is processing the recover event(app.FinalTransition). {code} // Recover the app synchronously, as otherwise client is possible to see // the application not recovered before it is actually recovered because // ClientRMService is already started at this point of time. appImpl.handle(new RMAppEvent(appImpl.getApplicationId(), RMAppEventType.RECOVER)); {code} > Slow or failing DelegationToken renewals on submission itself make RM > unavailable > - > > Key: YARN-674 > URL: https://issues.apache.org/jira/browse/YARN-674 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Vinod Kumar Vavilapalli >Assignee: Omkar Vinit Joshi > Attachments: YARN-674.1.patch, YARN-674.2.patch, YARN-674.3.patch, > YARN-674.4.patch, YARN-674.5.patch, YARN-674.5.patch > > > This was caused by YARN-280. A slow or a down NameNode for will make it look > like RM is unavailable as it may run out of RPC handlers due to blocked > client submissions. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-311) Dynamic node resource configuration: core scheduler changes
[ https://issues.apache.org/jira/browse/YARN-311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13814478#comment-13814478 ] Junping Du commented on YARN-311: - Thanks Luke for review! > Dynamic node resource configuration: core scheduler changes > --- > > Key: YARN-311 > URL: https://issues.apache.org/jira/browse/YARN-311 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager, scheduler >Reporter: Junping Du >Assignee: Junping Du > Fix For: 2.3.0 > > Attachments: YARN-311-v1.patch, YARN-311-v10.patch, > YARN-311-v11.patch, YARN-311-v12.patch, YARN-311-v12b.patch, > YARN-311-v13.patch, YARN-311-v2.patch, YARN-311-v3.patch, YARN-311-v4.patch, > YARN-311-v4.patch, YARN-311-v5.patch, YARN-311-v6.1.patch, > YARN-311-v6.2.patch, YARN-311-v6.patch, YARN-311-v7.patch, YARN-311-v8.patch, > YARN-311-v9.patch > > > As the first step, we go for resource change on RM side and expose admin APIs > (admin protocol, CLI, REST and JMX API) later. In this jira, we will only > contain changes in scheduler. > The flow to update node's resource and awareness in resource scheduling is: > 1. Resource update is through admin API to RM and take effect on RMNodeImpl. > 2. When next NM heartbeat for updating status comes, the RMNode's resource > change will be aware and the delta resource is added to schedulerNode's > availableResource before actual scheduling happens. > 3. Scheduler do resource allocation according to new availableResource in > SchedulerNode. > For more design details, please refer proposal and discussions in parent > JIRA: YARN-291. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-987) Adding History Service to use Store and converting Historydata to Report
[ https://issues.apache.org/jira/browse/YARN-987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13814486#comment-13814486 ] Zhijie Shen commented on YARN-987: -- * The unnecessary type casting is still there. {code} + @Override + protected void serviceStart() throws Exception { +LOG.info("Starting ApplicationHistory"); +if (historyStore instanceof Service) { + ((Service) historyStore).start(); +} +super.serviceStart(); + } + + @Override + protected void serviceStop() throws Exception { +LOG.info("Stopping ApplicationHistory"); +if (historyStore != null && historyStore instanceof Service) { + ((Service) historyStore).stop(); +} +super.serviceStop(); + } {code} * lastAttempt can be null. Should do null check. Otherwise, NPE may be expected. Btw, it not like other methods which is straightforward wrap-up. Is it good to write a test case for this one? {code} +ApplicationAttemptHistoryData lastAttempt = getLastAttempt(appHistory +.getApplicationId()); {code} > Adding History Service to use Store and converting Historydata to Report > > > Key: YARN-987 > URL: https://issues.apache.org/jira/browse/YARN-987 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Mayank Bansal >Assignee: Mayank Bansal > Attachments: YARN-987-1.patch, YARN-987-2.patch, YARN-987-3.patch, > YARN-987-4.patch, YARN-987-5.patch > > -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1121) RMStateStore should flush all pending store events before closing
[ https://issues.apache.org/jira/browse/YARN-1121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-1121: -- Attachment: YARN-1121.7.patch bq. Replacing the eventHandler to DropEventHandler (instead of GenericEventHandler) may not be enough Good catch ! New patch removes the DropEventHandler and just do return if blockNewEvents is true in GenericEventHandler > RMStateStore should flush all pending store events before closing > - > > Key: YARN-1121 > URL: https://issues.apache.org/jira/browse/YARN-1121 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.1.0-beta >Reporter: Bikas Saha >Assignee: Jian He > Fix For: 2.2.1 > > Attachments: YARN-1121.1.patch, YARN-1121.2.patch, YARN-1121.2.patch, > YARN-1121.3.patch, YARN-1121.4.patch, YARN-1121.5.patch, YARN-1121.6.patch, > YARN-1121.6.patch, YARN-1121.7.patch > > > on serviceStop it should wait for all internal pending events to drain before > stopping. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1210) During RM restart, RM should start a new attempt only when previous attempt exits for real
[ https://issues.apache.org/jira/browse/YARN-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13814496#comment-13814496 ] Omkar Vinit Joshi commented on YARN-1210: - Thanks [~jianhe] for reviewing it. {code} Instead of passing running containers as parameter in RegisterNodeManagerRequest, is it possible to just call heartBeat immediately after registerCall and then unBlockNewContainerRequests ? That way we can take advantage of the existing heartbeat logic, cover other things like keep app alive for log aggregation after AM container completes. Or at least we can send the list of ContainerStatus(including diagnostics) instead of just container Ids and also the list of keep-alive apps (separate jira)? {code} it makes sense replacing finishedContainers with containerStatuses. bq. Unnecessary import changes in DefaultContainerExecutor.java and LinuxContainerExecutor, ContainerLaunch, ContainersLauncher actually I wanted that earlier as I had created new ExitCode.java. I wanted to access it from ResourceTrackerService. Now since we are sending container status from node manager itself so no longer need that ..fixed it. bq. Finished containers may not necessary be killed. The containers can also normal finish and remain in the NM cache before NM resync. Updated the logic for cleanupContainers on node manager side. Now we should have all the finishedContainer statuses as it is. bq. wrong LOG class name. :) fixed it.. bq. LogFactory.getLog(RMAppImpl.class); removed. bq. Isn't always the case that after this patch only the last attempt can be running ? a new attempt will not be launched until the previous attempt reports back it really exits. If this is case, it can be a bug. We may only need to check that if the last attempt is finished or not. It is actually checking for any attempt to be in non running state. Do you want me to only check last attempt (by comparing application attempt ids)?. bq. should we return RUNNING or ACCEPTED for apps that are not in final state ? It's ok to return RUNNING in the scope of this patch because anyways we are launching a new attempt. Later on in working preserving restart, RM can crash before attempt register, attempt can register with RM after RM comes back in which case we can then move app from ACCEPTED to RUNNING? Yes right now I will keep it as RUNNING only. Today we don't have any information whether previous application master started and registered or not. Once we will have that information then probably we can do this. > During RM restart, RM should start a new attempt only when previous attempt > exits for real > -- > > Key: YARN-1210 > URL: https://issues.apache.org/jira/browse/YARN-1210 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Vinod Kumar Vavilapalli >Assignee: Omkar Vinit Joshi > Attachments: YARN-1210.1.patch, YARN-1210.2.patch > > > When RM recovers, it can wait for existing AMs to contact RM back and then > kill them forcefully before even starting a new AM. Worst case, RM will start > a new AppAttempt after waiting for 10 mins ( the expiry interval). This way > we'll minimize multiple AMs racing with each other. This can help issues with > downstream components like Pig, Hive and Oozie during RM restart. > In the mean while, new apps will proceed as usual as existing apps wait for > recovery. > This can continue to be useful after work-preserving restart, so that AMs > which can properly sync back up with RM can continue to run and those that > don't are guaranteed to be killed before starting a new attempt. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1121) RMStateStore should flush all pending store events before closing
[ https://issues.apache.org/jira/browse/YARN-1121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13814512#comment-13814512 ] Hadoop QA commented on YARN-1121: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12612294/YARN-1121.7.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/2379//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2379//console This message is automatically generated. > RMStateStore should flush all pending store events before closing > - > > Key: YARN-1121 > URL: https://issues.apache.org/jira/browse/YARN-1121 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.1.0-beta >Reporter: Bikas Saha >Assignee: Jian He > Fix For: 2.2.1 > > Attachments: YARN-1121.1.patch, YARN-1121.2.patch, YARN-1121.2.patch, > YARN-1121.3.patch, YARN-1121.4.patch, YARN-1121.5.patch, YARN-1121.6.patch, > YARN-1121.6.patch, YARN-1121.7.patch > > > on serviceStop it should wait for all internal pending events to drain before > stopping. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1210) During RM restart, RM should start a new attempt only when previous attempt exits for real
[ https://issues.apache.org/jira/browse/YARN-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Vinit Joshi updated YARN-1210: Attachment: YARN-1210.3.patch > During RM restart, RM should start a new attempt only when previous attempt > exits for real > -- > > Key: YARN-1210 > URL: https://issues.apache.org/jira/browse/YARN-1210 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Vinod Kumar Vavilapalli >Assignee: Omkar Vinit Joshi > Attachments: YARN-1210.1.patch, YARN-1210.2.patch, YARN-1210.3.patch > > > When RM recovers, it can wait for existing AMs to contact RM back and then > kill them forcefully before even starting a new AM. Worst case, RM will start > a new AppAttempt after waiting for 10 mins ( the expiry interval). This way > we'll minimize multiple AMs racing with each other. This can help issues with > downstream components like Pig, Hive and Oozie during RM restart. > In the mean while, new apps will proceed as usual as existing apps wait for > recovery. > This can continue to be useful after work-preserving restart, so that AMs > which can properly sync back up with RM can continue to run and those that > don't are guaranteed to be killed before starting a new attempt. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Resolved] (YARN-832) Update Resource javadoc to clarify units for memory
[ https://issues.apache.org/jira/browse/YARN-832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza resolved YARN-832. - Resolution: Duplicate This was fixed in YARN-976 > Update Resource javadoc to clarify units for memory > --- > > Key: YARN-832 > URL: https://issues.apache.org/jira/browse/YARN-832 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bikas Saha > Labels: newbie > > These values are supposed to be megabytes (need to check MB vs MiB ie 1000 vs > 1024) > /** >* Get memory of the resource. >* @return memory of the resource >*/ > @Public > @Stable > public abstract int getMemory(); > > /** >* Set memory of the resource. >* @param memory memory of the resource >*/ > @Public > @Stable > public abstract void setMemory(int memory); -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-674) Slow or failing DelegationToken renewals on submission itself make RM unavailable
[ https://issues.apache.org/jira/browse/YARN-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13814541#comment-13814541 ] Omkar Vinit Joshi commented on YARN-674: Thanks [~jianhe], [~bikassaha] . bq. Saw this is changed back to asynchronous submission on recovery, the original intention was to prevent client from seeing the application as a new application. If asynchronously, the client can query the application before recover event gets processed, meaning before the application is fully recovered as some recover logic happens when app is processing the recover event(app.FinalTransition). fixed to make sure that it gets updated synchronously. bq. The assert doesnt make it to the production jar - so it wont catch anything on the cluster. Need to throw an exception here. If we dont want to crash the RM here then we can log and error. When the attempt state machine gets the event then it will crash on the async dispatcher thread if the event is not handled in the current state. discussed with bikas offline.. this is fine. > Slow or failing DelegationToken renewals on submission itself make RM > unavailable > - > > Key: YARN-674 > URL: https://issues.apache.org/jira/browse/YARN-674 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Vinod Kumar Vavilapalli >Assignee: Omkar Vinit Joshi > Attachments: YARN-674.1.patch, YARN-674.2.patch, YARN-674.3.patch, > YARN-674.4.patch, YARN-674.5.patch, YARN-674.5.patch > > > This was caused by YARN-280. A slow or a down NameNode for will make it look > like RM is unavailable as it may run out of RPC handlers due to blocked > client submissions. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-674) Slow or failing DelegationToken renewals on submission itself make RM unavailable
[ https://issues.apache.org/jira/browse/YARN-674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Vinit Joshi updated YARN-674: --- Attachment: YARN-674.6.patch > Slow or failing DelegationToken renewals on submission itself make RM > unavailable > - > > Key: YARN-674 > URL: https://issues.apache.org/jira/browse/YARN-674 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Vinod Kumar Vavilapalli >Assignee: Omkar Vinit Joshi > Attachments: YARN-674.1.patch, YARN-674.2.patch, YARN-674.3.patch, > YARN-674.4.patch, YARN-674.5.patch, YARN-674.5.patch, YARN-674.6.patch > > > This was caused by YARN-280. A slow or a down NameNode for will make it look > like RM is unavailable as it may run out of RPC handlers due to blocked > client submissions. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-90) NodeManager should identify failed disks becoming good back again
[ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hou Song updated YARN-90: - Attachment: YARN-90.patch Now I understand, thanks. Please review this patch first, and will open a new jira for the information exporsure soon. > NodeManager should identify failed disks becoming good back again > - > > Key: YARN-90 > URL: https://issues.apache.org/jira/browse/YARN-90 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Ravi Gummadi > Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, > YARN-90.patch > > > MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes > down, it is marked as failed forever. To reuse that disk (after it becomes > good), NodeManager needs restart. This JIRA is to improve NodeManager to > reuse good disks(which could be bad some time back). -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-674) Slow or failing DelegationToken renewals on submission itself make RM unavailable
[ https://issues.apache.org/jira/browse/YARN-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13814569#comment-13814569 ] Hadoop QA commented on YARN-674: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12612310/YARN-674.6.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/2380//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2380//console This message is automatically generated. > Slow or failing DelegationToken renewals on submission itself make RM > unavailable > - > > Key: YARN-674 > URL: https://issues.apache.org/jira/browse/YARN-674 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Vinod Kumar Vavilapalli >Assignee: Omkar Vinit Joshi > Attachments: YARN-674.1.patch, YARN-674.2.patch, YARN-674.3.patch, > YARN-674.4.patch, YARN-674.5.patch, YARN-674.5.patch, YARN-674.6.patch > > > This was caused by YARN-280. A slow or a down NameNode for will make it look > like RM is unavailable as it may run out of RPC handlers due to blocked > client submissions. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-954) [YARN-321] History Service should create the webUI and wire it to HistoryStorage
[ https://issues.apache.org/jira/browse/YARN-954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13814591#comment-13814591 ] Devaraj K commented on YARN-954: Thanks for reminding Mayank, I will update the patch with the changes. Thanks... > [YARN-321] History Service should create the webUI and wire it to > HistoryStorage > > > Key: YARN-954 > URL: https://issues.apache.org/jira/browse/YARN-954 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Vinod Kumar Vavilapalli >Assignee: Devaraj K > Attachments: YARN-954-3.patch, YARN-954-v0.patch, YARN-954-v1.patch, > YARN-954-v2.patch > > -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-954) [YARN-321] History Service should create the webUI and wire it to HistoryStorage
[ https://issues.apache.org/jira/browse/YARN-954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13814590#comment-13814590 ] Devaraj K commented on YARN-954: Thanks for reminding Mayank, I will update the patch with the changes. Thanks... > [YARN-321] History Service should create the webUI and wire it to > HistoryStorage > > > Key: YARN-954 > URL: https://issues.apache.org/jira/browse/YARN-954 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Vinod Kumar Vavilapalli >Assignee: Devaraj K > Attachments: YARN-954-3.patch, YARN-954-v0.patch, YARN-954-v1.patch, > YARN-954-v2.patch > > -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1222) Make improvements in ZKRMStateStore for fencing
[ https://issues.apache.org/jira/browse/YARN-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13814596#comment-13814596 ] Karthik Kambatla commented on YARN-1222: bq. Thinking aloud, using HAServiceTarget in RMStateStore to transitionToStandby() may not be the right solution. My bad. I should have explained the choice. Post YARN-1318, I think RMStateStore constructor should take RMContext. Then, we should be able to replace the RPC approach with rmContext.getHAService.transitionToStandby(). bq. The store should inform the higher entity about the fenced state and not take action on the higher entity by fencing it. I think it is a trade-off between pushing higher-level concepts like HA down versus spreading the logic of handling the FencedException across multiple entities. If we push it all the way down to the store implementation (ZKRMStateStore), we can get away with handling at one location. The other extreme would be to handle it at every location where a store operation is triggered. I think handling it in RMStateStore and not an implementation is a good compromise. A completely different approach might to be keep {{handleStoreFencedException()}} in {{ResourceManager}} and the store implementation to call it when it realizes it got fenced. Thoughts? > Make improvements in ZKRMStateStore for fencing > --- > > Key: YARN-1222 > URL: https://issues.apache.org/jira/browse/YARN-1222 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bikas Saha >Assignee: Karthik Kambatla > Attachments: yarn-1222-1.patch, yarn-1222-2.patch, yarn-1222-3.patch, > yarn-1222-4.patch, yarn-1222-5.patch > > > Using multi-operations for every ZK interaction. > In every operation, automatically creating/deleting a lock znode that is the > child of the root znode. This is to achieve fencing by modifying the > create/delete permissions on the root znode. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-987) Adding History Service to use Store and converting Historydata to Report
[ https://issues.apache.org/jira/browse/YARN-987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13814605#comment-13814605 ] Vinod Kumar Vavilapalli commented on YARN-987: -- Quickly scanned through the patch, comments: - reduce the scope of methods like getLastAttempt, they don't need to be public. - ApplicationHistoryContext -> ApplicationHistoryManager and ApplicationHistory -> ApplicationHistoryManagerImpl. They aren't just context objects. > Adding History Service to use Store and converting Historydata to Report > > > Key: YARN-987 > URL: https://issues.apache.org/jira/browse/YARN-987 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Mayank Bansal >Assignee: Mayank Bansal > Attachments: YARN-987-1.patch, YARN-987-2.patch, YARN-987-3.patch, > YARN-987-4.patch, YARN-987-5.patch > > -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1266) Adding ApplicationHistoryProtocolPBService
[ https://issues.apache.org/jira/browse/YARN-1266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13814618#comment-13814618 ] Vinod Kumar Vavilapalli commented on YARN-1266: --- This is not enough, you need a client wrapper too. I think we should just bite the bullet and remove ApplicationHistoryProtocol completely. We can merge the new APIs into ApplicationClientProtocol and take care of ResourceManager implementation of those APIs in a separate JIRA. > Adding ApplicationHistoryProtocolPBService > -- > > Key: YARN-1266 > URL: https://issues.apache.org/jira/browse/YARN-1266 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Mayank Bansal >Assignee: Mayank Bansal > Attachments: YARN-1266-1.patch, YARN-1266-2.patch > > > Adding ApplicationHistoryProtocolPBService to make web apps to work and > changing yarn to run AHS as a seprate process -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-955) [YARN-321] Implementation of ApplicationHistoryProtocol
[ https://issues.apache.org/jira/browse/YARN-955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13814622#comment-13814622 ] Zhijie Shen commented on YARN-955: -- 1. Not necessary. The default value can be read from yarn-default.xml. The problem is that you can not specify the prefix variables like that in the xml file. This default URI will be a relative path based on the current directory. {code} + public static final String DEFAULT_FS_HISTORY_STORE_URI = "tmp"; {code} 2. Maybe just call it AHS_ADDRESS {code} + public static final String AHS_HISTORY_ADDRESS = AHS_PREFIX + "address"; {code} 3. The nested class is not necessary. ApplicationHistoryClientService can implement ApplicationHistoryProtocol directly. {code} + private class ApplicationHSClientProtocolHandler implements + ApplicationHistoryProtocol { {code} 4. Not necessary wrap-up. Please place the simple statement directly in the callers. Same for getApplications. {code} +public List getApplicationAttempts( +ApplicationId appId) throws IOException { + List appAttemptReports = new ArrayList( + history.getApplicationAttempts(appId).values()); + return appAttemptReports; +} {code} 5. Personally, I think returning empty collections is fine to indicate no results. Otherwise, the caller needs always to check not null first. {code} + } else { +response.setApplicationList(null); + } {code} 6. Why do you want two references pointing to the same object? {code} +historyService = createApplicationHistory(); +historyContext = (ApplicationHistoryContext) historyService; {code} 7. In the original design, we said we're going to make AHS a service of RM, though it should be independent enough. In this patch, I can see AHS is going to be an completely independent process. So far, it should be OK, because AHS needs nothing from RM. However, I'm expecting some more security work to do if AHS is separate process, as AHS and RM will not share the common context, and may be launched by different users. [~vinodkv], do you have any opinion about service or process? Anyway, if we decide to make AHS a process now, this patch should also include the shell script to launch AHS. > [YARN-321] Implementation of ApplicationHistoryProtocol > --- > > Key: YARN-955 > URL: https://issues.apache.org/jira/browse/YARN-955 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Vinod Kumar Vavilapalli >Assignee: Mayank Bansal > Attachments: YARN-955-1.patch, YARN-955-2.patch > > -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1266) Adding ApplicationHistoryProtocolPBService
[ https://issues.apache.org/jira/browse/YARN-1266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13814639#comment-13814639 ] Zhijie Shen commented on YARN-1266: --- bq. This is not enough, you need a client wrapper too. I think [~mayank_bansal] has it in another Jira. It seems that the work of ApplicationHistoryProtocol has been split into the following Jiras: YARN-979, YARN-1266, YARN-955 and YARN-967, and the client wrapper is in YARN-967. [~mayank_bansal], please correct me if I'm wrong. bq. I think we should just bite the bullet and remove ApplicationHistoryProtocol completely. We can merge the new APIs into ApplicationClientProtocol and take care of ResourceManager implementation of those APIs in a separate JIRA. Do you mean that we have a single RPC interface, and server-side implementation will redirect the query of completed applications/attempts/containers to AHS, right? If so, I think it makes sense, and probably simplifies the problem. However, I still have one concern about the independency of AHS. Let's say if we want AHS to be a separate process like JHS in the future (or maybe now, see my comments in YARN-955), when RM is stopped, AHS can not be accessed via RPC interface. > Adding ApplicationHistoryProtocolPBService > -- > > Key: YARN-1266 > URL: https://issues.apache.org/jira/browse/YARN-1266 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Mayank Bansal >Assignee: Mayank Bansal > Attachments: YARN-1266-1.patch, YARN-1266-2.patch > > > Adding ApplicationHistoryProtocolPBService to make web apps to work and > changing yarn to run AHS as a seprate process -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-978) [YARN-321] Adding ApplicationAttemptReport and Protobuf implementation
[ https://issues.apache.org/jira/browse/YARN-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13814653#comment-13814653 ] Vinod Kumar Vavilapalli commented on YARN-978: -- Tx for the reviews Zhijie. Also Xuan for the earlier patches. > [YARN-321] Adding ApplicationAttemptReport and Protobuf implementation > -- > > Key: YARN-978 > URL: https://issues.apache.org/jira/browse/YARN-978 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Mayank Bansal >Assignee: Mayank Bansal > Fix For: YARN-321 > > Attachments: YARN-978-1.patch, YARN-978.10.patch, YARN-978.2.patch, > YARN-978.3.patch, YARN-978.4.patch, YARN-978.5.patch, YARN-978.6.patch, > YARN-978.7.patch, YARN-978.8.patch, YARN-978.9.patch > > > We dont have ApplicationAttemptReport and Protobuf implementation. > Adding that. > Thanks, > Mayank -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1307) Rethink znode structure for RM HA
[ https://issues.apache.org/jira/browse/YARN-1307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-1307: - Attachment: YARN-1307.4-2.patch > Rethink znode structure for RM HA > - > > Key: YARN-1307 > URL: https://issues.apache.org/jira/browse/YARN-1307 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Tsuyoshi OZAWA >Assignee: Tsuyoshi OZAWA > Attachments: YARN-1307.1.patch, YARN-1307.2.patch, YARN-1307.3.patch, > YARN-1307.4-2.patch, YARN-1307.4.patch > > > Rethink for znode structure for RM HA is proposed in some JIRAs(YARN-659, > YARN-1222). The motivation of this JIRA is quoted from Bikas' comment in > YARN-1222: > {quote} > We should move to creating a node hierarchy for apps such that all znodes for > an app are stored under an app znode instead of the app root znode. This will > help in removeApplication and also in scaling better on ZK. The earlier code > was written this way to ensure create/delete happens under a root znode for > fencing. But given that we have moved to multi-operations globally, this isnt > required anymore. > {quote} -- This message was sent by Atlassian JIRA (v6.1#6144)