[jira] [Commented] (YARN-1474) Make schedulers services
[ https://issues.apache.org/jira/browse/YARN-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14005674#comment-14005674 ] Tsuyoshi OZAWA commented on YARN-1474: -- I'm rebasing a patch on YARN-2017. Please wait a moment. > Make schedulers services > > > Key: YARN-1474 > URL: https://issues.apache.org/jira/browse/YARN-1474 > Project: Hadoop YARN > Issue Type: Sub-task > Components: scheduler >Affects Versions: 2.3.0, 2.4.0 >Reporter: Sandy Ryza >Assignee: Tsuyoshi OZAWA > Attachments: YARN-1474.1.patch, YARN-1474.10.patch, > YARN-1474.11.patch, YARN-1474.12.patch, YARN-1474.13.patch, > YARN-1474.14.patch, YARN-1474.15.patch, YARN-1474.2.patch, YARN-1474.3.patch, > YARN-1474.4.patch, YARN-1474.5.patch, YARN-1474.6.patch, YARN-1474.7.patch, > YARN-1474.8.patch, YARN-1474.9.patch > > > Schedulers currently have a reinitialize but no start and stop. Fitting them > into the YARN service model would make things more coherent. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2017) Merge some of the common lib code in schedulers
[ https://issues.apache.org/jira/browse/YARN-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14005667#comment-14005667 ] Tsuyoshi OZAWA commented on YARN-2017: -- Good job! > Merge some of the common lib code in schedulers > --- > > Key: YARN-2017 > URL: https://issues.apache.org/jira/browse/YARN-2017 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Jian He >Assignee: Jian He > Fix For: 2.5.0 > > Attachments: YARN-2017.1.patch, YARN-2017.2.patch, YARN-2017.3.patch, > YARN-2017.4.patch, YARN-2017.4.patch, YARN-2017.5.patch, YARN-2017.6.patch, > YARN-2017.6.patch, YARN-2017.7.patch > > > A bunch of same code is repeated among schedulers, e.g: between > FicaSchedulerNode and FSSchedulerNode. It's good to merge and share them in a > common base. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2017) Merge some of the common lib code in schedulers
[ https://issues.apache.org/jira/browse/YARN-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14005632#comment-14005632 ] Vinod Kumar Vavilapalli commented on YARN-2017: --- +1, looks good. Checking this in. > Merge some of the common lib code in schedulers > --- > > Key: YARN-2017 > URL: https://issues.apache.org/jira/browse/YARN-2017 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Jian He >Assignee: Jian He > Attachments: YARN-2017.1.patch, YARN-2017.2.patch, YARN-2017.3.patch, > YARN-2017.4.patch, YARN-2017.4.patch, YARN-2017.5.patch, YARN-2017.6.patch, > YARN-2017.6.patch, YARN-2017.7.patch > > > A bunch of same code is repeated among schedulers, e.g: between > FicaSchedulerNode and FSSchedulerNode. It's good to merge and share them in a > common base. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2088) Fix code bug in GetApplicationsRequestPBImpl#mergeLocalToBuilder
[ https://issues.apache.org/jira/browse/YARN-2088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14005602#comment-14005602 ] Hong Zhiguo commented on YARN-2088: --- +1 I guess the failure before this patch is caused by builder.clearApplicationTags() being not called in setApplicationTags() or mergeLocalToBuilder(). 2 lines as below could be cleaned away too. {code} public GetApplicationsRequestProto getProto() { mergeLocalToProto(); -proto = viaProto ? proto : builder.build(); -viaProto = true; return proto; } {code} > Fix code bug in GetApplicationsRequestPBImpl#mergeLocalToBuilder > > > Key: YARN-2088 > URL: https://issues.apache.org/jira/browse/YARN-2088 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Binglin Chang >Assignee: Binglin Chang > Attachments: YARN-2088.v1.patch > > > Some fields(set,list) are added to proto builders many times, we need to > clear those fields before add, otherwise the result proto contains more > contents. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (YARN-2094) how to enable job counters for mapreduce or applications
[ https://issues.apache.org/jira/browse/YARN-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli resolved YARN-2094. --- Resolution: Invalid Closing it as invalid. Please ask such questions on the user mailing lists. Thanks. > how to enable job counters for mapreduce or applications > > > Key: YARN-2094 > URL: https://issues.apache.org/jira/browse/YARN-2094 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Nikhil Mulley > > Hi, > I was looking at MapReduce jobs in my YARN setup and was wondering about the > jobcounters. I do not see the jobcounters for the mapreduce applications. > When I browse through the web page for job counters, there are no job > counters. Is there a specific setting to enable the application/job counters > in YARN? Please let me know. > thanks, > Nikhil -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2017) Merge some of the common lib code in schedulers
[ https://issues.apache.org/jira/browse/YARN-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14005604#comment-14005604 ] Hadoop QA commented on YARN-2017: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12646173/YARN-2017.7.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 9 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-tools/hadoop-sls hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3787//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3787//console This message is automatically generated. > Merge some of the common lib code in schedulers > --- > > Key: YARN-2017 > URL: https://issues.apache.org/jira/browse/YARN-2017 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Jian He >Assignee: Jian He > Attachments: YARN-2017.1.patch, YARN-2017.2.patch, YARN-2017.3.patch, > YARN-2017.4.patch, YARN-2017.4.patch, YARN-2017.5.patch, YARN-2017.6.patch, > YARN-2017.6.patch, YARN-2017.7.patch > > > A bunch of same code is repeated among schedulers, e.g: between > FicaSchedulerNode and FSSchedulerNode. It's good to merge and share them in a > common base. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2017) Merge some of the common lib code in schedulers
[ https://issues.apache.org/jira/browse/YARN-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-2017: -- Attachment: YARN-2017.7.patch > Merge some of the common lib code in schedulers > --- > > Key: YARN-2017 > URL: https://issues.apache.org/jira/browse/YARN-2017 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Jian He >Assignee: Jian He > Attachments: YARN-2017.1.patch, YARN-2017.2.patch, YARN-2017.3.patch, > YARN-2017.4.patch, YARN-2017.4.patch, YARN-2017.5.patch, YARN-2017.6.patch, > YARN-2017.6.patch, YARN-2017.7.patch > > > A bunch of same code is repeated among schedulers, e.g: between > FicaSchedulerNode and FSSchedulerNode. It's good to merge and share them in a > common base. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2017) Merge some of the common lib code in schedulers
[ https://issues.apache.org/jira/browse/YARN-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14005583#comment-14005583 ] Hadoop QA commented on YARN-2017: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12646169/YARN-2017.6.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 9 new or modified test files. {color:red}-1 javac{color:red}. The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3786//console This message is automatically generated. > Merge some of the common lib code in schedulers > --- > > Key: YARN-2017 > URL: https://issues.apache.org/jira/browse/YARN-2017 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Jian He >Assignee: Jian He > Attachments: YARN-2017.1.patch, YARN-2017.2.patch, YARN-2017.3.patch, > YARN-2017.4.patch, YARN-2017.4.patch, YARN-2017.5.patch, YARN-2017.6.patch, > YARN-2017.6.patch > > > A bunch of same code is repeated among schedulers, e.g: between > FicaSchedulerNode and FSSchedulerNode. It's good to merge and share them in a > common base. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2094) how to enable job counters for mapreduce or applications
[ https://issues.apache.org/jira/browse/YARN-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14005575#comment-14005575 ] Rohith commented on YARN-2094: -- Hi Nikhil, Welcome to Hadoop community. bq. When I browse through the web page for job counters, there are no job counters. Which web page are you browsing? Counters link is available in HistoryServer web page,top left Job popdown menu. Make sure history server is running. You can access jobcounter page in link *http:///jobhistory/jobcounters/* For asking any question like this , you can post in Hadoop user mailing list. For subscribe to Hadoop user mailing list follow the link http://hadoop.apache.org/mailing_lists.html#User > how to enable job counters for mapreduce or applications > > > Key: YARN-2094 > URL: https://issues.apache.org/jira/browse/YARN-2094 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Nikhil Mulley > > Hi, > I was looking at MapReduce jobs in my YARN setup and was wondering about the > jobcounters. I do not see the jobcounters for the mapreduce applications. > When I browse through the web page for job counters, there are no job > counters. Is there a specific setting to enable the application/job counters > in YARN? Please let me know. > thanks, > Nikhil -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2017) Merge some of the common lib code in schedulers
[ https://issues.apache.org/jira/browse/YARN-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14005574#comment-14005574 ] Junping Du commented on YARN-2017: -- Seems the jenkins is not started automatically. Kick off test manually. > Merge some of the common lib code in schedulers > --- > > Key: YARN-2017 > URL: https://issues.apache.org/jira/browse/YARN-2017 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Jian He >Assignee: Jian He > Attachments: YARN-2017.1.patch, YARN-2017.2.patch, YARN-2017.3.patch, > YARN-2017.4.patch, YARN-2017.4.patch, YARN-2017.5.patch, YARN-2017.6.patch, > YARN-2017.6.patch > > > A bunch of same code is repeated among schedulers, e.g: between > FicaSchedulerNode and FSSchedulerNode. It's good to merge and share them in a > common base. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2017) Merge some of the common lib code in schedulers
[ https://issues.apache.org/jira/browse/YARN-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-2017: -- Attachment: YARN-2017.6.patch Same patch to kick jenkins > Merge some of the common lib code in schedulers > --- > > Key: YARN-2017 > URL: https://issues.apache.org/jira/browse/YARN-2017 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Jian He >Assignee: Jian He > Attachments: YARN-2017.1.patch, YARN-2017.2.patch, YARN-2017.3.patch, > YARN-2017.4.patch, YARN-2017.4.patch, YARN-2017.5.patch, YARN-2017.6.patch, > YARN-2017.6.patch > > > A bunch of same code is repeated among schedulers, e.g: between > FicaSchedulerNode and FSSchedulerNode. It's good to merge and share them in a > common base. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2074) Preemption of AM containers shouldn't count towards AM failures
[ https://issues.apache.org/jira/browse/YARN-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14005564#comment-14005564 ] Jian He commented on YARN-2074: --- bq. Use this condition to decide whether this RMAppAttempt is isLastAttempt Actually the isLastAttempt boolean is not used for determining whether to restart the AM , the method getAttemptFailureCount is used to do that. Will rename this boolean flag to avoid confusion. bq. maybe we could use a more general way to check whether the AM is isPreempted, (check ContainerExitStatus instead thinking about this. To do this we need to persist the ContainerExitStatus in state store also. > Preemption of AM containers shouldn't count towards AM failures > --- > > Key: YARN-2074 > URL: https://issues.apache.org/jira/browse/YARN-2074 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Vinod Kumar Vavilapalli >Assignee: Jian He > Attachments: YARN-2074.1.patch, YARN-2074.2.patch > > > One orthogonal concern with issues like YARN-2055 and YARN-2022 is that AM > containers getting preempted shouldn't count towards AM failures and thus > shouldn't eventually fail applications. > We should explicitly handle AM container preemption/kill as a separate issue > and not count it towards the limit on AM failures. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1474) Make schedulers services
[ https://issues.apache.org/jira/browse/YARN-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14005544#comment-14005544 ] Tsuyoshi OZAWA commented on YARN-1474: -- [~kkambatl], thanks for your review. {quote} And, let us handle the incompatible change to reinitialize in a separate JIRA. {quote} I agree with this point. Fixed the following points in a latest patch: 1. Moved the part corresponding to if (!initialized) to {{serviceInit()}}. Moved initialization code to {{initScheduler}} and {{startThreads}} to avoid code duplication. 2. Changed to call {{initScheduler}} and {{startThreads}} instead of calling {{reinitialize()}} in serviceInit or serviceStart. 3. For the individual threads in the schedulers, init them in serviceInit, but call thread.start() in serviceStart() 4. Fixed serviceStop() for CS. 5. Fixed tests based on your idea. > Make schedulers services > > > Key: YARN-1474 > URL: https://issues.apache.org/jira/browse/YARN-1474 > Project: Hadoop YARN > Issue Type: Sub-task > Components: scheduler >Affects Versions: 2.3.0, 2.4.0 >Reporter: Sandy Ryza >Assignee: Tsuyoshi OZAWA > Attachments: YARN-1474.1.patch, YARN-1474.10.patch, > YARN-1474.11.patch, YARN-1474.12.patch, YARN-1474.13.patch, > YARN-1474.14.patch, YARN-1474.15.patch, YARN-1474.2.patch, YARN-1474.3.patch, > YARN-1474.4.patch, YARN-1474.5.patch, YARN-1474.6.patch, YARN-1474.7.patch, > YARN-1474.8.patch, YARN-1474.9.patch > > > Schedulers currently have a reinitialize but no start and stop. Fitting them > into the YARN service model would make things more coherent. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1368) Common work to re-populate containers’ state into scheduler
[ https://issues.apache.org/jira/browse/YARN-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-1368: -- Attachment: YARN-1368.3.patch Thanks Wangda for the review ! the new patch fixed the comments also. bq. Should we change Resource(1024, 1) to its actually resource? fixed bq. For recoverContainersOnNode, is it possible NODE_ADDED happened before APP_ADDED? Not possible, APP_ADDED happens synchronously before ResourceTrackerService is started. bq. It may better to use two parameter assertEquals, because delta is 0 because they are two doubles. fixed the delta value to be 1e-8 bq. Why use split AMContainerCrashedTransition to two transitions and set their states to RUNNING/LAUNCHED differently. To capture completed containers at RUNNING/LAUNCHED state and reuse the common code. > Common work to re-populate containers’ state into scheduler > --- > > Key: YARN-1368 > URL: https://issues.apache.org/jira/browse/YARN-1368 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bikas Saha >Assignee: Jian He > Attachments: YARN-1368.1.patch, YARN-1368.2.patch, YARN-1368.3.patch, > YARN-1368.combined.001.patch, YARN-1368.preliminary.patch > > > YARN-1367 adds support for the NM to tell the RM about all currently running > containers upon registration. The RM needs to send this information to the > schedulers along with the NODE_ADDED_EVENT so that the schedulers can recover > the current allocation state of the cluster. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2049) Delegation token stuff for the timeline sever
[ https://issues.apache.org/jira/browse/YARN-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2049: -- Attachment: YARN-2049.5.patch The history deamon in MiniYarnCluster is also affected by the changes. Fix it accordingly. > Delegation token stuff for the timeline sever > - > > Key: YARN-2049 > URL: https://issues.apache.org/jira/browse/YARN-2049 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhijie Shen >Assignee: Zhijie Shen > Attachments: YARN-2049.1.patch, YARN-2049.2.patch, YARN-2049.3.patch, > YARN-2049.4.patch, YARN-2049.5.patch > > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1368) Common work to re-populate containers’ state into scheduler
[ https://issues.apache.org/jira/browse/YARN-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-1368: -- Attachment: (was: YARN-1368.3.patch) > Common work to re-populate containers’ state into scheduler > --- > > Key: YARN-1368 > URL: https://issues.apache.org/jira/browse/YARN-1368 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bikas Saha >Assignee: Jian He > Attachments: YARN-1368.1.patch, YARN-1368.2.patch, > YARN-1368.combined.001.patch, YARN-1368.preliminary.patch > > > YARN-1367 adds support for the NM to tell the RM about all currently running > containers upon registration. The RM needs to send this information to the > schedulers along with the NODE_ADDED_EVENT so that the schedulers can recover > the current allocation state of the cluster. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2017) Merge some of the common lib code in schedulers
[ https://issues.apache.org/jira/browse/YARN-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-2017: -- Attachment: YARN-2017.6.patch Thanks Vinod for the review ! Fixed the comments. bq. he new node classes have a lot of getReserverdContainer() calls which can be replaced by a single call assigned to a local variable. FicaSchedulerNode#reserveResource: parameter reservedContainer is renamed to container, similarly for FSSchedulerNode. single getReserverdContainer() call is used upfront. Suppressed the find bugs warnings. > Merge some of the common lib code in schedulers > --- > > Key: YARN-2017 > URL: https://issues.apache.org/jira/browse/YARN-2017 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Jian He >Assignee: Jian He > Attachments: YARN-2017.1.patch, YARN-2017.2.patch, YARN-2017.3.patch, > YARN-2017.4.patch, YARN-2017.4.patch, YARN-2017.5.patch, YARN-2017.6.patch > > > A bunch of same code is repeated among schedulers, e.g: between > FicaSchedulerNode and FSSchedulerNode. It's good to merge and share them in a > common base. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1368) Common work to re-populate containers’ state into scheduler
[ https://issues.apache.org/jira/browse/YARN-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14005511#comment-14005511 ] Jian He commented on YARN-1368: --- The new patch is rebased on YARN-2017 and created a new ContainerRecoveryReport record in NM-RM protocol to include the container resource capability. > Common work to re-populate containers’ state into scheduler > --- > > Key: YARN-1368 > URL: https://issues.apache.org/jira/browse/YARN-1368 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bikas Saha >Assignee: Jian He > Attachments: YARN-1368.1.patch, YARN-1368.2.patch, YARN-1368.3.patch, > YARN-1368.combined.001.patch, YARN-1368.preliminary.patch > > > YARN-1367 adds support for the NM to tell the RM about all currently running > containers upon registration. The RM needs to send this information to the > schedulers along with the NODE_ADDED_EVENT so that the schedulers can recover > the current allocation state of the cluster. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1368) Common work to re-populate containers’ state into scheduler
[ https://issues.apache.org/jira/browse/YARN-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-1368: -- Attachment: YARN-1368.3.patch > Common work to re-populate containers’ state into scheduler > --- > > Key: YARN-1368 > URL: https://issues.apache.org/jira/browse/YARN-1368 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bikas Saha >Assignee: Jian He > Attachments: YARN-1368.1.patch, YARN-1368.2.patch, YARN-1368.3.patch, > YARN-1368.combined.001.patch, YARN-1368.preliminary.patch > > > YARN-1367 adds support for the NM to tell the RM about all currently running > containers upon registration. The RM needs to send this information to the > schedulers along with the NODE_ADDED_EVENT so that the schedulers can recover > the current allocation state of the cluster. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1338) Recover localized resource cache state upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14005493#comment-14005493 ] Junping Du commented on YARN-1338: -- Thanks for addressing my comments, [~jlowe]! Some additional comments: I think currently we are using initStorage(conf) to create DB items for storing NMState when NM is start for the first time and the same method for locating DB items when NM is restart. Do we have any code to destroy DB items for NMState when NM is decommissioned (not expecting short-term restart)? If not, when NM is recommissioned - which should be recognized as a fresh node, it will still have stale NMState info if NM_RECOVERY_DIR and DB_NAME not changed. Do I miss anything here? In LocalResourcesTrackerImpl#recoverResource() {code} +incrementFileCountForLocalCacheDirectory(localDir.getParent()); {code} Given localDir is already the parent of localPath, may be we should just increment locaDir rather than its parent? I didn't see we have unit test to check file count for resource directory after recovery. May be we should add some? > Recover localized resource cache state upon nodemanager restart > --- > > Key: YARN-1338 > URL: https://issues.apache.org/jira/browse/YARN-1338 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.3.0 >Reporter: Jason Lowe >Assignee: Jason Lowe > Attachments: YARN-1338.patch, YARN-1338v2.patch, > YARN-1338v3-and-YARN-1987.patch, YARN-1338v4.patch, YARN-1338v5.patch > > > Today when node manager restarts we clean up all the distributed cache files > from disk. This is definitely not ideal from 2 aspects. > * For work preserving restart we definitely want them as running containers > are using them > * For even non work preserving restart this will be useful in the sense that > we don't have to download them again if needed by future tasks. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1913) With Fair Scheduler, cluster can logjam when all resources are consumed by AMs
[ https://issues.apache.org/jira/browse/YARN-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan updated YARN-1913: -- Attachment: YARN-1913.patch init patch for review. Add queueMaxAMShare configuration for each queue. And update the code in MaxRunningAppsEnforcer.java by considering AM share. Instead of using accurate AM resource usage, here use an easier way. The max_app_limited_by_AM = (queue.queueMaxAMShare * queue.maxShare) / scheduler.minAllocation. > With Fair Scheduler, cluster can logjam when all resources are consumed by AMs > -- > > Key: YARN-1913 > URL: https://issues.apache.org/jira/browse/YARN-1913 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.3.0 >Reporter: bc Wong >Assignee: Karthik Kambatla > Attachments: YARN-1913.patch > > > It's possible to deadlock a cluster by submitting many applications at once, > and have all cluster resources taken up by AMs. > One solution is for the scheduler to limit resources taken up by AMs, as a > percentage of total cluster resources, via a "maxApplicationMasterShare" > config. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-1913) With Fair Scheduler, cluster can logjam when all resources are consumed by AMs
[ https://issues.apache.org/jira/browse/YARN-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan reassigned YARN-1913: - Assignee: Wei Yan (was: Karthik Kambatla) > With Fair Scheduler, cluster can logjam when all resources are consumed by AMs > -- > > Key: YARN-1913 > URL: https://issues.apache.org/jira/browse/YARN-1913 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.3.0 >Reporter: bc Wong >Assignee: Wei Yan > Attachments: YARN-1913.patch > > > It's possible to deadlock a cluster by submitting many applications at once, > and have all cluster resources taken up by AMs. > One solution is for the scheduler to limit resources taken up by AMs, as a > percentage of total cluster resources, via a "maxApplicationMasterShare" > config. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2049) Delegation token stuff for the timeline sever
[ https://issues.apache.org/jira/browse/YARN-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2049: -- Attachment: YARN-2049.4.patch Updated the patch given YARN-1938 is committed > Delegation token stuff for the timeline sever > - > > Key: YARN-2049 > URL: https://issues.apache.org/jira/browse/YARN-2049 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhijie Shen >Assignee: Zhijie Shen > Attachments: YARN-2049.1.patch, YARN-2049.2.patch, YARN-2049.3.patch, > YARN-2049.4.patch > > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2094) how to enable job counters for mapreduce or applications
Nikhil Mulley created YARN-2094: --- Summary: how to enable job counters for mapreduce or applications Key: YARN-2094 URL: https://issues.apache.org/jira/browse/YARN-2094 Project: Hadoop YARN Issue Type: Bug Reporter: Nikhil Mulley Hi, I was looking at MapReduce jobs in my YARN setup and was wondering about the jobcounters. I do not see the jobcounters for the mapreduce applications. When I browse through the web page for job counters, there are no job counters. Is there a specific setting to enable the application/job counters in YARN? Please let me know. thanks, Nikhil -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2082) Support for alternative log aggregation mechanism
[ https://issues.apache.org/jira/browse/YARN-2082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14005427#comment-14005427 ] Zhijie Shen commented on YARN-2082: --- Just think it out loudly. Instead of making another store based on HBase to host the aggregated logs. Is it possible to reuse the timeline store to do it? I think the event stream data model should be suitable in this case, and there's a pending work to scale out the timeline store with HBase as well (YARN-2032). The additional benefit is that the interfaces for publish and querying the data are ready, and we just need to change the hook or wrap them into a log aggregation plugin. > Support for alternative log aggregation mechanism > - > > Key: YARN-2082 > URL: https://issues.apache.org/jira/browse/YARN-2082 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Ming Ma > > I will post a more detailed design later. Here is the brief summary and would > like to get early feedback. > Problem Statement: > Current implementation of log aggregation create one HDFS file for each > {application, nodemanager }. These files are relative small, in the range of > 1-2 MB. In a large cluster with lots of application and many nodemanagers, it > ends up creating lots of small files in HDFS. This creates pressure on HDFS > NN on the following ways. > 1. It increases NN Memory size. It is mitigated by having history server > deletes old log files in HDFS. > 2. Runtime RPC hit on HDFS. Each log aggregation file introduced several NN > RPCs such as create, getAdditionalBlock, complete, rename. When the cluster > is busy, such RPC hit has impact on NN performance. > In addition, to support non-MR applications on YARN, we might need to support > aggregation for long running applications. > Design choices: > 1. Don't aggregate all the logs, as in YARN-221. > 2. Create a dedicated HDFS namespace used only for log aggregation. > 3. Write logs to some key-value store like HBase. HBase's RPC hit on NN will > be much less. > 4. Decentralize the application level log aggregation to NMs. All logs for a > given application are aggregated first by a dedicated NM before it is pushed > to HDFS. > 5. Have NM aggregate logs on a regular basis; each of these log files will > have data from different applications and there needs to be some index for > quick lookup. > Proposal: > 1. Make yarn log aggregation pluggable for both read and write path. Note > that Hadoop FileSystem provides an abstraction and we could ask alternative > log aggregator implement compatable FileSystem, but that seems to an overkill. > 2. Provide a log aggregation plugin that write to HBase. The scheme design > needs to support efficient read on a per application as well as per > application+container basis; in addition, it shouldn't create hotspot in a > cluster where certain users might create more jobs than others. For example, > we can use hash($user+$applicationId} + containerid as the row key. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2012) Fair Scheduler : Default rule in queue placement policy can take a queue as an optional attribute
[ https://issues.apache.org/jira/browse/YARN-2012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14005409#comment-14005409 ] Sandy Ryza commented on YARN-2012: -- My thinking is that QueuePlacementRule.assignAppToQueue should return "" (pass) if the queue returned by the default rule is not configured and create is false. I think this is a rare case that could only be a result of misconfiguration, so it's not worth adding any special handling that complicates the logic. > Fair Scheduler : Default rule in queue placement policy can take a queue as > an optional attribute > - > > Key: YARN-2012 > URL: https://issues.apache.org/jira/browse/YARN-2012 > Project: Hadoop YARN > Issue Type: Improvement > Components: scheduler >Reporter: Ashwin Shankar >Assignee: Ashwin Shankar > Labels: scheduler > Attachments: YARN-2012-v1.txt, YARN-2012-v2.txt > > > Currently 'default' rule in queue placement policy,if applied,puts the app in > root.default queue. It would be great if we can make 'default' rule > optionally point to a different queue as default queue . This queue should be > an existing queue,if not we fall back to root.default queue hence keeping > this rule as terminal. > This default queue can be a leaf queue or it can also be an parent queue if > the 'default' rule is nested inside nestedUserQueue rule(YARN-1864). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2093) Fair Scheduler IllegalStateException after upgrade from 2.2.0 to 2.4.1-SNAP
[ https://issues.apache.org/jira/browse/YARN-2093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14005407#comment-14005407 ] Jon Bringhurst commented on YARN-2093: -- RM-HA is enabled. This only happened on the first start after upgrading from 2.2.0. Starting the RM again after the first start works without error. I haven't tried to do an upgrade again, so I'm not sure if it's reproducible. [ https://issues.apache.org/jira/browse/YARN-2093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14005397#comment-14005397 ] Sandy Ryza commented on YARN-2093: -- Thanks for reporting this Jon. Did this occur in an RM-HA setup? Is it reproducible? -- This message was sent by Atlassian JIRA (v6.2#6252) > Fair Scheduler IllegalStateException after upgrade from 2.2.0 to 2.4.1-SNAP > --- > > Key: YARN-2093 > URL: https://issues.apache.org/jira/browse/YARN-2093 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.4.1 >Reporter: Jon Bringhurst > > After upgrading from 2.2.0 to 2.4.1-SNAP, I ran into the following on startup: > {noformat} > 21:19:34,308 INFO RMAppAttemptImpl:659 - > appattempt_1400092144371_0003_09 State change from SUBMITTED to SCHEDULED > 21:19:34,309 INFO RMAppAttemptImpl:659 - > appattempt_1400092144371_0004_08 State change from SUBMITTED to SCHEDULED > 21:19:34,310 INFO RMAppAttemptImpl:659 - > appattempt_1400092144371_0003_10 State change from SUBMITTED to SCHEDULED > 21:19:34,310 INFO RMAppAttemptImpl:659 - > appattempt_1400092144371_0003_11 State change from SUBMITTED to SCHEDULED > 21:19:34,317 INFO FairScheduler:673 - Added Application Attempt > appattempt_1400092144371_0004_09 to scheduler from user: > samza-perf-playground > 21:19:34,318 INFO FairScheduler:673 - Added Application Attempt > appattempt_1400092144371_0004_10 to scheduler from user: > samza-perf-playground > 21:19:34,318 INFO RMAppAttemptImpl:659 - > appattempt_1400092144371_0004_09 State change from SUBMITTED to SCHEDULED > 21:19:34,318 INFO FairScheduler:733 - Application > appattempt_1400092144371_0003_05 is done. finalState=FAILED > 21:19:34,319 INFO RMAppAttemptImpl:659 - > appattempt_1400092144371_0004_10 State change from SUBMITTED to SCHEDULED > 21:19:34,319 INFO AppSchedulingInfo:108 - Application > application_1400092144371_0003 requests cleared > 21:19:34,319 INFO FairScheduler:673 - Added Application Attempt > appattempt_1400092144371_0004_11 to scheduler from user: > samza-perf-playground > 21:19:34,320 INFO FairScheduler:733 - Application > appattempt_1400092144371_0003_06 is done. finalState=FAILED > 21:19:34,320 INFO AppSchedulingInfo:108 - Application > application_1400092144371_0003 requests cleared > 21:19:34,320 INFO RMAppAttemptImpl:659 - > appattempt_1400092144371_0004_11 State change from SUBMITTED to SCHEDULED > 21:19:34,323 FATAL ResourceManager:600 - Error in handling event type > APP_ATTEMPT_REMOVED to the scheduler > java.lang.IllegalStateException: Given app to remove > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerApp@429f809d > does not exist in queue [root.samza-perf-playground, demand= vCores:0>, running=, share=, > w=] > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.removeApp(FSLeafQueue.java:93) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplicationAttempt(FairScheduler.java:774) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1201) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:122) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:591) > at java.lang.Thread.run(Thread.java:744) > 21:19:34,330 INFO ResourceManager:604 - Exiting, bbye.. > 21:19:34,335 INFO log:67 - Stopped SelectChannelConnector@:8088 > 21:19:34,437 INFO Server:2398 - Stopping server on 8033 > 21:19:34,438 INFO Server:694 - Stopping IPC Server listener on 8033 > {noformat} > Last commit message for this build is (branch-2.4 on > github.com/apache/hadoop-common): > {noformat} > commit 09e24d5519187c0db67aacc1992be5d43829aa1e > Author: Arpit Agarwal > Date: Tue May 20 20:18:46 2014 + > HADOOP-10562. Fix CHANGES.txt entry again > > git-svn-id: > https://svn.apache.org/repos/asf/hadoop/common/branches/branch-2.4@1596389 > 13f79535-47bb-0310-9956-ffa450edef68 > {noformat} -- This message wa
[jira] [Commented] (YARN-2054) Poor defaults for YARN ZK configs for retries and retry-inteval
[ https://issues.apache.org/jira/browse/YARN-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14005402#comment-14005402 ] Hadoop QA commented on YARN-2054: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12646115/yarn-2054-2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3785//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3785//console This message is automatically generated. > Poor defaults for YARN ZK configs for retries and retry-inteval > --- > > Key: YARN-2054 > URL: https://issues.apache.org/jira/browse/YARN-2054 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla > Attachments: yarn-2054-1.patch, yarn-2054-2.patch > > > Currenly, we have the following default values: > # yarn.resourcemanager.zk-num-retries - 500 > # yarn.resourcemanager.zk-retry-interval-ms - 2000 > This leads to a cumulate 1000 seconds before the RM gives up trying to > connect to the ZK. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2093) Fair Scheduler IllegalStateException after upgrade from 2.2.0 to 2.4.1-SNAP
[ https://issues.apache.org/jira/browse/YARN-2093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14005397#comment-14005397 ] Sandy Ryza commented on YARN-2093: -- Thanks for reporting this Jon. Did this occur in an RM-HA setup? Is it reproducible? > Fair Scheduler IllegalStateException after upgrade from 2.2.0 to 2.4.1-SNAP > --- > > Key: YARN-2093 > URL: https://issues.apache.org/jira/browse/YARN-2093 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.4.1 >Reporter: Jon Bringhurst > > After upgrading from 2.2.0 to 2.4.1-SNAP, I ran into the following on startup: > {noformat} > 21:19:34,308 INFO RMAppAttemptImpl:659 - > appattempt_1400092144371_0003_09 State change from SUBMITTED to SCHEDULED > 21:19:34,309 INFO RMAppAttemptImpl:659 - > appattempt_1400092144371_0004_08 State change from SUBMITTED to SCHEDULED > 21:19:34,310 INFO RMAppAttemptImpl:659 - > appattempt_1400092144371_0003_10 State change from SUBMITTED to SCHEDULED > 21:19:34,310 INFO RMAppAttemptImpl:659 - > appattempt_1400092144371_0003_11 State change from SUBMITTED to SCHEDULED > 21:19:34,317 INFO FairScheduler:673 - Added Application Attempt > appattempt_1400092144371_0004_09 to scheduler from user: > samza-perf-playground > 21:19:34,318 INFO FairScheduler:673 - Added Application Attempt > appattempt_1400092144371_0004_10 to scheduler from user: > samza-perf-playground > 21:19:34,318 INFO RMAppAttemptImpl:659 - > appattempt_1400092144371_0004_09 State change from SUBMITTED to SCHEDULED > 21:19:34,318 INFO FairScheduler:733 - Application > appattempt_1400092144371_0003_05 is done. finalState=FAILED > 21:19:34,319 INFO RMAppAttemptImpl:659 - > appattempt_1400092144371_0004_10 State change from SUBMITTED to SCHEDULED > 21:19:34,319 INFO AppSchedulingInfo:108 - Application > application_1400092144371_0003 requests cleared > 21:19:34,319 INFO FairScheduler:673 - Added Application Attempt > appattempt_1400092144371_0004_11 to scheduler from user: > samza-perf-playground > 21:19:34,320 INFO FairScheduler:733 - Application > appattempt_1400092144371_0003_06 is done. finalState=FAILED > 21:19:34,320 INFO AppSchedulingInfo:108 - Application > application_1400092144371_0003 requests cleared > 21:19:34,320 INFO RMAppAttemptImpl:659 - > appattempt_1400092144371_0004_11 State change from SUBMITTED to SCHEDULED > 21:19:34,323 FATAL ResourceManager:600 - Error in handling event type > APP_ATTEMPT_REMOVED to the scheduler > java.lang.IllegalStateException: Given app to remove > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerApp@429f809d > does not exist in queue [root.samza-perf-playground, demand= vCores:0>, running=, share=, > w=] > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.removeApp(FSLeafQueue.java:93) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplicationAttempt(FairScheduler.java:774) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1201) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:122) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:591) > at java.lang.Thread.run(Thread.java:744) > 21:19:34,330 INFO ResourceManager:604 - Exiting, bbye.. > 21:19:34,335 INFO log:67 - Stopped SelectChannelConnector@:8088 > 21:19:34,437 INFO Server:2398 - Stopping server on 8033 > 21:19:34,438 INFO Server:694 - Stopping IPC Server listener on 8033 > {noformat} > Last commit message for this build is (branch-2.4 on > github.com/apache/hadoop-common): > {noformat} > commit 09e24d5519187c0db67aacc1992be5d43829aa1e > Author: Arpit Agarwal > Date: Tue May 20 20:18:46 2014 + > HADOOP-10562. Fix CHANGES.txt entry again > > git-svn-id: > https://svn.apache.org/repos/asf/hadoop/common/branches/branch-2.4@1596389 > 13f79535-47bb-0310-9956-ffa450edef68 > {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2089) FairScheduler: QueuePlacementPolicy and QueuePlacementRule are missing audience annotations
[ https://issues.apache.org/jira/browse/YARN-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14005398#comment-14005398 ] Hadoop QA commented on YARN-2089: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12646089/yarn-2089.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3784//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3784//console This message is automatically generated. > FairScheduler: QueuePlacementPolicy and QueuePlacementRule are missing > audience annotations > --- > > Key: YARN-2089 > URL: https://issues.apache.org/jira/browse/YARN-2089 > Project: Hadoop YARN > Issue Type: Improvement > Components: scheduler >Affects Versions: 2.4.0 >Reporter: Anubhav Dhoot >Assignee: zhihai xu > Labels: newbie > Attachments: yarn-2089.patch > > > We should mark QueuePlacementPolicy and QueuePlacementRule with audience > annotations @Private @Unstable -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1474) Make schedulers services
[ https://issues.apache.org/jira/browse/YARN-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-1474: - Attachment: (was: YARN-1474.15.patch) > Make schedulers services > > > Key: YARN-1474 > URL: https://issues.apache.org/jira/browse/YARN-1474 > Project: Hadoop YARN > Issue Type: Sub-task > Components: scheduler >Affects Versions: 2.3.0, 2.4.0 >Reporter: Sandy Ryza >Assignee: Tsuyoshi OZAWA > Attachments: YARN-1474.1.patch, YARN-1474.10.patch, > YARN-1474.11.patch, YARN-1474.12.patch, YARN-1474.13.patch, > YARN-1474.14.patch, YARN-1474.15.patch, YARN-1474.2.patch, YARN-1474.3.patch, > YARN-1474.4.patch, YARN-1474.5.patch, YARN-1474.6.patch, YARN-1474.7.patch, > YARN-1474.8.patch, YARN-1474.9.patch > > > Schedulers currently have a reinitialize but no start and stop. Fitting them > into the YARN service model would make things more coherent. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1474) Make schedulers services
[ https://issues.apache.org/jira/browse/YARN-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-1474: - Attachment: YARN-1474.15.patch > Make schedulers services > > > Key: YARN-1474 > URL: https://issues.apache.org/jira/browse/YARN-1474 > Project: Hadoop YARN > Issue Type: Sub-task > Components: scheduler >Affects Versions: 2.3.0, 2.4.0 >Reporter: Sandy Ryza >Assignee: Tsuyoshi OZAWA > Attachments: YARN-1474.1.patch, YARN-1474.10.patch, > YARN-1474.11.patch, YARN-1474.12.patch, YARN-1474.13.patch, > YARN-1474.14.patch, YARN-1474.15.patch, YARN-1474.2.patch, YARN-1474.3.patch, > YARN-1474.4.patch, YARN-1474.5.patch, YARN-1474.6.patch, YARN-1474.7.patch, > YARN-1474.8.patch, YARN-1474.9.patch > > > Schedulers currently have a reinitialize but no start and stop. Fitting them > into the YARN service model would make things more coherent. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1868) YARN status web ui does not show correctly in IE 11
[ https://issues.apache.org/jira/browse/YARN-1868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14005387#comment-14005387 ] Mike Liddell commented on YARN-1868: I didn't see this mentioned: A specific workaround in IE 11 is Settings|Compatability View Settings|Display intranet sites in Compatability View -> False. > YARN status web ui does not show correctly in IE 11 > --- > > Key: YARN-1868 > URL: https://issues.apache.org/jira/browse/YARN-1868 > Project: Hadoop YARN > Issue Type: Bug > Components: webapp >Affects Versions: 3.0.0 >Reporter: Chuan Liu >Assignee: Chuan Liu > Attachments: YARN-1868.1.patch, YARN-1868.2.patch, YARN-1868.patch, > YARN_status.png > > > The YARN status web ui does not show correctly in IE 11. The drop down menu > for app entries are not shown. Also the navigation menu displays incorrectly. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1938) Kerberos authentication for the timeline server
[ https://issues.apache.org/jira/browse/YARN-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14005369#comment-14005369 ] Vinod Kumar Vavilapalli commented on YARN-1938: --- Looks good. +1. Checking this in.. > Kerberos authentication for the timeline server > --- > > Key: YARN-1938 > URL: https://issues.apache.org/jira/browse/YARN-1938 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhijie Shen >Assignee: Zhijie Shen > Attachments: YARN-1938.1.patch, YARN-1938.2.patch, YARN-1938.3.patch > > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2074) Preemption of AM containers shouldn't count towards AM failures
[ https://issues.apache.org/jira/browse/YARN-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14005370#comment-14005370 ] Xuan Gong commented on YARN-2074: - Comments: 1. {code} RMAppAttempt attempt = new RMAppAttemptImpl(appAttemptId, rmContext, scheduler, masterService, submissionContext, conf, maxAppAttempts <= attempts.size()); {code} Use this condition to decide whether this RMAppAttempt is isLastAttempt, does not sound right to me. For example, we set the maxAppAttempts as 3, but previous 2 AM is preempted, based on the condition you set here, the next RMAppAttempt is the lastAttempt ?? If this Attempt is failed, the whole application will be marked as failure. 2. {code} public boolean isPreempted() { return getDiagnostics().contains(SchedulerUtils.PREEMPTED_CONTAINER); } {code} It is fine to use this to check isPreempted. But, link https://issues.apache.org/jira/browse/YARN-614, basically, this ticket is saying we should separate hardware failures or YARN issues from AM failure, and do not count them as AM failure. I think that the Preemption of AM is one of them. So, maybe we could use a more general way to check whether the AM is isPreempted, (check ContainerExitStatus instead ?) > Preemption of AM containers shouldn't count towards AM failures > --- > > Key: YARN-2074 > URL: https://issues.apache.org/jira/browse/YARN-2074 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Vinod Kumar Vavilapalli >Assignee: Jian He > Attachments: YARN-2074.1.patch, YARN-2074.2.patch > > > One orthogonal concern with issues like YARN-2055 and YARN-2022 is that AM > containers getting preempted shouldn't count towards AM failures and thus > shouldn't eventually fail applications. > We should explicitly handle AM container preemption/kill as a separate issue > and not count it towards the limit on AM failures. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1474) Make schedulers services
[ https://issues.apache.org/jira/browse/YARN-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-1474: - Attachment: YARN-1474.15.patch > Make schedulers services > > > Key: YARN-1474 > URL: https://issues.apache.org/jira/browse/YARN-1474 > Project: Hadoop YARN > Issue Type: Sub-task > Components: scheduler >Affects Versions: 2.3.0, 2.4.0 >Reporter: Sandy Ryza >Assignee: Tsuyoshi OZAWA > Attachments: YARN-1474.1.patch, YARN-1474.10.patch, > YARN-1474.11.patch, YARN-1474.12.patch, YARN-1474.13.patch, > YARN-1474.14.patch, YARN-1474.15.patch, YARN-1474.2.patch, YARN-1474.3.patch, > YARN-1474.4.patch, YARN-1474.5.patch, YARN-1474.6.patch, YARN-1474.7.patch, > YARN-1474.8.patch, YARN-1474.9.patch > > > Schedulers currently have a reinitialize but no start and stop. Fitting them > into the YARN service model would make things more coherent. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2070) DistributedShell publishes unfriendly user information to the timeline server
[ https://issues.apache.org/jira/browse/YARN-2070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14005340#comment-14005340 ] Zhijie Shen commented on YARN-2070: --- bq. the Server itself is going to start injecting a user-name that is the sole authority. In YARN-1937, I try to keep users away from the system information (entity owner here), and it will be removed before the entity/event is returned back to the user. > DistributedShell publishes unfriendly user information to the timeline server > - > > Key: YARN-2070 > URL: https://issues.apache.org/jira/browse/YARN-2070 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhijie Shen >Assignee: Robert Kanter >Priority: Minor > Labels: newbie > Attachments: YARN-2070.patch > > > Bellow is the code of using the string of current user object as the "user" > value. > {code} > entity.addPrimaryFilter("user", UserGroupInformation.getCurrentUser() > .toString()); > {code} > When we use kerberos authentication, it's going to output the full name, such > as "zjshen/localhost@LOCALHOST (auth.KERBEROS)". It is not user friendly for > searching by the primary filters. It's better to use shortUserName instead. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2092) Incompatible org.codehaus.jackson* dependencies when moving from 2.4.0 to 2.5.0-SNAPSHOT
[ https://issues.apache.org/jira/browse/YARN-2092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14005335#comment-14005335 ] Zhijie Shen commented on YARN-2092: --- {code} 2014-05-19 20:09:07,933 FATAL [HistoryEventHandlingThread] org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[HistoryEventHandlingThread,5,main] threw an Error. Shutting down now... java.lang.NoSuchMethodError: org.codehaus.jackson.map.ObjectMapper.setSerializationInclusion(Lorg/codehaus/jackson/map/annotate/JsonSerialize$Inclusion;)Lorg/codehaus/jackson/map/ObjectMapper; at org.apache.hadoop.yarn.webapp.YarnJacksonJaxbJsonProvider.locateMapper(YarnJacksonJaxbJsonProvider.java:54) at org.codehaus.jackson.jaxrs.JacksonJsonProvider.writeTo(JacksonJsonProvider.java:488) {code} JacksonJsonProvider is in jackson-jaxrs while ObjectMapper is in jackson-mapper-asl. If I understand correctly, it looks like the two libs' versions don't match. Hadoop uses the following 4 jackson libs. {code} org.codehaus.jackson jackson-mapper-asl 1.9.13 org.codehaus.jackson jackson-core-asl 1.9.13 org.codehaus.jackson jackson-jaxrs 1.9.13 org.codehaus.jackson jackson-xc 1.9.13 {code} Given Tez includes all these 4 jars of 1.8.8 in its classpath, either putting it before or after Hadoop classpath, there shouldn't be the mismatch. On the other side, if Tez includes part of these 4 jars and puts the libs before hadoop libs, this problem will occur. Say: {code} cp=...:jackson-jaxrs-1.8.8.jar:jackson-xc-1.8.3.jar:jackson-jaxrs-1.9.13.jar:jackson-xc-1.9.13.jar:jackson-mapper-asl-1.9.13.jar:jackson-xc-1.9.13.jar:... {code} > Incompatible org.codehaus.jackson* dependencies when moving from 2.4.0 to > 2.5.0-SNAPSHOT > > > Key: YARN-2092 > URL: https://issues.apache.org/jira/browse/YARN-2092 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Hitesh Shah > > Came across this when trying to integrate with the timeline server. Using a > 1.8.8 dependency of jackson works fine against 2.4.0 but fails against > 2.5.0-SNAPSHOT which needs 1.9.13. This is in the scenario where the user > jars are first in the classpath. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2070) DistributedShell publishes unfriendly user information to the timeline server
[ https://issues.apache.org/jira/browse/YARN-2070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14005320#comment-14005320 ] Vinod Kumar Vavilapalli commented on YARN-2070: --- With some of the tickets under YARN-1935, the Server itself is going to start injecting a user-name that is the sole authority. Given that, should we consider dropping this completely? > DistributedShell publishes unfriendly user information to the timeline server > - > > Key: YARN-2070 > URL: https://issues.apache.org/jira/browse/YARN-2070 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhijie Shen >Assignee: Robert Kanter >Priority: Minor > Labels: newbie > Attachments: YARN-2070.patch > > > Bellow is the code of using the string of current user object as the "user" > value. > {code} > entity.addPrimaryFilter("user", UserGroupInformation.getCurrentUser() > .toString()); > {code} > When we use kerberos authentication, it's going to output the full name, such > as "zjshen/localhost@LOCALHOST (auth.KERBEROS)". It is not user friendly for > searching by the primary filters. It's better to use shortUserName instead. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2054) Poor defaults for YARN ZK configs for retries and retry-inteval
[ https://issues.apache.org/jira/browse/YARN-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2054: --- Attachment: yarn-2054-2.patch A patch that sets the retry interval based on the session timeout, number of retries and whether HA is enabled. > Poor defaults for YARN ZK configs for retries and retry-inteval > --- > > Key: YARN-2054 > URL: https://issues.apache.org/jira/browse/YARN-2054 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla > Attachments: yarn-2054-1.patch, yarn-2054-2.patch > > > Currenly, we have the following default values: > # yarn.resourcemanager.zk-num-retries - 500 > # yarn.resourcemanager.zk-retry-interval-ms - 2000 > This leads to a cumulate 1000 seconds before the RM gives up trying to > connect to the ZK. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2093) Fair Scheduler IllegalStateException after upgrade from 2.2.0 to 2.4.1-SNAP
[ https://issues.apache.org/jira/browse/YARN-2093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jon Bringhurst updated YARN-2093: - Description: After upgrading from 2.2.0 to 2.4.1-SNAP, I ran into the following on startup: {noformat} 21:19:34,308 INFO RMAppAttemptImpl:659 - appattempt_1400092144371_0003_09 State change from SUBMITTED to SCHEDULED 21:19:34,309 INFO RMAppAttemptImpl:659 - appattempt_1400092144371_0004_08 State change from SUBMITTED to SCHEDULED 21:19:34,310 INFO RMAppAttemptImpl:659 - appattempt_1400092144371_0003_10 State change from SUBMITTED to SCHEDULED 21:19:34,310 INFO RMAppAttemptImpl:659 - appattempt_1400092144371_0003_11 State change from SUBMITTED to SCHEDULED 21:19:34,317 INFO FairScheduler:673 - Added Application Attempt appattempt_1400092144371_0004_09 to scheduler from user: samza-perf-playground 21:19:34,318 INFO FairScheduler:673 - Added Application Attempt appattempt_1400092144371_0004_10 to scheduler from user: samza-perf-playground 21:19:34,318 INFO RMAppAttemptImpl:659 - appattempt_1400092144371_0004_09 State change from SUBMITTED to SCHEDULED 21:19:34,318 INFO FairScheduler:733 - Application appattempt_1400092144371_0003_05 is done. finalState=FAILED 21:19:34,319 INFO RMAppAttemptImpl:659 - appattempt_1400092144371_0004_10 State change from SUBMITTED to SCHEDULED 21:19:34,319 INFO AppSchedulingInfo:108 - Application application_1400092144371_0003 requests cleared 21:19:34,319 INFO FairScheduler:673 - Added Application Attempt appattempt_1400092144371_0004_11 to scheduler from user: samza-perf-playground 21:19:34,320 INFO FairScheduler:733 - Application appattempt_1400092144371_0003_06 is done. finalState=FAILED 21:19:34,320 INFO AppSchedulingInfo:108 - Application application_1400092144371_0003 requests cleared 21:19:34,320 INFO RMAppAttemptImpl:659 - appattempt_1400092144371_0004_11 State change from SUBMITTED to SCHEDULED 21:19:34,323 FATAL ResourceManager:600 - Error in handling event type APP_ATTEMPT_REMOVED to the scheduler java.lang.IllegalStateException: Given app to remove org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerApp@429f809d does not exist in queue [root.samza-perf-playground, demand=, running=, share=, w=] at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.removeApp(FSLeafQueue.java:93) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplicationAttempt(FairScheduler.java:774) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1201) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:122) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:591) at java.lang.Thread.run(Thread.java:744) 21:19:34,330 INFO ResourceManager:604 - Exiting, bbye.. 21:19:34,335 INFO log:67 - Stopped SelectChannelConnector@:8088 21:19:34,437 INFO Server:2398 - Stopping server on 8033 21:19:34,438 INFO Server:694 - Stopping IPC Server listener on 8033 {noformat} Last commit message for this build is (branch-2.4 on github.com/apache/hadoop-common): {noformat} commit 09e24d5519187c0db67aacc1992be5d43829aa1e Author: Arpit Agarwal Date: Tue May 20 20:18:46 2014 + HADOOP-10562. Fix CHANGES.txt entry again git-svn-id: https://svn.apache.org/repos/asf/hadoop/common/branches/branch-2.4@1596389 13f79535-47bb-0310-9956-ffa450edef68 {noformat} was: After upgrading from 2.2.0 to 2.4.1-SNAP, I ran into the following on startup: {noformat} 21:19:34,308 INFO RMAppAttemptImpl:659 - appattempt_1400092144371_0003_09 State change from SUBMITTED to SCHEDULED 21:19:34,309 INFO RMAppAttemptImpl:659 - appattempt_1400092144371_0004_08 State change from SUBMITTED to SCHEDULED 21:19:34,310 INFO RMAppAttemptImpl:659 - appattempt_1400092144371_0003_10 State change from SUBMITTED to SCHEDULED 21:19:34,310 INFO RMAppAttemptImpl:659 - appattempt_1400092144371_0003_11 State change from SUBMITTED to SCHEDULED 21:19:34,317 INFO FairScheduler:673 - Added Application Attempt appattempt_1400092144371_0004_09 to scheduler from user: samza-perf-playground 21:19:34,318 INFO FairScheduler:673 - Added Application Attempt appattempt_1400092144371_0004_10 to scheduler from user: samza-perf-playground 21:19:34,318 INFO RMAppAttemptImpl:659 - appattempt_1400092144371_0004_09 State change from SUBMITTED to SCHEDULED 21:19:34,318 INFO FairScheduler:733 - Application appattempt_1400092144371_0003_05 is done. finalState=FAILED 21:19:34,319 INFO RMAppAttemptImpl:659 - appattempt_1400092144371_0004_10 State change from SUBMITTED to SCHEDULED 21:19:34,319 INFO AppSchedulingIn
[jira] [Updated] (YARN-2093) Fair Scheduler IllegalStateException after upgrade from 2.2.0 to 2.4.1-SNAP
[ https://issues.apache.org/jira/browse/YARN-2093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jon Bringhurst updated YARN-2093: - Description: After upgrading from 2.2.0 to 2.4.1-SNAP, I ran into the following on startup: {noformat} 21:19:34,308 INFO RMAppAttemptImpl:659 - appattempt_1400092144371_0003_09 State change from SUBMITTED to SCHEDULED 21:19:34,309 INFO RMAppAttemptImpl:659 - appattempt_1400092144371_0004_08 State change from SUBMITTED to SCHEDULED 21:19:34,310 INFO RMAppAttemptImpl:659 - appattempt_1400092144371_0003_10 State change from SUBMITTED to SCHEDULED 21:19:34,310 INFO RMAppAttemptImpl:659 - appattempt_1400092144371_0003_11 State change from SUBMITTED to SCHEDULED 21:19:34,317 INFO FairScheduler:673 - Added Application Attempt appattempt_1400092144371_0004_09 to scheduler from user: samza-perf-playground 21:19:34,318 INFO FairScheduler:673 - Added Application Attempt appattempt_1400092144371_0004_10 to scheduler from user: samza-perf-playground 21:19:34,318 INFO RMAppAttemptImpl:659 - appattempt_1400092144371_0004_09 State change from SUBMITTED to SCHEDULED 21:19:34,318 INFO FairScheduler:733 - Application appattempt_1400092144371_0003_05 is done. finalState=FAILED 21:19:34,319 INFO RMAppAttemptImpl:659 - appattempt_1400092144371_0004_10 State change from SUBMITTED to SCHEDULED 21:19:34,319 INFO AppSchedulingInfo:108 - Application application_1400092144371_0003 requests cleared 21:19:34,319 INFO FairScheduler:673 - Added Application Attempt appattempt_1400092144371_0004_11 to scheduler from user: samza-perf-playground 21:19:34,320 INFO FairScheduler:733 - Application appattempt_1400092144371_0003_06 is done. finalState=FAILED 21:19:34,320 INFO AppSchedulingInfo:108 - Application application_1400092144371_0003 requests cleared 21:19:34,320 INFO RMAppAttemptImpl:659 - appattempt_1400092144371_0004_11 State change from SUBMITTED to SCHEDULED 21:19:34,323 FATAL ResourceManager:600 - Error in handling event type APP_ATTEMPT_REMOVED to the scheduler java.lang.IllegalStateException: Given app to remove org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerApp@429f809d does not exist in queue [root.samza-perf-playground, demand=, running=, share=, w=] at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.removeApp(FSLeafQueue.java:93) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplicationAttempt(FairScheduler.java:774) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1201) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:122) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:591) at java.lang.Thread.run(Thread.java:744) 21:19:34,330 INFO ResourceManager:604 - Exiting, bbye.. 21:19:34,335 INFO log:67 - Stopped SelectChannelConnector@:8088 21:19:34,437 INFO Server:2398 - Stopping server on 8033 21:19:34,438 INFO Server:694 - Stopping IPC Server listener on 8033 {noformat} Last commit message is (branch-2.4 on github.com/apache/hadoop-common): {noformat} commit 09e24d5519187c0db67aacc1992be5d43829aa1e Author: Arpit Agarwal Date: Tue May 20 20:18:46 2014 + HADOOP-10562. Fix CHANGES.txt entry again git-svn-id: https://svn.apache.org/repos/asf/hadoop/common/branches/branch-2.4@1596389 13f79535-47bb-0310-9956-ffa450edef68 {noformat} was: After upgrading from 2.2.0 to 2.4.1-SNAP, I ran into the following on startup: {noformat} 21:19:34,308 INFO RMAppAttemptImpl:659 - appattempt_1400092144371_0003_09 State change from SUBMITTED to SCHEDULED 21:19:34,309 INFO RMAppAttemptImpl:659 - appattempt_1400092144371_0004_08 State change from SUBMITTED to SCHEDULED 21:19:34,310 INFO RMAppAttemptImpl:659 - appattempt_1400092144371_0003_10 State change from SUBMITTED to SCHEDULED 21:19:34,310 INFO RMAppAttemptImpl:659 - appattempt_1400092144371_0003_11 State change from SUBMITTED to SCHEDULED 21:19:34,317 INFO FairScheduler:673 - Added Application Attempt appattempt_1400092144371_0004_09 to scheduler from user: samza-perf-playground 21:19:34,318 INFO FairScheduler:673 - Added Application Attempt appattempt_1400092144371_0004_10 to scheduler from user: samza-perf-playground 21:19:34,318 INFO RMAppAttemptImpl:659 - appattempt_1400092144371_0004_09 State change from SUBMITTED to SCHEDULED 21:19:34,318 INFO FairScheduler:733 - Application appattempt_1400092144371_0003_05 is done. finalState=FAILED 21:19:34,319 INFO RMAppAttemptImpl:659 - appattempt_1400092144371_0004_10 State change from SUBMITTED to SCHEDULED 21:19:34,319 INFO AppSchedulingInfo:108 - Applica
[jira] [Created] (YARN-2093) Fair Scheduler IllegalStateException after upgrade from 2.2.0 to 2.4.1-SNAP
Jon Bringhurst created YARN-2093: Summary: Fair Scheduler IllegalStateException after upgrade from 2.2.0 to 2.4.1-SNAP Key: YARN-2093 URL: https://issues.apache.org/jira/browse/YARN-2093 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.1 Reporter: Jon Bringhurst After upgrading from 2.2.0 to 2.4.1-SNAP, I ran into the following on startup: {noformat} 21:19:34,308 INFO RMAppAttemptImpl:659 - appattempt_1400092144371_0003_09 State change from SUBMITTED to SCHEDULED 21:19:34,309 INFO RMAppAttemptImpl:659 - appattempt_1400092144371_0004_08 State change from SUBMITTED to SCHEDULED 21:19:34,310 INFO RMAppAttemptImpl:659 - appattempt_1400092144371_0003_10 State change from SUBMITTED to SCHEDULED 21:19:34,310 INFO RMAppAttemptImpl:659 - appattempt_1400092144371_0003_11 State change from SUBMITTED to SCHEDULED 21:19:34,317 INFO FairScheduler:673 - Added Application Attempt appattempt_1400092144371_0004_09 to scheduler from user: samza-perf-playground 21:19:34,318 INFO FairScheduler:673 - Added Application Attempt appattempt_1400092144371_0004_10 to scheduler from user: samza-perf-playground 21:19:34,318 INFO RMAppAttemptImpl:659 - appattempt_1400092144371_0004_09 State change from SUBMITTED to SCHEDULED 21:19:34,318 INFO FairScheduler:733 - Application appattempt_1400092144371_0003_05 is done. finalState=FAILED 21:19:34,319 INFO RMAppAttemptImpl:659 - appattempt_1400092144371_0004_10 State change from SUBMITTED to SCHEDULED 21:19:34,319 INFO AppSchedulingInfo:108 - Application application_1400092144371_0003 requests cleared 21:19:34,319 INFO FairScheduler:673 - Added Application Attempt appattempt_1400092144371_0004_11 to scheduler from user: samza-perf-playground 21:19:34,320 INFO FairScheduler:733 - Application appattempt_1400092144371_0003_06 is done. finalState=FAILED 21:19:34,320 INFO AppSchedulingInfo:108 - Application application_1400092144371_0003 requests cleared 21:19:34,320 INFO RMAppAttemptImpl:659 - appattempt_1400092144371_0004_11 State change from SUBMITTED to SCHEDULED 21:19:34,323 FATAL ResourceManager:600 - Error in handling event type APP_ATTEMPT_REMOVED to the scheduler java.lang.IllegalStateException: Given app to remove org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerApp@429f809d does not exist in queue [root.samza-perf-playground, demand=, running=, share=, w=] at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.removeApp(FSLeafQueue.java:93) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplicationAttempt(FairScheduler.java:774) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1201) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:122) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:591) at java.lang.Thread.run(Thread.java:744) 21:19:34,330 INFO ResourceManager:604 - Exiting, bbye.. 21:19:34,335 INFO log:67 - Stopped selectchannelconnec...@eat1-app587.stg.linkedin.com:8088 21:19:34,437 INFO Server:2398 - Stopping server on 8033 21:19:34,438 INFO Server:694 - Stopping IPC Server listener on 8033 {noformat} Last commit message is (branch-2.4 on github.com/apache/hadoop-common): {noformat} commit 09e24d5519187c0db67aacc1992be5d43829aa1e Author: Arpit Agarwal Date: Tue May 20 20:18:46 2014 + HADOOP-10562. Fix CHANGES.txt entry again git-svn-id: https://svn.apache.org/repos/asf/hadoop/common/branches/branch-2.4@1596389 13f79535-47bb-0310-9956-ffa450edef68 {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2092) Incompatible org.codehaus.jackson* dependencies when moving from 2.4.0 to 2.5.0-SNAPSHOT
[ https://issues.apache.org/jira/browse/YARN-2092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14005286#comment-14005286 ] Hitesh Shah commented on YARN-2092: --- See https://issues.apache.org/jira/browse/TEZ-1066?focusedCommentId=14002674&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14002674 for the stack trace when trying to run against 2.5.0-SNAPSHOT with jackson 1.8.8 jars. > Incompatible org.codehaus.jackson* dependencies when moving from 2.4.0 to > 2.5.0-SNAPSHOT > > > Key: YARN-2092 > URL: https://issues.apache.org/jira/browse/YARN-2092 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Hitesh Shah > > Came across this when trying to integrate with the timeline server. Using a > 1.8.8 dependency of jackson works fine against 2.4.0 but fails against > 2.5.0-SNAPSHOT which needs 1.9.13. This is in the scenario where the user > jars are first in the classpath. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2092) Incompatible org.codehaus.jackson* dependencies when moving from 2.4.0 to 2.5.0-SNAPSHOT
Hitesh Shah created YARN-2092: - Summary: Incompatible org.codehaus.jackson* dependencies when moving from 2.4.0 to 2.5.0-SNAPSHOT Key: YARN-2092 URL: https://issues.apache.org/jira/browse/YARN-2092 Project: Hadoop YARN Issue Type: Bug Reporter: Hitesh Shah Came across this when trying to integrate with the timeline server. Using a 1.8.8 dependency of jackson works fine against 2.4.0 but fails against 2.5.0-SNAPSHOT which needs 1.9.13. This is in the scenario where the user jars are first in the classpath. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2091) Add ContainerExitStatus.KILL_EXCEEDED_MEMORY and pass it to app masters
[ https://issues.apache.org/jira/browse/YARN-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated YARN-2091: - Summary: Add ContainerExitStatus.KILL_EXCEEDED_MEMORY and pass it to app masters (was: Add ContainerExitStatus.KILL_EXCEECED_MEMORY and pass it to app masters) > Add ContainerExitStatus.KILL_EXCEEDED_MEMORY and pass it to app masters > --- > > Key: YARN-2091 > URL: https://issues.apache.org/jira/browse/YARN-2091 > Project: Hadoop YARN > Issue Type: Task >Reporter: Bikas Saha > > Currently, the AM cannot programmatically determine if the task was killed > due to using excessive memory. The NM kills it without passing this > information in the container status back to the RM. So the AM cannot take any > action here. The jira tracks adding this exit status and passing it from the > NM to the RM and then the AM. In general, there may be other such actions > taken by YARN that are currently opaque to the AM. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2091) Add ContainerExitStatus.KILL_EXCEECED_MEMORY and pass it to app masters
Bikas Saha created YARN-2091: Summary: Add ContainerExitStatus.KILL_EXCEECED_MEMORY and pass it to app masters Key: YARN-2091 URL: https://issues.apache.org/jira/browse/YARN-2091 Project: Hadoop YARN Issue Type: Task Reporter: Bikas Saha Currently, the AM cannot programmatically determine if the task was killed due to using excessive memory. The NM kills it without passing this information in the container status back to the RM. So the AM cannot take any action here. The jira tracks adding this exit status and passing it from the NM to the RM and then the AM. In general, there may be other such actions taken by YARN that are currently opaque to the AM. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2090) If Kerberos Authentication is enabled, MapReduce job is failing on reducer phase
Victor Kim created YARN-2090: Summary: If Kerberos Authentication is enabled, MapReduce job is failing on reducer phase Key: YARN-2090 URL: https://issues.apache.org/jira/browse/YARN-2090 Project: Hadoop YARN Issue Type: Bug Components: applications, nodemanager Affects Versions: 2.4.0 Environment: hadoop: 2.4.0.2.1.2.0 Reporter: Victor Kim Priority: Critical I have 3-node cluster configuration: 1 ResourceManager and 3 NodeManagers, Kerberos is enabled, have hdfs, yarn, mapred principals\keytabs. ResourceManager and NodeManager are ran under yarn user, using yarn Kerberos principal. Use case 1: WordCount, submit job using yarn UGI (i.e. superuser, the one having Kerberos principal on all boxes). Result: job successfully completed. Use case 2: WordCount, submit job using LDAP user impersonation via yarn UGI. Result: Map tasks are completed SUCCESSfully, Reduce task fails with ShuffleError Caused by: java.io.IOException: Exceeded MAX_FAILED_UNIQUE_FETCHES (see the stack trace below). The use case with user impersonation used to work on earlier versions, without YARN (with JT&TT). I found similar issue with Kerberos AUTH involved here: https://groups.google.com/forum/#!topic/nosql-databases/tGDqs75ACqQ And here https://issues.apache.org/jira/browse/MAPREDUCE-4030 it's marked as resolved, which is not the case when Kerberos Authentication is enabled. The exception trace from YarnChild JVM: 2014-05-21 12:49:35,687 FATAL [fetcher#3] org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: Shuffle failed with too many fetch failures and insufficient progress! 2014-05-21 12:49:35,688 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle in fetcher#3 at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1557) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162) Caused by: java.io.IOException: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out. at org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.checkReducerHealth(ShuffleSchedulerImpl.java:323) at org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.copyFailed(ShuffleSchedulerImpl.java:245) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:347) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:165) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2089) FairScheduler: QueuePlacementPolicy and QueuePlacementRule are missing audience annotations
[ https://issues.apache.org/jira/browse/YARN-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14005214#comment-14005214 ] Karthik Kambatla commented on YARN-2089: (actually, let us wait for Jenkins even though the changes are not really code) > FairScheduler: QueuePlacementPolicy and QueuePlacementRule are missing > audience annotations > --- > > Key: YARN-2089 > URL: https://issues.apache.org/jira/browse/YARN-2089 > Project: Hadoop YARN > Issue Type: Improvement > Components: scheduler >Affects Versions: 2.4.0 >Reporter: Anubhav Dhoot >Assignee: zhihai xu > Labels: newbie > Attachments: yarn-2089.patch > > > We should mark QueuePlacementPolicy and QueuePlacementRule with audience > annotations @Private @Unstable -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2089) FairScheduler: QueuePlacementPolicy and QueuePlacementRule are missing audience annotations
[ https://issues.apache.org/jira/browse/YARN-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14005211#comment-14005211 ] Karthik Kambatla commented on YARN-2089: Looks good to me as well. +1. Checking this in. > FairScheduler: QueuePlacementPolicy and QueuePlacementRule are missing > audience annotations > --- > > Key: YARN-2089 > URL: https://issues.apache.org/jira/browse/YARN-2089 > Project: Hadoop YARN > Issue Type: Improvement > Components: scheduler >Affects Versions: 2.4.0 >Reporter: Anubhav Dhoot >Assignee: zhihai xu > Labels: newbie > Attachments: yarn-2089.patch > > > We should mark QueuePlacementPolicy and QueuePlacementRule with audience > annotations @Private @Unstable -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2089) FairScheduler: QueuePlacementPolicy and QueuePlacementRule are missing audience annotations
[ https://issues.apache.org/jira/browse/YARN-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14005206#comment-14005206 ] Sandy Ryza commented on YARN-2089: -- +1 > FairScheduler: QueuePlacementPolicy and QueuePlacementRule are missing > audience annotations > --- > > Key: YARN-2089 > URL: https://issues.apache.org/jira/browse/YARN-2089 > Project: Hadoop YARN > Issue Type: Improvement > Components: scheduler >Affects Versions: 2.4.0 >Reporter: Anubhav Dhoot >Assignee: zhihai xu > Labels: newbie > Attachments: yarn-2089.patch > > > We should mark QueuePlacementPolicy and QueuePlacementRule with audience > annotations @Private @Unstable -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2070) DistributedShell publishes unfriendly user information to the timeline server
[ https://issues.apache.org/jira/browse/YARN-2070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Kanter updated YARN-2070: Attachment: YARN-2070.patch > DistributedShell publishes unfriendly user information to the timeline server > - > > Key: YARN-2070 > URL: https://issues.apache.org/jira/browse/YARN-2070 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhijie Shen >Assignee: Robert Kanter >Priority: Minor > Labels: newbie > Attachments: YARN-2070.patch > > > Bellow is the code of using the string of current user object as the "user" > value. > {code} > entity.addPrimaryFilter("user", UserGroupInformation.getCurrentUser() > .toString()); > {code} > When we use kerberos authentication, it's going to output the full name, such > as "zjshen/localhost@LOCALHOST (auth.KERBEROS)". It is not user friendly for > searching by the primary filters. It's better to use shortUserName instead. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2089) FairScheduler: QueuePlacementPolicy and QueuePlacementRule are missing audience annotations
[ https://issues.apache.org/jira/browse/YARN-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-2089: Attachment: yarn-2089.patch > FairScheduler: QueuePlacementPolicy and QueuePlacementRule are missing > audience annotations > --- > > Key: YARN-2089 > URL: https://issues.apache.org/jira/browse/YARN-2089 > Project: Hadoop YARN > Issue Type: Improvement > Components: scheduler >Affects Versions: 2.4.0 >Reporter: Anubhav Dhoot >Assignee: zhihai xu > Labels: newbie > Attachments: yarn-2089.patch > > > We should mark QueuePlacementPolicy and QueuePlacementRule with audience > annotations @Private @Unstable -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-2089) FairScheduler: QueuePlacementPolicy and QueuePlacementRule are missing audience annotations
[ https://issues.apache.org/jira/browse/YARN-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu reassigned YARN-2089: --- Assignee: zhihai xu > FairScheduler: QueuePlacementPolicy and QueuePlacementRule are missing > audience annotations > --- > > Key: YARN-2089 > URL: https://issues.apache.org/jira/browse/YARN-2089 > Project: Hadoop YARN > Issue Type: Improvement > Components: scheduler >Affects Versions: 2.4.0 >Reporter: Anubhav Dhoot >Assignee: zhihai xu > Labels: newbie > > We should mark QueuePlacementPolicy and QueuePlacementRule with audience > annotations @Private @Unstable -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2012) Fair Scheduler : Default rule in queue placement policy can take a queue as an optional attribute
[ https://issues.apache.org/jira/browse/YARN-2012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14004968#comment-14004968 ] Ashwin Shankar commented on YARN-2012: -- [~sandyr] Thanks for looking into this. bq. I think it's a little confusing for the rule to fall back to "default". Can we let this part be handled by the "create" logic in assignAppToQueue? Sure but just to clarify,are you saying that in QueuePlacementRule.assignAppToQueue we should return "root.default" if the queue returned by Default rule is not configured and if create is false ? ie Basically, in the "create" logic, we won't cause a skip to occur if the rule is a Default rule, instead we return root.default. > Fair Scheduler : Default rule in queue placement policy can take a queue as > an optional attribute > - > > Key: YARN-2012 > URL: https://issues.apache.org/jira/browse/YARN-2012 > Project: Hadoop YARN > Issue Type: Improvement > Components: scheduler >Reporter: Ashwin Shankar >Assignee: Ashwin Shankar > Labels: scheduler > Attachments: YARN-2012-v1.txt, YARN-2012-v2.txt > > > Currently 'default' rule in queue placement policy,if applied,puts the app in > root.default queue. It would be great if we can make 'default' rule > optionally point to a different queue as default queue . This queue should be > an existing queue,if not we fall back to root.default queue hence keeping > this rule as terminal. > This default queue can be a leaf queue or it can also be an parent queue if > the 'default' rule is nested inside nestedUserQueue rule(YARN-1864). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1366) ApplicationMasterService should Resync with the AM upon allocate call after restart
[ https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14004927#comment-14004927 ] Bikas Saha commented on YARN-1366: -- I mean what will go wrong is we allow unregister without register? Is it fundamentally wrong? > ApplicationMasterService should Resync with the AM upon allocate call after > restart > --- > > Key: YARN-1366 > URL: https://issues.apache.org/jira/browse/YARN-1366 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Bikas Saha >Assignee: Rohith > Attachments: YARN-1366.1.patch, YARN-1366.2.patch, YARN-1366.patch, > YARN-1366.prototype.patch, YARN-1366.prototype.patch > > > The ApplicationMasterService currently sends a resync response to which the > AM responds by shutting down. The AM behavior is expected to change to > calling resyncing with the RM. Resync means resetting the allocate RPC > sequence number to 0 and the AM should send its entire outstanding request to > the RM. Note that if the AM is making its first allocate call to the RM then > things should proceed like normal without needing a resync. The RM will > return all containers that have completed since the RM last synced with the > AM. Some container completions may be reported more than once. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-2070) DistributedShell publishes unfriendly user information to the timeline server
[ https://issues.apache.org/jira/browse/YARN-2070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Kanter reassigned YARN-2070: --- Assignee: Robert Kanter > DistributedShell publishes unfriendly user information to the timeline server > - > > Key: YARN-2070 > URL: https://issues.apache.org/jira/browse/YARN-2070 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhijie Shen >Assignee: Robert Kanter >Priority: Minor > Labels: newbie > > Bellow is the code of using the string of current user object as the "user" > value. > {code} > entity.addPrimaryFilter("user", UserGroupInformation.getCurrentUser() > .toString()); > {code} > When we use kerberos authentication, it's going to output the full name, such > as "zjshen/localhost@LOCALHOST (auth.KERBEROS)". It is not user friendly for > searching by the primary filters. It's better to use shortUserName instead. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1365) ApplicationMasterService to allow Register and Unregister of an app that was running before restart
[ https://issues.apache.org/jira/browse/YARN-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14004900#comment-14004900 ] Anubhav Dhoot commented on YARN-1365: - The failed test has race conditions that i am fixing. > ApplicationMasterService to allow Register and Unregister of an app that was > running before restart > --- > > Key: YARN-1365 > URL: https://issues.apache.org/jira/browse/YARN-1365 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Bikas Saha >Assignee: Anubhav Dhoot > Attachments: YARN-1365.001.patch, YARN-1365.002.patch, > YARN-1365.initial.patch > > > For an application that was running before restart, the > ApplicationMasterService currently throws an exception when the app tries to > make the initial register or final unregister call. These should succeed and > the RMApp state machine should transition to completed like normal. > Unregistration should succeed for an app that the RM considers complete since > the RM may have died after saving completion in the store but before > notifying the AM that the AM is free to exit. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2089) FairScheduler: QueuePlacementPolicy and QueuePlacementRule are missing audience annotations
Anubhav Dhoot created YARN-2089: --- Summary: FairScheduler: QueuePlacementPolicy and QueuePlacementRule are missing audience annotations Key: YARN-2089 URL: https://issues.apache.org/jira/browse/YARN-2089 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Affects Versions: 2.4.0 Reporter: Anubhav Dhoot We should mark QueuePlacementPolicy and QueuePlacementRule with audience annotations @Private @Unstable -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2088) Fix code bug in GetApplicationsRequestPBImpl#mergeLocalToBuilder
[ https://issues.apache.org/jira/browse/YARN-2088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14004803#comment-14004803 ] Binglin Chang commented on YARN-2088: - Based on recent bugs related to api records/PBImpl, I have some doubts about the general patterns used in PBImpls(java fields mixed with proto objects, cached states), which causes lots of redundant code and confusion, changes to those code is a mental challenge and can easily generate new bugs... > Fix code bug in GetApplicationsRequestPBImpl#mergeLocalToBuilder > > > Key: YARN-2088 > URL: https://issues.apache.org/jira/browse/YARN-2088 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Binglin Chang >Assignee: Binglin Chang > Attachments: YARN-2088.v1.patch > > > Some fields(set,list) are added to proto builders many times, we need to > clear those fields before add, otherwise the result proto contains more > contents. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2088) Fix code bug in GetApplicationsRequestPBImpl#mergeLocalToBuilder
[ https://issues.apache.org/jira/browse/YARN-2088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Binglin Chang updated YARN-2088: Attachment: YARN-2088.v1.patch Bug and fixes: 1. clear builder before adding Set/Lists 2. remove unnecessary maybeInitBuilder in mergeLocalToBuilder 3. we don't need to construct Iterable manually, just use guava library 4. the property limit is not set properly in mergeLocalToBuilder, this may cause the limit property be reset to Long.MAX... 5. add a test assertion in TestGetApplicationsRequest to verify the bug Run the test on my local laptop, the test failed before the patch, and success after the patch. > Fix code bug in GetApplicationsRequestPBImpl#mergeLocalToBuilder > > > Key: YARN-2088 > URL: https://issues.apache.org/jira/browse/YARN-2088 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Binglin Chang >Assignee: Binglin Chang > Attachments: YARN-2088.v1.patch > > > Some fields(set,list) are added to proto builders many times, we need to > clear those fields before add, otherwise the result proto contains more > contents. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2088) Fix code bug in GetApplicationsRequestPBImpl#mergeLocalToBuilder
Binglin Chang created YARN-2088: --- Summary: Fix code bug in GetApplicationsRequestPBImpl#mergeLocalToBuilder Key: YARN-2088 URL: https://issues.apache.org/jira/browse/YARN-2088 Project: Hadoop YARN Issue Type: Bug Reporter: Binglin Chang Assignee: Binglin Chang Some fields(set,list) are added to proto builders many times, we need to clear those fields before add, otherwise the result proto contains more contents. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2084) YARN to support REST APIs in AMs
[ https://issues.apache.org/jira/browse/YARN-2084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14004609#comment-14004609 ] Steve Loughran commented on YARN-2084: -- Note that for Slider we worked around the proxy problems with our own AM-side filter -we don't want to encourage that. > YARN to support REST APIs in AMs > > > Key: YARN-2084 > URL: https://issues.apache.org/jira/browse/YARN-2084 > Project: Hadoop YARN > Issue Type: New Feature > Components: webapp >Affects Versions: 2.4.0 >Reporter: Steve Loughran > > Having built a REST API in a YARN app, we've had to work around a few > quirks/issues that could be addressed centrally > # proxy & filter not allowing PUT/POST/DELETE operations > # NotFound exceptions incompatible with text/plain responses > This JIRA exists to cover them and any other issues. It'll probably need some > tests too. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2087) YARN proxy doesn't relay verbs other than GET
Steve Loughran created YARN-2087: Summary: YARN proxy doesn't relay verbs other than GET Key: YARN-2087 URL: https://issues.apache.org/jira/browse/YARN-2087 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager, webapp Affects Versions: 2.4.0 Reporter: Steve Loughran the {{WebAppProxy}} class only proxies GET requests, REST verbs PUT, DELETE and POST aren't handled. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2086) AmIpFilter to support REST APIs
Steve Loughran created YARN-2086: Summary: AmIpFilter to support REST APIs Key: YARN-2086 URL: https://issues.apache.org/jira/browse/YARN-2086 Project: Hadoop YARN Issue Type: Sub-task Reporter: Steve Loughran The {{AmIpFilter}} doesn't like REST APIs, as all operations are redirected to the proxy as a 302. Even if the proxy did relay all verbs, the filter would need to return a 307 and hope the client was enable to re-issue the verb. The alternative is to have a dedicated part of the webapp to be unproxied, which we did with a custom filter to not relay "/ws/*", or even allow apps to register a rest endpoint directly, either in the AppReport data, or via the YARN-913 registry -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2085) GenericExceptionHandler can't report into TEXT/PLAIN responses
Steve Loughran created YARN-2085: Summary: GenericExceptionHandler can't report into TEXT/PLAIN responses Key: YARN-2085 URL: https://issues.apache.org/jira/browse/YARN-2085 Project: Hadoop YARN Issue Type: Sub-task Components: webapp Affects Versions: 2.4.0 Reporter: Steve Loughran As seen in SLIDER-51, exceptions (like NotFound) can't be mapped into text/plain responses. it may be that the {{Response.status(s).entity(exception).build()}} logic just doesn't work for plaintext, in which case the handler should detect an unsupported mime type and just return the error code with an empty body. That might be the best approach for other binaries too. or: simply catch the marshalling exception and downgrade to an empty-body status code. This would be a more graceful fallback, as it would catch all marshalling issues and return the original error code to the user -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2084) YARN to support REST APIs in AMs
Steve Loughran created YARN-2084: Summary: YARN to support REST APIs in AMs Key: YARN-2084 URL: https://issues.apache.org/jira/browse/YARN-2084 Project: Hadoop YARN Issue Type: New Feature Components: webapp Affects Versions: 2.4.0 Reporter: Steve Loughran Having built a REST API in a YARN app, we've had to work around a few quirks/issues that could be addressed centrally # proxy & filter not allowing PUT/POST/DELETE operations # NotFound exceptions incompatible with text/plain responses This JIRA exists to cover them and any other issues. It'll probably need some tests too. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2082) Support for alternative log aggregation mechanism
[ https://issues.apache.org/jira/browse/YARN-2082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14004478#comment-14004478 ] Steve Loughran commented on YARN-2082: -- we do have log issues in long-lived apps, but different ones, YARN-1104 and YARN- look at those. for those services we don't want the logs aggregated at the end of the run, more streamed off while the app runs along. I don't know if this plugin mechanism would help at that phase, unless the logs were being snapshotted and rolled out. > Support for alternative log aggregation mechanism > - > > Key: YARN-2082 > URL: https://issues.apache.org/jira/browse/YARN-2082 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Ming Ma > > I will post a more detailed design later. Here is the brief summary and would > like to get early feedback. > Problem Statement: > Current implementation of log aggregation create one HDFS file for each > {application, nodemanager }. These files are relative small, in the range of > 1-2 MB. In a large cluster with lots of application and many nodemanagers, it > ends up creating lots of small files in HDFS. This creates pressure on HDFS > NN on the following ways. > 1. It increases NN Memory size. It is mitigated by having history server > deletes old log files in HDFS. > 2. Runtime RPC hit on HDFS. Each log aggregation file introduced several NN > RPCs such as create, getAdditionalBlock, complete, rename. When the cluster > is busy, such RPC hit has impact on NN performance. > In addition, to support non-MR applications on YARN, we might need to support > aggregation for long running applications. > Design choices: > 1. Don't aggregate all the logs, as in YARN-221. > 2. Create a dedicated HDFS namespace used only for log aggregation. > 3. Write logs to some key-value store like HBase. HBase's RPC hit on NN will > be much less. > 4. Decentralize the application level log aggregation to NMs. All logs for a > given application are aggregated first by a dedicated NM before it is pushed > to HDFS. > 5. Have NM aggregate logs on a regular basis; each of these log files will > have data from different applications and there needs to be some index for > quick lookup. > Proposal: > 1. Make yarn log aggregation pluggable for both read and write path. Note > that Hadoop FileSystem provides an abstraction and we could ask alternative > log aggregator implement compatable FileSystem, but that seems to an overkill. > 2. Provide a log aggregation plugin that write to HBase. The scheme design > needs to support efficient read on a per application as well as per > application+container basis; in addition, it shouldn't create hotspot in a > cluster where certain users might create more jobs than others. For example, > we can use hash($user+$applicationId} + containerid as the row key. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2083) In fair scheduler, Queue should not been assigned more containers when its usedResource had reach the maxResource limit
[ https://issues.apache.org/jira/browse/YARN-2083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Tian updated YARN-2083: -- Attachment: YARN-2083.patch > In fair scheduler, Queue should not been assigned more containers when its > usedResource had reach the maxResource limit > --- > > Key: YARN-2083 > URL: https://issues.apache.org/jira/browse/YARN-2083 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.3.0 >Reporter: Yi Tian > Labels: assignContainer, fair, scheduler > Fix For: 2.3.0 > > Attachments: YARN-2083.patch > > > In fair scheduler, FSParentQueue and FSLeafQueue do an > assignContainerPreCheck to guaranty this queue is not over its limit. > But the fitsIn function in Resource.java did not return false when the > usedResource equals the maxResource. > I think we should create a new Function "fitsInWithoutEqual" instead of > "fitsIn" in this case. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2083) In fair scheduler, Queue should not been assigned more containers when its usedResource had reach the maxResource limit
Yi Tian created YARN-2083: - Summary: In fair scheduler, Queue should not been assigned more containers when its usedResource had reach the maxResource limit Key: YARN-2083 URL: https://issues.apache.org/jira/browse/YARN-2083 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.3.0 Reporter: Yi Tian In fair scheduler, FSParentQueue and FSLeafQueue do an assignContainerPreCheck to guaranty this queue is not over its limit. But the fitsIn function in Resource.java did not return false when the usedResource equals the maxResource. I think we should create a new Function "fitsInWithoutEqual" instead of "fitsIn" in this case. -- This message was sent by Atlassian JIRA (v6.2#6252)