[jira] [Commented] (YARN-4673) race condition in ResourceTrackerService#nodeHeartBeat while processing deduplicated msg
[ https://issues.apache.org/jira/browse/YARN-4673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15166867#comment-15166867 ] Tsuyoshi Ozawa commented on YARN-4673: -- Hi [~sandflee] thank you for the contribution. Could you explain the cause of the deadlock? It helps us to review your patch more fast and more correctly. > race condition in ResourceTrackerService#nodeHeartBeat while processing > deduplicated msg > > > Key: YARN-4673 > URL: https://issues.apache.org/jira/browse/YARN-4673 > Project: Hadoop YARN > Issue Type: Bug >Reporter: sandflee >Assignee: sandflee > Attachments: YARN-4673.01.patch > > > we could add a lock like ApplicationMasterService#allocate -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4731) Linux container executor fails to delete nmlocal folders
[ https://issues.apache.org/jira/browse/YARN-4731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin A Chundatt updated YARN-4731: --- Description: Enable LCE and CGroups Submit a mapreduce job {noformat} 2016-02-24 18:56:46,889 INFO org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Deleting absolute path : /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01 2016-02-24 18:56:46,894 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: Shell execution returned exit code: 255. Privileged Execution Operation Output: main : command provided 3 main : run as user is dsperf main : requested yarn user is dsperf failed to rmdir job.jar: Not a directory Error while deleting /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01: 20 (Not a directory) Full command array for failed execution: [/opt/bibin/dsperf/HAINSTALL/install/hadoop/nodemanager/bin/container-executor, dsperf, dsperf, 3, /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01] 2016-02-24 18:56:46,894 ERROR org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: DeleteAsUser for /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01 returned with exit code: 255 org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: ExitCodeException exitCode=255: at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:173) at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:199) at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.deleteAsUser(LinuxContainerExecutor.java:569) at org.apache.hadoop.yarn.server.nodemanager.DeletionService$FileDeletionTask.run(DeletionService.java:265) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: ExitCodeException exitCode=255: at org.apache.hadoop.util.Shell.runCommand(Shell.java:927) at org.apache.hadoop.util.Shell.run(Shell.java:838) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1117) at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:150) ... 10 more {noformat} As a result nodemanager-local directory are not getting deleted for each application {noformat} total 36 drwxr-s--- 4 hdfs hadoop 4096 Feb 25 08:25 ./ drwxr-s--- 7 hdfs hadoop 4096 Feb 25 08:25 ../ -rw--- 1 hdfs hadoop 340 Feb 25 08:25 container_tokens lrwxrwxrwx 1 hdfs hadoop 111 Feb 25 08:25 job.jar -> /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/hdfs/appcache/application_1456364845478_0004/filecache/11/job.jar/ lrwxrwxrwx 1 hdfs hadoop 111 Feb 25 08:25 job.xml -> /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/hdfs/appcache/application_1456364845478_0004/filecache/13/job.xml* drwxr-s--- 2 hdfs hadoop 4096 Feb 25 08:25 jobSubmitDir/ -rwx-- 1 hdfs hadoop 5348 Feb 25 08:25 launch_container.sh* drwxr-s--- 2 hdfs hadoop 4096 Feb 25 08:25 tmp/ {noformat} was: Enable LCE and CGroups Submit a mapreduce job {noformat} 2016-02-24 18:56:46,889 INFO org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Deleting absolute path : /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01 2016-02-24 18:56:46,894 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: Shell execution returned exit code: 255. Privileged Execution Operation Output: main : command provided 3 main : run as user is dsperf main : requested yarn user is dsperf failed to rmdir job.jar: Not a directory Error while deleting /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf
[jira] [Commented] (YARN-4673) race condition in ResourceTrackerService#nodeHeartBeat while processing deduplicated msg
[ https://issues.apache.org/jira/browse/YARN-4673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15166842#comment-15166842 ] Hadoop QA commented on YARN-4673: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 10s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s {color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 12s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 29s {color} | {color:green} trunk passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 30s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 19s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 36s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 15s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 10s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 24s {color} | {color:green} trunk passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 28s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 32s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 27s {color} | {color:green} the patch passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 27s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 28s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 28s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 16s {color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: patch generated 1 new + 16 unchanged - 3 fixed = 17 total (was 19) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 36s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 12s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 25s {color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0) {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 22s {color} | {color:green} the patch passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 26s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 69m 29s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_72. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 68m 34s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.7.0_95. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 17s {color} | {color:green} Patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 155m 36s {color} | {color:black} {color} | \\ \\ || Reason || Tests || | FindBugs | module:hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager | | | Unread field:ResourceTrackerService.java:[line 623] | | JDK v1.8.0_72 Failed junit tests | hadoop.yarn.server.resourc
[jira] [Resolved] (YARN-4730) YARN preemption based on instantaneous fair share
[ https://issues.apache.org/jira/browse/YARN-4730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph resolved YARN-4730. - Resolution: Duplicate YARN-2026 > YARN preemption based on instantaneous fair share > - > > Key: YARN-4730 > URL: https://issues.apache.org/jira/browse/YARN-4730 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Reporter: Prabhu Joseph > > On a big cluster with Total Cluster Resource of 10TB, 3000 cores and Fair > Sheduler having 230 queues and total 6 jobs run a day. [ all 230 queues > are very critical and hence the minResource is same for all]. On this case, > when a Spark Job is run on queue A and which occupies the entire cluster > resource and does not release any resource, another job submitted into queue > B and preemption is getting only the Fair Share which is <10TB , 3000> / 230 > = <45 GB , 13 cores> which is very less fair share for a queue.shared by many > applications. > The Preemption should get the instantaneous fair Share, that is <10TB, 3000> > / 2 (active queues) = 5TB and 1500 cores, so that the first job won't hog the > entire cluster resource and also the subsequent jobs run fine. > This issue is only when the number of queues are very high. In case of less > number of queues, Preemption getting Fair Share would be suffice as the fair > share will be high. But in case of too many number of queues, Preemption > should try to get the instantaneous Fair Share. > Note: Configuring optimal maxResources to 230 queues is difficult and also > putting constraint for the queues using maxResource will leave cluster > resource idle most of the time. > There are 1000s of Spark Jobs, so asking each user to restrict the > number of executors is also difficult. > Preempting Instantaneous Fair Share will help to overcome the above issues. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4720) Skip unnecessary NN operations in log aggregation
[ https://issues.apache.org/jira/browse/YARN-4720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15166819#comment-15166819 ] Hadoop QA commented on YARN-4720: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 10s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 1 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 40s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 21s {color} | {color:green} trunk passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 25s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 15s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 27s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 13s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 48s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 17s {color} | {color:green} trunk passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 21s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 23s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 20s {color} | {color:green} the patch passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 20s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 23s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 23s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 13s {color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: patch generated 1 new + 18 unchanged - 1 fixed = 19 total (was 19) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 25s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 11s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 58s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 15s {color} | {color:green} the patch passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 19s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 8m 47s {color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with JDK v1.8.0_72. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 19s {color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with JDK v1.7.0_95. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 18s {color} | {color:green} Patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 32m 46s {color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:0ca8df7 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12789865/YARN-4720.04.patch | | JIRA Issue | YARN-4720 | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle | | uname | Linux c3b3cba6bf60 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/hadoop/patch
[jira] [Commented] (YARN-4735) Remove stale LogAggregationReport from NM's context
[ https://issues.apache.org/jira/browse/YARN-4735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15166778#comment-15166778 ] Karthik Kambatla commented on YARN-4735: We have seen this issue as well. > Remove stale LogAggregationReport from NM's context > --- > > Key: YARN-4735 > URL: https://issues.apache.org/jira/browse/YARN-4735 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jun Gong >Assignee: Jun Gong > > {quote} > All LogAggregationReport(current and previous) are only added to > *context.getLogAggregationStatusForApps*, and never removed. > So for long running service, the LogAggregationReport list NM sends to RM > will grow over time. > {quote} > Per discussion in YARN-4720, we need remove stale LogAggregationReport from > NM's context. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4731) Linux container executor fails on DeleteAsUser
[ https://issues.apache.org/jira/browse/YARN-4731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin A Chundatt updated YARN-4731: --- Priority: Critical (was: Major) > Linux container executor fails on DeleteAsUser > -- > > Key: YARN-4731 > URL: https://issues.apache.org/jira/browse/YARN-4731 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bibin A Chundatt >Priority: Critical > > Enable LCE and CGroups > Submit a mapreduce job > {noformat} > 2016-02-24 18:56:46,889 INFO > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Deleting > absolute path : > /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01 > 2016-02-24 18:56:46,894 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: > Shell execution returned exit code: 255. Privileged Execution Operation > Output: > main : command provided 3 > main : run as user is dsperf > main : requested yarn user is dsperf > failed to rmdir job.jar: Not a directory > Error while deleting > /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01: > 20 (Not a directory) > Full command array for failed execution: > [/opt/bibin/dsperf/HAINSTALL/install/hadoop/nodemanager/bin/container-executor, > dsperf, dsperf, 3, > /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01] > 2016-02-24 18:56:46,894 ERROR > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: > DeleteAsUser for > /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01 > returned with exit code: 255 > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: > ExitCodeException exitCode=255: > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:173) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:199) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.deleteAsUser(LinuxContainerExecutor.java:569) > at > org.apache.hadoop.yarn.server.nodemanager.DeletionService$FileDeletionTask.run(DeletionService.java:265) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: ExitCodeException exitCode=255: > at org.apache.hadoop.util.Shell.runCommand(Shell.java:927) > at org.apache.hadoop.util.Shell.run(Shell.java:838) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1117) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:150) > ... 10 more > {noformat} > As a result nodemanager local directory are not getting deleted for each > application -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4720) Skip unnecessary NN operations in log aggregation
[ https://issues.apache.org/jira/browse/YARN-4720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15166741#comment-15166741 ] Jun Gong commented on YARN-4720: Thanks for the suggestion. Attach a new patch to address it. {quote} ah, that is a good point. So for long running service, the LogAggregationReport list NM sends to RM will grow over time. Sounds like a bug; but not something related to this jira. Jun Gong, you want to open a separate jira for that? {quote} Thanks for the confirmation. Just created for YARN-4735 to address it. > Skip unnecessary NN operations in log aggregation > - > > Key: YARN-4720 > URL: https://issues.apache.org/jira/browse/YARN-4720 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Ming Ma >Assignee: Jun Gong > Attachments: YARN-4720.01.patch, YARN-4720.02.patch, > YARN-4720.03.patch, YARN-4720.04.patch > > > Log aggregation service could have unnecessary NN operations in the following > scenarios: > * No new local log has been created since the last upload for the long > running service scenario. > * NM uses {{ContainerLogAggregationPolicy}} that skips log aggregation for > certain containers. > In the following code snippet, even though {{pendingContainerInThisCycle}} is > empty, it still creates the writer and then removes the file later. Thus it > introduces unnecessary create/getfileinfo/delete NN calls when NM doesn't > aggregate logs for an app. > > {noformat} > AppLogAggregatorImpl.java > .. > writer = > new LogWriter(this.conf, this.remoteNodeTmpLogFileForApp, > this.userUgi); > .. > for (ContainerId container : pendingContainerInThisCycle) { > .. > } > .. > if (remoteFS.exists(remoteNodeTmpLogFileForApp)) { > if (rename) { > remoteFS.rename(remoteNodeTmpLogFileForApp, renamedPath); > } else { > remoteFS.delete(remoteNodeTmpLogFileForApp, false); > } > } > .. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4735) Remove stale LogAggregationReport from NM's context
Jun Gong created YARN-4735: -- Summary: Remove stale LogAggregationReport from NM's context Key: YARN-4735 URL: https://issues.apache.org/jira/browse/YARN-4735 Project: Hadoop YARN Issue Type: Bug Reporter: Jun Gong Assignee: Jun Gong {quote} All LogAggregationReport(current and previous) are only added to *context.getLogAggregationStatusForApps*, and never removed. So for long running service, the LogAggregationReport list NM sends to RM will grow over time. {quote} Per discussion in YARN-4720, we need remove stale LogAggregationReport from NM's context. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4712) CPU Usage Metric is not captured properly in YARN-2928
[ https://issues.apache.org/jira/browse/YARN-4712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15166740#comment-15166740 ] Naganarasimha G R commented on YARN-4712: - One way to avoid ( in particular to CPU usage) is multiply with 100 and floor it and then type cast it to int. But we need to further think YARN-4053 's aproach is not a limitation for others to load the metrics as it doesnt support decimals ! cc /[~sjlee0] & [~varun_saxena] > CPU Usage Metric is not captured properly in YARN-2928 > -- > > Key: YARN-4712 > URL: https://issues.apache.org/jira/browse/YARN-4712 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > > There are 2 issues with CPU usage collection > * I was able to observe that that many times CPU usage got from > {{pTree.getCpuUsagePercent()}} is > ResourceCalculatorProcessTree.UNAVAILABLE(i.e. -1) but ContainersMonitor do > the calculation i.e. {{cpuUsageTotalCoresPercentage = cpuUsagePercentPerCore > /resourceCalculatorPlugin.getNumProcessors()}} because of which UNAVAILABLE > check in {{NMTimelinePublisher.reportContainerResourceUsage}} is not > encountered. so proper checks needs to be handled > * {{EntityColumnPrefix.METRIC}} uses always LongConverter but > ContainerMonitor is publishing decimal values for the CPU usage. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4720) Skip unnecessary NN operations in log aggregation
[ https://issues.apache.org/jira/browse/YARN-4720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Gong updated YARN-4720: --- Attachment: YARN-4720.04.patch > Skip unnecessary NN operations in log aggregation > - > > Key: YARN-4720 > URL: https://issues.apache.org/jira/browse/YARN-4720 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Ming Ma >Assignee: Jun Gong > Attachments: YARN-4720.01.patch, YARN-4720.02.patch, > YARN-4720.03.patch, YARN-4720.04.patch > > > Log aggregation service could have unnecessary NN operations in the following > scenarios: > * No new local log has been created since the last upload for the long > running service scenario. > * NM uses {{ContainerLogAggregationPolicy}} that skips log aggregation for > certain containers. > In the following code snippet, even though {{pendingContainerInThisCycle}} is > empty, it still creates the writer and then removes the file later. Thus it > introduces unnecessary create/getfileinfo/delete NN calls when NM doesn't > aggregate logs for an app. > > {noformat} > AppLogAggregatorImpl.java > .. > writer = > new LogWriter(this.conf, this.remoteNodeTmpLogFileForApp, > this.userUgi); > .. > for (ContainerId container : pendingContainerInThisCycle) { > .. > } > .. > if (remoteFS.exists(remoteNodeTmpLogFileForApp)) { > if (rename) { > remoteFS.rename(remoteNodeTmpLogFileForApp, renamedPath); > } else { > remoteFS.delete(remoteNodeTmpLogFileForApp, false); > } > } > .. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4720) Skip unnecessary NN operations in log aggregation
[ https://issues.apache.org/jira/browse/YARN-4720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15166710#comment-15166710 ] Hadoop QA commented on YARN-4720: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 13s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 1 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 29s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 25s {color} | {color:green} trunk passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 27s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 16s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 29s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 13s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 55s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 20s {color} | {color:green} trunk passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 22s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 26s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 23s {color} | {color:green} the patch passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 23s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 24s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 24s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 14s {color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: patch generated 1 new + 17 unchanged - 1 fixed = 18 total (was 18) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 28s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 10s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 4s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 16s {color} | {color:green} the patch passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 20s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 8s {color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with JDK v1.8.0_72. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 31s {color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with JDK v1.7.0_95. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 18s {color} | {color:green} Patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 34m 51s {color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:0ca8df7 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12789856/YARN-4720.03.patch | | JIRA Issue | YARN-4720 | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle | | uname | Linux 1e6841db56b4 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/hadoop/patchpr
[jira] [Commented] (YARN-4720) Skip unnecessary NN operations in log aggregation
[ https://issues.apache.org/jira/browse/YARN-4720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15166691#comment-15166691 ] Ming Ma commented on YARN-4720: --- ah, that is a good point. So for long running service, the {{LogAggregationReport}} list NM sends to RM will grow over time. Sounds like a bug; but not something related to this jira. [~hex108], you want to open a separate jira for that? To have it send RUNNING report for all scenarios, how about moving the following block to finally? {noformat} LogAggregationStatus logAggregationStatus = logAggregationSucceedInThisCycle ? LogAggregationStatus.RUNNING : LogAggregationStatus.RUNNING_WITH_FAILURE; sendLogAggregationReport(logAggregationStatus, diagnosticMessage); {noformat} Instead of creating a new {{operateWriterFailed}}, maybe it can reuse {{logAggregationSucceedInThisCycle}} instead. > Skip unnecessary NN operations in log aggregation > - > > Key: YARN-4720 > URL: https://issues.apache.org/jira/browse/YARN-4720 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Ming Ma >Assignee: Jun Gong > Attachments: YARN-4720.01.patch, YARN-4720.02.patch, > YARN-4720.03.patch > > > Log aggregation service could have unnecessary NN operations in the following > scenarios: > * No new local log has been created since the last upload for the long > running service scenario. > * NM uses {{ContainerLogAggregationPolicy}} that skips log aggregation for > certain containers. > In the following code snippet, even though {{pendingContainerInThisCycle}} is > empty, it still creates the writer and then removes the file later. Thus it > introduces unnecessary create/getfileinfo/delete NN calls when NM doesn't > aggregate logs for an app. > > {noformat} > AppLogAggregatorImpl.java > .. > writer = > new LogWriter(this.conf, this.remoteNodeTmpLogFileForApp, > this.userUgi); > .. > for (ContainerId container : pendingContainerInThisCycle) { > .. > } > .. > if (remoteFS.exists(remoteNodeTmpLogFileForApp)) { > if (rename) { > remoteFS.rename(remoteNodeTmpLogFileForApp, renamedPath); > } else { > remoteFS.delete(remoteNodeTmpLogFileForApp, false); > } > } > .. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4712) CPU Usage Metric is not captured properly in YARN-2928
[ https://issues.apache.org/jira/browse/YARN-4712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15166682#comment-15166682 ] Naganarasimha G R commented on YARN-4712: - Thanks [~sunilg], Yes the first scenario is same as that jira, we should not proceed ahead with the calculations (divide by processors) if its -1 as usage, hope to see that jira to be committed. 2nd we need to discuss whether long is sufficient or we need to support double > CPU Usage Metric is not captured properly in YARN-2928 > -- > > Key: YARN-4712 > URL: https://issues.apache.org/jira/browse/YARN-4712 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > > There are 2 issues with CPU usage collection > * I was able to observe that that many times CPU usage got from > {{pTree.getCpuUsagePercent()}} is > ResourceCalculatorProcessTree.UNAVAILABLE(i.e. -1) but ContainersMonitor do > the calculation i.e. {{cpuUsageTotalCoresPercentage = cpuUsagePercentPerCore > /resourceCalculatorPlugin.getNumProcessors()}} because of which UNAVAILABLE > check in {{NMTimelinePublisher.reportContainerResourceUsage}} is not > encountered. so proper checks needs to be handled > * {{EntityColumnPrefix.METRIC}} uses always LongConverter but > ContainerMonitor is publishing decimal values for the CPU usage. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4673) race condition in ResourceTrackerService#nodeHeartBeat while processing deduplicated msg
[ https://issues.apache.org/jira/browse/YARN-4673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-4673: --- Attachment: YARN-4673.01.patch > race condition in ResourceTrackerService#nodeHeartBeat while processing > deduplicated msg > > > Key: YARN-4673 > URL: https://issues.apache.org/jira/browse/YARN-4673 > Project: Hadoop YARN > Issue Type: Bug >Reporter: sandflee >Assignee: sandflee > Attachments: YARN-4673.01.patch > > > we could add a lock like ApplicationMasterService#allocate -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4731) Linux container executor fails on DeleteAsUser
[ https://issues.apache.org/jira/browse/YARN-4731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1516#comment-1516 ] Bibin A Chundatt commented on YARN-4731: *Command array logs* /opt/bibin/dsperf/HAINSTALL/install/hadoop/nodemanager/bin/container-executor dsperf dsperf 3 /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0002/container_e02_1456319010019_0002_01_01 main : command provided 3 main : run as user is dsperf main : requested yarn user is dsperf failed to rmdir job.jar: Not a directory Error while deleting /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0002/container_e02_1456319010019_0002_01_01: 20 (Not a directory) > Linux container executor fails on DeleteAsUser > -- > > Key: YARN-4731 > URL: https://issues.apache.org/jira/browse/YARN-4731 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bibin A Chundatt > > Enable LCE and CGroups > Submit a mapreduce job > {noformat} > 2016-02-24 18:56:46,889 INFO > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Deleting > absolute path : > /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01 > 2016-02-24 18:56:46,894 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: > Shell execution returned exit code: 255. Privileged Execution Operation > Output: > main : command provided 3 > main : run as user is dsperf > main : requested yarn user is dsperf > failed to rmdir job.jar: Not a directory > Error while deleting > /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01: > 20 (Not a directory) > Full command array for failed execution: > [/opt/bibin/dsperf/HAINSTALL/install/hadoop/nodemanager/bin/container-executor, > dsperf, dsperf, 3, > /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01] > 2016-02-24 18:56:46,894 ERROR > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: > DeleteAsUser for > /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01 > returned with exit code: 255 > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: > ExitCodeException exitCode=255: > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:173) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:199) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.deleteAsUser(LinuxContainerExecutor.java:569) > at > org.apache.hadoop.yarn.server.nodemanager.DeletionService$FileDeletionTask.run(DeletionService.java:265) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: ExitCodeException exitCode=255: > at org.apache.hadoop.util.Shell.runCommand(Shell.java:927) > at org.apache.hadoop.util.Shell.run(Shell.java:838) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1117) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:150) > ... 10 more > {noformat} > As a result nodemanager local directory are not getting deleted for each > application -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4729) SchedulerApplicationAttempt#getTotalRequiredResources can throw an NPE
[ https://issues.apache.org/jira/browse/YARN-4729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15166648#comment-15166648 ] Hudson commented on YARN-4729: -- FAILURE: Integrated in Hadoop-trunk-Commit #9366 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/9366/]) YARN-4729. SchedulerApplicationAttempt#getTotalRequiredResources can (kasha: rev c684f2b007a4808dafbe1c1d3ce01758e281d329) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerApplicationAttempt.java * hadoop-yarn-project/CHANGES.txt > SchedulerApplicationAttempt#getTotalRequiredResources can throw an NPE > -- > > Key: YARN-4729 > URL: https://issues.apache.org/jira/browse/YARN-4729 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.7.2 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla > Fix For: 2.9.0 > > Attachments: yarn-4729.patch > > > SchedulerApplicationAttempt#getTotalRequiredResources can throw an NPE. We > saw this in a unit test failure. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4720) Skip unnecessary NN operations in log aggregation
[ https://issues.apache.org/jira/browse/YARN-4720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15166644#comment-15166644 ] Jun Gong commented on YARN-4720: Thanks for explaining. Attach a new patch to fix it. {quote} Yes, NM can send several {{LogAggregationReport}}s in the list which is ordered; that is the API between NM and RM. Then on RM side, it will retrieve all elements from the list. {quote} IIUC all LogAggregationReport(current and previous) are only added to 'context.getLogAggregationStatusForApps', and never removed. > Skip unnecessary NN operations in log aggregation > - > > Key: YARN-4720 > URL: https://issues.apache.org/jira/browse/YARN-4720 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Ming Ma >Assignee: Jun Gong > Attachments: YARN-4720.01.patch, YARN-4720.02.patch, > YARN-4720.03.patch > > > Log aggregation service could have unnecessary NN operations in the following > scenarios: > * No new local log has been created since the last upload for the long > running service scenario. > * NM uses {{ContainerLogAggregationPolicy}} that skips log aggregation for > certain containers. > In the following code snippet, even though {{pendingContainerInThisCycle}} is > empty, it still creates the writer and then removes the file later. Thus it > introduces unnecessary create/getfileinfo/delete NN calls when NM doesn't > aggregate logs for an app. > > {noformat} > AppLogAggregatorImpl.java > .. > writer = > new LogWriter(this.conf, this.remoteNodeTmpLogFileForApp, > this.userUgi); > .. > for (ContainerId container : pendingContainerInThisCycle) { > .. > } > .. > if (remoteFS.exists(remoteNodeTmpLogFileForApp)) { > if (rename) { > remoteFS.rename(remoteNodeTmpLogFileForApp, renamedPath); > } else { > remoteFS.delete(remoteNodeTmpLogFileForApp, false); > } > } > .. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4720) Skip unnecessary NN operations in log aggregation
[ https://issues.apache.org/jira/browse/YARN-4720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Gong updated YARN-4720: --- Attachment: YARN-4720.03.patch > Skip unnecessary NN operations in log aggregation > - > > Key: YARN-4720 > URL: https://issues.apache.org/jira/browse/YARN-4720 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Ming Ma >Assignee: Jun Gong > Attachments: YARN-4720.01.patch, YARN-4720.02.patch, > YARN-4720.03.patch > > > Log aggregation service could have unnecessary NN operations in the following > scenarios: > * No new local log has been created since the last upload for the long > running service scenario. > * NM uses {{ContainerLogAggregationPolicy}} that skips log aggregation for > certain containers. > In the following code snippet, even though {{pendingContainerInThisCycle}} is > empty, it still creates the writer and then removes the file later. Thus it > introduces unnecessary create/getfileinfo/delete NN calls when NM doesn't > aggregate logs for an app. > > {noformat} > AppLogAggregatorImpl.java > .. > writer = > new LogWriter(this.conf, this.remoteNodeTmpLogFileForApp, > this.userUgi); > .. > for (ContainerId container : pendingContainerInThisCycle) { > .. > } > .. > if (remoteFS.exists(remoteNodeTmpLogFileForApp)) { > if (rename) { > remoteFS.rename(remoteNodeTmpLogFileForApp, renamedPath); > } else { > remoteFS.delete(remoteNodeTmpLogFileForApp, false); > } > } > .. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1040) De-link container life cycle from the process and add ability to execute multiple processes in the same long-lived container
[ https://issues.apache.org/jira/browse/YARN-1040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15166633#comment-15166633 ] Bikas Saha commented on YARN-1040: -- I am sorry if I caused a digression by mentioning Slider etc. I am not sure the upgrade scenario is the only one for this jira since this jira covers a broader set. Even without upgrades apps can change the processes they are running in a container without having to lose the container allocation. Identical calls of primitives could be used without the notion of upgrade. E.g. start a Java process first for a Java task, then launch a python process for a Python task. To the NM this is identical to starting v1 and then starting v2. So while it makes sense for the second one to use an API called upgrade, it may not for the first one. (Unrelated to this jira, IMO, YARN should allow upgrade of app code without losing containers but not necessarily understand it deeply. E.g. YARN need not assume that upgrade will need additional resource or try to acquire them transparently for the application.) For the purpose of this jira here is what my thoughts are when I had opened YARN-1292 to delink process lifecycle from container. 1) new API - acquireContainer - means ask for the allocated resource. The API has a flag to specify whether process exit implies releaseContainer. This is for backwards compatibility with a default of true. Apps that want to continue to use that behavior can explicitly pass true when using the new API and is mainly for reducing number of RPCs for apps like MR/Tez etc. 2) new API - startProcess - means start the remote process 3) new API - stopProcess - means stop the remote process 4) new API - releaseContainer - means release the allocated resource 5) Potentially a new API for localization, though in theory, this could be separate. Since this fine grained control makes the protocol chatty, we can reduce the RPC traffic by having a new NM RPC, say NMCommand, that takes a sequence of API primitives that can be sent in 1 RPC. So the current API of startContainer effectively becomes NMCommand(1, 2) and stopContainer becomes NMCommand(3,4). This can be leveraged for backwards compatibility and rolling upgrades. The above items would effectively delink process and container lifecyle and close out this jira. This provides the fine grained control in core YARN that can be used for various scenarios e.g. upgrades without YARN understanding the scenarios. If we need to add higher level notions for upgrades etc. then those could be done as separate items. I hope that helps make my thoughts concrete within the scope of this jira. > De-link container life cycle from the process and add ability to execute > multiple processes in the same long-lived container > > > Key: YARN-1040 > URL: https://issues.apache.org/jira/browse/YARN-1040 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 3.0.0 >Reporter: Steve Loughran > > The AM should be able to exec >1 process in a container, rather than have the > NM automatically release the container when the single process exits. > This would let an AM restart a process on the same container repeatedly, > which for HBase would offer locality on a restarted region server. > We may also want the ability to exec multiple processes in parallel, so that > something could be run in the container while a long-lived process was > already running. This can be useful in monitoring and reconfiguring the > long-lived process, as well as shutting it down. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-556) [Umbrella] RM Restart phase 2 - Work preserving restart
[ https://issues.apache.org/jira/browse/YARN-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15166612#comment-15166612 ] Rohith Sharma K S commented on YARN-556: You probably hitting any one of the following issues YARN-2340 YARN-2308 YARN-4000. Any queue configuration got changed after restart? > [Umbrella] RM Restart phase 2 - Work preserving restart > --- > > Key: YARN-556 > URL: https://issues.apache.org/jira/browse/YARN-556 > Project: Hadoop YARN > Issue Type: New Feature > Components: graceful, resourcemanager, rolling upgrade >Reporter: Bikas Saha > Attachments: Work Preserving RM Restart.pdf, > WorkPreservingRestartPrototype.001.patch, YARN-1372.prelim.patch > > > YARN-128 covered storing the state needed for the RM to recover critical > information. This umbrella jira will track changes needed to recover the > running state of the cluster so that work can be preserved across RM restarts. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4624) NPE in PartitionQueueCapacitiesInfo while accessing Schduler UI
[ https://issues.apache.org/jira/browse/YARN-4624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15166593#comment-15166593 ] Rohith Sharma K S commented on YARN-4624: - thanks [~brahmareddy] for providing patch. one nit: why are we using wrapper Float instead of primitive float? > NPE in PartitionQueueCapacitiesInfo while accessing Schduler UI > --- > > Key: YARN-4624 > URL: https://issues.apache.org/jira/browse/YARN-4624 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Brahma Reddy Battula >Assignee: Brahma Reddy Battula > Attachments: SchedulerUIWithOutLabelMapping.png, YARN-2674-002.patch, > YARN-4624-003.patch, YARN-4624.patch > > > Scenario: > === > Configure nodelables and add to cluster > Start the cluster > {noformat} > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.PartitionQueueCapacitiesInfo.getMaxAMLimitPercentage(PartitionQueueCapacitiesInfo.java:114) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.renderQueueCapacityInfo(CapacitySchedulerPage.java:163) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.renderLeafQueueInfoWithPartition(CapacitySchedulerPage.java:105) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.render(CapacitySchedulerPage.java:94) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79) > at org.apache.hadoop.yarn.webapp.View.render(View.java:235) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock$Block.subView(HtmlBlock.java:43) > at > org.apache.hadoop.yarn.webapp.hamlet.HamletImpl$EImp._v(HamletImpl.java:117) > at org.apache.hadoop.yarn.webapp.hamlet.Hamlet$LI._(Hamlet.java:7702) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$QueueBlock.render(CapacitySchedulerPage.java:293) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79) > at org.apache.hadoop.yarn.webapp.View.render(View.java:235) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock$Block.subView(HtmlBlock.java:43) > at > org.apache.hadoop.yarn.webapp.hamlet.HamletImpl$EImp._v(HamletImpl.java:117) > at org.apache.hadoop.yarn.webapp.hamlet.Hamlet$LI._(Hamlet.java:7702) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$QueuesBlock.render(CapacitySchedulerPage.java:447) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79) > at org.apache.hadoop.yarn.webapp.View.render(View.java:235) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4720) Skip unnecessary NN operations in log aggregation
[ https://issues.apache.org/jira/browse/YARN-4720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15166575#comment-15166575 ] Ming Ma commented on YARN-4720: --- It seems that {{LogAggregationStatus.RUNNING}} implies the log aggregation service is running, it doesn't necessarily mean NM actually aggregate any logs. So if the long running service is running and hasn't generate any logs since it starts, it is better to return {{LogAggregationStatus.RUNNING}}. Yes, NM can send several {{LogAggregationReport}}s in the list which is ordered; that is the API between NM and RM. Then on RM side, it will retrieve all elements from the list. > Skip unnecessary NN operations in log aggregation > - > > Key: YARN-4720 > URL: https://issues.apache.org/jira/browse/YARN-4720 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Ming Ma >Assignee: Jun Gong > Attachments: YARN-4720.01.patch, YARN-4720.02.patch > > > Log aggregation service could have unnecessary NN operations in the following > scenarios: > * No new local log has been created since the last upload for the long > running service scenario. > * NM uses {{ContainerLogAggregationPolicy}} that skips log aggregation for > certain containers. > In the following code snippet, even though {{pendingContainerInThisCycle}} is > empty, it still creates the writer and then removes the file later. Thus it > introduces unnecessary create/getfileinfo/delete NN calls when NM doesn't > aggregate logs for an app. > > {noformat} > AppLogAggregatorImpl.java > .. > writer = > new LogWriter(this.conf, this.remoteNodeTmpLogFileForApp, > this.userUgi); > .. > for (ContainerId container : pendingContainerInThisCycle) { > .. > } > .. > if (remoteFS.exists(remoteNodeTmpLogFileForApp)) { > if (rename) { > remoteFS.rename(remoteNodeTmpLogFileForApp, renamedPath); > } else { > remoteFS.delete(remoteNodeTmpLogFileForApp, false); > } > } > .. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run
[ https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15166567#comment-15166567 ] Jun Gong commented on YARN-3998: Thanks [~vinodkv] for explaining it. {quote} My point was mainly about creating and reusing a common policy-framework even if the actual policies may not be entirely reused. We should seriously consider this instead of creating adhoc APIs for custom hard-coded policies. {quote} Yes, it will be better if we could reuse a common policy-framework, we might need discuss it more. {quote} I'm okay creating separate JIRAs under YARN-3998 if you both think of doing so, but treat (some of the above) as blockers for releasing this feature. Given that, does it make sense to work on this in a branch? {quote} I could address these block problems in this issue if needed. [~vvasudev] Could you share your thought please? Thanks. > Add retry-times to let NM re-launch container when it fails to run > -- > > Key: YARN-3998 > URL: https://issues.apache.org/jira/browse/YARN-3998 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Jun Gong >Assignee: Jun Gong > Attachments: YARN-3998.01.patch, YARN-3998.02.patch, > YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch, YARN-3998.06.patch > > > I'd like to add a field(retry-times) in ContainerLaunchContext. When AM > launches containers, it could specify the value. Then NM will re-launch the > container 'retry-times' times when it fails to run(e.g.exit code is not 0). > It will save a lot of time. It avoids container localization. RM does not > need to re-schedule the container. And local files in container's working > directory will be left for re-use.(If container have downloaded some big > files, it does not need to re-download them when running again.) > We find it is useful in systems like Storm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4731) Linux container executor fails on DeleteAsUser
[ https://issues.apache.org/jira/browse/YARN-4731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin A Chundatt updated YARN-4731: --- Description: Enable LCE and CGroups Submit a mapreduce job {noformat} 2016-02-24 18:56:46,889 INFO org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Deleting absolute path : /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01 2016-02-24 18:56:46,894 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: Shell execution returned exit code: 255. Privileged Execution Operation Output: main : command provided 3 main : run as user is dsperf main : requested yarn user is dsperf failed to rmdir job.jar: Not a directory Error while deleting /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01: 20 (Not a directory) Full command array for failed execution: [/opt/bibin/dsperf/HAINSTALL/install/hadoop/nodemanager/bin/container-executor, dsperf, dsperf, 3, /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01] 2016-02-24 18:56:46,894 ERROR org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: DeleteAsUser for /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01 returned with exit code: 255 org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: ExitCodeException exitCode=255: at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:173) at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:199) at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.deleteAsUser(LinuxContainerExecutor.java:569) at org.apache.hadoop.yarn.server.nodemanager.DeletionService$FileDeletionTask.run(DeletionService.java:265) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: ExitCodeException exitCode=255: at org.apache.hadoop.util.Shell.runCommand(Shell.java:927) at org.apache.hadoop.util.Shell.run(Shell.java:838) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1117) at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:150) ... 10 more {noformat} As a result nodemanager local directory are not getting deleted for each applicaton was: Enable LCE and CGroups Submit a mapreduce job {noformat} 2016-02-24 18:56:46,889 INFO org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Deleting absolute path : /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01 2016-02-24 18:56:46,894 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: Shell execution returned exit code: 255. Privileged Execution Operation Output: main : command provided 3 main : run as user is dsperf main : requested yarn user is dsperf failed to rmdir job.jar: Not a directory Error while deleting /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01: 20 (Not a directory) Full command array for failed execution: [/opt/bibin/dsperf/HAINSTALL/install/hadoop/nodemanager/bin/container-executor, dsperf, dsperf, 3, /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01] 2016-02-24 18:56:46,894 ERROR org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: DeleteAsUser for /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01 returned with exit code: 255 org.apache.hadoop.yarn.
[jira] [Updated] (YARN-4731) Linux container executor fails on DeleteAsUser
[ https://issues.apache.org/jira/browse/YARN-4731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin A Chundatt updated YARN-4731: --- Description: Enable LCE and CGroups Submit a mapreduce job {noformat} 2016-02-24 18:56:46,889 INFO org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Deleting absolute path : /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01 2016-02-24 18:56:46,894 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: Shell execution returned exit code: 255. Privileged Execution Operation Output: main : command provided 3 main : run as user is dsperf main : requested yarn user is dsperf failed to rmdir job.jar: Not a directory Error while deleting /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01: 20 (Not a directory) Full command array for failed execution: [/opt/bibin/dsperf/HAINSTALL/install/hadoop/nodemanager/bin/container-executor, dsperf, dsperf, 3, /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01] 2016-02-24 18:56:46,894 ERROR org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: DeleteAsUser for /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01 returned with exit code: 255 org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: ExitCodeException exitCode=255: at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:173) at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:199) at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.deleteAsUser(LinuxContainerExecutor.java:569) at org.apache.hadoop.yarn.server.nodemanager.DeletionService$FileDeletionTask.run(DeletionService.java:265) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: ExitCodeException exitCode=255: at org.apache.hadoop.util.Shell.runCommand(Shell.java:927) at org.apache.hadoop.util.Shell.run(Shell.java:838) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1117) at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:150) ... 10 more {noformat} As a result nodemanager local directory are not getting deleted for each application was: Enable LCE and CGroups Submit a mapreduce job {noformat} 2016-02-24 18:56:46,889 INFO org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Deleting absolute path : /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01 2016-02-24 18:56:46,894 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: Shell execution returned exit code: 255. Privileged Execution Operation Output: main : command provided 3 main : run as user is dsperf main : requested yarn user is dsperf failed to rmdir job.jar: Not a directory Error while deleting /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01: 20 (Not a directory) Full command array for failed execution: [/opt/bibin/dsperf/HAINSTALL/install/hadoop/nodemanager/bin/container-executor, dsperf, dsperf, 3, /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01] 2016-02-24 18:56:46,894 ERROR org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: DeleteAsUser for /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01 returned with exit code: 255 org.apache.hadoop.yarn
[jira] [Commented] (YARN-4720) Skip unnecessary NN operations in log aggregation
[ https://issues.apache.org/jira/browse/YARN-4720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15166538#comment-15166538 ] Jun Gong commented on YARN-4720: Thanks [~mingma] for review and comments. {quote} When pendingContainerInThisCycle is empty, NM will skip sending the LogAggregationReport with LogAggregationStatus.RUNNING. It means for a long running service, it is possible for a yarn client to get LogAggregationStatus.NOT_START when it calls ApplicationClientProtocol#getApplicationReport if the long running service doesn't generate any log. Without the patch, NM will send LogAggregationStatus.RUNNING regardless. So it might be better to still send LogAggregationStatus.RUNNING regardless. {quote} Yes, it is a different behavior actually. LogAggregationReport is a report for current status, is it necessary to send a report if NM has not done log aggregation actually? BTW: I noticed that there is no cleanup for previous LogAggregationReport, there is only 'this.context.getLogAggregationStatusForApps().add()' and no 'remove', is it a deliberate action? {quote} When LogWriter creation throws exception and appFinished is true, NM will send a LogAggregationReport with LogAggregationStatus.SUCCEEDED. Without the patch, NM won't send any final LogAggregationReport. Maybe it is better to update the patch to send LogAggregationStatus.FAILED for such scenario. {quote} I will update the patch to address it. > Skip unnecessary NN operations in log aggregation > - > > Key: YARN-4720 > URL: https://issues.apache.org/jira/browse/YARN-4720 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Ming Ma >Assignee: Jun Gong > Attachments: YARN-4720.01.patch, YARN-4720.02.patch > > > Log aggregation service could have unnecessary NN operations in the following > scenarios: > * No new local log has been created since the last upload for the long > running service scenario. > * NM uses {{ContainerLogAggregationPolicy}} that skips log aggregation for > certain containers. > In the following code snippet, even though {{pendingContainerInThisCycle}} is > empty, it still creates the writer and then removes the file later. Thus it > introduces unnecessary create/getfileinfo/delete NN calls when NM doesn't > aggregate logs for an app. > > {noformat} > AppLogAggregatorImpl.java > .. > writer = > new LogWriter(this.conf, this.remoteNodeTmpLogFileForApp, > this.userUgi); > .. > for (ContainerId container : pendingContainerInThisCycle) { > .. > } > .. > if (remoteFS.exists(remoteNodeTmpLogFileForApp)) { > if (rename) { > remoteFS.rename(remoteNodeTmpLogFileForApp, renamedPath); > } else { > remoteFS.delete(remoteNodeTmpLogFileForApp, false); > } > } > .. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4731) Linux container executor fails on DeleteAsUser
[ https://issues.apache.org/jira/browse/YARN-4731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin A Chundatt updated YARN-4731: --- Summary: Linux container executor fails on DeleteAsUser (was: Linux container executor exception on DeleteAsUser) > Linux container executor fails on DeleteAsUser > -- > > Key: YARN-4731 > URL: https://issues.apache.org/jira/browse/YARN-4731 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bibin A Chundatt > > Enable LCE and CGroups > Submit a mapreduce job > {noformat} > 2016-02-24 18:56:46,889 INFO > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Deleting > absolute path : > /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01 > 2016-02-24 18:56:46,894 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: > Shell execution returned exit code: 255. Privileged Execution Operation > Output: > main : command provided 3 > main : run as user is dsperf > main : requested yarn user is dsperf > failed to rmdir job.jar: Not a directory > Error while deleting > /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01: > 20 (Not a directory) > Full command array for failed execution: > [/opt/bibin/dsperf/HAINSTALL/install/hadoop/nodemanager/bin/container-executor, > dsperf, dsperf, 3, > /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01] > 2016-02-24 18:56:46,894 ERROR > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: > DeleteAsUser for > /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01 > returned with exit code: 255 > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: > ExitCodeException exitCode=255: > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:173) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:199) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.deleteAsUser(LinuxContainerExecutor.java:569) > at > org.apache.hadoop.yarn.server.nodemanager.DeletionService$FileDeletionTask.run(DeletionService.java:265) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: ExitCodeException exitCode=255: > at org.apache.hadoop.util.Shell.runCommand(Shell.java:927) > at org.apache.hadoop.util.Shell.run(Shell.java:838) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1117) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:150) > ... 10 more > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4729) SchedulerApplicationAttempt#getTotalRequiredResources can throw an NPE
[ https://issues.apache.org/jira/browse/YARN-4729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15166526#comment-15166526 ] Brahma Reddy Battula commented on YARN-4729: [~kasha] thanks for reporting and working on this..+1 LGTM (non-binding). > SchedulerApplicationAttempt#getTotalRequiredResources can throw an NPE > -- > > Key: YARN-4729 > URL: https://issues.apache.org/jira/browse/YARN-4729 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.7.2 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla > Attachments: yarn-4729.patch > > > SchedulerApplicationAttempt#getTotalRequiredResources can throw an NPE. We > saw this in a unit test failure. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4359) Update LowCost agents logic to take advantage of YARN-4358
[ https://issues.apache.org/jira/browse/YARN-4359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15166498#comment-15166498 ] Hadoop QA commented on YARN-4359: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 11s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 1 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 25s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 29s {color} | {color:green} trunk passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 30s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 18s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 39s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 15s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 14s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 23s {color} | {color:green} trunk passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 28s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:red}-1{color} | {color:red} mvninstall {color} | {color:red} 0m 17s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} | | {color:red}-1{color} | {color:red} compile {color} | {color:red} 0m 15s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_72. {color} | | {color:red}-1{color} | {color:red} javac {color} | {color:red} 0m 15s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_72. {color} | | {color:red}-1{color} | {color:red} compile {color} | {color:red} 0m 19s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.7.0_95. {color} | | {color:red}-1{color} | {color:red} javac {color} | {color:red} 0m 19s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.7.0_95. {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 18s {color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: patch generated 83 new + 25 unchanged - 1 fixed = 108 total (was 26) {color} | | {color:red}-1{color} | {color:red} mvnsite {color} | {color:red} 0m 19s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 13s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s {color} | {color:red} The patch has 40 line(s) that end in whitespace. Use git apply --whitespace=fix. {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 1s {color} | {color:red} The patch has 68 line(s) with tabs. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 0m 17s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} | | {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 0m 50s {color} | {color:red} hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.8.0_72 with JDK v1.8.0_72 generated 3 new + 100 unchanged - 0 fixed = 103 total (was 100) {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 19s {color} | {color:green} the patch passed with JDK v1.8.0_72 {color} | | {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 1m 29s {color} | {color:red} hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.7.0_95 with JDK v1.7.0_95 generated 4 new + 2 unchanged - 0 fixed = 6 total (was 2) {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 24s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 0m 14s {color} | {color:red} hadoop-yarn-server-resou
[jira] [Commented] (YARN-3863) Support complex filters in TimelineReader
[ https://issues.apache.org/jira/browse/YARN-3863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15166487#comment-15166487 ] Sangjin Lee commented on YARN-3863: --- I completed one full pass over the patch (it's large!), and I wouldn't call it complete yet. I may follow up with more comments as I delve more into it. I'd welcome others' reviews too! Here are the comments from this review. (TimelineEntityFilters.java) - l.48: typo: "a entity type" -> "the entity type" - There are multiple places where a space is missing before an opening parenthesis ("("). I also saw it in other files too. You want to have a space before the opening parenthesis. - l.51: make it a link - l.59: typo: "a entity type" -> "the entity type" - l.69: typo: "a info key" -> "the info key" - l.81: make it a link - l.91: make it a link - l.99: make it a link (TimelineReaderWebServicesUtils.java) - l.94: I'm not really sure what this change is intended to do. The goal is to do an equality filter against multiple values, right? Why do we need a separate {{parseMetricsFilters()}} method for this? What's changed? - l.257: Why is it GREATER_OR_EQUAL instead of EQUAL? - This is more of a question. Is a list of multiple equality filters the same as the multi-val equality filter? If not, how are they different? (TimelineCompareFilter.java) - nit: let's make the member variables final (TimelineFilter.java) - l.52: the name "MULTIVAL_EQUALITY" is bit confusing, and it took me a little bit to see this means equality with an element in the set (I thought it was multiple key-value equality). Is this essentially "in the set" comparison? I wonder if there could be a better name? The same goes for {{TimelineMultiValueEqualityFilter}}. (TimelineFilterUtils.java) - l.104: can {{createSingleColValueFiltersByRange()}} be refactored to call {{createHBaseSingleColValueFilter()}}? - l.107: dead code? (HBaseTimelineWriterImpl.java) - Is this basically improving the code by using the strongly typed methods for bytes? As mentioned in a previous comment, these changes (this and {{\*Column\*}} changes) seem orthogonal. Would it be possible to isolate these changes from the main changes? - l.448: it should simply be a {{else if}} (TimelineStorageUtils.java) - There are many place here and others where {{equals()}} is used to compare enums. All the enum comparisons should use simply "==". - see my previous comment about refactoring to make these methods simpler and easier to read (GenericEntityReader.java) - l.260: I know this is happening deep inside the method, but it seems like a bit of an anti-pattern that we have to reference whether something is an application v. entity. There are multiple places in {{GenericEntityReader}} for this (basically each place where {{ApplicationColumn\*}} is used). I know there is already a precedent (I introduced it :(), but now it's gone full bloom. This makes the line between {{GenericEntityReader}} and {{ApplicationEntityReader}} quite blurry. Would it be possible to refactor these so that application behavior goes into {{ApplicationEntityReader}}? I haven't thought through what kind of refactoring would make that separation possible, but it would be great if you could come up with ideas to retain separation between {{GenericEntityReader}} and {{ApplicationEnttiyReader}}. - l.532: This is an interesting point. Should we categorically disallow any multi-entity reads without a filter? Is it an obvious requirement? I understand we already set some default values (e.g. created time, etc.) so this might be a moot point, but do we need to check for it when some defaults are set anyway? (TestHBaseTimelineStorage.java) - I think we went back and forth on this, but this test is getting real long now. Should we consider breaking it up in some fashion? I think we originally broke it up as a reader test and a writer test, and then combined them into one again. Would there be some value in separating them (with possibly a common base class)? Or we could break it down along different types of entities? I'm open to ideas. (TimelineExistsFilter.java) - l.32-33: nit: make them final (TimelineMultiValueEqualityFilter.java) - The name is bit confusing (see above) > Support complex filters in TimelineReader > - > > Key: YARN-3863 > URL: https://issues.apache.org/jira/browse/YARN-3863 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: YARN-2928 >Reporter: Varun Saxena >Assignee: Varun Saxena > Labels: yarn-2928-1st-milestone > Attachments: YARN-3863-YARN-2928.v2.01.patch, > YARN-3863-YARN-2928.v2.02.patch, YARN-3863-feature-YARN-2928.wip.003.patch, > YARN-3863-feature-YARN-2928.wip.01.patch, > YARN-3863-feature-YARN-2928.wip.02.patch, > YARN-3863-feature-YARN-2928.
[jira] [Commented] (YARN-4734) Merge branch:YARN-3368 to trunk
[ https://issues.apache.org/jira/browse/YARN-4734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15166482#comment-15166482 ] Hadoop QA commented on YARN-4734: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 11s {color} | {color:blue} Docker mode activated. {color} | | {color:red}-1{color} | {color:red} patch {color} | {color:red} 0m 4s {color} | {color:red} YARN-4734 does not apply to trunk. Rebase required? Wrong Branch? See https://wiki.apache.org/hadoop/HowToContribute for help. {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:0ca8df7 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12789821/YARN-4734.1.patch | | JIRA Issue | YARN-4734 | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/10629/console | | Powered by | Apache Yetus 0.2.0-SNAPSHOT http://yetus.apache.org | This message was automatically generated. > Merge branch:YARN-3368 to trunk > --- > > Key: YARN-4734 > URL: https://issues.apache.org/jira/browse/YARN-4734 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Wangda Tan >Assignee: Wangda Tan > Attachments: YARN-4734.1.patch > > > YARN-2928 branch is planned to merge back to trunk shortly, it depends on > changes of YARN-3368. This JIRA is to track the merging task. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4734) Merge branch:YARN-3368 to trunk
[ https://issues.apache.org/jira/browse/YARN-4734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-4734: - Summary: Merge branch:YARN-3368 to trunk (was: Merge YARN-3368 commit to trunk) > Merge branch:YARN-3368 to trunk > --- > > Key: YARN-4734 > URL: https://issues.apache.org/jira/browse/YARN-4734 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Wangda Tan >Assignee: Wangda Tan > Attachments: YARN-4734.1.patch > > > YARN-2928 branch is planned to merge back to trunk shortly, it depends on > changes of YARN-3368. This JIRA is to track the merging task. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4723) NodesListManager$UnknownNodeId ClassCastException
[ https://issues.apache.org/jira/browse/YARN-4723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15166469#comment-15166469 ] Hadoop QA commented on YARN-4723: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 9s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 1 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 46s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 30s {color} | {color:green} trunk passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 30s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 18s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 36s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 15s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 4s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 23s {color} | {color:green} trunk passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 25s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 30s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 24s {color} | {color:green} the patch passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 24s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 27s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 27s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 16s {color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: patch generated 3 new + 66 unchanged - 0 fixed = 69 total (was 66) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 31s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 12s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 15s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 19s {color} | {color:green} the patch passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 23s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 67m 42s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_72. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 68m 13s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.7.0_95. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 18s {color} | {color:green} Patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 152m 24s {color} | {color:black} {color} | \\ \\ || Reason || Tests || | JDK v1.8.0_72 Failed junit tests | hadoop.yarn.server.resourcemanager.TestClientRMTokens | | | hadoop.yarn.server.resourcemanager.TestAMAuthorization | | JDK v1.7.0_95 Failed junit tests | hadoop.yarn.server.resourcemanager.TestClientRMTokens | | | hadoop.yarn.server.resourcemanager.TestAMAuthorization | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:0ca8df7 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12789649/YARN-4723.001.
[jira] [Commented] (YARN-1040) De-link container life cycle from the process and add ability to execute multiple processes in the same long-lived container
[ https://issues.apache.org/jira/browse/YARN-1040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15166462#comment-15166462 ] Arun Suresh commented on YARN-1040: --- So that we are on the same page, If we were to separate what needs to be in YARN vs what Slider etc. should handle, id say : *YARN* * Container Upgrade primitive: ** provide AM with APIs (via NMClient) to upgrade the Container. ** API takes 1) new {{ContainerLaunchContext}} and 2) a policy viz. *In-place* (localize in parallel v2, start v2, stop v1) or *New+rollback* (stop v1, localize v2, start v2) + (start v1 if start v2 fails) *or* list of primitive composable commands if the above policies doesn't cover the use case. ** should negotiate Resource increase for in-place upgrade with RM prior to upgrade via YARN-1197 (or perhaps use OPPORTUNISTIC containers to locally negotiated at the NM for the resource spike needed for upgrade, once YARN-2877 is ready) *Slider / or something similar* * Application upgrade primitive ** Upgrade Orchestration Policy: Allow applications deployed via slider to specify order in which tasks/roles are upgraded (or started) ** Allow applications to specify how containers of each role are upgraded ** Actually call the YARN container upgrade APIs (described above) to perform upgrade of each container in the user specified order/policy Makes sense ? > De-link container life cycle from the process and add ability to execute > multiple processes in the same long-lived container > > > Key: YARN-1040 > URL: https://issues.apache.org/jira/browse/YARN-1040 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 3.0.0 >Reporter: Steve Loughran > > The AM should be able to exec >1 process in a container, rather than have the > NM automatically release the container when the single process exits. > This would let an AM restart a process on the same container repeatedly, > which for HBase would offer locality on a restarted region server. > We may also want the ability to exec multiple processes in parallel, so that > something could be run in the container while a long-lived process was > already running. This can be useful in monitoring and reconfiguring the > long-lived process, as well as shutting it down. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3863) Support complex filters in TimelineReader
[ https://issues.apache.org/jira/browse/YARN-3863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15166456#comment-15166456 ] Sangjin Lee commented on YARN-3863: --- {quote} Yes, code is similar. We are looping over a filter list and then checking the operator while doing processing for an individual filter. I thought about it but then the issue in moving it into a common area is that the data structures which hold events, configs, metrics,etc. are not same. We can however do one thing and that is to pass the TimelineEntity object itself into a common function(for all filters) and also pass something, say an enum indicating what kind of filter we are intending to match(name it as something like TimelineEntityFiltersType). Then based on this enum value, get the appropriate item(configs, metrics,etc.) from the passed entity. This way we can move common logic to a specific method which can in turn call the appropriate method to process based on filter type(say equality filter, multivalue equality filter, etc.). Does this sound fine ? {quote} I think so. I'll need to see the changes in code to get a better sense, though. > Support complex filters in TimelineReader > - > > Key: YARN-3863 > URL: https://issues.apache.org/jira/browse/YARN-3863 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: YARN-2928 >Reporter: Varun Saxena >Assignee: Varun Saxena > Labels: yarn-2928-1st-milestone > Attachments: YARN-3863-YARN-2928.v2.01.patch, > YARN-3863-YARN-2928.v2.02.patch, YARN-3863-feature-YARN-2928.wip.003.patch, > YARN-3863-feature-YARN-2928.wip.01.patch, > YARN-3863-feature-YARN-2928.wip.02.patch, > YARN-3863-feature-YARN-2928.wip.04.patch, > YARN-3863-feature-YARN-2928.wip.05.patch > > > Currently filters in timeline reader will return an entity only if all the > filter conditions hold true i.e. only AND operation is supported. We can > support OR operation for the filters as well. Additionally as primary backend > implementation is HBase, we can design our filters in a manner, where they > closely resemble HBase Filters. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4734) Merge YARN-3368 commit to trunk
[ https://issues.apache.org/jira/browse/YARN-4734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-4734: - Attachment: YARN-4734.1.patch Attached patch for merge. > Merge YARN-3368 commit to trunk > --- > > Key: YARN-4734 > URL: https://issues.apache.org/jira/browse/YARN-4734 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Wangda Tan >Assignee: Wangda Tan > Attachments: YARN-4734.1.patch > > > YARN-2928 branch is planned to merge back to trunk shortly, it depends on > changes of YARN-3368. This JIRA is to track the merging task. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4734) Merge YARN-3368 commit to trunk
Wangda Tan created YARN-4734: Summary: Merge YARN-3368 commit to trunk Key: YARN-4734 URL: https://issues.apache.org/jira/browse/YARN-4734 Project: Hadoop YARN Issue Type: Sub-task Reporter: Wangda Tan Assignee: Wangda Tan YARN-2928 branch is planned to merge back to trunk soon, it depends on changes of YARN-3368. This JIRA is to track the merging task. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4734) Merge YARN-3368 commit to trunk
[ https://issues.apache.org/jira/browse/YARN-4734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-4734: - Description: YARN-2928 branch is planned to merge back to trunk shortly, it depends on changes of YARN-3368. This JIRA is to track the merging task. (was: YARN-2928 branch is planned to merge back to trunk soon, it depends on changes of YARN-3368. This JIRA is to track the merging task.) > Merge YARN-3368 commit to trunk > --- > > Key: YARN-4734 > URL: https://issues.apache.org/jira/browse/YARN-4734 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Wangda Tan >Assignee: Wangda Tan > > YARN-2928 branch is planned to merge back to trunk shortly, it depends on > changes of YARN-3368. This JIRA is to track the merging task. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-4733) [YARN-3368] Commit initial web UI patch to branch: YARN-3368
[ https://issues.apache.org/jira/browse/YARN-4733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan resolved YARN-4733. -- Resolution: Fixed Hadoop Flags: Reviewed Fix Version/s: YARN-3368 This JIRA is created to track commit of initial web UI patch to branch YARN-3368. See [commit|https://github.com/apache/hadoop/commit/8ef2e8f1218a7be112ababccfde112c16ba48aa5] > [YARN-3368] Commit initial web UI patch to branch: YARN-3368 > > > Key: YARN-4733 > URL: https://issues.apache.org/jira/browse/YARN-4733 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Wangda Tan >Assignee: Wangda Tan > Fix For: YARN-3368 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4359) Update LowCost agents logic to take advantage of YARN-4358
[ https://issues.apache.org/jira/browse/YARN-4359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15166439#comment-15166439 ] Ishai Menache commented on YARN-4359: - Will add additional tests soon. > Update LowCost agents logic to take advantage of YARN-4358 > -- > > Key: YARN-4359 > URL: https://issues.apache.org/jira/browse/YARN-4359 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler, fairscheduler, resourcemanager >Reporter: Carlo Curino >Assignee: Ishai Menache > Attachments: YARN-4359.0.patch > > > Given the improvements of YARN-4358, the LowCost agent should be improved to > leverage this, and operate on RLESparseResourceAllocation (ideally leveraging > the improvements of YARN-3454 to compute avaialable resources) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4359) Update LowCost agents logic to take advantage of YARN-4358
[ https://issues.apache.org/jira/browse/YARN-4359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ishai Menache updated YARN-4359: Attachment: YARN-4359.0.patch first version of the patch > Update LowCost agents logic to take advantage of YARN-4358 > -- > > Key: YARN-4359 > URL: https://issues.apache.org/jira/browse/YARN-4359 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler, fairscheduler, resourcemanager >Reporter: Carlo Curino >Assignee: Ishai Menache > Attachments: YARN-4359.0.patch > > > Given the improvements of YARN-4358, the LowCost agent should be improved to > leverage this, and operate on RLESparseResourceAllocation (ideally leveraging > the improvements of YARN-3454 to compute avaialable resources) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4733) [YARN-3368] Commit initial web UI patch to branch: YARN-3368
Wangda Tan created YARN-4733: Summary: [YARN-3368] Commit initial web UI patch to branch: YARN-3368 Key: YARN-4733 URL: https://issues.apache.org/jira/browse/YARN-4733 Project: Hadoop YARN Issue Type: Task Reporter: Wangda Tan Assignee: Wangda Tan -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4733) [YARN-3368] Commit initial web UI patch to branch: YARN-3368
[ https://issues.apache.org/jira/browse/YARN-4733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-4733: - Issue Type: Sub-task (was: Task) Parent: YARN-3368 > [YARN-3368] Commit initial web UI patch to branch: YARN-3368 > > > Key: YARN-4733 > URL: https://issues.apache.org/jira/browse/YARN-4733 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Wangda Tan >Assignee: Wangda Tan > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4097) Create POC timeline web UI with new YARN web UI framework
[ https://issues.apache.org/jira/browse/YARN-4097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15166386#comment-15166386 ] Li Lu commented on YARN-4097: - BTW, I haven't fine-tune the styles of our current pages. More decorations will be very helpful. > Create POC timeline web UI with new YARN web UI framework > - > > Key: YARN-4097 > URL: https://issues.apache.org/jira/browse/YARN-4097 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Li Lu >Assignee: Li Lu > Labels: yarn-2928-1st-milestone > Attachments: Screen Shot 2016-02-24 at 15.57.38.png, Screen Shot > 2016-02-24 at 15.57.53.png, Screen Shot 2016-02-24 at 15.58.08.png, Screen > Shot 2016-02-24 at 15.58.26.png > > > As planned, we need to try out the new YARN web UI framework and implement > timeline v2 web UI on top of it. This JIRA proposes to build the basic active > flow and application lists of the timeline data. We can add more content > after we get used to this framework. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4097) Create POC timeline web UI with new YARN web UI framework
[ https://issues.apache.org/jira/browse/YARN-4097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-4097: Attachment: Screen Shot 2016-02-24 at 15.58.26.png Screen Shot 2016-02-24 at 15.58.08.png Screen Shot 2016-02-24 at 15.57.53.png Screen Shot 2016-02-24 at 15.57.38.png OK here are some screenshots for the current POC we have for the ATS related web UI pages. Note that we're affected by YARN-4700 in the flow activity list (there are multiple items for the same flow due to different cluster ids). For flow and flowrun page now the version is very immature, but I left some possibility to further integrate more data from aggregations. The application page in the new YARN UI currently only reads data from the RM. Right now I simply link to the page for future integrations. For the near term next steps, we may want: # Provide a "dashboard" in the flow activity page, summarizing flow activities. # Integrate flowrun/flow level aggregation data to flow and flowrun page. # Integrate ATS v2 web services to the application page, so that once an application finished, data can be read from the timeline server rather than the RM. We may want to extend this practice to attempt and container page. > Create POC timeline web UI with new YARN web UI framework > - > > Key: YARN-4097 > URL: https://issues.apache.org/jira/browse/YARN-4097 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Li Lu >Assignee: Li Lu > Labels: yarn-2928-1st-milestone > Attachments: Screen Shot 2016-02-24 at 15.57.38.png, Screen Shot > 2016-02-24 at 15.57.53.png, Screen Shot 2016-02-24 at 15.58.08.png, Screen > Shot 2016-02-24 at 15.58.26.png > > > As planned, we need to try out the new YARN web UI framework and implement > timeline v2 web UI on top of it. This JIRA proposes to build the basic active > flow and application lists of the timeline data. We can add more content > after we get used to this framework. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4697) NM aggregation thread pool is not bound by limits
[ https://issues.apache.org/jira/browse/YARN-4697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15166325#comment-15166325 ] Hudson commented on YARN-4697: -- FAILURE: Integrated in Hadoop-trunk-Commit #9364 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/9364/]) YARN-4697. NM aggregation thread pool is not bound by limits (haibochen (rkanter: rev 954dd57043d2de4f962876c1b89753bfc7e4ce55) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/LogAggregationService.java > NM aggregation thread pool is not bound by limits > - > > Key: YARN-4697 > URL: https://issues.apache.org/jira/browse/YARN-4697 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Reporter: Haibo Chen >Assignee: Haibo Chen > Fix For: 2.9.0 > > Attachments: yarn4697.001.patch, yarn4697.002.patch, > yarn4697.003.patch, yarn4697.004.patch > > > In the LogAggregationService.java we create a threadpool to upload logs from > the nodemanager to HDFS if log aggregation is turned on. This is a cached > threadpool which based on the javadoc is an ulimited pool of threads. > In the case that we have had a problem with log aggregation this could cause > a problem on restart. The number of threads created at that point could be > huge and will put a large load on the NameNode and in worse case could even > bring it down due to file descriptor issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1040) De-link container life cycle from the process and add ability to execute multiple processes in the same long-lived container
[ https://issues.apache.org/jira/browse/YARN-1040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15165668#comment-15165668 ] Bikas Saha commented on YARN-1040: -- Agree with your scenarios. I am trying to figure a way by which this does not become a YARN problem (both initial work and ongoing maintenance). E.g. we dont know for sure that the resource needs to be x, 2x or 3x. This is an allocation decision and cannot be done without the RMs blessing. And increasing container resources is already work in progress and may become another NM primitive. Next, what is the ordering for the tasks during an upgrade? We could implement one of many possibilities but then be stuck with bug-fixing or improving it. Potentially use that as a precedent to implement yet another upgrade policy. Hence, my suggestion of creating composable primitives that can be used to easily implement these flows. And leave it to the apps to determine the exact upgrades paths. Perhaps Slider is a better place which could wrap different upgrade possibilities using the composable primitives. E.g. SliderStopAllUpgradePolicy or SliderConcurrentUpgradePolicy. Or they could be provided as helper libs in YARN/NMClient so apps dont have to compose the primitives from scratch. The main aim is to continue to make core YARN/NM simple by creating primitives and layering complexity on top. This approach may be simpler and incremental to develop, test and deploy. Of course, these are my personal design views :) Thoughts? > De-link container life cycle from the process and add ability to execute > multiple processes in the same long-lived container > > > Key: YARN-1040 > URL: https://issues.apache.org/jira/browse/YARN-1040 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 3.0.0 >Reporter: Steve Loughran > > The AM should be able to exec >1 process in a container, rather than have the > NM automatically release the container when the single process exits. > This would let an AM restart a process on the same container repeatedly, > which for HBase would offer locality on a restarted region server. > We may also want the ability to exec multiple processes in parallel, so that > something could be run in the container while a long-lived process was > already running. This can be useful in monitoring and reconfiguring the > long-lived process, as well as shutting it down. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4680) TimerTasks leak in ATS V1.5 Writer
[ https://issues.apache.org/jira/browse/YARN-4680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15165655#comment-15165655 ] Hudson commented on YARN-4680: -- FAILURE: Integrated in Hadoop-trunk-Commit #9363 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/9363/]) YARN-4680. TimerTasks leak in ATS V1.5 Writer. (Xuan Gong via (gtcarrera9: rev 9e0f7b8b69ead629f999aa86c8fb7eb581e175d8) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/FileSystemTimelineWriter.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/CHANGES.txt > TimerTasks leak in ATS V1.5 Writer > -- > > Key: YARN-4680 > URL: https://issues.apache.org/jira/browse/YARN-4680 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Xuan Gong >Assignee: Xuan Gong > Fix For: 2.8.0 > > Attachments: YARN-4680.1.patch, YARN-4680.20160108.patch, > YARN-4680.20160109.patch, YARN-4680.20160222.patch > > > We have seen TimerTasks leak which could cause application server done (such > as oozie server done due to too many active threads) > Although we have fixed some potentially leak situations in upper application > level, such as > https://issues.apache.org/jira/browse/MAPREDUCE-6618 > https://issues.apache.org/jira/browse/MAPREDUCE-6621, we still can not > guarantee that we fixed the issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4701) When task logs are not available, port 8041 is referenced instead of port 8042
[ https://issues.apache.org/jira/browse/YARN-4701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15165651#comment-15165651 ] Robert Kanter commented on YARN-4701: - LGTM +1 pending Jenkins > When task logs are not available, port 8041 is referenced instead of port 8042 > -- > > Key: YARN-4701 > URL: https://issues.apache.org/jira/browse/YARN-4701 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: Haibo Chen >Assignee: Haibo Chen > Attachments: yarn4701.001.patch, yarn4701.002.patch, > yarn4701.003.patch, yarn4701.004.patch > > > Accessing logs for an task attempt in the workflow tool in Hue shows "Logs > not available for attempt_1433822010707_0001_m_00_0. Aggregation may not > be complete, Check back later or try the nodemanager at > quickstart.cloudera:8041" > If the user follows that link, he/she will get "It looks like you are making > an HTTP request to a Hadoop IPC port. This is not the correct port for the > web interface on this daemon." > We should update the message to use the correct HTTP port. We could also make > it more convenient by providing the application's specific page at NM as well > instead of just NM's main page. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4720) Skip unnecessary NN operations in log aggregation
[ https://issues.apache.org/jira/browse/YARN-4720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15163918#comment-15163918 ] Ming Ma commented on YARN-4720: --- Thanks [~hex108] for the update. The patch looks good overall. It does change the following behaviors. * When {{pendingContainerInThisCycle}} is empty, NM will skip sending the {{LogAggregationReport}} with {{LogAggregationStatus.RUNNING}}. It means for a long running service, it is possible for a yarn client to get {{LogAggregationStatus.NOT_START}} when it calls {{ApplicationClientProtocol#getApplicationReport}} if the long running service doesn't generate any log. Without the patch, NM will send {{LogAggregationStatus.RUNNING}} regardless. So it might be better to still send {{LogAggregationStatus.RUNNING}} regardless. * When {{LogWriter}} creation throws exception and {{appFinished}} is true, NM will send a {{LogAggregationReport}} with {{LogAggregationStatus.SUCCEEDED}}. Without the patch, NM won't send any final {{LogAggregationReport}}. Maybe it is better to update the patch to send {{LogAggregationStatus.FAILED}} for such scenario. > Skip unnecessary NN operations in log aggregation > - > > Key: YARN-4720 > URL: https://issues.apache.org/jira/browse/YARN-4720 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Ming Ma >Assignee: Jun Gong > Attachments: YARN-4720.01.patch, YARN-4720.02.patch > > > Log aggregation service could have unnecessary NN operations in the following > scenarios: > * No new local log has been created since the last upload for the long > running service scenario. > * NM uses {{ContainerLogAggregationPolicy}} that skips log aggregation for > certain containers. > In the following code snippet, even though {{pendingContainerInThisCycle}} is > empty, it still creates the writer and then removes the file later. Thus it > introduces unnecessary create/getfileinfo/delete NN calls when NM doesn't > aggregate logs for an app. > > {noformat} > AppLogAggregatorImpl.java > .. > writer = > new LogWriter(this.conf, this.remoteNodeTmpLogFileForApp, > this.userUgi); > .. > for (ContainerId container : pendingContainerInThisCycle) { > .. > } > .. > if (remoteFS.exists(remoteNodeTmpLogFileForApp)) { > if (rename) { > remoteFS.rename(remoteNodeTmpLogFileForApp, renamedPath); > } else { > remoteFS.delete(remoteNodeTmpLogFileForApp, false); > } > } > .. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4729) SchedulerApplicationAttempt#getTotalRequiredResources can throw an NPE
[ https://issues.apache.org/jira/browse/YARN-4729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15163879#comment-15163879 ] Robert Kanter commented on YARN-4729: - +1 LGTM > SchedulerApplicationAttempt#getTotalRequiredResources can throw an NPE > -- > > Key: YARN-4729 > URL: https://issues.apache.org/jira/browse/YARN-4729 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.7.2 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla > Attachments: yarn-4729.patch > > > SchedulerApplicationAttempt#getTotalRequiredResources can throw an NPE. We > saw this in a unit test failure. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4730) YARN preemption based on instantaneous fair share
[ https://issues.apache.org/jira/browse/YARN-4730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15163873#comment-15163873 ] Karthik Kambatla commented on YARN-4730: IIRR, FairScheduler preemption is based on instantaneous fairshare. The steady fairshare is used only for WebUI purposes. In your case, I would think minshare preemption kicks in because you specify min resources for all queues. Isn't it expected that all queues are getting the same resources the sum of which is cluster resources? Do you expect allocations different from minshare? > YARN preemption based on instantaneous fair share > - > > Key: YARN-4730 > URL: https://issues.apache.org/jira/browse/YARN-4730 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Reporter: Prabhu Joseph > > On a big cluster with Total Cluster Resource of 10TB, 3000 cores and Fair > Sheduler having 230 queues and total 6 jobs run a day. [ all 230 queues > are very critical and hence the minResource is same for all]. On this case, > when a Spark Job is run on queue A and which occupies the entire cluster > resource and does not release any resource, another job submitted into queue > B and preemption is getting only the Fair Share which is <10TB , 3000> / 230 > = <45 GB , 13 cores> which is very less fair share for a queue.shared by many > applications. > The Preemption should get the instantaneous fair Share, that is <10TB, 3000> > / 2 (active queues) = 5TB and 1500 cores, so that the first job won't hog the > entire cluster resource and also the subsequent jobs run fine. > This issue is only when the number of queues are very high. In case of less > number of queues, Preemption getting Fair Share would be suffice as the fair > share will be high. But in case of too many number of queues, Preemption > should try to get the instantaneous Fair Share. > Note: Configuring optimal maxResources to 230 queues is difficult and also > putting constraint for the queues using maxResource will leave cluster > resource idle most of the time. > There are 1000s of Spark Jobs, so asking each user to restrict the > number of executors is also difficult. > Preempting Instantaneous Fair Share will help to overcome the above issues. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4730) YARN preemption based on instantaneous fair share
[ https://issues.apache.org/jira/browse/YARN-4730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-4730: --- Component/s: fairscheduler > YARN preemption based on instantaneous fair share > - > > Key: YARN-4730 > URL: https://issues.apache.org/jira/browse/YARN-4730 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Reporter: Prabhu Joseph > > On a big cluster with Total Cluster Resource of 10TB, 3000 cores and Fair > Sheduler having 230 queues and total 6 jobs run a day. [ all 230 queues > are very critical and hence the minResource is same for all]. On this case, > when a Spark Job is run on queue A and which occupies the entire cluster > resource and does not release any resource, another job submitted into queue > B and preemption is getting only the Fair Share which is <10TB , 3000> / 230 > = <45 GB , 13 cores> which is very less fair share for a queue.shared by many > applications. > The Preemption should get the instantaneous fair Share, that is <10TB, 3000> > / 2 (active queues) = 5TB and 1500 cores, so that the first job won't hog the > entire cluster resource and also the subsequent jobs run fine. > This issue is only when the number of queues are very high. In case of less > number of queues, Preemption getting Fair Share would be suffice as the fair > share will be high. But in case of too many number of queues, Preemption > should try to get the instantaneous Fair Share. > Note: Configuring optimal maxResources to 230 queues is difficult and also > putting constraint for the queues using maxResource will leave cluster > resource idle most of the time. > There are 1000s of Spark Jobs, so asking each user to restrict the > number of executors is also difficult. > Preempting Instantaneous Fair Share will help to overcome the above issues. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1040) De-link container life cycle from the process and add ability to execute multiple processes in the same long-lived container
[ https://issues.apache.org/jira/browse/YARN-1040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15163864#comment-15163864 ] Arun Suresh commented on YARN-1040: --- Thanks for the feedback [~bikassaha] I understand we might not want to place artificial constraint of apps, I was just trying to scope out the bare min effort required specifically for long running container upgrades. That said, im all for going the whole hog (allow 0 or 1+ processes) if that is maybe easier. Some thoughts specifically with regard to container upgrade: # If we allow multiple processes per container, we might need to have {{startProcess()}} to return maybe a *processId* which can subsequently be used by the AM to address the process in subsequent calls like {{stopProcess()}}. This might complicate the state of AM, and maybe we can leave it out in the first cut. # w.r.t resource re-localization, as per YARN-4597, we are exploring localization as a service and possibly re-localization on the fly. # I like the idea of clubbing multiple API calls in the same RPC. But should *upgrade* be a first class semantic, or should it be expressed as a {{localize v2, start v2, stop v1}} API combo. One reason to distinguish may be in the case of having both versions up at the same time till the new version stabilizes... in an upgrade case, the Container should probably be allowed to go 2x its allocated resource limit for a period of time, but in the case were we are just starting 2 processes, this should probably not be allowed. > De-link container life cycle from the process and add ability to execute > multiple processes in the same long-lived container > > > Key: YARN-1040 > URL: https://issues.apache.org/jira/browse/YARN-1040 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 3.0.0 >Reporter: Steve Loughran > > The AM should be able to exec >1 process in a container, rather than have the > NM automatically release the container when the single process exits. > This would let an AM restart a process on the same container repeatedly, > which for HBase would offer locality on a restarted region server. > We may also want the ability to exec multiple processes in parallel, so that > something could be run in the container while a long-lived process was > already running. This can be useful in monitoring and reconfiguring the > long-lived process, as well as shutting it down. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4729) SchedulerApplicationAttempt#getTotalRequiredResources can throw an NPE
[ https://issues.apache.org/jira/browse/YARN-4729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15163860#comment-15163860 ] Karthik Kambatla commented on YARN-4729: Test failures are not related. > SchedulerApplicationAttempt#getTotalRequiredResources can throw an NPE > -- > > Key: YARN-4729 > URL: https://issues.apache.org/jira/browse/YARN-4729 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.7.2 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla > Attachments: yarn-4729.patch > > > SchedulerApplicationAttempt#getTotalRequiredResources can throw an NPE. We > saw this in a unit test failure. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4723) NodesListManager$UnknownNodeId ClassCastException
[ https://issues.apache.org/jira/browse/YARN-4723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated YARN-4723: -- Attachment: YARN-4723.001.patch Attaching preliminary patch based on approach#2 by [~jlowe]. Also added the change to not put such a node in active RMNodes map but in inactive map. > NodesListManager$UnknownNodeId ClassCastException > - > > Key: YARN-4723 > URL: https://issues.apache.org/jira/browse/YARN-4723 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.3 >Reporter: Jason Lowe >Assignee: Kuhu Shukla >Priority: Critical > Attachments: YARN-4723.001.patch > > > Saw the following in an RM log: > {noformat} > 2016-02-16 22:55:35,207 [IPC Server handler 5 on 8030] WARN ipc.Server: IPC > Server handler 5 on 8030, call > org.apache.hadoop.ipc.ProtobufRpcEngine$Server@6c403aff > java.lang.ClassCastException: > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager$UnknownNodeId > cannot be cast to org.apache.hadoop.yarn.api.records.impl.pb.NodeIdPBImpl > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToBuilder(NodeReportPBImpl.java:247) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToProto(NodeReportPBImpl.java:271) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.getProto(NodeReportPBImpl.java:220) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.convertToProtoFormat(AllocateResponsePBImpl.java:712) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.access$500(AllocateResponsePBImpl.java:68) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:658) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:647) > at > com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336) > at > com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323) > at > org.apache.hadoop.yarn.proto.YarnServiceProtos$AllocateResponseProto$Builder.addAllUpdatedNodes(YarnServiceProtos.java:9335) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToBuilder(AllocateResponsePBImpl.java:144) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToProto(AllocateResponsePBImpl.java:175) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.getProto(AllocateResponsePBImpl.java:96) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:61) > at > org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:608) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982) > at org.apache.hadoop.ipc.Server.call(Server.java:2267) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:648) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:615) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2217) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4723) NodesListManager$UnknownNodeId ClassCastException
[ https://issues.apache.org/jira/browse/YARN-4723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15163729#comment-15163729 ] Jason Lowe commented on YARN-4723: -- I haven't looked at it in detail, but can we simply avoid doing any node update processing for dummy nodes (e.g.: port == -1) when processing the decommission transition? > NodesListManager$UnknownNodeId ClassCastException > - > > Key: YARN-4723 > URL: https://issues.apache.org/jira/browse/YARN-4723 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.3 >Reporter: Jason Lowe >Assignee: Kuhu Shukla >Priority: Critical > > Saw the following in an RM log: > {noformat} > 2016-02-16 22:55:35,207 [IPC Server handler 5 on 8030] WARN ipc.Server: IPC > Server handler 5 on 8030, call > org.apache.hadoop.ipc.ProtobufRpcEngine$Server@6c403aff > java.lang.ClassCastException: > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager$UnknownNodeId > cannot be cast to org.apache.hadoop.yarn.api.records.impl.pb.NodeIdPBImpl > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToBuilder(NodeReportPBImpl.java:247) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToProto(NodeReportPBImpl.java:271) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.getProto(NodeReportPBImpl.java:220) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.convertToProtoFormat(AllocateResponsePBImpl.java:712) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.access$500(AllocateResponsePBImpl.java:68) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:658) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:647) > at > com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336) > at > com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323) > at > org.apache.hadoop.yarn.proto.YarnServiceProtos$AllocateResponseProto$Builder.addAllUpdatedNodes(YarnServiceProtos.java:9335) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToBuilder(AllocateResponsePBImpl.java:144) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToProto(AllocateResponsePBImpl.java:175) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.getProto(AllocateResponsePBImpl.java:96) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:61) > at > org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:608) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982) > at org.apache.hadoop.ipc.Server.call(Server.java:2267) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:648) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:615) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2217) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1040) De-link container life cycle from the process and add ability to execute multiple processes in the same long-lived container
[ https://issues.apache.org/jira/browse/YARN-1040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15163708#comment-15163708 ] Bikas Saha commented on YARN-1040: -- I am not sure we need to place (somewhat artificial) constraints on the app when its not clear that it practically affects YARN 1) Container with no process should be allowed. Apps could terminate all running tasks of version A, then start running tasks of version B when they are not backwards compatible. 2) Container should be allowed to run multiple processes. This is similar to the existing process spawning more processes. It is different from that in the sense that the NM has to add the new process to existing monitoring/cgroups etc. 3) Startprocess should be allowed with no process actually started. This will allow apps to localize new resources to an existing container. Alternatively, we could create a new localization API thats delinked from starting the process. But re-localization is an important related feature that we should look at supporting via this work because currently that does not work since its tied to start process. 4) Most current apps are already communicating directly with their tasks and hence can shut them down when they are not needed. However, like suggested above, it may be useful for the NM to provide a feature whereby the previous task can be shutdown when a new task request is received. Alternatively, the NM could provide a stopProcess API to make that explicit. IMO all of this should be allowed. The timeline could be different with some being allowed earlier and some later based on implementation effort. Thinking ahead, it may be useful for the NM to accept a series of API calls within the same RPC (with the current mechanism supported as a single command entity for backwards compatibility). Then we will not have to build a lot of logic into the NM. The app can get all features by composing a multi-command entity. E.g. Current start process = {acquire, localize, start} // where acquire means start container Current shutdown process = {stop, release} // where release means give up container Only localize = {localize} Start another process = {localize, start} Start another process after shutting down first process = {stop, start} or {stop, localize, start} Start another process and then shutdown the first process = {start, stop} New container shutdown = {release} // at this point there may be 0 or more processes running and which will be stopped > De-link container life cycle from the process and add ability to execute > multiple processes in the same long-lived container > > > Key: YARN-1040 > URL: https://issues.apache.org/jira/browse/YARN-1040 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 3.0.0 >Reporter: Steve Loughran > > The AM should be able to exec >1 process in a container, rather than have the > NM automatically release the container when the single process exits. > This would let an AM restart a process on the same container repeatedly, > which for HBase would offer locality on a restarted region server. > We may also want the ability to exec multiple processes in parallel, so that > something could be run in the container while a long-lived process was > already running. This can be useful in monitoring and reconfiguring the > long-lived process, as well as shutting it down. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4732) *ProcessTree classes have too many whitespace issues
Karthik Kambatla created YARN-4732: -- Summary: *ProcessTree classes have too many whitespace issues Key: YARN-4732 URL: https://issues.apache.org/jira/browse/YARN-4732 Project: Hadoop YARN Issue Type: Improvement Reporter: Karthik Kambatla Priority: Trivial *ProcessTree classes have too many whitespace issues - extra newlines between methods, spaces in empty lines etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4732) *ProcessTree classes have too many whitespace issues
[ https://issues.apache.org/jira/browse/YARN-4732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haibo Chen updated YARN-4732: - Assignee: Haibo Chen > *ProcessTree classes have too many whitespace issues > > > Key: YARN-4732 > URL: https://issues.apache.org/jira/browse/YARN-4732 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Karthik Kambatla >Assignee: Haibo Chen >Priority: Trivial > Labels: newbie > > *ProcessTree classes have too many whitespace issues - extra newlines between > methods, spaces in empty lines etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3304) ResourceCalculatorProcessTree#getCpuUsagePercent default return value is inconsistent with other getters
[ https://issues.apache.org/jira/browse/YARN-3304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15163573#comment-15163573 ] Karthik Kambatla commented on YARN-3304: [~djp] - did we ever file follow up JIRAs to delete these deprecated methods from trunk? If not, I would like to file them and get the trunk changes in. > ResourceCalculatorProcessTree#getCpuUsagePercent default return value is > inconsistent with other getters > > > Key: YARN-3304 > URL: https://issues.apache.org/jira/browse/YARN-3304 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Junping Du >Assignee: Junping Du >Priority: Blocker > Fix For: 2.7.0 > > Attachments: YARN-3304-appendix-v2.patch, > YARN-3304-appendix-v3.patch, YARN-3304-appendix-v4.patch, > YARN-3304-appendix.patch, YARN-3304-v2.patch, YARN-3304-v3.patch, > YARN-3304-v4-boolean-way.patch, YARN-3304-v4-negative-way-MR.patch, > YARN-3304-v4-negtive-value-way.patch, YARN-3304-v6-no-rename.patch, > YARN-3304-v6-with-rename.patch, YARN-3304-v7.patch, YARN-3304-v8.patch, > YARN-3304.patch, yarn-3304-5.patch > > > Per discussions in YARN-3296, getCpuUsagePercent() will return -1 for > unavailable case while other resource metrics are return 0 in the same case > which sounds inconsistent. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1040) De-link container life cycle from the process and add ability to execute multiple processes in the same long-lived container
[ https://issues.apache.org/jira/browse/YARN-1040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15163564#comment-15163564 ] Arun Suresh commented on YARN-1040: --- Spent some time going thru the conversation (this one as well as YARN-1404) Given that this has been tracked as a requirement for In place application upgrades and it has been sometime since any activity has been posted here, [~bikassaha] / [~vinodkv] / [~hitesh] / [~tucu00] / [~steve_l], can you kindly clarify the following ? # Are we still trying to handle the case where we have > 1 processes running against a container *at the same time* # Have we decided that allowing a Container with 0 processes running is a bad idea ? >From the context of getting Application upgrades working, I guess 1) can be >relaxed to exactly 1 process running under a container but AM has the option >of explicitly starting via the {{startProcess(containerLaunchContext)}} API >Bikas mentioned (an additional constraint could probably be the startProcess >has to be called within a timeout if no ContainerLaunchContext has been >provided with the initial {{startContainer()}} else NM will deem the container >dead). In addition, I was also thinking # If a process is already running in the container when a {{startProcess(ContainerLaunchContext)}} is received, then the first process is killed and another is started using the new {{ContainerLaunchContext}} # Maybe we can refine the above by add an {{upgradeProcess(ContainerLaunchContext)}} API that can additionally take on a policy like: ## auto-rollback if new process does not start within a timout. ## Rollback could either mean keeping the old process running until upgraded process is up -or- if we want to preserve semantics of only 1 process per container, first kill the old process and try to start new one, and on failure restart old version. If everyone is ok with the above, I volunteer to either post a preliminary patch for the above or if the details get dicier during investigation, I can put up a doc. Thoughts ? > De-link container life cycle from the process and add ability to execute > multiple processes in the same long-lived container > > > Key: YARN-1040 > URL: https://issues.apache.org/jira/browse/YARN-1040 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 3.0.0 >Reporter: Steve Loughran > > The AM should be able to exec >1 process in a container, rather than have the > NM automatically release the container when the single process exits. > This would let an AM restart a process on the same container repeatedly, > which for HBase would offer locality on a restarted region server. > We may also want the ability to exec multiple processes in parallel, so that > something could be run in the container while a long-lived process was > already running. This can be useful in monitoring and reconfiguring the > long-lived process, as well as shutting it down. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4699) Scheduler UI and REST o/p is not in sync when -replaceLabelsOnNode is used to change label of a node
[ https://issues.apache.org/jira/browse/YARN-4699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G updated YARN-4699: -- Attachment: 0001-YARN-4699.patch As I see it, if we can update usedCapacity of the label while changing label on a node, this issue can be fixed. Tested various cases mentioned in the patch, and with this fix, it comes up correctly. Attaching this patch for an initial review. [~leftnoteasy] thoughts? > Scheduler UI and REST o/p is not in sync when -replaceLabelsOnNode is used to > change label of a node > > > Key: YARN-4699 > URL: https://issues.apache.org/jira/browse/YARN-4699 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.7.2 >Reporter: Sunil G >Assignee: Sunil G >Priority: Critical > Attachments: 0001-YARN-4699.patch, AfterAppFInish-LabelY-Metrics.png, > ForLabelX-AfterSwitch.png, ForLabelY-AfterSwitch.png > > > Scenario is as follows: > a. 2 nodes are available in the cluster (node1 with label "x", node2 with > label "y") > b. Submit an application to node1 for label "x". > c. Change node1 label to "y" by using *replaceLabelsOnNode* command. > d. Verify Scheduler UI for metrics such as "Used Capacity", "Absolute > Capacity" etc. "x" still shows some capacity. > e. Change node1 label back to "x" and verify UI and REST o/p > Output: > 1. "Used Capacity", "Absolute Capacity" etc are not decremented once labels > is changed for a node. > 2. UI tab for respective label shows wrong GREEN color in these cases. > 3. REST o/p is wrong for each label after executing above scenario. > Attaching screen shots also. This ticket will try to cover UI and REST o/p > fix when label is changed runtime. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4718) Rename variables in SchedulerNode to reduce ambiguity post YARN-1011
[ https://issues.apache.org/jira/browse/YARN-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15163441#comment-15163441 ] Hadoop QA commented on YARN-4718: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 9s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 16 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 34s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 25s {color} | {color:green} trunk passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 28s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 24s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 34s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 14s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 5s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 20s {color} | {color:green} trunk passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 26s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:red}-1{color} | {color:red} mvninstall {color} | {color:red} 0m 17s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} | | {color:red}-1{color} | {color:red} compile {color} | {color:red} 0m 13s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_72. {color} | | {color:red}-1{color} | {color:red} javac {color} | {color:red} 0m 13s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_72. {color} | | {color:red}-1{color} | {color:red} compile {color} | {color:red} 0m 17s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.7.0_95. {color} | | {color:red}-1{color} | {color:red} javac {color} | {color:red} 0m 17s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.7.0_95. {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 23s {color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: patch generated 15 new + 493 unchanged - 16 fixed = 508 total (was 509) {color} | | {color:red}-1{color} | {color:red} mvnsite {color} | {color:red} 0m 18s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 12s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s {color} | {color:red} The patch has 2 line(s) that end in whitespace. Use git apply --whitespace=fix. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 0m 15s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} | | {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 0m 19s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_72. {color} | | {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 1m 27s {color} | {color:red} hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.7.0_95 with JDK v1.7.0_95 generated 2 new + 2 unchanged - 0 fixed = 4 total (was 2) {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 22s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 0m 13s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_72. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 0m 17s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.7.0_95. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 16s {color} | {color:green} Patch does not generate ASF License warnings. {colo
[jira] [Commented] (YARN-4722) AsyncDispatcher logs redundant event queue sizes
[ https://issues.apache.org/jira/browse/YARN-4722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15163406#comment-15163406 ] Jason Lowe commented on YARN-4722: -- Thanks, Sangjin! > AsyncDispatcher logs redundant event queue sizes > > > Key: YARN-4722 > URL: https://issues.apache.org/jira/browse/YARN-4722 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Assignee: Jason Lowe > Fix For: 2.8.0, 2.7.3, 2.9.0, 2.6.5 > > Attachments: YARN-4722.001.patch > > > A fairly common occurrence in RM logs is a string of redundant event-queue > logs like the following which does little except bloat the logs: > {noformat} > 2016-02-23 08:00:00,948 [IPC Server handler 36 on 8030] INFO > event.AsyncDispatcher: Size of event-queue is 1000 > 2016-02-23 08:00:00,948 [IPC Server handler 36 on 8030] INFO > event.AsyncDispatcher: Size of event-queue is 1000 > 2016-02-23 08:00:00,948 [IPC Server handler 36 on 8030] INFO > event.AsyncDispatcher: Size of event-queue is 1000 > 2016-02-23 08:00:00,948 [IPC Server handler 36 on 8030] INFO > event.AsyncDispatcher: Size of event-queue is 1000 > 2016-02-23 08:00:00,948 [IPC Server handler 36 on 8030] INFO > event.AsyncDispatcher: Size of event-queue is 1000 > [...] > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4722) AsyncDispatcher logs redundant event queue sizes
[ https://issues.apache.org/jira/browse/YARN-4722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15163389#comment-15163389 ] Hudson commented on YARN-4722: -- FAILURE: Integrated in Hadoop-trunk-Commit #9360 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/9360/]) YARN-4722. AsyncDispatcher logs redundant event queue sizes (Jason Lowe (sjlee: rev 553b591ba06bbf0b18dca674d25a48218fed0a26) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/event/AsyncDispatcher.java > AsyncDispatcher logs redundant event queue sizes > > > Key: YARN-4722 > URL: https://issues.apache.org/jira/browse/YARN-4722 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Assignee: Jason Lowe > Attachments: YARN-4722.001.patch > > > A fairly common occurrence in RM logs is a string of redundant event-queue > logs like the following which does little except bloat the logs: > {noformat} > 2016-02-23 08:00:00,948 [IPC Server handler 36 on 8030] INFO > event.AsyncDispatcher: Size of event-queue is 1000 > 2016-02-23 08:00:00,948 [IPC Server handler 36 on 8030] INFO > event.AsyncDispatcher: Size of event-queue is 1000 > 2016-02-23 08:00:00,948 [IPC Server handler 36 on 8030] INFO > event.AsyncDispatcher: Size of event-queue is 1000 > 2016-02-23 08:00:00,948 [IPC Server handler 36 on 8030] INFO > event.AsyncDispatcher: Size of event-queue is 1000 > 2016-02-23 08:00:00,948 [IPC Server handler 36 on 8030] INFO > event.AsyncDispatcher: Size of event-queue is 1000 > [...] > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4731) Linux container executor exception on DeleteAsUser
Bibin A Chundatt created YARN-4731: -- Summary: Linux container executor exception on DeleteAsUser Key: YARN-4731 URL: https://issues.apache.org/jira/browse/YARN-4731 Project: Hadoop YARN Issue Type: Bug Reporter: Bibin A Chundatt Enable LCE and CGroups Submit a mapreduce job {noformat} 2016-02-24 18:56:46,889 INFO org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Deleting absolute path : /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01 2016-02-24 18:56:46,894 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: Shell execution returned exit code: 255. Privileged Execution Operation Output: main : command provided 3 main : run as user is dsperf main : requested yarn user is dsperf failed to rmdir job.jar: Not a directory Error while deleting /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01: 20 (Not a directory) Full command array for failed execution: [/opt/bibin/dsperf/HAINSTALL/install/hadoop/nodemanager/bin/container-executor, dsperf, dsperf, 3, /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01] 2016-02-24 18:56:46,894 ERROR org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: DeleteAsUser for /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_01 returned with exit code: 255 org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: ExitCodeException exitCode=255: at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:173) at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:199) at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.deleteAsUser(LinuxContainerExecutor.java:569) at org.apache.hadoop.yarn.server.nodemanager.DeletionService$FileDeletionTask.run(DeletionService.java:265) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: ExitCodeException exitCode=255: at org.apache.hadoop.util.Shell.runCommand(Shell.java:927) at org.apache.hadoop.util.Shell.run(Shell.java:838) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1117) at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:150) ... 10 more {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4720) Skip unnecessary NN operations in log aggregation
[ https://issues.apache.org/jira/browse/YARN-4720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15163327#comment-15163327 ] Hadoop QA commented on YARN-4720: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 16s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 1 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 2s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 26s {color} | {color:green} trunk passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 29s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 15s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 28s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 13s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 49s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 18s {color} | {color:green} trunk passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 22s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 25s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 20s {color} | {color:green} the patch passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 20s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 24s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 24s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 13s {color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: patch generated 1 new + 17 unchanged - 1 fixed = 18 total (was 18) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 25s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 10s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 0s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 15s {color} | {color:green} the patch passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 20s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 8m 55s {color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with JDK v1.8.0_72. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 22s {color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with JDK v1.7.0_95. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 18s {color} | {color:green} Patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 33m 44s {color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:0ca8df7 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12789595/YARN-4720.02.patch | | JIRA Issue | YARN-4720 | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle | | uname | Linux f0125aa6792a 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/hadoop/patchpr
[jira] [Commented] (YARN-4696) EntityGroupFSTimelineStore to work in the absence of an RM
[ https://issues.apache.org/jira/browse/YARN-4696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15163307#comment-15163307 ] Hadoop QA commented on YARN-4696: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 14s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 1 new or modified test files. {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 14s {color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 41s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 6m 3s {color} | {color:green} trunk passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 6m 44s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 4s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 21s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 55s {color} | {color:green} trunk passed {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 0m 28s {color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timeline-pluginstorage in trunk has 1 extant Findbugs warnings. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 50s {color} | {color:green} trunk passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 9s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 15s {color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 41s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 6m 1s {color} | {color:green} the patch passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 6m 1s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 6m 40s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 6m 40s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 1m 3s {color} | {color:red} root: patch generated 5 new + 29 unchanged - 0 fixed = 34 total (was 29) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 15s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 55s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 50s {color} | {color:red} hadoop-common-project/hadoop-common generated 2 new + 0 unchanged - 0 fixed = 2 total (was 0) {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 23s {color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0) {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 47s {color} | {color:green} the patch passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 10s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 6m 57s {color} | {color:green} hadoop-common in the patch passed with JDK v1.8.0_72. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 53s {color} | {color:green} hadoop-yarn-common in the patch passed with JDK v1.8.0_72. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 3m 44s {color} | {color:green} hadoop-yarn-server-applicationhistoryservice in the patch passed with JDK v1.8.0_72. {color} | |
[jira] [Commented] (YARN-4723) NodesListManager$UnknownNodeId ClassCastException
[ https://issues.apache.org/jira/browse/YARN-4723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15163294#comment-15163294 ] Kuhu Shukla commented on YARN-4723: --- The primary reason for this failure is the {{UnknownNodeId}} object. Even if we do not put this dummy nodeId in the active RMNodes, and instead put it in inactiveRMNodes, the transition from NEW to DECOMMISSIONED that makes the node unusable(NODE_UNUSABLE) will trigger a NODE_UPDATE which instead would populate the {{updatedNodes}} in the AllocateResponse. {code} @Override public void handle(NodesListManagerEvent event) { RMNode eventNode = event.getNode(); switch (event.getType()) { case NODE_UNUSABLE: LOG.debug(eventNode + " reported unusable"); unusableRMNodesConcurrentSet.add(eventNode); for(RMApp app: rmContext.getRMApps().values()) { if (!app.isAppFinalStateStored()) { this.rmContext .getDispatcher() .getEventHandler() .handle( new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode, RMAppNodeUpdateType.NODE_UNUSABLE)); } } {code} That being said, we should not add the node to active list, but the way to solve this problem is to get rid of UnknownNodeId and have an anonymous classes to initialize these dummy nodes. For the unit test, I did call {{allocate}} for this scenario but that did not replicate the issue until I explicitly set the updatedNodes to an UnknownNodeId object. Asking [~jlowe], [~templedf] for comments and corrections. Excerpt from a sample test : {code} AllocateRequest allocateRequest = Records.newRecord(AllocateRequest.class); AllocateResponse resp = rmClient.allocate(allocateRequest); NodeReport report = new NodeReportPBImpl(); report.setNodeId(new NodesListManager.UnknownNodeId("host2")); List reports = new ArrayList(); reports.add(report); resp.setUpdatedNodes(reports); allocateRequest = Records.newRecord(AllocateRequest.class); YarnServiceProtos.AllocateResponseProto p = ((AllocateResponsePBImpl) resp).getProto(); {code} Proposed change in NodesListManager.java: {code} private void setDecomissionedNMs() { Set excludeList = hostsReader.getExcludedHosts(); for (final String host : excludeList) { NodeId nodeId = makeUnknownNodeId(host); RMNodeImpl rmNode = new RMNodeImpl(nodeId, rmContext, host, -1, -1, makeUnknownNode(host), null, null); rmContext.getInactiveRMNodes().putIfAbsent(rmNode.getNodeID().getHost(),rmNode); rmNode.handle(new RMNodeEvent(rmNode.getNodeID(), RMNodeEventType .DECOMMISSION)); } } {code} {code} Node makeUnknownNode(final String host) { return new Node() { @Override public String getNetworkLocation() { return null; } @Override public void setNetworkLocation(String location) { } @Override public String getName() { return host; } @Override public Node getParent() { return null; } @Override public void setParent(Node parent) { } @Override public int getLevel() { return 0; } @Override public void setLevel(int i) { } }; } {code} > NodesListManager$UnknownNodeId ClassCastException > - > > Key: YARN-4723 > URL: https://issues.apache.org/jira/browse/YARN-4723 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.3 >Reporter: Jason Lowe >Assignee: Kuhu Shukla >Priority: Critical > > Saw the following in an RM log: > {noformat} > 2016-02-16 22:55:35,207 [IPC Server handler 5 on 8030] WARN ipc.Server: IPC > Server handler 5 on 8030, call > org.apache.hadoop.ipc.ProtobufRpcEngine$Server@6c403aff > java.lang.ClassCastException: > org.apache.hadoop.yarn.server.resourcemanager.NodesListManager$UnknownNodeId > cannot be cast to org.apache.hadoop.yarn.api.records.impl.pb.NodeIdPBImpl > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToBuilder(NodeReportPBImpl.java:247) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToProto(NodeReportPBImpl.java:271) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.getProto(NodeReportPBImpl.java:220) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.convertToProtoFormat(AllocateResponsePBImpl.java:712) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.access$500(AllocateResponsePBImpl.java:68) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.n
[jira] [Commented] (YARN-4720) Skip unnecessary NN operations in log aggregation
[ https://issues.apache.org/jira/browse/YARN-4720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15163282#comment-15163282 ] Jun Gong commented on YARN-4720: Thanks [~mingma] for the review. Attach a new patch to address above problems. > Skip unnecessary NN operations in log aggregation > - > > Key: YARN-4720 > URL: https://issues.apache.org/jira/browse/YARN-4720 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Ming Ma >Assignee: Jun Gong > Attachments: YARN-4720.01.patch, YARN-4720.02.patch > > > Log aggregation service could have unnecessary NN operations in the following > scenarios: > * No new local log has been created since the last upload for the long > running service scenario. > * NM uses {{ContainerLogAggregationPolicy}} that skips log aggregation for > certain containers. > In the following code snippet, even though {{pendingContainerInThisCycle}} is > empty, it still creates the writer and then removes the file later. Thus it > introduces unnecessary create/getfileinfo/delete NN calls when NM doesn't > aggregate logs for an app. > > {noformat} > AppLogAggregatorImpl.java > .. > writer = > new LogWriter(this.conf, this.remoteNodeTmpLogFileForApp, > this.userUgi); > .. > for (ContainerId container : pendingContainerInThisCycle) { > .. > } > .. > if (remoteFS.exists(remoteNodeTmpLogFileForApp)) { > if (rename) { > remoteFS.rename(remoteNodeTmpLogFileForApp, renamedPath); > } else { > remoteFS.delete(remoteNodeTmpLogFileForApp, false); > } > } > .. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4720) Skip unnecessary NN operations in log aggregation
[ https://issues.apache.org/jira/browse/YARN-4720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Gong updated YARN-4720: --- Attachment: YARN-4720.02.patch > Skip unnecessary NN operations in log aggregation > - > > Key: YARN-4720 > URL: https://issues.apache.org/jira/browse/YARN-4720 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Ming Ma >Assignee: Jun Gong > Attachments: YARN-4720.01.patch, YARN-4720.02.patch > > > Log aggregation service could have unnecessary NN operations in the following > scenarios: > * No new local log has been created since the last upload for the long > running service scenario. > * NM uses {{ContainerLogAggregationPolicy}} that skips log aggregation for > certain containers. > In the following code snippet, even though {{pendingContainerInThisCycle}} is > empty, it still creates the writer and then removes the file later. Thus it > introduces unnecessary create/getfileinfo/delete NN calls when NM doesn't > aggregate logs for an app. > > {noformat} > AppLogAggregatorImpl.java > .. > writer = > new LogWriter(this.conf, this.remoteNodeTmpLogFileForApp, > this.userUgi); > .. > for (ContainerId container : pendingContainerInThisCycle) { > .. > } > .. > if (remoteFS.exists(remoteNodeTmpLogFileForApp)) { > if (rename) { > remoteFS.rename(remoteNodeTmpLogFileForApp, renamedPath); > } else { > remoteFS.delete(remoteNodeTmpLogFileForApp, false); > } > } > .. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4696) EntityGroupFSTimelineStore to work in the absence of an RM
[ https://issues.apache.org/jira/browse/YARN-4696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15163273#comment-15163273 ] Hadoop QA commented on YARN-4696: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 10s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 1 new or modified test files. {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 10s {color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 39s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 44s {color} | {color:green} trunk passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 4s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 32s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 13s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 39s {color} | {color:green} trunk passed {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 0m 25s {color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timeline-pluginstorage in trunk has 1 extant Findbugs warnings. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 53s {color} | {color:green} trunk passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 4s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 10s {color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 1s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 50s {color} | {color:green} the patch passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 51s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 5s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 5s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 32s {color} | {color:red} hadoop-yarn-project/hadoop-yarn: patch generated 3 new + 29 unchanged - 0 fixed = 32 total (was 29) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 9s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 35s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 24s {color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0) {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 49s {color} | {color:green} the patch passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 0s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 53s {color} | {color:green} hadoop-yarn-common in the patch passed with JDK v1.8.0_72. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 54s {color} | {color:green} hadoop-yarn-server-applicationhistoryservice in the patch passed with JDK v1.8.0_72. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 45s {color} | {color:green} hadoop-yarn-server-timeline-pluginstorage in the patch passed with JDK v1.8.0_72. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 9s {color} | {color:green} hadoop-yarn-common in the patch passed with JDK v
[jira] [Commented] (YARN-4705) ATS 1.5 parse pipeline to consider handling open() events recoverably
[ https://issues.apache.org/jira/browse/YARN-4705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15163232#comment-15163232 ] Steve Loughran commented on YARN-4705: -- All we need to know is "does flush() write data back so that other code can eventually see it?" > ATS 1.5 parse pipeline to consider handling open() events recoverably > - > > Key: YARN-4705 > URL: https://issues.apache.org/jira/browse/YARN-4705 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: 2.8.0 >Reporter: Steve Loughran >Priority: Minor > > During one of my own timeline test runs, I've been seeing a stack trace > warning that the CRC check failed in Filesystem.open() file; something the FS > was ignoring. > Even though its swallowed (and probably not the cause of my test failure), > looking at the code in {{LogInfo.parsePath()}} that it considers a failure to > open a file as unrecoverable. > on some filesystems, this may not be the case, i.e. if its open for writing > it may not be available for reading; checksums maybe a similar issue. > Perhaps a failure at open() should be viewed as recoverable while the app is > still running? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4705) ATS 1.5 parse pipeline to consider handling open() events recoverably
[ https://issues.apache.org/jira/browse/YARN-4705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15163222#comment-15163222 ] jay vyas commented on YARN-4705: Ah the GlusterFS consistency model ? From my experience its not strongly consistent all the time in all cases. I'd cc [~chenh] and @childsb as well on this ... they are currently working on these filesystems. > ATS 1.5 parse pipeline to consider handling open() events recoverably > - > > Key: YARN-4705 > URL: https://issues.apache.org/jira/browse/YARN-4705 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: 2.8.0 >Reporter: Steve Loughran >Priority: Minor > > During one of my own timeline test runs, I've been seeing a stack trace > warning that the CRC check failed in Filesystem.open() file; something the FS > was ignoring. > Even though its swallowed (and probably not the cause of my test failure), > looking at the code in {{LogInfo.parsePath()}} that it considers a failure to > open a file as unrecoverable. > on some filesystems, this may not be the case, i.e. if its open for writing > it may not be available for reading; checksums maybe a similar issue. > Perhaps a failure at open() should be viewed as recoverable while the app is > still running? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-556) [Umbrella] RM Restart phase 2 - Work preserving restart
[ https://issues.apache.org/jira/browse/YARN-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15163215#comment-15163215 ] Johannes Zillmann commented on YARN-556: Mmh, have a test-cluster where the resource-manager fails to start after a crash. No matter if only the resource manager is started or the whole YARN, we always getting following exception: {quote} 2016-02-24 15:37:22,474 INFO attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:recover(796)) - Recovering attempt: appattempt_1456252782760_0018_01 with final state: null 2016-02-24 15:37:22,474 INFO security.AMRMTokenSecretManager (AMRMTokenSecretManager.java:createAndGetAMRMToken(195)) - Create AMRMToken for ApplicationAttempt: appattempt_1456252782760_0018_01 2016-02-24 15:37:22,474 INFO security.AMRMTokenSecretManager (AMRMTokenSecretManager.java:createPassword(307)) - Creating password for appattempt_1456252782760_0018_01 2016-02-24 15:37:22,474 INFO resourcemanager.ApplicationMasterService (ApplicationMasterService.java:registerAppAttempt(670)) - Registering app attempt : appattempt_1456252782760_0018_01 2016-02-24 15:37:22,475 ERROR resourcemanager.ResourceManager (ResourceManager.java:serviceStart(594)) - Failed to load/recover state java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:734) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1089) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:114) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AttemptRecoveredTransition.transition(RMAppAttemptImpl.java:1038) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AttemptRecoveredTransition.transition(RMAppAttemptImpl.java:1002) at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:755) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:106) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recoverAppAttempts(RMAppImpl.java:831) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.access$1900(RMAppImpl.java:101) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$RMAppRecoveredTransition.transition(RMAppImpl.java:846) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$RMAppRecoveredTransition.transition(RMAppImpl.java:836) at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:711) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:312) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:413) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1207) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:590) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1014) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1051) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1047) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1047) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(Res
[jira] [Commented] (YARN-4705) ATS 1.5 parse pipeline to consider handling open() events recoverably
[ https://issues.apache.org/jira/browse/YARN-4705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15163205#comment-15163205 ] Steve Loughran commented on YARN-4705: -- YARN-4696 contains my current logic to handle failures to parse things. : If the JSON parser fails then an info message is printed if we know the file is non-empty (i.e. either length>0 or offset > 0) I think there are some possible race conditions in the code as is, certainly FNFEs ought to downgrade to info, For other IOEs, I think they should be caught & logged per file, rather than stop the entire scan loop. Otherwise bad permissions on one file would be enough to break the scanning. Regarding trying to work with Raw vs HDFS...I've not been able to get at raw, am trying to disable caching in file://, but am close to accepting defeat and spinning up a single mini yarn cluster across all my test cases. That or add a config option to turn off checksumming in localFS. The logic is there, but you can only set it in an FS instance which must be used directly or propagated to the code-under-test via the FS cache. The local FS does work for picking up completed work; the problem is that as flush() doesn't, it doesn't reliably read the updates of incomplete jobs. And when it does, unless the JSON is aligned on a buffer boundary, the parser is going to fail, which is going to lead to lots and lots of info messages, unless the logging is tuned further to only log if the last operation was not a failure. We only need to really worry about other cross-cluster filesystems for production use here. Single node with local FS? Use the 1.0 APIs. Production: Distributed FS which is required to implement flush() (even a delayed/async flush) if you want to see incomplete applications. I believe GlusterFS supports that, as does any POSIX FS if the checksum FS doesn't get in the way. What does [~jayunit100] have to say about his filesystem's consistency model? It will mean that the object stores, S3 and swift can't work as destinations for logs. They are dangerous anyway as if the app crashes before {{out.close()}} is called *all* data is lost. If we care about that, then you'd really want to write to an FS (local or HDFS) then copy to the blobstore for long-term histories. > ATS 1.5 parse pipeline to consider handling open() events recoverably > - > > Key: YARN-4705 > URL: https://issues.apache.org/jira/browse/YARN-4705 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: 2.8.0 >Reporter: Steve Loughran >Priority: Minor > > During one of my own timeline test runs, I've been seeing a stack trace > warning that the CRC check failed in Filesystem.open() file; something the FS > was ignoring. > Even though its swallowed (and probably not the cause of my test failure), > looking at the code in {{LogInfo.parsePath()}} that it considers a failure to > open a file as unrecoverable. > on some filesystems, this may not be the case, i.e. if its open for writing > it may not be available for reading; checksums maybe a similar issue. > Perhaps a failure at open() should be viewed as recoverable while the app is > still running? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4680) TimerTasks leak in ATS V1.5 Writer
[ https://issues.apache.org/jira/browse/YARN-4680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15163192#comment-15163192 ] Junping Du commented on YARN-4680: -- +1 on latest patch. [~gtCarrera9], please go ahead to commit this patch to trunk, branch-2 and branch-2.8. > TimerTasks leak in ATS V1.5 Writer > -- > > Key: YARN-4680 > URL: https://issues.apache.org/jira/browse/YARN-4680 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Xuan Gong >Assignee: Xuan Gong > Attachments: YARN-4680.1.patch, YARN-4680.20160108.patch, > YARN-4680.20160109.patch, YARN-4680.20160222.patch > > > We have seen TimerTasks leak which could cause application server done (such > as oozie server done due to too many active threads) > Although we have fixed some potentially leak situations in upper application > level, such as > https://issues.apache.org/jira/browse/MAPREDUCE-6618 > https://issues.apache.org/jira/browse/MAPREDUCE-6621, we still can not > guarantee that we fixed the issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4680) TimerTasks leak in ATS V1.5 Writer
[ https://issues.apache.org/jira/browse/YARN-4680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du updated YARN-4680: - Target Version/s: 2.8.0 > TimerTasks leak in ATS V1.5 Writer > -- > > Key: YARN-4680 > URL: https://issues.apache.org/jira/browse/YARN-4680 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Xuan Gong >Assignee: Xuan Gong > Attachments: YARN-4680.1.patch, YARN-4680.20160108.patch, > YARN-4680.20160109.patch, YARN-4680.20160222.patch > > > We have seen TimerTasks leak which could cause application server done (such > as oozie server done due to too many active threads) > Although we have fixed some potentially leak situations in upper application > level, such as > https://issues.apache.org/jira/browse/MAPREDUCE-6618 > https://issues.apache.org/jira/browse/MAPREDUCE-6621, we still can not > guarantee that we fixed the issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4696) EntityGroupFSTimelineStore to work in the absence of an RM
[ https://issues.apache.org/jira/browse/YARN-4696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated YARN-4696: - Attachment: YARN-4696-008.patch Patch -008. This removes a subclass of RawLocalFileSystem that I'd been trying to instantiate directly. That doesn't work...I won't go into the details. Note also that patch -007 # has the code to remember the cache option before the {{FileSystemTimelineWriter}} gets a file, and restores it after # has commented out the entire action of disabling the cache. Why #2? It's to try to get a local FS with checksumming disabled picked up in test cases. I've not got that working. Why #1? Because some other part of the JVM may want caching, and so they won't want this class disabling it for them. I'm assuming that the caching was disabled to ensure that if this class closed the fs instance then the solution there is: don't close the FS when the service is stopped. We can rely on Hadoop itself to stop all filesystems in JVM shutdown. Of course, if the concern is that its other bits of code closing the FS, that's harder. In such a case, if I do manage to get my local FS test working, then we may need a test-time option to not-disable the cache > EntityGroupFSTimelineStore to work in the absence of an RM > -- > > Key: YARN-4696 > URL: https://issues.apache.org/jira/browse/YARN-4696 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: 2.8.0 >Reporter: Steve Loughran >Assignee: Steve Loughran > Attachments: YARN-4696-001.patch, YARN-4696-002.patch, > YARN-4696-003.patch, YARN-4696-005.patch, YARN-4696-006.patch, > YARN-4696-007.patch, YARN-4696-008.patch > > > {{EntityGroupFSTimelineStore}} now depends on an RM being up and running; the > configuration pointing to it. This is a new change, and impacts testing where > you have historically been able to test without an RM running. > The sole purpose of the probe is to automatically determine if an app is > running; it falls back to "unknown" if not. If the RM connection was > optional, the "unknown" codepath could be called directly, relying on age of > file as a metric of completion > Options > # add a flag to disable RM connect > # skip automatically if RM not defined/set to 0.0.0.0 > # disable retries on yarn client IPC; if it fails, tag app as unknown. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4696) EntityGroupFSTimelineStore to work in the absence of an RM
[ https://issues.apache.org/jira/browse/YARN-4696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated YARN-4696: - Attachment: YARN-4696-007.patch Patch 007 files that are stat-ed as empty are not skipped, but no attempt is made to log a parse problem if the length is 0 and no data has ever been read from it before (i.e. offset=0). > EntityGroupFSTimelineStore to work in the absence of an RM > -- > > Key: YARN-4696 > URL: https://issues.apache.org/jira/browse/YARN-4696 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: 2.8.0 >Reporter: Steve Loughran >Assignee: Steve Loughran > Attachments: YARN-4696-001.patch, YARN-4696-002.patch, > YARN-4696-003.patch, YARN-4696-005.patch, YARN-4696-006.patch, > YARN-4696-007.patch > > > {{EntityGroupFSTimelineStore}} now depends on an RM being up and running; the > configuration pointing to it. This is a new change, and impacts testing where > you have historically been able to test without an RM running. > The sole purpose of the probe is to automatically determine if an app is > running; it falls back to "unknown" if not. If the RM connection was > optional, the "unknown" codepath could be called directly, relying on age of > file as a metric of completion > Options > # add a flag to disable RM connect > # skip automatically if RM not defined/set to 0.0.0.0 > # disable retries on yarn client IPC; if it fails, tag app as unknown. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4630) Remove useless boxing/unboxing code (Hadoop YARN)
[ https://issues.apache.org/jira/browse/YARN-4630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15163144#comment-15163144 ] Akira AJISAKA commented on YARN-4630: - Found unnecessarily call of {{ApplicationAttemptId#compareTo}}. {code:title=ContainerId.java} public int compareTo(ContainerId other) { if (this.getApplicationAttemptId().compareTo( other.getApplicationAttemptId()) == 0) { return Long.compare(getContainerId(), other.getContainerId()); } else { return this.getApplicationAttemptId().compareTo( other.getApplicationAttemptId()); } } {code} Hi [~sarutak], would you keep the value of {{this.getApplicationAttemptId().compareTo(other.getApplicationAttemptId())}} and reuse it as follows? {code} public int compareTo(ContainerId other) { int result = this.getApplicationAttemptId().compareTo( other.getApplicationAttemptId()); if (result == 0) { return Long.compare(getContainerId(), other.getContainerId()); } else { return result; } } {code} > Remove useless boxing/unboxing code (Hadoop YARN) > - > > Key: YARN-4630 > URL: https://issues.apache.org/jira/browse/YARN-4630 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 3.0.0 >Reporter: Kousuke Saruta >Priority: Minor > Attachments: YARN-4630.0.patch > > > There are lots of places where useless boxing/unboxing occur. > To avoid performance issue, let's remove them. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4630) Remove useless boxing/unboxing code (Hadoop YARN)
[ https://issues.apache.org/jira/browse/YARN-4630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15163125#comment-15163125 ] Akira AJISAKA commented on YARN-4630: - bq. can I check this since it seems to include the changes against ContainerId? Okay. {code:title=ContainerId.java} public int compareTo(ContainerId other) { if (this.getApplicationAttemptId().compareTo( other.getApplicationAttemptId()) == 0) { - return Long.valueOf(getContainerId()) - .compareTo(Long.valueOf(other.getContainerId())); + return Long.compare(getContainerId(), other.getContainerId()); {code} IMO, the change is safe since the following change is only to remove unnecessarily boxing/unboxing. > Remove useless boxing/unboxing code (Hadoop YARN) > - > > Key: YARN-4630 > URL: https://issues.apache.org/jira/browse/YARN-4630 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 3.0.0 >Reporter: Kousuke Saruta >Priority: Minor > Attachments: YARN-4630.0.patch > > > There are lots of places where useless boxing/unboxing occur. > To avoid performance issue, let's remove them. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4630) Remove useless boxing/unboxing code (Hadoop YARN)
[ https://issues.apache.org/jira/browse/YARN-4630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15163046#comment-15163046 ] Tsuyoshi Ozawa commented on YARN-4630: -- Hey Akira, can I check this since it seems to include the changes againstContainerId? It has an impact against RM-HA. > Remove useless boxing/unboxing code (Hadoop YARN) > - > > Key: YARN-4630 > URL: https://issues.apache.org/jira/browse/YARN-4630 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 3.0.0 >Reporter: Kousuke Saruta >Priority: Minor > Attachments: YARN-4630.0.patch > > > There are lots of places where useless boxing/unboxing occur. > To avoid performance issue, let's remove them. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (YARN-4630) Remove useless boxing/unboxing code (Hadoop YARN)
[ https://issues.apache.org/jira/browse/YARN-4630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15163046#comment-15163046 ] Tsuyoshi Ozawa edited comment on YARN-4630 at 2/24/16 2:15 PM: --- Hey Akira, can I check this since it seems to include the changes against ContainerId? It has an impact against RM-HA. was (Author: ozawa): Hey Akira, can I check this since it seems to include the changes againstContainerId? It has an impact against RM-HA. > Remove useless boxing/unboxing code (Hadoop YARN) > - > > Key: YARN-4630 > URL: https://issues.apache.org/jira/browse/YARN-4630 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 3.0.0 >Reporter: Kousuke Saruta >Priority: Minor > Attachments: YARN-4630.0.patch > > > There are lots of places where useless boxing/unboxing occur. > To avoid performance issue, let's remove them. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3863) Support complex filters in TimelineReader
[ https://issues.apache.org/jira/browse/YARN-3863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15162920#comment-15162920 ] Varun Saxena commented on YARN-3863: Thanks [~sjlee0] for the comments. bq. If I'm reading this right, the key changes seem to be in TimelineStorageUtils The changes in TimelineStorageUtils would primarily be used by FS implementation. Because in FS Impl, filters will be applied locally. The major change from a HBase implementation perspective is xxxEntityReader classes where we are creating a filter list based on filters. However for relation filters and event filters, we cannot create a HBase filter to filter out rows because of the way relations and events are stored. So the logic for relations and filters is to fetch only the required columns(as required by the filters) if those fields are not to be retrieved. I am basically trying to trim down data brought over from backend. For relations and events, filters are then applied locally(even for HBase storage implementation). For other filters, in HBase implementation, we no longer apply filters locally and its all handled through HBase filters. Sorry for missing out on adding detailed comments in TimelineStorageUtils. I agree code can be refactored there to make it more readable. bq. Also, these methods seem to have similar code. Any possibility of refactoring the common logic? Yes, code is similar. We are looping over a filter list and then checking the operator while doing processing for an individual filter. I thought about it but then the issue in moving it into a common area is that the data structures which hold events, configs, metrics,etc. are not same. We can however do one thing and that is to pass the TimelineEntity object itself into a common function(for all filters) and also pass something, say an enum indicating what kind of filter we are intending to match(name it as something like TimelineEntityFiltersType). Then based on this enum value, get the appropriate item(configs, metrics,etc.) from the passed entity. This way we can move common logic to a specific method which can in turn call the appropriate method to process based on filter type(say equality filter, multivalue equality filter, etc.). Does this sound fine ? > Support complex filters in TimelineReader > - > > Key: YARN-3863 > URL: https://issues.apache.org/jira/browse/YARN-3863 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: YARN-2928 >Reporter: Varun Saxena >Assignee: Varun Saxena > Labels: yarn-2928-1st-milestone > Attachments: YARN-3863-YARN-2928.v2.01.patch, > YARN-3863-YARN-2928.v2.02.patch, YARN-3863-feature-YARN-2928.wip.003.patch, > YARN-3863-feature-YARN-2928.wip.01.patch, > YARN-3863-feature-YARN-2928.wip.02.patch, > YARN-3863-feature-YARN-2928.wip.04.patch, > YARN-3863-feature-YARN-2928.wip.05.patch > > > Currently filters in timeline reader will return an entity only if all the > filter conditions hold true i.e. only AND operation is supported. We can > support OR operation for the filters as well. Additionally as primary backend > implementation is HBase, we can design our filters in a manner, where they > closely resemble HBase Filters. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4333) Fair scheduler should support preemption within queue
[ https://issues.apache.org/jira/browse/YARN-4333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15162904#comment-15162904 ] Ashwin Shankar commented on YARN-4333: -- I'm out of town until end of this week. Will take a look when I get back. Thanks! > Fair scheduler should support preemption within queue > - > > Key: YARN-4333 > URL: https://issues.apache.org/jira/browse/YARN-4333 > Project: Hadoop YARN > Issue Type: Improvement > Components: fairscheduler >Affects Versions: 2.6.0 >Reporter: Tao Jie >Assignee: Tao Jie > Attachments: YARN-4333.001.patch, YARN-4333.002.patch, > YARN-4333.003.patch > > > Now each app in fair scheduler is allocated its fairshare, however fairshare > resource is not ensured even if fairSharePreemption is enabled. > Consider: > 1, When the cluster is idle, we submit app1 to queueA,which takes maxResource > of queueA. > 2, Then the cluster becomes busy, but app1 does not release any resource, > queueA resource usage is over its fairshare > 3, Then we submit app2(maybe with higher priority) to queueA. Now app2 has > its own fairshare, but could not obtain any resource, since queueA is still > over its fairshare and resource will not assign to queueA anymore. Also, > preemption is not triggered in this case. > So we should allow preemption within queue, when app is starved for fairshare. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4630) Remove useless boxing/unboxing code (Hadoop YARN)
[ https://issues.apache.org/jira/browse/YARN-4630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15162897#comment-15162897 ] Hadoop QA commented on YARN-4630: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 16s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 5 new or modified test files. {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 21s {color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 51s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 58s {color} | {color:green} trunk passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 12s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 39s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 3m 8s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 1m 36s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 5m 51s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 36s {color} | {color:green} trunk passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 5m 3s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 10s {color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 2m 39s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 51s {color} | {color:green} the patch passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 51s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 6s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 6s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 36s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 3m 2s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 1m 26s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 7m 7s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 24s {color} | {color:green} the patch passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 5m 0s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 21s {color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.8.0_72. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 55s {color} | {color:green} hadoop-yarn-common in the patch passed with JDK v1.8.0_72. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 3s {color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with JDK v1.8.0_72. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 21s {color} | {color:green} hadoop-yarn-server-web-proxy in the patch passed with JDK v1.8.0_72. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 71m 22s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_72. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 64m 40s {color} | {color:red} hadoop-yar
[jira] [Commented] (YARN-1489) [Umbrella] Work-preserving ApplicationMaster restart
[ https://issues.apache.org/jira/browse/YARN-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15162794#comment-15162794 ] Junping Du commented on YARN-1489: -- bq. That and the "Old running containers don't know where the new AM is running." issue is big enough that we shouldn't close this umbrella as done. I don't think we have an open JIRA under this umbrella to track this issue. Is this a specific issue for MR (like we discussed on MAPREDUCE-6608) or a generic issue for other frameworks (Spark, etc.) too? YARN-4602 get filed to track this issue as a generic problem for messages pass between containers. > [Umbrella] Work-preserving ApplicationMaster restart > > > Key: YARN-1489 > URL: https://issues.apache.org/jira/browse/YARN-1489 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Vinod Kumar Vavilapalli >Assignee: Vinod Kumar Vavilapalli > Attachments: Work preserving AM restart.pdf > > > Today if AMs go down, > - RM kills all the containers of that ApplicationAttempt > - New ApplicationAttempt doesn't know where the previous containers are > running > - Old running containers don't know where the new AM is running. > We need to fix this to enable work-preserving AM restart. The later two > potentially can be done at the app level, but it is good to have a common > solution for all apps where-ever possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4729) SchedulerApplicationAttempt#getTotalRequiredResources can throw an NPE
[ https://issues.apache.org/jira/browse/YARN-4729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15162702#comment-15162702 ] Hadoop QA commented on YARN-4729: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 11s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s {color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 47s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 27s {color} | {color:green} trunk passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 29s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 19s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 36s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 15s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 4s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 21s {color} | {color:green} trunk passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 26s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 30s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 23s {color} | {color:green} the patch passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 23s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 27s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 27s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 17s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 32s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 12s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 15s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 19s {color} | {color:green} the patch passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 23s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 66m 11s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_72. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 67m 36s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.7.0_95. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 19s {color} | {color:green} Patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 150m 19s {color} | {color:black} {color} | \\ \\ || Reason || Tests || | JDK v1.8.0_72 Failed junit tests | hadoop.yarn.server.resourcemanager.TestClientRMTokens | | | hadoop.yarn.server.resourcemanager.TestAMAuthorization | | JDK v1.7.0_95 Failed junit tests | hadoop.yarn.server.resourcemanager.TestClientRMTokens | | | hadoop.yarn.server.resourcemanager.TestAMAuthorization | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:0ca8df7 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12789374/yarn-4729.patch
[jira] [Commented] (YARN-4634) Scheduler UI/Metrics need to consider cases like non-queue label mappings
[ https://issues.apache.org/jira/browse/YARN-4634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15162699#comment-15162699 ] Hadoop QA commented on YARN-4634: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 10s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s {color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 41s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 25s {color} | {color:green} trunk passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 28s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 20s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 35s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 14s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 4s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 21s {color} | {color:green} trunk passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 25s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 29s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 22s {color} | {color:green} the patch passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 22s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 26s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 26s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 18s {color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: patch generated 2 new + 200 unchanged - 0 fixed = 202 total (was 200) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 31s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 12s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 15s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 18s {color} | {color:green} the patch passed with JDK v1.8.0_72 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 22s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 66m 0s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_72. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 67m 11s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.7.0_95. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 18s {color} | {color:green} Patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 149m 24s {color} | {color:black} {color} | \\ \\ || Reason || Tests || | JDK v1.8.0_72 Failed junit tests | hadoop.yarn.server.resourcemanager.TestClientRMTokens | | | hadoop.yarn.server.resourcemanager.TestAMAuthorization | | JDK v1.7.0_95 Failed junit tests | hadoop.yarn.server.resourcemanager.TestClientRMTokens | | | hadoop.yarn.server.resourcemanager.TestAMAuthorization | \\ \\ || Subsystem || Report/Notes || |