[jira] [Commented] (YARN-8771) CapacityScheduler fails to unreserve when cluster resource contains empty resource type

2018-09-13 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16614391#comment-16614391
 ] 

Hadoop QA commented on YARN-8771:
-

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
30s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 2 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 22m 
 8s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
58s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
47s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
2s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
14m  3s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
26s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
31s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
49s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
44s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
44s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
37s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
48s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green}  0m  
2s{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m  5s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
19s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
26s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 79m 
13s{color} | {color:green} hadoop-yarn-server-resourcemanager in the patch 
passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
36s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}138m 28s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:4b8c2b1 |
| JIRA Issue | YARN-8771 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12939643/YARN-8771.002.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  xml  |
| uname | Linux c61ad89506f8 3.13.0-153-generic #203-Ubuntu SMP Thu Jun 14 
08:52:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / ef5c776 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_181 |
| findbugs | v3.1.0-RC1 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/21835/testReport/ |
| Max. process+thread count | 861 (vs. ulimit of 1) |
| modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 U: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/21835/console |
| 

[jira] [Commented] (YARN-8720) CapacityScheduler does not enforce yarn.scheduler.capacity..maximum-allocation-mb/vcores when configured

2018-09-13 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16614361#comment-16614361
 ] 

Weiwei Yang commented on YARN-8720:
---

Thanks [~tarunparimi], if this is already handled, I am fine with the patch. +1

Thanks!

> CapacityScheduler does not enforce 
> yarn.scheduler.capacity..maximum-allocation-mb/vcores when 
> configured
> 
>
> Key: YARN-8720
> URL: https://issues.apache.org/jira/browse/YARN-8720
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, capacityscheduler, resourcemanager
>Affects Versions: 2.7.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: YARN-8720.001.patch, YARN-8720.002.patch
>
>
> The value of 
> yarn.scheduler.capacity..maximum-allocation-mb/vcores is not 
> strictly enforced when applications request containers. An 
> InvalidResourceRequestException is thrown only when the ResourceRequest is 
> greater than the global value of yarn.scheduler.maximum-allocation-mb/vcores 
> . So for an example configuration such as below,
>  
> {code:java}
> yarn.scheduler.maximum-allocation-mb=4096
> yarn.scheduler.capacity.root.test.maximum-allocation-mb=2048
> {code}
>  
> The below DSShell command runs successfully and asks an AM container of size 
> 4096MB which is greater than max 2048MB configured in test queue.
> {code:java}
> yarn jar $YARN_HOME/hadoop-yarn-applications-distributedshell.jar 
> -num_containers 1 -jar 
> $YARN_HOME/hadoop-yarn-applications-distributedshell.jar -shell_command 
> "sleep 60" -container_memory=4096 -master_memory=4096 -queue=test{code}
> Instead it should not launch the application and fail with 
> InvalidResourceRequestException . The child container however will be 
> requested with size 2048MB as DSShell AppMaster does the below check before 
> ResourceRequest ask with RM.
> {code:java}
> // A resource ask cannot exceed the max.
> if (containerMemory > maxMem) {
>  LOG.info("Container memory specified above max threshold of cluster."
>  + " Using max value." + ", specified=" + containerMemory + ", max="
>  + maxMem);
>  containerMemory = maxMem;
> }{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8759) Copy of "resource-types.xml" is not deleted if test fails, causes other test failures

2018-09-13 Thread Manikandan R (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16614353#comment-16614353
 ] 

Manikandan R commented on YARN-8759:


[~bsteinbach] Thanks for the patch.

Couple of minor nits:
 # Do we need to set *File variable = null in TestClientRMService#tearDown? If 
yes, then can we make it same even in other class as well?
 # Can we make the same checks even in TestRMAdminCLI#tearDown?

> Copy of "resource-types.xml" is not deleted if test fails, causes other test 
> failures
> -
>
> Key: YARN-8759
> URL: https://issues.apache.org/jira/browse/YARN-8759
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Antal Bálint Steinbach
>Assignee: Antal Bálint Steinbach
>Priority: Major
> Attachments: YARN-8759.001.patch, YARN-8759.002.patch, 
> YARN-8759.003.patch
>
>
> resource-types.xml is copied in several tests to the test machine, but it is 
> deleted only at the end of the test. In case the test fails the file will not 
> be deleted and other tests will fail, because of the wrong configuration.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8720) CapacityScheduler does not enforce yarn.scheduler.capacity..maximum-allocation-mb/vcores when configured

2018-09-13 Thread Tarun Parimi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16614334#comment-16614334
 ] 

Tarun Parimi commented on YARN-8720:


Hi [~sunilg], [~cheersyang] 

Thanks a lot for taking a look at the patch. I checked now that if 
queue_max_allocation > global_max_allocation is set , then queue refresh / RM 
restart fails because the following check is already present in 
{{CapacitySchedulerConfiguration#getMaximumAllocationPerQueue}}
{code:java}
if (maxAllocationMbPerQueue > clusterMax.getMemorySize()
 || maxAllocationVcoresPerQueue > clusterMax.getVirtualCores()) {
 throw new IllegalArgumentException(
 "Queue maximum allocation cannot be larger than the cluster setting"
 + " for queue " + queue
 + " max allocation per queue: " + result
 + " cluster setting: " + clusterMax);
}{code}
 

> CapacityScheduler does not enforce 
> yarn.scheduler.capacity..maximum-allocation-mb/vcores when 
> configured
> 
>
> Key: YARN-8720
> URL: https://issues.apache.org/jira/browse/YARN-8720
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, capacityscheduler, resourcemanager
>Affects Versions: 2.7.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: YARN-8720.001.patch, YARN-8720.002.patch
>
>
> The value of 
> yarn.scheduler.capacity..maximum-allocation-mb/vcores is not 
> strictly enforced when applications request containers. An 
> InvalidResourceRequestException is thrown only when the ResourceRequest is 
> greater than the global value of yarn.scheduler.maximum-allocation-mb/vcores 
> . So for an example configuration such as below,
>  
> {code:java}
> yarn.scheduler.maximum-allocation-mb=4096
> yarn.scheduler.capacity.root.test.maximum-allocation-mb=2048
> {code}
>  
> The below DSShell command runs successfully and asks an AM container of size 
> 4096MB which is greater than max 2048MB configured in test queue.
> {code:java}
> yarn jar $YARN_HOME/hadoop-yarn-applications-distributedshell.jar 
> -num_containers 1 -jar 
> $YARN_HOME/hadoop-yarn-applications-distributedshell.jar -shell_command 
> "sleep 60" -container_memory=4096 -master_memory=4096 -queue=test{code}
> Instead it should not launch the application and fail with 
> InvalidResourceRequestException . The child container however will be 
> requested with size 2048MB as DSShell AppMaster does the below check before 
> ResourceRequest ask with RM.
> {code:java}
> // A resource ask cannot exceed the max.
> if (containerMemory > maxMem) {
>  LOG.info("Container memory specified above max threshold of cluster."
>  + " Using max value." + ", specified=" + containerMemory + ", max="
>  + maxMem);
>  containerMemory = maxMem;
> }{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8720) CapacityScheduler does not enforce yarn.scheduler.capacity..maximum-allocation-mb/vcores when configured

2018-09-13 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16614297#comment-16614297
 ] 

Weiwei Yang commented on YARN-8720:
---

Hi [~tarunparimi], [~sunilg]

Should we take a min(global_max_allocation, queue_max_allocation) and enforce 
the check with this value? That ensures a request won't violate both dimension. 
In another word, if the config looks like following
 * Global: 2048mb
 * Queue root.test: 3072mb

a request asks for 3072mb on queue root.test should be rejected. Please let me 
know if this makes sense.

Thanks

> CapacityScheduler does not enforce 
> yarn.scheduler.capacity..maximum-allocation-mb/vcores when 
> configured
> 
>
> Key: YARN-8720
> URL: https://issues.apache.org/jira/browse/YARN-8720
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, capacityscheduler, resourcemanager
>Affects Versions: 2.7.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: YARN-8720.001.patch, YARN-8720.002.patch
>
>
> The value of 
> yarn.scheduler.capacity..maximum-allocation-mb/vcores is not 
> strictly enforced when applications request containers. An 
> InvalidResourceRequestException is thrown only when the ResourceRequest is 
> greater than the global value of yarn.scheduler.maximum-allocation-mb/vcores 
> . So for an example configuration such as below,
>  
> {code:java}
> yarn.scheduler.maximum-allocation-mb=4096
> yarn.scheduler.capacity.root.test.maximum-allocation-mb=2048
> {code}
>  
> The below DSShell command runs successfully and asks an AM container of size 
> 4096MB which is greater than max 2048MB configured in test queue.
> {code:java}
> yarn jar $YARN_HOME/hadoop-yarn-applications-distributedshell.jar 
> -num_containers 1 -jar 
> $YARN_HOME/hadoop-yarn-applications-distributedshell.jar -shell_command 
> "sleep 60" -container_memory=4096 -master_memory=4096 -queue=test{code}
> Instead it should not launch the application and fail with 
> InvalidResourceRequestException . The child container however will be 
> requested with size 2048MB as DSShell AppMaster does the below check before 
> ResourceRequest ask with RM.
> {code:java}
> // A resource ask cannot exceed the max.
> if (containerMemory > maxMem) {
>  LOG.info("Container memory specified above max threshold of cluster."
>  + " Using max value." + ", specified=" + containerMemory + ", max="
>  + maxMem);
>  containerMemory = maxMem;
> }{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8771) CapacityScheduler fails to unreserve when cluster resource contains empty resource type

2018-09-13 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-8771:
---
Attachment: YARN-8771.002.patch

> CapacityScheduler fails to unreserve when cluster resource contains empty 
> resource type
> ---
>
> Key: YARN-8771
> URL: https://issues.apache.org/jira/browse/YARN-8771
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.2.0
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Critical
> Attachments: YARN-8771.001.patch, YARN-8771.002.patch
>
>
> We found this problem when cluster is almost but not exhausted (93% used), 
> scheduler kept allocating for an app but always fail to commit, this can 
> blocking requests from other apps and parts of cluster resource can't be used.
> Reproduce this problem:
> (1) use DominantResourceCalculator
> (2) cluster resource has empty resource type, for example: gpu=0
> (3) scheduler allocates container for app1 who has reserved containers and 
> whose queue limit or user limit reached(used + required > limit). 
> Reference codes in RegularContainerAllocator#assignContainer:
> {code:java}
> boolean needToUnreserve =
> Resources.greaterThan(rc, clusterResource,
> resourceNeedToUnReserve, Resources.none());
> {code}
> value of resourceNeedToUnReserve can be <8GB, -6 cores, 0 gpu>, result of 
> {{Resources#greaterThan}} will be false if using DominantResourceCalculator.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8729) Node status updater thread could be lost after it is restarted

2018-09-13 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16614289#comment-16614289
 ] 

Tao Yang commented on YARN-8729:


Thanks [~cheersyang], [~ebadger] !

> Node status updater thread could be lost after it is restarted
> --
>
> Key: YARN-8729
> URL: https://issues.apache.org/jira/browse/YARN-8729
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.8.0
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Critical
> Fix For: 3.2.0, 2.9.2, 3.0.4, 3.1.2, 2.8.6
>
> Attachments: YARN-8729.001.patch, YARN-8729.001.patch, 
> YARN-8729.002.patch
>
>
> Today I found a lost NM whose node status updater thread was not exist after 
> this thread restarted. In 
> {{NodeStatusUpdaterImpl#rebootNodeStatusUpdaterAndRegisterWithRM}}, isStopped 
> flag is not updated to be false before executing {{statusUpdater.start()}}, 
> so that if the thread is immediately started and found isStopped==true, it 
> will exit without any log.
> Key codes in 
> {{NodeStatusUpdaterImpl#rebootNodeStatusUpdaterAndRegisterWithRM}}:
> {code:java}
>  statusUpdater.join();
>  registerWithRM();
>  statusUpdater = new Thread(statusUpdaterRunnable, "Node Status Updater");
>  statusUpdater.start();
>  this.isStopped = false;   //this line should be moved before 
> statusUpdater.start();
>  LOG.info("NodeStatusUpdater thread is reRegistered and restarted");
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8734) Readiness check for remote service

2018-09-13 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16614246#comment-16614246
 ] 

Hadoop QA commented on YARN-8734:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
20s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  1m 
14s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 20m 
42s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
36s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
19s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
59s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m  0s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
24s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
34s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
10s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
 0s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
33s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
33s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 17s{color} | {color:orange} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-services: 
The patch generated 6 new + 33 unchanged - 0 fixed = 39 total (was 33) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
55s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m 36s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
28s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
31s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 12m 
21s{color} | {color:green} hadoop-yarn-services-core in the patch passed. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  1m 
50s{color} | {color:green} hadoop-yarn-services-api in the patch passed. 
{color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
26s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 71m 13s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:4b8c2b1 |
| JIRA Issue | YARN-8734 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12939633/YARN-8734.001.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux cbd2979bb521 3.13.0-153-generic #203-Ubuntu SMP Thu Jun 14 
08:52:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision 

[jira] [Commented] (YARN-8706) DelayedProcessKiller is executed for Docker containers even though docker stop sends a KILL signal after the specified grace period

2018-09-13 Thread Chandni Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16614226#comment-16614226
 ] 

Chandni Singh commented on YARN-8706:
-

[~eyang] Thanks for looking at it. Yes I should not have replaced it. Was just 
a small change so overwrote the patch file. 

> DelayedProcessKiller is executed for Docker containers even though docker 
> stop sends a KILL signal after the specified grace period
> ---
>
> Key: YARN-8706
> URL: https://issues.apache.org/jira/browse/YARN-8706
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
>  Labels: docker
> Fix For: 3.2.0
>
> Attachments: YARN-8706.001.patch, YARN-8706.002.patch, 
> YARN-8706.003.patch, YARN-8706.004.patch, YARN-8706.addendum.001.patch
>
>
> {{DockerStopCommand}} adds a grace period of 10 seconds.
> 10 seconds is also the default grace time use by docker stop
>  [https://docs.docker.com/engine/reference/commandline/stop/]
> Documentation of the docker stop:
> {quote}the main process inside the container will receive {{SIGTERM}}, and 
> after a grace period, {{SIGKILL}}.
> {quote}
> There is a {{DelayedProcessKiller}} in {{ContainerExcecutor}} which executes 
> for all containers after a delay when {{sleepDelayBeforeSigKill>0}}. By 
> default this is set to {{250 milliseconds}} and so irrespective of the 
> container type, it will always get executed.
>  
> For a docker container, {{docker stop}} takes care of sending a {{SIGKILL}} 
> after the grace period
> - when sleepDelayBeforeSigKill > 10 seconds, then there is no point of 
> executing DelayedProcessKiller
> - when sleepDelayBeforeSigKill < 1 second, then the grace period should be 
> the smallest value, which is 1 second, because anyways we are forcing kill 
> after 250 ms
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8706) DelayedProcessKiller is executed for Docker containers even though docker stop sends a KILL signal after the specified grace period

2018-09-13 Thread Eric Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16614215#comment-16614215
 ] 

Eric Yang commented on YARN-8706:
-

[~csingh] +1 for addendum patch 001.  It would be nice to not replace the patch 
after it has been posted.  It helps reviewer to know which version of the patch 
had been reviewed.  I will commit tomorrow, if no other issue has been found.

> DelayedProcessKiller is executed for Docker containers even though docker 
> stop sends a KILL signal after the specified grace period
> ---
>
> Key: YARN-8706
> URL: https://issues.apache.org/jira/browse/YARN-8706
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
>  Labels: docker
> Fix For: 3.2.0
>
> Attachments: YARN-8706.001.patch, YARN-8706.002.patch, 
> YARN-8706.003.patch, YARN-8706.004.patch, YARN-8706.addendum.001.patch
>
>
> {{DockerStopCommand}} adds a grace period of 10 seconds.
> 10 seconds is also the default grace time use by docker stop
>  [https://docs.docker.com/engine/reference/commandline/stop/]
> Documentation of the docker stop:
> {quote}the main process inside the container will receive {{SIGTERM}}, and 
> after a grace period, {{SIGKILL}}.
> {quote}
> There is a {{DelayedProcessKiller}} in {{ContainerExcecutor}} which executes 
> for all containers after a delay when {{sleepDelayBeforeSigKill>0}}. By 
> default this is set to {{250 milliseconds}} and so irrespective of the 
> container type, it will always get executed.
>  
> For a docker container, {{docker stop}} takes care of sending a {{SIGKILL}} 
> after the grace period
> - when sleepDelayBeforeSigKill > 10 seconds, then there is no point of 
> executing DelayedProcessKiller
> - when sleepDelayBeforeSigKill < 1 second, then the grace period should be 
> the smallest value, which is 1 second, because anyways we are forcing kill 
> after 250 ms
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8706) DelayedProcessKiller is executed for Docker containers even though docker stop sends a KILL signal after the specified grace period

2018-09-13 Thread Chandni Singh (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh updated YARN-8706:

Attachment: YARN-8706.addendum.001.patch

> DelayedProcessKiller is executed for Docker containers even though docker 
> stop sends a KILL signal after the specified grace period
> ---
>
> Key: YARN-8706
> URL: https://issues.apache.org/jira/browse/YARN-8706
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
>  Labels: docker
> Fix For: 3.2.0
>
> Attachments: YARN-8706.001.patch, YARN-8706.002.patch, 
> YARN-8706.003.patch, YARN-8706.004.patch, YARN-8706.addendum.001.patch
>
>
> {{DockerStopCommand}} adds a grace period of 10 seconds.
> 10 seconds is also the default grace time use by docker stop
>  [https://docs.docker.com/engine/reference/commandline/stop/]
> Documentation of the docker stop:
> {quote}the main process inside the container will receive {{SIGTERM}}, and 
> after a grace period, {{SIGKILL}}.
> {quote}
> There is a {{DelayedProcessKiller}} in {{ContainerExcecutor}} which executes 
> for all containers after a delay when {{sleepDelayBeforeSigKill>0}}. By 
> default this is set to {{250 milliseconds}} and so irrespective of the 
> container type, it will always get executed.
>  
> For a docker container, {{docker stop}} takes care of sending a {{SIGKILL}} 
> after the grace period
> - when sleepDelayBeforeSigKill > 10 seconds, then there is no point of 
> executing DelayedProcessKiller
> - when sleepDelayBeforeSigKill < 1 second, then the grace period should be 
> the smallest value, which is 1 second, because anyways we are forcing kill 
> after 250 ms
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8706) DelayedProcessKiller is executed for Docker containers even though docker stop sends a KILL signal after the specified grace period

2018-09-13 Thread Chandni Singh (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh updated YARN-8706:

Attachment: (was: YARN-8706.addendum.001.patch)

> DelayedProcessKiller is executed for Docker containers even though docker 
> stop sends a KILL signal after the specified grace period
> ---
>
> Key: YARN-8706
> URL: https://issues.apache.org/jira/browse/YARN-8706
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
>  Labels: docker
> Fix For: 3.2.0
>
> Attachments: YARN-8706.001.patch, YARN-8706.002.patch, 
> YARN-8706.003.patch, YARN-8706.004.patch, YARN-8706.addendum.001.patch
>
>
> {{DockerStopCommand}} adds a grace period of 10 seconds.
> 10 seconds is also the default grace time use by docker stop
>  [https://docs.docker.com/engine/reference/commandline/stop/]
> Documentation of the docker stop:
> {quote}the main process inside the container will receive {{SIGTERM}}, and 
> after a grace period, {{SIGKILL}}.
> {quote}
> There is a {{DelayedProcessKiller}} in {{ContainerExcecutor}} which executes 
> for all containers after a delay when {{sleepDelayBeforeSigKill>0}}. By 
> default this is set to {{250 milliseconds}} and so irrespective of the 
> container type, it will always get executed.
>  
> For a docker container, {{docker stop}} takes care of sending a {{SIGKILL}} 
> after the grace period
> - when sleepDelayBeforeSigKill > 10 seconds, then there is no point of 
> executing DelayedProcessKiller
> - when sleepDelayBeforeSigKill < 1 second, then the grace period should be 
> the smallest value, which is 1 second, because anyways we are forcing kill 
> after 250 ms
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8734) Readiness check for remote service

2018-09-13 Thread Eric Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16614199#comment-16614199
 ] 

Eric Yang commented on YARN-8734:
-

Patch 001 is proof of concept to prevent service from deploying until 
remote_service_dependencies is satisfied.  User can express remote dependencies 
like:

{code}
{
  "name": "sleeper-service",
  "version": "5",
  "remote_service_dependencies": ["my-remote-service"],
  "components" :
  [
{
  "name": "ping",
  "number_of_containers": 2,
  "artifact": {
"id": "hadoop/centos:latest",
"type": "DOCKER"
  },
  "launch_command": "sleep,1",
  "resource": {
"cpus": 1,
"memory": "256"
  }
}
  ]
}
{code}

The only limitation of remote service dependency is the dependent service must 
be owned by the same user.  YARN restricts who can see the status of other 
jobs.  This is the reason that user A can not obtain status of user B's 
service.  This limitation is likely to be resolved by YARN-8733 remote service 
port check.

> Readiness check for remote service
> --
>
> Key: YARN-8734
> URL: https://issues.apache.org/jira/browse/YARN-8734
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: yarn-native-services
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Major
> Attachments: Dependency check vs.pdf, YARN-8734.001.patch
>
>
> When a service is deploying, there can be remote service dependency.  It 
> would be nice to describe ZooKeeper as a dependent service, and the service 
> has reached a stable state, then deploy HBase.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8734) Readiness check for remote service

2018-09-13 Thread Eric Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang updated YARN-8734:

Attachment: YARN-8734.001.patch

> Readiness check for remote service
> --
>
> Key: YARN-8734
> URL: https://issues.apache.org/jira/browse/YARN-8734
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: yarn-native-services
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Major
> Attachments: Dependency check vs.pdf, YARN-8734.001.patch
>
>
> When a service is deploying, there can be remote service dependency.  It 
> would be nice to describe ZooKeeper as a dependent service, and the service 
> has reached a stable state, then deploy HBase.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8755) Add clean up for FederationStore apps

2018-09-13 Thread Botong Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16614168#comment-16614168
 ] 

Botong Huang commented on YARN-8755:


Sure, thx [~subru] and [~bibinchundatt] for the comment!

> Add clean up for FederationStore apps
> -
>
> Key: YARN-8755
> URL: https://issues.apache.org/jira/browse/YARN-8755
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin A Chundatt
>Priority: Major
>
> We should add clean up logic for applications to home cluster mapping  in 
> federation State store. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8755) Add clean up for FederationStore apps

2018-09-13 Thread Subru Krishnan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16614158#comment-16614158
 ] 

Subru Krishnan commented on YARN-8755:
--

Thanks [~bibinchundatt]!

I see that [~botong] is working on addressing your feedback. I do have a 
request - can both of you make sure if YARN-6648 needs to be updated with your 
comments and also include that as part of YARN-7599?

> Add clean up for FederationStore apps
> -
>
> Key: YARN-8755
> URL: https://issues.apache.org/jira/browse/YARN-8755
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin A Chundatt
>Priority: Major
>
> We should add clean up logic for applications to home cluster mapping  in 
> federation State store. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7592) yarn.federation.failover.enabled missing in yarn-default.xml

2018-09-13 Thread Subru Krishnan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16614143#comment-16614143
 ] 

Subru Krishnan commented on YARN-7592:
--

Thanks [~jira.shegalov] for raising this and [~bibinchundatt] and 
[~rahulanand90] for the detailed analysis.

 

[~bibinchundatt], I agree that this is related to YARN-8434. Looks like in our 
test setup, we specify {{FederationRMFailoverProxyProvider}}  for non-HA setup 
and ConfiguredRMFailoverProxyProvider for HA setup in yarn-site.

Before we change Server/Client proxies, is it possible to remove  
*yarn.federation.enabled* flag from yarn-site and check as after (re)looking at 
the code, that may not  be necessary in NMs (only in RMs)?

> yarn.federation.failover.enabled missing in yarn-default.xml
> 
>
> Key: YARN-7592
> URL: https://issues.apache.org/jira/browse/YARN-7592
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: federation
>Affects Versions: 3.0.0-beta1
>Reporter: Gera Shegalov
>Priority: Major
> Attachments: IssueReproduce.patch
>
>
> yarn.federation.failover.enabled should be documented in yarn-default.xml. I 
> am also not sure why it should be true by default and force the HA retry 
> policy in {{RMProxy#createRMProxy}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8773) Blacklisting support for scheduling AMs

2018-09-13 Thread Prabhu Joseph (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-8773:

Description: MapReduce jobs failed with both AM attempts failing on same 
node - the node had some issue. Both AM attempts are placed on same node as 
there is no blacklisting feature. Customer is expecting a fix for YARN-2005 + 
YARN-4389.   (was: MapReduce jobs failed with both AM attempts failing on same 
node - the node had some issue. Both AM attempts are placed on same node as 
there is no blacklisting feature. Customer is expecting a fix for YARN-2005 + 
YARN-4389. Is it possible to backport it to HDP-2.2.9 and do we have any better 
workaround to avoid this issue. 

{code}
"2018-08-18 11:32:57,855 WARN 
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=edwaefrp  
  OPERATION=Application Finished - Failed TARGET=RMAppManager 
RESULT=FAILURE  DESCRIPTION=App failed with state: FAILED   
PERMISSIONS=Application application_1529242338015_465184 failed 2 times due to 
AM Container for appattempt_1529242338015_465184_02 exited with  exitCode: 
-1000
For more detailed output, check application tracking 
page:https://ma4-gbihrcp-lnn14.corp.apple.com:8078/proxy/application_1529242338015_465184/Then,
 click on links to logs of each attempt.
Diagnostics: Error while running command to get file permissions : 
ExitCodeException exitCode=139:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:808)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:791)
at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1103)
at 
org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:659)
at 
org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:634)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.checkLocalDir(ResourceLocalizationService.java:1411)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.getInitializedLocalDirs(ResourceLocalizationService.java:1375)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.access$1000(ResourceLocalizationService.java:139)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1102)
Failing this attempt. Failing the application.  
APPID=application_1529242338015_465184","2018-08-18T11:32:57.855+","ma4-gbihrcp-lnn14.corp.apple.com","gbi_hadoop_prod_hrc_core",16,"/ngs/app/yarn/hadoop/logs/yarn/yarn-yarn-resourcemanager-ma4-gbihrcp-lnn14.corp.apple.com.log","yarn_resourcemanager_log","in-gncs-159.corp.apple.com",
"2018-08-18 11:32:57,855 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Application 
application_1529242338015_465184 failed 2 times due to AM Container for 
appattempt_1529242338015_465184_02 exited with  exitCode: -1000
For more detailed output, check application tracking 
page:https://ma4-gbihrcp-lnn14.corp.apple.com:8078/proxy/application_1529242338015_465184/Then,
 click on links to logs of each attempt.
Diagnostics: Error while running command to get file permissions : 
ExitCodeException exitCode=139:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:808)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:791)
at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1103)
at 
org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:659)
at 
org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:634)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.checkLocalDir(ResourceLocalizationService.java:1411)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.getInitializedLocalDirs(ResourceLocalizationService.java:1375)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.access$1000(ResourceLocalizationService.java:139)
at 

[jira] [Resolved] (YARN-8773) Blacklisting support for scheduling AMs

2018-09-13 Thread Prabhu Joseph (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph resolved YARN-8773.
-
Resolution: Invalid

> Blacklisting support for scheduling AMs 
> 
>
> Key: YARN-8773
> URL: https://issues.apache.org/jira/browse/YARN-8773
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.2.0
>Reporter: Prabhu Joseph
>Assignee: Wangda Tan
>Priority: Major
>
> MapReduce jobs failed with both AM attempts failing on same node - the node 
> had some issue. Both AM attempts are placed on same node as there is no 
> blacklisting feature. Customer is expecting a fix for YARN-2005 + YARN-4389. 
> Is it possible to backport it to HDP-2.2.9 and do we have any better 
> workaround to avoid this issue. 
> {code}
> "2018-08-18 11:32:57,855 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=edwaefrp
> OPERATION=Application Finished - Failed TARGET=RMAppManager 
> RESULT=FAILURE  DESCRIPTION=App failed with state: FAILED   
> PERMISSIONS=Application application_1529242338015_465184 failed 2 times due 
> to AM Container for appattempt_1529242338015_465184_02 exited with  
> exitCode: -1000
> For more detailed output, check application tracking 
> page:https://ma4-gbihrcp-lnn14.corp.apple.com:8078/proxy/application_1529242338015_465184/Then,
>  click on links to logs of each attempt.
> Diagnostics: Error while running command to get file permissions : 
> ExitCodeException exitCode=139:
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
> at org.apache.hadoop.util.Shell.run(Shell.java:455)
> at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
> at org.apache.hadoop.util.Shell.execCommand(Shell.java:808)
> at org.apache.hadoop.util.Shell.execCommand(Shell.java:791)
> at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1103)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:659)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:634)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.checkLocalDir(ResourceLocalizationService.java:1411)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.getInitializedLocalDirs(ResourceLocalizationService.java:1375)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.access$1000(ResourceLocalizationService.java:139)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1102)
> Failing this attempt. Failing the application.  
> APPID=application_1529242338015_465184","2018-08-18T11:32:57.855+","ma4-gbihrcp-lnn14.corp.apple.com","gbi_hadoop_prod_hrc_core",16,"/ngs/app/yarn/hadoop/logs/yarn/yarn-yarn-resourcemanager-ma4-gbihrcp-lnn14.corp.apple.com.log","yarn_resourcemanager_log","in-gncs-159.corp.apple.com",
> "2018-08-18 11:32:57,855 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Application 
> application_1529242338015_465184 failed 2 times due to AM Container for 
> appattempt_1529242338015_465184_02 exited with  exitCode: -1000
> For more detailed output, check application tracking 
> page:https://ma4-gbihrcp-lnn14.corp.apple.com:8078/proxy/application_1529242338015_465184/Then,
>  click on links to logs of each attempt.
> Diagnostics: Error while running command to get file permissions : 
> ExitCodeException exitCode=139:
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
> at org.apache.hadoop.util.Shell.run(Shell.java:455)
> at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
> at org.apache.hadoop.util.Shell.execCommand(Shell.java:808)
> at org.apache.hadoop.util.Shell.execCommand(Shell.java:791)
> at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1103)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:659)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:634)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.checkLocalDir(ResourceLocalizationService.java:1411)
> at 
> 

[jira] [Created] (YARN-8773) Blacklisting support for scheduling AMs for Apple HDP-2.2.9

2018-09-13 Thread Prabhu Joseph (JIRA)
Prabhu Joseph created YARN-8773:
---

 Summary: Blacklisting support for scheduling AMs for Apple 
HDP-2.2.9
 Key: YARN-8773
 URL: https://issues.apache.org/jira/browse/YARN-8773
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.2.0
Reporter: Prabhu Joseph
Assignee: Wangda Tan


MapReduce jobs failed with both AM attempts failing on same node - the node had 
some issue. Both AM attempts are placed on same node as there is no 
blacklisting feature. Customer is expecting a fix for YARN-2005 + YARN-4389. Is 
it possible to backport it to HDP-2.2.9 and do we have any better workaround to 
avoid this issue. 

{code}
"2018-08-18 11:32:57,855 WARN 
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=edwaefrp  
  OPERATION=Application Finished - Failed TARGET=RMAppManager 
RESULT=FAILURE  DESCRIPTION=App failed with state: FAILED   
PERMISSIONS=Application application_1529242338015_465184 failed 2 times due to 
AM Container for appattempt_1529242338015_465184_02 exited with  exitCode: 
-1000
For more detailed output, check application tracking 
page:https://ma4-gbihrcp-lnn14.corp.apple.com:8078/proxy/application_1529242338015_465184/Then,
 click on links to logs of each attempt.
Diagnostics: Error while running command to get file permissions : 
ExitCodeException exitCode=139:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:808)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:791)
at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1103)
at 
org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:659)
at 
org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:634)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.checkLocalDir(ResourceLocalizationService.java:1411)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.getInitializedLocalDirs(ResourceLocalizationService.java:1375)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.access$1000(ResourceLocalizationService.java:139)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1102)
Failing this attempt. Failing the application.  
APPID=application_1529242338015_465184","2018-08-18T11:32:57.855+","ma4-gbihrcp-lnn14.corp.apple.com","gbi_hadoop_prod_hrc_core",16,"/ngs/app/yarn/hadoop/logs/yarn/yarn-yarn-resourcemanager-ma4-gbihrcp-lnn14.corp.apple.com.log","yarn_resourcemanager_log","in-gncs-159.corp.apple.com",
"2018-08-18 11:32:57,855 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Application 
application_1529242338015_465184 failed 2 times due to AM Container for 
appattempt_1529242338015_465184_02 exited with  exitCode: -1000
For more detailed output, check application tracking 
page:https://ma4-gbihrcp-lnn14.corp.apple.com:8078/proxy/application_1529242338015_465184/Then,
 click on links to logs of each attempt.
Diagnostics: Error while running command to get file permissions : 
ExitCodeException exitCode=139:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:808)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:791)
at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1103)
at 
org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:659)
at 
org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:634)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.checkLocalDir(ResourceLocalizationService.java:1411)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.getInitializedLocalDirs(ResourceLocalizationService.java:1375)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.access$1000(ResourceLocalizationService.java:139)
at 

[jira] [Updated] (YARN-8773) Blacklisting support for scheduling AMs

2018-09-13 Thread Prabhu Joseph (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-8773:

Summary: Blacklisting support for scheduling AMs   (was: Blacklisting 
support for scheduling AMs for Apple HDP-2.2.9)

> Blacklisting support for scheduling AMs 
> 
>
> Key: YARN-8773
> URL: https://issues.apache.org/jira/browse/YARN-8773
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.2.0
>Reporter: Prabhu Joseph
>Assignee: Wangda Tan
>Priority: Major
>
> MapReduce jobs failed with both AM attempts failing on same node - the node 
> had some issue. Both AM attempts are placed on same node as there is no 
> blacklisting feature. Customer is expecting a fix for YARN-2005 + YARN-4389. 
> Is it possible to backport it to HDP-2.2.9 and do we have any better 
> workaround to avoid this issue. 
> {code}
> "2018-08-18 11:32:57,855 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=edwaefrp
> OPERATION=Application Finished - Failed TARGET=RMAppManager 
> RESULT=FAILURE  DESCRIPTION=App failed with state: FAILED   
> PERMISSIONS=Application application_1529242338015_465184 failed 2 times due 
> to AM Container for appattempt_1529242338015_465184_02 exited with  
> exitCode: -1000
> For more detailed output, check application tracking 
> page:https://ma4-gbihrcp-lnn14.corp.apple.com:8078/proxy/application_1529242338015_465184/Then,
>  click on links to logs of each attempt.
> Diagnostics: Error while running command to get file permissions : 
> ExitCodeException exitCode=139:
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
> at org.apache.hadoop.util.Shell.run(Shell.java:455)
> at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
> at org.apache.hadoop.util.Shell.execCommand(Shell.java:808)
> at org.apache.hadoop.util.Shell.execCommand(Shell.java:791)
> at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1103)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:659)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:634)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.checkLocalDir(ResourceLocalizationService.java:1411)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.getInitializedLocalDirs(ResourceLocalizationService.java:1375)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.access$1000(ResourceLocalizationService.java:139)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1102)
> Failing this attempt. Failing the application.  
> APPID=application_1529242338015_465184","2018-08-18T11:32:57.855+","ma4-gbihrcp-lnn14.corp.apple.com","gbi_hadoop_prod_hrc_core",16,"/ngs/app/yarn/hadoop/logs/yarn/yarn-yarn-resourcemanager-ma4-gbihrcp-lnn14.corp.apple.com.log","yarn_resourcemanager_log","in-gncs-159.corp.apple.com",
> "2018-08-18 11:32:57,855 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Application 
> application_1529242338015_465184 failed 2 times due to AM Container for 
> appattempt_1529242338015_465184_02 exited with  exitCode: -1000
> For more detailed output, check application tracking 
> page:https://ma4-gbihrcp-lnn14.corp.apple.com:8078/proxy/application_1529242338015_465184/Then,
>  click on links to logs of each attempt.
> Diagnostics: Error while running command to get file permissions : 
> ExitCodeException exitCode=139:
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
> at org.apache.hadoop.util.Shell.run(Shell.java:455)
> at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
> at org.apache.hadoop.util.Shell.execCommand(Shell.java:808)
> at org.apache.hadoop.util.Shell.execCommand(Shell.java:791)
> at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1103)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:659)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:634)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.checkLocalDir(ResourceLocalizationService.java:1411)
> at 
> 

[jira] [Commented] (YARN-8772) Annotation javax.annotation.Generated has moved

2018-09-13 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16614029#comment-16614029
 ] 

Hadoop QA commented on YARN-8772:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
19s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 20m 
 3s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
29s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
21s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
30s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m 47s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
51s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
18s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
33s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
27s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
27s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
18s{color} | {color:green} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-services/hadoop-yarn-services-core:
 The patch generated 0 new + 142 unchanged - 12 fixed = 142 total (was 154) 
{color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
38s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m 26s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
58s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
18s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 12m 
39s{color} | {color:green} hadoop-yarn-services-core in the patch passed. 
{color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
32s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 65m 57s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:4b8c2b1 |
| JIRA Issue | YARN-8772 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12939597/YARN-8772.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux dd133ae159b7 3.13.0-143-generic #192-Ubuntu SMP Tue Feb 27 
10:45:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 250b500 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_181 |
| findbugs | v3.1.0-RC1 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/21833/testReport/ |
| Max. process+thread count | 740 (vs. ulimit of 1) |
| modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-services/hadoop-yarn-services-core
 U: 

[jira] [Commented] (YARN-8771) CapacityScheduler fails to unreserve when cluster resource contains empty resource type

2018-09-13 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16614021#comment-16614021
 ] 

Hadoop QA commented on YARN-8771:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
26s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 2 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 18m 
 2s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
43s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
37s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
41s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
11m  6s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
11s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
26s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
43s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
39s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
39s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 28s{color} | {color:orange} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:
 The patch generated 2 new + 24 unchanged - 0 fixed = 26 total (was 24) {color} 
|
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
42s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green}  0m  
2s{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
10m 40s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
15s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
22s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 72m  6s{color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
29s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}120m 18s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | 
hadoop.yarn.server.resourcemanager.scheduler.capacity.TestIncreaseAllocationExpirer
 |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:4b8c2b1 |
| JIRA Issue | YARN-8771 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12939543/YARN-8771.001.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  xml  |
| uname | Linux 34b8283948e4 4.4.0-133-generic #159-Ubuntu SMP Fri Aug 10 
07:31:43 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / e1b242a |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_181 |
| findbugs | v3.1.0-RC1 |
| checkstyle | 

[jira] [Commented] (YARN-8648) Container cgroups are leaked when using docker

2018-09-13 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16614019#comment-16614019
 ] 

Jason Lowe commented on YARN-8648:
--

Thanks for updating the patch!  +1 lgtm.  Waiting to hear back from 
[~billie.rinaldi] about the docker-in-docker use case before committing.


> Container cgroups are leaked when using docker
> --
>
> Key: YARN-8648
> URL: https://issues.apache.org/jira/browse/YARN-8648
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
>  Labels: Docker
> Attachments: YARN-8648.001.patch, YARN-8648.002.patch, 
> YARN-8648.003.patch, YARN-8648.004.patch, YARN-8648.005.patch, 
> YARN-8648.006.patch
>
>
> When you run with docker and enable cgroups for cpu, docker creates cgroups 
> for all resources on the system, not just for cpu.  For instance, if the 
> {{yarn.nodemanager.linux-container-executor.cgroups.hierarchy=/hadoop-yarn}}, 
> the nodemanager will create a cgroup for each container under 
> {{/sys/fs/cgroup/cpu/hadoop-yarn}}.  In the docker case, we pass this path 
> via the {{--cgroup-parent}} command line argument.   Docker then creates a 
> cgroup for the docker container under that, for instance: 
> {{/sys/fs/cgroup/cpu/hadoop-yarn/container_id/docker_container_id}}.
> When the container exits, docker cleans up the {{docker_container_id}} 
> cgroup, and the nodemanager cleans up the {{container_id}} cgroup,   All is 
> good under {{/sys/fs/cgroup/hadoop-yarn}}.
> The problem is that docker also creates that same hierarchy under every 
> resource under {{/sys/fs/cgroup}}.  On the rhel7 system I am using, these 
> are: blkio, cpuset, devices, freezer, hugetlb, memory, net_cls, net_prio, 
> perf_event, and systemd.So for instance, docker creates 
> {{/sys/fs/cgroup/cpuset/hadoop-yarn/container_id/docker_container_id}}, but 
> it only cleans up the leaf cgroup {{docker_container_id}}.  Nobody cleans up 
> the {{container_id}} cgroups for these other resources.  On one of our busy 
> clusters, we found > 100,000 of these leaked cgroups.
> I found this in our 2.8-based version of hadoop, but I have been able to 
> repro with current hadoop.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8045) Reduce log output from container status calls

2018-09-13 Thread Shane Kumpf (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613969#comment-16613969
 ] 

Shane Kumpf commented on YARN-8045:
---

Thanks for the patch, [~ccondit-target]. This is much better. +1 I'll commit 
this shortly.

> Reduce log output from container status calls
> -
>
> Key: YARN-8045
> URL: https://issues.apache.org/jira/browse/YARN-8045
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Shane Kumpf
>Assignee: Craig Condit
>Priority: Major
> Attachments: YARN-8045.001.patch
>
>
> Each time a container's status is returned a log entry is produced in the NM 
> from {{ContainerManagerImpl}}. The container status includes the diagnostics 
> field for the container. If the diagnostics field contains an exception, it 
> can appear as if the exception is logged repeatedly every second. The 
> diagnostics message can also span many lines, which puts pressure on the logs 
> and makes it harder to read.
> For example:
> {code}
> 2018-03-17 22:01:11,632 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Getting container-status for container_e01_1521323860653_0001_01_05
> 2018-03-17 22:01:11,632 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Returning ContainerStatus: [ContainerId: 
> container_e01_1521323860653_0001_01_05, ExecutionType: GUARANTEED, State: 
> RUNNING, Capability: , Diagnostics: [2018-03-17 
> 22:01:00.675]Exception from container-launch.
> Container id: container_e01_1521323860653_0001_01_05
> Exit code: -1
> Exception message: 
> Shell ouput: 
> [2018-03-17 22:01:00.750]Diagnostic message from attempt :
> [2018-03-17 22:01:00.750]Container exited with a non-zero exit code -1.
> , ExitStatus: -1, IP: null, Host: null, ContainerSubState: SCHEDULED]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8680) YARN NM: Implement Iterable Abstraction for LocalResourceTracker state

2018-09-13 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613967#comment-16613967
 ] 

Hudson commented on YARN-8680:
--

SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #14948 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/14948/])
YARN-8680. YARN NM: Implement Iterable Abstraction for (jlowe: rev 
250b50018e8c94d8ca83ff981b01f26bb68c0842)
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/recovery/TestNMLeveldbStateStoreService.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMLeveldbStateStoreService.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMStateStoreService.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMMemoryStateStoreService.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java


> YARN NM: Implement Iterable Abstraction for LocalResourceTracker state
> --
>
> Key: YARN-8680
> URL: https://issues.apache.org/jira/browse/YARN-8680
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Reporter: Pradeep Ambati
>Assignee: Pradeep Ambati
>Priority: Critical
> Fix For: 3.2.0, 3.1.2
>
> Attachments: YARN-8680.00.patch, YARN-8680.01.patch, 
> YARN-8680.02.patch, YARN-8680.03.patch, YARN-8680.04.patch
>
>
> Similar to YARN-8242, implement iterable abstraction for 
> LocalResourceTrackerState to load completed and in progress resources when 
> needed rather than loading them all at a time for a respective state.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8772) Annotation javax.annotation.Generated has moved

2018-09-13 Thread Andrew Purtell (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell updated YARN-8772:
-
Attachment: YARN-8772.patch

> Annotation javax.annotation.Generated has moved
> ---
>
> Key: YARN-8772
> URL: https://issues.apache.org/jira/browse/YARN-8772
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: api
>Affects Versions: 3.1.1
>Reporter: Andrew Purtell
>Priority: Minor
> Attachments: YARN-8772.patch
>
>
> YARN compilation with Java 11 fails because the annotation 
> javax.annotation.Generated has moved. It is now 
> javax.annotation.processing.Generated. A simple substitution will break 
> compilation with older JDK, so it seems best to remove the annotations, which 
> are only documentation not functional.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8772) Annotation javax.annotation.Generated has moved

2018-09-13 Thread Andrew Purtell (JIRA)
Andrew Purtell created YARN-8772:


 Summary: Annotation javax.annotation.Generated has moved
 Key: YARN-8772
 URL: https://issues.apache.org/jira/browse/YARN-8772
 Project: Hadoop YARN
  Issue Type: Bug
  Components: api
Affects Versions: 3.1.1
Reporter: Andrew Purtell


YARN compilation with Java 11 fails because the annotation 
javax.annotation.Generated has moved. It is now 
javax.annotation.processing.Generated. A simple substitution will break 
compilation with older JDK, so it seems best to remove the annotations, which 
are only documentation not functional.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8680) YARN NM: Implement Iterable Abstraction for LocalResourceTracker state

2018-09-13 Thread Jason Lowe (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated YARN-8680:
-
Summary: YARN NM: Implement Iterable Abstraction for LocalResourceTracker 
state  (was: YARN NM: Implement Iterable Abstraction for 
LocalResourceTrackerstate)

> YARN NM: Implement Iterable Abstraction for LocalResourceTracker state
> --
>
> Key: YARN-8680
> URL: https://issues.apache.org/jira/browse/YARN-8680
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Reporter: Pradeep Ambati
>Assignee: Pradeep Ambati
>Priority: Critical
> Attachments: YARN-8680.00.patch, YARN-8680.01.patch, 
> YARN-8680.02.patch, YARN-8680.03.patch, YARN-8680.04.patch
>
>
> Similar to YARN-8242, implement iterable abstraction for 
> LocalResourceTrackerState to load completed and in progress resources when 
> needed rather than loading them all at a time for a respective state.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8765) Extend YARN to support SET type of resources

2018-09-13 Thread Suma Shivaprasad (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613926#comment-16613926
 ] 

Suma Shivaprasad commented on YARN-8765:


Thanks [~cheersyang]! for starting this and uploading the design.

It would be useful to add a RANGE resource type to manage resource types like 
IPs, ports ..for easier management...since IPs/ports etc are usually 
specified/dealt with as a range...

with RANGE, there could either be a value for the range( eg: CIDR range) or a 
min/lower bound and max/upperBound. There could be a plugin in RM to validate 
the values specified for the resource types.  (eg : validate CIDR range). 

We may also need to support the case where IP ranges may not be restricted to 
individual NM reported IPs and IPs can move across hosts, given the underlying 
network model supports it. RM would manage a global routable IP range in this 
case and when containers are allocated, RM could assign an available IP from 
this range.

Thoughts?






> Extend YARN to support SET type of resources
> 
>
> Key: YARN-8765
> URL: https://issues.apache.org/jira/browse/YARN-8765
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Weiwei Yang
>Priority: Major
>
> YARN-3926 evolves a new resource model in YARN by providing a general 
> resource definition mechanism. However right now only COUNTABLE type is 
> supported. To support resources that cannot be declared with a single value, 
> propose to add a SET type. This will extend YARN to manage IP address 
> resources. Design doc is attached 
> [here|https://docs.google.com/document/d/1U9hj1xX9a3c_xT_X4EP_YC0fZ7ItD5X-rB0bYne-waU/edit?usp=sharing].
> This feature is split from YARN-8446.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7599) [GPG] ApplicationCleaner in Global Policy Generator

2018-09-13 Thread Botong Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613901#comment-16613901
 ] 

Botong Huang commented on YARN-7599:


Thanks [~bibinchundatt] for the comments! Yes I will make the policy 
configurable.

About the race condition, in v2 patch it not only looks at expiry time, but 
also apps from Router, which is a joined set of apps in all YarnRM memory. In 
stead of excluding only the running apps from Router (good catch! ), I should 
have excluded all apps that are still in YarnRM memory. That should eliminate 
the race condition you mentioned. What do you think?

> [GPG] ApplicationCleaner in Global Policy Generator
> ---
>
> Key: YARN-7599
> URL: https://issues.apache.org/jira/browse/YARN-7599
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Minor
>  Labels: federation, gpg
> Attachments: YARN-7599-YARN-7402.v1.patch, 
> YARN-7599-YARN-7402.v2.patch
>
>
> In Federation, we need a cleanup service for StateStore as well as Yarn 
> Registry. For the former, we need to remove old application records. For the 
> latter, failed and killed applications might leave records in the Yarn 
> Registry (see YARN-6128). We plan to do both cleanup work in 
> ApplicationCleaner in GPG



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8706) DelayedProcessKiller is executed for Docker containers even though docker stop sends a KILL signal after the specified grace period

2018-09-13 Thread Chandni Singh (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh updated YARN-8706:

Attachment: YARN-8706.addendum.001.patch

> DelayedProcessKiller is executed for Docker containers even though docker 
> stop sends a KILL signal after the specified grace period
> ---
>
> Key: YARN-8706
> URL: https://issues.apache.org/jira/browse/YARN-8706
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
>  Labels: docker
> Fix For: 3.2.0
>
> Attachments: YARN-8706.001.patch, YARN-8706.002.patch, 
> YARN-8706.003.patch, YARN-8706.004.patch, YARN-8706.addendum.001.patch
>
>
> {{DockerStopCommand}} adds a grace period of 10 seconds.
> 10 seconds is also the default grace time use by docker stop
>  [https://docs.docker.com/engine/reference/commandline/stop/]
> Documentation of the docker stop:
> {quote}the main process inside the container will receive {{SIGTERM}}, and 
> after a grace period, {{SIGKILL}}.
> {quote}
> There is a {{DelayedProcessKiller}} in {{ContainerExcecutor}} which executes 
> for all containers after a delay when {{sleepDelayBeforeSigKill>0}}. By 
> default this is set to {{250 milliseconds}} and so irrespective of the 
> container type, it will always get executed.
>  
> For a docker container, {{docker stop}} takes care of sending a {{SIGKILL}} 
> after the grace period
> - when sleepDelayBeforeSigKill > 10 seconds, then there is no point of 
> executing DelayedProcessKiller
> - when sleepDelayBeforeSigKill < 1 second, then the grace period should be 
> the smallest value, which is 1 second, because anyways we are forcing kill 
> after 250 ms
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8680) YARN NM: Implement Iterable Abstraction for LocalResourceTrackerstate

2018-09-13 Thread Pradeep Ambati (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613886#comment-16613886
 ] 

Pradeep Ambati commented on YARN-8680:
--

Thanks for the review [~jlowe].

> YARN NM: Implement Iterable Abstraction for LocalResourceTrackerstate
> -
>
> Key: YARN-8680
> URL: https://issues.apache.org/jira/browse/YARN-8680
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Reporter: Pradeep Ambati
>Assignee: Pradeep Ambati
>Priority: Critical
> Attachments: YARN-8680.00.patch, YARN-8680.01.patch, 
> YARN-8680.02.patch, YARN-8680.03.patch, YARN-8680.04.patch
>
>
> Similar to YARN-8242, implement iterable abstraction for 
> LocalResourceTrackerState to load completed and in progress resources when 
> needed rather than loading them all at a time for a respective state.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8680) YARN NM: Implement Iterable Abstraction for LocalResourceTrackerstate

2018-09-13 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613878#comment-16613878
 ] 

Jason Lowe commented on YARN-8680:
--

Thanks for updating the patch!  +1 lgtm.  Committing this.

> YARN NM: Implement Iterable Abstraction for LocalResourceTrackerstate
> -
>
> Key: YARN-8680
> URL: https://issues.apache.org/jira/browse/YARN-8680
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Reporter: Pradeep Ambati
>Assignee: Pradeep Ambati
>Priority: Critical
> Attachments: YARN-8680.00.patch, YARN-8680.01.patch, 
> YARN-8680.02.patch, YARN-8680.03.patch, YARN-8680.04.patch
>
>
> Similar to YARN-8242, implement iterable abstraction for 
> LocalResourceTrackerState to load completed and in progress resources when 
> needed rather than loading them all at a time for a respective state.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Reopened] (YARN-8706) DelayedProcessKiller is executed for Docker containers even though docker stop sends a KILL signal after the specified grace period

2018-09-13 Thread Chandni Singh (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh reopened YARN-8706:
-

docker-util.c only permits these formats. 
 {code}
"{{.State.Status}}",
"{{range(.NetworkSettings.Networks)}}{{.IPAddress}},{{end}}{{.Config.Hostname}}"
{code}

It needs to include the below:
{code}
{{.State.Status}},{{.Config.StopSignal}}
{code}

> DelayedProcessKiller is executed for Docker containers even though docker 
> stop sends a KILL signal after the specified grace period
> ---
>
> Key: YARN-8706
> URL: https://issues.apache.org/jira/browse/YARN-8706
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
>  Labels: docker
> Fix For: 3.2.0
>
> Attachments: YARN-8706.001.patch, YARN-8706.002.patch, 
> YARN-8706.003.patch, YARN-8706.004.patch
>
>
> {{DockerStopCommand}} adds a grace period of 10 seconds.
> 10 seconds is also the default grace time use by docker stop
>  [https://docs.docker.com/engine/reference/commandline/stop/]
> Documentation of the docker stop:
> {quote}the main process inside the container will receive {{SIGTERM}}, and 
> after a grace period, {{SIGKILL}}.
> {quote}
> There is a {{DelayedProcessKiller}} in {{ContainerExcecutor}} which executes 
> for all containers after a delay when {{sleepDelayBeforeSigKill>0}}. By 
> default this is set to {{250 milliseconds}} and so irrespective of the 
> container type, it will always get executed.
>  
> For a docker container, {{docker stop}} takes care of sending a {{SIGKILL}} 
> after the grace period
> - when sleepDelayBeforeSigKill > 10 seconds, then there is no point of 
> executing DelayedProcessKiller
> - when sleepDelayBeforeSigKill < 1 second, then the grace period should be 
> the smallest value, which is 1 second, because anyways we are forcing kill 
> after 250 ms
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8648) Container cgroups are leaked when using docker

2018-09-13 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613834#comment-16613834
 ] 

Hadoop QA commented on YARN-8648:
-

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
25s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 3 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 19m 
 1s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
58s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
22s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
35s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
11m 17s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
56s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
22s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
36s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
57s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} cc {color} | {color:green}  0m 
57s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
57s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
20s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
33s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
11m 28s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m  
5s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
22s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 19m  
4s{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed. 
{color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
22s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 68m 49s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:4b8c2b1 |
| JIRA Issue | YARN-8648 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12939580/YARN-8648.006.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  cc  |
| uname | Linux 2ec2c6408a2d 4.4.0-133-generic #159-Ubuntu SMP Fri Aug 10 
07:31:43 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / e1b242a |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_181 |
| findbugs | v3.1.0-RC1 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/21831/testReport/ |
| Max. process+thread count | 465 (vs. ulimit of 1) |
| modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 U: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/21831/console |
| Powered by | Apache Yetus 0.8.0   

[jira] [Commented] (YARN-7592) yarn.federation.failover.enabled missing in yarn-default.xml

2018-09-13 Thread Bibin A Chundatt (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613828#comment-16613828
 ] 

Bibin A Chundatt commented on YARN-7592:


Thank you [~rahulanand90] for detail analysis

[~subru] seems in single RM case {{FederationRMFailoverProxyProvider}} 
configuration works for {{ResourceTracker}} and fails in case of *RM HA* 
cluster.

As discussed in YARN-8434  for ServerProxy and ClientProxy separate conf are 
required or for federationUtils  should use extended ClientRMProxy.



> yarn.federation.failover.enabled missing in yarn-default.xml
> 
>
> Key: YARN-7592
> URL: https://issues.apache.org/jira/browse/YARN-7592
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: federation
>Affects Versions: 3.0.0-beta1
>Reporter: Gera Shegalov
>Priority: Major
> Attachments: IssueReproduce.patch
>
>
> yarn.federation.failover.enabled should be documented in yarn-default.xml. I 
> am also not sure why it should be true by default and force the HA retry 
> policy in {{RMProxy#createRMProxy}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8769) [Submarine] Allow user to specify customized quicklink(s) when submit Submarine job

2018-09-13 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613809#comment-16613809
 ] 

Sunil Govindan commented on YARN-8769:
--

This change looks fine to me. I ll do some dry runs.

> [Submarine] Allow user to specify customized quicklink(s) when submit 
> Submarine job
> ---
>
> Key: YARN-8769
> URL: https://issues.apache.org/jira/browse/YARN-8769
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Critical
> Attachments: YARN-8769.001.patch
>
>
> This will be helpful when user submit a job and some links need to be shown 
> on YARN UI2 (service page). For example, user can specify a quick link to 
> Zeppelin notebook UI when a Zeppelin notebook got launched.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8648) Container cgroups are leaked when using docker

2018-09-13 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613729#comment-16613729
 ] 

Jim Brennan commented on YARN-8648:
---

[~jlowe] thanks for the review.   I've made the change you suggested in patch 
006.

 

> Container cgroups are leaked when using docker
> --
>
> Key: YARN-8648
> URL: https://issues.apache.org/jira/browse/YARN-8648
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
>  Labels: Docker
> Attachments: YARN-8648.001.patch, YARN-8648.002.patch, 
> YARN-8648.003.patch, YARN-8648.004.patch, YARN-8648.005.patch, 
> YARN-8648.006.patch
>
>
> When you run with docker and enable cgroups for cpu, docker creates cgroups 
> for all resources on the system, not just for cpu.  For instance, if the 
> {{yarn.nodemanager.linux-container-executor.cgroups.hierarchy=/hadoop-yarn}}, 
> the nodemanager will create a cgroup for each container under 
> {{/sys/fs/cgroup/cpu/hadoop-yarn}}.  In the docker case, we pass this path 
> via the {{--cgroup-parent}} command line argument.   Docker then creates a 
> cgroup for the docker container under that, for instance: 
> {{/sys/fs/cgroup/cpu/hadoop-yarn/container_id/docker_container_id}}.
> When the container exits, docker cleans up the {{docker_container_id}} 
> cgroup, and the nodemanager cleans up the {{container_id}} cgroup,   All is 
> good under {{/sys/fs/cgroup/hadoop-yarn}}.
> The problem is that docker also creates that same hierarchy under every 
> resource under {{/sys/fs/cgroup}}.  On the rhel7 system I am using, these 
> are: blkio, cpuset, devices, freezer, hugetlb, memory, net_cls, net_prio, 
> perf_event, and systemd.So for instance, docker creates 
> {{/sys/fs/cgroup/cpuset/hadoop-yarn/container_id/docker_container_id}}, but 
> it only cleans up the leaf cgroup {{docker_container_id}}.  Nobody cleans up 
> the {{container_id}} cgroups for these other resources.  On one of our busy 
> clusters, we found > 100,000 of these leaked cgroups.
> I found this in our 2.8-based version of hadoop, but I have been able to 
> repro with current hadoop.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8648) Container cgroups are leaked when using docker

2018-09-13 Thread Jim Brennan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-8648:
--
Attachment: YARN-8648.006.patch

> Container cgroups are leaked when using docker
> --
>
> Key: YARN-8648
> URL: https://issues.apache.org/jira/browse/YARN-8648
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
>  Labels: Docker
> Attachments: YARN-8648.001.patch, YARN-8648.002.patch, 
> YARN-8648.003.patch, YARN-8648.004.patch, YARN-8648.005.patch, 
> YARN-8648.006.patch
>
>
> When you run with docker and enable cgroups for cpu, docker creates cgroups 
> for all resources on the system, not just for cpu.  For instance, if the 
> {{yarn.nodemanager.linux-container-executor.cgroups.hierarchy=/hadoop-yarn}}, 
> the nodemanager will create a cgroup for each container under 
> {{/sys/fs/cgroup/cpu/hadoop-yarn}}.  In the docker case, we pass this path 
> via the {{--cgroup-parent}} command line argument.   Docker then creates a 
> cgroup for the docker container under that, for instance: 
> {{/sys/fs/cgroup/cpu/hadoop-yarn/container_id/docker_container_id}}.
> When the container exits, docker cleans up the {{docker_container_id}} 
> cgroup, and the nodemanager cleans up the {{container_id}} cgroup,   All is 
> good under {{/sys/fs/cgroup/hadoop-yarn}}.
> The problem is that docker also creates that same hierarchy under every 
> resource under {{/sys/fs/cgroup}}.  On the rhel7 system I am using, these 
> are: blkio, cpuset, devices, freezer, hugetlb, memory, net_cls, net_prio, 
> perf_event, and systemd.So for instance, docker creates 
> {{/sys/fs/cgroup/cpuset/hadoop-yarn/container_id/docker_container_id}}, but 
> it only cleans up the leaf cgroup {{docker_container_id}}.  Nobody cleans up 
> the {{container_id}} cgroups for these other resources.  On one of our busy 
> clusters, we found > 100,000 of these leaked cgroups.
> I found this in our 2.8-based version of hadoop, but I have been able to 
> repro with current hadoop.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6456) Allow administrators to set a single ContainerRuntime for all containers

2018-09-13 Thread Eric Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613718#comment-16613718
 ] 

Eric Yang commented on YARN-6456:
-

{quote}The allowed image check is optional and disabled by default. If we 
decide that the limit-images-within-a-trusted-registry feature requires a 
separate registry and updating container-executor.cfg to support that use-case 
then I agree the allowed images property is extraneous.{quote}

Registry check and image check both happens prior to pull image from registry 
because it is only a string comparison.  It might be more effective to refine 
check_trusted_image function to also check for allowed image.  It would be nice 
to keep same business logic close for maintainability.  Otherwise, it would be 
a headache to restart node manager to add more allowed images into 
yarn-site.xml IMHO.

> Allow administrators to set a single ContainerRuntime for all containers
> 
>
> Key: YARN-6456
> URL: https://issues.apache.org/jira/browse/YARN-6456
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Miklos Szegedi
>Assignee: Craig Condit
>Priority: Major
>  Labels: Docker
> Attachments: YARN-6456-ForceDockerRuntimeIfSupported.patch, 
> YARN-6456.001.patch, YARN-6456.002.patch, YARN-6456.003.patch
>
>
>  
> With LCE, there are multiple ContainerRuntimes available for handling 
> different types of containers; default, docker, java sandbox. Admins should 
> have the ability to override the user decision and set a single global 
> ContainerRuntime to be used for all containers.
> Original Description:
> {quote}One reason to use Docker containers is to be able to isolate different 
> workloads, even, if they run as the same user.
> I have noticed some issues in the current design:
>  1. DockerLinuxContainerRuntime mounts containerLocalDirs 
> {{nm-local-dir/usercache/user/appcache/application_1491598755372_0011/}} and 
> userLocalDirs {{nm-local-dir/usercache/user/}}, so that a container can see 
> and modify the files of another container. I think the application file cache 
> directory should be enough for the container to run in most of the cases.
>  2. The whole cgroups directory is mounted. Would the container directory be 
> enough?
>  3. There is no way to enforce exclusive use of Docker for all containers. 
> There should be an option that it is not the user but the admin that requires 
> to use Docker.
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6456) Allow administrators to set a single ContainerRuntime for all containers

2018-09-13 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613676#comment-16613676
 ] 

Jason Lowe commented on YARN-6456:
--

Thanks for updating the patch!  +1 lgtm.

I don't see the allowed images property as being a security feature, rather a 
way for admins to limit the images within an already trusted registry.  For 
example, we publish our Docker images to a shared registry, and we only want 
some of the images there to be used in YARN.  The restriction for us is a 
concern with clusters hammering the registry when they don't ask for preloaded 
images rather than thinking the image itself is an attack vector.  Dropping all 
capabilities, using seccomp profiles, and restricting bind mounts should cover 
the concerns with the image contents, so it's more about controlling the images 
within a registry for bandwidth, have-we-tested-this, etc. concerns.

The allowed image check is optional and disabled by default.  If we decide that 
the limit-images-within-a-trusted-registry feature requires a separate registry 
and updating container-executor.cfg to support that use-case then I agree the 
allowed images property is extraneous.


> Allow administrators to set a single ContainerRuntime for all containers
> 
>
> Key: YARN-6456
> URL: https://issues.apache.org/jira/browse/YARN-6456
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Miklos Szegedi
>Assignee: Craig Condit
>Priority: Major
>  Labels: Docker
> Attachments: YARN-6456-ForceDockerRuntimeIfSupported.patch, 
> YARN-6456.001.patch, YARN-6456.002.patch, YARN-6456.003.patch
>
>
>  
> With LCE, there are multiple ContainerRuntimes available for handling 
> different types of containers; default, docker, java sandbox. Admins should 
> have the ability to override the user decision and set a single global 
> ContainerRuntime to be used for all containers.
> Original Description:
> {quote}One reason to use Docker containers is to be able to isolate different 
> workloads, even, if they run as the same user.
> I have noticed some issues in the current design:
>  1. DockerLinuxContainerRuntime mounts containerLocalDirs 
> {{nm-local-dir/usercache/user/appcache/application_1491598755372_0011/}} and 
> userLocalDirs {{nm-local-dir/usercache/user/}}, so that a container can see 
> and modify the files of another container. I think the application file cache 
> directory should be enough for the container to run in most of the cases.
>  2. The whole cgroups directory is mounted. Would the container directory be 
> enough?
>  3. There is no way to enforce exclusive use of Docker for all containers. 
> There should be an option that it is not the user but the admin that requires 
> to use Docker.
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8748) Javadoc warnings within the nodemanager package

2018-09-13 Thread Shane Kumpf (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613661#comment-16613661
 ] 

Shane Kumpf commented on YARN-8748:
---

Thanks for the contribution, [~ccondit-target]. It's a bummer we need to 
introduce new warnings to address these warnings, but I see what you mean. +1 
I'll commit this shortly.

> Javadoc warnings within the nodemanager package
> ---
>
> Key: YARN-8748
> URL: https://issues.apache.org/jira/browse/YARN-8748
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Shane Kumpf
>Assignee: Craig Condit
>Priority: Trivial
> Attachments: YARN-8748.001.patch
>
>
> There are a number of javadoc warnings in trunk in classes under the 
> nodemanager package. These should be addressed or suppressed.
> {code:java}
> [WARNING] Javadoc Warnings
> [WARNING] 
> /testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/ContainerExecutor.java:93:
>  warning - Tag @see: reference not found: 
> ContainerLaunch.ShellScriptBuilder#listDebugInformation
> [WARNING] 
> /testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/runtime/JavaSandboxLinuxContainerRuntime.java:118:
>  warning - YarnConfiguration#YARN_CONTAINER_SANDBOX (referenced by @value 
> tag) is an unknown reference.
> [WARNING] 
> /testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/runtime/JavaSandboxLinuxContainerRuntime.java:118:
>  warning - YarnConfiguration#YARN_CONTAINER_SANDBOX_FILE_PERMISSIONS 
> (referenced by @value tag) is an unknown reference.
> [WARNING] 
> /testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/runtime/JavaSandboxLinuxContainerRuntime.java:118:
>  warning - YarnConfiguration#YARN_CONTAINER_SANDBOX_POLICY (referenced by 
> @value tag) is an unknown reference.
> [WARNING] 
> /testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/runtime/JavaSandboxLinuxContainerRuntime.java:118:
>  warning - YarnConfiguration#YARN_CONTAINER_SANDBOX_WHITELIST_GROUP 
> (referenced by @value tag) is an unknown reference.
> [WARNING] 
> /testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/runtime/JavaSandboxLinuxContainerRuntime.java:118:
>  warning - YarnConfiguration#YARN_CONTAINER_SANDBOX_POLICY_GROUP_PREFIX 
> (referenced by @value tag) is an unknown reference.
> [WARNING] 
> /testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/runtime/JavaSandboxLinuxContainerRuntime.java:211:
>  warning - YarnConfiguration#YARN_CONTAINER_SANDBOX_WHITELIST_GROUP 
> (referenced by @value tag) is an unknown reference.
> [WARNING] 
> /testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/runtime/JavaSandboxLinuxContainerRuntime.java:211:
>  warning - NMContainerPolicyUtils#SECURITY_FLAG (referenced by @value tag) is 
> an unknown reference.
> [WARNING] 
> /testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/resources/TrafficControlBandwidthHandlerImpl.java:248:
>  warning - @return tag has no arguments.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8729) Node status updater thread could be lost after it is restarted

2018-09-13 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613643#comment-16613643
 ] 

Weiwei Yang commented on YARN-8729:
---

Pushed to trunk, cherry-picked to branch-3.1, branch-3.0, branch-2.9 and 
branch-2.8. Thanks for the contribution [~Tao Yang] and thanks for the review 
[~ebadger].

> Node status updater thread could be lost after it is restarted
> --
>
> Key: YARN-8729
> URL: https://issues.apache.org/jira/browse/YARN-8729
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.8.0
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Critical
> Fix For: 3.2.0, 2.9.2, 3.0.4, 3.1.2, 2.8.6
>
> Attachments: YARN-8729.001.patch, YARN-8729.001.patch, 
> YARN-8729.002.patch
>
>
> Today I found a lost NM whose node status updater thread was not exist after 
> this thread restarted. In 
> {{NodeStatusUpdaterImpl#rebootNodeStatusUpdaterAndRegisterWithRM}}, isStopped 
> flag is not updated to be false before executing {{statusUpdater.start()}}, 
> so that if the thread is immediately started and found isStopped==true, it 
> will exit without any log.
> Key codes in 
> {{NodeStatusUpdaterImpl#rebootNodeStatusUpdaterAndRegisterWithRM}}:
> {code:java}
>  statusUpdater.join();
>  registerWithRM();
>  statusUpdater = new Thread(statusUpdaterRunnable, "Node Status Updater");
>  statusUpdater.start();
>  this.isStopped = false;   //this line should be moved before 
> statusUpdater.start();
>  LOG.info("NodeStatusUpdater thread is reRegistered and restarted");
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8729) Node status updater thread could be lost after it is restarted

2018-09-13 Thread Weiwei Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang updated YARN-8729:
--
Fix Version/s: 3.0.4

> Node status updater thread could be lost after it is restarted
> --
>
> Key: YARN-8729
> URL: https://issues.apache.org/jira/browse/YARN-8729
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.8.0
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Critical
> Fix For: 3.2.0, 2.9.2, 3.0.4, 3.1.2, 2.8.6
>
> Attachments: YARN-8729.001.patch, YARN-8729.001.patch, 
> YARN-8729.002.patch
>
>
> Today I found a lost NM whose node status updater thread was not exist after 
> this thread restarted. In 
> {{NodeStatusUpdaterImpl#rebootNodeStatusUpdaterAndRegisterWithRM}}, isStopped 
> flag is not updated to be false before executing {{statusUpdater.start()}}, 
> so that if the thread is immediately started and found isStopped==true, it 
> will exit without any log.
> Key codes in 
> {{NodeStatusUpdaterImpl#rebootNodeStatusUpdaterAndRegisterWithRM}}:
> {code:java}
>  statusUpdater.join();
>  registerWithRM();
>  statusUpdater = new Thread(statusUpdaterRunnable, "Node Status Updater");
>  statusUpdater.start();
>  this.isStopped = false;   //this line should be moved before 
> statusUpdater.start();
>  LOG.info("NodeStatusUpdater thread is reRegistered and restarted");
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8729) Node status updater thread could be lost after it is restarted

2018-09-13 Thread Weiwei Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang updated YARN-8729:
--
Fix Version/s: 3.1.2

> Node status updater thread could be lost after it is restarted
> --
>
> Key: YARN-8729
> URL: https://issues.apache.org/jira/browse/YARN-8729
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.8.0
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Critical
> Fix For: 3.2.0, 2.9.2, 3.0.4, 3.1.2, 2.8.6
>
> Attachments: YARN-8729.001.patch, YARN-8729.001.patch, 
> YARN-8729.002.patch
>
>
> Today I found a lost NM whose node status updater thread was not exist after 
> this thread restarted. In 
> {{NodeStatusUpdaterImpl#rebootNodeStatusUpdaterAndRegisterWithRM}}, isStopped 
> flag is not updated to be false before executing {{statusUpdater.start()}}, 
> so that if the thread is immediately started and found isStopped==true, it 
> will exit without any log.
> Key codes in 
> {{NodeStatusUpdaterImpl#rebootNodeStatusUpdaterAndRegisterWithRM}}:
> {code:java}
>  statusUpdater.join();
>  registerWithRM();
>  statusUpdater = new Thread(statusUpdaterRunnable, "Node Status Updater");
>  statusUpdater.start();
>  this.isStopped = false;   //this line should be moved before 
> statusUpdater.start();
>  LOG.info("NodeStatusUpdater thread is reRegistered and restarted");
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8759) Copy of "resource-types.xml" is not deleted if test fails, causes other test failures

2018-09-13 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613640#comment-16613640
 ] 

Sunil Govindan commented on YARN-8759:
--

Thanks [~bsteinbach]. Looks good to me.

Will commit later today if no objections.

> Copy of "resource-types.xml" is not deleted if test fails, causes other test 
> failures
> -
>
> Key: YARN-8759
> URL: https://issues.apache.org/jira/browse/YARN-8759
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Antal Bálint Steinbach
>Assignee: Antal Bálint Steinbach
>Priority: Major
> Attachments: YARN-8759.001.patch, YARN-8759.002.patch, 
> YARN-8759.003.patch
>
>
> resource-types.xml is copied in several tests to the test machine, but it is 
> deleted only at the end of the test. In case the test fails the file will not 
> be deleted and other tests will fail, because of the wrong configuration.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5592) Add support for dynamic resource updates with multiple resource types

2018-09-13 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-5592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613629#comment-16613629
 ] 

Sunil Govindan commented on YARN-5592:
--

Even if we limit to add op alone, as [~leftnoteasy] told there will be impact 
for scheduler. We assume all resource obj has same resource info.

I think lets take a pause on this for now. Because any change in resource 
comparision logic in DominantRC or Resources class will have huge impact on CS 
container allocation performance. Thanks [~leftnoteasy] for pointing it out.

[~maniraj...@gmail.com] thoughts? we can live by restarting RM for now i think. 
But sudden change ll affect a lot of perf. 

> Add support for dynamic resource updates with multiple resource types
> -
>
> Key: YARN-5592
> URL: https://issues.apache.org/jira/browse/YARN-5592
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, resourcemanager
>Reporter: Varun Vasudev
>Assignee: Manikandan R
>Priority: Major
> Attachments: YARN-5592-design-2.docx
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8729) Node status updater thread could be lost after it is restarted

2018-09-13 Thread Weiwei Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang updated YARN-8729:
--
Affects Version/s: (was: 3.2.0)
   2.8.0

> Node status updater thread could be lost after it is restarted
> --
>
> Key: YARN-8729
> URL: https://issues.apache.org/jira/browse/YARN-8729
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.8.0
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Critical
> Fix For: 3.2.0, 2.9.2, 2.8.6
>
> Attachments: YARN-8729.001.patch, YARN-8729.001.patch, 
> YARN-8729.002.patch
>
>
> Today I found a lost NM whose node status updater thread was not exist after 
> this thread restarted. In 
> {{NodeStatusUpdaterImpl#rebootNodeStatusUpdaterAndRegisterWithRM}}, isStopped 
> flag is not updated to be false before executing {{statusUpdater.start()}}, 
> so that if the thread is immediately started and found isStopped==true, it 
> will exit without any log.
> Key codes in 
> {{NodeStatusUpdaterImpl#rebootNodeStatusUpdaterAndRegisterWithRM}}:
> {code:java}
>  statusUpdater.join();
>  registerWithRM();
>  statusUpdater = new Thread(statusUpdaterRunnable, "Node Status Updater");
>  statusUpdater.start();
>  this.isStopped = false;   //this line should be moved before 
> statusUpdater.start();
>  LOG.info("NodeStatusUpdater thread is reRegistered and restarted");
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8729) Node status updater thread could be lost after it is restarted

2018-09-13 Thread Weiwei Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang updated YARN-8729:
--
Fix Version/s: 2.9.2

> Node status updater thread could be lost after it is restarted
> --
>
> Key: YARN-8729
> URL: https://issues.apache.org/jira/browse/YARN-8729
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.2.0
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Critical
> Fix For: 3.2.0, 2.9.2, 2.8.6
>
> Attachments: YARN-8729.001.patch, YARN-8729.001.patch, 
> YARN-8729.002.patch
>
>
> Today I found a lost NM whose node status updater thread was not exist after 
> this thread restarted. In 
> {{NodeStatusUpdaterImpl#rebootNodeStatusUpdaterAndRegisterWithRM}}, isStopped 
> flag is not updated to be false before executing {{statusUpdater.start()}}, 
> so that if the thread is immediately started and found isStopped==true, it 
> will exit without any log.
> Key codes in 
> {{NodeStatusUpdaterImpl#rebootNodeStatusUpdaterAndRegisterWithRM}}:
> {code:java}
>  statusUpdater.join();
>  registerWithRM();
>  statusUpdater = new Thread(statusUpdaterRunnable, "Node Status Updater");
>  statusUpdater.start();
>  this.isStopped = false;   //this line should be moved before 
> statusUpdater.start();
>  LOG.info("NodeStatusUpdater thread is reRegistered and restarted");
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8757) [Submarine] Add Tensorboard component when --tensorboard is specified

2018-09-13 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613623#comment-16613623
 ] 

Sunil Govindan commented on YARN-8757:
--

Thanks [~leftnoteasy]

Some comments

1.
{code:java}
113 tensorboardDockerImage = parsedCommandLine.getOptionValue(
114 CliConstants.TENSORBOARD_DOCKER_IMAGE);{code}
In above segment, we have to handle default case also, correct ?

2. In {{updateParametersByParsedCommandline}}, 
{{setWorkerResource(workerResource)}} is removed, is this intentional?

3. 
{code:java}
467 // Add tensorboard to quicklink
468 String tensorboardLink = "http://; + YarnServiceUtils.getDNSName(
469 parameters.getName(), TaskType.TENSORBOARD.getComponentName(), 0,
470 getUserName(), getDNSDomain(), 6006);{code}
Could we make *http*, *6006* etc from some common config end point or from 
default constants.

4. I think we can publish tensorboard link in all cases when user asks for 
tensorboard. May not need to check for verbose.

Tests looks good to me. Thank you.

> [Submarine] Add Tensorboard component when --tensorboard is specified
> -
>
> Key: YARN-8757
> URL: https://issues.apache.org/jira/browse/YARN-8757
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Critical
> Attachments: YARN-8757.001.patch, YARN-8757.002.patch
>
>
> We need to have a Tensorboard component when --tensorboard is specified. And 
> we need to set quicklinks to let users view tensorboard.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8729) Node status updater thread could be lost after it is restarted

2018-09-13 Thread Weiwei Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang updated YARN-8729:
--
Fix Version/s: 2.8.6

> Node status updater thread could be lost after it is restarted
> --
>
> Key: YARN-8729
> URL: https://issues.apache.org/jira/browse/YARN-8729
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.2.0
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Critical
> Fix For: 3.2.0, 2.8.6
>
> Attachments: YARN-8729.001.patch, YARN-8729.001.patch, 
> YARN-8729.002.patch
>
>
> Today I found a lost NM whose node status updater thread was not exist after 
> this thread restarted. In 
> {{NodeStatusUpdaterImpl#rebootNodeStatusUpdaterAndRegisterWithRM}}, isStopped 
> flag is not updated to be false before executing {{statusUpdater.start()}}, 
> so that if the thread is immediately started and found isStopped==true, it 
> will exit without any log.
> Key codes in 
> {{NodeStatusUpdaterImpl#rebootNodeStatusUpdaterAndRegisterWithRM}}:
> {code:java}
>  statusUpdater.join();
>  registerWithRM();
>  statusUpdater = new Thread(statusUpdaterRunnable, "Node Status Updater");
>  statusUpdater.start();
>  this.isStopped = false;   //this line should be moved before 
> statusUpdater.start();
>  LOG.info("NodeStatusUpdater thread is reRegistered and restarted");
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8729) Node status updater thread could be lost after it is restarted

2018-09-13 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613616#comment-16613616
 ] 

Hudson commented on YARN-8729:
--

FAILURE: Integrated in Jenkins build Hadoop-trunk-Commit #14946 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/14946/])
YARN-8729. Node status updater thread could be lost after it is (wwei: rev 
39c1ea1ed454b6c61f0985fc951f20913ed964fb)
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java


> Node status updater thread could be lost after it is restarted
> --
>
> Key: YARN-8729
> URL: https://issues.apache.org/jira/browse/YARN-8729
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.2.0
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Critical
> Fix For: 3.2.0
>
> Attachments: YARN-8729.001.patch, YARN-8729.001.patch, 
> YARN-8729.002.patch
>
>
> Today I found a lost NM whose node status updater thread was not exist after 
> this thread restarted. In 
> {{NodeStatusUpdaterImpl#rebootNodeStatusUpdaterAndRegisterWithRM}}, isStopped 
> flag is not updated to be false before executing {{statusUpdater.start()}}, 
> so that if the thread is immediately started and found isStopped==true, it 
> will exit without any log.
> Key codes in 
> {{NodeStatusUpdaterImpl#rebootNodeStatusUpdaterAndRegisterWithRM}}:
> {code:java}
>  statusUpdater.join();
>  registerWithRM();
>  statusUpdater = new Thread(statusUpdaterRunnable, "Node Status Updater");
>  statusUpdater.start();
>  this.isStopped = false;   //this line should be moved before 
> statusUpdater.start();
>  LOG.info("NodeStatusUpdater thread is reRegistered and restarted");
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8729) Node status updater thread could be lost after it is restarted

2018-09-13 Thread Weiwei Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang updated YARN-8729:
--
Summary: Node status updater thread could be lost after it is restarted  
(was: Node status updater thread could be lost after it restarted)

> Node status updater thread could be lost after it is restarted
> --
>
> Key: YARN-8729
> URL: https://issues.apache.org/jira/browse/YARN-8729
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.2.0
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Critical
> Attachments: YARN-8729.001.patch, YARN-8729.001.patch, 
> YARN-8729.002.patch
>
>
> Today I found a lost NM whose node status updater thread was not exist after 
> this thread restarted. In 
> {{NodeStatusUpdaterImpl#rebootNodeStatusUpdaterAndRegisterWithRM}}, isStopped 
> flag is not updated to be false before executing {{statusUpdater.start()}}, 
> so that if the thread is immediately started and found isStopped==true, it 
> will exit without any log.
> Key codes in 
> {{NodeStatusUpdaterImpl#rebootNodeStatusUpdaterAndRegisterWithRM}}:
> {code:java}
>  statusUpdater.join();
>  registerWithRM();
>  statusUpdater = new Thread(statusUpdaterRunnable, "Node Status Updater");
>  statusUpdater.start();
>  this.isStopped = false;   //this line should be moved before 
> statusUpdater.start();
>  LOG.info("NodeStatusUpdater thread is reRegistered and restarted");
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8729) Node status updater thread could be lost after it restarted

2018-09-13 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613560#comment-16613560
 ] 

Weiwei Yang commented on YARN-8729:
---

Thanks [~ebadger], [~Tao Yang], I will commit this shortly. 

> Node status updater thread could be lost after it restarted
> ---
>
> Key: YARN-8729
> URL: https://issues.apache.org/jira/browse/YARN-8729
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.2.0
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Critical
> Attachments: YARN-8729.001.patch, YARN-8729.001.patch, 
> YARN-8729.002.patch
>
>
> Today I found a lost NM whose node status updater thread was not exist after 
> this thread restarted. In 
> {{NodeStatusUpdaterImpl#rebootNodeStatusUpdaterAndRegisterWithRM}}, isStopped 
> flag is not updated to be false before executing {{statusUpdater.start()}}, 
> so that if the thread is immediately started and found isStopped==true, it 
> will exit without any log.
> Key codes in 
> {{NodeStatusUpdaterImpl#rebootNodeStatusUpdaterAndRegisterWithRM}}:
> {code:java}
>  statusUpdater.join();
>  registerWithRM();
>  statusUpdater = new Thread(statusUpdaterRunnable, "Node Status Updater");
>  statusUpdater.start();
>  this.isStopped = false;   //this line should be moved before 
> statusUpdater.start();
>  LOG.info("NodeStatusUpdater thread is reRegistered and restarted");
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8729) Node status updater thread could be lost after it restarted

2018-09-13 Thread Eric Badger (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613557#comment-16613557
 ] 

Eric Badger commented on YARN-8729:
---

+1 (non-binding) from me

> Node status updater thread could be lost after it restarted
> ---
>
> Key: YARN-8729
> URL: https://issues.apache.org/jira/browse/YARN-8729
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.2.0
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Critical
> Attachments: YARN-8729.001.patch, YARN-8729.001.patch, 
> YARN-8729.002.patch
>
>
> Today I found a lost NM whose node status updater thread was not exist after 
> this thread restarted. In 
> {{NodeStatusUpdaterImpl#rebootNodeStatusUpdaterAndRegisterWithRM}}, isStopped 
> flag is not updated to be false before executing {{statusUpdater.start()}}, 
> so that if the thread is immediately started and found isStopped==true, it 
> will exit without any log.
> Key codes in 
> {{NodeStatusUpdaterImpl#rebootNodeStatusUpdaterAndRegisterWithRM}}:
> {code:java}
>  statusUpdater.join();
>  registerWithRM();
>  statusUpdater = new Thread(statusUpdaterRunnable, "Node Status Updater");
>  statusUpdater.start();
>  this.isStopped = false;   //this line should be moved before 
> statusUpdater.start();
>  LOG.info("NodeStatusUpdater thread is reRegistered and restarted");
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8767) TestStreamingStatus fails

2018-09-13 Thread Andras Bokor (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613458#comment-16613458
 ] 

Andras Bokor commented on YARN-8767:


It seems if yarn.resourcemanager.address is not set 
org.apache.hadoop.yarn.client.ClientRMProxy#getRMAddress will set the default 
port which is 8032 but RM actually starts on somewhere around 5.
We have to set "yarn.resourcemanager.address" for the job. 

> TestStreamingStatus fails
> -
>
> Key: YARN-8767
> URL: https://issues.apache.org/jira/browse/YARN-8767
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Andras Bokor
>Assignee: Andras Bokor
>Priority: Major
> Attachments: YARN-8767.001.patch, YARN-8767.002.patch
>
>
> The test tries to connect to RM through 0.0.0.0:8032, but it cannot.
> On the console I see the following error message:
> {code}Your endpoint configuration is wrong; For more details see:  
> http://wiki.apache.org/hadoop/UnsetHostnameOrPort, while invoking 
> ApplicationClientProtocolPBClientImpl.getNewApplication over null after 1 
> failover attempts. Trying to failover after sleeping for 44892ms.{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8767) TestStreamingStatus fails

2018-09-13 Thread Andras Bokor (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Bokor updated YARN-8767:
---
Attachment: YARN-8767.002.patch

> TestStreamingStatus fails
> -
>
> Key: YARN-8767
> URL: https://issues.apache.org/jira/browse/YARN-8767
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Andras Bokor
>Assignee: Andras Bokor
>Priority: Major
> Attachments: YARN-8767.001.patch, YARN-8767.002.patch
>
>
> The test tries to connect to RM through 0.0.0.0:8032, but it cannot.
> On the console I see the following error message:
> {code}Your endpoint configuration is wrong; For more details see:  
> http://wiki.apache.org/hadoop/UnsetHostnameOrPort, while invoking 
> ApplicationClientProtocolPBClientImpl.getNewApplication over null after 1 
> failover attempts. Trying to failover after sleeping for 44892ms.{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8630) ATSv2 REST APIs should honor filter-entity-list-by-user in non-secure cluster when ACls are enabled

2018-09-13 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613448#comment-16613448
 ] 

Hudson commented on YARN-8630:
--

SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #14944 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/14944/])
YARN-8630. ATSv2 REST APIs should honor filter-entity-list-by-user in (sunilg: 
rev f4bda5e8e9fee6c5a0dda7c79ef14e73aec20e7e)
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice/src/main/java/org/apache/hadoop/yarn/server/timelineservice/reader/TimelineReaderWebServices.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice/src/test/java/org/apache/hadoop/yarn/server/timelineservice/reader/TestTimelineReaderWebServicesBasicAcl.java


> ATSv2 REST APIs should honor filter-entity-list-by-user in non-secure cluster 
> when ACls are enabled
> ---
>
> Key: YARN-8630
> URL: https://issues.apache.org/jira/browse/YARN-8630
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rohith Sharma K S
>Assignee: Rohith Sharma K S
>Priority: Major
> Fix For: 3.2.0, 3.1.2
>
> Attachments: YARN-8630.01.patch
>
>
> It is observed that ATSv2 REST endpoints are not honoring 
> *yarn.webapp.filter-entity-list-by-user* in non-secure cluster when ACLs are 
> enabled. 
> The issue can be seen if static web app filter is not configured in  
> non-secure cluster.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8767) TestStreamingStatus fails

2018-09-13 Thread Andras Bokor (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Bokor updated YARN-8767:
---
Attachment: YARN-8767.001.patch

> TestStreamingStatus fails
> -
>
> Key: YARN-8767
> URL: https://issues.apache.org/jira/browse/YARN-8767
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Andras Bokor
>Assignee: Andras Bokor
>Priority: Major
> Attachments: YARN-8767.001.patch
>
>
> The test tries to connect to RM through 0.0.0.0:8032, but it cannot.
> On the console I see the following error message:
> {code}Your endpoint configuration is wrong; For more details see:  
> http://wiki.apache.org/hadoop/UnsetHostnameOrPort, while invoking 
> ApplicationClientProtocolPBClientImpl.getNewApplication over null after 1 
> failover attempts. Trying to failover after sleeping for 44892ms.{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8767) TestStreamingStatus fails

2018-09-13 Thread Andras Bokor (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Bokor updated YARN-8767:
---
Description: 
The test tries to connect to RM through 0.0.0.0:8032, but it cannot.

On the console I see the following error message:

{code}Your endpoint configuration is wrong; For more details see:  
http://wiki.apache.org/hadoop/UnsetHostnameOrPort, while invoking 
ApplicationClientProtocolPBClientImpl.getNewApplication over null after 1 
failover attempts. Trying to failover after sleeping for 44892ms.{code}

  was:
The test tries to connect to RM through 0.0.0.0:8032, but it cannot.

On the console I see the following error message:

{code}Your endpoint configuration is wrong; For more details see:  
http://wiki.apache.org/hadoop/UnsetHostnameOrPort, while invoking 
ApplicationClientProtocolPBClientImpl.getNewApplication over null after 1 
failover attempts. Trying to failover after sleeping for 44892ms.{code}

Do I miss some configuration?


> TestStreamingStatus fails
> -
>
> Key: YARN-8767
> URL: https://issues.apache.org/jira/browse/YARN-8767
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Andras Bokor
>Assignee: Andras Bokor
>Priority: Major
>
> The test tries to connect to RM through 0.0.0.0:8032, but it cannot.
> On the console I see the following error message:
> {code}Your endpoint configuration is wrong; For more details see:  
> http://wiki.apache.org/hadoop/UnsetHostnameOrPort, while invoking 
> ApplicationClientProtocolPBClientImpl.getNewApplication over null after 1 
> failover attempts. Trying to failover after sleeping for 44892ms.{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-3879) [Storage implementation] Create HDFS backing storage implementation for ATS reads

2018-09-13 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613350#comment-16613350
 ] 

Hadoop QA commented on YARN-3879:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
27s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 21m 
49s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
27s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
19s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
29s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m 11s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
32s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
20s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
24s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
19s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
19s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
13s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
22s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m 24s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
38s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
18s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  1m  
0s{color} | {color:green} hadoop-yarn-server-timelineservice in the patch 
passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
25s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 52m 54s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:4b8c2b1 |
| JIRA Issue | YARN-3879 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12939542/YARN-3879.005.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux af2cf0c8a4f4 3.13.0-139-generic #188-Ubuntu SMP Tue Jan 9 
14:43:09 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / c6e19db |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_181 |
| findbugs | v3.1.0-RC1 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/21830/testReport/ |
| Max. process+thread count | 336 (vs. ulimit of 1) |
| modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice
 U: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice
 |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/21830/console |
| Powered by | Apache Yetus 0.8.0  

[jira] [Commented] (YARN-3841) [Storage implementation] Adding retry semantics to HDFS backing storage

2018-09-13 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613347#comment-16613347
 ] 

Hadoop QA commented on YARN-3841:
-

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 24m 
42s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 24m 
 8s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
23s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
26s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
24s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
10m 59s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
40s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
18s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
26s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
22s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
22s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 12s{color} | {color:orange} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice:
 The patch generated 2 new + 1 unchanged - 1 fixed = 3 total (was 2) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
21s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
11m 26s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
38s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
16s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  1m  
2s{color} | {color:green} hadoop-yarn-server-timelineservice in the patch 
passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
39s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 77m 33s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:4b8c2b1 |
| JIRA Issue | YARN-3841 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12939532/YARN-3841.005.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux ad5d56788244 4.4.0-133-generic #159-Ubuntu SMP Fri Aug 10 
07:31:43 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / c6e19db |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_181 |
| findbugs | v3.1.0-RC1 |
| checkstyle | 
https://builds.apache.org/job/PreCommit-YARN-Build/21829/artifact/out/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-timelineservice.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/21829/testReport/ |
| Max. process+thread count | 451 (vs. ulimit of 1) |
| modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice
 

[jira] [Commented] (YARN-8720) CapacityScheduler does not enforce yarn.scheduler.capacity..maximum-allocation-mb/vcores when configured

2018-09-13 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613332#comment-16613332
 ] 

Sunil Govindan commented on YARN-8720:
--

Hi [~tarunparimi]

{{yarn.scheduler.capacity..maximum-allocation-mb}} and 
{{yarn.scheduler.capacity.maximum-allocation-mb}} are both valid. 

And {{Resource getMaximumResourceCapability(String queueName)}} from 
YarnScheduler, takes care of falling back to latter configuration.
{code:java}
public Resource getMaximumResourceCapability(String queueName) {
  CSQueue queue = getQueue(queueName);
  if (queue == null) {
LOG.error("Unknown queue: " + queueName);
return getMaximumResourceCapability();
  }
  if (!(queue instanceof LeafQueue)) {
LOG.error("queue " + queueName + " is not an leaf queue");
return getMaximumResourceCapability();
  }
  return ((LeafQueue)queue).getMaximumAllocation();
}{code}
Hence I think its fine to have your modification. [~cheersyang], could u also 
please take a look on this.

> CapacityScheduler does not enforce 
> yarn.scheduler.capacity..maximum-allocation-mb/vcores when 
> configured
> 
>
> Key: YARN-8720
> URL: https://issues.apache.org/jira/browse/YARN-8720
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, capacityscheduler, resourcemanager
>Affects Versions: 2.7.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: YARN-8720.001.patch, YARN-8720.002.patch
>
>
> The value of 
> yarn.scheduler.capacity..maximum-allocation-mb/vcores is not 
> strictly enforced when applications request containers. An 
> InvalidResourceRequestException is thrown only when the ResourceRequest is 
> greater than the global value of yarn.scheduler.maximum-allocation-mb/vcores 
> . So for an example configuration such as below,
>  
> {code:java}
> yarn.scheduler.maximum-allocation-mb=4096
> yarn.scheduler.capacity.root.test.maximum-allocation-mb=2048
> {code}
>  
> The below DSShell command runs successfully and asks an AM container of size 
> 4096MB which is greater than max 2048MB configured in test queue.
> {code:java}
> yarn jar $YARN_HOME/hadoop-yarn-applications-distributedshell.jar 
> -num_containers 1 -jar 
> $YARN_HOME/hadoop-yarn-applications-distributedshell.jar -shell_command 
> "sleep 60" -container_memory=4096 -master_memory=4096 -queue=test{code}
> Instead it should not launch the application and fail with 
> InvalidResourceRequestException . The child container however will be 
> requested with size 2048MB as DSShell AppMaster does the below check before 
> ResourceRequest ask with RM.
> {code:java}
> // A resource ask cannot exceed the max.
> if (containerMemory > maxMem) {
>  LOG.info("Container memory specified above max threshold of cluster."
>  + " Using max value." + ", specified=" + containerMemory + ", max="
>  + maxMem);
>  containerMemory = maxMem;
> }{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8729) Node status updater thread could be lost after it restarted

2018-09-13 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613300#comment-16613300
 ] 

Weiwei Yang commented on YARN-8729:
---

+1 on my side.

[~ebadger], are you fine with this patch?

> Node status updater thread could be lost after it restarted
> ---
>
> Key: YARN-8729
> URL: https://issues.apache.org/jira/browse/YARN-8729
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.2.0
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Critical
> Attachments: YARN-8729.001.patch, YARN-8729.001.patch, 
> YARN-8729.002.patch
>
>
> Today I found a lost NM whose node status updater thread was not exist after 
> this thread restarted. In 
> {{NodeStatusUpdaterImpl#rebootNodeStatusUpdaterAndRegisterWithRM}}, isStopped 
> flag is not updated to be false before executing {{statusUpdater.start()}}, 
> so that if the thread is immediately started and found isStopped==true, it 
> will exit without any log.
> Key codes in 
> {{NodeStatusUpdaterImpl#rebootNodeStatusUpdaterAndRegisterWithRM}}:
> {code:java}
>  statusUpdater.join();
>  registerWithRM();
>  statusUpdater = new Thread(statusUpdaterRunnable, "Node Status Updater");
>  statusUpdater.start();
>  this.isStopped = false;   //this line should be moved before 
> statusUpdater.start();
>  LOG.info("NodeStatusUpdater thread is reRegistered and restarted");
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8771) CapacityScheduler fails to unreserve when cluster resource contains empty resource type

2018-09-13 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613299#comment-16613299
 ] 

Tao Yang commented on YARN-8771:


Attached v1 patch for review. 
[~cheersyang], can you help to review this patch in your free time? Thanks

> CapacityScheduler fails to unreserve when cluster resource contains empty 
> resource type
> ---
>
> Key: YARN-8771
> URL: https://issues.apache.org/jira/browse/YARN-8771
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.2.0
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Critical
> Attachments: YARN-8771.001.patch
>
>
> We found this problem when cluster is almost but not exhausted (93% used), 
> scheduler kept allocating for an app but always fail to commit, this can 
> blocking requests from other apps and parts of cluster resource can't be used.
> Reproduce this problem:
> (1) use DominantResourceCalculator
> (2) cluster resource has empty resource type, for example: gpu=0
> (3) scheduler allocates container for app1 who has reserved containers and 
> whose queue limit or user limit reached(used + required > limit). 
> Reference codes in RegularContainerAllocator#assignContainer:
> {code:java}
> boolean needToUnreserve =
> Resources.greaterThan(rc, clusterResource,
> resourceNeedToUnReserve, Resources.none());
> {code}
> value of resourceNeedToUnReserve can be <8GB, -6 cores, 0 gpu>, result of 
> {{Resources#greaterThan}} will be false if using DominantResourceCalculator.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8771) CapacityScheduler fails to unreserve when cluster resource contains empty resource type

2018-09-13 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-8771:
---
Attachment: YARN-8771.001.patch

> CapacityScheduler fails to unreserve when cluster resource contains empty 
> resource type
> ---
>
> Key: YARN-8771
> URL: https://issues.apache.org/jira/browse/YARN-8771
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.2.0
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Critical
> Attachments: YARN-8771.001.patch
>
>
> We found this problem when cluster is almost but not exhausted (93% used), 
> scheduler kept allocating for an app but always fail to commit, this can 
> blocking requests from other apps and parts of cluster resource can't be used.
> Reproduce this problem:
> (1) use DominantResourceCalculator
> (2) cluster resource has empty resource type, for example: gpu=0
> (3) scheduler allocates container for app1 who has reserved containers and 
> whose queue limit or user limit reached(used + required > limit). 
> Reference codes in RegularContainerAllocator#assignContainer:
> {code:java}
> boolean needToUnreserve =
> Resources.greaterThan(rc, clusterResource,
> resourceNeedToUnReserve, Resources.none());
> {code}
> value of resourceNeedToUnReserve can be <8GB, -6 cores, 0 gpu>, result of 
> {{Resources#greaterThan}} will be false if using DominantResourceCalculator.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-3879) [Storage implementation] Create HDFS backing storage implementation for ATS reads

2018-09-13 Thread Abhishek Modi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613296#comment-16613296
 ] 

Abhishek Modi commented on YARN-3879:
-

Thanks [~vrushalic] for review. I have submitted an updated patch with fixes 
for review comments.

> [Storage implementation] Create HDFS backing storage implementation for ATS 
> reads
> -
>
> Key: YARN-3879
> URL: https://issues.apache.org/jira/browse/YARN-3879
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Tsuyoshi Ozawa
>Assignee: Abhishek Modi
>Priority: Major
>  Labels: YARN-5355
> Attachments: YARN-3879-YARN-7055.001.patch, YARN-3879.001.patch, 
> YARN-3879.002.patch, YARN-3879.003.patch, YARN-3879.004.patch, 
> YARN-3879.005.patch
>
>
> Reader version of YARN-3841



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-3879) [Storage implementation] Create HDFS backing storage implementation for ATS reads

2018-09-13 Thread Abhishek Modi (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhishek Modi updated YARN-3879:

Attachment: YARN-3879.005.patch

> [Storage implementation] Create HDFS backing storage implementation for ATS 
> reads
> -
>
> Key: YARN-3879
> URL: https://issues.apache.org/jira/browse/YARN-3879
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Tsuyoshi Ozawa
>Assignee: Abhishek Modi
>Priority: Major
>  Labels: YARN-5355
> Attachments: YARN-3879-YARN-7055.001.patch, YARN-3879.001.patch, 
> YARN-3879.002.patch, YARN-3879.003.patch, YARN-3879.004.patch, 
> YARN-3879.005.patch
>
>
> Reader version of YARN-3841



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8729) Node status updater thread could be lost after it restarted

2018-09-13 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613294#comment-16613294
 ] 

Hadoop QA commented on YARN-8729:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
22s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 21m 
 4s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
22s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
38s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
51s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
14m  4s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
14s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
36s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
47s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
12s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m 
12s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
26s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
39s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
14m 11s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
10s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
27s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 20m  
7s{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed. 
{color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
34s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 79m 41s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:4b8c2b1 |
| JIRA Issue | YARN-8729 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12939524/YARN-8729.002.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 84088a0956a7 3.13.0-153-generic #203-Ubuntu SMP Thu Jun 14 
08:52:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / c6e19db |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_181 |
| findbugs | v3.1.0-RC1 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/21828/testReport/ |
| Max. process+thread count | 300 (vs. ulimit of 1) |
| modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 U: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/21828/console |
| Powered by | Apache Yetus 0.8.0   

[jira] [Created] (YARN-8771) CapacityScheduler fails to unreserve when cluster resource contains empty resource type

2018-09-13 Thread Tao Yang (JIRA)
Tao Yang created YARN-8771:
--

 Summary: CapacityScheduler fails to unreserve when cluster 
resource contains empty resource type
 Key: YARN-8771
 URL: https://issues.apache.org/jira/browse/YARN-8771
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 3.2.0
Reporter: Tao Yang
Assignee: Tao Yang


We found this problem when cluster is almost but not exhausted (93% used), 
scheduler kept allocating for an app but always fail to commit, this can 
blocking requests from other apps and parts of cluster resource can't be used.

Reproduce this problem:
(1) use DominantResourceCalculator
(2) cluster resource has empty resource type, for example: gpu=0
(3) scheduler allocates container for app1 who has reserved containers and 
whose queue limit or user limit reached(used + required > limit). 

Reference codes in RegularContainerAllocator#assignContainer:
{code:java}
boolean needToUnreserve =
Resources.greaterThan(rc, clusterResource,
resourceNeedToUnReserve, Resources.none());
{code}
value of resourceNeedToUnReserve can be <8GB, -6 cores, 0 gpu>, result of 
{{Resources#greaterThan}} will be false if using DominantResourceCalculator.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-3841) [Storage implementation] Adding retry semantics to HDFS backing storage

2018-09-13 Thread Abhishek Modi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613236#comment-16613236
 ] 

Abhishek Modi commented on YARN-3841:
-

Thanks [~vrushalic] for review. Attached new patch with fixes for review 
comments.

> [Storage implementation] Adding retry semantics to HDFS backing storage
> ---
>
> Key: YARN-3841
> URL: https://issues.apache.org/jira/browse/YARN-3841
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Tsuyoshi Ozawa
>Assignee: Abhishek Modi
>Priority: Major
>  Labels: YARN-5355
> Attachments: YARN-3841-YARN-7055.002.patch, YARN-3841.001.patch, 
> YARN-3841.002.patch, YARN-3841.003.patch, YARN-3841.004.patch, 
> YARN-3841.005.patch
>
>
> HDFS backing storage is useful for following scenarios.
> 1. For Hadoop clusters which don't run HBase.
> 2. For fallback from HBase when HBase cluster is temporary unavailable. 
> Quoting ATS design document of YARN-2928:
> {quote}
> In the case the HBase
> storage is not available, the plugin should buffer the writes temporarily 
> (e.g. HDFS), and flush
> them once the storage comes back online. Reading and writing to hdfs as the 
> the backup storage
> could potentially use the HDFS writer plugin unless the complexity of 
> generalizing the HDFS
> writer plugin for this purpose exceeds the benefits of reusing it here.
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-3841) [Storage implementation] Adding retry semantics to HDFS backing storage

2018-09-13 Thread Abhishek Modi (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhishek Modi updated YARN-3841:

Attachment: YARN-3841.005.patch

> [Storage implementation] Adding retry semantics to HDFS backing storage
> ---
>
> Key: YARN-3841
> URL: https://issues.apache.org/jira/browse/YARN-3841
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Tsuyoshi Ozawa
>Assignee: Abhishek Modi
>Priority: Major
>  Labels: YARN-5355
> Attachments: YARN-3841-YARN-7055.002.patch, YARN-3841.001.patch, 
> YARN-3841.002.patch, YARN-3841.003.patch, YARN-3841.004.patch, 
> YARN-3841.005.patch
>
>
> HDFS backing storage is useful for following scenarios.
> 1. For Hadoop clusters which don't run HBase.
> 2. For fallback from HBase when HBase cluster is temporary unavailable. 
> Quoting ATS design document of YARN-2928:
> {quote}
> In the case the HBase
> storage is not available, the plugin should buffer the writes temporarily 
> (e.g. HDFS), and flush
> them once the storage comes back online. Reading and writing to hdfs as the 
> the backup storage
> could potentially use the HDFS writer plugin unless the complexity of 
> generalizing the HDFS
> writer plugin for this purpose exceeds the benefits of reusing it here.
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8635) Container Resource localization fails if umask is 077

2018-09-13 Thread Bilwa S T (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613234#comment-16613234
 ] 

Bilwa S T commented on YARN-8635:
-

[~bibinchundatt] Please review

> Container Resource localization fails if umask is 077
> -
>
> Key: YARN-8635
> URL: https://issues.apache.org/jira/browse/YARN-8635
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: Bibin A Chundatt
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-8635-001.patch
>
>
> {code}
> java.io.IOException: Application application_1533652359071_0001 
> initialization failed (exitCode=255) with output: main : command provided 0
> main : run as user is mapred
> main : requested yarn user is mapred
> Path 
> /opt/HA/OSBR310/nmlocal/usercache/mapred/appcache/application_1533652359071_0001
>  has permission 700 but needs permission 750.
> Did not create any app directories
> at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:411)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1229)
> Caused by: 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException:
>  ExitCodeException exitCode=255:
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:180)
> at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:402)
> ... 1 more
> Caused by: ExitCodeException exitCode=255:
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:1009)
> at org.apache.hadoop.util.Shell.run(Shell.java:902)
> at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1227)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:152)
> ... 2 more
> 2018-08-08 17:43:26,918 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
>  Container container_e04_1533652359071_0001_01_27 transitioned from 
> LOCALIZING to LOCALIZATION_FAILED
> 2018-08-08 17:43:26,916 WARN 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code 
> from container container_e04_1533652359071_0001_01_31 startLocalizer is : 
> 255
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException:
>  ExitCodeException exitCode=255:
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:180)
> at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:402)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1229)
> Caused by: ExitCodeException exitCode=255:
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:1009)
> at org.apache.hadoop.util.Shell.run(Shell.java:902)
> at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1227)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:152)
> ... 2 more
> 2018-08-08 17:43:26,923 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  Localizer failed for containe
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8635) Container Resource localization fails if umask is 077

2018-09-13 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613183#comment-16613183
 ] 

Hadoop QA commented on YARN-8635:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
20s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 19m 
27s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
2s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
39s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
33m  4s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
36s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
58s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} cc {color} | {color:green}  0m 
58s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
58s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
32s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m 38s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 18m 
39s{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed. 
{color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
27s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 67m 49s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:4b8c2b1 |
| JIRA Issue | YARN-8635 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12939514/YARN-8635-001.patch |
| Optional Tests |  dupname  asflicense  compile  cc  mvnsite  javac  unit  |
| uname | Linux 1ed0f0792853 3.13.0-153-generic #203-Ubuntu SMP Thu Jun 14 
08:52:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / c6e19db |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_181 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/21827/testReport/ |
| Max. process+thread count | 303 (vs. ulimit of 1) |
| modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 U: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/21827/console |
| Powered by | Apache Yetus 0.8.0   http://yetus.apache.org |


This message was automatically generated.



> Container Resource localization fails if umask is 077
> -
>
> Key: YARN-8635
> URL: https://issues.apache.org/jira/browse/YARN-8635
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: Bibin A Chundatt
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-8635-001.patch
>
>
> {code}
> java.io.IOException: Application application_1533652359071_0001 
> initialization failed (exitCode=255) with output: main : command provided 0
> main : run as user is mapred
> main : requested yarn user is 

[jira] [Updated] (YARN-8729) Node status updater thread could be lost after it restarted

2018-09-13 Thread Weiwei Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang updated YARN-8729:
--
Attachment: YARN-8729.002.patch

> Node status updater thread could be lost after it restarted
> ---
>
> Key: YARN-8729
> URL: https://issues.apache.org/jira/browse/YARN-8729
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.2.0
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Critical
> Attachments: YARN-8729.001.patch, YARN-8729.001.patch, 
> YARN-8729.002.patch
>
>
> Today I found a lost NM whose node status updater thread was not exist after 
> this thread restarted. In 
> {{NodeStatusUpdaterImpl#rebootNodeStatusUpdaterAndRegisterWithRM}}, isStopped 
> flag is not updated to be false before executing {{statusUpdater.start()}}, 
> so that if the thread is immediately started and found isStopped==true, it 
> will exit without any log.
> Key codes in 
> {{NodeStatusUpdaterImpl#rebootNodeStatusUpdaterAndRegisterWithRM}}:
> {code:java}
>  statusUpdater.join();
>  registerWithRM();
>  statusUpdater = new Thread(statusUpdaterRunnable, "Node Status Updater");
>  statusUpdater.start();
>  this.isStopped = false;   //this line should be moved before 
> statusUpdater.start();
>  LOG.info("NodeStatusUpdater thread is reRegistered and restarted");
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8635) Container Resource localization fails if umask is 077

2018-09-13 Thread Bilwa S T (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613111#comment-16613111
 ] 

Bilwa S T commented on YARN-8635:
-

Thanks [~bibinchundatt] for reporting the issue

> Container Resource localization fails if umask is 077
> -
>
> Key: YARN-8635
> URL: https://issues.apache.org/jira/browse/YARN-8635
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: Bibin A Chundatt
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-8635-001.patch
>
>
> {code}
> java.io.IOException: Application application_1533652359071_0001 
> initialization failed (exitCode=255) with output: main : command provided 0
> main : run as user is mapred
> main : requested yarn user is mapred
> Path 
> /opt/HA/OSBR310/nmlocal/usercache/mapred/appcache/application_1533652359071_0001
>  has permission 700 but needs permission 750.
> Did not create any app directories
> at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:411)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1229)
> Caused by: 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException:
>  ExitCodeException exitCode=255:
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:180)
> at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:402)
> ... 1 more
> Caused by: ExitCodeException exitCode=255:
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:1009)
> at org.apache.hadoop.util.Shell.run(Shell.java:902)
> at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1227)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:152)
> ... 2 more
> 2018-08-08 17:43:26,918 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
>  Container container_e04_1533652359071_0001_01_27 transitioned from 
> LOCALIZING to LOCALIZATION_FAILED
> 2018-08-08 17:43:26,916 WARN 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code 
> from container container_e04_1533652359071_0001_01_31 startLocalizer is : 
> 255
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException:
>  ExitCodeException exitCode=255:
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:180)
> at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:402)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1229)
> Caused by: ExitCodeException exitCode=255:
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:1009)
> at org.apache.hadoop.util.Shell.run(Shell.java:902)
> at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1227)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:152)
> ... 2 more
> 2018-08-08 17:43:26,923 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  Localizer failed for containe
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8635) Container Resource localization fails if umask is 077

2018-09-13 Thread Bilwa S T (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bilwa S T updated YARN-8635:

Attachment: YARN-8635-001.patch

> Container Resource localization fails if umask is 077
> -
>
> Key: YARN-8635
> URL: https://issues.apache.org/jira/browse/YARN-8635
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: Bibin A Chundatt
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-8635-001.patch
>
>
> {code}
> java.io.IOException: Application application_1533652359071_0001 
> initialization failed (exitCode=255) with output: main : command provided 0
> main : run as user is mapred
> main : requested yarn user is mapred
> Path 
> /opt/HA/OSBR310/nmlocal/usercache/mapred/appcache/application_1533652359071_0001
>  has permission 700 but needs permission 750.
> Did not create any app directories
> at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:411)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1229)
> Caused by: 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException:
>  ExitCodeException exitCode=255:
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:180)
> at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:402)
> ... 1 more
> Caused by: ExitCodeException exitCode=255:
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:1009)
> at org.apache.hadoop.util.Shell.run(Shell.java:902)
> at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1227)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:152)
> ... 2 more
> 2018-08-08 17:43:26,918 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
>  Container container_e04_1533652359071_0001_01_27 transitioned from 
> LOCALIZING to LOCALIZATION_FAILED
> 2018-08-08 17:43:26,916 WARN 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code 
> from container container_e04_1533652359071_0001_01_31 startLocalizer is : 
> 255
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException:
>  ExitCodeException exitCode=255:
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:180)
> at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:402)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1229)
> Caused by: ExitCodeException exitCode=255:
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:1009)
> at org.apache.hadoop.util.Shell.run(Shell.java:902)
> at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1227)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:152)
> ... 2 more
> 2018-08-08 17:43:26,923 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  Localizer failed for containe
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8234) Improve RM system metrics publisher's performance by pushing events to timeline server in batch

2018-09-13 Thread Hu Ziqian (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613077#comment-16613077
 ] 

Hu Ziqian commented on YARN-8234:
-

[~rohithsharma], sorry for replying too late.

I add patch 004 with following changes:
 # change putEntity for offer to put
 # change config with prefix  yarn.resourcemanager.system-metrics-publisher.* 
 # change entityQueue size to batchSize and remove bufferSize
 # stop the super first in serviceStop()

Can you review it again?

And i found the Hadoop QA failed on YARN-8234-branch-2.8.3.004.patch, do i need 
to move it to branch-2.8.4?

> Improve RM system metrics publisher's performance by pushing events to 
> timeline server in batch
> ---
>
> Key: YARN-8234
> URL: https://issues.apache.org/jira/browse/YARN-8234
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager, timelineserver
>Affects Versions: 2.8.3
>Reporter: Hu Ziqian
>Assignee: Hu Ziqian
>Priority: Critical
> Attachments: YARN-8234-branch-2.8.3.001.patch, 
> YARN-8234-branch-2.8.3.002.patch, YARN-8234-branch-2.8.3.003.patch, 
> YARN-8234-branch-2.8.3.004.patch, YARN-8234.001.patch, YARN-8234.002.patch, 
> YARN-8234.003.patch, YARN-8234.004.patch
>
>
> When system metrics publisher is enabled, RM will push events to timeline 
> server via restful api. If the cluster load is heavy, many events are sent to 
> timeline server and the timeline server's event handler thread locked. 
> YARN-7266 talked about the detail of this problem. Because of the lock, 
> timeline server can't receive event as fast as it generated in RM and lots of 
> timeline event stays in RM's memory. Finally, those events will consume all 
> RM's memory and RM will start a full gc (which cause an JVM stop-world and 
> cause a timeout from rm to zookeeper) or even get an OOM. 
> The main problem here is that timeline can't receive timeline server's event 
> as fast as it generated. Now, RM system metrics publisher put only one event 
> in a request, and most time costs on handling http header or some thing about 
> the net connection on timeline side. Only few time is spent on dealing with 
> the timeline event which is truly valuable.
> In this issue, we add a buffer in system metrics publisher and let publisher 
> send events to timeline server in batch via one request. When sets the batch 
> size to 1000, in out experiment the speed of the timeline server receives 
> events has 100x improvement. We have implement this function int our product 
> environment which accepts 2 app's in one hour and it works fine.
> We add following configuration:
>  * yarn.resourcemanager.system-metrics-publisher.batch-size: the size of 
> system metrics publisher sending events in one request. Default value is 1000
>  * yarn.resourcemanager.system-metrics-publisher.buffer-size: the size of the 
> event buffer in system metrics publisher.
>  * yarn.resourcemanager.system-metrics-publisher.interval-seconds: When 
> enable batch publishing, we must avoid that the publisher waits for a batch 
> to be filled up and hold events in buffer for long time. So we add another 
> thread which send event's in the buffer periodically. This config sets the 
> interval of the cyclical sending thread. The default value is 60s.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8769) [Submarine] Allow user to specify customized quicklink(s) when submit Submarine job

2018-09-13 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613057#comment-16613057
 ] 

Sunil Govindan commented on YARN-8769:
--

Thanks [~leftnoteasy]. I ll help to review both.

> [Submarine] Allow user to specify customized quicklink(s) when submit 
> Submarine job
> ---
>
> Key: YARN-8769
> URL: https://issues.apache.org/jira/browse/YARN-8769
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Critical
> Attachments: YARN-8769.001.patch
>
>
> This will be helpful when user submit a job and some links need to be shown 
> on YARN UI2 (service page). For example, user can specify a quick link to 
> Zeppelin notebook UI when a Zeppelin notebook got launched.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org