[jira] [Updated] (YARN-8508) On NodeManager container gets cleaned up before its pid file is created

2018-07-31 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8508:
-
Fix Version/s: (was: 3.1.2)
   3.1.1

> On NodeManager container gets cleaned up before its pid file is created
> ---
>
> Key: YARN-8508
> URL: https://issues.apache.org/jira/browse/YARN-8508
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Sumana Sathish
>Assignee: Chandni Singh
>Priority: Critical
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8505.001.patch, YARN-8505.002.patch
>
>
> GPU failed to release even though the container using it is being killed
> {Code}
> 2018-07-06 05:22:26,201 INFO  container.ContainerImpl 
> (ContainerImpl.java:handle(2093)) - Container 
> container_e20_1530854311763_0006_01_01 transitioned from RUNNING to 
> KILLING
> 2018-07-06 05:22:26,250 INFO  container.ContainerImpl 
> (ContainerImpl.java:handle(2093)) - Container 
> container_e20_1530854311763_0006_01_02 transitioned from RUNNING to 
> KILLING
> 2018-07-06 05:22:26,251 INFO  application.ApplicationImpl 
> (ApplicationImpl.java:handle(632)) - Application 
> application_1530854311763_0006 transitioned from RUNNING to 
> FINISHING_CONTAINERS_WAIT
> 2018-07-06 05:22:26,251 INFO  launcher.ContainerLaunch 
> (ContainerLaunch.java:cleanupContainer(734)) - Cleaning up container 
> container_e20_1530854311763_0006_01_02
> 2018-07-06 05:22:31,358 INFO  launcher.ContainerLaunch 
> (ContainerLaunch.java:getContainerPid(1102)) - Could not get pid for 
> container_e20_1530854311763_0006_01_02. Waited for 5000 ms.
> 2018-07-06 05:22:31,358 WARN  launcher.ContainerLaunch 
> (ContainerLaunch.java:cleanupContainer(784)) - Container clean up before pid 
> file created container_e20_1530854311763_0006_01_02
> 2018-07-06 05:22:31,359 INFO  launcher.ContainerLaunch 
> (ContainerLaunch.java:reapDockerContainerNoPid(940)) - Unable to obtain pid, 
> but docker container request detected. Attempting to reap container 
> container_e20_1530854311763_0006_01_02
> 2018-07-06 05:22:31,494 INFO  nodemanager.LinuxContainerExecutor 
> (LinuxContainerExecutor.java:deleteAsUser(828)) - Deleting absolute path : 
> /grid/0/hadoop/yarn/local/usercache/hrt_qa/appcache/application_1530854311763_0006/container_e20_1530854311763_0006_01_02/launch_container.sh
> 2018-07-06 05:22:31,500 INFO  nodemanager.LinuxContainerExecutor 
> (LinuxContainerExecutor.java:deleteAsUser(828)) - Deleting absolute path : 
> /grid/0/hadoop/yarn/local/usercache/hrt_qa/appcache/application_1530854311763_0006/container_e20_1530854311763_0006_01_02/container_tokens
> 2018-07-06 05:22:31,510 INFO  container.ContainerImpl 
> (ContainerImpl.java:handle(2093)) - Container 
> container_e20_1530854311763_0006_01_01 transitioned from KILLING to 
> CONTAINER_CLEANEDUP_AFTER_KILL
> 2018-07-06 05:22:31,510 INFO  container.ContainerImpl 
> (ContainerImpl.java:handle(2093)) - Container 
> container_e20_1530854311763_0006_01_02 transitioned from KILLING to 
> CONTAINER_CLEANEDUP_AFTER_KILL
> 2018-07-06 05:22:31,512 INFO  container.ContainerImpl 
> (ContainerImpl.java:handle(2093)) - Container 
> container_e20_1530854311763_0006_01_01 transitioned from 
> CONTAINER_CLEANEDUP_AFTER_KILL to DONE
> 2018-07-06 05:22:31,513 INFO  container.ContainerImpl 
> (ContainerImpl.java:handle(2093)) - Container 
> container_e20_1530854311763_0006_01_02 transitioned from 
> CONTAINER_CLEANEDUP_AFTER_KILL to DONE
> 2018-07-06 05:22:38,955 INFO  container.ContainerImpl 
> (ContainerImpl.java:handle(2093)) - Container 
> container_e20_1530854311763_0007_01_02 transitioned from NEW to SCHEDULED
> {Code}
> New container requesting for GPU fails to launch
> {code}
> 2018-07-06 05:22:39,048 ERROR nodemanager.LinuxContainerExecutor 
> (LinuxContainerExecutor.java:handleLaunchForLaunchType(550)) - 
> ResourceHandlerChain.preStart() failed!
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerException:
>  Failed to find enough GPUs, 
> requestor=container_e20_1530854311763_0007_01_02, #RequestedGPUs=2, 
> #availableGpus=1
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.gpu.GpuResourceAllocator.internalAssignGpus(GpuResourceAllocator.java:225)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.gpu.GpuResourceAllocator.assignGpus(GpuResourceAllocator.java:173)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.gpu.GpuResourceHandlerImpl.preStart(GpuResourceHandlerImpl.java:98)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerChain.preStart(ResourceHandlerChain.java:75)
>   

[jira] [Commented] (YARN-8508) On NodeManager container gets cleaned up before its pid file is created

2018-07-31 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16564256#comment-16564256
 ] 

Wangda Tan commented on YARN-8508:
--

Committed to branch-3.1.1, thanks [~csingh]!

> On NodeManager container gets cleaned up before its pid file is created
> ---
>
> Key: YARN-8508
> URL: https://issues.apache.org/jira/browse/YARN-8508
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Sumana Sathish
>Assignee: Chandni Singh
>Priority: Critical
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8505.001.patch, YARN-8505.002.patch
>
>
> GPU failed to release even though the container using it is being killed
> {Code}
> 2018-07-06 05:22:26,201 INFO  container.ContainerImpl 
> (ContainerImpl.java:handle(2093)) - Container 
> container_e20_1530854311763_0006_01_01 transitioned from RUNNING to 
> KILLING
> 2018-07-06 05:22:26,250 INFO  container.ContainerImpl 
> (ContainerImpl.java:handle(2093)) - Container 
> container_e20_1530854311763_0006_01_02 transitioned from RUNNING to 
> KILLING
> 2018-07-06 05:22:26,251 INFO  application.ApplicationImpl 
> (ApplicationImpl.java:handle(632)) - Application 
> application_1530854311763_0006 transitioned from RUNNING to 
> FINISHING_CONTAINERS_WAIT
> 2018-07-06 05:22:26,251 INFO  launcher.ContainerLaunch 
> (ContainerLaunch.java:cleanupContainer(734)) - Cleaning up container 
> container_e20_1530854311763_0006_01_02
> 2018-07-06 05:22:31,358 INFO  launcher.ContainerLaunch 
> (ContainerLaunch.java:getContainerPid(1102)) - Could not get pid for 
> container_e20_1530854311763_0006_01_02. Waited for 5000 ms.
> 2018-07-06 05:22:31,358 WARN  launcher.ContainerLaunch 
> (ContainerLaunch.java:cleanupContainer(784)) - Container clean up before pid 
> file created container_e20_1530854311763_0006_01_02
> 2018-07-06 05:22:31,359 INFO  launcher.ContainerLaunch 
> (ContainerLaunch.java:reapDockerContainerNoPid(940)) - Unable to obtain pid, 
> but docker container request detected. Attempting to reap container 
> container_e20_1530854311763_0006_01_02
> 2018-07-06 05:22:31,494 INFO  nodemanager.LinuxContainerExecutor 
> (LinuxContainerExecutor.java:deleteAsUser(828)) - Deleting absolute path : 
> /grid/0/hadoop/yarn/local/usercache/hrt_qa/appcache/application_1530854311763_0006/container_e20_1530854311763_0006_01_02/launch_container.sh
> 2018-07-06 05:22:31,500 INFO  nodemanager.LinuxContainerExecutor 
> (LinuxContainerExecutor.java:deleteAsUser(828)) - Deleting absolute path : 
> /grid/0/hadoop/yarn/local/usercache/hrt_qa/appcache/application_1530854311763_0006/container_e20_1530854311763_0006_01_02/container_tokens
> 2018-07-06 05:22:31,510 INFO  container.ContainerImpl 
> (ContainerImpl.java:handle(2093)) - Container 
> container_e20_1530854311763_0006_01_01 transitioned from KILLING to 
> CONTAINER_CLEANEDUP_AFTER_KILL
> 2018-07-06 05:22:31,510 INFO  container.ContainerImpl 
> (ContainerImpl.java:handle(2093)) - Container 
> container_e20_1530854311763_0006_01_02 transitioned from KILLING to 
> CONTAINER_CLEANEDUP_AFTER_KILL
> 2018-07-06 05:22:31,512 INFO  container.ContainerImpl 
> (ContainerImpl.java:handle(2093)) - Container 
> container_e20_1530854311763_0006_01_01 transitioned from 
> CONTAINER_CLEANEDUP_AFTER_KILL to DONE
> 2018-07-06 05:22:31,513 INFO  container.ContainerImpl 
> (ContainerImpl.java:handle(2093)) - Container 
> container_e20_1530854311763_0006_01_02 transitioned from 
> CONTAINER_CLEANEDUP_AFTER_KILL to DONE
> 2018-07-06 05:22:38,955 INFO  container.ContainerImpl 
> (ContainerImpl.java:handle(2093)) - Container 
> container_e20_1530854311763_0007_01_02 transitioned from NEW to SCHEDULED
> {Code}
> New container requesting for GPU fails to launch
> {code}
> 2018-07-06 05:22:39,048 ERROR nodemanager.LinuxContainerExecutor 
> (LinuxContainerExecutor.java:handleLaunchForLaunchType(550)) - 
> ResourceHandlerChain.preStart() failed!
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerException:
>  Failed to find enough GPUs, 
> requestor=container_e20_1530854311763_0007_01_02, #RequestedGPUs=2, 
> #availableGpus=1
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.gpu.GpuResourceAllocator.internalAssignGpus(GpuResourceAllocator.java:225)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.gpu.GpuResourceAllocator.assignGpus(GpuResourceAllocator.java:173)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.gpu.GpuResourceHandlerImpl.preStart(GpuResourceHandlerImpl.java:98)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerChain.preSt

[jira] [Commented] (YARN-8546) Resource leak caused by a reserved container being released more than once under async scheduling

2018-07-31 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16564254#comment-16564254
 ] 

Wangda Tan commented on YARN-8546:
--

Committed to branch-3.1.1, thanks [~Tao Yang]/[~cheersyang]

> Resource leak caused by a reserved container being released more than once 
> under async scheduling
> -
>
> Key: YARN-8546
> URL: https://issues.apache.org/jira/browse/YARN-8546
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Affects Versions: 3.1.0
>Reporter: Weiwei Yang
>Assignee: Tao Yang
>Priority: Major
>  Labels: global-scheduling
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8546.001.patch
>
>
> I was able to reproduce this issue by starting a job, and this job keeps 
> requesting containers until it uses up cluster available resource. My cluster 
> has 70200 vcores, and each task it applies for 100 vcores, I was expecting 
> total 702 containers can be allocated but eventually there was only 701. The 
> last container could not get allocated because queue used resource is updated 
> to be more than 100%.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8301) Yarn Service Upgrade: Add documentation

2018-07-31 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8301:
-
Fix Version/s: (was: 3.1.2)
   3.1.1

> Yarn Service Upgrade: Add documentation
> ---
>
> Key: YARN-8301
> URL: https://issues.apache.org/jira/browse/YARN-8301
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Critical
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8301.001.patch, YARN-8301.002.patch, 
> YARN-8301.003.patch, YARN-8301.004.patch, YARN-8301.005.patch, 
> YARN-8301.006.patch, YARN-8301.007.patch
>
>
> Add documentation for yarn service upgrade.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8528) Final states in ContainerAllocation might be modified externally causing unexpected allocation results

2018-07-31 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16564253#comment-16564253
 ] 

Wangda Tan commented on YARN-8528:
--

Committed to branch-3.1.1, thanks [~cheersyang]

> Final states in ContainerAllocation might be modified externally causing 
> unexpected allocation results
> --
>
> Key: YARN-8528
> URL: https://issues.apache.org/jira/browse/YARN-8528
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Xintong Song
>Assignee: Xintong Song
>Priority: Major
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8528.001.patch
>
>
> ContainerAllocation.LOCALITY_SKIPPED is final static, and its .state should 
> always be AllocationState.LOCALITY_SKIPPED. However, this variable is public 
> and is accidentally changed to AllocationState.APP_SKIPPED in 
> RegularContainerAllocator under certain conditions. Once that happens, all 
> following LOCALITY_SKIPPED situations will be treated as APP_SKIPPED.
> Similar risks exist for 
> ContainerAllocation.PRIORITY_SKIPPED/APP_SKIPPED/QUEUE_SKIPPED. 
> ContainerAllocation.state should be private and should not be changed. If 
> changes are needed, a new ContainerAllocation should be created.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8546) Resource leak caused by a reserved container being released more than once under async scheduling

2018-07-31 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8546:
-
Fix Version/s: (was: 3.1.2)
   3.1.1

> Resource leak caused by a reserved container being released more than once 
> under async scheduling
> -
>
> Key: YARN-8546
> URL: https://issues.apache.org/jira/browse/YARN-8546
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Affects Versions: 3.1.0
>Reporter: Weiwei Yang
>Assignee: Tao Yang
>Priority: Major
>  Labels: global-scheduling
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8546.001.patch
>
>
> I was able to reproduce this issue by starting a job, and this job keeps 
> requesting containers until it uses up cluster available resource. My cluster 
> has 70200 vcores, and each task it applies for 100 vcores, I was expecting 
> total 702 containers can be allocated but eventually there was only 701. The 
> last container could not get allocated because queue used resource is updated 
> to be more than 100%.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8528) Final states in ContainerAllocation might be modified externally causing unexpected allocation results

2018-07-31 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8528:
-
Fix Version/s: (was: 3.1.2)
   3.1.1

> Final states in ContainerAllocation might be modified externally causing 
> unexpected allocation results
> --
>
> Key: YARN-8528
> URL: https://issues.apache.org/jira/browse/YARN-8528
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Xintong Song
>Assignee: Xintong Song
>Priority: Major
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8528.001.patch
>
>
> ContainerAllocation.LOCALITY_SKIPPED is final static, and its .state should 
> always be AllocationState.LOCALITY_SKIPPED. However, this variable is public 
> and is accidentally changed to AllocationState.APP_SKIPPED in 
> RegularContainerAllocator under certain conditions. Once that happens, all 
> following LOCALITY_SKIPPED situations will be treated as APP_SKIPPED.
> Similar risks exist for 
> ContainerAllocation.PRIORITY_SKIPPED/APP_SKIPPED/QUEUE_SKIPPED. 
> ContainerAllocation.state should be private and should not be changed. If 
> changes are needed, a new ContainerAllocation should be created.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8559) Expose mutable-conf scheduler's configuration in RM /scheduler-conf endpoint

2018-07-31 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16564213#comment-16564213
 ] 

Wangda Tan commented on YARN-8559:
--

Thanks [~cheersyang], 

Some suggestions:

1) Instead of creating a new DAO, can we use the existing logic similar to 
org.apache.hadoop.conf.ConfServlet? 

2) Also, it might be better to protect access of the REST endpoint by admin 
only since it includes some sensitive information like ACL, etc. For sensitive 
information of scheduler, user can access /scheduler. 

> Expose mutable-conf scheduler's configuration in RM /scheduler-conf endpoint
> 
>
> Key: YARN-8559
> URL: https://issues.apache.org/jira/browse/YARN-8559
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Anna Savarin
>Assignee: Weiwei Yang
>Priority: Major
> Attachments: YARN-8559.001.patch, YARN-8559.002.patch, 
> YARN-8559.003.patch
>
>
> All Hadoop services provide a set of common endpoints (/stacks, /logLevel, 
> /metrics, /jmx, /conf).  In the case of the Resource Manager, part of the 
> configuration comes from the scheduler being used.  Currently, these 
> configuration key/values are not exposed through the /conf endpoint, thereby 
> revealing an incomplete configuration picture. 
> Make an improvement and expose the scheduling configuration info through the 
> RM's /conf endpoint.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8606) Opportunistic scheduling doesnt work after failover

2018-07-31 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16564089#comment-16564089
 ] 

Wangda Tan commented on YARN-8606:
--

[~bibinchundatt], is this a regression recently? If yes, which JIRA breaks the 
use case?

> Opportunistic scheduling doesnt work after failover
> ---
>
> Key: YARN-8606
> URL: https://issues.apache.org/jira/browse/YARN-8606
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Blocker
> Attachments: YARN-8606.001.patch
>
>
> EventDispatcher for oppurtunistic scheduling is added to RM composite service 
> and not RMActiveService composite service  causing dispatcher to be started 
> once on RM restart.
> Issue credits: Rakesh



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8509) Fix UserLimit calculation for preemption to balance scenario after queue satisfied

2018-07-30 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16562483#comment-16562483
 ] 

Wangda Tan commented on YARN-8509:
--

[~eepayne], if you have some bandwidth, could u help to check this patch?

> Fix UserLimit calculation for preemption to balance scenario after queue 
> satisfied  
> 
>
> Key: YARN-8509
> URL: https://issues.apache.org/jira/browse/YARN-8509
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Zian Chen
>Assignee: Zian Chen
>Priority: Major
> Attachments: YARN-8509.001.patch, YARN-8509.002.patch
>
>
> In LeafQueue#getTotalPendingResourcesConsideringUserLimit, we calculate total 
> pending resource based on user-limit percent and user-limit factor which will 
> cap pending resource for each user to the minimum of user-limit pending and 
> actual pending. This will prevent queue from taking more pending resource to 
> achieve queue balance after all queue satisfied with its ideal allocation.
>   
>  We need to change the logic to let queue pending can go beyond userlimit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8603) [UI2] Latest run application should be listed first in the RM UI

2018-07-30 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8603:
-
Reporter: Sumana Sathish  (was: Akhil PB)

> [UI2] Latest run application should be listed first in the RM UI
> 
>
> Key: YARN-8603
> URL: https://issues.apache.org/jira/browse/YARN-8603
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-ui-v2
>Reporter: Sumana Sathish
>Assignee: Akhil PB
>Priority: Major
> Attachments: YARN-8603.001.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8591) [ATSv2] NPE while checking for entity acl in non-secure cluster

2018-07-30 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16562368#comment-16562368
 ] 

Wangda Tan commented on YARN-8591:
--

Updated fixed version to 3.1.2 given this don't exist in branch-3.1.1

> [ATSv2] NPE while checking for entity acl in non-secure cluster
> ---
>
> Key: YARN-8591
> URL: https://issues.apache.org/jira/browse/YARN-8591
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelinereader, timelineserver
>Reporter: Akhil PB
>Assignee: Rohith Sharma K S
>Priority: Major
> Fix For: 3.2.0, 3.1.2
>
> Attachments: YARN-8591.01.patch
>
>
> {code:java}
> GET 
> http://ctr-e138-1518143905142-417433-01-04.hwx.site:8198/ws/v2/timeline/apps/application_1532578985272_0002/entities/YARN_CONTAINER?fields=ALL&_=1532670071899{code}
> {code:java}
> 2018-07-27 05:32:03,468 WARN  webapp.GenericExceptionHandler 
> (GenericExceptionHandler.java:toResponse(98)) - INTERNAL_SERVER_ERROR
> javax.ws.rs.WebApplicationException: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.timelineservice.reader.TimelineReaderWebServices.handleException(TimelineReaderWebServices.java:196)
> at 
> org.apache.hadoop.yarn.server.timelineservice.reader.TimelineReaderWebServices.getEntities(TimelineReaderWebServices.java:624)
> at 
> org.apache.hadoop.yarn.server.timelineservice.reader.TimelineReaderWebServices.getEntities(TimelineReaderWebServices.java:474)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
> at 
> com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185)
> at 
> com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)
> at 
> com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:302)
> at 
> com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
> at 
> com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
> at 
> com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
> at 
> com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84)
> at 
> com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1542)
> at 
> com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1473)
> at 
> com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1419)
> at 
> com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1409)
> at 
> com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:409)
> at 
> com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:558)
> at 
> com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:733)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
> at 
> org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
> at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772)
> at 
> org.apache.hadoop.yarn.server.timelineservice.reader.security.TimelineReaderWhitelistAuthorizationFilter.doFilter(TimelineReaderWhitelistAuthorizationFilter.java:85)
> at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
> at 
> org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:644)
> at 
> org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:592)
> at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
> at 
> org.apache.hadoop.security.http.CrossOriginFilter.doFilter(CrossOriginFilter.java:98)
> at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
> at 
> org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1604)
>  

[jira] [Updated] (YARN-8508) On NodeManager container gets cleaned up before its pid file is created

2018-07-30 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8508:
-
Priority: Critical  (was: Major)

> On NodeManager container gets cleaned up before its pid file is created
> ---
>
> Key: YARN-8508
> URL: https://issues.apache.org/jira/browse/YARN-8508
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Sumana Sathish
>Assignee: Chandni Singh
>Priority: Critical
> Fix For: 3.2.0, 3.1.2
>
> Attachments: YARN-8505.001.patch, YARN-8505.002.patch
>
>
> GPU failed to release even though the container using it is being killed
> {Code}
> 2018-07-06 05:22:26,201 INFO  container.ContainerImpl 
> (ContainerImpl.java:handle(2093)) - Container 
> container_e20_1530854311763_0006_01_01 transitioned from RUNNING to 
> KILLING
> 2018-07-06 05:22:26,250 INFO  container.ContainerImpl 
> (ContainerImpl.java:handle(2093)) - Container 
> container_e20_1530854311763_0006_01_02 transitioned from RUNNING to 
> KILLING
> 2018-07-06 05:22:26,251 INFO  application.ApplicationImpl 
> (ApplicationImpl.java:handle(632)) - Application 
> application_1530854311763_0006 transitioned from RUNNING to 
> FINISHING_CONTAINERS_WAIT
> 2018-07-06 05:22:26,251 INFO  launcher.ContainerLaunch 
> (ContainerLaunch.java:cleanupContainer(734)) - Cleaning up container 
> container_e20_1530854311763_0006_01_02
> 2018-07-06 05:22:31,358 INFO  launcher.ContainerLaunch 
> (ContainerLaunch.java:getContainerPid(1102)) - Could not get pid for 
> container_e20_1530854311763_0006_01_02. Waited for 5000 ms.
> 2018-07-06 05:22:31,358 WARN  launcher.ContainerLaunch 
> (ContainerLaunch.java:cleanupContainer(784)) - Container clean up before pid 
> file created container_e20_1530854311763_0006_01_02
> 2018-07-06 05:22:31,359 INFO  launcher.ContainerLaunch 
> (ContainerLaunch.java:reapDockerContainerNoPid(940)) - Unable to obtain pid, 
> but docker container request detected. Attempting to reap container 
> container_e20_1530854311763_0006_01_02
> 2018-07-06 05:22:31,494 INFO  nodemanager.LinuxContainerExecutor 
> (LinuxContainerExecutor.java:deleteAsUser(828)) - Deleting absolute path : 
> /grid/0/hadoop/yarn/local/usercache/hrt_qa/appcache/application_1530854311763_0006/container_e20_1530854311763_0006_01_02/launch_container.sh
> 2018-07-06 05:22:31,500 INFO  nodemanager.LinuxContainerExecutor 
> (LinuxContainerExecutor.java:deleteAsUser(828)) - Deleting absolute path : 
> /grid/0/hadoop/yarn/local/usercache/hrt_qa/appcache/application_1530854311763_0006/container_e20_1530854311763_0006_01_02/container_tokens
> 2018-07-06 05:22:31,510 INFO  container.ContainerImpl 
> (ContainerImpl.java:handle(2093)) - Container 
> container_e20_1530854311763_0006_01_01 transitioned from KILLING to 
> CONTAINER_CLEANEDUP_AFTER_KILL
> 2018-07-06 05:22:31,510 INFO  container.ContainerImpl 
> (ContainerImpl.java:handle(2093)) - Container 
> container_e20_1530854311763_0006_01_02 transitioned from KILLING to 
> CONTAINER_CLEANEDUP_AFTER_KILL
> 2018-07-06 05:22:31,512 INFO  container.ContainerImpl 
> (ContainerImpl.java:handle(2093)) - Container 
> container_e20_1530854311763_0006_01_01 transitioned from 
> CONTAINER_CLEANEDUP_AFTER_KILL to DONE
> 2018-07-06 05:22:31,513 INFO  container.ContainerImpl 
> (ContainerImpl.java:handle(2093)) - Container 
> container_e20_1530854311763_0006_01_02 transitioned from 
> CONTAINER_CLEANEDUP_AFTER_KILL to DONE
> 2018-07-06 05:22:38,955 INFO  container.ContainerImpl 
> (ContainerImpl.java:handle(2093)) - Container 
> container_e20_1530854311763_0007_01_02 transitioned from NEW to SCHEDULED
> {Code}
> New container requesting for GPU fails to launch
> {code}
> 2018-07-06 05:22:39,048 ERROR nodemanager.LinuxContainerExecutor 
> (LinuxContainerExecutor.java:handleLaunchForLaunchType(550)) - 
> ResourceHandlerChain.preStart() failed!
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerException:
>  Failed to find enough GPUs, 
> requestor=container_e20_1530854311763_0007_01_02, #RequestedGPUs=2, 
> #availableGpus=1
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.gpu.GpuResourceAllocator.internalAssignGpus(GpuResourceAllocator.java:225)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.gpu.GpuResourceAllocator.assignGpus(GpuResourceAllocator.java:173)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.gpu.GpuResourceHandlerImpl.preStart(GpuResourceHandlerImpl.java:98)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerChain.preStart(ResourceHandlerChain.java:75)
>   at 
> org.apache.hadoop.

[jira] [Commented] (YARN-8508) On NodeManager container gets cleaned up before its pid file is created

2018-07-30 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16562364#comment-16562364
 ] 

Wangda Tan commented on YARN-8508:
--

I think it is important to get it backported to branch-3.1.1, I'm going to do 
this in a couple of hours, please let me know if you think different.

cc: [~csingh], [~eyang]

> On NodeManager container gets cleaned up before its pid file is created
> ---
>
> Key: YARN-8508
> URL: https://issues.apache.org/jira/browse/YARN-8508
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Sumana Sathish
>Assignee: Chandni Singh
>Priority: Critical
> Fix For: 3.2.0, 3.1.2
>
> Attachments: YARN-8505.001.patch, YARN-8505.002.patch
>
>
> GPU failed to release even though the container using it is being killed
> {Code}
> 2018-07-06 05:22:26,201 INFO  container.ContainerImpl 
> (ContainerImpl.java:handle(2093)) - Container 
> container_e20_1530854311763_0006_01_01 transitioned from RUNNING to 
> KILLING
> 2018-07-06 05:22:26,250 INFO  container.ContainerImpl 
> (ContainerImpl.java:handle(2093)) - Container 
> container_e20_1530854311763_0006_01_02 transitioned from RUNNING to 
> KILLING
> 2018-07-06 05:22:26,251 INFO  application.ApplicationImpl 
> (ApplicationImpl.java:handle(632)) - Application 
> application_1530854311763_0006 transitioned from RUNNING to 
> FINISHING_CONTAINERS_WAIT
> 2018-07-06 05:22:26,251 INFO  launcher.ContainerLaunch 
> (ContainerLaunch.java:cleanupContainer(734)) - Cleaning up container 
> container_e20_1530854311763_0006_01_02
> 2018-07-06 05:22:31,358 INFO  launcher.ContainerLaunch 
> (ContainerLaunch.java:getContainerPid(1102)) - Could not get pid for 
> container_e20_1530854311763_0006_01_02. Waited for 5000 ms.
> 2018-07-06 05:22:31,358 WARN  launcher.ContainerLaunch 
> (ContainerLaunch.java:cleanupContainer(784)) - Container clean up before pid 
> file created container_e20_1530854311763_0006_01_02
> 2018-07-06 05:22:31,359 INFO  launcher.ContainerLaunch 
> (ContainerLaunch.java:reapDockerContainerNoPid(940)) - Unable to obtain pid, 
> but docker container request detected. Attempting to reap container 
> container_e20_1530854311763_0006_01_02
> 2018-07-06 05:22:31,494 INFO  nodemanager.LinuxContainerExecutor 
> (LinuxContainerExecutor.java:deleteAsUser(828)) - Deleting absolute path : 
> /grid/0/hadoop/yarn/local/usercache/hrt_qa/appcache/application_1530854311763_0006/container_e20_1530854311763_0006_01_02/launch_container.sh
> 2018-07-06 05:22:31,500 INFO  nodemanager.LinuxContainerExecutor 
> (LinuxContainerExecutor.java:deleteAsUser(828)) - Deleting absolute path : 
> /grid/0/hadoop/yarn/local/usercache/hrt_qa/appcache/application_1530854311763_0006/container_e20_1530854311763_0006_01_02/container_tokens
> 2018-07-06 05:22:31,510 INFO  container.ContainerImpl 
> (ContainerImpl.java:handle(2093)) - Container 
> container_e20_1530854311763_0006_01_01 transitioned from KILLING to 
> CONTAINER_CLEANEDUP_AFTER_KILL
> 2018-07-06 05:22:31,510 INFO  container.ContainerImpl 
> (ContainerImpl.java:handle(2093)) - Container 
> container_e20_1530854311763_0006_01_02 transitioned from KILLING to 
> CONTAINER_CLEANEDUP_AFTER_KILL
> 2018-07-06 05:22:31,512 INFO  container.ContainerImpl 
> (ContainerImpl.java:handle(2093)) - Container 
> container_e20_1530854311763_0006_01_01 transitioned from 
> CONTAINER_CLEANEDUP_AFTER_KILL to DONE
> 2018-07-06 05:22:31,513 INFO  container.ContainerImpl 
> (ContainerImpl.java:handle(2093)) - Container 
> container_e20_1530854311763_0006_01_02 transitioned from 
> CONTAINER_CLEANEDUP_AFTER_KILL to DONE
> 2018-07-06 05:22:38,955 INFO  container.ContainerImpl 
> (ContainerImpl.java:handle(2093)) - Container 
> container_e20_1530854311763_0007_01_02 transitioned from NEW to SCHEDULED
> {Code}
> New container requesting for GPU fails to launch
> {code}
> 2018-07-06 05:22:39,048 ERROR nodemanager.LinuxContainerExecutor 
> (LinuxContainerExecutor.java:handleLaunchForLaunchType(550)) - 
> ResourceHandlerChain.preStart() failed!
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerException:
>  Failed to find enough GPUs, 
> requestor=container_e20_1530854311763_0007_01_02, #RequestedGPUs=2, 
> #availableGpus=1
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.gpu.GpuResourceAllocator.internalAssignGpus(GpuResourceAllocator.java:225)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.gpu.GpuResourceAllocator.assignGpus(GpuResourceAllocator.java:173)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.gpu.GpuResourceHandlerImpl.preStart(GpuResourceHandler

[jira] [Commented] (YARN-8545) YARN native service should return container if launch failed

2018-07-30 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16562358#comment-16562358
 ] 

Wangda Tan commented on YARN-8545:
--

I think it is important to get it backported to branch-3.1.1, I'm going to do 
this in a couple of hours, please let me know if you think different.

cc: [~csingh], [~eyang]

> YARN native service should return container if launch failed
> 
>
> Key: YARN-8545
> URL: https://issues.apache.org/jira/browse/YARN-8545
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Wangda Tan
>Assignee: Chandni Singh
>Priority: Critical
> Fix For: 3.2.0, 3.1.2
>
> Attachments: YARN-8545.001.patch
>
>
> In some cases, container launch may fail but container will not be properly 
> returned to RM. 
> This could happen when AM trying to prepare container launch context but 
> failed w/o sending container launch context to NM (Once container launch 
> context is sent to NM, NM will report failed container to RM).
> Exception like: 
> {code:java}
> java.io.FileNotFoundException: File does not exist: 
> hdfs://ns1/user/wtan/.yarn/services/tf-job-001/components/1531852429056/primary-worker/primary-worker-0/run-PRIMARY_WORKER.sh
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1583)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1576)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1591)
>   at 
> org.apache.hadoop.yarn.service.utils.CoreFileSystem.createAmResource(CoreFileSystem.java:388)
>   at 
> org.apache.hadoop.yarn.service.provider.ProviderUtils.createConfigFileAndAddLocalResource(ProviderUtils.java:253)
>   at 
> org.apache.hadoop.yarn.service.provider.AbstractProviderService.buildContainerLaunchContext(AbstractProviderService.java:152)
>   at 
> org.apache.hadoop.yarn.service.containerlaunch.ContainerLaunchService$ContainerLauncher.run(ContainerLaunchService.java:105)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745){code}
> And even after container launch context prepare failed, AM still trying to 
> monitor container's readiness:
> {code:java}
> 2018-07-17 18:42:57,518 [pool-7-thread-1] INFO  monitor.ServiceMonitor - 
> Readiness check failed for primary-worker-0: Probe Status, time="Tue Jul 17 
> 18:42:57 UTC 2018", outcome="failure", message="Failure in Default probe: IP 
> presence", exception="java.io.IOException: primary-worker-0: IP is not 
> available yet"
> ...{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8546) Resource leak caused by a reserved container being released more than once under async scheduling

2018-07-30 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16562357#comment-16562357
 ] 

Wangda Tan commented on YARN-8546:
--

I think it is important to get it backported to branch-3.1.1, I'm going to do 
this in a couple of hours, please let me know if you think different. 

cc: [~cheersyang], [~Tao Yang]

> Resource leak caused by a reserved container being released more than once 
> under async scheduling
> -
>
> Key: YARN-8546
> URL: https://issues.apache.org/jira/browse/YARN-8546
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Affects Versions: 3.1.0
>Reporter: Weiwei Yang
>Assignee: Tao Yang
>Priority: Major
>  Labels: global-scheduling
> Fix For: 3.2.0, 3.1.2
>
> Attachments: YARN-8546.001.patch
>
>
> I was able to reproduce this issue by starting a job, and this job keeps 
> requesting containers until it uses up cluster available resource. My cluster 
> has 70200 vcores, and each task it applies for 100 vcores, I was expecting 
> total 702 containers can be allocated but eventually there was only 701. The 
> last container could not get allocated because queue used resource is updated 
> to be more than 100%.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8301) Yarn Service Upgrade: Add documentation

2018-07-30 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16562354#comment-16562354
 ] 

Wangda Tan commented on YARN-8301:
--

Updated fixed version to 3.1.2 given this don't exist in branch-3.1.1.
But I think it is important to get it backported to branch-3.1.1, I'm going to 
do this in a couple of hours, please let me know if you think different.

cc: [~csingh], [~eyang]

> Yarn Service Upgrade: Add documentation
> ---
>
> Key: YARN-8301
> URL: https://issues.apache.org/jira/browse/YARN-8301
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Critical
> Fix For: 3.2.0, 3.1.2
>
> Attachments: YARN-8301.001.patch, YARN-8301.002.patch, 
> YARN-8301.003.patch, YARN-8301.004.patch, YARN-8301.005.patch, 
> YARN-8301.006.patch, YARN-8301.007.patch
>
>
> Add documentation for yarn service upgrade.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8301) Yarn Service Upgrade: Add documentation

2018-07-30 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8301:
-
Fix Version/s: (was: 3.1.1)
   3.1.2

> Yarn Service Upgrade: Add documentation
> ---
>
> Key: YARN-8301
> URL: https://issues.apache.org/jira/browse/YARN-8301
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Critical
> Fix For: 3.2.0, 3.1.2
>
> Attachments: YARN-8301.001.patch, YARN-8301.002.patch, 
> YARN-8301.003.patch, YARN-8301.004.patch, YARN-8301.005.patch, 
> YARN-8301.006.patch, YARN-8301.007.patch
>
>
> Add documentation for yarn service upgrade.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8528) Final states in ContainerAllocation might be modified externally causing unexpected allocation results

2018-07-30 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16562350#comment-16562350
 ] 

Wangda Tan commented on YARN-8528:
--

Updated fixed version to 3.1.2 given this don't exist in branch-3.1.1.
But I think it is important to get it backported to branch-3.1.1, I'm going to 
do this in a couple of hours, please let me know if you think different.

cc: [~cheersyang]

> Final states in ContainerAllocation might be modified externally causing 
> unexpected allocation results
> --
>
> Key: YARN-8528
> URL: https://issues.apache.org/jira/browse/YARN-8528
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Xintong Song
>Assignee: Xintong Song
>Priority: Major
> Fix For: 3.2.0, 3.1.2
>
> Attachments: YARN-8528.001.patch
>
>
> ContainerAllocation.LOCALITY_SKIPPED is final static, and its .state should 
> always be AllocationState.LOCALITY_SKIPPED. However, this variable is public 
> and is accidentally changed to AllocationState.APP_SKIPPED in 
> RegularContainerAllocator under certain conditions. Once that happens, all 
> following LOCALITY_SKIPPED situations will be treated as APP_SKIPPED.
> Similar risks exist for 
> ContainerAllocation.PRIORITY_SKIPPED/APP_SKIPPED/QUEUE_SKIPPED. 
> ContainerAllocation.state should be private and should not be changed. If 
> changes are needed, a new ContainerAllocation should be created.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8301) Yarn Service Upgrade: Add documentation

2018-07-30 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8301:
-
Priority: Critical  (was: Major)

> Yarn Service Upgrade: Add documentation
> ---
>
> Key: YARN-8301
> URL: https://issues.apache.org/jira/browse/YARN-8301
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Critical
> Fix For: 3.2.0, 3.1.2
>
> Attachments: YARN-8301.001.patch, YARN-8301.002.patch, 
> YARN-8301.003.patch, YARN-8301.004.patch, YARN-8301.005.patch, 
> YARN-8301.006.patch, YARN-8301.007.patch
>
>
> Add documentation for yarn service upgrade.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8528) Final states in ContainerAllocation might be modified externally causing unexpected allocation results

2018-07-30 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8528:
-
Fix Version/s: (was: 3.1.1)
   3.1.2

> Final states in ContainerAllocation might be modified externally causing 
> unexpected allocation results
> --
>
> Key: YARN-8528
> URL: https://issues.apache.org/jira/browse/YARN-8528
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Xintong Song
>Assignee: Xintong Song
>Priority: Major
> Fix For: 3.2.0, 3.1.2
>
> Attachments: YARN-8528.001.patch
>
>
> ContainerAllocation.LOCALITY_SKIPPED is final static, and its .state should 
> always be AllocationState.LOCALITY_SKIPPED. However, this variable is public 
> and is accidentally changed to AllocationState.APP_SKIPPED in 
> RegularContainerAllocator under certain conditions. Once that happens, all 
> following LOCALITY_SKIPPED situations will be treated as APP_SKIPPED.
> Similar risks exist for 
> ContainerAllocation.PRIORITY_SKIPPED/APP_SKIPPED/QUEUE_SKIPPED. 
> ContainerAllocation.state should be private and should not be changed. If 
> changes are needed, a new ContainerAllocation should be created.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8508) On NodeManager container gets cleaned up before its pid file is created

2018-07-30 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8508:
-
Summary: On NodeManager container gets cleaned up before its pid file is 
created  (was: GPU  does not get released even though the container is killed)

> On NodeManager container gets cleaned up before its pid file is created
> ---
>
> Key: YARN-8508
> URL: https://issues.apache.org/jira/browse/YARN-8508
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Sumana Sathish
>Assignee: Chandni Singh
>Priority: Major
> Fix For: 3.2.0, 3.1.2
>
> Attachments: YARN-8505.001.patch, YARN-8505.002.patch
>
>
> GPU failed to release even though the container using it is being killed
> {Code}
> 2018-07-06 05:22:26,201 INFO  container.ContainerImpl 
> (ContainerImpl.java:handle(2093)) - Container 
> container_e20_1530854311763_0006_01_01 transitioned from RUNNING to 
> KILLING
> 2018-07-06 05:22:26,250 INFO  container.ContainerImpl 
> (ContainerImpl.java:handle(2093)) - Container 
> container_e20_1530854311763_0006_01_02 transitioned from RUNNING to 
> KILLING
> 2018-07-06 05:22:26,251 INFO  application.ApplicationImpl 
> (ApplicationImpl.java:handle(632)) - Application 
> application_1530854311763_0006 transitioned from RUNNING to 
> FINISHING_CONTAINERS_WAIT
> 2018-07-06 05:22:26,251 INFO  launcher.ContainerLaunch 
> (ContainerLaunch.java:cleanupContainer(734)) - Cleaning up container 
> container_e20_1530854311763_0006_01_02
> 2018-07-06 05:22:31,358 INFO  launcher.ContainerLaunch 
> (ContainerLaunch.java:getContainerPid(1102)) - Could not get pid for 
> container_e20_1530854311763_0006_01_02. Waited for 5000 ms.
> 2018-07-06 05:22:31,358 WARN  launcher.ContainerLaunch 
> (ContainerLaunch.java:cleanupContainer(784)) - Container clean up before pid 
> file created container_e20_1530854311763_0006_01_02
> 2018-07-06 05:22:31,359 INFO  launcher.ContainerLaunch 
> (ContainerLaunch.java:reapDockerContainerNoPid(940)) - Unable to obtain pid, 
> but docker container request detected. Attempting to reap container 
> container_e20_1530854311763_0006_01_02
> 2018-07-06 05:22:31,494 INFO  nodemanager.LinuxContainerExecutor 
> (LinuxContainerExecutor.java:deleteAsUser(828)) - Deleting absolute path : 
> /grid/0/hadoop/yarn/local/usercache/hrt_qa/appcache/application_1530854311763_0006/container_e20_1530854311763_0006_01_02/launch_container.sh
> 2018-07-06 05:22:31,500 INFO  nodemanager.LinuxContainerExecutor 
> (LinuxContainerExecutor.java:deleteAsUser(828)) - Deleting absolute path : 
> /grid/0/hadoop/yarn/local/usercache/hrt_qa/appcache/application_1530854311763_0006/container_e20_1530854311763_0006_01_02/container_tokens
> 2018-07-06 05:22:31,510 INFO  container.ContainerImpl 
> (ContainerImpl.java:handle(2093)) - Container 
> container_e20_1530854311763_0006_01_01 transitioned from KILLING to 
> CONTAINER_CLEANEDUP_AFTER_KILL
> 2018-07-06 05:22:31,510 INFO  container.ContainerImpl 
> (ContainerImpl.java:handle(2093)) - Container 
> container_e20_1530854311763_0006_01_02 transitioned from KILLING to 
> CONTAINER_CLEANEDUP_AFTER_KILL
> 2018-07-06 05:22:31,512 INFO  container.ContainerImpl 
> (ContainerImpl.java:handle(2093)) - Container 
> container_e20_1530854311763_0006_01_01 transitioned from 
> CONTAINER_CLEANEDUP_AFTER_KILL to DONE
> 2018-07-06 05:22:31,513 INFO  container.ContainerImpl 
> (ContainerImpl.java:handle(2093)) - Container 
> container_e20_1530854311763_0006_01_02 transitioned from 
> CONTAINER_CLEANEDUP_AFTER_KILL to DONE
> 2018-07-06 05:22:38,955 INFO  container.ContainerImpl 
> (ContainerImpl.java:handle(2093)) - Container 
> container_e20_1530854311763_0007_01_02 transitioned from NEW to SCHEDULED
> {Code}
> New container requesting for GPU fails to launch
> {code}
> 2018-07-06 05:22:39,048 ERROR nodemanager.LinuxContainerExecutor 
> (LinuxContainerExecutor.java:handleLaunchForLaunchType(550)) - 
> ResourceHandlerChain.preStart() failed!
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerException:
>  Failed to find enough GPUs, 
> requestor=container_e20_1530854311763_0007_01_02, #RequestedGPUs=2, 
> #availableGpus=1
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.gpu.GpuResourceAllocator.internalAssignGpus(GpuResourceAllocator.java:225)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.gpu.GpuResourceAllocator.assignGpus(GpuResourceAllocator.java:173)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.gpu.GpuResourceHandlerImpl.preStart(GpuResourceHandlerImpl.java:98)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containe

[jira] [Updated] (YARN-8330) Avoid publishing reserved container to ATS from RM

2018-07-30 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8330:
-
Summary: Avoid publishing reserved container to ATS from RM  (was: Avoid 
publishing reserved container to ATS)

> Avoid publishing reserved container to ATS from RM
> --
>
> Key: YARN-8330
> URL: https://issues.apache.org/jira/browse/YARN-8330
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Reporter: Yesha Vora
>Assignee: Suma Shivaprasad
>Priority: Critical
> Fix For: 3.2.0, 3.1.2
>
> Attachments: YARN-8330.1.patch, YARN-8330.2.patch, YARN-8330.3.patch, 
> YARN-8330.4.patch
>
>
> Steps:
> launch Hbase tarball app
> list containers for hbase tarball app
> {code}
> /usr/hdp/current/hadoop-yarn-client/bin/yarn container -list 
> appattempt_1525463491331_0006_01
> WARNING: YARN_LOG_DIR has been replaced by HADOOP_LOG_DIR. Using value of 
> YARN_LOG_DIR.
> WARNING: YARN_LOGFILE has been replaced by HADOOP_LOGFILE. Using value of 
> YARN_LOGFILE.
> WARNING: YARN_PID_DIR has been replaced by HADOOP_PID_DIR. Using value of 
> YARN_PID_DIR.
> WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS.
> 18/05/04 22:36:11 INFO client.AHSProxy: Connecting to Application History 
> server at xxx/xxx:10200
> 18/05/04 22:36:11 INFO client.ConfiguredRMFailoverProxyProvider: Failing over 
> to rm2
> Total number of containers :5
> Container-IdStart Time Finish Time   
> StateHost   Node Http Address 
>LOG-URL
> container_e06_1525463491331_0006_01_02Fri May 04 22:34:26 + 2018  
>  N/A RUNNINGxxx:25454  http://xxx:8042
> http://xxx:8042/node/containerlogs/container_e06_1525463491331_0006_01_02/hrt_qa
> 2018-05-04 22:36:11,216|INFO|MainThread|machine.py:167 - 
> run()||GUID=0169fa41-d1c5-4b43-85bf-c3e9f2682398|container_e06_1525463491331_0006_01_03
> Fri May 04 22:34:26 + 2018   N/A 
> RUNNINGxxx:25454  http://xxx:8042
> http://xxx:8042/node/containerlogs/container_e06_1525463491331_0006_01_03/hrt_qa
> 2018-05-04 22:36:11,217|INFO|MainThread|machine.py:167 - 
> run()||GUID=0169fa41-d1c5-4b43-85bf-c3e9f2682398|container_e06_1525463491331_0006_01_01
> Fri May 04 22:34:15 + 2018   N/A 
> RUNNINGxxx:25454  http://xxx:8042
> http://xxx:8042/node/containerlogs/container_e06_1525463491331_0006_01_01/hrt_qa
> 2018-05-04 22:36:11,217|INFO|MainThread|machine.py:167 - 
> run()||GUID=0169fa41-d1c5-4b43-85bf-c3e9f2682398|container_e06_1525463491331_0006_01_05
> Fri May 04 22:34:56 + 2018   N/A 
> RUNNINGxxx:25454  http://xxx:8042
> http://xxx:8042/node/containerlogs/container_e06_1525463491331_0006_01_05/hrt_qa
> 2018-05-04 22:36:11,218|INFO|MainThread|machine.py:167 - 
> run()||GUID=0169fa41-d1c5-4b43-85bf-c3e9f2682398|container_e06_1525463491331_0006_01_04
> Fri May 04 22:34:56 + 2018   N/A
> nullxxx:25454  http://xxx:8042
> http://xxx:8188/applicationhistory/logs/xxx:25454/container_e06_1525463491331_0006_01_04/container_e06_1525463491331_0006_01_04/hrt_qa{code}
> Total expected containers = 4 ( 3 components container + 1 am). Instead, RM 
> is listing 5 containers. 
> container_e06_1525463491331_0006_01_04 is in null state.
> Yarn service utilized container 02, 03, 05 for component. There is no log 
> available in NM & AM related to container 04. Only one line in RM log is 
> printed
> {code}
> 2018-05-04 22:34:56,618 INFO  rmcontainer.RMContainerImpl 
> (RMContainerImpl.java:handle(489)) - 
> container_e06_1525463491331_0006_01_04 Container Transitioned from NEW to 
> RESERVED{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8330) Avoid publishing reserved container to ATS

2018-07-30 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8330:
-
Summary: Avoid publishing reserved container to ATS  (was: An extra 
container got launched by RM for yarn-service)

> Avoid publishing reserved container to ATS
> --
>
> Key: YARN-8330
> URL: https://issues.apache.org/jira/browse/YARN-8330
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Reporter: Yesha Vora
>Assignee: Suma Shivaprasad
>Priority: Critical
> Fix For: 3.2.0, 3.1.2
>
> Attachments: YARN-8330.1.patch, YARN-8330.2.patch, YARN-8330.3.patch, 
> YARN-8330.4.patch
>
>
> Steps:
> launch Hbase tarball app
> list containers for hbase tarball app
> {code}
> /usr/hdp/current/hadoop-yarn-client/bin/yarn container -list 
> appattempt_1525463491331_0006_01
> WARNING: YARN_LOG_DIR has been replaced by HADOOP_LOG_DIR. Using value of 
> YARN_LOG_DIR.
> WARNING: YARN_LOGFILE has been replaced by HADOOP_LOGFILE. Using value of 
> YARN_LOGFILE.
> WARNING: YARN_PID_DIR has been replaced by HADOOP_PID_DIR. Using value of 
> YARN_PID_DIR.
> WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS.
> 18/05/04 22:36:11 INFO client.AHSProxy: Connecting to Application History 
> server at xxx/xxx:10200
> 18/05/04 22:36:11 INFO client.ConfiguredRMFailoverProxyProvider: Failing over 
> to rm2
> Total number of containers :5
> Container-IdStart Time Finish Time   
> StateHost   Node Http Address 
>LOG-URL
> container_e06_1525463491331_0006_01_02Fri May 04 22:34:26 + 2018  
>  N/A RUNNINGxxx:25454  http://xxx:8042
> http://xxx:8042/node/containerlogs/container_e06_1525463491331_0006_01_02/hrt_qa
> 2018-05-04 22:36:11,216|INFO|MainThread|machine.py:167 - 
> run()||GUID=0169fa41-d1c5-4b43-85bf-c3e9f2682398|container_e06_1525463491331_0006_01_03
> Fri May 04 22:34:26 + 2018   N/A 
> RUNNINGxxx:25454  http://xxx:8042
> http://xxx:8042/node/containerlogs/container_e06_1525463491331_0006_01_03/hrt_qa
> 2018-05-04 22:36:11,217|INFO|MainThread|machine.py:167 - 
> run()||GUID=0169fa41-d1c5-4b43-85bf-c3e9f2682398|container_e06_1525463491331_0006_01_01
> Fri May 04 22:34:15 + 2018   N/A 
> RUNNINGxxx:25454  http://xxx:8042
> http://xxx:8042/node/containerlogs/container_e06_1525463491331_0006_01_01/hrt_qa
> 2018-05-04 22:36:11,217|INFO|MainThread|machine.py:167 - 
> run()||GUID=0169fa41-d1c5-4b43-85bf-c3e9f2682398|container_e06_1525463491331_0006_01_05
> Fri May 04 22:34:56 + 2018   N/A 
> RUNNINGxxx:25454  http://xxx:8042
> http://xxx:8042/node/containerlogs/container_e06_1525463491331_0006_01_05/hrt_qa
> 2018-05-04 22:36:11,218|INFO|MainThread|machine.py:167 - 
> run()||GUID=0169fa41-d1c5-4b43-85bf-c3e9f2682398|container_e06_1525463491331_0006_01_04
> Fri May 04 22:34:56 + 2018   N/A
> nullxxx:25454  http://xxx:8042
> http://xxx:8188/applicationhistory/logs/xxx:25454/container_e06_1525463491331_0006_01_04/container_e06_1525463491331_0006_01_04/hrt_qa{code}
> Total expected containers = 4 ( 3 components container + 1 am). Instead, RM 
> is listing 5 containers. 
> container_e06_1525463491331_0006_01_04 is in null state.
> Yarn service utilized container 02, 03, 05 for component. There is no log 
> available in NM & AM related to container 04. Only one line in RM log is 
> printed
> {code}
> 2018-05-04 22:34:56,618 INFO  rmcontainer.RMContainerImpl 
> (RMContainerImpl.java:handle(489)) - 
> container_e06_1525463491331_0006_01_04 Container Transitioned from NEW to 
> RESERVED{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8418) App local logs could leaked if log aggregation fails to initialize for the app

2018-07-30 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16562288#comment-16562288
 ] 

Wangda Tan commented on YARN-8418:
--

Thanks [~bibinchundatt], +1 to the latest patch, it gonna be ideal if you could 
share some test results of the ver.9 patch. 

[~rohithsharma], any comments to the ver.9 patch?

> App local logs could leaked if log aggregation fails to initialize for the app
> --
>
> Key: YARN-8418
> URL: https://issues.apache.org/jira/browse/YARN-8418
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.8.0, 3.0.0-alpha1
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: YARN-8418.001.patch, YARN-8418.002.patch, 
> YARN-8418.003.patch, YARN-8418.004.patch, YARN-8418.005.patch, 
> YARN-8418.006.patch, YARN-8418.007.patch, YARN-8418.008.patch, 
> YARN-8418.009.patch
>
>
> If log aggregation fails init createApp directory container logs could get 
> leaked in NM directory
> For log running application restart of NM after token renewal this case is 
> possible/  Application submission with invalid delegation token



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8418) App local logs could leaked if log aggregation fails to initialize for the app

2018-07-30 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16562026#comment-16562026
 ] 

Wangda Tan commented on YARN-8418:
--

[~rohithsharma], regarding to the YARN-4984 revert, Initially I have doubt as 
well, now I'm convinced. See my comment at: 
https://issues.apache.org/jira/browse/YARN-8418?focusedCommentId=16551034&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16551034

Regarding to delegation token change, let's wait [~bibinchundatt]'s response. 

> App local logs could leaked if log aggregation fails to initialize for the app
> --
>
> Key: YARN-8418
> URL: https://issues.apache.org/jira/browse/YARN-8418
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.8.0, 3.0.0-alpha1
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: YARN-8418.001.patch, YARN-8418.002.patch, 
> YARN-8418.003.patch, YARN-8418.004.patch, YARN-8418.005.patch, 
> YARN-8418.006.patch, YARN-8418.007.patch, YARN-8418.008.patch
>
>
> If log aggregation fails init createApp directory container logs could get 
> leaked in NM directory
> For log running application restart of NM after token renewal this case is 
> possible/  Application submission with invalid delegation token



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8418) App local logs could leaked if log aggregation fails to initialize for the app

2018-07-29 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16561188#comment-16561188
 ] 

Wangda Tan commented on YARN-8418:
--

Thanks [~bibinchundatt] for updating the patch and [~suma.shivaprasad] for code 
review.

One minor comments:

ContainerManagerImpl#handleCredentialUpdate, instead of calling following 
logic: 
{code}
Set invalidTokenApps = logHandler.getInvalidTokenApps();
for (ApplicationId app : invalidTokenApps) {
  if (context.getSystemCredentialsForApps().get(app) != null) {
dispatcher.getEventHandler()
.handle(new LogHandlerTokenUpdatedEvent(app));
  }
}
{code} 

Is it better to just send a LogHandlerTokenUpdatedEvent to log handler (but no 
need to specify app id). And inside LogHandler, loop the invalidTokenApps, and 
update token. 

Benefits of doing this: 
- ContainerManagerImpl doesn't have to know about invalid apps, better 
encapsulation.
- Avoid race condition that app get added to invalidApps while looping apps 
from handleCredentialUpdate.


> App local logs could leaked if log aggregation fails to initialize for the app
> --
>
> Key: YARN-8418
> URL: https://issues.apache.org/jira/browse/YARN-8418
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.8.0, 3.0.0-alpha1
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: YARN-8418.001.patch, YARN-8418.002.patch, 
> YARN-8418.003.patch, YARN-8418.004.patch, YARN-8418.005.patch, 
> YARN-8418.006.patch, YARN-8418.007.patch
>
>
> If log aggregation fails init createApp directory container logs could get 
> leaked in NM directory
> For log running application restart of NM after token renewal this case is 
> possible/  Application submission with invalid delegation token



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8563) [Submarine] Support users to specify Python/TF package/version/dependencies for training job.

2018-07-21 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8563:
-
Summary: [Submarine] Support users to specify Python/TF 
package/version/dependencies for training job.  (was: Support users to specify 
Python/TF package/version/dependencies for training job.)

> [Submarine] Support users to specify Python/TF package/version/dependencies 
> for training job.
> -
>
> Key: YARN-8563
> URL: https://issues.apache.org/jira/browse/YARN-8563
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Priority: Major
>
> YARN-8561 assumes all Python / Tensorflow dependencies will be packed to 
> docker image. In practice, user doesn't want to build docker image. Instead, 
> user can provide python package / dependencies (like .whl), Python and TF 
> version. And Submarine can localize specified dependencies to prebuilt base 
> Docker images.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8561) [Submarine] Add initial implementation: training job submission and job history retrieve.

2018-07-21 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8561:
-
Summary: [Submarine] Add initial implementation: training job submission 
and job history retrieve.  (was: Add submarine initial implementation: training 
job submission and job history retrieve.)

> [Submarine] Add initial implementation: training job submission and job 
> history retrieve.
> -
>
> Key: YARN-8561
> URL: https://issues.apache.org/jira/browse/YARN-8561
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Major
> Attachments: YARN-8561.001.patch
>
>
> Added following parts:
> 1) New subcomponent of YARN, under applications/ project. 
> 2) Tensorflow training job submission, including training (single node and 
> distributed). 
> - Supported Docker container. 
> - Support GPU isolation. 
> - Support YARN registry DNS.
> 3) Retrieve job history.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8563) Support users to specify Python/TF package/version/dependencies for training job.

2018-07-21 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-8563:


 Summary: Support users to specify Python/TF 
package/version/dependencies for training job.
 Key: YARN-8563
 URL: https://issues.apache.org/jira/browse/YARN-8563
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Wangda Tan


YARN-8561 assumes all Python / Tensorflow dependencies will be packed to docker 
image. In practice, user doesn't want to build docker image. Instead, user can 
provide python package / dependencies (like .whl), Python and TF version. And 
Submarine can localize specified dependencies to prebuilt base Docker images.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8558) NM recovery level db not cleaned up properly on container finish

2018-07-21 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16551755#comment-16551755
 ] 

Wangda Tan commented on YARN-8558:
--

[~bibinchundatt], I think we should have a follow up Jira to make sure all 
container related keys can be grouped so we don't need to worry about manually 
adding keys to delete in the future. 

Overall patch looks good.

There're some other CONTAINER_ related fields are not included in your patch, 
like CONTAINER_TOKENS_KEY_PREFIX. Could u double confirm if they're required or 
not? 

cc: [~sunil.gov...@gmail.com]

> NM recovery level db not cleaned up properly on container finish
> 
>
> Key: YARN-8558
> URL: https://issues.apache.org/jira/browse/YARN-8558
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: YARN-8558.001.patch
>
>
> {code}
> 2018-07-20 16:49:23,117 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl:
>  Application application_1531994217928_0054 transitioned from NEW to INITING
> 2018-07-20 16:49:23,204 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_18 with incomplete 
> records
> 2018-07-20 16:49:23,204 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_19 with incomplete 
> records
> 2018-07-20 16:49:23,204 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_20 with incomplete 
> records
> 2018-07-20 16:49:23,205 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_21 with incomplete 
> records
> 2018-07-20 16:49:23,205 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_22 with incomplete 
> records
> 2018-07-20 16:49:23,205 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_23 with incomplete 
> records
> 2018-07-20 16:49:23,205 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_24 with incomplete 
> records
> 2018-07-20 16:49:23,205 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_25 with incomplete 
> records
> 2018-07-20 16:49:23,205 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_38 with incomplete 
> records
> 2018-07-20 16:49:23,205 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_39 with incomplete 
> records
> 2018-07-20 16:49:23,206 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_41 with incomplete 
> records
> 2018-07-20 16:49:23,206 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_44 with incomplete 
> records
> 2018-07-20 16:49:23,206 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_46 with incomplete 
> records
> 2018-07-20 16:49:23,206 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_49 with incomplete 
> records
> 2018-07-20 16:49:23,206 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_52 with incomplete 
> records
> 2018-07-20 16:49:23,206 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_54 with incomplete 
> records
> 2018-07-20 16:49:23,206 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_73 with incomplete 
> records
> 2018-07-20 16:49:23,207 WARN 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService:
>  Remove container container_1531994217928_0001_01_74 with incomplete 
> records
> 2018-07-20 16:49:23,207 WARN 
> org.apache

[jira] [Updated] (YARN-8135) Hadoop {Submarine} Project: Simple and scalable deployment of deep learning training / serving jobs on Hadoop

2018-07-20 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8135:
-
Attachment: (was: YARN-8135.poc.001.patch)

> Hadoop {Submarine} Project: Simple and scalable deployment of deep learning 
> training / serving jobs on Hadoop
> -
>
> Key: YARN-8135
> URL: https://issues.apache.org/jira/browse/YARN-8135
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Major
>
> Description:
> *Goals:*
>  - Allow infra engineer / data scientist to run *unmodified* Tensorflow jobs 
> on YARN.
>  - Allow jobs easy access data/models in HDFS and other storages.
>  - Can launch services to serve Tensorflow/MXNet models.
>  - Support run distributed Tensorflow jobs with simple configs.
>  - Support run user-specified Docker images.
>  - Support specify GPU and other resources.
>  - Support launch tensorboard if user specified.
>  - Support customized DNS name for roles (like tensorboard.$user.$domain:6006)
> *Why this name?*
>  - Because Submarine is the only vehicle can let human to explore deep 
> places. B-)
> h3. {color:#FF}Please refer to on-going design doc, and add your 
> thoughts: 
> {color:#33}[https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#|https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit?usp=sharing]{color}{color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8135) Hadoop {Submarine} Project: Simple and scalable deployment of deep learning training / serving jobs on Hadoop

2018-07-20 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16551490#comment-16551490
 ] 

Wangda Tan commented on YARN-8135:
--

Discussed with many folks, thanks inputs from: 

 [~sunilg], [~jhung], [~oliverhuh...@gmail.com], [~erwaman], [~yanboliang],  
[~zhz], [~vinodkv], Xun Liu, [~shaneku...@gmail.com] and many others.

I just put the initial patch to YARN-8561 to get early feedbacks. I tested the 
patch on a 3.1.0 cluster which runs fine. Please let me know your thoughts.

I'm going to on vacation in the next week, please expect some delays of my 
responses.

> Hadoop {Submarine} Project: Simple and scalable deployment of deep learning 
> training / serving jobs on Hadoop
> -
>
> Key: YARN-8135
> URL: https://issues.apache.org/jira/browse/YARN-8135
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Major
> Attachments: YARN-8135.poc.001.patch
>
>
> Description:
> *Goals:*
>  - Allow infra engineer / data scientist to run *unmodified* Tensorflow jobs 
> on YARN.
>  - Allow jobs easy access data/models in HDFS and other storages.
>  - Can launch services to serve Tensorflow/MXNet models.
>  - Support run distributed Tensorflow jobs with simple configs.
>  - Support run user-specified Docker images.
>  - Support specify GPU and other resources.
>  - Support launch tensorboard if user specified.
>  - Support customized DNS name for roles (like tensorboard.$user.$domain:6006)
> *Why this name?*
>  - Because Submarine is the only vehicle can let human to explore deep 
> places. B-)
> h3. {color:#FF}Please refer to on-going design doc, and add your 
> thoughts: 
> {color:#33}[https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#|https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit?usp=sharing]{color}{color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8561) Add submarine initial implementation: training job submission and job history retrieve.

2018-07-20 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16551487#comment-16551487
 ] 

Wangda Tan commented on YARN-8561:
--

Attached initial patch to get some early feedbacks. Please refer to 
{{QuickStart.md}} for overall usage, and {{DeveloperGuide.md}} for basic 
developing-related information.

> Add submarine initial implementation: training job submission and job history 
> retrieve.
> ---
>
> Key: YARN-8561
> URL: https://issues.apache.org/jira/browse/YARN-8561
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Major
> Attachments: YARN-8561.001.patch
>
>
> Added following parts:
> 1) New subcomponent of YARN, under applications/ project. 
> 2) Tensorflow training job submission, including training (single node and 
> distributed). 
> - Supported Docker container. 
> - Support GPU isolation. 
> - Support YARN registry DNS.
> 3) Retrieve job history.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8561) Add submarine initial implementation: training job submission and job history retrieve.

2018-07-20 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8561:
-
Attachment: YARN-8561.001.patch

> Add submarine initial implementation: training job submission and job history 
> retrieve.
> ---
>
> Key: YARN-8561
> URL: https://issues.apache.org/jira/browse/YARN-8561
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Major
> Attachments: YARN-8561.001.patch
>
>
> Added following parts:
> 1) New subcomponent of YARN, under applications/ project. 
> 2) Tensorflow training job submission, including training (single node and 
> distributed). 
> - Supported Docker container. 
> - Support GPU isolation. 
> - Support YARN registry DNS.
> 3) Retrieve job history.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8561) Add submarine initial implementation: training job submission and job history retrieve.

2018-07-20 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-8561:


 Summary: Add submarine initial implementation: training job 
submission and job history retrieve.
 Key: YARN-8561
 URL: https://issues.apache.org/jira/browse/YARN-8561
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Wangda Tan
Assignee: Wangda Tan


Added following parts:
1) New subcomponent of YARN, under applications/ project. 

2) Tensorflow training job submission, including training (single node and 
distributed). 
- Supported Docker container. 
- Support GPU isolation. 
- Support YARN registry DNS.

3) Retrieve job history.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8360) Yarn service conflict between restart policy and NM configuration

2018-07-20 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8360:
-
Target Version/s: 3.2.0, 3.1.2
Priority: Critical  (was: Major)

> Yarn service conflict between restart policy and NM configuration 
> --
>
> Key: YARN-8360
> URL: https://issues.apache.org/jira/browse/YARN-8360
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Chandni Singh
>Assignee: Suma Shivaprasad
>Priority: Critical
> Attachments: YARN-8360.1.patch
>
>
> For the below spec, the service will not stop even after container failures 
> because of the NM auto retry properties :
>  * "yarn.service.container-failure.retry.max": 1,
>  * "yarn.service.container-failure.validity-interval-ms": 5000
>  The NM will continue auto-restarting containers.
>  {{fail_after 20}} fails after 20 seconds. Since the validity failure 
> interval is 5 seconds, NM will auto restart the container.
> {code:java}
> {
>   "name": "fail-demo2",
>   "version": "1.0.0",
>   "components" :
>   [
> {
>   "name": "comp1",
>   "number_of_containers": 1,
>   "launch_command": "fail_after 20",
>   "restart_policy": "NEVER",
>   "resource": {
> "cpus": 1,
> "memory": "256"
>   },
>   "configuration": {
> "properties": {
>   "yarn.service.container-failure.retry.max": 1,
>   "yarn.service.container-failure.validity-interval-ms": 5000
> }
>   }
> }
>   ]
> }
> {code}
> If {{restart_policy}} is NEVER, then the service should stop after the 
> container fails.
> Since we have introduced, the service level Restart Policies, I think we 
> should make the NM auto retry configurations part of the {{RetryPolicy}} and 
> get rid of all {{yarn.service.container-failure.**}} properties. Otherwise it 
> gets confusing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8544) [DS] AM registration fails when hadoop authorization is enabled

2018-07-20 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8544:
-
Target Version/s: 3.2.0, 3.1.2  (was: 3.1.1)

> [DS] AM registration fails when hadoop authorization is enabled
> ---
>
> Key: YARN-8544
> URL: https://issues.apache.org/jira/browse/YARN-8544
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Blocker
> Attachments: YARN-8544.001.patch
>
>
> Application master fails to register when hadoop authorization is enabled.
> DistributedSchedulingAMProtocol connection authorization fails are RM side  
> Issue credits: [~BilwaST]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8480) Add boolean option for resources

2018-07-20 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16551245#comment-16551245
 ] 

Wangda Tan commented on YARN-8480:
--

[~templedf],
{quote}It also looks to me like the PCM is an out-of-band scheduling process 
that skips the usual scheduling process.
{quote}
There're two approaches to do PCM allocation, one is called 
{{PlacementConstraintProcessor}} it is out-of-band. And the other one is 
scheduler, which happens inside normal scheduler loop. You can check 
http://hadoop.apache.org/docs/r3.1.0/hadoop-yarn/hadoop-yarn-site/PlacementConstraints.html
 for more details.

When we were doing scheduler changes to placement constraint, we put most 
logics to common part like AppSchedulingInfo, so other scheduler impl can use 
it directly. The hardest part is done, you only add some glue code to make FS 
work. 

If you really don't like to utilize the new SchedulingRequest, adding node 
attribute to ResourceRequest could be also considered (I'm not quite like the 
approach, but it is acceptable to me). 
By going with the approach to support node attribute by ResourceRequest, 
changes in FS should be minimum and all other logics like how to assign node 
attribute to nodes, how node to report node attributes to RM, how web UIs 
renders attributes etc. can be reused. 
Looping other folks who are doing node attribute support here for more 
thoughts: [~sunilg], [~cheersyang], [~Naganarasimha], [~bibinchundatt].

> Add boolean option for resources
> 
>
> Key: YARN-8480
> URL: https://issues.apache.org/jira/browse/YARN-8480
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Daniel Templeton
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-8480.001.patch, YARN-8480.002.patch
>
>
> Make it possible to define a resource with a boolean value.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8330) An extra container got launched by RM for yarn-service

2018-07-20 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16551137#comment-16551137
 ] 

Wangda Tan commented on YARN-8330:
--

[~rohithsharma], 

To me it is fine that we publish ALLOCATED/ACQUIRED container to ATS since it 
is a user-visible container. Comparing to that, RESERVED container is not 
visible to app.

> An extra container got launched by RM for yarn-service
> --
>
> Key: YARN-8330
> URL: https://issues.apache.org/jira/browse/YARN-8330
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Reporter: Yesha Vora
>Assignee: Suma Shivaprasad
>Priority: Critical
> Attachments: YARN-8330.1.patch, YARN-8330.2.patch
>
>
> Steps:
> launch Hbase tarball app
> list containers for hbase tarball app
> {code}
> /usr/hdp/current/hadoop-yarn-client/bin/yarn container -list 
> appattempt_1525463491331_0006_01
> WARNING: YARN_LOG_DIR has been replaced by HADOOP_LOG_DIR. Using value of 
> YARN_LOG_DIR.
> WARNING: YARN_LOGFILE has been replaced by HADOOP_LOGFILE. Using value of 
> YARN_LOGFILE.
> WARNING: YARN_PID_DIR has been replaced by HADOOP_PID_DIR. Using value of 
> YARN_PID_DIR.
> WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS.
> 18/05/04 22:36:11 INFO client.AHSProxy: Connecting to Application History 
> server at xxx/xxx:10200
> 18/05/04 22:36:11 INFO client.ConfiguredRMFailoverProxyProvider: Failing over 
> to rm2
> Total number of containers :5
> Container-IdStart Time Finish Time   
> StateHost   Node Http Address 
>LOG-URL
> container_e06_1525463491331_0006_01_02Fri May 04 22:34:26 + 2018  
>  N/A RUNNINGxxx:25454  http://xxx:8042
> http://xxx:8042/node/containerlogs/container_e06_1525463491331_0006_01_02/hrt_qa
> 2018-05-04 22:36:11,216|INFO|MainThread|machine.py:167 - 
> run()||GUID=0169fa41-d1c5-4b43-85bf-c3e9f2682398|container_e06_1525463491331_0006_01_03
> Fri May 04 22:34:26 + 2018   N/A 
> RUNNINGxxx:25454  http://xxx:8042
> http://xxx:8042/node/containerlogs/container_e06_1525463491331_0006_01_03/hrt_qa
> 2018-05-04 22:36:11,217|INFO|MainThread|machine.py:167 - 
> run()||GUID=0169fa41-d1c5-4b43-85bf-c3e9f2682398|container_e06_1525463491331_0006_01_01
> Fri May 04 22:34:15 + 2018   N/A 
> RUNNINGxxx:25454  http://xxx:8042
> http://xxx:8042/node/containerlogs/container_e06_1525463491331_0006_01_01/hrt_qa
> 2018-05-04 22:36:11,217|INFO|MainThread|machine.py:167 - 
> run()||GUID=0169fa41-d1c5-4b43-85bf-c3e9f2682398|container_e06_1525463491331_0006_01_05
> Fri May 04 22:34:56 + 2018   N/A 
> RUNNINGxxx:25454  http://xxx:8042
> http://xxx:8042/node/containerlogs/container_e06_1525463491331_0006_01_05/hrt_qa
> 2018-05-04 22:36:11,218|INFO|MainThread|machine.py:167 - 
> run()||GUID=0169fa41-d1c5-4b43-85bf-c3e9f2682398|container_e06_1525463491331_0006_01_04
> Fri May 04 22:34:56 + 2018   N/A
> nullxxx:25454  http://xxx:8042
> http://xxx:8188/applicationhistory/logs/xxx:25454/container_e06_1525463491331_0006_01_04/container_e06_1525463491331_0006_01_04/hrt_qa{code}
> Total expected containers = 4 ( 3 components container + 1 am). Instead, RM 
> is listing 5 containers. 
> container_e06_1525463491331_0006_01_04 is in null state.
> Yarn service utilized container 02, 03, 05 for component. There is no log 
> available in NM & AM related to container 04. Only one line in RM log is 
> printed
> {code}
> 2018-05-04 22:34:56,618 INFO  rmcontainer.RMContainerImpl 
> (RMContainerImpl.java:handle(489)) - 
> container_e06_1525463491331_0006_01_04 Container Transitioned from NEW to 
> RESERVED{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8559) Expose scheduling configuration info in Resource Manager's /conf endpoint

2018-07-20 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16551066#comment-16551066
 ] 

Wangda Tan commented on YARN-8559:
--

[~cheersyang] / [~banditka], 

Actually the scheduler-conf was not targeted for getting configs of scheduler, 
it is an API to update scheduler configs by using PUT. I think we should add a 
GET API to the same endpoint. Thoughts?

> Expose scheduling configuration info in Resource Manager's /conf endpoint
> -
>
> Key: YARN-8559
> URL: https://issues.apache.org/jira/browse/YARN-8559
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Anna Savarin
>Priority: Minor
>
> All Hadoop services provide a set of common endpoints (/stacks, /logLevel, 
> /metrics, /jmx, /conf).  In the case of the Resource Manager, part of the 
> configuration comes from the scheduler being used.  Currently, these 
> configuration key/values are not exposed through the /conf endpoint, thereby 
> revealing an incomplete configuration picture. 
> Make an improvement and expose the scheduling configuration info through the 
> RM's /conf endpoint.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8418) App local logs could leaked if log aggregation fails to initialize for the app

2018-07-20 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16551057#comment-16551057
 ] 

Wangda Tan commented on YARN-8418:
--

For logics of the patch, a couple of comments: 

1) Is it possible to send the "new token arrived" message to 
LogAggregationService instead of handling inside ContainersManagerImpl? 

2) In addition to above suggestion, From ContainerManagerImpl, it should not 
tell LogAggregationService that "app log aggregation enabled". Instead, it 
should notify LogAggregationService that new token arrived and let 
LogAggregationService make the decision. 

3) Following logic:
{code:java}
375 AppLogAggregator aggregator = appLogAggregators.get(appId);
376 if (aggregator != null) {{code}
When this could be null? 

4) Following logic:
{code:java}
259 if (e.getCause() instanceof SecretManager.InvalidToken) {
260 aggregationDisabledApps.add(appId);
261 }{code}
Then we should call it "invalidTokenApps" or some more appropriate name. We 
want to distinguish between this scenario and other invalid apps.

5) What happens if token arrives after AppLogAggregator removed from context? 
Is it possible? If yes, are we going to remove log dir for this case?

6) Have u done tests in real cluster to prove it work? Just to make sure we're 
pushing the right fix to 3.1.1 given we don't have much time before RC. 

[~sunil.gov...@gmail.com], [~suma.shivaprasad], could u help to review the 
patch and related logics since I will be off after tomorrow?

 

> App local logs could leaked if log aggregation fails to initialize for the app
> --
>
> Key: YARN-8418
> URL: https://issues.apache.org/jira/browse/YARN-8418
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.8.0, 3.0.0-alpha1
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: YARN-8418.001.patch, YARN-8418.002.patch, 
> YARN-8418.003.patch, YARN-8418.004.patch
>
>
> If log aggregation fails init createApp directory container logs could get 
> leaked in NM directory
> For log running application restart of NM after token renewal this case is 
> possible/  Application submission with invalid delegation token



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8418) App local logs could leaked if log aggregation fails to initialize for the app

2018-07-20 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16551034#comment-16551034
 ] 

Wangda Tan commented on YARN-8418:
--

[~bibinchundatt], 
{quote}As part of YARN-4984 we disabled the thread / not sure was a leak.. 
{quote}
I see, I checked history of related fixes. 

there're 3 issues related to the problem:
1) YARN-4697 bounds maximum #threads can be used by LogAggregationService.
2) YARN-4325 will fix a problem that completed app event will be sent to 
LogAggregationService again
3) YARN-4984 will fix a corner case that even creation of log aggregation app 
dir fails, LogAggregationService still use a new thread to do following works.

Given we have #1, I think #3 is not a big issue here and we need it to do app 
cleanup and it is also needed to upload logs once new token received. Now I'm 
convinced that we should revert part of changes in YARN-4984 and do right fix 
here.

> App local logs could leaked if log aggregation fails to initialize for the app
> --
>
> Key: YARN-8418
> URL: https://issues.apache.org/jira/browse/YARN-8418
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.8.0, 3.0.0-alpha1
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: YARN-8418.001.patch, YARN-8418.002.patch, 
> YARN-8418.003.patch, YARN-8418.004.patch
>
>
> If log aggregation fails init createApp directory container logs could get 
> leaked in NM directory
> For log running application restart of NM after token renewal this case is 
> possible/  Application submission with invalid delegation token



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8418) App local logs could leaked if log aggregation fails to initialize for the app

2018-07-19 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16549964#comment-16549964
 ] 

Wangda Tan commented on YARN-8418:
--

[~bibinchundatt], I'm a bit hesitated to get this patch committed since it 
seems reverted changes of YARN-4984. But this is a valid issue, is it possible 
to delay creating app log aggregation thread after token received? Otherwise, I 
would prefer to move this to 3.1.2 since it could introduce thread leak.

> App local logs could leaked if log aggregation fails to initialize for the app
> --
>
> Key: YARN-8418
> URL: https://issues.apache.org/jira/browse/YARN-8418
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.8.0, 3.0.0-alpha1
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: YARN-8418.001.patch, YARN-8418.002.patch, 
> YARN-8418.003.patch, YARN-8418.004.patch
>
>
> If log aggregation fails init createApp directory container logs could get 
> leaked in NM directory
> For log running application restart of NM after token renewal this case is 
> possible/  Application submission with invalid delegation token



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8418) App local logs could leaked if log aggregation fails to initialize for the app

2018-07-19 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8418:
-
Target Version/s: 3.1.1  (was: 3.1.2)

> App local logs could leaked if log aggregation fails to initialize for the app
> --
>
> Key: YARN-8418
> URL: https://issues.apache.org/jira/browse/YARN-8418
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.8.0, 3.0.0-alpha1
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: YARN-8418.001.patch, YARN-8418.002.patch, 
> YARN-8418.003.patch, YARN-8418.004.patch
>
>
> If log aggregation fails init createApp directory container logs could get 
> leaked in NM directory
> For log running application restart of NM after token renewal this case is 
> possible/  Application submission with invalid delegation token



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8418) App local logs could leaked if log aggregation fails to initialize for the app

2018-07-19 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8418:
-
Target Version/s: 3.1.2  (was: 3.1.1)

> App local logs could leaked if log aggregation fails to initialize for the app
> --
>
> Key: YARN-8418
> URL: https://issues.apache.org/jira/browse/YARN-8418
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.8.0, 3.0.0-alpha1
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: YARN-8418.001.patch, YARN-8418.002.patch, 
> YARN-8418.003.patch, YARN-8418.004.patch
>
>
> If log aggregation fails init createApp directory container logs could get 
> leaked in NM directory
> For log running application restart of NM after token renewal this case is 
> possible/  Application submission with invalid delegation token



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8418) App local logs could leaked if log aggregation fails to initialize for the app

2018-07-19 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8418:
-
Target Version/s: 3.1.1  (was: 3.1.2)

> App local logs could leaked if log aggregation fails to initialize for the app
> --
>
> Key: YARN-8418
> URL: https://issues.apache.org/jira/browse/YARN-8418
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.8.0, 3.0.0-alpha1
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: YARN-8418.001.patch, YARN-8418.002.patch, 
> YARN-8418.003.patch, YARN-8418.004.patch
>
>
> If log aggregation fails init createApp directory container logs could get 
> leaked in NM directory
> For log running application restart of NM after token renewal this case is 
> possible/  Application submission with invalid delegation token



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8474) sleeper service fails to launch with "Authentication Required"

2018-07-19 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16549901#comment-16549901
 ] 

Wangda Tan commented on YARN-8474:
--

Bulk update: moved all 3.1.1 non-blocker issues, please move back if it is a 
blocker.

> sleeper service fails to launch with "Authentication Required"
> --
>
> Key: YARN-8474
> URL: https://issues.apache.org/jira/browse/YARN-8474
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.1.0
>Reporter: Sumana Sathish
>Assignee: Eric Yang
>Priority: Critical
> Attachments: YARN-8474.001.patch, YARN-8474.002.patch, 
> YARN-8474.003.patch, YARN-8474.004.patch
>
>
> Sleeper job fails with Authentication required.
> {code}
> yarn app -launch sl1 a/YarnServiceLogs/sleeper-orig.json
>  18/06/28 22:00:43 INFO client.ApiServiceClient: Loading service definition 
> from local FS: /a/YarnServiceLogs/sleeper-orig.json
>  18/06/28 22:00:44 ERROR client.ApiServiceClient: Authentication required
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8234) Improve RM system metrics publisher's performance by pushing events to timeline server in batch

2018-07-19 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16549902#comment-16549902
 ] 

Wangda Tan commented on YARN-8234:
--

Bulk update: moved all 3.1.1 non-blocker issues, please move back if it is a 
blocker.

> Improve RM system metrics publisher's performance by pushing events to 
> timeline server in batch
> ---
>
> Key: YARN-8234
> URL: https://issues.apache.org/jira/browse/YARN-8234
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager, timelineserver
>Affects Versions: 2.8.3
>Reporter: Hu Ziqian
>Assignee: Hu Ziqian
>Priority: Critical
> Attachments: YARN-8234-branch-2.8.3.001.patch, 
> YARN-8234-branch-2.8.3.002.patch, YARN-8234-branch-2.8.3.003.patch, 
> YARN-8234.001.patch, YARN-8234.002.patch, YARN-8234.003.patch
>
>
> When system metrics publisher is enabled, RM will push events to timeline 
> server via restful api. If the cluster load is heavy, many events are sent to 
> timeline server and the timeline server's event handler thread locked. 
> YARN-7266 talked about the detail of this problem. Because of the lock, 
> timeline server can't receive event as fast as it generated in RM and lots of 
> timeline event stays in RM's memory. Finally, those events will consume all 
> RM's memory and RM will start a full gc (which cause an JVM stop-world and 
> cause a timeout from rm to zookeeper) or even get an OOM. 
> The main problem here is that timeline can't receive timeline server's event 
> as fast as it generated. Now, RM system metrics publisher put only one event 
> in a request, and most time costs on handling http header or some thing about 
> the net connection on timeline side. Only few time is spent on dealing with 
> the timeline event which is truly valuable.
> In this issue, we add a buffer in system metrics publisher and let publisher 
> send events to timeline server in batch via one request. When sets the batch 
> size to 1000, in out experiment the speed of the timeline server receives 
> events has 100x improvement. We have implement this function int our product 
> environment which accepts 2 app's in one hour and it works fine.
> We add following configuration:
>  * yarn.resourcemanager.system-metrics-publisher.batch-size: the size of 
> system metrics publisher sending events in one request. Default value is 1000
>  * yarn.resourcemanager.system-metrics-publisher.buffer-size: the size of the 
> event buffer in system metrics publisher.
>  * yarn.resourcemanager.system-metrics-publisher.interval-seconds: When 
> enable batch publishing, we must avoid that the publisher waits for a batch 
> to be filled up and hold events in buffer for long time. So we add another 
> thread which send event's in the buffer periodically. This config sets the 
> interval of the cyclical sending thread. The default value is 60s.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8474) sleeper service fails to launch with "Authentication Required"

2018-07-19 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8474:
-
Target Version/s: 3.2.0, 3.1.2  (was: 3.2.0, 3.1.1)

> sleeper service fails to launch with "Authentication Required"
> --
>
> Key: YARN-8474
> URL: https://issues.apache.org/jira/browse/YARN-8474
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.1.0
>Reporter: Sumana Sathish
>Assignee: Eric Yang
>Priority: Critical
> Attachments: YARN-8474.001.patch, YARN-8474.002.patch, 
> YARN-8474.003.patch, YARN-8474.004.patch
>
>
> Sleeper job fails with Authentication required.
> {code}
> yarn app -launch sl1 a/YarnServiceLogs/sleeper-orig.json
>  18/06/28 22:00:43 INFO client.ApiServiceClient: Loading service definition 
> from local FS: /a/YarnServiceLogs/sleeper-orig.json
>  18/06/28 22:00:44 ERROR client.ApiServiceClient: Authentication required
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8234) Improve RM system metrics publisher's performance by pushing events to timeline server in batch

2018-07-19 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8234:
-
Target Version/s: 3.1.2  (was: 3.1.1, 3.1.2)

> Improve RM system metrics publisher's performance by pushing events to 
> timeline server in batch
> ---
>
> Key: YARN-8234
> URL: https://issues.apache.org/jira/browse/YARN-8234
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager, timelineserver
>Affects Versions: 2.8.3
>Reporter: Hu Ziqian
>Assignee: Hu Ziqian
>Priority: Critical
> Attachments: YARN-8234-branch-2.8.3.001.patch, 
> YARN-8234-branch-2.8.3.002.patch, YARN-8234-branch-2.8.3.003.patch, 
> YARN-8234.001.patch, YARN-8234.002.patch, YARN-8234.003.patch
>
>
> When system metrics publisher is enabled, RM will push events to timeline 
> server via restful api. If the cluster load is heavy, many events are sent to 
> timeline server and the timeline server's event handler thread locked. 
> YARN-7266 talked about the detail of this problem. Because of the lock, 
> timeline server can't receive event as fast as it generated in RM and lots of 
> timeline event stays in RM's memory. Finally, those events will consume all 
> RM's memory and RM will start a full gc (which cause an JVM stop-world and 
> cause a timeout from rm to zookeeper) or even get an OOM. 
> The main problem here is that timeline can't receive timeline server's event 
> as fast as it generated. Now, RM system metrics publisher put only one event 
> in a request, and most time costs on handling http header or some thing about 
> the net connection on timeline side. Only few time is spent on dealing with 
> the timeline event which is truly valuable.
> In this issue, we add a buffer in system metrics publisher and let publisher 
> send events to timeline server in batch via one request. When sets the batch 
> size to 1000, in out experiment the speed of the timeline server receives 
> events has 100x improvement. We have implement this function int our product 
> environment which accepts 2 app's in one hour and it works fine.
> We add following configuration:
>  * yarn.resourcemanager.system-metrics-publisher.batch-size: the size of 
> system metrics publisher sending events in one request. Default value is 1000
>  * yarn.resourcemanager.system-metrics-publisher.buffer-size: the size of the 
> event buffer in system metrics publisher.
>  * yarn.resourcemanager.system-metrics-publisher.interval-seconds: When 
> enable batch publishing, we must avoid that the publisher waits for a batch 
> to be filled up and hold events in buffer for long time. So we add another 
> thread which send event's in the buffer periodically. This config sets the 
> interval of the cyclical sending thread. The default value is 60s.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8234) Improve RM system metrics publisher's performance by pushing events to timeline server in batch

2018-07-19 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8234:
-
Target Version/s: 3.1.1, 3.1.2  (was: 3.1.1)

> Improve RM system metrics publisher's performance by pushing events to 
> timeline server in batch
> ---
>
> Key: YARN-8234
> URL: https://issues.apache.org/jira/browse/YARN-8234
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager, timelineserver
>Affects Versions: 2.8.3
>Reporter: Hu Ziqian
>Assignee: Hu Ziqian
>Priority: Critical
> Attachments: YARN-8234-branch-2.8.3.001.patch, 
> YARN-8234-branch-2.8.3.002.patch, YARN-8234-branch-2.8.3.003.patch, 
> YARN-8234.001.patch, YARN-8234.002.patch, YARN-8234.003.patch
>
>
> When system metrics publisher is enabled, RM will push events to timeline 
> server via restful api. If the cluster load is heavy, many events are sent to 
> timeline server and the timeline server's event handler thread locked. 
> YARN-7266 talked about the detail of this problem. Because of the lock, 
> timeline server can't receive event as fast as it generated in RM and lots of 
> timeline event stays in RM's memory. Finally, those events will consume all 
> RM's memory and RM will start a full gc (which cause an JVM stop-world and 
> cause a timeout from rm to zookeeper) or even get an OOM. 
> The main problem here is that timeline can't receive timeline server's event 
> as fast as it generated. Now, RM system metrics publisher put only one event 
> in a request, and most time costs on handling http header or some thing about 
> the net connection on timeline side. Only few time is spent on dealing with 
> the timeline event which is truly valuable.
> In this issue, we add a buffer in system metrics publisher and let publisher 
> send events to timeline server in batch via one request. When sets the batch 
> size to 1000, in out experiment the speed of the timeline server receives 
> events has 100x improvement. We have implement this function int our product 
> environment which accepts 2 app's in one hour and it works fine.
> We add following configuration:
>  * yarn.resourcemanager.system-metrics-publisher.batch-size: the size of 
> system metrics publisher sending events in one request. Default value is 1000
>  * yarn.resourcemanager.system-metrics-publisher.buffer-size: the size of the 
> event buffer in system metrics publisher.
>  * yarn.resourcemanager.system-metrics-publisher.interval-seconds: When 
> enable batch publishing, we must avoid that the publisher waits for a batch 
> to be filled up and hold events in buffer for long time. So we add another 
> thread which send event's in the buffer periodically. This config sets the 
> interval of the cyclical sending thread. The default value is 60s.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8514) YARN RegistryDNS throws NPE when Kerberos tgt expires

2018-07-19 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8514:
-
Target Version/s: 3.2.0, 3.1.2  (was: 3.2.0, 3.1.1)

> YARN RegistryDNS throws NPE when Kerberos tgt expires
> -
>
> Key: YARN-8514
> URL: https://issues.apache.org/jira/browse/YARN-8514
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.9.1, 3.0.1, 3.0.2, 2.9.2
>Reporter: Eric Yang
>Priority: Critical
>
> After Kerberos ticket expires, RegistryDNS throws NPE error:
> {code:java}
> 2018-07-06 01:26:25,025 ERROR yarn.YarnUncaughtExceptionHandler 
> (YarnUncaughtExceptionHandler.java:uncaughtException(68)) - Thread Thread[TGT 
> Renewer for rm/host1.example@example.com,5,main] threw an Exception.
> java.lang.NullPointerException
>         at 
> javax.security.auth.kerberos.KerberosTicket.getEndTime(KerberosTicket.java:482)
>         at 
> org.apache.hadoop.security.UserGroupInformation$1.run(UserGroupInformation.java:894)
>         at java.lang.Thread.run(Thread.java:745){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8242) YARN NM: OOM error while reading back the state store on recovery

2018-07-19 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8242:
-
Target Version/s: 3.2.0, 3.1.2  (was: 3.2.0)

> YARN NM: OOM error while reading back the state store on recovery
> -
>
> Key: YARN-8242
> URL: https://issues.apache.org/jira/browse/YARN-8242
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.6.0, 2.9.0, 2.6.5, 2.8.3, 3.1.0, 2.7.6, 3.0.2
>Reporter: Kanwaljeet Sachdev
>Priority: Critical
> Attachments: YARN-8242.001.patch, YARN-8242.002.patch, 
> YARN-8242.003.patch
>
>
> On startup the NM reads its state store and builds a list of application in 
> the state store to process. If the number of applications in the state store 
> is large and have a lot of "state" connected to it the NM can run OOM and 
> never get to the point that it can start processing the recovery.
> Since it never starts the recovery there is no way for the NM to ever pass 
> this point. It will require a change in heap size to get the NM started.
>  
> Following is the stack trace
> {code:java}
> at java.lang.OutOfMemoryError. (OutOfMemoryError.java:48) at 
> com.google.protobuf.ByteString.copyFrom (ByteString.java:192) at 
> com.google.protobuf.CodedInputStream.readBytes (CodedInputStream.java:324) at 
> org.apache.hadoop.yarn.proto.YarnProtos$StringStringMapProto. 
> (YarnProtos.java:47069) at 
> org.apache.hadoop.yarn.proto.YarnProtos$StringStringMapProto. 
> (YarnProtos.java:47014) at 
> org.apache.hadoop.yarn.proto.YarnProtos$StringStringMapProto$1.parsePartialFrom
>  (YarnProtos.java:47102) at 
> org.apache.hadoop.yarn.proto.YarnProtos$StringStringMapProto$1.parsePartialFrom
>  (YarnProtos.java:47097) at com.google.protobuf.CodedInputStream.readMessage 
> (CodedInputStream.java:309) at 
> org.apache.hadoop.yarn.proto.YarnProtos$ContainerLaunchContextProto. 
> (YarnProtos.java:41016) at 
> org.apache.hadoop.yarn.proto.YarnProtos$ContainerLaunchContextProto. 
> (YarnProtos.java:40942) at 
> org.apache.hadoop.yarn.proto.YarnProtos$ContainerLaunchContextProto$1.parsePartialFrom
>  (YarnProtos.java:41080) at 
> org.apache.hadoop.yarn.proto.YarnProtos$ContainerLaunchContextProto$1.parsePartialFrom
>  (YarnProtos.java:41075) at com.google.protobuf.CodedInputStream.readMessage 
> (CodedInputStream.java:309) at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$StartContainerRequestProto.
>  (YarnServiceProtos.java:24517) at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$StartContainerRequestProto.
>  (YarnServiceProtos.java:24464) at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$StartContainerRequestProto$1.parsePartialFrom
>  (YarnServiceProtos.java:24568) at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$StartContainerRequestProto$1.parsePartialFrom
>  (YarnServiceProtos.java:24563) at 
> com.google.protobuf.AbstractParser.parsePartialFrom (AbstractParser.java:141) 
> at com.google.protobuf.AbstractParser.parseFrom (AbstractParser.java:176) at 
> com.google.protobuf.AbstractParser.parseFrom (AbstractParser.java:188) at 
> com.google.protobuf.AbstractParser.parseFrom (AbstractParser.java:193) at 
> com.google.protobuf.AbstractParser.parseFrom (AbstractParser.java:49) at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$StartContainerRequestProto.parseFrom
>  (YarnServiceProtos.java:24739) at 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.loadContainerState
>  (NMLeveldbStateStoreService.java:217) at 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.loadContainersState
>  (NMLeveldbStateStoreService.java:170) at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recover
>  (ContainerManagerImpl.java:253) at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit
>  (ContainerManagerImpl.java:237) at 
> org.apache.hadoop.service.AbstractService.init (AbstractService.java:163) at 
> org.apache.hadoop.service.CompositeService.serviceInit 
> (CompositeService.java:107) at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit 
> (NodeManager.java:255) at org.apache.hadoop.service.AbstractService.init 
> (AbstractService.java:163) at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager 
> (NodeManager.java:474) at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main 
> (NodeManager.java:521){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8242) YARN NM: OOM error while reading back the state store on recovery

2018-07-19 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16549898#comment-16549898
 ] 

Wangda Tan commented on YARN-8242:
--

Bulk update: moved all 3.1.1 non-blocker issues, please move back if it is a 
blocker.

> YARN NM: OOM error while reading back the state store on recovery
> -
>
> Key: YARN-8242
> URL: https://issues.apache.org/jira/browse/YARN-8242
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.6.0, 2.9.0, 2.6.5, 2.8.3, 3.1.0, 2.7.6, 3.0.2
>Reporter: Kanwaljeet Sachdev
>Priority: Critical
> Attachments: YARN-8242.001.patch, YARN-8242.002.patch, 
> YARN-8242.003.patch
>
>
> On startup the NM reads its state store and builds a list of application in 
> the state store to process. If the number of applications in the state store 
> is large and have a lot of "state" connected to it the NM can run OOM and 
> never get to the point that it can start processing the recovery.
> Since it never starts the recovery there is no way for the NM to ever pass 
> this point. It will require a change in heap size to get the NM started.
>  
> Following is the stack trace
> {code:java}
> at java.lang.OutOfMemoryError. (OutOfMemoryError.java:48) at 
> com.google.protobuf.ByteString.copyFrom (ByteString.java:192) at 
> com.google.protobuf.CodedInputStream.readBytes (CodedInputStream.java:324) at 
> org.apache.hadoop.yarn.proto.YarnProtos$StringStringMapProto. 
> (YarnProtos.java:47069) at 
> org.apache.hadoop.yarn.proto.YarnProtos$StringStringMapProto. 
> (YarnProtos.java:47014) at 
> org.apache.hadoop.yarn.proto.YarnProtos$StringStringMapProto$1.parsePartialFrom
>  (YarnProtos.java:47102) at 
> org.apache.hadoop.yarn.proto.YarnProtos$StringStringMapProto$1.parsePartialFrom
>  (YarnProtos.java:47097) at com.google.protobuf.CodedInputStream.readMessage 
> (CodedInputStream.java:309) at 
> org.apache.hadoop.yarn.proto.YarnProtos$ContainerLaunchContextProto. 
> (YarnProtos.java:41016) at 
> org.apache.hadoop.yarn.proto.YarnProtos$ContainerLaunchContextProto. 
> (YarnProtos.java:40942) at 
> org.apache.hadoop.yarn.proto.YarnProtos$ContainerLaunchContextProto$1.parsePartialFrom
>  (YarnProtos.java:41080) at 
> org.apache.hadoop.yarn.proto.YarnProtos$ContainerLaunchContextProto$1.parsePartialFrom
>  (YarnProtos.java:41075) at com.google.protobuf.CodedInputStream.readMessage 
> (CodedInputStream.java:309) at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$StartContainerRequestProto.
>  (YarnServiceProtos.java:24517) at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$StartContainerRequestProto.
>  (YarnServiceProtos.java:24464) at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$StartContainerRequestProto$1.parsePartialFrom
>  (YarnServiceProtos.java:24568) at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$StartContainerRequestProto$1.parsePartialFrom
>  (YarnServiceProtos.java:24563) at 
> com.google.protobuf.AbstractParser.parsePartialFrom (AbstractParser.java:141) 
> at com.google.protobuf.AbstractParser.parseFrom (AbstractParser.java:176) at 
> com.google.protobuf.AbstractParser.parseFrom (AbstractParser.java:188) at 
> com.google.protobuf.AbstractParser.parseFrom (AbstractParser.java:193) at 
> com.google.protobuf.AbstractParser.parseFrom (AbstractParser.java:49) at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$StartContainerRequestProto.parseFrom
>  (YarnServiceProtos.java:24739) at 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.loadContainerState
>  (NMLeveldbStateStoreService.java:217) at 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.loadContainersState
>  (NMLeveldbStateStoreService.java:170) at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recover
>  (ContainerManagerImpl.java:253) at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit
>  (ContainerManagerImpl.java:237) at 
> org.apache.hadoop.service.AbstractService.init (AbstractService.java:163) at 
> org.apache.hadoop.service.CompositeService.serviceInit 
> (CompositeService.java:107) at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit 
> (NodeManager.java:255) at org.apache.hadoop.service.AbstractService.init 
> (AbstractService.java:163) at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager 
> (NodeManager.java:474) at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main 
> (NodeManager.java:521){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mai

[jira] [Updated] (YARN-8242) YARN NM: OOM error while reading back the state store on recovery

2018-07-19 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8242:
-
Target Version/s: 3.2.0  (was: 3.2.0, 3.1.1)

> YARN NM: OOM error while reading back the state store on recovery
> -
>
> Key: YARN-8242
> URL: https://issues.apache.org/jira/browse/YARN-8242
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.6.0, 2.9.0, 2.6.5, 2.8.3, 3.1.0, 2.7.6, 3.0.2
>Reporter: Kanwaljeet Sachdev
>Priority: Critical
> Attachments: YARN-8242.001.patch, YARN-8242.002.patch, 
> YARN-8242.003.patch
>
>
> On startup the NM reads its state store and builds a list of application in 
> the state store to process. If the number of applications in the state store 
> is large and have a lot of "state" connected to it the NM can run OOM and 
> never get to the point that it can start processing the recovery.
> Since it never starts the recovery there is no way for the NM to ever pass 
> this point. It will require a change in heap size to get the NM started.
>  
> Following is the stack trace
> {code:java}
> at java.lang.OutOfMemoryError. (OutOfMemoryError.java:48) at 
> com.google.protobuf.ByteString.copyFrom (ByteString.java:192) at 
> com.google.protobuf.CodedInputStream.readBytes (CodedInputStream.java:324) at 
> org.apache.hadoop.yarn.proto.YarnProtos$StringStringMapProto. 
> (YarnProtos.java:47069) at 
> org.apache.hadoop.yarn.proto.YarnProtos$StringStringMapProto. 
> (YarnProtos.java:47014) at 
> org.apache.hadoop.yarn.proto.YarnProtos$StringStringMapProto$1.parsePartialFrom
>  (YarnProtos.java:47102) at 
> org.apache.hadoop.yarn.proto.YarnProtos$StringStringMapProto$1.parsePartialFrom
>  (YarnProtos.java:47097) at com.google.protobuf.CodedInputStream.readMessage 
> (CodedInputStream.java:309) at 
> org.apache.hadoop.yarn.proto.YarnProtos$ContainerLaunchContextProto. 
> (YarnProtos.java:41016) at 
> org.apache.hadoop.yarn.proto.YarnProtos$ContainerLaunchContextProto. 
> (YarnProtos.java:40942) at 
> org.apache.hadoop.yarn.proto.YarnProtos$ContainerLaunchContextProto$1.parsePartialFrom
>  (YarnProtos.java:41080) at 
> org.apache.hadoop.yarn.proto.YarnProtos$ContainerLaunchContextProto$1.parsePartialFrom
>  (YarnProtos.java:41075) at com.google.protobuf.CodedInputStream.readMessage 
> (CodedInputStream.java:309) at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$StartContainerRequestProto.
>  (YarnServiceProtos.java:24517) at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$StartContainerRequestProto.
>  (YarnServiceProtos.java:24464) at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$StartContainerRequestProto$1.parsePartialFrom
>  (YarnServiceProtos.java:24568) at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$StartContainerRequestProto$1.parsePartialFrom
>  (YarnServiceProtos.java:24563) at 
> com.google.protobuf.AbstractParser.parsePartialFrom (AbstractParser.java:141) 
> at com.google.protobuf.AbstractParser.parseFrom (AbstractParser.java:176) at 
> com.google.protobuf.AbstractParser.parseFrom (AbstractParser.java:188) at 
> com.google.protobuf.AbstractParser.parseFrom (AbstractParser.java:193) at 
> com.google.protobuf.AbstractParser.parseFrom (AbstractParser.java:49) at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$StartContainerRequestProto.parseFrom
>  (YarnServiceProtos.java:24739) at 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.loadContainerState
>  (NMLeveldbStateStoreService.java:217) at 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.loadContainersState
>  (NMLeveldbStateStoreService.java:170) at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recover
>  (ContainerManagerImpl.java:253) at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit
>  (ContainerManagerImpl.java:237) at 
> org.apache.hadoop.service.AbstractService.init (AbstractService.java:163) at 
> org.apache.hadoop.service.CompositeService.serviceInit 
> (CompositeService.java:107) at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit 
> (NodeManager.java:255) at org.apache.hadoop.service.AbstractService.init 
> (AbstractService.java:163) at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager 
> (NodeManager.java:474) at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main 
> (NodeManager.java:521){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8015) Support inter-app placement constraints in AppPlacementAllocator

2018-07-19 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8015:
-
Target Version/s: 3.2.0  (was: 3.1.2)

> Support inter-app placement constraints in AppPlacementAllocator
> 
>
> Key: YARN-8015
> URL: https://issues.apache.org/jira/browse/YARN-8015
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Critical
> Attachments: YARN-8015.001.patch, YARN-8015.002.patch
>
>
> AppPlacementAllocator currently only supports intra-app anti-affinity 
> placement constraints, once YARN-8002 and YARN-8013 are resolved, it needs to 
> support inter-app constraints too. Also, this may require some refactoring on 
> the existing code logic. Use this JIRA to track.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8015) Support inter-app placement constraints in AppPlacementAllocator

2018-07-19 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16549897#comment-16549897
 ] 

Wangda Tan commented on YARN-8015:
--

Bulk update: moved all 3.1.1 non-blocker issues, please move back if it is a 
blocker.

> Support inter-app placement constraints in AppPlacementAllocator
> 
>
> Key: YARN-8015
> URL: https://issues.apache.org/jira/browse/YARN-8015
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Critical
> Attachments: YARN-8015.001.patch, YARN-8015.002.patch
>
>
> AppPlacementAllocator currently only supports intra-app anti-affinity 
> placement constraints, once YARN-8002 and YARN-8013 are resolved, it needs to 
> support inter-app constraints too. Also, this may require some refactoring on 
> the existing code logic. Use this JIRA to track.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8015) Support inter-app placement constraints in AppPlacementAllocator

2018-07-19 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8015:
-
Target Version/s: 3.1.2  (was: 3.1.1)

> Support inter-app placement constraints in AppPlacementAllocator
> 
>
> Key: YARN-8015
> URL: https://issues.apache.org/jira/browse/YARN-8015
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Critical
> Attachments: YARN-8015.001.patch, YARN-8015.002.patch
>
>
> AppPlacementAllocator currently only supports intra-app anti-affinity 
> placement constraints, once YARN-8002 and YARN-8013 are resolved, it needs to 
> support inter-app constraints too. Also, this may require some refactoring on 
> the existing code logic. Use this JIRA to track.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8418) App local logs could leaked if log aggregation fails to initialize for the app

2018-07-19 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8418:
-
Target Version/s: 3.1.2  (was: 3.1.1)

> App local logs could leaked if log aggregation fails to initialize for the app
> --
>
> Key: YARN-8418
> URL: https://issues.apache.org/jira/browse/YARN-8418
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.8.0, 3.0.0-alpha1
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: YARN-8418.001.patch, YARN-8418.002.patch, 
> YARN-8418.003.patch, YARN-8418.004.patch
>
>
> If log aggregation fails init createApp directory container logs could get 
> leaked in NM directory
> For log running application restart of NM after token renewal this case is 
> possible/  Application submission with invalid delegation token



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8544) [DS] AM registration fails when hadoop authorization is enabled

2018-07-19 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16549715#comment-16549715
 ] 

Wangda Tan commented on YARN-8544:
--

[~bibinchundatt], thanks for working on the issue. 

Is this a regression in 3.1.x? If yes, could u add a link of which patch breaks 
the behavior. If no, given this patch added new fields which I think is a new 
feature, should we move it to 3.2.0 to de-risk 3.1.1 release? 

> [DS] AM registration fails when hadoop authorization is enabled
> ---
>
> Key: YARN-8544
> URL: https://issues.apache.org/jira/browse/YARN-8544
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Blocker
> Attachments: YARN-8544.001.patch
>
>
> Application master fails to register when hadoop authorization is enabled.
> DistributedSchedulingAMProtocol connection authorization fails are RM side  
> Issue credits: [~BilwaST]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8541) RM startup failure on recovery after user deletion

2018-07-19 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16549710#comment-16549710
 ] 

Wangda Tan commented on YARN-8541:
--

Thanks [~bibinchundatt] for reporting this issue, so what is the behavior after 
this patch? We're going to fail individual app instead of whole RM, correct? 

[~sunilg], could u help to review and get this patch committed?

> RM startup failure on recovery after user deletion
> --
>
> Key: YARN-8541
> URL: https://issues.apache.org/jira/browse/YARN-8541
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.1.0
>Reporter: yimeng
>Assignee: Bibin A Chundatt
>Priority: Blocker
> Attachments: YARN-8541.001.patch, YARN-8541.002.patch
>
>
> My hadoop version 3.1.0. I found that  a problem RM startup failure on 
> recovery as the follow test step:
> 1.create a user "user1" have the permisson to submit app.
> 2.use user1 to submit a job ,wait job finished.
> 3.delete user "user1"
> 4.restart yarn 
> 5.the RM restart failed
> RM logs:
> 2018-07-16 16:24:59,708 | INFO | main-EventThread | Initialized root queue 
> root: numChildQueue= 3, capacity=1.0, absoluteCapacity=1.0, 
> usedResources=usedCapacity=0.0, numApps=0, 
> numContainers=0 | CapacitySchedulerQueueManager.java:163
> 2018-07-16 16:24:59,708 | INFO | main-EventThread | Initialized queue 
> mappings, override: false | UserGroupMappingPlacementRule.java:232
> 2018-07-16 16:24:59,708 | INFO | main-EventThread | Initialized 
> CapacityScheduler with calculator=class 
> org.apache.hadoop.yarn.util.resource.DominantResourceCalculator, 
> minimumAllocation=<>, maximumAllocation=< vCores:32>>, asynchronousScheduling=false, asyncScheduleInterval=5ms | 
> CapacityScheduler.java:392
> 2018-07-16 16:24:59,709 | INFO | main-EventThread | dynamic-resources.xml not 
> found | Configuration.java:2767
> 2018-07-16 16:24:59,709 | INFO | main-EventThread | Initializing AMS 
> Processing chain. Root 
> Processor=[org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor].
>  | AMSProcessingChain.java:62
> 2018-07-16 16:24:59,709 | INFO | main-EventThread | disabled placement 
> handler will be used, all scheduling requests will be rejected. | 
> ApplicationMasterService.java:130
> 2018-07-16 16:24:59,709 | INFO | main-EventThread | Adding 
> [org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.processor.DisabledPlacementProcessor]
>  tp top of AMS Processing chain. | AMSProcessingChain.java:75
> 2018-07-16 16:24:59,713 | WARN | main-EventThread | Exception handling the 
> winning of election | ActiveStandbyElector.java:897
> org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146)
>  at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:893)
>  at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:473)
>  at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:728)
>  at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:600)
> Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when 
> transitioning to Active mode
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:325)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144)
>  ... 4 more
> Caused by: org.apache.hadoop.service.ServiceStateException: 
> org.apache.hadoop.yarn.exceptions.YarnException: Failed to submit application 
> application_1531624956005_0001 submitted by user super reason: No groups 
> found for user super
>  at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105)
>  at org.apache.hadoop.service.AbstractService.start(AbstractService.java:203)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1204)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1245)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1241)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1686)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1241)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToAc

[jira] [Commented] (YARN-7974) Allow updating application tracking url after registration

2018-07-19 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16549668#comment-16549668
 ] 

Wangda Tan commented on YARN-7974:
--

LGTM +1, thanks [~jhung] for the patch.

If Jenkins comes with green color, I will commit the patch by tomorrow if no 
objections.

> Allow updating application tracking url after registration
> --
>
> Key: YARN-7974
> URL: https://issues.apache.org/jira/browse/YARN-7974
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
> Attachments: YARN-7974.001.patch, YARN-7974.002.patch, 
> YARN-7974.003.patch, YARN-7974.004.patch, YARN-7974.005.patch, 
> YARN-7974.006.patch
>
>
> Normally an application's tracking url is set on AM registration. We have a 
> use case for updating the tracking url after registration (e.g. the UI is 
> hosted on one of the containers).
> Approach is for AM to update tracking url on heartbeat to RM, and add related 
> API in AMRMClient.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8545) YARN native service should return container if launch failed

2018-07-17 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8545:
-
Description: 
In some cases, container launch may fail but container will not be properly 
returned to RM. 

This could happen when AM trying to prepare container launch context but failed 
w/o sending container launch context to NM (Once container launch context is 
sent to NM, NM will report failed container to RM).

Exception like: 
{code:java}
java.io.FileNotFoundException: File does not exist: 
hdfs://ns1/user/wtan/.yarn/services/tf-job-001/components/1531852429056/primary-worker/primary-worker-0/run-PRIMARY_WORKER.sh
at 
org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1583)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1576)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1591)
at 
org.apache.hadoop.yarn.service.utils.CoreFileSystem.createAmResource(CoreFileSystem.java:388)
at 
org.apache.hadoop.yarn.service.provider.ProviderUtils.createConfigFileAndAddLocalResource(ProviderUtils.java:253)
at 
org.apache.hadoop.yarn.service.provider.AbstractProviderService.buildContainerLaunchContext(AbstractProviderService.java:152)
at 
org.apache.hadoop.yarn.service.containerlaunch.ContainerLaunchService$ContainerLauncher.run(ContainerLaunchService.java:105)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745){code}
And even after container launch context prepare failed, AM still trying to 
monitor container's readiness:
{code:java}
2018-07-17 18:42:57,518 [pool-7-thread-1] INFO  monitor.ServiceMonitor - 
Readiness check failed for primary-worker-0: Probe Status, time="Tue Jul 17 
18:42:57 UTC 2018", outcome="failure", message="Failure in Default probe: IP 
presence", exception="java.io.IOException: primary-worker-0: IP is not 
available yet"

...{code}

  was:
In some cases, container launch may fail but container will not be properly 
returned to RM. 

This could happen when AM trying to prepare container launch context but failed 
w/o sending container launch context to NM (Once container launch context is 
sent to NM, NM will report failed container to RM).

Exception like: 
{code:java}
java.io.FileNotFoundException: File does not exist: 
hdfs://ns1/user/wtan/.yarn/services/tf-job-001/components/1531852429056/primary-worker/primary-worker-0/run-PRIMARY_WORKER.sh
at 
org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1583)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1576)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1591)
at 
org.apache.hadoop.yarn.service.utils.CoreFileSystem.createAmResource(CoreFileSystem.java:388)
at 
org.apache.hadoop.yarn.service.provider.ProviderUtils.createConfigFileAndAddLocalResource(ProviderUtils.java:253)
at 
org.apache.hadoop.yarn.service.provider.AbstractProviderService.buildContainerLaunchContext(AbstractProviderService.java:152)
at 
org.apache.hadoop.yarn.service.containerlaunch.ContainerLaunchService$ContainerLauncher.run(ContainerLaunchService.java:105)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745){code}


> YARN native service should return container if launch failed
> 
>
> Key: YARN-8545
> URL: https://issues.apache.org/jira/browse/YARN-8545
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Wangda Tan
>Priority: Critical
>
> In some cases, container launch may fail but container will not be properly 
> returned to RM. 
> This could happen when AM trying to prepare container launch context but 
> failed w/o sending container launch context to NM (Once container launch 
> context is sent to NM, NM will report failed container to RM).
> Exception like: 
> {code:java}
> java.io.FileNotFoundException: File does not exist: 
> hdfs://ns1/user/wtan/.yarn/services/tf-job-001/components/1531852429056/primary-worker/primary-worker-0/run-PRIMARY_WORKER.sh
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1583)
>   at 
> org.a

[jira] [Created] (YARN-8545) YARN native service should return container if launch failed

2018-07-17 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-8545:


 Summary: YARN native service should return container if launch 
failed
 Key: YARN-8545
 URL: https://issues.apache.org/jira/browse/YARN-8545
 Project: Hadoop YARN
  Issue Type: Task
Reporter: Wangda Tan


In some cases, container launch may fail but container will not be properly 
returned to RM. 

This could happen when AM trying to prepare container launch context but failed 
w/o sending container launch context to NM (Once container launch context is 
sent to NM, NM will report failed container to RM).

Exception like: 
{code:java}
java.io.FileNotFoundException: File does not exist: 
hdfs://ns1/user/wtan/.yarn/services/tf-job-001/components/1531852429056/primary-worker/primary-worker-0/run-PRIMARY_WORKER.sh
at 
org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1583)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1576)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1591)
at 
org.apache.hadoop.yarn.service.utils.CoreFileSystem.createAmResource(CoreFileSystem.java:388)
at 
org.apache.hadoop.yarn.service.provider.ProviderUtils.createConfigFileAndAddLocalResource(ProviderUtils.java:253)
at 
org.apache.hadoop.yarn.service.provider.AbstractProviderService.buildContainerLaunchContext(AbstractProviderService.java:152)
at 
org.apache.hadoop.yarn.service.containerlaunch.ContainerLaunchService$ContainerLauncher.run(ContainerLaunchService.java:105)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8361) Change App Name Placement Rule to use App Name instead of App Id for configuration

2018-07-17 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8361:
-
Target Version/s: 3.2.0  (was: 3.2.0, 3.1.1)

> Change App Name Placement Rule to use App Name instead of App Id for 
> configuration
> --
>
> Key: YARN-8361
> URL: https://issues.apache.org/jira/browse/YARN-8361
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Reporter: Zian Chen
>Assignee: Zian Chen
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: YARN-8361.001.patch, YARN-8361.002.patch, 
> YARN-8361.003.patch
>
>
> in YARN-8016, we expose a framework to let user specify custom placement rule 
> through CS configuration, and also add a new placement rule which is mapping 
> specific app with queues. However, the strategy implemented in YARN-8016 was 
> using application id which is hard for user to use this config. In this JIRA, 
> we are changing the mapping to use application name. More specifically, 
> 1. AppNamePlacementRule used app id while specifying queue mapping placement 
> rules, should change to app name
> 2. Change documentation as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8361) Change App Name Placement Rule to use App Name instead of App Id for configuration

2018-07-16 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8361:
-
Fix Version/s: 3.2.0

> Change App Name Placement Rule to use App Name instead of App Id for 
> configuration
> --
>
> Key: YARN-8361
> URL: https://issues.apache.org/jira/browse/YARN-8361
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Reporter: Zian Chen
>Assignee: Zian Chen
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: YARN-8361.001.patch, YARN-8361.002.patch, 
> YARN-8361.003.patch
>
>
> in YARN-8016, we expose a framework to let user specify custom placement rule 
> through CS configuration, and also add a new placement rule which is mapping 
> specific app with queues. However, the strategy implemented in YARN-8016 was 
> using application id which is hard for user to use this config. In this JIRA, 
> we are changing the mapping to use application name. More specifically, 
> 1. AppNamePlacementRule used app id while specifying queue mapping placement 
> rules, should change to app name
> 2. Change documentation as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8361) Change App Name Placement Rule to use App Name instead of App Id for configuration

2018-07-16 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16545572#comment-16545572
 ] 

Wangda Tan commented on YARN-8361:
--

Thanks [~Zian Chen] for the patch and thanks reviews from [~suma.shivaprasad], 
just pushed to trunk. 

This patch doesn't apply to branch-3.1, could u update the patch to branch-3.1?

> Change App Name Placement Rule to use App Name instead of App Id for 
> configuration
> --
>
> Key: YARN-8361
> URL: https://issues.apache.org/jira/browse/YARN-8361
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Reporter: Zian Chen
>Assignee: Zian Chen
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: YARN-8361.001.patch, YARN-8361.002.patch, 
> YARN-8361.003.patch
>
>
> in YARN-8016, we expose a framework to let user specify custom placement rule 
> through CS configuration, and also add a new placement rule which is mapping 
> specific app with queues. However, the strategy implemented in YARN-8016 was 
> using application id which is hard for user to use this config. In this JIRA, 
> we are changing the mapping to use application name. More specifically, 
> 1. AppNamePlacementRule used app id while specifying queue mapping placement 
> rules, should change to app name
> 2. Change documentation as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8524) Single parameter Resource / LightWeightResource constructor looks confusing

2018-07-16 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16545534#comment-16545534
 ] 

Wangda Tan commented on YARN-8524:
--

Thanks [~snemeth] for the patch, LGTM too. Will commit soon.

> Single parameter Resource / LightWeightResource constructor looks confusing
> ---
>
> Key: YARN-8524
> URL: https://issues.apache.org/jira/browse/YARN-8524
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: api
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-8524.001.patch, YARN-8524.002.patch, 
> YARN-8524.003.patch
>
>
> The single parameter (long) constructor in Resource / LightWeightResource 
> sets all resource components to the same value.
> Since there are other constructors in these classes with (long, int) 
> parameters where the semantics are different, it could be confusing for the 
> users.
> The perfect place to create such a resource would be in the Resources class, 
> with a method named like "createResourceWithSameValue".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8361) Change App Name Placement Rule to use App Name instead of App Id for configuration

2018-07-13 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16543711#comment-16543711
 ] 

Wangda Tan commented on YARN-8361:
--

LGTM +1, thanks [~Zian Chen] and reviews from [~suma.shivaprasad]. Will commit 
today if no objections.

> Change App Name Placement Rule to use App Name instead of App Id for 
> configuration
> --
>
> Key: YARN-8361
> URL: https://issues.apache.org/jira/browse/YARN-8361
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Reporter: Zian Chen
>Assignee: Zian Chen
>Priority: Major
> Attachments: YARN-8361.001.patch, YARN-8361.002.patch, 
> YARN-8361.003.patch
>
>
> in YARN-8016, we expose a framework to let user specify custom placement rule 
> through CS configuration, and also add a new placement rule which is mapping 
> specific app with queues. However, the strategy implemented in YARN-8016 was 
> using application id which is hard for user to use this config. In this JIRA, 
> we are changing the mapping to use application name. More specifically, 
> 1. AppNamePlacementRule used app id while specifying queue mapping placement 
> rules, should change to app name
> 2. Change documentation as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8513) CapacityScheduler infinite loop when queue is near fully utilized

2018-07-13 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16543689#comment-16543689
 ] 

Wangda Tan commented on YARN-8513:
--

[~cyfdecyf], I couldn't find the error message on the latest codebase. Not sure 
if this still a problem in latest release (3.1.0). We have many fixes in the 
last several months for CapacityScheduler scheduling after YARN-5139, I believe 
many of them are not backported to 2.9.1. Could u check if the problem still 
exists in 3.1.0 if possible?

> CapacityScheduler infinite loop when queue is near fully utilized
> -
>
> Key: YARN-8513
> URL: https://issues.apache.org/jira/browse/YARN-8513
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, yarn
>Affects Versions: 2.9.1
> Environment: Ubuntu 14.04.5
> YARN is configured with one label and 5 queues.
>Reporter: Chen Yufei
>Priority: Major
>
> ResourceManager does not respond to any request when queue is near fully 
> utilized sometimes. Sending SIGTERM won't stop RM, only SIGKILL can. After RM 
> restart, it can recover running jobs and start accepting new ones.
>  
> Seems like CapacityScheduler is in an infinite loop printing out the 
> following log messages (more than 25,000 lines in a second):
>  
> {{2018-07-10 17:16:29,227 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> assignedContainer queue=root usedCapacity=0.99816763 
> absoluteUsedCapacity=0.99816763 used= 
> cluster=}}
> {{2018-07-10 17:16:29,227 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Failed to accept allocation proposal}}
> {{2018-07-10 17:16:29,227 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator:
>  assignedContainer application attempt=appattempt_1530619767030_1652_01 
> container=null 
> queue=org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator@14420943
>  clusterResource= type=NODE_LOCAL 
> requestedPartition=}}
>  
> I encounter this problem several times after upgrading to YARN 2.9.1, while 
> the same configuration works fine under version 2.7.3.
>  
> YARN-4477 is an infinite loop bug in FairScheduler, not sure if this is a 
> similar problem.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8511) When AM releases a container, RM removes allocation tags before it is released by NM

2018-07-13 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16543683#comment-16543683
 ] 

Wangda Tan commented on YARN-8511:
--

Thanks [~cheersyang], 

The latest patch LGTM, +1. Will commit today if no objections. [~asuresh] / 
[~kkaranasos], wanna take a look before commit?

 

> When AM releases a container, RM removes allocation tags before it is 
> released by NM
> 
>
> Key: YARN-8511
> URL: https://issues.apache.org/jira/browse/YARN-8511
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.1.0
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Major
> Attachments: YARN-8511.001.patch, YARN-8511.002.patch, 
> YARN-8511.003.patch, YARN-8511.004.patch
>
>
> User leverages PC with allocation tags to avoid port conflicts between apps, 
> we found sometimes they still get port conflicts. This is a similar issue 
> like YARN-4148. Because RM immediately removes allocation tags once 
> AM#allocate asks to release a container, however container on NM has some 
> delay until it actually gets killed and released the port. We should let RM 
> remove allocation tags AFTER NM confirms the containers are released. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8135) Hadoop {Submarine} Project: Simple and scalable deployment of deep learning training / serving jobs on Hadoop

2018-07-12 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16542363#comment-16542363
 ] 

Wangda Tan commented on YARN-8135:
--

Added Google doc link to Design doc.

> Hadoop {Submarine} Project: Simple and scalable deployment of deep learning 
> training / serving jobs on Hadoop
> -
>
> Key: YARN-8135
> URL: https://issues.apache.org/jira/browse/YARN-8135
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Major
> Attachments: YARN-8135.poc.001.patch
>
>
> Description:
> *Goals:*
>  - Allow infra engineer / data scientist to run *unmodified* Tensorflow jobs 
> on YARN.
>  - Allow jobs easy access data/models in HDFS and other storages.
>  - Can launch services to serve Tensorflow/MXNet models.
>  - Support run distributed Tensorflow jobs with simple configs.
>  - Support run user-specified Docker images.
>  - Support specify GPU and other resources.
>  - Support launch tensorboard if user specified.
>  - Support customized DNS name for roles (like tensorboard.$user.$domain:6006)
> *Why this name?*
>  - Because Submarine is the only vehicle can let human to explore deep 
> places. B-)
> h3. {color:#FF}Please refer to on-going design doc, and add your 
> thoughts: 
> {color:#33}[https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#|https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit?usp=sharing]{color}{color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-8330) An extra container got launched by RM for yarn-service

2018-07-12 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16542327#comment-16542327
 ] 

Wangda Tan edited comment on YARN-8330 at 7/12/18 11:35 PM:


Trying to remember this issue, and post it here before forgot again: 
- The issue is caused by we send container information to ATS inside 
RMContainerImpl's constructor, we should not do it: 
{code}
 // If saveNonAMContainerMetaInfo is true, store system metrics for all
 // containers. If false, and if this container is marked as the AM, metrics
 // will still be published for this container, but that calculation happens
 // later.
 if (saveNonAMContainerMetaInfo && null != container.getId()) {
 rmContext.getSystemMetricsPublisher().containerCreated(
 this, this.creationTime);
 }
{code}


was (Author: leftnoteasy):
Trying to remember this issue, and post it here before forgot again: 
- The issue is caused by we send container information to ATS inside 
RMContainerImpl's constructor, we should not do it: 
{code}
 // If saveNonAMContainerMetaInfo is true, store system metrics for all
 // containers. If false, and if this container is marked as the AM, metrics
 // will still be published for this container, but that calculation happens
 // later.
 if (saveNonAMContainerMetaInfo && null != container.getId()) {
 rmContext.getSystemMetricsPublisher().containerCreated(
 this, this.creationTime);
 }
{code

> An extra container got launched by RM for yarn-service
> --
>
> Key: YARN-8330
> URL: https://issues.apache.org/jira/browse/YARN-8330
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Reporter: Yesha Vora
>Assignee: Suma Shivaprasad
>Priority: Critical
>
> Steps:
> launch Hbase tarball app
> list containers for hbase tarball app
> {code}
> /usr/hdp/current/hadoop-yarn-client/bin/yarn container -list 
> appattempt_1525463491331_0006_01
> WARNING: YARN_LOG_DIR has been replaced by HADOOP_LOG_DIR. Using value of 
> YARN_LOG_DIR.
> WARNING: YARN_LOGFILE has been replaced by HADOOP_LOGFILE. Using value of 
> YARN_LOGFILE.
> WARNING: YARN_PID_DIR has been replaced by HADOOP_PID_DIR. Using value of 
> YARN_PID_DIR.
> WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS.
> 18/05/04 22:36:11 INFO client.AHSProxy: Connecting to Application History 
> server at xxx/xxx:10200
> 18/05/04 22:36:11 INFO client.ConfiguredRMFailoverProxyProvider: Failing over 
> to rm2
> Total number of containers :5
> Container-IdStart Time Finish Time   
> StateHost   Node Http Address 
>LOG-URL
> container_e06_1525463491331_0006_01_02Fri May 04 22:34:26 + 2018  
>  N/A RUNNINGxxx:25454  http://xxx:8042
> http://xxx:8042/node/containerlogs/container_e06_1525463491331_0006_01_02/hrt_qa
> 2018-05-04 22:36:11,216|INFO|MainThread|machine.py:167 - 
> run()||GUID=0169fa41-d1c5-4b43-85bf-c3e9f2682398|container_e06_1525463491331_0006_01_03
> Fri May 04 22:34:26 + 2018   N/A 
> RUNNINGxxx:25454  http://xxx:8042
> http://xxx:8042/node/containerlogs/container_e06_1525463491331_0006_01_03/hrt_qa
> 2018-05-04 22:36:11,217|INFO|MainThread|machine.py:167 - 
> run()||GUID=0169fa41-d1c5-4b43-85bf-c3e9f2682398|container_e06_1525463491331_0006_01_01
> Fri May 04 22:34:15 + 2018   N/A 
> RUNNINGxxx:25454  http://xxx:8042
> http://xxx:8042/node/containerlogs/container_e06_1525463491331_0006_01_01/hrt_qa
> 2018-05-04 22:36:11,217|INFO|MainThread|machine.py:167 - 
> run()||GUID=0169fa41-d1c5-4b43-85bf-c3e9f2682398|container_e06_1525463491331_0006_01_05
> Fri May 04 22:34:56 + 2018   N/A 
> RUNNINGxxx:25454  http://xxx:8042
> http://xxx:8042/node/containerlogs/container_e06_1525463491331_0006_01_05/hrt_qa
> 2018-05-04 22:36:11,218|INFO|MainThread|machine.py:167 - 
> run()||GUID=0169fa41-d1c5-4b43-85bf-c3e9f2682398|container_e06_1525463491331_0006_01_04
> Fri May 04 22:34:56 + 2018   N/A
> nullxxx:25454  http://xxx:8042
> http://xxx:8188/applicationhistory/logs/xxx:25454/container_e06_1525463491331_0006_01_04/container_e06_1525463491331_0006_01_04/hrt_qa{code}
> Total expected containers = 4 ( 3 components container + 1 am). Instead, RM 
> is listing 5 containers. 
> container_e06_1525463491331_0006_01_04 is in null state.
> Yarn service utilized container 02, 03, 05 for component. There is no log 
> available in NM & AM related to container 04. Only one line in RM log is 
> printed
> {code}

[jira] [Commented] (YARN-8330) An extra container got launched by RM for yarn-service

2018-07-12 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16542327#comment-16542327
 ] 

Wangda Tan commented on YARN-8330:
--

Trying to remember this issue, and post it here before forgot again: 
- The issue is caused by we send container information to ATS inside 
RMContainerImpl's constructor, we should not do it: 
{code}
 // If saveNonAMContainerMetaInfo is true, store system metrics for all
 // containers. If false, and if this container is marked as the AM, metrics
 // will still be published for this container, but that calculation happens
 // later.
 if (saveNonAMContainerMetaInfo && null != container.getId()) {
 rmContext.getSystemMetricsPublisher().containerCreated(
 this, this.creationTime);
 }
{code

> An extra container got launched by RM for yarn-service
> --
>
> Key: YARN-8330
> URL: https://issues.apache.org/jira/browse/YARN-8330
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Reporter: Yesha Vora
>Assignee: Suma Shivaprasad
>Priority: Critical
>
> Steps:
> launch Hbase tarball app
> list containers for hbase tarball app
> {code}
> /usr/hdp/current/hadoop-yarn-client/bin/yarn container -list 
> appattempt_1525463491331_0006_01
> WARNING: YARN_LOG_DIR has been replaced by HADOOP_LOG_DIR. Using value of 
> YARN_LOG_DIR.
> WARNING: YARN_LOGFILE has been replaced by HADOOP_LOGFILE. Using value of 
> YARN_LOGFILE.
> WARNING: YARN_PID_DIR has been replaced by HADOOP_PID_DIR. Using value of 
> YARN_PID_DIR.
> WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS.
> 18/05/04 22:36:11 INFO client.AHSProxy: Connecting to Application History 
> server at xxx/xxx:10200
> 18/05/04 22:36:11 INFO client.ConfiguredRMFailoverProxyProvider: Failing over 
> to rm2
> Total number of containers :5
> Container-IdStart Time Finish Time   
> StateHost   Node Http Address 
>LOG-URL
> container_e06_1525463491331_0006_01_02Fri May 04 22:34:26 + 2018  
>  N/A RUNNINGxxx:25454  http://xxx:8042
> http://xxx:8042/node/containerlogs/container_e06_1525463491331_0006_01_02/hrt_qa
> 2018-05-04 22:36:11,216|INFO|MainThread|machine.py:167 - 
> run()||GUID=0169fa41-d1c5-4b43-85bf-c3e9f2682398|container_e06_1525463491331_0006_01_03
> Fri May 04 22:34:26 + 2018   N/A 
> RUNNINGxxx:25454  http://xxx:8042
> http://xxx:8042/node/containerlogs/container_e06_1525463491331_0006_01_03/hrt_qa
> 2018-05-04 22:36:11,217|INFO|MainThread|machine.py:167 - 
> run()||GUID=0169fa41-d1c5-4b43-85bf-c3e9f2682398|container_e06_1525463491331_0006_01_01
> Fri May 04 22:34:15 + 2018   N/A 
> RUNNINGxxx:25454  http://xxx:8042
> http://xxx:8042/node/containerlogs/container_e06_1525463491331_0006_01_01/hrt_qa
> 2018-05-04 22:36:11,217|INFO|MainThread|machine.py:167 - 
> run()||GUID=0169fa41-d1c5-4b43-85bf-c3e9f2682398|container_e06_1525463491331_0006_01_05
> Fri May 04 22:34:56 + 2018   N/A 
> RUNNINGxxx:25454  http://xxx:8042
> http://xxx:8042/node/containerlogs/container_e06_1525463491331_0006_01_05/hrt_qa
> 2018-05-04 22:36:11,218|INFO|MainThread|machine.py:167 - 
> run()||GUID=0169fa41-d1c5-4b43-85bf-c3e9f2682398|container_e06_1525463491331_0006_01_04
> Fri May 04 22:34:56 + 2018   N/A
> nullxxx:25454  http://xxx:8042
> http://xxx:8188/applicationhistory/logs/xxx:25454/container_e06_1525463491331_0006_01_04/container_e06_1525463491331_0006_01_04/hrt_qa{code}
> Total expected containers = 4 ( 3 components container + 1 am). Instead, RM 
> is listing 5 containers. 
> container_e06_1525463491331_0006_01_04 is in null state.
> Yarn service utilized container 02, 03, 05 for component. There is no log 
> available in NM & AM related to container 04. Only one line in RM log is 
> printed
> {code}
> 2018-05-04 22:34:56,618 INFO  rmcontainer.RMContainerImpl 
> (RMContainerImpl.java:handle(489)) - 
> container_e06_1525463491331_0006_01_04 Container Transitioned from NEW to 
> RESERVED{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-8522) Application fails with InvalidResourceRequestException

2018-07-12 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan reassigned YARN-8522:


Assignee: Zian Chen

> Application fails with InvalidResourceRequestException
> --
>
> Key: YARN-8522
> URL: https://issues.apache.org/jira/browse/YARN-8522
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yesha Vora
>Assignee: Zian Chen
>Priority: Major
>
> Launch multiple streaming app simultaneously. Here, sometimes one of the 
> application fails with below stack trace.
> {code}
> 18/07/02 07:14:32 INFO retry.RetryInvocationHandler: 
> java.net.ConnectException: Call From xx.xx.xx.xx/xx.xx.xx.xx to 
> xx.xx.xx.xx:8032 failed on connection exception: java.net.ConnectException: 
> Connection refused; For more details see:  
> http://wiki.apache.org/hadoop/ConnectionRefused, while invoking 
> ApplicationClientProtocolPBClientImpl.submitApplication over null. Retrying 
> after sleeping for 3ms.
> 18/07/02 07:14:32 WARN client.RequestHedgingRMFailoverProxyProvider: 
> Invocation returned exception: 
> org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid 
> resource request, only one resource request with * is allowed
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:502)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:389)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:320)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.submitApplication(ClientRMService.java:645)
> at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.submitApplication(ApplicationClientProtocolPBServiceImpl.java:277)
> at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:563)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:872)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:818)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1688)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2678)
>  on [rm2], so propagating back to caller.
> 18/07/02 07:14:32 INFO mapreduce.JobSubmitter: Cleaning up the staging area 
> /user/hrt_qa/.staging/job_1530515284077_0007
> 18/07/02 07:14:32 ERROR streaming.StreamJob: Error Launching job : 
> org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid 
> resource request, only one resource request with * is allowed
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:502)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:389)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:320)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.submitApplication(ClientRMService.java:645)
> at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.submitApplication(ApplicationClientProtocolPBServiceImpl.java:277)
> at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:563)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:872)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:818)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1688)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2678)
> Streaming Command Failed!{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7556) Fair scheduler configuration should allow resource types in the minResources and maxResources properties

2018-07-12 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16541999#comment-16541999
 ] 

Wangda Tan commented on YARN-7556:
--

Thanks [~snemeth]

> Fair scheduler configuration should allow resource types in the minResources 
> and maxResources properties
> 
>
> Key: YARN-7556
> URL: https://issues.apache.org/jira/browse/YARN-7556
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: fairscheduler
>Affects Versions: 3.0.0-beta1
>Reporter: Daniel Templeton
>Assignee: Szilard Nemeth
>Priority: Critical
> Fix For: 3.2.0
>
> Attachments: YARN-7556.001.patch, YARN-7556.002.patch, 
> YARN-7556.003.patch, YARN-7556.004.patch, YARN-7556.005.patch, 
> YARN-7556.006.patch, YARN-7556.007.patch, YARN-7556.008.patch, 
> YARN-7556.009.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7481) Gpu locality support for Better AI scheduling

2018-07-12 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16541998#comment-16541998
 ] 

Wangda Tan commented on YARN-7481:
--

[~qinc...@microsoft.com], is there any detailed plan of how to better integrate 
this to resource types (see my previous comment)?

> Gpu locality support for Better AI scheduling
> -
>
> Key: YARN-7481
> URL: https://issues.apache.org/jira/browse/YARN-7481
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: api, RM, yarn
>Affects Versions: 2.7.2
>Reporter: Chen Qingcha
>Priority: Major
> Attachments: GPU locality support for Job scheduling.pdf, 
> hadoop-2.7.2.gpu-port-20180711.patch, hadoop-2.7.2.gpu-port.patch, 
> hadoop-2.9.0.gpu-port.patch, hadoop_2.9.0.patch
>
>   Original Estimate: 1,344h
>  Remaining Estimate: 1,344h
>
> We enhance Hadoop with GPU support for better AI job scheduling. 
> Currently, YARN-3926 also supports GPU scheduling, which treats GPU as 
> countable resource. 
> However, GPU placement is also very important to deep learning job for better 
> efficiency.
>  For example, a 2-GPU job runs on gpu {0,1} could be faster than run on gpu 
> {0, 7}, if GPU 0 and 1 are under the same PCI-E switch while 0 and 7 are not.
>  We add the support to Hadoop 2.7.2 to enable GPU locality scheduling, which 
> support fine-grained GPU placement. 
> A 64-bits bitmap is added to yarn Resource, which indicates both GPU usage 
> and locality information in a node (up to 64 GPUs per node). '1' means 
> available and '0' otherwise in the corresponding position of the bit.   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8511) When AM releases a container, RM removes allocation tags before it is released by NM

2018-07-12 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16541973#comment-16541973
 ] 

Wangda Tan commented on YARN-8511:
--

Thanks [~cheersyang], for explanation. I completely missed YARN-4148.

So for #1, it looks not a problem anymore.

#2 still exists, I think we can handle it separately. 

Regarding to implementation, I saw the only purpose of changes added to RMNode 
and subclasses is to access tags manager. Instead of doing this, is it better 
to pass RMContext to SchedulerNode, so schedulerNode can directly access 
RMContext like SchedulerAppAttempt?

 

> When AM releases a container, RM removes allocation tags before it is 
> released by NM
> 
>
> Key: YARN-8511
> URL: https://issues.apache.org/jira/browse/YARN-8511
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.1.0
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Major
> Attachments: YARN-8511.001.patch, YARN-8511.002.patch
>
>
> User leverages PC with allocation tags to avoid port conflicts between apps, 
> we found sometimes they still get port conflicts. This is a similar issue 
> like YARN-4148. Because RM immediately removes allocation tags once 
> AM#allocate asks to release a container, however container on NM has some 
> delay until it actually gets killed and released the port. We should let RM 
> remove allocation tags AFTER NM confirms the containers are released. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8480) Add boolean option for resources

2018-07-11 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16541156#comment-16541156
 ] 

Wangda Tan commented on YARN-8480:
--

[~cheersyang], 

What [~templedf] / [~snemeth] proposed is make the resource like a label, for 
example, node can report it has resource: memory=2048,vcore=3,has_java=true. 
And AM can request resource with ...has_java=true. Allocating container on the 
node will not update node's available resource, as far as the node has enough 
memory and vcores, it can allocate >1 containers with has_java=true resource 
request. This is an alternative way to represent node label by using resource 
types.

> Add boolean option for resources
> 
>
> Key: YARN-8480
> URL: https://issues.apache.org/jira/browse/YARN-8480
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Daniel Templeton
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-8480.001.patch, YARN-8480.002.patch
>
>
> Make it possible to define a resource with a boolean value.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8505) AMLimit and userAMLimit check should be skipped for unmanaged AM

2018-07-11 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16541155#comment-16541155
 ] 

Wangda Tan commented on YARN-8505:
--

[~bibinchundatt]  / [~Tao Yang] / [~cheersyang], 

I would prefer to add a new limit to container 
#maximum-concurrently-activated-apps within a queue and skip updating AMLimit 
when a unmanaged AM is launched.

> AMLimit and userAMLimit check should be skipped for unmanaged AM
> 
>
> Key: YARN-8505
> URL: https://issues.apache.org/jira/browse/YARN-8505
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.2.0, 2.9.2
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-8505.001.patch
>
>
> AMLimit and userAMLimit check in LeafQueue#activateApplications should be 
> skipped for unmanaged AM whose resource is not taken from YARN cluster.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8511) When AM releases a container, RM removes allocation tags before it is released by NM

2018-07-11 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16541153#comment-16541153
 ] 

Wangda Tan commented on YARN-8511:
--

[~cheersyang], 

Thanks for reporting and working on this issue, this is valid issue, and we saw 
it from other places.  

For example, when exclusive use resource types like GPU, we could allocate and 
container to a node before the previous container completed. Memory has the 
same issue.

I'm not sure if your patch works since the {{SchedulerNode#releaseContainer}} 
could be invoked in scenarios like when an AM release container by invoking 
allocate call, or app attempt finishes. Scheduler could still place a new 
container on a node before it terminated by NM.

Instead, I think we should have some hook to handle such event inside 
{{AbstractYarnScheduler#nodeUpdate}}.

However we still have two issues: 
1) If we deduct resource after actual container finishes, it is possible that 
scheduler application attempt already finished. In that case, scheduler is not 
able to deduct resources. (Scheduler relies on SchedulerApplicationAttempt to 
locate RMContainer). I'm not sure if it impacts allocation tags or not.
2) It is also possible that NM spend too much time on terminating containers, 
in our docker-in-docker setup, we observed OS takes several minutes to 
terminate container. And NM could report container is DONE before it is 
actually terminated. (Another bug here). YARN-8508 is caused by the issue.
 
 

> When AM releases a container, RM removes allocation tags before it is 
> released by NM
> 
>
> Key: YARN-8511
> URL: https://issues.apache.org/jira/browse/YARN-8511
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.1.0
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Major
> Attachments: YARN-8511.001.patch, YARN-8511.002.patch
>
>
> User leverages PC with allocation tags to avoid port conflicts between apps, 
> we found sometimes they still get port conflicts. This is a similar issue 
> like YARN-4148. Because RM immediately removes allocation tags once 
> AM#allocate asks to release a container, however container on NM has some 
> delay until it actually gets killed and released the port. We should let RM 
> remove allocation tags AFTER NM confirms the containers are released. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8480) Add boolean option for resources

2018-07-11 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16540897#comment-16540897
 ] 

Wangda Tan commented on YARN-8480:
--

btw, [~templedf], I knew you found some troubles of support node partition 
concept in FS (YARN-2497). The node attribute should be a much easier support 
because there's no resource sharing required under node attribute, everything 
comes with FCFS. Basically you should only add a if check before deciding 
allocate container X on node Y. 

> Add boolean option for resources
> 
>
> Key: YARN-8480
> URL: https://issues.apache.org/jira/browse/YARN-8480
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Daniel Templeton
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-8480.001.patch, YARN-8480.002.patch
>
>
> Make it possible to define a resource with a boolean value.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8480) Add boolean option for resources

2018-07-11 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16540895#comment-16540895
 ] 

Wangda Tan commented on YARN-8480:
--

[~templedf], 

If this only changes Fair Scheduler, I'm fine with that. However this touches 
Resource/ResourceInformation/RMAminCLI/ResourceCalculator/CapacityScheduler-implementation,
 etc. 

If you could take a closer look at YARN-3409 APIs, it should not be hard at 
all. Should definitely cheaper than adding a new resource type.

> Add boolean option for resources
> 
>
> Key: YARN-8480
> URL: https://issues.apache.org/jira/browse/YARN-8480
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Daniel Templeton
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-8480.001.patch, YARN-8480.002.patch
>
>
> Make it possible to define a resource with a boolean value.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7974) Allow updating application tracking url after registration

2018-07-11 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16540879#comment-16540879
 ] 

Wangda Tan commented on YARN-7974:
--

[~oliverhuh...@gmail.com], [~jhung],  

Thanks for updating the patch, in general the patch looks good. 

Several minor comments: 

1) 

{code}
public abstract void updateTrackingUrl(String trackingUrl); 
{code}

Should have a default (maybe empty) implementation. Given AMRMClient/Async are 
all public/stable APIs, we don't want build break if any app extend from these 
classes. 

2) The updateTrackingUrl should be marked as public/unstable. 

3) Should we explicitly compare content of the new tracking url with old 
tracking url? Now we only check != null which may not be enough. 

{code} 
1824  // Update tracking url if changed and save it to state store
1825  String newTrackingUrl = statusUpdateEvent.getTrackingUrl();
1826  if (newTrackingUrl != null) {
{code}


> Allow updating application tracking url after registration
> --
>
> Key: YARN-7974
> URL: https://issues.apache.org/jira/browse/YARN-7974
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
> Attachments: YARN-7974.001.patch, YARN-7974.002.patch, 
> YARN-7974.003.patch, YARN-7974.004.patch, YARN-7974.005.patch
>
>
> Normally an application's tracking url is set on AM registration. We have a 
> use case for updating the tracking url after registration (e.g. the UI is 
> hosted on one of the containers).
> Approach is for AM to update tracking url on heartbeat to RM, and add related 
> API in AMRMClient.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7481) Gpu locality support for Better AI scheduling

2018-07-11 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-7481:
-
Fix Version/s: (was: 2.7.2)

> Gpu locality support for Better AI scheduling
> -
>
> Key: YARN-7481
> URL: https://issues.apache.org/jira/browse/YARN-7481
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: api, RM, yarn
>Affects Versions: 2.7.2
>Reporter: Chen Qingcha
>Priority: Major
> Attachments: GPU locality support for Job scheduling.pdf, 
> hadoop-2.7.2.gpu-port-20180711.patch, hadoop-2.7.2.gpu-port.patch, 
> hadoop-2.9.0.gpu-port.patch, hadoop_2.9.0.patch
>
>   Original Estimate: 1,344h
>  Remaining Estimate: 1,344h
>
> We enhance Hadoop with GPU support for better AI job scheduling. 
> Currently, YARN-3926 also supports GPU scheduling, which treats GPU as 
> countable resource. 
> However, GPU placement is also very important to deep learning job for better 
> efficiency.
>  For example, a 2-GPU job runs on gpu {0,1} could be faster than run on gpu 
> {0, 7}, if GPU 0 and 1 are under the same PCI-E switch while 0 and 7 are not.
>  We add the support to Hadoop 2.7.2 to enable GPU locality scheduling, which 
> support fine-grained GPU placement. 
> A 64-bits bitmap is added to yarn Resource, which indicates both GPU usage 
> and locality information in a node (up to 64 GPUs per node). '1' means 
> available and '0' otherwise in the corresponding position of the bit.   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8480) Add boolean option for resources

2018-07-11 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16540810#comment-16540810
 ] 

Wangda Tan commented on YARN-8480:
--

[~snemeth]/[~templedf]. 

To me we should move this to node attribute: we already have most things ready 
here under YARN-3409 branch. I don't think we should duplicate the same things 
in this JIRA. 

For stuffs added to resource class, it's not necessarily to be countable, 
(definition of countable according to design doc of YARN-3926):
{code} 
When we speak of countable resources, we refer to resource­types where the 
allocation and release of resources is a simple subtraction and addition 
operation.
{code}

I'm supportive to add new resource types like set, range, or hierarchical (like 
disk resource isolation which is mentioned by [~cheersyang]). The pure label 
resource type should go to node attribute / node partition.

> Add boolean option for resources
> 
>
> Key: YARN-8480
> URL: https://issues.apache.org/jira/browse/YARN-8480
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Daniel Templeton
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-8480.001.patch, YARN-8480.002.patch
>
>
> Make it possible to define a resource with a boolean value.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7481) Gpu locality support for Better AI scheduling

2018-07-11 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16540501#comment-16540501
 ] 

Wangda Tan commented on YARN-7481:
--

[~qinc...@microsoft.com], I saw you were keep updating patches in the last 
several months. Given the proposed approach conflicts with community existing 
solution, is there any plans to merge this with community solutions?

> Gpu locality support for Better AI scheduling
> -
>
> Key: YARN-7481
> URL: https://issues.apache.org/jira/browse/YARN-7481
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: api, RM, yarn
>Affects Versions: 2.7.2
>Reporter: Chen Qingcha
>Priority: Major
> Fix For: 2.7.2
>
> Attachments: GPU locality support for Job scheduling.pdf, 
> hadoop-2.7.2.gpu-port-20180711.patch, hadoop-2.7.2.gpu-port.patch, 
> hadoop-2.9.0.gpu-port.patch, hadoop_2.9.0.patch
>
>   Original Estimate: 1,344h
>  Remaining Estimate: 1,344h
>
> We enhance Hadoop with GPU support for better AI job scheduling. 
> Currently, YARN-3926 also supports GPU scheduling, which treats GPU as 
> countable resource. 
> However, GPU placement is also very important to deep learning job for better 
> efficiency.
>  For example, a 2-GPU job runs on gpu {0,1} could be faster than run on gpu 
> {0, 7}, if GPU 0 and 1 are under the same PCI-E switch while 0 and 7 are not.
>  We add the support to Hadoop 2.7.2 to enable GPU locality scheduling, which 
> support fine-grained GPU placement. 
> A 64-bits bitmap is added to yarn Resource, which indicates both GPU usage 
> and locality information in a node (up to 64 GPUs per node). '1' means 
> available and '0' otherwise in the corresponding position of the bit.   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8512) ATSv2 entities are not published to HBase from second attempt onwards

2018-07-10 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16539027#comment-16539027
 ] 

Wangda Tan commented on YARN-8512:
--

Patch LGTM as well, thanks [~rohithsharma] for the fix. 

> ATSv2 entities are not published to HBase from second attempt onwards
> -
>
> Key: YARN-8512
> URL: https://issues.apache.org/jira/browse/YARN-8512
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.0, 2.10.0, 3.2.0, 3.0.3
>Reporter: Yesha Vora
>Assignee: Rohith Sharma K S
>Priority: Major
> Attachments: YARN-8512.01.patch, YARN-8512.02.patch, 
> YARN-8512.03.patch
>
>
> It is observed that if 1st attempt master container is died and 2nd attempt 
> master container is launched in a NM where old containers are running but not 
> master container. 
> ||Attempt||NM1||NM2||Action||
> |attempt-1|master container i.e container-1-1|container-1-2|master container 
> died|
> |attempt-2|NA|container-1-2 and master container container-2-1|NA|
> In the above scenario, NM doesn't identifies flowContext and will get log 
> below
> {noformat}
> 2018-07-10 00:44:38,285 WARN  storage.HBaseTimelineWriterImpl 
> (HBaseTimelineWriterImpl.java:write(170)) - Found null for one of: 
> flowName=null appId=application_1531175172425_0001 userId=hbase 
> clusterId=yarn-cluster . Not proceeding with writing to hbase
> 2018-07-10 00:44:38,560 WARN  storage.HBaseTimelineWriterImpl 
> (HBaseTimelineWriterImpl.java:write(170)) - Found null for one of: 
> flowName=null appId=application_1531175172425_0001 userId=hbase 
> clusterId=yarn-cluster . Not proceeding with writing to hbase
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8509) Fix UserLimit calculation for preemption to balance scenario after queue satisfied

2018-07-09 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16537854#comment-16537854
 ] 

Wangda Tan commented on YARN-8509:
--

And downgraded priority to major, removed 3.1.1 from target version.

> Fix UserLimit calculation for preemption to balance scenario after queue 
> satisfied  
> 
>
> Key: YARN-8509
> URL: https://issues.apache.org/jira/browse/YARN-8509
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Zian Chen
>Assignee: Zian Chen
>Priority: Major
>
> In LeafQueue#getTotalPendingResourcesConsideringUserLimit, we calculate total 
> pending resource based on user-limit percent and user-limit factor which will 
> cap pending resource for each user to the minimum of user-limit pending and 
> actual pending. This will prevent queue from taking more pending resource and 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8509) Fix UserLimit calculation for preemption to balance scenario after queue satisfied

2018-07-09 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16537853#comment-16537853
 ] 

Wangda Tan commented on YARN-8509:
--

[~Zian Chen], fix version is only set once the patch got committed, "target 
version" should be set for new issues.

> Fix UserLimit calculation for preemption to balance scenario after queue 
> satisfied  
> 
>
> Key: YARN-8509
> URL: https://issues.apache.org/jira/browse/YARN-8509
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Zian Chen
>Assignee: Zian Chen
>Priority: Critical
>
> In LeafQueue#getTotalPendingResourcesConsideringUserLimit, we calculate total 
> pending resource based on user-limit percent and user-limit factor which will 
> cap pending resource for each user to the minimum of user-limit pending and 
> actual pending. This will prevent queue from taking more pending resource and 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8509) Fix UserLimit calculation for preemption to balance scenario after queue satisfied

2018-07-09 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8509:
-
Priority: Major  (was: Critical)

> Fix UserLimit calculation for preemption to balance scenario after queue 
> satisfied  
> 
>
> Key: YARN-8509
> URL: https://issues.apache.org/jira/browse/YARN-8509
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Zian Chen
>Assignee: Zian Chen
>Priority: Major
>
> In LeafQueue#getTotalPendingResourcesConsideringUserLimit, we calculate total 
> pending resource based on user-limit percent and user-limit factor which will 
> cap pending resource for each user to the minimum of user-limit pending and 
> actual pending. This will prevent queue from taking more pending resource and 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8509) Fix UserLimit calculation for preemption to balance scenario after queue satisfied

2018-07-09 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8509:
-
Target Version/s: 3.2.0, 3.1.2  (was: 3.2.0, 3.1.1)

> Fix UserLimit calculation for preemption to balance scenario after queue 
> satisfied  
> 
>
> Key: YARN-8509
> URL: https://issues.apache.org/jira/browse/YARN-8509
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Zian Chen
>Assignee: Zian Chen
>Priority: Major
>
> In LeafQueue#getTotalPendingResourcesConsideringUserLimit, we calculate total 
> pending resource based on user-limit percent and user-limit factor which will 
> cap pending resource for each user to the minimum of user-limit pending and 
> actual pending. This will prevent queue from taking more pending resource and 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8509) Fix UserLimit calculation for preemption to balance scenario after queue satisfied

2018-07-09 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8509:
-
Target Version/s: 3.2.0, 3.1.1

> Fix UserLimit calculation for preemption to balance scenario after queue 
> satisfied  
> 
>
> Key: YARN-8509
> URL: https://issues.apache.org/jira/browse/YARN-8509
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Zian Chen
>Assignee: Zian Chen
>Priority: Critical
>
> In LeafQueue#getTotalPendingResourcesConsideringUserLimit, we calculate total 
> pending resource based on user-limit percent and user-limit factor which will 
> cap pending resource for each user to the minimum of user-limit pending and 
> actual pending. This will prevent queue from taking more pending resource and 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8509) Fix UserLimit calculation for preemption to balance scenario after queue satisfied

2018-07-09 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8509:
-
Fix Version/s: (was: 3.1.1)
   (was: 3.2.0)

> Fix UserLimit calculation for preemption to balance scenario after queue 
> satisfied  
> 
>
> Key: YARN-8509
> URL: https://issues.apache.org/jira/browse/YARN-8509
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Zian Chen
>Assignee: Zian Chen
>Priority: Critical
>
> In LeafQueue#getTotalPendingResourcesConsideringUserLimit, we calculate total 
> pending resource based on user-limit percent and user-limit factor which will 
> cap pending resource for each user to the minimum of user-limit pending and 
> actual pending. This will prevent queue from taking more pending resource and 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8506) Make GetApplicationsRequestPBImpl thread safe

2018-07-09 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16537244#comment-16537244
 ] 

Wangda Tan commented on YARN-8506:
--

Rebased to latest trunk. (002)

> Make GetApplicationsRequestPBImpl thread safe
> -
>
> Key: YARN-8506
> URL: https://issues.apache.org/jira/browse/YARN-8506
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Critical
> Attachments: YARN-8506.001.patch, YARN-8506.002.patch
>
>
> When GetApplicationRequestPBImpl is used in multi-thread environment, 
> exceptions like below will occur because we don't protect write ops.
> {code}
> java.lang.ArrayIndexOutOfBoundsException
>   at java.lang.System.arraycopy(Native Method)
>   at java.util.ArrayList.addAll(ArrayList.java:613)
>   at 
> com.google.protobuf.LazyStringArrayList.addAll(LazyStringArrayList.java:132)
>   at 
> com.google.protobuf.LazyStringArrayList.addAll(LazyStringArrayList.java:123)
>   at 
> com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:327)
>   at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$GetApplicationsRequestProto$Builder.addAllApplicationTags(YarnServiceProtos.java:24450)
>   at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetApplicationsRequestPBImpl.mergeLocalToBuilder(GetApplicationsRequestPBImpl.java:100)
>   at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetApplicationsRequestPBImpl.mergeLocalToProto(GetApplicationsRequestPBImpl.java:78)
>   at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetApplicationsRequestPBImpl.getProto(GetApplicationsRequestPBImpl.java:69)
> {code}
> We need to make GetApplicationRequestPBImpl thread safe. We saw the issue 
> happens frequently when RequestHedgingRMFailoverProxyProvider is being used.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8506) Make GetApplicationsRequestPBImpl thread safe

2018-07-09 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8506:
-
Attachment: YARN-8506.002.patch

> Make GetApplicationsRequestPBImpl thread safe
> -
>
> Key: YARN-8506
> URL: https://issues.apache.org/jira/browse/YARN-8506
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Critical
> Attachments: YARN-8506.001.patch, YARN-8506.002.patch
>
>
> When GetApplicationRequestPBImpl is used in multi-thread environment, 
> exceptions like below will occur because we don't protect write ops.
> {code}
> java.lang.ArrayIndexOutOfBoundsException
>   at java.lang.System.arraycopy(Native Method)
>   at java.util.ArrayList.addAll(ArrayList.java:613)
>   at 
> com.google.protobuf.LazyStringArrayList.addAll(LazyStringArrayList.java:132)
>   at 
> com.google.protobuf.LazyStringArrayList.addAll(LazyStringArrayList.java:123)
>   at 
> com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:327)
>   at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$GetApplicationsRequestProto$Builder.addAllApplicationTags(YarnServiceProtos.java:24450)
>   at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetApplicationsRequestPBImpl.mergeLocalToBuilder(GetApplicationsRequestPBImpl.java:100)
>   at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetApplicationsRequestPBImpl.mergeLocalToProto(GetApplicationsRequestPBImpl.java:78)
>   at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetApplicationsRequestPBImpl.getProto(GetApplicationsRequestPBImpl.java:69)
> {code}
> We need to make GetApplicationRequestPBImpl thread safe. We saw the issue 
> happens frequently when RequestHedgingRMFailoverProxyProvider is being used.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8506) Make GetApplicationsRequestPBImpl thread safe

2018-07-09 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8506:
-
Priority: Critical  (was: Blocker)

> Make GetApplicationsRequestPBImpl thread safe
> -
>
> Key: YARN-8506
> URL: https://issues.apache.org/jira/browse/YARN-8506
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Critical
> Attachments: YARN-8506.001.patch
>
>
> When GetApplicationRequestPBImpl is used in multi-thread environment, 
> exceptions like below will occur because we don't protect write ops.
> {code}
> java.lang.ArrayIndexOutOfBoundsException
>   at java.lang.System.arraycopy(Native Method)
>   at java.util.ArrayList.addAll(ArrayList.java:613)
>   at 
> com.google.protobuf.LazyStringArrayList.addAll(LazyStringArrayList.java:132)
>   at 
> com.google.protobuf.LazyStringArrayList.addAll(LazyStringArrayList.java:123)
>   at 
> com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:327)
>   at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$GetApplicationsRequestProto$Builder.addAllApplicationTags(YarnServiceProtos.java:24450)
>   at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetApplicationsRequestPBImpl.mergeLocalToBuilder(GetApplicationsRequestPBImpl.java:100)
>   at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetApplicationsRequestPBImpl.mergeLocalToProto(GetApplicationsRequestPBImpl.java:78)
>   at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetApplicationsRequestPBImpl.getProto(GetApplicationsRequestPBImpl.java:69)
> {code}
> We need to make GetApplicationRequestPBImpl thread safe. We saw the issue 
> happens frequently when RequestHedgingRMFailoverProxyProvider is being used.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8506) Make GetApplicationsRequestPBImpl thread safe

2018-07-09 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8506:
-
Attachment: YARN-8506.001.patch

> Make GetApplicationsRequestPBImpl thread safe
> -
>
> Key: YARN-8506
> URL: https://issues.apache.org/jira/browse/YARN-8506
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Blocker
> Attachments: YARN-8506.001.patch
>
>
> When GetApplicationRequestPBImpl is used in multi-thread environment, 
> exceptions like below will occur because we don't protect write ops.
> {code}
> java.lang.ArrayIndexOutOfBoundsException
>   at java.lang.System.arraycopy(Native Method)
>   at java.util.ArrayList.addAll(ArrayList.java:613)
>   at 
> com.google.protobuf.LazyStringArrayList.addAll(LazyStringArrayList.java:132)
>   at 
> com.google.protobuf.LazyStringArrayList.addAll(LazyStringArrayList.java:123)
>   at 
> com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:327)
>   at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$GetApplicationsRequestProto$Builder.addAllApplicationTags(YarnServiceProtos.java:24450)
>   at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetApplicationsRequestPBImpl.mergeLocalToBuilder(GetApplicationsRequestPBImpl.java:100)
>   at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetApplicationsRequestPBImpl.mergeLocalToProto(GetApplicationsRequestPBImpl.java:78)
>   at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetApplicationsRequestPBImpl.getProto(GetApplicationsRequestPBImpl.java:69)
> {code}
> We need to make GetApplicationRequestPBImpl thread safe. We saw the issue 
> happens frequently when RequestHedgingRMFailoverProxyProvider is being used.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8506) Make GetApplicationsRequestPBImpl thread safe

2018-07-09 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-8506:


 Summary: Make GetApplicationsRequestPBImpl thread safe
 Key: YARN-8506
 URL: https://issues.apache.org/jira/browse/YARN-8506
 Project: Hadoop YARN
  Issue Type: Task
Reporter: Wangda Tan
Assignee: Wangda Tan


When GetApplicationRequestPBImpl is used in multi-thread environment, 
exceptions like below will occur because we don't protect write ops.

{code}
java.lang.ArrayIndexOutOfBoundsException
at java.lang.System.arraycopy(Native Method)
at java.util.ArrayList.addAll(ArrayList.java:613)
at 
com.google.protobuf.LazyStringArrayList.addAll(LazyStringArrayList.java:132)
at 
com.google.protobuf.LazyStringArrayList.addAll(LazyStringArrayList.java:123)
at 
com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:327)
at 
org.apache.hadoop.yarn.proto.YarnServiceProtos$GetApplicationsRequestProto$Builder.addAllApplicationTags(YarnServiceProtos.java:24450)
at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetApplicationsRequestPBImpl.mergeLocalToBuilder(GetApplicationsRequestPBImpl.java:100)
at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetApplicationsRequestPBImpl.mergeLocalToProto(GetApplicationsRequestPBImpl.java:78)
at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetApplicationsRequestPBImpl.getProto(GetApplicationsRequestPBImpl.java:69)
{code}

We need to make GetApplicationRequestPBImpl thread safe. We saw the issue 
happens frequently when RequestHedgingRMFailoverProxyProvider is being used.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



<    2   3   4   5   6   7   8   9   10   11   >