[jira] [Updated] (YARN-8508) On NodeManager container gets cleaned up before its pid file is created
[ https://issues.apache.org/jira/browse/YARN-8508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8508: - Fix Version/s: (was: 3.1.2) 3.1.1 > On NodeManager container gets cleaned up before its pid file is created > --- > > Key: YARN-8508 > URL: https://issues.apache.org/jira/browse/YARN-8508 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Sumana Sathish >Assignee: Chandni Singh >Priority: Critical > Fix For: 3.2.0, 3.1.1 > > Attachments: YARN-8505.001.patch, YARN-8505.002.patch > > > GPU failed to release even though the container using it is being killed > {Code} > 2018-07-06 05:22:26,201 INFO container.ContainerImpl > (ContainerImpl.java:handle(2093)) - Container > container_e20_1530854311763_0006_01_01 transitioned from RUNNING to > KILLING > 2018-07-06 05:22:26,250 INFO container.ContainerImpl > (ContainerImpl.java:handle(2093)) - Container > container_e20_1530854311763_0006_01_02 transitioned from RUNNING to > KILLING > 2018-07-06 05:22:26,251 INFO application.ApplicationImpl > (ApplicationImpl.java:handle(632)) - Application > application_1530854311763_0006 transitioned from RUNNING to > FINISHING_CONTAINERS_WAIT > 2018-07-06 05:22:26,251 INFO launcher.ContainerLaunch > (ContainerLaunch.java:cleanupContainer(734)) - Cleaning up container > container_e20_1530854311763_0006_01_02 > 2018-07-06 05:22:31,358 INFO launcher.ContainerLaunch > (ContainerLaunch.java:getContainerPid(1102)) - Could not get pid for > container_e20_1530854311763_0006_01_02. Waited for 5000 ms. > 2018-07-06 05:22:31,358 WARN launcher.ContainerLaunch > (ContainerLaunch.java:cleanupContainer(784)) - Container clean up before pid > file created container_e20_1530854311763_0006_01_02 > 2018-07-06 05:22:31,359 INFO launcher.ContainerLaunch > (ContainerLaunch.java:reapDockerContainerNoPid(940)) - Unable to obtain pid, > but docker container request detected. Attempting to reap container > container_e20_1530854311763_0006_01_02 > 2018-07-06 05:22:31,494 INFO nodemanager.LinuxContainerExecutor > (LinuxContainerExecutor.java:deleteAsUser(828)) - Deleting absolute path : > /grid/0/hadoop/yarn/local/usercache/hrt_qa/appcache/application_1530854311763_0006/container_e20_1530854311763_0006_01_02/launch_container.sh > 2018-07-06 05:22:31,500 INFO nodemanager.LinuxContainerExecutor > (LinuxContainerExecutor.java:deleteAsUser(828)) - Deleting absolute path : > /grid/0/hadoop/yarn/local/usercache/hrt_qa/appcache/application_1530854311763_0006/container_e20_1530854311763_0006_01_02/container_tokens > 2018-07-06 05:22:31,510 INFO container.ContainerImpl > (ContainerImpl.java:handle(2093)) - Container > container_e20_1530854311763_0006_01_01 transitioned from KILLING to > CONTAINER_CLEANEDUP_AFTER_KILL > 2018-07-06 05:22:31,510 INFO container.ContainerImpl > (ContainerImpl.java:handle(2093)) - Container > container_e20_1530854311763_0006_01_02 transitioned from KILLING to > CONTAINER_CLEANEDUP_AFTER_KILL > 2018-07-06 05:22:31,512 INFO container.ContainerImpl > (ContainerImpl.java:handle(2093)) - Container > container_e20_1530854311763_0006_01_01 transitioned from > CONTAINER_CLEANEDUP_AFTER_KILL to DONE > 2018-07-06 05:22:31,513 INFO container.ContainerImpl > (ContainerImpl.java:handle(2093)) - Container > container_e20_1530854311763_0006_01_02 transitioned from > CONTAINER_CLEANEDUP_AFTER_KILL to DONE > 2018-07-06 05:22:38,955 INFO container.ContainerImpl > (ContainerImpl.java:handle(2093)) - Container > container_e20_1530854311763_0007_01_02 transitioned from NEW to SCHEDULED > {Code} > New container requesting for GPU fails to launch > {code} > 2018-07-06 05:22:39,048 ERROR nodemanager.LinuxContainerExecutor > (LinuxContainerExecutor.java:handleLaunchForLaunchType(550)) - > ResourceHandlerChain.preStart() failed! > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerException: > Failed to find enough GPUs, > requestor=container_e20_1530854311763_0007_01_02, #RequestedGPUs=2, > #availableGpus=1 > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.gpu.GpuResourceAllocator.internalAssignGpus(GpuResourceAllocator.java:225) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.gpu.GpuResourceAllocator.assignGpus(GpuResourceAllocator.java:173) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.gpu.GpuResourceHandlerImpl.preStart(GpuResourceHandlerImpl.java:98) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerChain.preStart(ResourceHandlerChain.java:75) >
[jira] [Commented] (YARN-8508) On NodeManager container gets cleaned up before its pid file is created
[ https://issues.apache.org/jira/browse/YARN-8508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16564256#comment-16564256 ] Wangda Tan commented on YARN-8508: -- Committed to branch-3.1.1, thanks [~csingh]! > On NodeManager container gets cleaned up before its pid file is created > --- > > Key: YARN-8508 > URL: https://issues.apache.org/jira/browse/YARN-8508 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Sumana Sathish >Assignee: Chandni Singh >Priority: Critical > Fix For: 3.2.0, 3.1.1 > > Attachments: YARN-8505.001.patch, YARN-8505.002.patch > > > GPU failed to release even though the container using it is being killed > {Code} > 2018-07-06 05:22:26,201 INFO container.ContainerImpl > (ContainerImpl.java:handle(2093)) - Container > container_e20_1530854311763_0006_01_01 transitioned from RUNNING to > KILLING > 2018-07-06 05:22:26,250 INFO container.ContainerImpl > (ContainerImpl.java:handle(2093)) - Container > container_e20_1530854311763_0006_01_02 transitioned from RUNNING to > KILLING > 2018-07-06 05:22:26,251 INFO application.ApplicationImpl > (ApplicationImpl.java:handle(632)) - Application > application_1530854311763_0006 transitioned from RUNNING to > FINISHING_CONTAINERS_WAIT > 2018-07-06 05:22:26,251 INFO launcher.ContainerLaunch > (ContainerLaunch.java:cleanupContainer(734)) - Cleaning up container > container_e20_1530854311763_0006_01_02 > 2018-07-06 05:22:31,358 INFO launcher.ContainerLaunch > (ContainerLaunch.java:getContainerPid(1102)) - Could not get pid for > container_e20_1530854311763_0006_01_02. Waited for 5000 ms. > 2018-07-06 05:22:31,358 WARN launcher.ContainerLaunch > (ContainerLaunch.java:cleanupContainer(784)) - Container clean up before pid > file created container_e20_1530854311763_0006_01_02 > 2018-07-06 05:22:31,359 INFO launcher.ContainerLaunch > (ContainerLaunch.java:reapDockerContainerNoPid(940)) - Unable to obtain pid, > but docker container request detected. Attempting to reap container > container_e20_1530854311763_0006_01_02 > 2018-07-06 05:22:31,494 INFO nodemanager.LinuxContainerExecutor > (LinuxContainerExecutor.java:deleteAsUser(828)) - Deleting absolute path : > /grid/0/hadoop/yarn/local/usercache/hrt_qa/appcache/application_1530854311763_0006/container_e20_1530854311763_0006_01_02/launch_container.sh > 2018-07-06 05:22:31,500 INFO nodemanager.LinuxContainerExecutor > (LinuxContainerExecutor.java:deleteAsUser(828)) - Deleting absolute path : > /grid/0/hadoop/yarn/local/usercache/hrt_qa/appcache/application_1530854311763_0006/container_e20_1530854311763_0006_01_02/container_tokens > 2018-07-06 05:22:31,510 INFO container.ContainerImpl > (ContainerImpl.java:handle(2093)) - Container > container_e20_1530854311763_0006_01_01 transitioned from KILLING to > CONTAINER_CLEANEDUP_AFTER_KILL > 2018-07-06 05:22:31,510 INFO container.ContainerImpl > (ContainerImpl.java:handle(2093)) - Container > container_e20_1530854311763_0006_01_02 transitioned from KILLING to > CONTAINER_CLEANEDUP_AFTER_KILL > 2018-07-06 05:22:31,512 INFO container.ContainerImpl > (ContainerImpl.java:handle(2093)) - Container > container_e20_1530854311763_0006_01_01 transitioned from > CONTAINER_CLEANEDUP_AFTER_KILL to DONE > 2018-07-06 05:22:31,513 INFO container.ContainerImpl > (ContainerImpl.java:handle(2093)) - Container > container_e20_1530854311763_0006_01_02 transitioned from > CONTAINER_CLEANEDUP_AFTER_KILL to DONE > 2018-07-06 05:22:38,955 INFO container.ContainerImpl > (ContainerImpl.java:handle(2093)) - Container > container_e20_1530854311763_0007_01_02 transitioned from NEW to SCHEDULED > {Code} > New container requesting for GPU fails to launch > {code} > 2018-07-06 05:22:39,048 ERROR nodemanager.LinuxContainerExecutor > (LinuxContainerExecutor.java:handleLaunchForLaunchType(550)) - > ResourceHandlerChain.preStart() failed! > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerException: > Failed to find enough GPUs, > requestor=container_e20_1530854311763_0007_01_02, #RequestedGPUs=2, > #availableGpus=1 > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.gpu.GpuResourceAllocator.internalAssignGpus(GpuResourceAllocator.java:225) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.gpu.GpuResourceAllocator.assignGpus(GpuResourceAllocator.java:173) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.gpu.GpuResourceHandlerImpl.preStart(GpuResourceHandlerImpl.java:98) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerChain.preSt
[jira] [Commented] (YARN-8546) Resource leak caused by a reserved container being released more than once under async scheduling
[ https://issues.apache.org/jira/browse/YARN-8546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16564254#comment-16564254 ] Wangda Tan commented on YARN-8546: -- Committed to branch-3.1.1, thanks [~Tao Yang]/[~cheersyang] > Resource leak caused by a reserved container being released more than once > under async scheduling > - > > Key: YARN-8546 > URL: https://issues.apache.org/jira/browse/YARN-8546 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacity scheduler >Affects Versions: 3.1.0 >Reporter: Weiwei Yang >Assignee: Tao Yang >Priority: Major > Labels: global-scheduling > Fix For: 3.2.0, 3.1.1 > > Attachments: YARN-8546.001.patch > > > I was able to reproduce this issue by starting a job, and this job keeps > requesting containers until it uses up cluster available resource. My cluster > has 70200 vcores, and each task it applies for 100 vcores, I was expecting > total 702 containers can be allocated but eventually there was only 701. The > last container could not get allocated because queue used resource is updated > to be more than 100%. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8301) Yarn Service Upgrade: Add documentation
[ https://issues.apache.org/jira/browse/YARN-8301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8301: - Fix Version/s: (was: 3.1.2) 3.1.1 > Yarn Service Upgrade: Add documentation > --- > > Key: YARN-8301 > URL: https://issues.apache.org/jira/browse/YARN-8301 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Critical > Fix For: 3.2.0, 3.1.1 > > Attachments: YARN-8301.001.patch, YARN-8301.002.patch, > YARN-8301.003.patch, YARN-8301.004.patch, YARN-8301.005.patch, > YARN-8301.006.patch, YARN-8301.007.patch > > > Add documentation for yarn service upgrade. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8528) Final states in ContainerAllocation might be modified externally causing unexpected allocation results
[ https://issues.apache.org/jira/browse/YARN-8528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16564253#comment-16564253 ] Wangda Tan commented on YARN-8528: -- Committed to branch-3.1.1, thanks [~cheersyang] > Final states in ContainerAllocation might be modified externally causing > unexpected allocation results > -- > > Key: YARN-8528 > URL: https://issues.apache.org/jira/browse/YARN-8528 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Reporter: Xintong Song >Assignee: Xintong Song >Priority: Major > Fix For: 3.2.0, 3.1.1 > > Attachments: YARN-8528.001.patch > > > ContainerAllocation.LOCALITY_SKIPPED is final static, and its .state should > always be AllocationState.LOCALITY_SKIPPED. However, this variable is public > and is accidentally changed to AllocationState.APP_SKIPPED in > RegularContainerAllocator under certain conditions. Once that happens, all > following LOCALITY_SKIPPED situations will be treated as APP_SKIPPED. > Similar risks exist for > ContainerAllocation.PRIORITY_SKIPPED/APP_SKIPPED/QUEUE_SKIPPED. > ContainerAllocation.state should be private and should not be changed. If > changes are needed, a new ContainerAllocation should be created. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8546) Resource leak caused by a reserved container being released more than once under async scheduling
[ https://issues.apache.org/jira/browse/YARN-8546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8546: - Fix Version/s: (was: 3.1.2) 3.1.1 > Resource leak caused by a reserved container being released more than once > under async scheduling > - > > Key: YARN-8546 > URL: https://issues.apache.org/jira/browse/YARN-8546 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacity scheduler >Affects Versions: 3.1.0 >Reporter: Weiwei Yang >Assignee: Tao Yang >Priority: Major > Labels: global-scheduling > Fix For: 3.2.0, 3.1.1 > > Attachments: YARN-8546.001.patch > > > I was able to reproduce this issue by starting a job, and this job keeps > requesting containers until it uses up cluster available resource. My cluster > has 70200 vcores, and each task it applies for 100 vcores, I was expecting > total 702 containers can be allocated but eventually there was only 701. The > last container could not get allocated because queue used resource is updated > to be more than 100%. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8528) Final states in ContainerAllocation might be modified externally causing unexpected allocation results
[ https://issues.apache.org/jira/browse/YARN-8528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8528: - Fix Version/s: (was: 3.1.2) 3.1.1 > Final states in ContainerAllocation might be modified externally causing > unexpected allocation results > -- > > Key: YARN-8528 > URL: https://issues.apache.org/jira/browse/YARN-8528 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Reporter: Xintong Song >Assignee: Xintong Song >Priority: Major > Fix For: 3.2.0, 3.1.1 > > Attachments: YARN-8528.001.patch > > > ContainerAllocation.LOCALITY_SKIPPED is final static, and its .state should > always be AllocationState.LOCALITY_SKIPPED. However, this variable is public > and is accidentally changed to AllocationState.APP_SKIPPED in > RegularContainerAllocator under certain conditions. Once that happens, all > following LOCALITY_SKIPPED situations will be treated as APP_SKIPPED. > Similar risks exist for > ContainerAllocation.PRIORITY_SKIPPED/APP_SKIPPED/QUEUE_SKIPPED. > ContainerAllocation.state should be private and should not be changed. If > changes are needed, a new ContainerAllocation should be created. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8559) Expose mutable-conf scheduler's configuration in RM /scheduler-conf endpoint
[ https://issues.apache.org/jira/browse/YARN-8559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16564213#comment-16564213 ] Wangda Tan commented on YARN-8559: -- Thanks [~cheersyang], Some suggestions: 1) Instead of creating a new DAO, can we use the existing logic similar to org.apache.hadoop.conf.ConfServlet? 2) Also, it might be better to protect access of the REST endpoint by admin only since it includes some sensitive information like ACL, etc. For sensitive information of scheduler, user can access /scheduler. > Expose mutable-conf scheduler's configuration in RM /scheduler-conf endpoint > > > Key: YARN-8559 > URL: https://issues.apache.org/jira/browse/YARN-8559 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Reporter: Anna Savarin >Assignee: Weiwei Yang >Priority: Major > Attachments: YARN-8559.001.patch, YARN-8559.002.patch, > YARN-8559.003.patch > > > All Hadoop services provide a set of common endpoints (/stacks, /logLevel, > /metrics, /jmx, /conf). In the case of the Resource Manager, part of the > configuration comes from the scheduler being used. Currently, these > configuration key/values are not exposed through the /conf endpoint, thereby > revealing an incomplete configuration picture. > Make an improvement and expose the scheduling configuration info through the > RM's /conf endpoint. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8606) Opportunistic scheduling doesnt work after failover
[ https://issues.apache.org/jira/browse/YARN-8606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16564089#comment-16564089 ] Wangda Tan commented on YARN-8606: -- [~bibinchundatt], is this a regression recently? If yes, which JIRA breaks the use case? > Opportunistic scheduling doesnt work after failover > --- > > Key: YARN-8606 > URL: https://issues.apache.org/jira/browse/YARN-8606 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.1.0 >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Blocker > Attachments: YARN-8606.001.patch > > > EventDispatcher for oppurtunistic scheduling is added to RM composite service > and not RMActiveService composite service causing dispatcher to be started > once on RM restart. > Issue credits: Rakesh -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8509) Fix UserLimit calculation for preemption to balance scenario after queue satisfied
[ https://issues.apache.org/jira/browse/YARN-8509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16562483#comment-16562483 ] Wangda Tan commented on YARN-8509: -- [~eepayne], if you have some bandwidth, could u help to check this patch? > Fix UserLimit calculation for preemption to balance scenario after queue > satisfied > > > Key: YARN-8509 > URL: https://issues.apache.org/jira/browse/YARN-8509 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: Zian Chen >Assignee: Zian Chen >Priority: Major > Attachments: YARN-8509.001.patch, YARN-8509.002.patch > > > In LeafQueue#getTotalPendingResourcesConsideringUserLimit, we calculate total > pending resource based on user-limit percent and user-limit factor which will > cap pending resource for each user to the minimum of user-limit pending and > actual pending. This will prevent queue from taking more pending resource to > achieve queue balance after all queue satisfied with its ideal allocation. > > We need to change the logic to let queue pending can go beyond userlimit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8603) [UI2] Latest run application should be listed first in the RM UI
[ https://issues.apache.org/jira/browse/YARN-8603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8603: - Reporter: Sumana Sathish (was: Akhil PB) > [UI2] Latest run application should be listed first in the RM UI > > > Key: YARN-8603 > URL: https://issues.apache.org/jira/browse/YARN-8603 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-ui-v2 >Reporter: Sumana Sathish >Assignee: Akhil PB >Priority: Major > Attachments: YARN-8603.001.patch > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8591) [ATSv2] NPE while checking for entity acl in non-secure cluster
[ https://issues.apache.org/jira/browse/YARN-8591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16562368#comment-16562368 ] Wangda Tan commented on YARN-8591: -- Updated fixed version to 3.1.2 given this don't exist in branch-3.1.1 > [ATSv2] NPE while checking for entity acl in non-secure cluster > --- > > Key: YARN-8591 > URL: https://issues.apache.org/jira/browse/YARN-8591 > Project: Hadoop YARN > Issue Type: Bug > Components: timelinereader, timelineserver >Reporter: Akhil PB >Assignee: Rohith Sharma K S >Priority: Major > Fix For: 3.2.0, 3.1.2 > > Attachments: YARN-8591.01.patch > > > {code:java} > GET > http://ctr-e138-1518143905142-417433-01-04.hwx.site:8198/ws/v2/timeline/apps/application_1532578985272_0002/entities/YARN_CONTAINER?fields=ALL&_=1532670071899{code} > {code:java} > 2018-07-27 05:32:03,468 WARN webapp.GenericExceptionHandler > (GenericExceptionHandler.java:toResponse(98)) - INTERNAL_SERVER_ERROR > javax.ws.rs.WebApplicationException: java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.timelineservice.reader.TimelineReaderWebServices.handleException(TimelineReaderWebServices.java:196) > at > org.apache.hadoop.yarn.server.timelineservice.reader.TimelineReaderWebServices.getEntities(TimelineReaderWebServices.java:624) > at > org.apache.hadoop.yarn.server.timelineservice.reader.TimelineReaderWebServices.getEntities(TimelineReaderWebServices.java:474) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60) > at > com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185) > at > com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75) > at > com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:302) > at > com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) > at > com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108) > at > com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) > at > com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84) > at > com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1542) > at > com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1473) > at > com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1419) > at > com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1409) > at > com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:409) > at > com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:558) > at > com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:733) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) > at > org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848) > at > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772) > at > org.apache.hadoop.yarn.server.timelineservice.reader.security.TimelineReaderWhitelistAuthorizationFilter.doFilter(TimelineReaderWhitelistAuthorizationFilter.java:85) > at > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) > at > org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:644) > at > org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:592) > at > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) > at > org.apache.hadoop.security.http.CrossOriginFilter.doFilter(CrossOriginFilter.java:98) > at > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) > at > org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1604) >
[jira] [Updated] (YARN-8508) On NodeManager container gets cleaned up before its pid file is created
[ https://issues.apache.org/jira/browse/YARN-8508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8508: - Priority: Critical (was: Major) > On NodeManager container gets cleaned up before its pid file is created > --- > > Key: YARN-8508 > URL: https://issues.apache.org/jira/browse/YARN-8508 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Sumana Sathish >Assignee: Chandni Singh >Priority: Critical > Fix For: 3.2.0, 3.1.2 > > Attachments: YARN-8505.001.patch, YARN-8505.002.patch > > > GPU failed to release even though the container using it is being killed > {Code} > 2018-07-06 05:22:26,201 INFO container.ContainerImpl > (ContainerImpl.java:handle(2093)) - Container > container_e20_1530854311763_0006_01_01 transitioned from RUNNING to > KILLING > 2018-07-06 05:22:26,250 INFO container.ContainerImpl > (ContainerImpl.java:handle(2093)) - Container > container_e20_1530854311763_0006_01_02 transitioned from RUNNING to > KILLING > 2018-07-06 05:22:26,251 INFO application.ApplicationImpl > (ApplicationImpl.java:handle(632)) - Application > application_1530854311763_0006 transitioned from RUNNING to > FINISHING_CONTAINERS_WAIT > 2018-07-06 05:22:26,251 INFO launcher.ContainerLaunch > (ContainerLaunch.java:cleanupContainer(734)) - Cleaning up container > container_e20_1530854311763_0006_01_02 > 2018-07-06 05:22:31,358 INFO launcher.ContainerLaunch > (ContainerLaunch.java:getContainerPid(1102)) - Could not get pid for > container_e20_1530854311763_0006_01_02. Waited for 5000 ms. > 2018-07-06 05:22:31,358 WARN launcher.ContainerLaunch > (ContainerLaunch.java:cleanupContainer(784)) - Container clean up before pid > file created container_e20_1530854311763_0006_01_02 > 2018-07-06 05:22:31,359 INFO launcher.ContainerLaunch > (ContainerLaunch.java:reapDockerContainerNoPid(940)) - Unable to obtain pid, > but docker container request detected. Attempting to reap container > container_e20_1530854311763_0006_01_02 > 2018-07-06 05:22:31,494 INFO nodemanager.LinuxContainerExecutor > (LinuxContainerExecutor.java:deleteAsUser(828)) - Deleting absolute path : > /grid/0/hadoop/yarn/local/usercache/hrt_qa/appcache/application_1530854311763_0006/container_e20_1530854311763_0006_01_02/launch_container.sh > 2018-07-06 05:22:31,500 INFO nodemanager.LinuxContainerExecutor > (LinuxContainerExecutor.java:deleteAsUser(828)) - Deleting absolute path : > /grid/0/hadoop/yarn/local/usercache/hrt_qa/appcache/application_1530854311763_0006/container_e20_1530854311763_0006_01_02/container_tokens > 2018-07-06 05:22:31,510 INFO container.ContainerImpl > (ContainerImpl.java:handle(2093)) - Container > container_e20_1530854311763_0006_01_01 transitioned from KILLING to > CONTAINER_CLEANEDUP_AFTER_KILL > 2018-07-06 05:22:31,510 INFO container.ContainerImpl > (ContainerImpl.java:handle(2093)) - Container > container_e20_1530854311763_0006_01_02 transitioned from KILLING to > CONTAINER_CLEANEDUP_AFTER_KILL > 2018-07-06 05:22:31,512 INFO container.ContainerImpl > (ContainerImpl.java:handle(2093)) - Container > container_e20_1530854311763_0006_01_01 transitioned from > CONTAINER_CLEANEDUP_AFTER_KILL to DONE > 2018-07-06 05:22:31,513 INFO container.ContainerImpl > (ContainerImpl.java:handle(2093)) - Container > container_e20_1530854311763_0006_01_02 transitioned from > CONTAINER_CLEANEDUP_AFTER_KILL to DONE > 2018-07-06 05:22:38,955 INFO container.ContainerImpl > (ContainerImpl.java:handle(2093)) - Container > container_e20_1530854311763_0007_01_02 transitioned from NEW to SCHEDULED > {Code} > New container requesting for GPU fails to launch > {code} > 2018-07-06 05:22:39,048 ERROR nodemanager.LinuxContainerExecutor > (LinuxContainerExecutor.java:handleLaunchForLaunchType(550)) - > ResourceHandlerChain.preStart() failed! > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerException: > Failed to find enough GPUs, > requestor=container_e20_1530854311763_0007_01_02, #RequestedGPUs=2, > #availableGpus=1 > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.gpu.GpuResourceAllocator.internalAssignGpus(GpuResourceAllocator.java:225) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.gpu.GpuResourceAllocator.assignGpus(GpuResourceAllocator.java:173) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.gpu.GpuResourceHandlerImpl.preStart(GpuResourceHandlerImpl.java:98) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerChain.preStart(ResourceHandlerChain.java:75) > at > org.apache.hadoop.
[jira] [Commented] (YARN-8508) On NodeManager container gets cleaned up before its pid file is created
[ https://issues.apache.org/jira/browse/YARN-8508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16562364#comment-16562364 ] Wangda Tan commented on YARN-8508: -- I think it is important to get it backported to branch-3.1.1, I'm going to do this in a couple of hours, please let me know if you think different. cc: [~csingh], [~eyang] > On NodeManager container gets cleaned up before its pid file is created > --- > > Key: YARN-8508 > URL: https://issues.apache.org/jira/browse/YARN-8508 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Sumana Sathish >Assignee: Chandni Singh >Priority: Critical > Fix For: 3.2.0, 3.1.2 > > Attachments: YARN-8505.001.patch, YARN-8505.002.patch > > > GPU failed to release even though the container using it is being killed > {Code} > 2018-07-06 05:22:26,201 INFO container.ContainerImpl > (ContainerImpl.java:handle(2093)) - Container > container_e20_1530854311763_0006_01_01 transitioned from RUNNING to > KILLING > 2018-07-06 05:22:26,250 INFO container.ContainerImpl > (ContainerImpl.java:handle(2093)) - Container > container_e20_1530854311763_0006_01_02 transitioned from RUNNING to > KILLING > 2018-07-06 05:22:26,251 INFO application.ApplicationImpl > (ApplicationImpl.java:handle(632)) - Application > application_1530854311763_0006 transitioned from RUNNING to > FINISHING_CONTAINERS_WAIT > 2018-07-06 05:22:26,251 INFO launcher.ContainerLaunch > (ContainerLaunch.java:cleanupContainer(734)) - Cleaning up container > container_e20_1530854311763_0006_01_02 > 2018-07-06 05:22:31,358 INFO launcher.ContainerLaunch > (ContainerLaunch.java:getContainerPid(1102)) - Could not get pid for > container_e20_1530854311763_0006_01_02. Waited for 5000 ms. > 2018-07-06 05:22:31,358 WARN launcher.ContainerLaunch > (ContainerLaunch.java:cleanupContainer(784)) - Container clean up before pid > file created container_e20_1530854311763_0006_01_02 > 2018-07-06 05:22:31,359 INFO launcher.ContainerLaunch > (ContainerLaunch.java:reapDockerContainerNoPid(940)) - Unable to obtain pid, > but docker container request detected. Attempting to reap container > container_e20_1530854311763_0006_01_02 > 2018-07-06 05:22:31,494 INFO nodemanager.LinuxContainerExecutor > (LinuxContainerExecutor.java:deleteAsUser(828)) - Deleting absolute path : > /grid/0/hadoop/yarn/local/usercache/hrt_qa/appcache/application_1530854311763_0006/container_e20_1530854311763_0006_01_02/launch_container.sh > 2018-07-06 05:22:31,500 INFO nodemanager.LinuxContainerExecutor > (LinuxContainerExecutor.java:deleteAsUser(828)) - Deleting absolute path : > /grid/0/hadoop/yarn/local/usercache/hrt_qa/appcache/application_1530854311763_0006/container_e20_1530854311763_0006_01_02/container_tokens > 2018-07-06 05:22:31,510 INFO container.ContainerImpl > (ContainerImpl.java:handle(2093)) - Container > container_e20_1530854311763_0006_01_01 transitioned from KILLING to > CONTAINER_CLEANEDUP_AFTER_KILL > 2018-07-06 05:22:31,510 INFO container.ContainerImpl > (ContainerImpl.java:handle(2093)) - Container > container_e20_1530854311763_0006_01_02 transitioned from KILLING to > CONTAINER_CLEANEDUP_AFTER_KILL > 2018-07-06 05:22:31,512 INFO container.ContainerImpl > (ContainerImpl.java:handle(2093)) - Container > container_e20_1530854311763_0006_01_01 transitioned from > CONTAINER_CLEANEDUP_AFTER_KILL to DONE > 2018-07-06 05:22:31,513 INFO container.ContainerImpl > (ContainerImpl.java:handle(2093)) - Container > container_e20_1530854311763_0006_01_02 transitioned from > CONTAINER_CLEANEDUP_AFTER_KILL to DONE > 2018-07-06 05:22:38,955 INFO container.ContainerImpl > (ContainerImpl.java:handle(2093)) - Container > container_e20_1530854311763_0007_01_02 transitioned from NEW to SCHEDULED > {Code} > New container requesting for GPU fails to launch > {code} > 2018-07-06 05:22:39,048 ERROR nodemanager.LinuxContainerExecutor > (LinuxContainerExecutor.java:handleLaunchForLaunchType(550)) - > ResourceHandlerChain.preStart() failed! > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerException: > Failed to find enough GPUs, > requestor=container_e20_1530854311763_0007_01_02, #RequestedGPUs=2, > #availableGpus=1 > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.gpu.GpuResourceAllocator.internalAssignGpus(GpuResourceAllocator.java:225) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.gpu.GpuResourceAllocator.assignGpus(GpuResourceAllocator.java:173) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.gpu.GpuResourceHandlerImpl.preStart(GpuResourceHandler
[jira] [Commented] (YARN-8545) YARN native service should return container if launch failed
[ https://issues.apache.org/jira/browse/YARN-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16562358#comment-16562358 ] Wangda Tan commented on YARN-8545: -- I think it is important to get it backported to branch-3.1.1, I'm going to do this in a couple of hours, please let me know if you think different. cc: [~csingh], [~eyang] > YARN native service should return container if launch failed > > > Key: YARN-8545 > URL: https://issues.apache.org/jira/browse/YARN-8545 > Project: Hadoop YARN > Issue Type: Task >Reporter: Wangda Tan >Assignee: Chandni Singh >Priority: Critical > Fix For: 3.2.0, 3.1.2 > > Attachments: YARN-8545.001.patch > > > In some cases, container launch may fail but container will not be properly > returned to RM. > This could happen when AM trying to prepare container launch context but > failed w/o sending container launch context to NM (Once container launch > context is sent to NM, NM will report failed container to RM). > Exception like: > {code:java} > java.io.FileNotFoundException: File does not exist: > hdfs://ns1/user/wtan/.yarn/services/tf-job-001/components/1531852429056/primary-worker/primary-worker-0/run-PRIMARY_WORKER.sh > at > org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1583) > at > org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1576) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1591) > at > org.apache.hadoop.yarn.service.utils.CoreFileSystem.createAmResource(CoreFileSystem.java:388) > at > org.apache.hadoop.yarn.service.provider.ProviderUtils.createConfigFileAndAddLocalResource(ProviderUtils.java:253) > at > org.apache.hadoop.yarn.service.provider.AbstractProviderService.buildContainerLaunchContext(AbstractProviderService.java:152) > at > org.apache.hadoop.yarn.service.containerlaunch.ContainerLaunchService$ContainerLauncher.run(ContainerLaunchService.java:105) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745){code} > And even after container launch context prepare failed, AM still trying to > monitor container's readiness: > {code:java} > 2018-07-17 18:42:57,518 [pool-7-thread-1] INFO monitor.ServiceMonitor - > Readiness check failed for primary-worker-0: Probe Status, time="Tue Jul 17 > 18:42:57 UTC 2018", outcome="failure", message="Failure in Default probe: IP > presence", exception="java.io.IOException: primary-worker-0: IP is not > available yet" > ...{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8546) Resource leak caused by a reserved container being released more than once under async scheduling
[ https://issues.apache.org/jira/browse/YARN-8546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16562357#comment-16562357 ] Wangda Tan commented on YARN-8546: -- I think it is important to get it backported to branch-3.1.1, I'm going to do this in a couple of hours, please let me know if you think different. cc: [~cheersyang], [~Tao Yang] > Resource leak caused by a reserved container being released more than once > under async scheduling > - > > Key: YARN-8546 > URL: https://issues.apache.org/jira/browse/YARN-8546 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacity scheduler >Affects Versions: 3.1.0 >Reporter: Weiwei Yang >Assignee: Tao Yang >Priority: Major > Labels: global-scheduling > Fix For: 3.2.0, 3.1.2 > > Attachments: YARN-8546.001.patch > > > I was able to reproduce this issue by starting a job, and this job keeps > requesting containers until it uses up cluster available resource. My cluster > has 70200 vcores, and each task it applies for 100 vcores, I was expecting > total 702 containers can be allocated but eventually there was only 701. The > last container could not get allocated because queue used resource is updated > to be more than 100%. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8301) Yarn Service Upgrade: Add documentation
[ https://issues.apache.org/jira/browse/YARN-8301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16562354#comment-16562354 ] Wangda Tan commented on YARN-8301: -- Updated fixed version to 3.1.2 given this don't exist in branch-3.1.1. But I think it is important to get it backported to branch-3.1.1, I'm going to do this in a couple of hours, please let me know if you think different. cc: [~csingh], [~eyang] > Yarn Service Upgrade: Add documentation > --- > > Key: YARN-8301 > URL: https://issues.apache.org/jira/browse/YARN-8301 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Critical > Fix For: 3.2.0, 3.1.2 > > Attachments: YARN-8301.001.patch, YARN-8301.002.patch, > YARN-8301.003.patch, YARN-8301.004.patch, YARN-8301.005.patch, > YARN-8301.006.patch, YARN-8301.007.patch > > > Add documentation for yarn service upgrade. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8301) Yarn Service Upgrade: Add documentation
[ https://issues.apache.org/jira/browse/YARN-8301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8301: - Fix Version/s: (was: 3.1.1) 3.1.2 > Yarn Service Upgrade: Add documentation > --- > > Key: YARN-8301 > URL: https://issues.apache.org/jira/browse/YARN-8301 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Critical > Fix For: 3.2.0, 3.1.2 > > Attachments: YARN-8301.001.patch, YARN-8301.002.patch, > YARN-8301.003.patch, YARN-8301.004.patch, YARN-8301.005.patch, > YARN-8301.006.patch, YARN-8301.007.patch > > > Add documentation for yarn service upgrade. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8528) Final states in ContainerAllocation might be modified externally causing unexpected allocation results
[ https://issues.apache.org/jira/browse/YARN-8528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16562350#comment-16562350 ] Wangda Tan commented on YARN-8528: -- Updated fixed version to 3.1.2 given this don't exist in branch-3.1.1. But I think it is important to get it backported to branch-3.1.1, I'm going to do this in a couple of hours, please let me know if you think different. cc: [~cheersyang] > Final states in ContainerAllocation might be modified externally causing > unexpected allocation results > -- > > Key: YARN-8528 > URL: https://issues.apache.org/jira/browse/YARN-8528 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Reporter: Xintong Song >Assignee: Xintong Song >Priority: Major > Fix For: 3.2.0, 3.1.2 > > Attachments: YARN-8528.001.patch > > > ContainerAllocation.LOCALITY_SKIPPED is final static, and its .state should > always be AllocationState.LOCALITY_SKIPPED. However, this variable is public > and is accidentally changed to AllocationState.APP_SKIPPED in > RegularContainerAllocator under certain conditions. Once that happens, all > following LOCALITY_SKIPPED situations will be treated as APP_SKIPPED. > Similar risks exist for > ContainerAllocation.PRIORITY_SKIPPED/APP_SKIPPED/QUEUE_SKIPPED. > ContainerAllocation.state should be private and should not be changed. If > changes are needed, a new ContainerAllocation should be created. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8301) Yarn Service Upgrade: Add documentation
[ https://issues.apache.org/jira/browse/YARN-8301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8301: - Priority: Critical (was: Major) > Yarn Service Upgrade: Add documentation > --- > > Key: YARN-8301 > URL: https://issues.apache.org/jira/browse/YARN-8301 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Critical > Fix For: 3.2.0, 3.1.2 > > Attachments: YARN-8301.001.patch, YARN-8301.002.patch, > YARN-8301.003.patch, YARN-8301.004.patch, YARN-8301.005.patch, > YARN-8301.006.patch, YARN-8301.007.patch > > > Add documentation for yarn service upgrade. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8528) Final states in ContainerAllocation might be modified externally causing unexpected allocation results
[ https://issues.apache.org/jira/browse/YARN-8528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8528: - Fix Version/s: (was: 3.1.1) 3.1.2 > Final states in ContainerAllocation might be modified externally causing > unexpected allocation results > -- > > Key: YARN-8528 > URL: https://issues.apache.org/jira/browse/YARN-8528 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Reporter: Xintong Song >Assignee: Xintong Song >Priority: Major > Fix For: 3.2.0, 3.1.2 > > Attachments: YARN-8528.001.patch > > > ContainerAllocation.LOCALITY_SKIPPED is final static, and its .state should > always be AllocationState.LOCALITY_SKIPPED. However, this variable is public > and is accidentally changed to AllocationState.APP_SKIPPED in > RegularContainerAllocator under certain conditions. Once that happens, all > following LOCALITY_SKIPPED situations will be treated as APP_SKIPPED. > Similar risks exist for > ContainerAllocation.PRIORITY_SKIPPED/APP_SKIPPED/QUEUE_SKIPPED. > ContainerAllocation.state should be private and should not be changed. If > changes are needed, a new ContainerAllocation should be created. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8508) On NodeManager container gets cleaned up before its pid file is created
[ https://issues.apache.org/jira/browse/YARN-8508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8508: - Summary: On NodeManager container gets cleaned up before its pid file is created (was: GPU does not get released even though the container is killed) > On NodeManager container gets cleaned up before its pid file is created > --- > > Key: YARN-8508 > URL: https://issues.apache.org/jira/browse/YARN-8508 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Sumana Sathish >Assignee: Chandni Singh >Priority: Major > Fix For: 3.2.0, 3.1.2 > > Attachments: YARN-8505.001.patch, YARN-8505.002.patch > > > GPU failed to release even though the container using it is being killed > {Code} > 2018-07-06 05:22:26,201 INFO container.ContainerImpl > (ContainerImpl.java:handle(2093)) - Container > container_e20_1530854311763_0006_01_01 transitioned from RUNNING to > KILLING > 2018-07-06 05:22:26,250 INFO container.ContainerImpl > (ContainerImpl.java:handle(2093)) - Container > container_e20_1530854311763_0006_01_02 transitioned from RUNNING to > KILLING > 2018-07-06 05:22:26,251 INFO application.ApplicationImpl > (ApplicationImpl.java:handle(632)) - Application > application_1530854311763_0006 transitioned from RUNNING to > FINISHING_CONTAINERS_WAIT > 2018-07-06 05:22:26,251 INFO launcher.ContainerLaunch > (ContainerLaunch.java:cleanupContainer(734)) - Cleaning up container > container_e20_1530854311763_0006_01_02 > 2018-07-06 05:22:31,358 INFO launcher.ContainerLaunch > (ContainerLaunch.java:getContainerPid(1102)) - Could not get pid for > container_e20_1530854311763_0006_01_02. Waited for 5000 ms. > 2018-07-06 05:22:31,358 WARN launcher.ContainerLaunch > (ContainerLaunch.java:cleanupContainer(784)) - Container clean up before pid > file created container_e20_1530854311763_0006_01_02 > 2018-07-06 05:22:31,359 INFO launcher.ContainerLaunch > (ContainerLaunch.java:reapDockerContainerNoPid(940)) - Unable to obtain pid, > but docker container request detected. Attempting to reap container > container_e20_1530854311763_0006_01_02 > 2018-07-06 05:22:31,494 INFO nodemanager.LinuxContainerExecutor > (LinuxContainerExecutor.java:deleteAsUser(828)) - Deleting absolute path : > /grid/0/hadoop/yarn/local/usercache/hrt_qa/appcache/application_1530854311763_0006/container_e20_1530854311763_0006_01_02/launch_container.sh > 2018-07-06 05:22:31,500 INFO nodemanager.LinuxContainerExecutor > (LinuxContainerExecutor.java:deleteAsUser(828)) - Deleting absolute path : > /grid/0/hadoop/yarn/local/usercache/hrt_qa/appcache/application_1530854311763_0006/container_e20_1530854311763_0006_01_02/container_tokens > 2018-07-06 05:22:31,510 INFO container.ContainerImpl > (ContainerImpl.java:handle(2093)) - Container > container_e20_1530854311763_0006_01_01 transitioned from KILLING to > CONTAINER_CLEANEDUP_AFTER_KILL > 2018-07-06 05:22:31,510 INFO container.ContainerImpl > (ContainerImpl.java:handle(2093)) - Container > container_e20_1530854311763_0006_01_02 transitioned from KILLING to > CONTAINER_CLEANEDUP_AFTER_KILL > 2018-07-06 05:22:31,512 INFO container.ContainerImpl > (ContainerImpl.java:handle(2093)) - Container > container_e20_1530854311763_0006_01_01 transitioned from > CONTAINER_CLEANEDUP_AFTER_KILL to DONE > 2018-07-06 05:22:31,513 INFO container.ContainerImpl > (ContainerImpl.java:handle(2093)) - Container > container_e20_1530854311763_0006_01_02 transitioned from > CONTAINER_CLEANEDUP_AFTER_KILL to DONE > 2018-07-06 05:22:38,955 INFO container.ContainerImpl > (ContainerImpl.java:handle(2093)) - Container > container_e20_1530854311763_0007_01_02 transitioned from NEW to SCHEDULED > {Code} > New container requesting for GPU fails to launch > {code} > 2018-07-06 05:22:39,048 ERROR nodemanager.LinuxContainerExecutor > (LinuxContainerExecutor.java:handleLaunchForLaunchType(550)) - > ResourceHandlerChain.preStart() failed! > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerException: > Failed to find enough GPUs, > requestor=container_e20_1530854311763_0007_01_02, #RequestedGPUs=2, > #availableGpus=1 > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.gpu.GpuResourceAllocator.internalAssignGpus(GpuResourceAllocator.java:225) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.gpu.GpuResourceAllocator.assignGpus(GpuResourceAllocator.java:173) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.gpu.GpuResourceHandlerImpl.preStart(GpuResourceHandlerImpl.java:98) > at > org.apache.hadoop.yarn.server.nodemanager.containe
[jira] [Updated] (YARN-8330) Avoid publishing reserved container to ATS from RM
[ https://issues.apache.org/jira/browse/YARN-8330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8330: - Summary: Avoid publishing reserved container to ATS from RM (was: Avoid publishing reserved container to ATS) > Avoid publishing reserved container to ATS from RM > -- > > Key: YARN-8330 > URL: https://issues.apache.org/jira/browse/YARN-8330 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-native-services >Reporter: Yesha Vora >Assignee: Suma Shivaprasad >Priority: Critical > Fix For: 3.2.0, 3.1.2 > > Attachments: YARN-8330.1.patch, YARN-8330.2.patch, YARN-8330.3.patch, > YARN-8330.4.patch > > > Steps: > launch Hbase tarball app > list containers for hbase tarball app > {code} > /usr/hdp/current/hadoop-yarn-client/bin/yarn container -list > appattempt_1525463491331_0006_01 > WARNING: YARN_LOG_DIR has been replaced by HADOOP_LOG_DIR. Using value of > YARN_LOG_DIR. > WARNING: YARN_LOGFILE has been replaced by HADOOP_LOGFILE. Using value of > YARN_LOGFILE. > WARNING: YARN_PID_DIR has been replaced by HADOOP_PID_DIR. Using value of > YARN_PID_DIR. > WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS. > 18/05/04 22:36:11 INFO client.AHSProxy: Connecting to Application History > server at xxx/xxx:10200 > 18/05/04 22:36:11 INFO client.ConfiguredRMFailoverProxyProvider: Failing over > to rm2 > Total number of containers :5 > Container-IdStart Time Finish Time > StateHost Node Http Address >LOG-URL > container_e06_1525463491331_0006_01_02Fri May 04 22:34:26 + 2018 > N/A RUNNINGxxx:25454 http://xxx:8042 > http://xxx:8042/node/containerlogs/container_e06_1525463491331_0006_01_02/hrt_qa > 2018-05-04 22:36:11,216|INFO|MainThread|machine.py:167 - > run()||GUID=0169fa41-d1c5-4b43-85bf-c3e9f2682398|container_e06_1525463491331_0006_01_03 > Fri May 04 22:34:26 + 2018 N/A > RUNNINGxxx:25454 http://xxx:8042 > http://xxx:8042/node/containerlogs/container_e06_1525463491331_0006_01_03/hrt_qa > 2018-05-04 22:36:11,217|INFO|MainThread|machine.py:167 - > run()||GUID=0169fa41-d1c5-4b43-85bf-c3e9f2682398|container_e06_1525463491331_0006_01_01 > Fri May 04 22:34:15 + 2018 N/A > RUNNINGxxx:25454 http://xxx:8042 > http://xxx:8042/node/containerlogs/container_e06_1525463491331_0006_01_01/hrt_qa > 2018-05-04 22:36:11,217|INFO|MainThread|machine.py:167 - > run()||GUID=0169fa41-d1c5-4b43-85bf-c3e9f2682398|container_e06_1525463491331_0006_01_05 > Fri May 04 22:34:56 + 2018 N/A > RUNNINGxxx:25454 http://xxx:8042 > http://xxx:8042/node/containerlogs/container_e06_1525463491331_0006_01_05/hrt_qa > 2018-05-04 22:36:11,218|INFO|MainThread|machine.py:167 - > run()||GUID=0169fa41-d1c5-4b43-85bf-c3e9f2682398|container_e06_1525463491331_0006_01_04 > Fri May 04 22:34:56 + 2018 N/A > nullxxx:25454 http://xxx:8042 > http://xxx:8188/applicationhistory/logs/xxx:25454/container_e06_1525463491331_0006_01_04/container_e06_1525463491331_0006_01_04/hrt_qa{code} > Total expected containers = 4 ( 3 components container + 1 am). Instead, RM > is listing 5 containers. > container_e06_1525463491331_0006_01_04 is in null state. > Yarn service utilized container 02, 03, 05 for component. There is no log > available in NM & AM related to container 04. Only one line in RM log is > printed > {code} > 2018-05-04 22:34:56,618 INFO rmcontainer.RMContainerImpl > (RMContainerImpl.java:handle(489)) - > container_e06_1525463491331_0006_01_04 Container Transitioned from NEW to > RESERVED{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8330) Avoid publishing reserved container to ATS
[ https://issues.apache.org/jira/browse/YARN-8330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8330: - Summary: Avoid publishing reserved container to ATS (was: An extra container got launched by RM for yarn-service) > Avoid publishing reserved container to ATS > -- > > Key: YARN-8330 > URL: https://issues.apache.org/jira/browse/YARN-8330 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-native-services >Reporter: Yesha Vora >Assignee: Suma Shivaprasad >Priority: Critical > Fix For: 3.2.0, 3.1.2 > > Attachments: YARN-8330.1.patch, YARN-8330.2.patch, YARN-8330.3.patch, > YARN-8330.4.patch > > > Steps: > launch Hbase tarball app > list containers for hbase tarball app > {code} > /usr/hdp/current/hadoop-yarn-client/bin/yarn container -list > appattempt_1525463491331_0006_01 > WARNING: YARN_LOG_DIR has been replaced by HADOOP_LOG_DIR. Using value of > YARN_LOG_DIR. > WARNING: YARN_LOGFILE has been replaced by HADOOP_LOGFILE. Using value of > YARN_LOGFILE. > WARNING: YARN_PID_DIR has been replaced by HADOOP_PID_DIR. Using value of > YARN_PID_DIR. > WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS. > 18/05/04 22:36:11 INFO client.AHSProxy: Connecting to Application History > server at xxx/xxx:10200 > 18/05/04 22:36:11 INFO client.ConfiguredRMFailoverProxyProvider: Failing over > to rm2 > Total number of containers :5 > Container-IdStart Time Finish Time > StateHost Node Http Address >LOG-URL > container_e06_1525463491331_0006_01_02Fri May 04 22:34:26 + 2018 > N/A RUNNINGxxx:25454 http://xxx:8042 > http://xxx:8042/node/containerlogs/container_e06_1525463491331_0006_01_02/hrt_qa > 2018-05-04 22:36:11,216|INFO|MainThread|machine.py:167 - > run()||GUID=0169fa41-d1c5-4b43-85bf-c3e9f2682398|container_e06_1525463491331_0006_01_03 > Fri May 04 22:34:26 + 2018 N/A > RUNNINGxxx:25454 http://xxx:8042 > http://xxx:8042/node/containerlogs/container_e06_1525463491331_0006_01_03/hrt_qa > 2018-05-04 22:36:11,217|INFO|MainThread|machine.py:167 - > run()||GUID=0169fa41-d1c5-4b43-85bf-c3e9f2682398|container_e06_1525463491331_0006_01_01 > Fri May 04 22:34:15 + 2018 N/A > RUNNINGxxx:25454 http://xxx:8042 > http://xxx:8042/node/containerlogs/container_e06_1525463491331_0006_01_01/hrt_qa > 2018-05-04 22:36:11,217|INFO|MainThread|machine.py:167 - > run()||GUID=0169fa41-d1c5-4b43-85bf-c3e9f2682398|container_e06_1525463491331_0006_01_05 > Fri May 04 22:34:56 + 2018 N/A > RUNNINGxxx:25454 http://xxx:8042 > http://xxx:8042/node/containerlogs/container_e06_1525463491331_0006_01_05/hrt_qa > 2018-05-04 22:36:11,218|INFO|MainThread|machine.py:167 - > run()||GUID=0169fa41-d1c5-4b43-85bf-c3e9f2682398|container_e06_1525463491331_0006_01_04 > Fri May 04 22:34:56 + 2018 N/A > nullxxx:25454 http://xxx:8042 > http://xxx:8188/applicationhistory/logs/xxx:25454/container_e06_1525463491331_0006_01_04/container_e06_1525463491331_0006_01_04/hrt_qa{code} > Total expected containers = 4 ( 3 components container + 1 am). Instead, RM > is listing 5 containers. > container_e06_1525463491331_0006_01_04 is in null state. > Yarn service utilized container 02, 03, 05 for component. There is no log > available in NM & AM related to container 04. Only one line in RM log is > printed > {code} > 2018-05-04 22:34:56,618 INFO rmcontainer.RMContainerImpl > (RMContainerImpl.java:handle(489)) - > container_e06_1525463491331_0006_01_04 Container Transitioned from NEW to > RESERVED{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8418) App local logs could leaked if log aggregation fails to initialize for the app
[ https://issues.apache.org/jira/browse/YARN-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16562288#comment-16562288 ] Wangda Tan commented on YARN-8418: -- Thanks [~bibinchundatt], +1 to the latest patch, it gonna be ideal if you could share some test results of the ver.9 patch. [~rohithsharma], any comments to the ver.9 patch? > App local logs could leaked if log aggregation fails to initialize for the app > -- > > Key: YARN-8418 > URL: https://issues.apache.org/jira/browse/YARN-8418 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.8.0, 3.0.0-alpha1 >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Critical > Attachments: YARN-8418.001.patch, YARN-8418.002.patch, > YARN-8418.003.patch, YARN-8418.004.patch, YARN-8418.005.patch, > YARN-8418.006.patch, YARN-8418.007.patch, YARN-8418.008.patch, > YARN-8418.009.patch > > > If log aggregation fails init createApp directory container logs could get > leaked in NM directory > For log running application restart of NM after token renewal this case is > possible/ Application submission with invalid delegation token -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8418) App local logs could leaked if log aggregation fails to initialize for the app
[ https://issues.apache.org/jira/browse/YARN-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16562026#comment-16562026 ] Wangda Tan commented on YARN-8418: -- [~rohithsharma], regarding to the YARN-4984 revert, Initially I have doubt as well, now I'm convinced. See my comment at: https://issues.apache.org/jira/browse/YARN-8418?focusedCommentId=16551034&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16551034 Regarding to delegation token change, let's wait [~bibinchundatt]'s response. > App local logs could leaked if log aggregation fails to initialize for the app > -- > > Key: YARN-8418 > URL: https://issues.apache.org/jira/browse/YARN-8418 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.8.0, 3.0.0-alpha1 >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Critical > Attachments: YARN-8418.001.patch, YARN-8418.002.patch, > YARN-8418.003.patch, YARN-8418.004.patch, YARN-8418.005.patch, > YARN-8418.006.patch, YARN-8418.007.patch, YARN-8418.008.patch > > > If log aggregation fails init createApp directory container logs could get > leaked in NM directory > For log running application restart of NM after token renewal this case is > possible/ Application submission with invalid delegation token -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8418) App local logs could leaked if log aggregation fails to initialize for the app
[ https://issues.apache.org/jira/browse/YARN-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16561188#comment-16561188 ] Wangda Tan commented on YARN-8418: -- Thanks [~bibinchundatt] for updating the patch and [~suma.shivaprasad] for code review. One minor comments: ContainerManagerImpl#handleCredentialUpdate, instead of calling following logic: {code} Set invalidTokenApps = logHandler.getInvalidTokenApps(); for (ApplicationId app : invalidTokenApps) { if (context.getSystemCredentialsForApps().get(app) != null) { dispatcher.getEventHandler() .handle(new LogHandlerTokenUpdatedEvent(app)); } } {code} Is it better to just send a LogHandlerTokenUpdatedEvent to log handler (but no need to specify app id). And inside LogHandler, loop the invalidTokenApps, and update token. Benefits of doing this: - ContainerManagerImpl doesn't have to know about invalid apps, better encapsulation. - Avoid race condition that app get added to invalidApps while looping apps from handleCredentialUpdate. > App local logs could leaked if log aggregation fails to initialize for the app > -- > > Key: YARN-8418 > URL: https://issues.apache.org/jira/browse/YARN-8418 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.8.0, 3.0.0-alpha1 >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Critical > Attachments: YARN-8418.001.patch, YARN-8418.002.patch, > YARN-8418.003.patch, YARN-8418.004.patch, YARN-8418.005.patch, > YARN-8418.006.patch, YARN-8418.007.patch > > > If log aggregation fails init createApp directory container logs could get > leaked in NM directory > For log running application restart of NM after token renewal this case is > possible/ Application submission with invalid delegation token -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8563) [Submarine] Support users to specify Python/TF package/version/dependencies for training job.
[ https://issues.apache.org/jira/browse/YARN-8563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8563: - Summary: [Submarine] Support users to specify Python/TF package/version/dependencies for training job. (was: Support users to specify Python/TF package/version/dependencies for training job.) > [Submarine] Support users to specify Python/TF package/version/dependencies > for training job. > - > > Key: YARN-8563 > URL: https://issues.apache.org/jira/browse/YARN-8563 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Wangda Tan >Priority: Major > > YARN-8561 assumes all Python / Tensorflow dependencies will be packed to > docker image. In practice, user doesn't want to build docker image. Instead, > user can provide python package / dependencies (like .whl), Python and TF > version. And Submarine can localize specified dependencies to prebuilt base > Docker images. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8561) [Submarine] Add initial implementation: training job submission and job history retrieve.
[ https://issues.apache.org/jira/browse/YARN-8561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8561: - Summary: [Submarine] Add initial implementation: training job submission and job history retrieve. (was: Add submarine initial implementation: training job submission and job history retrieve.) > [Submarine] Add initial implementation: training job submission and job > history retrieve. > - > > Key: YARN-8561 > URL: https://issues.apache.org/jira/browse/YARN-8561 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Wangda Tan >Assignee: Wangda Tan >Priority: Major > Attachments: YARN-8561.001.patch > > > Added following parts: > 1) New subcomponent of YARN, under applications/ project. > 2) Tensorflow training job submission, including training (single node and > distributed). > - Supported Docker container. > - Support GPU isolation. > - Support YARN registry DNS. > 3) Retrieve job history. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8563) Support users to specify Python/TF package/version/dependencies for training job.
Wangda Tan created YARN-8563: Summary: Support users to specify Python/TF package/version/dependencies for training job. Key: YARN-8563 URL: https://issues.apache.org/jira/browse/YARN-8563 Project: Hadoop YARN Issue Type: Sub-task Reporter: Wangda Tan YARN-8561 assumes all Python / Tensorflow dependencies will be packed to docker image. In practice, user doesn't want to build docker image. Instead, user can provide python package / dependencies (like .whl), Python and TF version. And Submarine can localize specified dependencies to prebuilt base Docker images. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8558) NM recovery level db not cleaned up properly on container finish
[ https://issues.apache.org/jira/browse/YARN-8558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16551755#comment-16551755 ] Wangda Tan commented on YARN-8558: -- [~bibinchundatt], I think we should have a follow up Jira to make sure all container related keys can be grouped so we don't need to worry about manually adding keys to delete in the future. Overall patch looks good. There're some other CONTAINER_ related fields are not included in your patch, like CONTAINER_TOKENS_KEY_PREFIX. Could u double confirm if they're required or not? cc: [~sunil.gov...@gmail.com] > NM recovery level db not cleaned up properly on container finish > > > Key: YARN-8558 > URL: https://issues.apache.org/jira/browse/YARN-8558 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.0, 3.1.0 >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Critical > Attachments: YARN-8558.001.patch > > > {code} > 2018-07-20 16:49:23,117 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl: > Application application_1531994217928_0054 transitioned from NEW to INITING > 2018-07-20 16:49:23,204 WARN > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: > Remove container container_1531994217928_0001_01_18 with incomplete > records > 2018-07-20 16:49:23,204 WARN > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: > Remove container container_1531994217928_0001_01_19 with incomplete > records > 2018-07-20 16:49:23,204 WARN > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: > Remove container container_1531994217928_0001_01_20 with incomplete > records > 2018-07-20 16:49:23,205 WARN > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: > Remove container container_1531994217928_0001_01_21 with incomplete > records > 2018-07-20 16:49:23,205 WARN > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: > Remove container container_1531994217928_0001_01_22 with incomplete > records > 2018-07-20 16:49:23,205 WARN > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: > Remove container container_1531994217928_0001_01_23 with incomplete > records > 2018-07-20 16:49:23,205 WARN > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: > Remove container container_1531994217928_0001_01_24 with incomplete > records > 2018-07-20 16:49:23,205 WARN > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: > Remove container container_1531994217928_0001_01_25 with incomplete > records > 2018-07-20 16:49:23,205 WARN > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: > Remove container container_1531994217928_0001_01_38 with incomplete > records > 2018-07-20 16:49:23,205 WARN > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: > Remove container container_1531994217928_0001_01_39 with incomplete > records > 2018-07-20 16:49:23,206 WARN > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: > Remove container container_1531994217928_0001_01_41 with incomplete > records > 2018-07-20 16:49:23,206 WARN > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: > Remove container container_1531994217928_0001_01_44 with incomplete > records > 2018-07-20 16:49:23,206 WARN > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: > Remove container container_1531994217928_0001_01_46 with incomplete > records > 2018-07-20 16:49:23,206 WARN > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: > Remove container container_1531994217928_0001_01_49 with incomplete > records > 2018-07-20 16:49:23,206 WARN > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: > Remove container container_1531994217928_0001_01_52 with incomplete > records > 2018-07-20 16:49:23,206 WARN > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: > Remove container container_1531994217928_0001_01_54 with incomplete > records > 2018-07-20 16:49:23,206 WARN > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: > Remove container container_1531994217928_0001_01_73 with incomplete > records > 2018-07-20 16:49:23,207 WARN > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService: > Remove container container_1531994217928_0001_01_74 with incomplete > records > 2018-07-20 16:49:23,207 WARN > org.apache
[jira] [Updated] (YARN-8135) Hadoop {Submarine} Project: Simple and scalable deployment of deep learning training / serving jobs on Hadoop
[ https://issues.apache.org/jira/browse/YARN-8135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8135: - Attachment: (was: YARN-8135.poc.001.patch) > Hadoop {Submarine} Project: Simple and scalable deployment of deep learning > training / serving jobs on Hadoop > - > > Key: YARN-8135 > URL: https://issues.apache.org/jira/browse/YARN-8135 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Wangda Tan >Assignee: Wangda Tan >Priority: Major > > Description: > *Goals:* > - Allow infra engineer / data scientist to run *unmodified* Tensorflow jobs > on YARN. > - Allow jobs easy access data/models in HDFS and other storages. > - Can launch services to serve Tensorflow/MXNet models. > - Support run distributed Tensorflow jobs with simple configs. > - Support run user-specified Docker images. > - Support specify GPU and other resources. > - Support launch tensorboard if user specified. > - Support customized DNS name for roles (like tensorboard.$user.$domain:6006) > *Why this name?* > - Because Submarine is the only vehicle can let human to explore deep > places. B-) > h3. {color:#FF}Please refer to on-going design doc, and add your > thoughts: > {color:#33}[https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#|https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit?usp=sharing]{color}{color} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8135) Hadoop {Submarine} Project: Simple and scalable deployment of deep learning training / serving jobs on Hadoop
[ https://issues.apache.org/jira/browse/YARN-8135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16551490#comment-16551490 ] Wangda Tan commented on YARN-8135: -- Discussed with many folks, thanks inputs from: [~sunilg], [~jhung], [~oliverhuh...@gmail.com], [~erwaman], [~yanboliang], [~zhz], [~vinodkv], Xun Liu, [~shaneku...@gmail.com] and many others. I just put the initial patch to YARN-8561 to get early feedbacks. I tested the patch on a 3.1.0 cluster which runs fine. Please let me know your thoughts. I'm going to on vacation in the next week, please expect some delays of my responses. > Hadoop {Submarine} Project: Simple and scalable deployment of deep learning > training / serving jobs on Hadoop > - > > Key: YARN-8135 > URL: https://issues.apache.org/jira/browse/YARN-8135 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Wangda Tan >Assignee: Wangda Tan >Priority: Major > Attachments: YARN-8135.poc.001.patch > > > Description: > *Goals:* > - Allow infra engineer / data scientist to run *unmodified* Tensorflow jobs > on YARN. > - Allow jobs easy access data/models in HDFS and other storages. > - Can launch services to serve Tensorflow/MXNet models. > - Support run distributed Tensorflow jobs with simple configs. > - Support run user-specified Docker images. > - Support specify GPU and other resources. > - Support launch tensorboard if user specified. > - Support customized DNS name for roles (like tensorboard.$user.$domain:6006) > *Why this name?* > - Because Submarine is the only vehicle can let human to explore deep > places. B-) > h3. {color:#FF}Please refer to on-going design doc, and add your > thoughts: > {color:#33}[https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#|https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit?usp=sharing]{color}{color} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8561) Add submarine initial implementation: training job submission and job history retrieve.
[ https://issues.apache.org/jira/browse/YARN-8561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16551487#comment-16551487 ] Wangda Tan commented on YARN-8561: -- Attached initial patch to get some early feedbacks. Please refer to {{QuickStart.md}} for overall usage, and {{DeveloperGuide.md}} for basic developing-related information. > Add submarine initial implementation: training job submission and job history > retrieve. > --- > > Key: YARN-8561 > URL: https://issues.apache.org/jira/browse/YARN-8561 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Wangda Tan >Assignee: Wangda Tan >Priority: Major > Attachments: YARN-8561.001.patch > > > Added following parts: > 1) New subcomponent of YARN, under applications/ project. > 2) Tensorflow training job submission, including training (single node and > distributed). > - Supported Docker container. > - Support GPU isolation. > - Support YARN registry DNS. > 3) Retrieve job history. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8561) Add submarine initial implementation: training job submission and job history retrieve.
[ https://issues.apache.org/jira/browse/YARN-8561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8561: - Attachment: YARN-8561.001.patch > Add submarine initial implementation: training job submission and job history > retrieve. > --- > > Key: YARN-8561 > URL: https://issues.apache.org/jira/browse/YARN-8561 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Wangda Tan >Assignee: Wangda Tan >Priority: Major > Attachments: YARN-8561.001.patch > > > Added following parts: > 1) New subcomponent of YARN, under applications/ project. > 2) Tensorflow training job submission, including training (single node and > distributed). > - Supported Docker container. > - Support GPU isolation. > - Support YARN registry DNS. > 3) Retrieve job history. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8561) Add submarine initial implementation: training job submission and job history retrieve.
Wangda Tan created YARN-8561: Summary: Add submarine initial implementation: training job submission and job history retrieve. Key: YARN-8561 URL: https://issues.apache.org/jira/browse/YARN-8561 Project: Hadoop YARN Issue Type: Sub-task Reporter: Wangda Tan Assignee: Wangda Tan Added following parts: 1) New subcomponent of YARN, under applications/ project. 2) Tensorflow training job submission, including training (single node and distributed). - Supported Docker container. - Support GPU isolation. - Support YARN registry DNS. 3) Retrieve job history. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8360) Yarn service conflict between restart policy and NM configuration
[ https://issues.apache.org/jira/browse/YARN-8360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8360: - Target Version/s: 3.2.0, 3.1.2 Priority: Critical (was: Major) > Yarn service conflict between restart policy and NM configuration > -- > > Key: YARN-8360 > URL: https://issues.apache.org/jira/browse/YARN-8360 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: Chandni Singh >Assignee: Suma Shivaprasad >Priority: Critical > Attachments: YARN-8360.1.patch > > > For the below spec, the service will not stop even after container failures > because of the NM auto retry properties : > * "yarn.service.container-failure.retry.max": 1, > * "yarn.service.container-failure.validity-interval-ms": 5000 > The NM will continue auto-restarting containers. > {{fail_after 20}} fails after 20 seconds. Since the validity failure > interval is 5 seconds, NM will auto restart the container. > {code:java} > { > "name": "fail-demo2", > "version": "1.0.0", > "components" : > [ > { > "name": "comp1", > "number_of_containers": 1, > "launch_command": "fail_after 20", > "restart_policy": "NEVER", > "resource": { > "cpus": 1, > "memory": "256" > }, > "configuration": { > "properties": { > "yarn.service.container-failure.retry.max": 1, > "yarn.service.container-failure.validity-interval-ms": 5000 > } > } > } > ] > } > {code} > If {{restart_policy}} is NEVER, then the service should stop after the > container fails. > Since we have introduced, the service level Restart Policies, I think we > should make the NM auto retry configurations part of the {{RetryPolicy}} and > get rid of all {{yarn.service.container-failure.**}} properties. Otherwise it > gets confusing. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8544) [DS] AM registration fails when hadoop authorization is enabled
[ https://issues.apache.org/jira/browse/YARN-8544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8544: - Target Version/s: 3.2.0, 3.1.2 (was: 3.1.1) > [DS] AM registration fails when hadoop authorization is enabled > --- > > Key: YARN-8544 > URL: https://issues.apache.org/jira/browse/YARN-8544 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Blocker > Attachments: YARN-8544.001.patch > > > Application master fails to register when hadoop authorization is enabled. > DistributedSchedulingAMProtocol connection authorization fails are RM side > Issue credits: [~BilwaST] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8480) Add boolean option for resources
[ https://issues.apache.org/jira/browse/YARN-8480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16551245#comment-16551245 ] Wangda Tan commented on YARN-8480: -- [~templedf], {quote}It also looks to me like the PCM is an out-of-band scheduling process that skips the usual scheduling process. {quote} There're two approaches to do PCM allocation, one is called {{PlacementConstraintProcessor}} it is out-of-band. And the other one is scheduler, which happens inside normal scheduler loop. You can check http://hadoop.apache.org/docs/r3.1.0/hadoop-yarn/hadoop-yarn-site/PlacementConstraints.html for more details. When we were doing scheduler changes to placement constraint, we put most logics to common part like AppSchedulingInfo, so other scheduler impl can use it directly. The hardest part is done, you only add some glue code to make FS work. If you really don't like to utilize the new SchedulingRequest, adding node attribute to ResourceRequest could be also considered (I'm not quite like the approach, but it is acceptable to me). By going with the approach to support node attribute by ResourceRequest, changes in FS should be minimum and all other logics like how to assign node attribute to nodes, how node to report node attributes to RM, how web UIs renders attributes etc. can be reused. Looping other folks who are doing node attribute support here for more thoughts: [~sunilg], [~cheersyang], [~Naganarasimha], [~bibinchundatt]. > Add boolean option for resources > > > Key: YARN-8480 > URL: https://issues.apache.org/jira/browse/YARN-8480 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Daniel Templeton >Assignee: Szilard Nemeth >Priority: Major > Attachments: YARN-8480.001.patch, YARN-8480.002.patch > > > Make it possible to define a resource with a boolean value. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8330) An extra container got launched by RM for yarn-service
[ https://issues.apache.org/jira/browse/YARN-8330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16551137#comment-16551137 ] Wangda Tan commented on YARN-8330: -- [~rohithsharma], To me it is fine that we publish ALLOCATED/ACQUIRED container to ATS since it is a user-visible container. Comparing to that, RESERVED container is not visible to app. > An extra container got launched by RM for yarn-service > -- > > Key: YARN-8330 > URL: https://issues.apache.org/jira/browse/YARN-8330 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-native-services >Reporter: Yesha Vora >Assignee: Suma Shivaprasad >Priority: Critical > Attachments: YARN-8330.1.patch, YARN-8330.2.patch > > > Steps: > launch Hbase tarball app > list containers for hbase tarball app > {code} > /usr/hdp/current/hadoop-yarn-client/bin/yarn container -list > appattempt_1525463491331_0006_01 > WARNING: YARN_LOG_DIR has been replaced by HADOOP_LOG_DIR. Using value of > YARN_LOG_DIR. > WARNING: YARN_LOGFILE has been replaced by HADOOP_LOGFILE. Using value of > YARN_LOGFILE. > WARNING: YARN_PID_DIR has been replaced by HADOOP_PID_DIR. Using value of > YARN_PID_DIR. > WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS. > 18/05/04 22:36:11 INFO client.AHSProxy: Connecting to Application History > server at xxx/xxx:10200 > 18/05/04 22:36:11 INFO client.ConfiguredRMFailoverProxyProvider: Failing over > to rm2 > Total number of containers :5 > Container-IdStart Time Finish Time > StateHost Node Http Address >LOG-URL > container_e06_1525463491331_0006_01_02Fri May 04 22:34:26 + 2018 > N/A RUNNINGxxx:25454 http://xxx:8042 > http://xxx:8042/node/containerlogs/container_e06_1525463491331_0006_01_02/hrt_qa > 2018-05-04 22:36:11,216|INFO|MainThread|machine.py:167 - > run()||GUID=0169fa41-d1c5-4b43-85bf-c3e9f2682398|container_e06_1525463491331_0006_01_03 > Fri May 04 22:34:26 + 2018 N/A > RUNNINGxxx:25454 http://xxx:8042 > http://xxx:8042/node/containerlogs/container_e06_1525463491331_0006_01_03/hrt_qa > 2018-05-04 22:36:11,217|INFO|MainThread|machine.py:167 - > run()||GUID=0169fa41-d1c5-4b43-85bf-c3e9f2682398|container_e06_1525463491331_0006_01_01 > Fri May 04 22:34:15 + 2018 N/A > RUNNINGxxx:25454 http://xxx:8042 > http://xxx:8042/node/containerlogs/container_e06_1525463491331_0006_01_01/hrt_qa > 2018-05-04 22:36:11,217|INFO|MainThread|machine.py:167 - > run()||GUID=0169fa41-d1c5-4b43-85bf-c3e9f2682398|container_e06_1525463491331_0006_01_05 > Fri May 04 22:34:56 + 2018 N/A > RUNNINGxxx:25454 http://xxx:8042 > http://xxx:8042/node/containerlogs/container_e06_1525463491331_0006_01_05/hrt_qa > 2018-05-04 22:36:11,218|INFO|MainThread|machine.py:167 - > run()||GUID=0169fa41-d1c5-4b43-85bf-c3e9f2682398|container_e06_1525463491331_0006_01_04 > Fri May 04 22:34:56 + 2018 N/A > nullxxx:25454 http://xxx:8042 > http://xxx:8188/applicationhistory/logs/xxx:25454/container_e06_1525463491331_0006_01_04/container_e06_1525463491331_0006_01_04/hrt_qa{code} > Total expected containers = 4 ( 3 components container + 1 am). Instead, RM > is listing 5 containers. > container_e06_1525463491331_0006_01_04 is in null state. > Yarn service utilized container 02, 03, 05 for component. There is no log > available in NM & AM related to container 04. Only one line in RM log is > printed > {code} > 2018-05-04 22:34:56,618 INFO rmcontainer.RMContainerImpl > (RMContainerImpl.java:handle(489)) - > container_e06_1525463491331_0006_01_04 Container Transitioned from NEW to > RESERVED{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8559) Expose scheduling configuration info in Resource Manager's /conf endpoint
[ https://issues.apache.org/jira/browse/YARN-8559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16551066#comment-16551066 ] Wangda Tan commented on YARN-8559: -- [~cheersyang] / [~banditka], Actually the scheduler-conf was not targeted for getting configs of scheduler, it is an API to update scheduler configs by using PUT. I think we should add a GET API to the same endpoint. Thoughts? > Expose scheduling configuration info in Resource Manager's /conf endpoint > - > > Key: YARN-8559 > URL: https://issues.apache.org/jira/browse/YARN-8559 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Reporter: Anna Savarin >Priority: Minor > > All Hadoop services provide a set of common endpoints (/stacks, /logLevel, > /metrics, /jmx, /conf). In the case of the Resource Manager, part of the > configuration comes from the scheduler being used. Currently, these > configuration key/values are not exposed through the /conf endpoint, thereby > revealing an incomplete configuration picture. > Make an improvement and expose the scheduling configuration info through the > RM's /conf endpoint. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8418) App local logs could leaked if log aggregation fails to initialize for the app
[ https://issues.apache.org/jira/browse/YARN-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16551057#comment-16551057 ] Wangda Tan commented on YARN-8418: -- For logics of the patch, a couple of comments: 1) Is it possible to send the "new token arrived" message to LogAggregationService instead of handling inside ContainersManagerImpl? 2) In addition to above suggestion, From ContainerManagerImpl, it should not tell LogAggregationService that "app log aggregation enabled". Instead, it should notify LogAggregationService that new token arrived and let LogAggregationService make the decision. 3) Following logic: {code:java} 375 AppLogAggregator aggregator = appLogAggregators.get(appId); 376 if (aggregator != null) {{code} When this could be null? 4) Following logic: {code:java} 259 if (e.getCause() instanceof SecretManager.InvalidToken) { 260 aggregationDisabledApps.add(appId); 261 }{code} Then we should call it "invalidTokenApps" or some more appropriate name. We want to distinguish between this scenario and other invalid apps. 5) What happens if token arrives after AppLogAggregator removed from context? Is it possible? If yes, are we going to remove log dir for this case? 6) Have u done tests in real cluster to prove it work? Just to make sure we're pushing the right fix to 3.1.1 given we don't have much time before RC. [~sunil.gov...@gmail.com], [~suma.shivaprasad], could u help to review the patch and related logics since I will be off after tomorrow? > App local logs could leaked if log aggregation fails to initialize for the app > -- > > Key: YARN-8418 > URL: https://issues.apache.org/jira/browse/YARN-8418 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.8.0, 3.0.0-alpha1 >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Critical > Attachments: YARN-8418.001.patch, YARN-8418.002.patch, > YARN-8418.003.patch, YARN-8418.004.patch > > > If log aggregation fails init createApp directory container logs could get > leaked in NM directory > For log running application restart of NM after token renewal this case is > possible/ Application submission with invalid delegation token -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8418) App local logs could leaked if log aggregation fails to initialize for the app
[ https://issues.apache.org/jira/browse/YARN-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16551034#comment-16551034 ] Wangda Tan commented on YARN-8418: -- [~bibinchundatt], {quote}As part of YARN-4984 we disabled the thread / not sure was a leak.. {quote} I see, I checked history of related fixes. there're 3 issues related to the problem: 1) YARN-4697 bounds maximum #threads can be used by LogAggregationService. 2) YARN-4325 will fix a problem that completed app event will be sent to LogAggregationService again 3) YARN-4984 will fix a corner case that even creation of log aggregation app dir fails, LogAggregationService still use a new thread to do following works. Given we have #1, I think #3 is not a big issue here and we need it to do app cleanup and it is also needed to upload logs once new token received. Now I'm convinced that we should revert part of changes in YARN-4984 and do right fix here. > App local logs could leaked if log aggregation fails to initialize for the app > -- > > Key: YARN-8418 > URL: https://issues.apache.org/jira/browse/YARN-8418 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.8.0, 3.0.0-alpha1 >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Critical > Attachments: YARN-8418.001.patch, YARN-8418.002.patch, > YARN-8418.003.patch, YARN-8418.004.patch > > > If log aggregation fails init createApp directory container logs could get > leaked in NM directory > For log running application restart of NM after token renewal this case is > possible/ Application submission with invalid delegation token -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8418) App local logs could leaked if log aggregation fails to initialize for the app
[ https://issues.apache.org/jira/browse/YARN-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16549964#comment-16549964 ] Wangda Tan commented on YARN-8418: -- [~bibinchundatt], I'm a bit hesitated to get this patch committed since it seems reverted changes of YARN-4984. But this is a valid issue, is it possible to delay creating app log aggregation thread after token received? Otherwise, I would prefer to move this to 3.1.2 since it could introduce thread leak. > App local logs could leaked if log aggregation fails to initialize for the app > -- > > Key: YARN-8418 > URL: https://issues.apache.org/jira/browse/YARN-8418 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.8.0, 3.0.0-alpha1 >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Critical > Attachments: YARN-8418.001.patch, YARN-8418.002.patch, > YARN-8418.003.patch, YARN-8418.004.patch > > > If log aggregation fails init createApp directory container logs could get > leaked in NM directory > For log running application restart of NM after token renewal this case is > possible/ Application submission with invalid delegation token -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8418) App local logs could leaked if log aggregation fails to initialize for the app
[ https://issues.apache.org/jira/browse/YARN-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8418: - Target Version/s: 3.1.1 (was: 3.1.2) > App local logs could leaked if log aggregation fails to initialize for the app > -- > > Key: YARN-8418 > URL: https://issues.apache.org/jira/browse/YARN-8418 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.8.0, 3.0.0-alpha1 >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Critical > Attachments: YARN-8418.001.patch, YARN-8418.002.patch, > YARN-8418.003.patch, YARN-8418.004.patch > > > If log aggregation fails init createApp directory container logs could get > leaked in NM directory > For log running application restart of NM after token renewal this case is > possible/ Application submission with invalid delegation token -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8418) App local logs could leaked if log aggregation fails to initialize for the app
[ https://issues.apache.org/jira/browse/YARN-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8418: - Target Version/s: 3.1.2 (was: 3.1.1) > App local logs could leaked if log aggregation fails to initialize for the app > -- > > Key: YARN-8418 > URL: https://issues.apache.org/jira/browse/YARN-8418 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.8.0, 3.0.0-alpha1 >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Critical > Attachments: YARN-8418.001.patch, YARN-8418.002.patch, > YARN-8418.003.patch, YARN-8418.004.patch > > > If log aggregation fails init createApp directory container logs could get > leaked in NM directory > For log running application restart of NM after token renewal this case is > possible/ Application submission with invalid delegation token -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8418) App local logs could leaked if log aggregation fails to initialize for the app
[ https://issues.apache.org/jira/browse/YARN-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8418: - Target Version/s: 3.1.1 (was: 3.1.2) > App local logs could leaked if log aggregation fails to initialize for the app > -- > > Key: YARN-8418 > URL: https://issues.apache.org/jira/browse/YARN-8418 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.8.0, 3.0.0-alpha1 >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Critical > Attachments: YARN-8418.001.patch, YARN-8418.002.patch, > YARN-8418.003.patch, YARN-8418.004.patch > > > If log aggregation fails init createApp directory container logs could get > leaked in NM directory > For log running application restart of NM after token renewal this case is > possible/ Application submission with invalid delegation token -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8474) sleeper service fails to launch with "Authentication Required"
[ https://issues.apache.org/jira/browse/YARN-8474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16549901#comment-16549901 ] Wangda Tan commented on YARN-8474: -- Bulk update: moved all 3.1.1 non-blocker issues, please move back if it is a blocker. > sleeper service fails to launch with "Authentication Required" > -- > > Key: YARN-8474 > URL: https://issues.apache.org/jira/browse/YARN-8474 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.1.0 >Reporter: Sumana Sathish >Assignee: Eric Yang >Priority: Critical > Attachments: YARN-8474.001.patch, YARN-8474.002.patch, > YARN-8474.003.patch, YARN-8474.004.patch > > > Sleeper job fails with Authentication required. > {code} > yarn app -launch sl1 a/YarnServiceLogs/sleeper-orig.json > 18/06/28 22:00:43 INFO client.ApiServiceClient: Loading service definition > from local FS: /a/YarnServiceLogs/sleeper-orig.json > 18/06/28 22:00:44 ERROR client.ApiServiceClient: Authentication required > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8234) Improve RM system metrics publisher's performance by pushing events to timeline server in batch
[ https://issues.apache.org/jira/browse/YARN-8234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16549902#comment-16549902 ] Wangda Tan commented on YARN-8234: -- Bulk update: moved all 3.1.1 non-blocker issues, please move back if it is a blocker. > Improve RM system metrics publisher's performance by pushing events to > timeline server in batch > --- > > Key: YARN-8234 > URL: https://issues.apache.org/jira/browse/YARN-8234 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager, timelineserver >Affects Versions: 2.8.3 >Reporter: Hu Ziqian >Assignee: Hu Ziqian >Priority: Critical > Attachments: YARN-8234-branch-2.8.3.001.patch, > YARN-8234-branch-2.8.3.002.patch, YARN-8234-branch-2.8.3.003.patch, > YARN-8234.001.patch, YARN-8234.002.patch, YARN-8234.003.patch > > > When system metrics publisher is enabled, RM will push events to timeline > server via restful api. If the cluster load is heavy, many events are sent to > timeline server and the timeline server's event handler thread locked. > YARN-7266 talked about the detail of this problem. Because of the lock, > timeline server can't receive event as fast as it generated in RM and lots of > timeline event stays in RM's memory. Finally, those events will consume all > RM's memory and RM will start a full gc (which cause an JVM stop-world and > cause a timeout from rm to zookeeper) or even get an OOM. > The main problem here is that timeline can't receive timeline server's event > as fast as it generated. Now, RM system metrics publisher put only one event > in a request, and most time costs on handling http header or some thing about > the net connection on timeline side. Only few time is spent on dealing with > the timeline event which is truly valuable. > In this issue, we add a buffer in system metrics publisher and let publisher > send events to timeline server in batch via one request. When sets the batch > size to 1000, in out experiment the speed of the timeline server receives > events has 100x improvement. We have implement this function int our product > environment which accepts 2 app's in one hour and it works fine. > We add following configuration: > * yarn.resourcemanager.system-metrics-publisher.batch-size: the size of > system metrics publisher sending events in one request. Default value is 1000 > * yarn.resourcemanager.system-metrics-publisher.buffer-size: the size of the > event buffer in system metrics publisher. > * yarn.resourcemanager.system-metrics-publisher.interval-seconds: When > enable batch publishing, we must avoid that the publisher waits for a batch > to be filled up and hold events in buffer for long time. So we add another > thread which send event's in the buffer periodically. This config sets the > interval of the cyclical sending thread. The default value is 60s. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8474) sleeper service fails to launch with "Authentication Required"
[ https://issues.apache.org/jira/browse/YARN-8474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8474: - Target Version/s: 3.2.0, 3.1.2 (was: 3.2.0, 3.1.1) > sleeper service fails to launch with "Authentication Required" > -- > > Key: YARN-8474 > URL: https://issues.apache.org/jira/browse/YARN-8474 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.1.0 >Reporter: Sumana Sathish >Assignee: Eric Yang >Priority: Critical > Attachments: YARN-8474.001.patch, YARN-8474.002.patch, > YARN-8474.003.patch, YARN-8474.004.patch > > > Sleeper job fails with Authentication required. > {code} > yarn app -launch sl1 a/YarnServiceLogs/sleeper-orig.json > 18/06/28 22:00:43 INFO client.ApiServiceClient: Loading service definition > from local FS: /a/YarnServiceLogs/sleeper-orig.json > 18/06/28 22:00:44 ERROR client.ApiServiceClient: Authentication required > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8234) Improve RM system metrics publisher's performance by pushing events to timeline server in batch
[ https://issues.apache.org/jira/browse/YARN-8234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8234: - Target Version/s: 3.1.2 (was: 3.1.1, 3.1.2) > Improve RM system metrics publisher's performance by pushing events to > timeline server in batch > --- > > Key: YARN-8234 > URL: https://issues.apache.org/jira/browse/YARN-8234 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager, timelineserver >Affects Versions: 2.8.3 >Reporter: Hu Ziqian >Assignee: Hu Ziqian >Priority: Critical > Attachments: YARN-8234-branch-2.8.3.001.patch, > YARN-8234-branch-2.8.3.002.patch, YARN-8234-branch-2.8.3.003.patch, > YARN-8234.001.patch, YARN-8234.002.patch, YARN-8234.003.patch > > > When system metrics publisher is enabled, RM will push events to timeline > server via restful api. If the cluster load is heavy, many events are sent to > timeline server and the timeline server's event handler thread locked. > YARN-7266 talked about the detail of this problem. Because of the lock, > timeline server can't receive event as fast as it generated in RM and lots of > timeline event stays in RM's memory. Finally, those events will consume all > RM's memory and RM will start a full gc (which cause an JVM stop-world and > cause a timeout from rm to zookeeper) or even get an OOM. > The main problem here is that timeline can't receive timeline server's event > as fast as it generated. Now, RM system metrics publisher put only one event > in a request, and most time costs on handling http header or some thing about > the net connection on timeline side. Only few time is spent on dealing with > the timeline event which is truly valuable. > In this issue, we add a buffer in system metrics publisher and let publisher > send events to timeline server in batch via one request. When sets the batch > size to 1000, in out experiment the speed of the timeline server receives > events has 100x improvement. We have implement this function int our product > environment which accepts 2 app's in one hour and it works fine. > We add following configuration: > * yarn.resourcemanager.system-metrics-publisher.batch-size: the size of > system metrics publisher sending events in one request. Default value is 1000 > * yarn.resourcemanager.system-metrics-publisher.buffer-size: the size of the > event buffer in system metrics publisher. > * yarn.resourcemanager.system-metrics-publisher.interval-seconds: When > enable batch publishing, we must avoid that the publisher waits for a batch > to be filled up and hold events in buffer for long time. So we add another > thread which send event's in the buffer periodically. This config sets the > interval of the cyclical sending thread. The default value is 60s. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8234) Improve RM system metrics publisher's performance by pushing events to timeline server in batch
[ https://issues.apache.org/jira/browse/YARN-8234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8234: - Target Version/s: 3.1.1, 3.1.2 (was: 3.1.1) > Improve RM system metrics publisher's performance by pushing events to > timeline server in batch > --- > > Key: YARN-8234 > URL: https://issues.apache.org/jira/browse/YARN-8234 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager, timelineserver >Affects Versions: 2.8.3 >Reporter: Hu Ziqian >Assignee: Hu Ziqian >Priority: Critical > Attachments: YARN-8234-branch-2.8.3.001.patch, > YARN-8234-branch-2.8.3.002.patch, YARN-8234-branch-2.8.3.003.patch, > YARN-8234.001.patch, YARN-8234.002.patch, YARN-8234.003.patch > > > When system metrics publisher is enabled, RM will push events to timeline > server via restful api. If the cluster load is heavy, many events are sent to > timeline server and the timeline server's event handler thread locked. > YARN-7266 talked about the detail of this problem. Because of the lock, > timeline server can't receive event as fast as it generated in RM and lots of > timeline event stays in RM's memory. Finally, those events will consume all > RM's memory and RM will start a full gc (which cause an JVM stop-world and > cause a timeout from rm to zookeeper) or even get an OOM. > The main problem here is that timeline can't receive timeline server's event > as fast as it generated. Now, RM system metrics publisher put only one event > in a request, and most time costs on handling http header or some thing about > the net connection on timeline side. Only few time is spent on dealing with > the timeline event which is truly valuable. > In this issue, we add a buffer in system metrics publisher and let publisher > send events to timeline server in batch via one request. When sets the batch > size to 1000, in out experiment the speed of the timeline server receives > events has 100x improvement. We have implement this function int our product > environment which accepts 2 app's in one hour and it works fine. > We add following configuration: > * yarn.resourcemanager.system-metrics-publisher.batch-size: the size of > system metrics publisher sending events in one request. Default value is 1000 > * yarn.resourcemanager.system-metrics-publisher.buffer-size: the size of the > event buffer in system metrics publisher. > * yarn.resourcemanager.system-metrics-publisher.interval-seconds: When > enable batch publishing, we must avoid that the publisher waits for a batch > to be filled up and hold events in buffer for long time. So we add another > thread which send event's in the buffer periodically. This config sets the > interval of the cyclical sending thread. The default value is 60s. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8514) YARN RegistryDNS throws NPE when Kerberos tgt expires
[ https://issues.apache.org/jira/browse/YARN-8514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8514: - Target Version/s: 3.2.0, 3.1.2 (was: 3.2.0, 3.1.1) > YARN RegistryDNS throws NPE when Kerberos tgt expires > - > > Key: YARN-8514 > URL: https://issues.apache.org/jira/browse/YARN-8514 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.9.1, 3.0.1, 3.0.2, 2.9.2 >Reporter: Eric Yang >Priority: Critical > > After Kerberos ticket expires, RegistryDNS throws NPE error: > {code:java} > 2018-07-06 01:26:25,025 ERROR yarn.YarnUncaughtExceptionHandler > (YarnUncaughtExceptionHandler.java:uncaughtException(68)) - Thread Thread[TGT > Renewer for rm/host1.example@example.com,5,main] threw an Exception. > java.lang.NullPointerException > at > javax.security.auth.kerberos.KerberosTicket.getEndTime(KerberosTicket.java:482) > at > org.apache.hadoop.security.UserGroupInformation$1.run(UserGroupInformation.java:894) > at java.lang.Thread.run(Thread.java:745){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8242) YARN NM: OOM error while reading back the state store on recovery
[ https://issues.apache.org/jira/browse/YARN-8242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8242: - Target Version/s: 3.2.0, 3.1.2 (was: 3.2.0) > YARN NM: OOM error while reading back the state store on recovery > - > > Key: YARN-8242 > URL: https://issues.apache.org/jira/browse/YARN-8242 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 2.6.0, 2.9.0, 2.6.5, 2.8.3, 3.1.0, 2.7.6, 3.0.2 >Reporter: Kanwaljeet Sachdev >Priority: Critical > Attachments: YARN-8242.001.patch, YARN-8242.002.patch, > YARN-8242.003.patch > > > On startup the NM reads its state store and builds a list of application in > the state store to process. If the number of applications in the state store > is large and have a lot of "state" connected to it the NM can run OOM and > never get to the point that it can start processing the recovery. > Since it never starts the recovery there is no way for the NM to ever pass > this point. It will require a change in heap size to get the NM started. > > Following is the stack trace > {code:java} > at java.lang.OutOfMemoryError. (OutOfMemoryError.java:48) at > com.google.protobuf.ByteString.copyFrom (ByteString.java:192) at > com.google.protobuf.CodedInputStream.readBytes (CodedInputStream.java:324) at > org.apache.hadoop.yarn.proto.YarnProtos$StringStringMapProto. > (YarnProtos.java:47069) at > org.apache.hadoop.yarn.proto.YarnProtos$StringStringMapProto. > (YarnProtos.java:47014) at > org.apache.hadoop.yarn.proto.YarnProtos$StringStringMapProto$1.parsePartialFrom > (YarnProtos.java:47102) at > org.apache.hadoop.yarn.proto.YarnProtos$StringStringMapProto$1.parsePartialFrom > (YarnProtos.java:47097) at com.google.protobuf.CodedInputStream.readMessage > (CodedInputStream.java:309) at > org.apache.hadoop.yarn.proto.YarnProtos$ContainerLaunchContextProto. > (YarnProtos.java:41016) at > org.apache.hadoop.yarn.proto.YarnProtos$ContainerLaunchContextProto. > (YarnProtos.java:40942) at > org.apache.hadoop.yarn.proto.YarnProtos$ContainerLaunchContextProto$1.parsePartialFrom > (YarnProtos.java:41080) at > org.apache.hadoop.yarn.proto.YarnProtos$ContainerLaunchContextProto$1.parsePartialFrom > (YarnProtos.java:41075) at com.google.protobuf.CodedInputStream.readMessage > (CodedInputStream.java:309) at > org.apache.hadoop.yarn.proto.YarnServiceProtos$StartContainerRequestProto. > (YarnServiceProtos.java:24517) at > org.apache.hadoop.yarn.proto.YarnServiceProtos$StartContainerRequestProto. > (YarnServiceProtos.java:24464) at > org.apache.hadoop.yarn.proto.YarnServiceProtos$StartContainerRequestProto$1.parsePartialFrom > (YarnServiceProtos.java:24568) at > org.apache.hadoop.yarn.proto.YarnServiceProtos$StartContainerRequestProto$1.parsePartialFrom > (YarnServiceProtos.java:24563) at > com.google.protobuf.AbstractParser.parsePartialFrom (AbstractParser.java:141) > at com.google.protobuf.AbstractParser.parseFrom (AbstractParser.java:176) at > com.google.protobuf.AbstractParser.parseFrom (AbstractParser.java:188) at > com.google.protobuf.AbstractParser.parseFrom (AbstractParser.java:193) at > com.google.protobuf.AbstractParser.parseFrom (AbstractParser.java:49) at > org.apache.hadoop.yarn.proto.YarnServiceProtos$StartContainerRequestProto.parseFrom > (YarnServiceProtos.java:24739) at > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.loadContainerState > (NMLeveldbStateStoreService.java:217) at > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.loadContainersState > (NMLeveldbStateStoreService.java:170) at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recover > (ContainerManagerImpl.java:253) at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit > (ContainerManagerImpl.java:237) at > org.apache.hadoop.service.AbstractService.init (AbstractService.java:163) at > org.apache.hadoop.service.CompositeService.serviceInit > (CompositeService.java:107) at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit > (NodeManager.java:255) at org.apache.hadoop.service.AbstractService.init > (AbstractService.java:163) at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager > (NodeManager.java:474) at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.main > (NodeManager.java:521){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8242) YARN NM: OOM error while reading back the state store on recovery
[ https://issues.apache.org/jira/browse/YARN-8242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16549898#comment-16549898 ] Wangda Tan commented on YARN-8242: -- Bulk update: moved all 3.1.1 non-blocker issues, please move back if it is a blocker. > YARN NM: OOM error while reading back the state store on recovery > - > > Key: YARN-8242 > URL: https://issues.apache.org/jira/browse/YARN-8242 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 2.6.0, 2.9.0, 2.6.5, 2.8.3, 3.1.0, 2.7.6, 3.0.2 >Reporter: Kanwaljeet Sachdev >Priority: Critical > Attachments: YARN-8242.001.patch, YARN-8242.002.patch, > YARN-8242.003.patch > > > On startup the NM reads its state store and builds a list of application in > the state store to process. If the number of applications in the state store > is large and have a lot of "state" connected to it the NM can run OOM and > never get to the point that it can start processing the recovery. > Since it never starts the recovery there is no way for the NM to ever pass > this point. It will require a change in heap size to get the NM started. > > Following is the stack trace > {code:java} > at java.lang.OutOfMemoryError. (OutOfMemoryError.java:48) at > com.google.protobuf.ByteString.copyFrom (ByteString.java:192) at > com.google.protobuf.CodedInputStream.readBytes (CodedInputStream.java:324) at > org.apache.hadoop.yarn.proto.YarnProtos$StringStringMapProto. > (YarnProtos.java:47069) at > org.apache.hadoop.yarn.proto.YarnProtos$StringStringMapProto. > (YarnProtos.java:47014) at > org.apache.hadoop.yarn.proto.YarnProtos$StringStringMapProto$1.parsePartialFrom > (YarnProtos.java:47102) at > org.apache.hadoop.yarn.proto.YarnProtos$StringStringMapProto$1.parsePartialFrom > (YarnProtos.java:47097) at com.google.protobuf.CodedInputStream.readMessage > (CodedInputStream.java:309) at > org.apache.hadoop.yarn.proto.YarnProtos$ContainerLaunchContextProto. > (YarnProtos.java:41016) at > org.apache.hadoop.yarn.proto.YarnProtos$ContainerLaunchContextProto. > (YarnProtos.java:40942) at > org.apache.hadoop.yarn.proto.YarnProtos$ContainerLaunchContextProto$1.parsePartialFrom > (YarnProtos.java:41080) at > org.apache.hadoop.yarn.proto.YarnProtos$ContainerLaunchContextProto$1.parsePartialFrom > (YarnProtos.java:41075) at com.google.protobuf.CodedInputStream.readMessage > (CodedInputStream.java:309) at > org.apache.hadoop.yarn.proto.YarnServiceProtos$StartContainerRequestProto. > (YarnServiceProtos.java:24517) at > org.apache.hadoop.yarn.proto.YarnServiceProtos$StartContainerRequestProto. > (YarnServiceProtos.java:24464) at > org.apache.hadoop.yarn.proto.YarnServiceProtos$StartContainerRequestProto$1.parsePartialFrom > (YarnServiceProtos.java:24568) at > org.apache.hadoop.yarn.proto.YarnServiceProtos$StartContainerRequestProto$1.parsePartialFrom > (YarnServiceProtos.java:24563) at > com.google.protobuf.AbstractParser.parsePartialFrom (AbstractParser.java:141) > at com.google.protobuf.AbstractParser.parseFrom (AbstractParser.java:176) at > com.google.protobuf.AbstractParser.parseFrom (AbstractParser.java:188) at > com.google.protobuf.AbstractParser.parseFrom (AbstractParser.java:193) at > com.google.protobuf.AbstractParser.parseFrom (AbstractParser.java:49) at > org.apache.hadoop.yarn.proto.YarnServiceProtos$StartContainerRequestProto.parseFrom > (YarnServiceProtos.java:24739) at > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.loadContainerState > (NMLeveldbStateStoreService.java:217) at > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.loadContainersState > (NMLeveldbStateStoreService.java:170) at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recover > (ContainerManagerImpl.java:253) at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit > (ContainerManagerImpl.java:237) at > org.apache.hadoop.service.AbstractService.init (AbstractService.java:163) at > org.apache.hadoop.service.CompositeService.serviceInit > (CompositeService.java:107) at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit > (NodeManager.java:255) at org.apache.hadoop.service.AbstractService.init > (AbstractService.java:163) at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager > (NodeManager.java:474) at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.main > (NodeManager.java:521){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mai
[jira] [Updated] (YARN-8242) YARN NM: OOM error while reading back the state store on recovery
[ https://issues.apache.org/jira/browse/YARN-8242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8242: - Target Version/s: 3.2.0 (was: 3.2.0, 3.1.1) > YARN NM: OOM error while reading back the state store on recovery > - > > Key: YARN-8242 > URL: https://issues.apache.org/jira/browse/YARN-8242 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 2.6.0, 2.9.0, 2.6.5, 2.8.3, 3.1.0, 2.7.6, 3.0.2 >Reporter: Kanwaljeet Sachdev >Priority: Critical > Attachments: YARN-8242.001.patch, YARN-8242.002.patch, > YARN-8242.003.patch > > > On startup the NM reads its state store and builds a list of application in > the state store to process. If the number of applications in the state store > is large and have a lot of "state" connected to it the NM can run OOM and > never get to the point that it can start processing the recovery. > Since it never starts the recovery there is no way for the NM to ever pass > this point. It will require a change in heap size to get the NM started. > > Following is the stack trace > {code:java} > at java.lang.OutOfMemoryError. (OutOfMemoryError.java:48) at > com.google.protobuf.ByteString.copyFrom (ByteString.java:192) at > com.google.protobuf.CodedInputStream.readBytes (CodedInputStream.java:324) at > org.apache.hadoop.yarn.proto.YarnProtos$StringStringMapProto. > (YarnProtos.java:47069) at > org.apache.hadoop.yarn.proto.YarnProtos$StringStringMapProto. > (YarnProtos.java:47014) at > org.apache.hadoop.yarn.proto.YarnProtos$StringStringMapProto$1.parsePartialFrom > (YarnProtos.java:47102) at > org.apache.hadoop.yarn.proto.YarnProtos$StringStringMapProto$1.parsePartialFrom > (YarnProtos.java:47097) at com.google.protobuf.CodedInputStream.readMessage > (CodedInputStream.java:309) at > org.apache.hadoop.yarn.proto.YarnProtos$ContainerLaunchContextProto. > (YarnProtos.java:41016) at > org.apache.hadoop.yarn.proto.YarnProtos$ContainerLaunchContextProto. > (YarnProtos.java:40942) at > org.apache.hadoop.yarn.proto.YarnProtos$ContainerLaunchContextProto$1.parsePartialFrom > (YarnProtos.java:41080) at > org.apache.hadoop.yarn.proto.YarnProtos$ContainerLaunchContextProto$1.parsePartialFrom > (YarnProtos.java:41075) at com.google.protobuf.CodedInputStream.readMessage > (CodedInputStream.java:309) at > org.apache.hadoop.yarn.proto.YarnServiceProtos$StartContainerRequestProto. > (YarnServiceProtos.java:24517) at > org.apache.hadoop.yarn.proto.YarnServiceProtos$StartContainerRequestProto. > (YarnServiceProtos.java:24464) at > org.apache.hadoop.yarn.proto.YarnServiceProtos$StartContainerRequestProto$1.parsePartialFrom > (YarnServiceProtos.java:24568) at > org.apache.hadoop.yarn.proto.YarnServiceProtos$StartContainerRequestProto$1.parsePartialFrom > (YarnServiceProtos.java:24563) at > com.google.protobuf.AbstractParser.parsePartialFrom (AbstractParser.java:141) > at com.google.protobuf.AbstractParser.parseFrom (AbstractParser.java:176) at > com.google.protobuf.AbstractParser.parseFrom (AbstractParser.java:188) at > com.google.protobuf.AbstractParser.parseFrom (AbstractParser.java:193) at > com.google.protobuf.AbstractParser.parseFrom (AbstractParser.java:49) at > org.apache.hadoop.yarn.proto.YarnServiceProtos$StartContainerRequestProto.parseFrom > (YarnServiceProtos.java:24739) at > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.loadContainerState > (NMLeveldbStateStoreService.java:217) at > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.loadContainersState > (NMLeveldbStateStoreService.java:170) at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recover > (ContainerManagerImpl.java:253) at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit > (ContainerManagerImpl.java:237) at > org.apache.hadoop.service.AbstractService.init (AbstractService.java:163) at > org.apache.hadoop.service.CompositeService.serviceInit > (CompositeService.java:107) at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit > (NodeManager.java:255) at org.apache.hadoop.service.AbstractService.init > (AbstractService.java:163) at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager > (NodeManager.java:474) at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.main > (NodeManager.java:521){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8015) Support inter-app placement constraints in AppPlacementAllocator
[ https://issues.apache.org/jira/browse/YARN-8015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8015: - Target Version/s: 3.2.0 (was: 3.1.2) > Support inter-app placement constraints in AppPlacementAllocator > > > Key: YARN-8015 > URL: https://issues.apache.org/jira/browse/YARN-8015 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacity scheduler >Reporter: Weiwei Yang >Assignee: Weiwei Yang >Priority: Critical > Attachments: YARN-8015.001.patch, YARN-8015.002.patch > > > AppPlacementAllocator currently only supports intra-app anti-affinity > placement constraints, once YARN-8002 and YARN-8013 are resolved, it needs to > support inter-app constraints too. Also, this may require some refactoring on > the existing code logic. Use this JIRA to track. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8015) Support inter-app placement constraints in AppPlacementAllocator
[ https://issues.apache.org/jira/browse/YARN-8015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16549897#comment-16549897 ] Wangda Tan commented on YARN-8015: -- Bulk update: moved all 3.1.1 non-blocker issues, please move back if it is a blocker. > Support inter-app placement constraints in AppPlacementAllocator > > > Key: YARN-8015 > URL: https://issues.apache.org/jira/browse/YARN-8015 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacity scheduler >Reporter: Weiwei Yang >Assignee: Weiwei Yang >Priority: Critical > Attachments: YARN-8015.001.patch, YARN-8015.002.patch > > > AppPlacementAllocator currently only supports intra-app anti-affinity > placement constraints, once YARN-8002 and YARN-8013 are resolved, it needs to > support inter-app constraints too. Also, this may require some refactoring on > the existing code logic. Use this JIRA to track. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8015) Support inter-app placement constraints in AppPlacementAllocator
[ https://issues.apache.org/jira/browse/YARN-8015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8015: - Target Version/s: 3.1.2 (was: 3.1.1) > Support inter-app placement constraints in AppPlacementAllocator > > > Key: YARN-8015 > URL: https://issues.apache.org/jira/browse/YARN-8015 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacity scheduler >Reporter: Weiwei Yang >Assignee: Weiwei Yang >Priority: Critical > Attachments: YARN-8015.001.patch, YARN-8015.002.patch > > > AppPlacementAllocator currently only supports intra-app anti-affinity > placement constraints, once YARN-8002 and YARN-8013 are resolved, it needs to > support inter-app constraints too. Also, this may require some refactoring on > the existing code logic. Use this JIRA to track. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8418) App local logs could leaked if log aggregation fails to initialize for the app
[ https://issues.apache.org/jira/browse/YARN-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8418: - Target Version/s: 3.1.2 (was: 3.1.1) > App local logs could leaked if log aggregation fails to initialize for the app > -- > > Key: YARN-8418 > URL: https://issues.apache.org/jira/browse/YARN-8418 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.8.0, 3.0.0-alpha1 >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Critical > Attachments: YARN-8418.001.patch, YARN-8418.002.patch, > YARN-8418.003.patch, YARN-8418.004.patch > > > If log aggregation fails init createApp directory container logs could get > leaked in NM directory > For log running application restart of NM after token renewal this case is > possible/ Application submission with invalid delegation token -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8544) [DS] AM registration fails when hadoop authorization is enabled
[ https://issues.apache.org/jira/browse/YARN-8544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16549715#comment-16549715 ] Wangda Tan commented on YARN-8544: -- [~bibinchundatt], thanks for working on the issue. Is this a regression in 3.1.x? If yes, could u add a link of which patch breaks the behavior. If no, given this patch added new fields which I think is a new feature, should we move it to 3.2.0 to de-risk 3.1.1 release? > [DS] AM registration fails when hadoop authorization is enabled > --- > > Key: YARN-8544 > URL: https://issues.apache.org/jira/browse/YARN-8544 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Blocker > Attachments: YARN-8544.001.patch > > > Application master fails to register when hadoop authorization is enabled. > DistributedSchedulingAMProtocol connection authorization fails are RM side > Issue credits: [~BilwaST] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8541) RM startup failure on recovery after user deletion
[ https://issues.apache.org/jira/browse/YARN-8541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16549710#comment-16549710 ] Wangda Tan commented on YARN-8541: -- Thanks [~bibinchundatt] for reporting this issue, so what is the behavior after this patch? We're going to fail individual app instead of whole RM, correct? [~sunilg], could u help to review and get this patch committed? > RM startup failure on recovery after user deletion > -- > > Key: YARN-8541 > URL: https://issues.apache.org/jira/browse/YARN-8541 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.1.0 >Reporter: yimeng >Assignee: Bibin A Chundatt >Priority: Blocker > Attachments: YARN-8541.001.patch, YARN-8541.002.patch > > > My hadoop version 3.1.0. I found that a problem RM startup failure on > recovery as the follow test step: > 1.create a user "user1" have the permisson to submit app. > 2.use user1 to submit a job ,wait job finished. > 3.delete user "user1" > 4.restart yarn > 5.the RM restart failed > RM logs: > 2018-07-16 16:24:59,708 | INFO | main-EventThread | Initialized root queue > root: numChildQueue= 3, capacity=1.0, absoluteCapacity=1.0, > usedResources=usedCapacity=0.0, numApps=0, > numContainers=0 | CapacitySchedulerQueueManager.java:163 > 2018-07-16 16:24:59,708 | INFO | main-EventThread | Initialized queue > mappings, override: false | UserGroupMappingPlacementRule.java:232 > 2018-07-16 16:24:59,708 | INFO | main-EventThread | Initialized > CapacityScheduler with calculator=class > org.apache.hadoop.yarn.util.resource.DominantResourceCalculator, > minimumAllocation=<>, maximumAllocation=< vCores:32>>, asynchronousScheduling=false, asyncScheduleInterval=5ms | > CapacityScheduler.java:392 > 2018-07-16 16:24:59,709 | INFO | main-EventThread | dynamic-resources.xml not > found | Configuration.java:2767 > 2018-07-16 16:24:59,709 | INFO | main-EventThread | Initializing AMS > Processing chain. Root > Processor=[org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor]. > | AMSProcessingChain.java:62 > 2018-07-16 16:24:59,709 | INFO | main-EventThread | disabled placement > handler will be used, all scheduling requests will be rejected. | > ApplicationMasterService.java:130 > 2018-07-16 16:24:59,709 | INFO | main-EventThread | Adding > [org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.processor.DisabledPlacementProcessor] > tp top of AMS Processing chain. | AMSProcessingChain.java:75 > 2018-07-16 16:24:59,713 | WARN | main-EventThread | Exception handling the > winning of election | ActiveStandbyElector.java:897 > org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active > at > org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146) > at > org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:893) > at > org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:473) > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:728) > at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:600) > Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when > transitioning to Active mode > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:325) > at > org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144) > ... 4 more > Caused by: org.apache.hadoop.service.ServiceStateException: > org.apache.hadoop.yarn.exceptions.YarnException: Failed to submit application > application_1531624956005_0001 submitted by user super reason: No groups > found for user super > at > org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105) > at org.apache.hadoop.service.AbstractService.start(AbstractService.java:203) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1204) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1245) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1241) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1686) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1241) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToAc
[jira] [Commented] (YARN-7974) Allow updating application tracking url after registration
[ https://issues.apache.org/jira/browse/YARN-7974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16549668#comment-16549668 ] Wangda Tan commented on YARN-7974: -- LGTM +1, thanks [~jhung] for the patch. If Jenkins comes with green color, I will commit the patch by tomorrow if no objections. > Allow updating application tracking url after registration > -- > > Key: YARN-7974 > URL: https://issues.apache.org/jira/browse/YARN-7974 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Jonathan Hung >Assignee: Jonathan Hung >Priority: Major > Attachments: YARN-7974.001.patch, YARN-7974.002.patch, > YARN-7974.003.patch, YARN-7974.004.patch, YARN-7974.005.patch, > YARN-7974.006.patch > > > Normally an application's tracking url is set on AM registration. We have a > use case for updating the tracking url after registration (e.g. the UI is > hosted on one of the containers). > Approach is for AM to update tracking url on heartbeat to RM, and add related > API in AMRMClient. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8545) YARN native service should return container if launch failed
[ https://issues.apache.org/jira/browse/YARN-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8545: - Description: In some cases, container launch may fail but container will not be properly returned to RM. This could happen when AM trying to prepare container launch context but failed w/o sending container launch context to NM (Once container launch context is sent to NM, NM will report failed container to RM). Exception like: {code:java} java.io.FileNotFoundException: File does not exist: hdfs://ns1/user/wtan/.yarn/services/tf-job-001/components/1531852429056/primary-worker/primary-worker-0/run-PRIMARY_WORKER.sh at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1583) at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1576) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1591) at org.apache.hadoop.yarn.service.utils.CoreFileSystem.createAmResource(CoreFileSystem.java:388) at org.apache.hadoop.yarn.service.provider.ProviderUtils.createConfigFileAndAddLocalResource(ProviderUtils.java:253) at org.apache.hadoop.yarn.service.provider.AbstractProviderService.buildContainerLaunchContext(AbstractProviderService.java:152) at org.apache.hadoop.yarn.service.containerlaunch.ContainerLaunchService$ContainerLauncher.run(ContainerLaunchService.java:105) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745){code} And even after container launch context prepare failed, AM still trying to monitor container's readiness: {code:java} 2018-07-17 18:42:57,518 [pool-7-thread-1] INFO monitor.ServiceMonitor - Readiness check failed for primary-worker-0: Probe Status, time="Tue Jul 17 18:42:57 UTC 2018", outcome="failure", message="Failure in Default probe: IP presence", exception="java.io.IOException: primary-worker-0: IP is not available yet" ...{code} was: In some cases, container launch may fail but container will not be properly returned to RM. This could happen when AM trying to prepare container launch context but failed w/o sending container launch context to NM (Once container launch context is sent to NM, NM will report failed container to RM). Exception like: {code:java} java.io.FileNotFoundException: File does not exist: hdfs://ns1/user/wtan/.yarn/services/tf-job-001/components/1531852429056/primary-worker/primary-worker-0/run-PRIMARY_WORKER.sh at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1583) at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1576) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1591) at org.apache.hadoop.yarn.service.utils.CoreFileSystem.createAmResource(CoreFileSystem.java:388) at org.apache.hadoop.yarn.service.provider.ProviderUtils.createConfigFileAndAddLocalResource(ProviderUtils.java:253) at org.apache.hadoop.yarn.service.provider.AbstractProviderService.buildContainerLaunchContext(AbstractProviderService.java:152) at org.apache.hadoop.yarn.service.containerlaunch.ContainerLaunchService$ContainerLauncher.run(ContainerLaunchService.java:105) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745){code} > YARN native service should return container if launch failed > > > Key: YARN-8545 > URL: https://issues.apache.org/jira/browse/YARN-8545 > Project: Hadoop YARN > Issue Type: Task >Reporter: Wangda Tan >Priority: Critical > > In some cases, container launch may fail but container will not be properly > returned to RM. > This could happen when AM trying to prepare container launch context but > failed w/o sending container launch context to NM (Once container launch > context is sent to NM, NM will report failed container to RM). > Exception like: > {code:java} > java.io.FileNotFoundException: File does not exist: > hdfs://ns1/user/wtan/.yarn/services/tf-job-001/components/1531852429056/primary-worker/primary-worker-0/run-PRIMARY_WORKER.sh > at > org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1583) > at > org.a
[jira] [Created] (YARN-8545) YARN native service should return container if launch failed
Wangda Tan created YARN-8545: Summary: YARN native service should return container if launch failed Key: YARN-8545 URL: https://issues.apache.org/jira/browse/YARN-8545 Project: Hadoop YARN Issue Type: Task Reporter: Wangda Tan In some cases, container launch may fail but container will not be properly returned to RM. This could happen when AM trying to prepare container launch context but failed w/o sending container launch context to NM (Once container launch context is sent to NM, NM will report failed container to RM). Exception like: {code:java} java.io.FileNotFoundException: File does not exist: hdfs://ns1/user/wtan/.yarn/services/tf-job-001/components/1531852429056/primary-worker/primary-worker-0/run-PRIMARY_WORKER.sh at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1583) at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1576) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1591) at org.apache.hadoop.yarn.service.utils.CoreFileSystem.createAmResource(CoreFileSystem.java:388) at org.apache.hadoop.yarn.service.provider.ProviderUtils.createConfigFileAndAddLocalResource(ProviderUtils.java:253) at org.apache.hadoop.yarn.service.provider.AbstractProviderService.buildContainerLaunchContext(AbstractProviderService.java:152) at org.apache.hadoop.yarn.service.containerlaunch.ContainerLaunchService$ContainerLauncher.run(ContainerLaunchService.java:105) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8361) Change App Name Placement Rule to use App Name instead of App Id for configuration
[ https://issues.apache.org/jira/browse/YARN-8361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8361: - Target Version/s: 3.2.0 (was: 3.2.0, 3.1.1) > Change App Name Placement Rule to use App Name instead of App Id for > configuration > -- > > Key: YARN-8361 > URL: https://issues.apache.org/jira/browse/YARN-8361 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Reporter: Zian Chen >Assignee: Zian Chen >Priority: Major > Fix For: 3.2.0 > > Attachments: YARN-8361.001.patch, YARN-8361.002.patch, > YARN-8361.003.patch > > > in YARN-8016, we expose a framework to let user specify custom placement rule > through CS configuration, and also add a new placement rule which is mapping > specific app with queues. However, the strategy implemented in YARN-8016 was > using application id which is hard for user to use this config. In this JIRA, > we are changing the mapping to use application name. More specifically, > 1. AppNamePlacementRule used app id while specifying queue mapping placement > rules, should change to app name > 2. Change documentation as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8361) Change App Name Placement Rule to use App Name instead of App Id for configuration
[ https://issues.apache.org/jira/browse/YARN-8361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8361: - Fix Version/s: 3.2.0 > Change App Name Placement Rule to use App Name instead of App Id for > configuration > -- > > Key: YARN-8361 > URL: https://issues.apache.org/jira/browse/YARN-8361 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Reporter: Zian Chen >Assignee: Zian Chen >Priority: Major > Fix For: 3.2.0 > > Attachments: YARN-8361.001.patch, YARN-8361.002.patch, > YARN-8361.003.patch > > > in YARN-8016, we expose a framework to let user specify custom placement rule > through CS configuration, and also add a new placement rule which is mapping > specific app with queues. However, the strategy implemented in YARN-8016 was > using application id which is hard for user to use this config. In this JIRA, > we are changing the mapping to use application name. More specifically, > 1. AppNamePlacementRule used app id while specifying queue mapping placement > rules, should change to app name > 2. Change documentation as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8361) Change App Name Placement Rule to use App Name instead of App Id for configuration
[ https://issues.apache.org/jira/browse/YARN-8361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16545572#comment-16545572 ] Wangda Tan commented on YARN-8361: -- Thanks [~Zian Chen] for the patch and thanks reviews from [~suma.shivaprasad], just pushed to trunk. This patch doesn't apply to branch-3.1, could u update the patch to branch-3.1? > Change App Name Placement Rule to use App Name instead of App Id for > configuration > -- > > Key: YARN-8361 > URL: https://issues.apache.org/jira/browse/YARN-8361 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Reporter: Zian Chen >Assignee: Zian Chen >Priority: Major > Fix For: 3.2.0 > > Attachments: YARN-8361.001.patch, YARN-8361.002.patch, > YARN-8361.003.patch > > > in YARN-8016, we expose a framework to let user specify custom placement rule > through CS configuration, and also add a new placement rule which is mapping > specific app with queues. However, the strategy implemented in YARN-8016 was > using application id which is hard for user to use this config. In this JIRA, > we are changing the mapping to use application name. More specifically, > 1. AppNamePlacementRule used app id while specifying queue mapping placement > rules, should change to app name > 2. Change documentation as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8524) Single parameter Resource / LightWeightResource constructor looks confusing
[ https://issues.apache.org/jira/browse/YARN-8524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16545534#comment-16545534 ] Wangda Tan commented on YARN-8524: -- Thanks [~snemeth] for the patch, LGTM too. Will commit soon. > Single parameter Resource / LightWeightResource constructor looks confusing > --- > > Key: YARN-8524 > URL: https://issues.apache.org/jira/browse/YARN-8524 > Project: Hadoop YARN > Issue Type: Improvement > Components: api >Reporter: Szilard Nemeth >Assignee: Szilard Nemeth >Priority: Major > Attachments: YARN-8524.001.patch, YARN-8524.002.patch, > YARN-8524.003.patch > > > The single parameter (long) constructor in Resource / LightWeightResource > sets all resource components to the same value. > Since there are other constructors in these classes with (long, int) > parameters where the semantics are different, it could be confusing for the > users. > The perfect place to create such a resource would be in the Resources class, > with a method named like "createResourceWithSameValue". -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8361) Change App Name Placement Rule to use App Name instead of App Id for configuration
[ https://issues.apache.org/jira/browse/YARN-8361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16543711#comment-16543711 ] Wangda Tan commented on YARN-8361: -- LGTM +1, thanks [~Zian Chen] and reviews from [~suma.shivaprasad]. Will commit today if no objections. > Change App Name Placement Rule to use App Name instead of App Id for > configuration > -- > > Key: YARN-8361 > URL: https://issues.apache.org/jira/browse/YARN-8361 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Reporter: Zian Chen >Assignee: Zian Chen >Priority: Major > Attachments: YARN-8361.001.patch, YARN-8361.002.patch, > YARN-8361.003.patch > > > in YARN-8016, we expose a framework to let user specify custom placement rule > through CS configuration, and also add a new placement rule which is mapping > specific app with queues. However, the strategy implemented in YARN-8016 was > using application id which is hard for user to use this config. In this JIRA, > we are changing the mapping to use application name. More specifically, > 1. AppNamePlacementRule used app id while specifying queue mapping placement > rules, should change to app name > 2. Change documentation as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8513) CapacityScheduler infinite loop when queue is near fully utilized
[ https://issues.apache.org/jira/browse/YARN-8513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16543689#comment-16543689 ] Wangda Tan commented on YARN-8513: -- [~cyfdecyf], I couldn't find the error message on the latest codebase. Not sure if this still a problem in latest release (3.1.0). We have many fixes in the last several months for CapacityScheduler scheduling after YARN-5139, I believe many of them are not backported to 2.9.1. Could u check if the problem still exists in 3.1.0 if possible? > CapacityScheduler infinite loop when queue is near fully utilized > - > > Key: YARN-8513 > URL: https://issues.apache.org/jira/browse/YARN-8513 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, yarn >Affects Versions: 2.9.1 > Environment: Ubuntu 14.04.5 > YARN is configured with one label and 5 queues. >Reporter: Chen Yufei >Priority: Major > > ResourceManager does not respond to any request when queue is near fully > utilized sometimes. Sending SIGTERM won't stop RM, only SIGKILL can. After RM > restart, it can recover running jobs and start accepting new ones. > > Seems like CapacityScheduler is in an infinite loop printing out the > following log messages (more than 25,000 lines in a second): > > {{2018-07-10 17:16:29,227 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: > assignedContainer queue=root usedCapacity=0.99816763 > absoluteUsedCapacity=0.99816763 used= > cluster=}} > {{2018-07-10 17:16:29,227 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: > Failed to accept allocation proposal}} > {{2018-07-10 17:16:29,227 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator: > assignedContainer application attempt=appattempt_1530619767030_1652_01 > container=null > queue=org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator@14420943 > clusterResource= type=NODE_LOCAL > requestedPartition=}} > > I encounter this problem several times after upgrading to YARN 2.9.1, while > the same configuration works fine under version 2.7.3. > > YARN-4477 is an infinite loop bug in FairScheduler, not sure if this is a > similar problem. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8511) When AM releases a container, RM removes allocation tags before it is released by NM
[ https://issues.apache.org/jira/browse/YARN-8511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16543683#comment-16543683 ] Wangda Tan commented on YARN-8511: -- Thanks [~cheersyang], The latest patch LGTM, +1. Will commit today if no objections. [~asuresh] / [~kkaranasos], wanna take a look before commit? > When AM releases a container, RM removes allocation tags before it is > released by NM > > > Key: YARN-8511 > URL: https://issues.apache.org/jira/browse/YARN-8511 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 3.1.0 >Reporter: Weiwei Yang >Assignee: Weiwei Yang >Priority: Major > Attachments: YARN-8511.001.patch, YARN-8511.002.patch, > YARN-8511.003.patch, YARN-8511.004.patch > > > User leverages PC with allocation tags to avoid port conflicts between apps, > we found sometimes they still get port conflicts. This is a similar issue > like YARN-4148. Because RM immediately removes allocation tags once > AM#allocate asks to release a container, however container on NM has some > delay until it actually gets killed and released the port. We should let RM > remove allocation tags AFTER NM confirms the containers are released. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8135) Hadoop {Submarine} Project: Simple and scalable deployment of deep learning training / serving jobs on Hadoop
[ https://issues.apache.org/jira/browse/YARN-8135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16542363#comment-16542363 ] Wangda Tan commented on YARN-8135: -- Added Google doc link to Design doc. > Hadoop {Submarine} Project: Simple and scalable deployment of deep learning > training / serving jobs on Hadoop > - > > Key: YARN-8135 > URL: https://issues.apache.org/jira/browse/YARN-8135 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Wangda Tan >Assignee: Wangda Tan >Priority: Major > Attachments: YARN-8135.poc.001.patch > > > Description: > *Goals:* > - Allow infra engineer / data scientist to run *unmodified* Tensorflow jobs > on YARN. > - Allow jobs easy access data/models in HDFS and other storages. > - Can launch services to serve Tensorflow/MXNet models. > - Support run distributed Tensorflow jobs with simple configs. > - Support run user-specified Docker images. > - Support specify GPU and other resources. > - Support launch tensorboard if user specified. > - Support customized DNS name for roles (like tensorboard.$user.$domain:6006) > *Why this name?* > - Because Submarine is the only vehicle can let human to explore deep > places. B-) > h3. {color:#FF}Please refer to on-going design doc, and add your > thoughts: > {color:#33}[https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#|https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit?usp=sharing]{color}{color} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-8330) An extra container got launched by RM for yarn-service
[ https://issues.apache.org/jira/browse/YARN-8330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16542327#comment-16542327 ] Wangda Tan edited comment on YARN-8330 at 7/12/18 11:35 PM: Trying to remember this issue, and post it here before forgot again: - The issue is caused by we send container information to ATS inside RMContainerImpl's constructor, we should not do it: {code} // If saveNonAMContainerMetaInfo is true, store system metrics for all // containers. If false, and if this container is marked as the AM, metrics // will still be published for this container, but that calculation happens // later. if (saveNonAMContainerMetaInfo && null != container.getId()) { rmContext.getSystemMetricsPublisher().containerCreated( this, this.creationTime); } {code} was (Author: leftnoteasy): Trying to remember this issue, and post it here before forgot again: - The issue is caused by we send container information to ATS inside RMContainerImpl's constructor, we should not do it: {code} // If saveNonAMContainerMetaInfo is true, store system metrics for all // containers. If false, and if this container is marked as the AM, metrics // will still be published for this container, but that calculation happens // later. if (saveNonAMContainerMetaInfo && null != container.getId()) { rmContext.getSystemMetricsPublisher().containerCreated( this, this.creationTime); } {code > An extra container got launched by RM for yarn-service > -- > > Key: YARN-8330 > URL: https://issues.apache.org/jira/browse/YARN-8330 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-native-services >Reporter: Yesha Vora >Assignee: Suma Shivaprasad >Priority: Critical > > Steps: > launch Hbase tarball app > list containers for hbase tarball app > {code} > /usr/hdp/current/hadoop-yarn-client/bin/yarn container -list > appattempt_1525463491331_0006_01 > WARNING: YARN_LOG_DIR has been replaced by HADOOP_LOG_DIR. Using value of > YARN_LOG_DIR. > WARNING: YARN_LOGFILE has been replaced by HADOOP_LOGFILE. Using value of > YARN_LOGFILE. > WARNING: YARN_PID_DIR has been replaced by HADOOP_PID_DIR. Using value of > YARN_PID_DIR. > WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS. > 18/05/04 22:36:11 INFO client.AHSProxy: Connecting to Application History > server at xxx/xxx:10200 > 18/05/04 22:36:11 INFO client.ConfiguredRMFailoverProxyProvider: Failing over > to rm2 > Total number of containers :5 > Container-IdStart Time Finish Time > StateHost Node Http Address >LOG-URL > container_e06_1525463491331_0006_01_02Fri May 04 22:34:26 + 2018 > N/A RUNNINGxxx:25454 http://xxx:8042 > http://xxx:8042/node/containerlogs/container_e06_1525463491331_0006_01_02/hrt_qa > 2018-05-04 22:36:11,216|INFO|MainThread|machine.py:167 - > run()||GUID=0169fa41-d1c5-4b43-85bf-c3e9f2682398|container_e06_1525463491331_0006_01_03 > Fri May 04 22:34:26 + 2018 N/A > RUNNINGxxx:25454 http://xxx:8042 > http://xxx:8042/node/containerlogs/container_e06_1525463491331_0006_01_03/hrt_qa > 2018-05-04 22:36:11,217|INFO|MainThread|machine.py:167 - > run()||GUID=0169fa41-d1c5-4b43-85bf-c3e9f2682398|container_e06_1525463491331_0006_01_01 > Fri May 04 22:34:15 + 2018 N/A > RUNNINGxxx:25454 http://xxx:8042 > http://xxx:8042/node/containerlogs/container_e06_1525463491331_0006_01_01/hrt_qa > 2018-05-04 22:36:11,217|INFO|MainThread|machine.py:167 - > run()||GUID=0169fa41-d1c5-4b43-85bf-c3e9f2682398|container_e06_1525463491331_0006_01_05 > Fri May 04 22:34:56 + 2018 N/A > RUNNINGxxx:25454 http://xxx:8042 > http://xxx:8042/node/containerlogs/container_e06_1525463491331_0006_01_05/hrt_qa > 2018-05-04 22:36:11,218|INFO|MainThread|machine.py:167 - > run()||GUID=0169fa41-d1c5-4b43-85bf-c3e9f2682398|container_e06_1525463491331_0006_01_04 > Fri May 04 22:34:56 + 2018 N/A > nullxxx:25454 http://xxx:8042 > http://xxx:8188/applicationhistory/logs/xxx:25454/container_e06_1525463491331_0006_01_04/container_e06_1525463491331_0006_01_04/hrt_qa{code} > Total expected containers = 4 ( 3 components container + 1 am). Instead, RM > is listing 5 containers. > container_e06_1525463491331_0006_01_04 is in null state. > Yarn service utilized container 02, 03, 05 for component. There is no log > available in NM & AM related to container 04. Only one line in RM log is > printed > {code}
[jira] [Commented] (YARN-8330) An extra container got launched by RM for yarn-service
[ https://issues.apache.org/jira/browse/YARN-8330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16542327#comment-16542327 ] Wangda Tan commented on YARN-8330: -- Trying to remember this issue, and post it here before forgot again: - The issue is caused by we send container information to ATS inside RMContainerImpl's constructor, we should not do it: {code} // If saveNonAMContainerMetaInfo is true, store system metrics for all // containers. If false, and if this container is marked as the AM, metrics // will still be published for this container, but that calculation happens // later. if (saveNonAMContainerMetaInfo && null != container.getId()) { rmContext.getSystemMetricsPublisher().containerCreated( this, this.creationTime); } {code > An extra container got launched by RM for yarn-service > -- > > Key: YARN-8330 > URL: https://issues.apache.org/jira/browse/YARN-8330 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-native-services >Reporter: Yesha Vora >Assignee: Suma Shivaprasad >Priority: Critical > > Steps: > launch Hbase tarball app > list containers for hbase tarball app > {code} > /usr/hdp/current/hadoop-yarn-client/bin/yarn container -list > appattempt_1525463491331_0006_01 > WARNING: YARN_LOG_DIR has been replaced by HADOOP_LOG_DIR. Using value of > YARN_LOG_DIR. > WARNING: YARN_LOGFILE has been replaced by HADOOP_LOGFILE. Using value of > YARN_LOGFILE. > WARNING: YARN_PID_DIR has been replaced by HADOOP_PID_DIR. Using value of > YARN_PID_DIR. > WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS. > 18/05/04 22:36:11 INFO client.AHSProxy: Connecting to Application History > server at xxx/xxx:10200 > 18/05/04 22:36:11 INFO client.ConfiguredRMFailoverProxyProvider: Failing over > to rm2 > Total number of containers :5 > Container-IdStart Time Finish Time > StateHost Node Http Address >LOG-URL > container_e06_1525463491331_0006_01_02Fri May 04 22:34:26 + 2018 > N/A RUNNINGxxx:25454 http://xxx:8042 > http://xxx:8042/node/containerlogs/container_e06_1525463491331_0006_01_02/hrt_qa > 2018-05-04 22:36:11,216|INFO|MainThread|machine.py:167 - > run()||GUID=0169fa41-d1c5-4b43-85bf-c3e9f2682398|container_e06_1525463491331_0006_01_03 > Fri May 04 22:34:26 + 2018 N/A > RUNNINGxxx:25454 http://xxx:8042 > http://xxx:8042/node/containerlogs/container_e06_1525463491331_0006_01_03/hrt_qa > 2018-05-04 22:36:11,217|INFO|MainThread|machine.py:167 - > run()||GUID=0169fa41-d1c5-4b43-85bf-c3e9f2682398|container_e06_1525463491331_0006_01_01 > Fri May 04 22:34:15 + 2018 N/A > RUNNINGxxx:25454 http://xxx:8042 > http://xxx:8042/node/containerlogs/container_e06_1525463491331_0006_01_01/hrt_qa > 2018-05-04 22:36:11,217|INFO|MainThread|machine.py:167 - > run()||GUID=0169fa41-d1c5-4b43-85bf-c3e9f2682398|container_e06_1525463491331_0006_01_05 > Fri May 04 22:34:56 + 2018 N/A > RUNNINGxxx:25454 http://xxx:8042 > http://xxx:8042/node/containerlogs/container_e06_1525463491331_0006_01_05/hrt_qa > 2018-05-04 22:36:11,218|INFO|MainThread|machine.py:167 - > run()||GUID=0169fa41-d1c5-4b43-85bf-c3e9f2682398|container_e06_1525463491331_0006_01_04 > Fri May 04 22:34:56 + 2018 N/A > nullxxx:25454 http://xxx:8042 > http://xxx:8188/applicationhistory/logs/xxx:25454/container_e06_1525463491331_0006_01_04/container_e06_1525463491331_0006_01_04/hrt_qa{code} > Total expected containers = 4 ( 3 components container + 1 am). Instead, RM > is listing 5 containers. > container_e06_1525463491331_0006_01_04 is in null state. > Yarn service utilized container 02, 03, 05 for component. There is no log > available in NM & AM related to container 04. Only one line in RM log is > printed > {code} > 2018-05-04 22:34:56,618 INFO rmcontainer.RMContainerImpl > (RMContainerImpl.java:handle(489)) - > container_e06_1525463491331_0006_01_04 Container Transitioned from NEW to > RESERVED{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-8522) Application fails with InvalidResourceRequestException
[ https://issues.apache.org/jira/browse/YARN-8522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan reassigned YARN-8522: Assignee: Zian Chen > Application fails with InvalidResourceRequestException > -- > > Key: YARN-8522 > URL: https://issues.apache.org/jira/browse/YARN-8522 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yesha Vora >Assignee: Zian Chen >Priority: Major > > Launch multiple streaming app simultaneously. Here, sometimes one of the > application fails with below stack trace. > {code} > 18/07/02 07:14:32 INFO retry.RetryInvocationHandler: > java.net.ConnectException: Call From xx.xx.xx.xx/xx.xx.xx.xx to > xx.xx.xx.xx:8032 failed on connection exception: java.net.ConnectException: > Connection refused; For more details see: > http://wiki.apache.org/hadoop/ConnectionRefused, while invoking > ApplicationClientProtocolPBClientImpl.submitApplication over null. Retrying > after sleeping for 3ms. > 18/07/02 07:14:32 WARN client.RequestHedgingRMFailoverProxyProvider: > Invocation returned exception: > org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid > resource request, only one resource request with * is allowed > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:502) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:389) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:320) > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.submitApplication(ClientRMService.java:645) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.submitApplication(ApplicationClientProtocolPBServiceImpl.java:277) > at > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:563) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:872) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:818) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1688) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2678) > on [rm2], so propagating back to caller. > 18/07/02 07:14:32 INFO mapreduce.JobSubmitter: Cleaning up the staging area > /user/hrt_qa/.staging/job_1530515284077_0007 > 18/07/02 07:14:32 ERROR streaming.StreamJob: Error Launching job : > org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid > resource request, only one resource request with * is allowed > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:502) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:389) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:320) > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.submitApplication(ClientRMService.java:645) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.submitApplication(ApplicationClientProtocolPBServiceImpl.java:277) > at > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:563) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:872) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:818) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1688) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2678) > Streaming Command Failed!{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7556) Fair scheduler configuration should allow resource types in the minResources and maxResources properties
[ https://issues.apache.org/jira/browse/YARN-7556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16541999#comment-16541999 ] Wangda Tan commented on YARN-7556: -- Thanks [~snemeth] > Fair scheduler configuration should allow resource types in the minResources > and maxResources properties > > > Key: YARN-7556 > URL: https://issues.apache.org/jira/browse/YARN-7556 > Project: Hadoop YARN > Issue Type: Sub-task > Components: fairscheduler >Affects Versions: 3.0.0-beta1 >Reporter: Daniel Templeton >Assignee: Szilard Nemeth >Priority: Critical > Fix For: 3.2.0 > > Attachments: YARN-7556.001.patch, YARN-7556.002.patch, > YARN-7556.003.patch, YARN-7556.004.patch, YARN-7556.005.patch, > YARN-7556.006.patch, YARN-7556.007.patch, YARN-7556.008.patch, > YARN-7556.009.patch > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7481) Gpu locality support for Better AI scheduling
[ https://issues.apache.org/jira/browse/YARN-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16541998#comment-16541998 ] Wangda Tan commented on YARN-7481: -- [~qinc...@microsoft.com], is there any detailed plan of how to better integrate this to resource types (see my previous comment)? > Gpu locality support for Better AI scheduling > - > > Key: YARN-7481 > URL: https://issues.apache.org/jira/browse/YARN-7481 > Project: Hadoop YARN > Issue Type: New Feature > Components: api, RM, yarn >Affects Versions: 2.7.2 >Reporter: Chen Qingcha >Priority: Major > Attachments: GPU locality support for Job scheduling.pdf, > hadoop-2.7.2.gpu-port-20180711.patch, hadoop-2.7.2.gpu-port.patch, > hadoop-2.9.0.gpu-port.patch, hadoop_2.9.0.patch > > Original Estimate: 1,344h > Remaining Estimate: 1,344h > > We enhance Hadoop with GPU support for better AI job scheduling. > Currently, YARN-3926 also supports GPU scheduling, which treats GPU as > countable resource. > However, GPU placement is also very important to deep learning job for better > efficiency. > For example, a 2-GPU job runs on gpu {0,1} could be faster than run on gpu > {0, 7}, if GPU 0 and 1 are under the same PCI-E switch while 0 and 7 are not. > We add the support to Hadoop 2.7.2 to enable GPU locality scheduling, which > support fine-grained GPU placement. > A 64-bits bitmap is added to yarn Resource, which indicates both GPU usage > and locality information in a node (up to 64 GPUs per node). '1' means > available and '0' otherwise in the corresponding position of the bit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8511) When AM releases a container, RM removes allocation tags before it is released by NM
[ https://issues.apache.org/jira/browse/YARN-8511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16541973#comment-16541973 ] Wangda Tan commented on YARN-8511: -- Thanks [~cheersyang], for explanation. I completely missed YARN-4148. So for #1, it looks not a problem anymore. #2 still exists, I think we can handle it separately. Regarding to implementation, I saw the only purpose of changes added to RMNode and subclasses is to access tags manager. Instead of doing this, is it better to pass RMContext to SchedulerNode, so schedulerNode can directly access RMContext like SchedulerAppAttempt? > When AM releases a container, RM removes allocation tags before it is > released by NM > > > Key: YARN-8511 > URL: https://issues.apache.org/jira/browse/YARN-8511 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 3.1.0 >Reporter: Weiwei Yang >Assignee: Weiwei Yang >Priority: Major > Attachments: YARN-8511.001.patch, YARN-8511.002.patch > > > User leverages PC with allocation tags to avoid port conflicts between apps, > we found sometimes they still get port conflicts. This is a similar issue > like YARN-4148. Because RM immediately removes allocation tags once > AM#allocate asks to release a container, however container on NM has some > delay until it actually gets killed and released the port. We should let RM > remove allocation tags AFTER NM confirms the containers are released. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8480) Add boolean option for resources
[ https://issues.apache.org/jira/browse/YARN-8480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16541156#comment-16541156 ] Wangda Tan commented on YARN-8480: -- [~cheersyang], What [~templedf] / [~snemeth] proposed is make the resource like a label, for example, node can report it has resource: memory=2048,vcore=3,has_java=true. And AM can request resource with ...has_java=true. Allocating container on the node will not update node's available resource, as far as the node has enough memory and vcores, it can allocate >1 containers with has_java=true resource request. This is an alternative way to represent node label by using resource types. > Add boolean option for resources > > > Key: YARN-8480 > URL: https://issues.apache.org/jira/browse/YARN-8480 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Daniel Templeton >Assignee: Szilard Nemeth >Priority: Major > Attachments: YARN-8480.001.patch, YARN-8480.002.patch > > > Make it possible to define a resource with a boolean value. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8505) AMLimit and userAMLimit check should be skipped for unmanaged AM
[ https://issues.apache.org/jira/browse/YARN-8505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16541155#comment-16541155 ] Wangda Tan commented on YARN-8505: -- [~bibinchundatt] / [~Tao Yang] / [~cheersyang], I would prefer to add a new limit to container #maximum-concurrently-activated-apps within a queue and skip updating AMLimit when a unmanaged AM is launched. > AMLimit and userAMLimit check should be skipped for unmanaged AM > > > Key: YARN-8505 > URL: https://issues.apache.org/jira/browse/YARN-8505 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 3.2.0, 2.9.2 >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-8505.001.patch > > > AMLimit and userAMLimit check in LeafQueue#activateApplications should be > skipped for unmanaged AM whose resource is not taken from YARN cluster. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8511) When AM releases a container, RM removes allocation tags before it is released by NM
[ https://issues.apache.org/jira/browse/YARN-8511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16541153#comment-16541153 ] Wangda Tan commented on YARN-8511: -- [~cheersyang], Thanks for reporting and working on this issue, this is valid issue, and we saw it from other places. For example, when exclusive use resource types like GPU, we could allocate and container to a node before the previous container completed. Memory has the same issue. I'm not sure if your patch works since the {{SchedulerNode#releaseContainer}} could be invoked in scenarios like when an AM release container by invoking allocate call, or app attempt finishes. Scheduler could still place a new container on a node before it terminated by NM. Instead, I think we should have some hook to handle such event inside {{AbstractYarnScheduler#nodeUpdate}}. However we still have two issues: 1) If we deduct resource after actual container finishes, it is possible that scheduler application attempt already finished. In that case, scheduler is not able to deduct resources. (Scheduler relies on SchedulerApplicationAttempt to locate RMContainer). I'm not sure if it impacts allocation tags or not. 2) It is also possible that NM spend too much time on terminating containers, in our docker-in-docker setup, we observed OS takes several minutes to terminate container. And NM could report container is DONE before it is actually terminated. (Another bug here). YARN-8508 is caused by the issue. > When AM releases a container, RM removes allocation tags before it is > released by NM > > > Key: YARN-8511 > URL: https://issues.apache.org/jira/browse/YARN-8511 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 3.1.0 >Reporter: Weiwei Yang >Assignee: Weiwei Yang >Priority: Major > Attachments: YARN-8511.001.patch, YARN-8511.002.patch > > > User leverages PC with allocation tags to avoid port conflicts between apps, > we found sometimes they still get port conflicts. This is a similar issue > like YARN-4148. Because RM immediately removes allocation tags once > AM#allocate asks to release a container, however container on NM has some > delay until it actually gets killed and released the port. We should let RM > remove allocation tags AFTER NM confirms the containers are released. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8480) Add boolean option for resources
[ https://issues.apache.org/jira/browse/YARN-8480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16540897#comment-16540897 ] Wangda Tan commented on YARN-8480: -- btw, [~templedf], I knew you found some troubles of support node partition concept in FS (YARN-2497). The node attribute should be a much easier support because there's no resource sharing required under node attribute, everything comes with FCFS. Basically you should only add a if check before deciding allocate container X on node Y. > Add boolean option for resources > > > Key: YARN-8480 > URL: https://issues.apache.org/jira/browse/YARN-8480 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Daniel Templeton >Assignee: Szilard Nemeth >Priority: Major > Attachments: YARN-8480.001.patch, YARN-8480.002.patch > > > Make it possible to define a resource with a boolean value. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8480) Add boolean option for resources
[ https://issues.apache.org/jira/browse/YARN-8480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16540895#comment-16540895 ] Wangda Tan commented on YARN-8480: -- [~templedf], If this only changes Fair Scheduler, I'm fine with that. However this touches Resource/ResourceInformation/RMAminCLI/ResourceCalculator/CapacityScheduler-implementation, etc. If you could take a closer look at YARN-3409 APIs, it should not be hard at all. Should definitely cheaper than adding a new resource type. > Add boolean option for resources > > > Key: YARN-8480 > URL: https://issues.apache.org/jira/browse/YARN-8480 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Daniel Templeton >Assignee: Szilard Nemeth >Priority: Major > Attachments: YARN-8480.001.patch, YARN-8480.002.patch > > > Make it possible to define a resource with a boolean value. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7974) Allow updating application tracking url after registration
[ https://issues.apache.org/jira/browse/YARN-7974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16540879#comment-16540879 ] Wangda Tan commented on YARN-7974: -- [~oliverhuh...@gmail.com], [~jhung], Thanks for updating the patch, in general the patch looks good. Several minor comments: 1) {code} public abstract void updateTrackingUrl(String trackingUrl); {code} Should have a default (maybe empty) implementation. Given AMRMClient/Async are all public/stable APIs, we don't want build break if any app extend from these classes. 2) The updateTrackingUrl should be marked as public/unstable. 3) Should we explicitly compare content of the new tracking url with old tracking url? Now we only check != null which may not be enough. {code} 1824 // Update tracking url if changed and save it to state store 1825 String newTrackingUrl = statusUpdateEvent.getTrackingUrl(); 1826 if (newTrackingUrl != null) { {code} > Allow updating application tracking url after registration > -- > > Key: YARN-7974 > URL: https://issues.apache.org/jira/browse/YARN-7974 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Jonathan Hung >Assignee: Jonathan Hung >Priority: Major > Attachments: YARN-7974.001.patch, YARN-7974.002.patch, > YARN-7974.003.patch, YARN-7974.004.patch, YARN-7974.005.patch > > > Normally an application's tracking url is set on AM registration. We have a > use case for updating the tracking url after registration (e.g. the UI is > hosted on one of the containers). > Approach is for AM to update tracking url on heartbeat to RM, and add related > API in AMRMClient. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7481) Gpu locality support for Better AI scheduling
[ https://issues.apache.org/jira/browse/YARN-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-7481: - Fix Version/s: (was: 2.7.2) > Gpu locality support for Better AI scheduling > - > > Key: YARN-7481 > URL: https://issues.apache.org/jira/browse/YARN-7481 > Project: Hadoop YARN > Issue Type: New Feature > Components: api, RM, yarn >Affects Versions: 2.7.2 >Reporter: Chen Qingcha >Priority: Major > Attachments: GPU locality support for Job scheduling.pdf, > hadoop-2.7.2.gpu-port-20180711.patch, hadoop-2.7.2.gpu-port.patch, > hadoop-2.9.0.gpu-port.patch, hadoop_2.9.0.patch > > Original Estimate: 1,344h > Remaining Estimate: 1,344h > > We enhance Hadoop with GPU support for better AI job scheduling. > Currently, YARN-3926 also supports GPU scheduling, which treats GPU as > countable resource. > However, GPU placement is also very important to deep learning job for better > efficiency. > For example, a 2-GPU job runs on gpu {0,1} could be faster than run on gpu > {0, 7}, if GPU 0 and 1 are under the same PCI-E switch while 0 and 7 are not. > We add the support to Hadoop 2.7.2 to enable GPU locality scheduling, which > support fine-grained GPU placement. > A 64-bits bitmap is added to yarn Resource, which indicates both GPU usage > and locality information in a node (up to 64 GPUs per node). '1' means > available and '0' otherwise in the corresponding position of the bit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8480) Add boolean option for resources
[ https://issues.apache.org/jira/browse/YARN-8480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16540810#comment-16540810 ] Wangda Tan commented on YARN-8480: -- [~snemeth]/[~templedf]. To me we should move this to node attribute: we already have most things ready here under YARN-3409 branch. I don't think we should duplicate the same things in this JIRA. For stuffs added to resource class, it's not necessarily to be countable, (definition of countable according to design doc of YARN-3926): {code} When we speak of countable resources, we refer to resourcetypes where the allocation and release of resources is a simple subtraction and addition operation. {code} I'm supportive to add new resource types like set, range, or hierarchical (like disk resource isolation which is mentioned by [~cheersyang]). The pure label resource type should go to node attribute / node partition. > Add boolean option for resources > > > Key: YARN-8480 > URL: https://issues.apache.org/jira/browse/YARN-8480 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Daniel Templeton >Assignee: Szilard Nemeth >Priority: Major > Attachments: YARN-8480.001.patch, YARN-8480.002.patch > > > Make it possible to define a resource with a boolean value. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7481) Gpu locality support for Better AI scheduling
[ https://issues.apache.org/jira/browse/YARN-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16540501#comment-16540501 ] Wangda Tan commented on YARN-7481: -- [~qinc...@microsoft.com], I saw you were keep updating patches in the last several months. Given the proposed approach conflicts with community existing solution, is there any plans to merge this with community solutions? > Gpu locality support for Better AI scheduling > - > > Key: YARN-7481 > URL: https://issues.apache.org/jira/browse/YARN-7481 > Project: Hadoop YARN > Issue Type: New Feature > Components: api, RM, yarn >Affects Versions: 2.7.2 >Reporter: Chen Qingcha >Priority: Major > Fix For: 2.7.2 > > Attachments: GPU locality support for Job scheduling.pdf, > hadoop-2.7.2.gpu-port-20180711.patch, hadoop-2.7.2.gpu-port.patch, > hadoop-2.9.0.gpu-port.patch, hadoop_2.9.0.patch > > Original Estimate: 1,344h > Remaining Estimate: 1,344h > > We enhance Hadoop with GPU support for better AI job scheduling. > Currently, YARN-3926 also supports GPU scheduling, which treats GPU as > countable resource. > However, GPU placement is also very important to deep learning job for better > efficiency. > For example, a 2-GPU job runs on gpu {0,1} could be faster than run on gpu > {0, 7}, if GPU 0 and 1 are under the same PCI-E switch while 0 and 7 are not. > We add the support to Hadoop 2.7.2 to enable GPU locality scheduling, which > support fine-grained GPU placement. > A 64-bits bitmap is added to yarn Resource, which indicates both GPU usage > and locality information in a node (up to 64 GPUs per node). '1' means > available and '0' otherwise in the corresponding position of the bit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8512) ATSv2 entities are not published to HBase from second attempt onwards
[ https://issues.apache.org/jira/browse/YARN-8512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16539027#comment-16539027 ] Wangda Tan commented on YARN-8512: -- Patch LGTM as well, thanks [~rohithsharma] for the fix. > ATSv2 entities are not published to HBase from second attempt onwards > - > > Key: YARN-8512 > URL: https://issues.apache.org/jira/browse/YARN-8512 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.1.0, 2.10.0, 3.2.0, 3.0.3 >Reporter: Yesha Vora >Assignee: Rohith Sharma K S >Priority: Major > Attachments: YARN-8512.01.patch, YARN-8512.02.patch, > YARN-8512.03.patch > > > It is observed that if 1st attempt master container is died and 2nd attempt > master container is launched in a NM where old containers are running but not > master container. > ||Attempt||NM1||NM2||Action|| > |attempt-1|master container i.e container-1-1|container-1-2|master container > died| > |attempt-2|NA|container-1-2 and master container container-2-1|NA| > In the above scenario, NM doesn't identifies flowContext and will get log > below > {noformat} > 2018-07-10 00:44:38,285 WARN storage.HBaseTimelineWriterImpl > (HBaseTimelineWriterImpl.java:write(170)) - Found null for one of: > flowName=null appId=application_1531175172425_0001 userId=hbase > clusterId=yarn-cluster . Not proceeding with writing to hbase > 2018-07-10 00:44:38,560 WARN storage.HBaseTimelineWriterImpl > (HBaseTimelineWriterImpl.java:write(170)) - Found null for one of: > flowName=null appId=application_1531175172425_0001 userId=hbase > clusterId=yarn-cluster . Not proceeding with writing to hbase > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8509) Fix UserLimit calculation for preemption to balance scenario after queue satisfied
[ https://issues.apache.org/jira/browse/YARN-8509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16537854#comment-16537854 ] Wangda Tan commented on YARN-8509: -- And downgraded priority to major, removed 3.1.1 from target version. > Fix UserLimit calculation for preemption to balance scenario after queue > satisfied > > > Key: YARN-8509 > URL: https://issues.apache.org/jira/browse/YARN-8509 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: Zian Chen >Assignee: Zian Chen >Priority: Major > > In LeafQueue#getTotalPendingResourcesConsideringUserLimit, we calculate total > pending resource based on user-limit percent and user-limit factor which will > cap pending resource for each user to the minimum of user-limit pending and > actual pending. This will prevent queue from taking more pending resource and -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8509) Fix UserLimit calculation for preemption to balance scenario after queue satisfied
[ https://issues.apache.org/jira/browse/YARN-8509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16537853#comment-16537853 ] Wangda Tan commented on YARN-8509: -- [~Zian Chen], fix version is only set once the patch got committed, "target version" should be set for new issues. > Fix UserLimit calculation for preemption to balance scenario after queue > satisfied > > > Key: YARN-8509 > URL: https://issues.apache.org/jira/browse/YARN-8509 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: Zian Chen >Assignee: Zian Chen >Priority: Critical > > In LeafQueue#getTotalPendingResourcesConsideringUserLimit, we calculate total > pending resource based on user-limit percent and user-limit factor which will > cap pending resource for each user to the minimum of user-limit pending and > actual pending. This will prevent queue from taking more pending resource and -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8509) Fix UserLimit calculation for preemption to balance scenario after queue satisfied
[ https://issues.apache.org/jira/browse/YARN-8509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8509: - Priority: Major (was: Critical) > Fix UserLimit calculation for preemption to balance scenario after queue > satisfied > > > Key: YARN-8509 > URL: https://issues.apache.org/jira/browse/YARN-8509 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: Zian Chen >Assignee: Zian Chen >Priority: Major > > In LeafQueue#getTotalPendingResourcesConsideringUserLimit, we calculate total > pending resource based on user-limit percent and user-limit factor which will > cap pending resource for each user to the minimum of user-limit pending and > actual pending. This will prevent queue from taking more pending resource and -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8509) Fix UserLimit calculation for preemption to balance scenario after queue satisfied
[ https://issues.apache.org/jira/browse/YARN-8509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8509: - Target Version/s: 3.2.0, 3.1.2 (was: 3.2.0, 3.1.1) > Fix UserLimit calculation for preemption to balance scenario after queue > satisfied > > > Key: YARN-8509 > URL: https://issues.apache.org/jira/browse/YARN-8509 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: Zian Chen >Assignee: Zian Chen >Priority: Major > > In LeafQueue#getTotalPendingResourcesConsideringUserLimit, we calculate total > pending resource based on user-limit percent and user-limit factor which will > cap pending resource for each user to the minimum of user-limit pending and > actual pending. This will prevent queue from taking more pending resource and -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8509) Fix UserLimit calculation for preemption to balance scenario after queue satisfied
[ https://issues.apache.org/jira/browse/YARN-8509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8509: - Target Version/s: 3.2.0, 3.1.1 > Fix UserLimit calculation for preemption to balance scenario after queue > satisfied > > > Key: YARN-8509 > URL: https://issues.apache.org/jira/browse/YARN-8509 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: Zian Chen >Assignee: Zian Chen >Priority: Critical > > In LeafQueue#getTotalPendingResourcesConsideringUserLimit, we calculate total > pending resource based on user-limit percent and user-limit factor which will > cap pending resource for each user to the minimum of user-limit pending and > actual pending. This will prevent queue from taking more pending resource and -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8509) Fix UserLimit calculation for preemption to balance scenario after queue satisfied
[ https://issues.apache.org/jira/browse/YARN-8509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8509: - Fix Version/s: (was: 3.1.1) (was: 3.2.0) > Fix UserLimit calculation for preemption to balance scenario after queue > satisfied > > > Key: YARN-8509 > URL: https://issues.apache.org/jira/browse/YARN-8509 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: Zian Chen >Assignee: Zian Chen >Priority: Critical > > In LeafQueue#getTotalPendingResourcesConsideringUserLimit, we calculate total > pending resource based on user-limit percent and user-limit factor which will > cap pending resource for each user to the minimum of user-limit pending and > actual pending. This will prevent queue from taking more pending resource and -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8506) Make GetApplicationsRequestPBImpl thread safe
[ https://issues.apache.org/jira/browse/YARN-8506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16537244#comment-16537244 ] Wangda Tan commented on YARN-8506: -- Rebased to latest trunk. (002) > Make GetApplicationsRequestPBImpl thread safe > - > > Key: YARN-8506 > URL: https://issues.apache.org/jira/browse/YARN-8506 > Project: Hadoop YARN > Issue Type: Task >Reporter: Wangda Tan >Assignee: Wangda Tan >Priority: Critical > Attachments: YARN-8506.001.patch, YARN-8506.002.patch > > > When GetApplicationRequestPBImpl is used in multi-thread environment, > exceptions like below will occur because we don't protect write ops. > {code} > java.lang.ArrayIndexOutOfBoundsException > at java.lang.System.arraycopy(Native Method) > at java.util.ArrayList.addAll(ArrayList.java:613) > at > com.google.protobuf.LazyStringArrayList.addAll(LazyStringArrayList.java:132) > at > com.google.protobuf.LazyStringArrayList.addAll(LazyStringArrayList.java:123) > at > com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:327) > at > org.apache.hadoop.yarn.proto.YarnServiceProtos$GetApplicationsRequestProto$Builder.addAllApplicationTags(YarnServiceProtos.java:24450) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetApplicationsRequestPBImpl.mergeLocalToBuilder(GetApplicationsRequestPBImpl.java:100) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetApplicationsRequestPBImpl.mergeLocalToProto(GetApplicationsRequestPBImpl.java:78) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetApplicationsRequestPBImpl.getProto(GetApplicationsRequestPBImpl.java:69) > {code} > We need to make GetApplicationRequestPBImpl thread safe. We saw the issue > happens frequently when RequestHedgingRMFailoverProxyProvider is being used. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8506) Make GetApplicationsRequestPBImpl thread safe
[ https://issues.apache.org/jira/browse/YARN-8506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8506: - Attachment: YARN-8506.002.patch > Make GetApplicationsRequestPBImpl thread safe > - > > Key: YARN-8506 > URL: https://issues.apache.org/jira/browse/YARN-8506 > Project: Hadoop YARN > Issue Type: Task >Reporter: Wangda Tan >Assignee: Wangda Tan >Priority: Critical > Attachments: YARN-8506.001.patch, YARN-8506.002.patch > > > When GetApplicationRequestPBImpl is used in multi-thread environment, > exceptions like below will occur because we don't protect write ops. > {code} > java.lang.ArrayIndexOutOfBoundsException > at java.lang.System.arraycopy(Native Method) > at java.util.ArrayList.addAll(ArrayList.java:613) > at > com.google.protobuf.LazyStringArrayList.addAll(LazyStringArrayList.java:132) > at > com.google.protobuf.LazyStringArrayList.addAll(LazyStringArrayList.java:123) > at > com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:327) > at > org.apache.hadoop.yarn.proto.YarnServiceProtos$GetApplicationsRequestProto$Builder.addAllApplicationTags(YarnServiceProtos.java:24450) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetApplicationsRequestPBImpl.mergeLocalToBuilder(GetApplicationsRequestPBImpl.java:100) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetApplicationsRequestPBImpl.mergeLocalToProto(GetApplicationsRequestPBImpl.java:78) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetApplicationsRequestPBImpl.getProto(GetApplicationsRequestPBImpl.java:69) > {code} > We need to make GetApplicationRequestPBImpl thread safe. We saw the issue > happens frequently when RequestHedgingRMFailoverProxyProvider is being used. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8506) Make GetApplicationsRequestPBImpl thread safe
[ https://issues.apache.org/jira/browse/YARN-8506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8506: - Priority: Critical (was: Blocker) > Make GetApplicationsRequestPBImpl thread safe > - > > Key: YARN-8506 > URL: https://issues.apache.org/jira/browse/YARN-8506 > Project: Hadoop YARN > Issue Type: Task >Reporter: Wangda Tan >Assignee: Wangda Tan >Priority: Critical > Attachments: YARN-8506.001.patch > > > When GetApplicationRequestPBImpl is used in multi-thread environment, > exceptions like below will occur because we don't protect write ops. > {code} > java.lang.ArrayIndexOutOfBoundsException > at java.lang.System.arraycopy(Native Method) > at java.util.ArrayList.addAll(ArrayList.java:613) > at > com.google.protobuf.LazyStringArrayList.addAll(LazyStringArrayList.java:132) > at > com.google.protobuf.LazyStringArrayList.addAll(LazyStringArrayList.java:123) > at > com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:327) > at > org.apache.hadoop.yarn.proto.YarnServiceProtos$GetApplicationsRequestProto$Builder.addAllApplicationTags(YarnServiceProtos.java:24450) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetApplicationsRequestPBImpl.mergeLocalToBuilder(GetApplicationsRequestPBImpl.java:100) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetApplicationsRequestPBImpl.mergeLocalToProto(GetApplicationsRequestPBImpl.java:78) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetApplicationsRequestPBImpl.getProto(GetApplicationsRequestPBImpl.java:69) > {code} > We need to make GetApplicationRequestPBImpl thread safe. We saw the issue > happens frequently when RequestHedgingRMFailoverProxyProvider is being used. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8506) Make GetApplicationsRequestPBImpl thread safe
[ https://issues.apache.org/jira/browse/YARN-8506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8506: - Attachment: YARN-8506.001.patch > Make GetApplicationsRequestPBImpl thread safe > - > > Key: YARN-8506 > URL: https://issues.apache.org/jira/browse/YARN-8506 > Project: Hadoop YARN > Issue Type: Task >Reporter: Wangda Tan >Assignee: Wangda Tan >Priority: Blocker > Attachments: YARN-8506.001.patch > > > When GetApplicationRequestPBImpl is used in multi-thread environment, > exceptions like below will occur because we don't protect write ops. > {code} > java.lang.ArrayIndexOutOfBoundsException > at java.lang.System.arraycopy(Native Method) > at java.util.ArrayList.addAll(ArrayList.java:613) > at > com.google.protobuf.LazyStringArrayList.addAll(LazyStringArrayList.java:132) > at > com.google.protobuf.LazyStringArrayList.addAll(LazyStringArrayList.java:123) > at > com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:327) > at > org.apache.hadoop.yarn.proto.YarnServiceProtos$GetApplicationsRequestProto$Builder.addAllApplicationTags(YarnServiceProtos.java:24450) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetApplicationsRequestPBImpl.mergeLocalToBuilder(GetApplicationsRequestPBImpl.java:100) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetApplicationsRequestPBImpl.mergeLocalToProto(GetApplicationsRequestPBImpl.java:78) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetApplicationsRequestPBImpl.getProto(GetApplicationsRequestPBImpl.java:69) > {code} > We need to make GetApplicationRequestPBImpl thread safe. We saw the issue > happens frequently when RequestHedgingRMFailoverProxyProvider is being used. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8506) Make GetApplicationsRequestPBImpl thread safe
Wangda Tan created YARN-8506: Summary: Make GetApplicationsRequestPBImpl thread safe Key: YARN-8506 URL: https://issues.apache.org/jira/browse/YARN-8506 Project: Hadoop YARN Issue Type: Task Reporter: Wangda Tan Assignee: Wangda Tan When GetApplicationRequestPBImpl is used in multi-thread environment, exceptions like below will occur because we don't protect write ops. {code} java.lang.ArrayIndexOutOfBoundsException at java.lang.System.arraycopy(Native Method) at java.util.ArrayList.addAll(ArrayList.java:613) at com.google.protobuf.LazyStringArrayList.addAll(LazyStringArrayList.java:132) at com.google.protobuf.LazyStringArrayList.addAll(LazyStringArrayList.java:123) at com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:327) at org.apache.hadoop.yarn.proto.YarnServiceProtos$GetApplicationsRequestProto$Builder.addAllApplicationTags(YarnServiceProtos.java:24450) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetApplicationsRequestPBImpl.mergeLocalToBuilder(GetApplicationsRequestPBImpl.java:100) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetApplicationsRequestPBImpl.mergeLocalToProto(GetApplicationsRequestPBImpl.java:78) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetApplicationsRequestPBImpl.getProto(GetApplicationsRequestPBImpl.java:69) {code} We need to make GetApplicationRequestPBImpl thread safe. We saw the issue happens frequently when RequestHedgingRMFailoverProxyProvider is being used. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org