[jira] [Comment Edited] (YARN-8579) New AM attempt could not retrieve previous attempt component data
[ https://issues.apache.org/jira/browse/YARN-8579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16560218#comment-16560218 ] Chandni Singh edited comment on YARN-8579 at 7/27/18 8:01 PM: -- [~gsaha] Thanks for debugging the issue. patch 2 looks good to me. Just a nitpick. Since we use slf4j, we can use it instead of string concatenation in the log stmt {code:java} LOG.info("Containers recovered after AM registered: " + containers); {code} to {code:java} LOG.info("Containers recovered after AM registered: {} ", containers); {code} was (Author: csingh): [~gsaha] Thanks for debugging the issue. patch 2 looks good to me. Just a nitpick. Since we use slf4j, we can use it instead of string concatenation in the log stmt {code:java} LOG.info("Containers recovered after AM registered: ", containers); {code} to {code:java} LOG.info("Containers recovered after AM registered: {} ", containers); {code} > New AM attempt could not retrieve previous attempt component data > - > > Key: YARN-8579 > URL: https://issues.apache.org/jira/browse/YARN-8579 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.1.1 >Reporter: Yesha Vora >Assignee: Gour Saha >Priority: Critical > Fix For: 3.2.0, 3.1.2 > > Attachments: YARN-8579.001.patch, YARN-8579.002.patch > > > Steps: > 1) Launch httpd-docker > 2) Wait for app to be in STABLE state > 3) Run validation for app (It takes around 3 mins) > 4) Stop all Zks > 5) Wait 60 sec > 6) Kill AM > 7) wait for 30 sec > 8) Start all ZKs > 9) Wait for application to finish > 10) Validate expected containers of the app > Expected behavior: > New attempt of AM should start and docker containers launched by 1st attempt > should be recovered by new attempt. > Actual behavior: > New AM attempt starts. It can not recover 1st attempt docker containers. It > can not read component details from ZK. > Thus, it starts new attempt for all containers. > {code} > 2018-07-19 22:42:47,595 [main] INFO service.ServiceScheduler - Registering > appattempt_1531977563978_0015_02, fault-test-zkrm-httpd-docker into > registry > 2018-07-19 22:42:47,611 [main] INFO service.ServiceScheduler - Received 1 > containers from previous attempt. > 2018-07-19 22:42:47,642 [main] INFO service.ServiceScheduler - Could not > read component paths: > `/users/hrt-qa/services/yarn-service/fault-test-zkrm-httpd-docker/components': > No such file or directory: KeeperErrorCode = NoNode for > /registry/users/hrt-qa/services/yarn-service/fault-test-zkrm-httpd-docker/components > 2018-07-19 22:42:47,643 [main] INFO service.ServiceScheduler - Handling > container_e08_1531977563978_0015_01_03 from previous attempt > 2018-07-19 22:42:47,643 [main] INFO service.ServiceScheduler - Record not > found in registry for container container_e08_1531977563978_0015_01_03 > from previous attempt, releasing > 2018-07-19 22:42:47,649 [AMRM Callback Handler Thread] INFO > impl.TimelineV2ClientImpl - Updated timeline service address to xxx:33019 > 2018-07-19 22:42:47,651 [main] INFO service.ServiceScheduler - Triggering > initial evaluation of component httpd > 2018-07-19 22:42:47,652 [main] INFO component.Component - [INIT COMPONENT > httpd]: 2 instances. > 2018-07-19 22:42:47,652 [main] INFO component.Component - [COMPONENT httpd] > Requesting for 2 container(s){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8579) New AM attempt could not retrieve previous attempt component data
[ https://issues.apache.org/jira/browse/YARN-8579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16560218#comment-16560218 ] Chandni Singh commented on YARN-8579: - [~gsaha] Thanks for debugging the issue. patch 2 looks good to me. Just a nitpick. Since we use slf4j, we can use it instead of string concatenation in the log stmt {code:java} LOG.info("Containers recovered after AM registered: ", containers); {code} to {code:java} LOG.info("Containers recovered after AM registered: {} ", containers); {code} > New AM attempt could not retrieve previous attempt component data > - > > Key: YARN-8579 > URL: https://issues.apache.org/jira/browse/YARN-8579 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.1.1 >Reporter: Yesha Vora >Assignee: Gour Saha >Priority: Critical > Fix For: 3.2.0, 3.1.2 > > Attachments: YARN-8579.001.patch, YARN-8579.002.patch > > > Steps: > 1) Launch httpd-docker > 2) Wait for app to be in STABLE state > 3) Run validation for app (It takes around 3 mins) > 4) Stop all Zks > 5) Wait 60 sec > 6) Kill AM > 7) wait for 30 sec > 8) Start all ZKs > 9) Wait for application to finish > 10) Validate expected containers of the app > Expected behavior: > New attempt of AM should start and docker containers launched by 1st attempt > should be recovered by new attempt. > Actual behavior: > New AM attempt starts. It can not recover 1st attempt docker containers. It > can not read component details from ZK. > Thus, it starts new attempt for all containers. > {code} > 2018-07-19 22:42:47,595 [main] INFO service.ServiceScheduler - Registering > appattempt_1531977563978_0015_02, fault-test-zkrm-httpd-docker into > registry > 2018-07-19 22:42:47,611 [main] INFO service.ServiceScheduler - Received 1 > containers from previous attempt. > 2018-07-19 22:42:47,642 [main] INFO service.ServiceScheduler - Could not > read component paths: > `/users/hrt-qa/services/yarn-service/fault-test-zkrm-httpd-docker/components': > No such file or directory: KeeperErrorCode = NoNode for > /registry/users/hrt-qa/services/yarn-service/fault-test-zkrm-httpd-docker/components > 2018-07-19 22:42:47,643 [main] INFO service.ServiceScheduler - Handling > container_e08_1531977563978_0015_01_03 from previous attempt > 2018-07-19 22:42:47,643 [main] INFO service.ServiceScheduler - Record not > found in registry for container container_e08_1531977563978_0015_01_03 > from previous attempt, releasing > 2018-07-19 22:42:47,649 [AMRM Callback Handler Thread] INFO > impl.TimelineV2ClientImpl - Updated timeline service address to xxx:33019 > 2018-07-19 22:42:47,651 [main] INFO service.ServiceScheduler - Triggering > initial evaluation of component httpd > 2018-07-19 22:42:47,652 [main] INFO component.Component - [INIT COMPONENT > httpd]: 2 instances. > 2018-07-19 22:42:47,652 [main] INFO component.Component - [COMPONENT httpd] > Requesting for 2 container(s){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8584) Several typos in Log Aggregation related classes
[ https://issues.apache.org/jira/browse/YARN-8584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16559325#comment-16559325 ] Chandni Singh commented on YARN-8584: - [~snemeth] Thanks! LGTM > Several typos in Log Aggregation related classes > > > Key: YARN-8584 > URL: https://issues.apache.org/jira/browse/YARN-8584 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Szilard Nemeth >Assignee: Szilard Nemeth >Priority: Minor > Attachments: YARN-8584.001.patch, YARN-8584.002.patch > > > There are typos in comments, log messages, method names, field names, etc. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8509) Fix UserLimit calculation for preemption to balance scenario after queue satisfied
[ https://issues.apache.org/jira/browse/YARN-8509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16559043#comment-16559043 ] Chandni Singh commented on YARN-8509: - Couple of nits and questions: 1. This is a javadoc block which should be above the method. I understand that this test is moved out of another test class but this gives a good opportunity to fix this. {code:java} /** * Test case: Submit three applications (app1/app2/app3) to different * queues, queue structure: * * * Root */ | \ \ * a b c d * 30 30 30 10 * * */ {code} 2. Why explicitly setting the log level to debug in the code? {code} Logger.getRootLogger().setLevel(Level.DEBUG); {code} 3. Can you explain the comment? {code} // We should release pending resource be capped at user limit, think about // a user ask for 1maps. but cluster can run a max of 1000. In this // case, as soon as each map finish, other one pending will get scheduled // When not deduct reserved, total-pending = 3G (u1) + 20G (u2) = 23G // deduct reserved, total-pending = 0G (u1) + 20G (u2) = 20G {code} > Fix UserLimit calculation for preemption to balance scenario after queue > satisfied > > > Key: YARN-8509 > URL: https://issues.apache.org/jira/browse/YARN-8509 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: Zian Chen >Assignee: Zian Chen >Priority: Major > Attachments: YARN-8509.001.patch, YARN-8509.002.patch > > > In LeafQueue#getTotalPendingResourcesConsideringUserLimit, we calculate total > pending resource based on user-limit percent and user-limit factor which will > cap pending resource for each user to the minimum of user-limit pending and > actual pending. This will prevent queue from taking more pending resource to > achieve queue balance after all queue satisfied with its ideal allocation. > > We need to change the logic to let queue pending can go beyond userlimit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8584) Several typos in Log Aggregation related classes
[ https://issues.apache.org/jira/browse/YARN-8584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16559023#comment-16559023 ] Chandni Singh commented on YARN-8584: - Looks good. We can also change the log statements to utilize slf4j instead of concatenating strings. For example {code:java} LOG.warn("rollingMonitorInterval should be more than or equal to " + MIN_LOG_ROLLING_INTERVAL + " seconds. Using " + MIN_LOG_ROLLING_INTERVAL + " seconds instead.");{code} to {code:java} LOG.warn("rollingMonitorInterval should be more than or equal to {} seconds. Using {} seconds instead.", MIN_LOG_ROLLING_INTERVAL, MIN_LOG_ROLLING_INTERVAL);{code} > Several typos in Log Aggregation related classes > > > Key: YARN-8584 > URL: https://issues.apache.org/jira/browse/YARN-8584 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Szilard Nemeth >Assignee: Szilard Nemeth >Priority: Minor > Attachments: YARN-8584.001.patch > > > There are typos in comments, log messages, method names, field names, etc. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8508) GPU does not get released even though the container is killed
[ https://issues.apache.org/jira/browse/YARN-8508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16558838#comment-16558838 ] Chandni Singh commented on YARN-8508: - [~shaneku...@gmail.com] [~eyang] could you please review patch 2? > GPU does not get released even though the container is killed > -- > > Key: YARN-8508 > URL: https://issues.apache.org/jira/browse/YARN-8508 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Sumana Sathish >Assignee: Chandni Singh >Priority: Major > Attachments: YARN-8505.001.patch, YARN-8505.002.patch > > > GPU failed to release even though the container using it is being killed > {Code} > 2018-07-06 05:22:26,201 INFO container.ContainerImpl > (ContainerImpl.java:handle(2093)) - Container > container_e20_1530854311763_0006_01_01 transitioned from RUNNING to > KILLING > 2018-07-06 05:22:26,250 INFO container.ContainerImpl > (ContainerImpl.java:handle(2093)) - Container > container_e20_1530854311763_0006_01_02 transitioned from RUNNING to > KILLING > 2018-07-06 05:22:26,251 INFO application.ApplicationImpl > (ApplicationImpl.java:handle(632)) - Application > application_1530854311763_0006 transitioned from RUNNING to > FINISHING_CONTAINERS_WAIT > 2018-07-06 05:22:26,251 INFO launcher.ContainerLaunch > (ContainerLaunch.java:cleanupContainer(734)) - Cleaning up container > container_e20_1530854311763_0006_01_02 > 2018-07-06 05:22:31,358 INFO launcher.ContainerLaunch > (ContainerLaunch.java:getContainerPid(1102)) - Could not get pid for > container_e20_1530854311763_0006_01_02. Waited for 5000 ms. > 2018-07-06 05:22:31,358 WARN launcher.ContainerLaunch > (ContainerLaunch.java:cleanupContainer(784)) - Container clean up before pid > file created container_e20_1530854311763_0006_01_02 > 2018-07-06 05:22:31,359 INFO launcher.ContainerLaunch > (ContainerLaunch.java:reapDockerContainerNoPid(940)) - Unable to obtain pid, > but docker container request detected. Attempting to reap container > container_e20_1530854311763_0006_01_02 > 2018-07-06 05:22:31,494 INFO nodemanager.LinuxContainerExecutor > (LinuxContainerExecutor.java:deleteAsUser(828)) - Deleting absolute path : > /grid/0/hadoop/yarn/local/usercache/hrt_qa/appcache/application_1530854311763_0006/container_e20_1530854311763_0006_01_02/launch_container.sh > 2018-07-06 05:22:31,500 INFO nodemanager.LinuxContainerExecutor > (LinuxContainerExecutor.java:deleteAsUser(828)) - Deleting absolute path : > /grid/0/hadoop/yarn/local/usercache/hrt_qa/appcache/application_1530854311763_0006/container_e20_1530854311763_0006_01_02/container_tokens > 2018-07-06 05:22:31,510 INFO container.ContainerImpl > (ContainerImpl.java:handle(2093)) - Container > container_e20_1530854311763_0006_01_01 transitioned from KILLING to > CONTAINER_CLEANEDUP_AFTER_KILL > 2018-07-06 05:22:31,510 INFO container.ContainerImpl > (ContainerImpl.java:handle(2093)) - Container > container_e20_1530854311763_0006_01_02 transitioned from KILLING to > CONTAINER_CLEANEDUP_AFTER_KILL > 2018-07-06 05:22:31,512 INFO container.ContainerImpl > (ContainerImpl.java:handle(2093)) - Container > container_e20_1530854311763_0006_01_01 transitioned from > CONTAINER_CLEANEDUP_AFTER_KILL to DONE > 2018-07-06 05:22:31,513 INFO container.ContainerImpl > (ContainerImpl.java:handle(2093)) - Container > container_e20_1530854311763_0006_01_02 transitioned from > CONTAINER_CLEANEDUP_AFTER_KILL to DONE > 2018-07-06 05:22:38,955 INFO container.ContainerImpl > (ContainerImpl.java:handle(2093)) - Container > container_e20_1530854311763_0007_01_02 transitioned from NEW to SCHEDULED > {Code} > New container requesting for GPU fails to launch > {code} > 2018-07-06 05:22:39,048 ERROR nodemanager.LinuxContainerExecutor > (LinuxContainerExecutor.java:handleLaunchForLaunchType(550)) - > ResourceHandlerChain.preStart() failed! > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerException: > Failed to find enough GPUs, > requestor=container_e20_1530854311763_0007_01_02, #RequestedGPUs=2, > #availableGpus=1 > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.gpu.GpuResourceAllocator.internalAssignGpus(GpuResourceAllocator.java:225) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.gpu.GpuResourceAllocator.assignGpus(GpuResourceAllocator.java:173) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.gpu.GpuResourceHandlerImpl.preStart(GpuResourceHandlerImpl.java:98) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerChain.preStart(ResourceHandlerChain.java:75) > at >
[jira] [Updated] (YARN-8508) GPU does not get released even though the container is killed
[ https://issues.apache.org/jira/browse/YARN-8508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chandni Singh updated YARN-8508: Attachment: YARN-8505.002.patch > GPU does not get released even though the container is killed > -- > > Key: YARN-8508 > URL: https://issues.apache.org/jira/browse/YARN-8508 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Sumana Sathish >Assignee: Chandni Singh >Priority: Major > Attachments: YARN-8505.001.patch, YARN-8505.002.patch > > > GPU failed to release even though the container using it is being killed > {Code} > 2018-07-06 05:22:26,201 INFO container.ContainerImpl > (ContainerImpl.java:handle(2093)) - Container > container_e20_1530854311763_0006_01_01 transitioned from RUNNING to > KILLING > 2018-07-06 05:22:26,250 INFO container.ContainerImpl > (ContainerImpl.java:handle(2093)) - Container > container_e20_1530854311763_0006_01_02 transitioned from RUNNING to > KILLING > 2018-07-06 05:22:26,251 INFO application.ApplicationImpl > (ApplicationImpl.java:handle(632)) - Application > application_1530854311763_0006 transitioned from RUNNING to > FINISHING_CONTAINERS_WAIT > 2018-07-06 05:22:26,251 INFO launcher.ContainerLaunch > (ContainerLaunch.java:cleanupContainer(734)) - Cleaning up container > container_e20_1530854311763_0006_01_02 > 2018-07-06 05:22:31,358 INFO launcher.ContainerLaunch > (ContainerLaunch.java:getContainerPid(1102)) - Could not get pid for > container_e20_1530854311763_0006_01_02. Waited for 5000 ms. > 2018-07-06 05:22:31,358 WARN launcher.ContainerLaunch > (ContainerLaunch.java:cleanupContainer(784)) - Container clean up before pid > file created container_e20_1530854311763_0006_01_02 > 2018-07-06 05:22:31,359 INFO launcher.ContainerLaunch > (ContainerLaunch.java:reapDockerContainerNoPid(940)) - Unable to obtain pid, > but docker container request detected. Attempting to reap container > container_e20_1530854311763_0006_01_02 > 2018-07-06 05:22:31,494 INFO nodemanager.LinuxContainerExecutor > (LinuxContainerExecutor.java:deleteAsUser(828)) - Deleting absolute path : > /grid/0/hadoop/yarn/local/usercache/hrt_qa/appcache/application_1530854311763_0006/container_e20_1530854311763_0006_01_02/launch_container.sh > 2018-07-06 05:22:31,500 INFO nodemanager.LinuxContainerExecutor > (LinuxContainerExecutor.java:deleteAsUser(828)) - Deleting absolute path : > /grid/0/hadoop/yarn/local/usercache/hrt_qa/appcache/application_1530854311763_0006/container_e20_1530854311763_0006_01_02/container_tokens > 2018-07-06 05:22:31,510 INFO container.ContainerImpl > (ContainerImpl.java:handle(2093)) - Container > container_e20_1530854311763_0006_01_01 transitioned from KILLING to > CONTAINER_CLEANEDUP_AFTER_KILL > 2018-07-06 05:22:31,510 INFO container.ContainerImpl > (ContainerImpl.java:handle(2093)) - Container > container_e20_1530854311763_0006_01_02 transitioned from KILLING to > CONTAINER_CLEANEDUP_AFTER_KILL > 2018-07-06 05:22:31,512 INFO container.ContainerImpl > (ContainerImpl.java:handle(2093)) - Container > container_e20_1530854311763_0006_01_01 transitioned from > CONTAINER_CLEANEDUP_AFTER_KILL to DONE > 2018-07-06 05:22:31,513 INFO container.ContainerImpl > (ContainerImpl.java:handle(2093)) - Container > container_e20_1530854311763_0006_01_02 transitioned from > CONTAINER_CLEANEDUP_AFTER_KILL to DONE > 2018-07-06 05:22:38,955 INFO container.ContainerImpl > (ContainerImpl.java:handle(2093)) - Container > container_e20_1530854311763_0007_01_02 transitioned from NEW to SCHEDULED > {Code} > New container requesting for GPU fails to launch > {code} > 2018-07-06 05:22:39,048 ERROR nodemanager.LinuxContainerExecutor > (LinuxContainerExecutor.java:handleLaunchForLaunchType(550)) - > ResourceHandlerChain.preStart() failed! > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerException: > Failed to find enough GPUs, > requestor=container_e20_1530854311763_0007_01_02, #RequestedGPUs=2, > #availableGpus=1 > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.gpu.GpuResourceAllocator.internalAssignGpus(GpuResourceAllocator.java:225) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.gpu.GpuResourceAllocator.assignGpus(GpuResourceAllocator.java:173) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.gpu.GpuResourceHandlerImpl.preStart(GpuResourceHandlerImpl.java:98) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerChain.preStart(ResourceHandlerChain.java:75) > at >
[jira] [Commented] (YARN-8545) YARN native service should return container if launch failed
[ https://issues.apache.org/jira/browse/YARN-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16558835#comment-16558835 ] Chandni Singh commented on YARN-8545: - [~billie.rinaldi] [~eyang] Do you have any comments on patch 1? > YARN native service should return container if launch failed > > > Key: YARN-8545 > URL: https://issues.apache.org/jira/browse/YARN-8545 > Project: Hadoop YARN > Issue Type: Task >Reporter: Wangda Tan >Assignee: Chandni Singh >Priority: Critical > Attachments: YARN-8545.001.patch > > > In some cases, container launch may fail but container will not be properly > returned to RM. > This could happen when AM trying to prepare container launch context but > failed w/o sending container launch context to NM (Once container launch > context is sent to NM, NM will report failed container to RM). > Exception like: > {code:java} > java.io.FileNotFoundException: File does not exist: > hdfs://ns1/user/wtan/.yarn/services/tf-job-001/components/1531852429056/primary-worker/primary-worker-0/run-PRIMARY_WORKER.sh > at > org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1583) > at > org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1576) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1591) > at > org.apache.hadoop.yarn.service.utils.CoreFileSystem.createAmResource(CoreFileSystem.java:388) > at > org.apache.hadoop.yarn.service.provider.ProviderUtils.createConfigFileAndAddLocalResource(ProviderUtils.java:253) > at > org.apache.hadoop.yarn.service.provider.AbstractProviderService.buildContainerLaunchContext(AbstractProviderService.java:152) > at > org.apache.hadoop.yarn.service.containerlaunch.ContainerLaunchService$ContainerLauncher.run(ContainerLaunchService.java:105) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745){code} > And even after container launch context prepare failed, AM still trying to > monitor container's readiness: > {code:java} > 2018-07-17 18:42:57,518 [pool-7-thread-1] INFO monitor.ServiceMonitor - > Readiness check failed for primary-worker-0: Probe Status, time="Tue Jul 17 > 18:42:57 UTC 2018", outcome="failure", message="Failure in Default probe: IP > presence", exception="java.io.IOException: primary-worker-0: IP is not > available yet" > ...{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8508) GPU does not get released even though the container is killed
[ https://issues.apache.org/jira/browse/YARN-8508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chandni Singh updated YARN-8508: Attachment: YARN-8505.001.patch > GPU does not get released even though the container is killed > -- > > Key: YARN-8508 > URL: https://issues.apache.org/jira/browse/YARN-8508 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Sumana Sathish >Assignee: Chandni Singh >Priority: Major > Attachments: YARN-8505.001.patch > > > GPU failed to release even though the container using it is being killed > {Code} > 2018-07-06 05:22:26,201 INFO container.ContainerImpl > (ContainerImpl.java:handle(2093)) - Container > container_e20_1530854311763_0006_01_01 transitioned from RUNNING to > KILLING > 2018-07-06 05:22:26,250 INFO container.ContainerImpl > (ContainerImpl.java:handle(2093)) - Container > container_e20_1530854311763_0006_01_02 transitioned from RUNNING to > KILLING > 2018-07-06 05:22:26,251 INFO application.ApplicationImpl > (ApplicationImpl.java:handle(632)) - Application > application_1530854311763_0006 transitioned from RUNNING to > FINISHING_CONTAINERS_WAIT > 2018-07-06 05:22:26,251 INFO launcher.ContainerLaunch > (ContainerLaunch.java:cleanupContainer(734)) - Cleaning up container > container_e20_1530854311763_0006_01_02 > 2018-07-06 05:22:31,358 INFO launcher.ContainerLaunch > (ContainerLaunch.java:getContainerPid(1102)) - Could not get pid for > container_e20_1530854311763_0006_01_02. Waited for 5000 ms. > 2018-07-06 05:22:31,358 WARN launcher.ContainerLaunch > (ContainerLaunch.java:cleanupContainer(784)) - Container clean up before pid > file created container_e20_1530854311763_0006_01_02 > 2018-07-06 05:22:31,359 INFO launcher.ContainerLaunch > (ContainerLaunch.java:reapDockerContainerNoPid(940)) - Unable to obtain pid, > but docker container request detected. Attempting to reap container > container_e20_1530854311763_0006_01_02 > 2018-07-06 05:22:31,494 INFO nodemanager.LinuxContainerExecutor > (LinuxContainerExecutor.java:deleteAsUser(828)) - Deleting absolute path : > /grid/0/hadoop/yarn/local/usercache/hrt_qa/appcache/application_1530854311763_0006/container_e20_1530854311763_0006_01_02/launch_container.sh > 2018-07-06 05:22:31,500 INFO nodemanager.LinuxContainerExecutor > (LinuxContainerExecutor.java:deleteAsUser(828)) - Deleting absolute path : > /grid/0/hadoop/yarn/local/usercache/hrt_qa/appcache/application_1530854311763_0006/container_e20_1530854311763_0006_01_02/container_tokens > 2018-07-06 05:22:31,510 INFO container.ContainerImpl > (ContainerImpl.java:handle(2093)) - Container > container_e20_1530854311763_0006_01_01 transitioned from KILLING to > CONTAINER_CLEANEDUP_AFTER_KILL > 2018-07-06 05:22:31,510 INFO container.ContainerImpl > (ContainerImpl.java:handle(2093)) - Container > container_e20_1530854311763_0006_01_02 transitioned from KILLING to > CONTAINER_CLEANEDUP_AFTER_KILL > 2018-07-06 05:22:31,512 INFO container.ContainerImpl > (ContainerImpl.java:handle(2093)) - Container > container_e20_1530854311763_0006_01_01 transitioned from > CONTAINER_CLEANEDUP_AFTER_KILL to DONE > 2018-07-06 05:22:31,513 INFO container.ContainerImpl > (ContainerImpl.java:handle(2093)) - Container > container_e20_1530854311763_0006_01_02 transitioned from > CONTAINER_CLEANEDUP_AFTER_KILL to DONE > 2018-07-06 05:22:38,955 INFO container.ContainerImpl > (ContainerImpl.java:handle(2093)) - Container > container_e20_1530854311763_0007_01_02 transitioned from NEW to SCHEDULED > {Code} > New container requesting for GPU fails to launch > {code} > 2018-07-06 05:22:39,048 ERROR nodemanager.LinuxContainerExecutor > (LinuxContainerExecutor.java:handleLaunchForLaunchType(550)) - > ResourceHandlerChain.preStart() failed! > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerException: > Failed to find enough GPUs, > requestor=container_e20_1530854311763_0007_01_02, #RequestedGPUs=2, > #availableGpus=1 > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.gpu.GpuResourceAllocator.internalAssignGpus(GpuResourceAllocator.java:225) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.gpu.GpuResourceAllocator.assignGpus(GpuResourceAllocator.java:173) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.gpu.GpuResourceHandlerImpl.preStart(GpuResourceHandlerImpl.java:98) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerChain.preStart(ResourceHandlerChain.java:75) > at >
[jira] [Commented] (YARN-8508) GPU does not get released even though the container is killed
[ https://issues.apache.org/jira/browse/YARN-8508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16558571#comment-16558571 ] Chandni Singh commented on YARN-8508: - This happens with a container that gets cleaned up before its pid file is created. To solve it, we need to release the resources at the end of \{{LinuxContainerExecutor.reapContainer()}} just like we do in \{{LinuxContainerExecutor.launchContainer()}}, {\{LinuxContainerExecutor.reLaunchContainer()}}, and \{{LinuxContainerExecutor.reacquireContainer}}. Please see my explanation below: Refer \{{container_e21_1532545600682_0001_01_02}} in yarn8505.nodemanager.log - 002 is launched but its pid file is not created {code} 2018-07-25 19:08:54,409 DEBUG util.ProcessIdFileReader (ProcessIdFileReader.java:getProcessId(53)) - Accessing pid from pid file /.../application_1532545600682_0001/container_e21_1532545600682_0001_01_02/container_e21_1532545600682_0001_01_02.pid 2018-07-25 19:08:54,409 DEBUG util.ProcessIdFileReader (ProcessIdFileReader.java:getProcessId(103)) - Got pid null from path /.../application_1532545600682_0001/container_e21_1532545600682_0001_01_02/container_e21_1532545600682_0001_01_02.pid {code} - Since application is killed, 002 is killed by ResourceManager {code} 2018-07-25 19:08:54,643 DEBUG container.ContainerImpl (ContainerImpl.java:handle(2080)) - Processing container_e21_1532545600682_0001_01_02 of type CONTAINER_KILLED_ON_REQUEST {code} - The above triggers \{{ContainerLaunch.cleanupContainer()}} for 002. This happens before the pid file is created {code} 2018-07-25 19:08:54,409 WARN launcher.ContainerLaunch (ContainerLaunch.java:cleanupContainer(784)) - Container clean up before pid file created container_e21_1532545600682_0001_01_02 {code} - \{{cleanupContainer}} invokes \{{reapDockerContainerNoPid(user)}} {code} 2018-07-25 19:08:54,410 INFO launcher.ContainerLaunch (ContainerLaunch.java:reapDockerContainerNoPid(940)) - Unable to obtain pid, but docker container request detected. Attempting to reap container container_e21_1532545600682_0001_01_02 {code} - \{{reapDockerContainerNoPid(user)}} calls \{{exec.reapContainer(...)}} {code} 2018-07-25 19:08:54,412 DEBUG docker.DockerCommandExecutor (DockerCommandExecutor.java:executeDockerCommand(89)) - Running docker command: inspect docker-command=inspect format=\{{.State.Status}} name=container_e21_1532545600682_0001_01_02 2018-07-25 19:08:54,412 DEBUG privileged.PrivilegedOperationExecutor (PrivilegedOperationExecutor.java:getPrivilegedOperationExecutionCommand(119)) - Privileged Execution Command Array: [/.../hadoop-yarn/bin/container-executor, --inspect-docker-container, --format=\{{.State.Status}}, container_e21_1532545600682_0001_01_02] 2018-07-25 19:08:54,530 DEBUG docker.DockerCommandExecutor (DockerCommandExecutor.java:getContainerStatus(160)) - Container Status: nonexistent ContainerId: container_e21_1532545600682_0001_01_02 2018-07-25 19:08:54,530 DEBUG launcher.ContainerLaunch (ContainerLaunch.java:reapDockerContainerNoPid(948)) - Sent signal to docker container container_e21_1532545600682_0001_01_02 as user hrt_qa, result=success {code} - The problem is that the \{{reapContainer}} in \{{LinuxContainerExecutor}} doesn't release the resources assigned to the container. The below code snippet that performs these tasks after the container completes doesn't happen at this point. {code} resourcesHandler.postExecute(containerId); try { if (resourceHandlerChain != null) { LOG.info("{} POST Complete", containerId); resourceHandlerChain.postComplete(containerId); } } catch (ResourceHandlerException e) { LOG.warn("ResourceHandlerChain.postComplete failed for " + "containerId: " + containerId + ". Exception: " + e); } } {code} - The launch of container fails after 4 minutes and only then the resources are released. {code} 2018-07-25 19:12:09,999 WARN nodemanager.LinuxContainerExecutor (LinuxContainerExecutor.java:handleExitCode(593)) - Exit code from container container_e21_1532545600682_0001_01_02 is : 27 2018-07-25 19:12:10,000 WARN nodemanager.LinuxContainerExecutor (LinuxContainerExecutor.java:handleExitCode(599)) - Exception from container-launch with container ID: container_e21_1532545600682_0001_01_02 and exit code: 27 2018-07-25 19:12:10,000 INFO nodemanager.ContainerExecutor (ContainerExecutor.java:logOutput(541)) - Container id: container_e21_1532545600682_0001_01_02 2018-07-25 19:12:10,003 INFO nodemanager.ContainerExecutor (ContainerExecutor.java:logOutput(541)) - Docker inspect command: /usr/bin/docker inspect --format \{{.State.Pid}} container_e21_1532545600682_0001_01_02 2018-07-25 19:12:10,003 INFO nodemanager.ContainerExecutor (ContainerExecutor.java:logOutput(541)) - Failed to write pid to file /cgroup/cpu/.../container_e21_1532545600682_0001_01_02/tasks - No such
[jira] [Comment Edited] (YARN-8545) YARN native service should return container if launch failed
[ https://issues.apache.org/jira/browse/YARN-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16553426#comment-16553426 ] Chandni Singh edited comment on YARN-8545 at 7/23/18 9:36 PM: -- In Patch 1 : - releasing containers that failed - removing failed containers from live instances was (Author: csingh): In Patch 1 : - releasing containers that failed - removing containers from live instances > YARN native service should return container if launch failed > > > Key: YARN-8545 > URL: https://issues.apache.org/jira/browse/YARN-8545 > Project: Hadoop YARN > Issue Type: Task >Reporter: Wangda Tan >Assignee: Chandni Singh >Priority: Critical > Attachments: YARN-8545.001.patch > > > In some cases, container launch may fail but container will not be properly > returned to RM. > This could happen when AM trying to prepare container launch context but > failed w/o sending container launch context to NM (Once container launch > context is sent to NM, NM will report failed container to RM). > Exception like: > {code:java} > java.io.FileNotFoundException: File does not exist: > hdfs://ns1/user/wtan/.yarn/services/tf-job-001/components/1531852429056/primary-worker/primary-worker-0/run-PRIMARY_WORKER.sh > at > org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1583) > at > org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1576) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1591) > at > org.apache.hadoop.yarn.service.utils.CoreFileSystem.createAmResource(CoreFileSystem.java:388) > at > org.apache.hadoop.yarn.service.provider.ProviderUtils.createConfigFileAndAddLocalResource(ProviderUtils.java:253) > at > org.apache.hadoop.yarn.service.provider.AbstractProviderService.buildContainerLaunchContext(AbstractProviderService.java:152) > at > org.apache.hadoop.yarn.service.containerlaunch.ContainerLaunchService$ContainerLauncher.run(ContainerLaunchService.java:105) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745){code} > And even after container launch context prepare failed, AM still trying to > monitor container's readiness: > {code:java} > 2018-07-17 18:42:57,518 [pool-7-thread-1] INFO monitor.ServiceMonitor - > Readiness check failed for primary-worker-0: Probe Status, time="Tue Jul 17 > 18:42:57 UTC 2018", outcome="failure", message="Failure in Default probe: IP > presence", exception="java.io.IOException: primary-worker-0: IP is not > available yet" > ...{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8545) YARN native service should return container if launch failed
[ https://issues.apache.org/jira/browse/YARN-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chandni Singh updated YARN-8545: Attachment: YARN-8545.001.patch > YARN native service should return container if launch failed > > > Key: YARN-8545 > URL: https://issues.apache.org/jira/browse/YARN-8545 > Project: Hadoop YARN > Issue Type: Task >Reporter: Wangda Tan >Assignee: Chandni Singh >Priority: Critical > Attachments: YARN-8545.001.patch > > > In some cases, container launch may fail but container will not be properly > returned to RM. > This could happen when AM trying to prepare container launch context but > failed w/o sending container launch context to NM (Once container launch > context is sent to NM, NM will report failed container to RM). > Exception like: > {code:java} > java.io.FileNotFoundException: File does not exist: > hdfs://ns1/user/wtan/.yarn/services/tf-job-001/components/1531852429056/primary-worker/primary-worker-0/run-PRIMARY_WORKER.sh > at > org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1583) > at > org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1576) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1591) > at > org.apache.hadoop.yarn.service.utils.CoreFileSystem.createAmResource(CoreFileSystem.java:388) > at > org.apache.hadoop.yarn.service.provider.ProviderUtils.createConfigFileAndAddLocalResource(ProviderUtils.java:253) > at > org.apache.hadoop.yarn.service.provider.AbstractProviderService.buildContainerLaunchContext(AbstractProviderService.java:152) > at > org.apache.hadoop.yarn.service.containerlaunch.ContainerLaunchService$ContainerLauncher.run(ContainerLaunchService.java:105) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745){code} > And even after container launch context prepare failed, AM still trying to > monitor container's readiness: > {code:java} > 2018-07-17 18:42:57,518 [pool-7-thread-1] INFO monitor.ServiceMonitor - > Readiness check failed for primary-worker-0: Probe Status, time="Tue Jul 17 > 18:42:57 UTC 2018", outcome="failure", message="Failure in Default probe: IP > presence", exception="java.io.IOException: primary-worker-0: IP is not > available yet" > ...{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8545) YARN native service should return container if launch failed
[ https://issues.apache.org/jira/browse/YARN-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16553422#comment-16553422 ] Chandni Singh commented on YARN-8545: - [~gsaha] [~billie.rinaldi] could you please review the patch? > YARN native service should return container if launch failed > > > Key: YARN-8545 > URL: https://issues.apache.org/jira/browse/YARN-8545 > Project: Hadoop YARN > Issue Type: Task >Reporter: Wangda Tan >Assignee: Chandni Singh >Priority: Critical > > In some cases, container launch may fail but container will not be properly > returned to RM. > This could happen when AM trying to prepare container launch context but > failed w/o sending container launch context to NM (Once container launch > context is sent to NM, NM will report failed container to RM). > Exception like: > {code:java} > java.io.FileNotFoundException: File does not exist: > hdfs://ns1/user/wtan/.yarn/services/tf-job-001/components/1531852429056/primary-worker/primary-worker-0/run-PRIMARY_WORKER.sh > at > org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1583) > at > org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1576) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1591) > at > org.apache.hadoop.yarn.service.utils.CoreFileSystem.createAmResource(CoreFileSystem.java:388) > at > org.apache.hadoop.yarn.service.provider.ProviderUtils.createConfigFileAndAddLocalResource(ProviderUtils.java:253) > at > org.apache.hadoop.yarn.service.provider.AbstractProviderService.buildContainerLaunchContext(AbstractProviderService.java:152) > at > org.apache.hadoop.yarn.service.containerlaunch.ContainerLaunchService$ContainerLauncher.run(ContainerLaunchService.java:105) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745){code} > And even after container launch context prepare failed, AM still trying to > monitor container's readiness: > {code:java} > 2018-07-17 18:42:57,518 [pool-7-thread-1] INFO monitor.ServiceMonitor - > Readiness check failed for primary-worker-0: Probe Status, time="Tue Jul 17 > 18:42:57 UTC 2018", outcome="failure", message="Failure in Default probe: IP > presence", exception="java.io.IOException: primary-worker-0: IP is not > available yet" > ...{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8301) Yarn Service Upgrade: Add documentation
[ https://issues.apache.org/jira/browse/YARN-8301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chandni Singh updated YARN-8301: Attachment: YARN-8301.007.patch > Yarn Service Upgrade: Add documentation > --- > > Key: YARN-8301 > URL: https://issues.apache.org/jira/browse/YARN-8301 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Major > Attachments: YARN-8301.001.patch, YARN-8301.002.patch, > YARN-8301.003.patch, YARN-8301.004.patch, YARN-8301.005.patch, > YARN-8301.006.patch, YARN-8301.007.patch > > > Add documentation for yarn service upgrade. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-8545) YARN native service should return container if launch failed
[ https://issues.apache.org/jira/browse/YARN-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chandni Singh reassigned YARN-8545: --- Assignee: Chandni Singh > YARN native service should return container if launch failed > > > Key: YARN-8545 > URL: https://issues.apache.org/jira/browse/YARN-8545 > Project: Hadoop YARN > Issue Type: Task >Reporter: Wangda Tan >Assignee: Chandni Singh >Priority: Critical > > In some cases, container launch may fail but container will not be properly > returned to RM. > This could happen when AM trying to prepare container launch context but > failed w/o sending container launch context to NM (Once container launch > context is sent to NM, NM will report failed container to RM). > Exception like: > {code:java} > java.io.FileNotFoundException: File does not exist: > hdfs://ns1/user/wtan/.yarn/services/tf-job-001/components/1531852429056/primary-worker/primary-worker-0/run-PRIMARY_WORKER.sh > at > org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1583) > at > org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1576) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1591) > at > org.apache.hadoop.yarn.service.utils.CoreFileSystem.createAmResource(CoreFileSystem.java:388) > at > org.apache.hadoop.yarn.service.provider.ProviderUtils.createConfigFileAndAddLocalResource(ProviderUtils.java:253) > at > org.apache.hadoop.yarn.service.provider.AbstractProviderService.buildContainerLaunchContext(AbstractProviderService.java:152) > at > org.apache.hadoop.yarn.service.containerlaunch.ContainerLaunchService$ContainerLauncher.run(ContainerLaunchService.java:105) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745){code} > And even after container launch context prepare failed, AM still trying to > monitor container's readiness: > {code:java} > 2018-07-17 18:42:57,518 [pool-7-thread-1] INFO monitor.ServiceMonitor - > Readiness check failed for primary-worker-0: Probe Status, time="Tue Jul 17 > 18:42:57 UTC 2018", outcome="failure", message="Failure in Default probe: IP > presence", exception="java.io.IOException: primary-worker-0: IP is not > available yet" > ...{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-8508) GPU does not get released even though the container is killed
[ https://issues.apache.org/jira/browse/YARN-8508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chandni Singh reassigned YARN-8508: --- Assignee: Chandni Singh (was: Wangda Tan) > GPU does not get released even though the container is killed > -- > > Key: YARN-8508 > URL: https://issues.apache.org/jira/browse/YARN-8508 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Sumana Sathish >Assignee: Chandni Singh >Priority: Major > > GPU failed to release even though the container using it is being killed > {Code} > 2018-07-06 05:22:26,201 INFO container.ContainerImpl > (ContainerImpl.java:handle(2093)) - Container > container_e20_1530854311763_0006_01_01 transitioned from RUNNING to > KILLING > 2018-07-06 05:22:26,250 INFO container.ContainerImpl > (ContainerImpl.java:handle(2093)) - Container > container_e20_1530854311763_0006_01_02 transitioned from RUNNING to > KILLING > 2018-07-06 05:22:26,251 INFO application.ApplicationImpl > (ApplicationImpl.java:handle(632)) - Application > application_1530854311763_0006 transitioned from RUNNING to > FINISHING_CONTAINERS_WAIT > 2018-07-06 05:22:26,251 INFO launcher.ContainerLaunch > (ContainerLaunch.java:cleanupContainer(734)) - Cleaning up container > container_e20_1530854311763_0006_01_02 > 2018-07-06 05:22:31,358 INFO launcher.ContainerLaunch > (ContainerLaunch.java:getContainerPid(1102)) - Could not get pid for > container_e20_1530854311763_0006_01_02. Waited for 5000 ms. > 2018-07-06 05:22:31,358 WARN launcher.ContainerLaunch > (ContainerLaunch.java:cleanupContainer(784)) - Container clean up before pid > file created container_e20_1530854311763_0006_01_02 > 2018-07-06 05:22:31,359 INFO launcher.ContainerLaunch > (ContainerLaunch.java:reapDockerContainerNoPid(940)) - Unable to obtain pid, > but docker container request detected. Attempting to reap container > container_e20_1530854311763_0006_01_02 > 2018-07-06 05:22:31,494 INFO nodemanager.LinuxContainerExecutor > (LinuxContainerExecutor.java:deleteAsUser(828)) - Deleting absolute path : > /grid/0/hadoop/yarn/local/usercache/hrt_qa/appcache/application_1530854311763_0006/container_e20_1530854311763_0006_01_02/launch_container.sh > 2018-07-06 05:22:31,500 INFO nodemanager.LinuxContainerExecutor > (LinuxContainerExecutor.java:deleteAsUser(828)) - Deleting absolute path : > /grid/0/hadoop/yarn/local/usercache/hrt_qa/appcache/application_1530854311763_0006/container_e20_1530854311763_0006_01_02/container_tokens > 2018-07-06 05:22:31,510 INFO container.ContainerImpl > (ContainerImpl.java:handle(2093)) - Container > container_e20_1530854311763_0006_01_01 transitioned from KILLING to > CONTAINER_CLEANEDUP_AFTER_KILL > 2018-07-06 05:22:31,510 INFO container.ContainerImpl > (ContainerImpl.java:handle(2093)) - Container > container_e20_1530854311763_0006_01_02 transitioned from KILLING to > CONTAINER_CLEANEDUP_AFTER_KILL > 2018-07-06 05:22:31,512 INFO container.ContainerImpl > (ContainerImpl.java:handle(2093)) - Container > container_e20_1530854311763_0006_01_01 transitioned from > CONTAINER_CLEANEDUP_AFTER_KILL to DONE > 2018-07-06 05:22:31,513 INFO container.ContainerImpl > (ContainerImpl.java:handle(2093)) - Container > container_e20_1530854311763_0006_01_02 transitioned from > CONTAINER_CLEANEDUP_AFTER_KILL to DONE > 2018-07-06 05:22:38,955 INFO container.ContainerImpl > (ContainerImpl.java:handle(2093)) - Container > container_e20_1530854311763_0007_01_02 transitioned from NEW to SCHEDULED > {Code} > New container requesting for GPU fails to launch > {code} > 2018-07-06 05:22:39,048 ERROR nodemanager.LinuxContainerExecutor > (LinuxContainerExecutor.java:handleLaunchForLaunchType(550)) - > ResourceHandlerChain.preStart() failed! > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerException: > Failed to find enough GPUs, > requestor=container_e20_1530854311763_0007_01_02, #RequestedGPUs=2, > #availableGpus=1 > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.gpu.GpuResourceAllocator.internalAssignGpus(GpuResourceAllocator.java:225) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.gpu.GpuResourceAllocator.assignGpus(GpuResourceAllocator.java:173) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.gpu.GpuResourceHandlerImpl.preStart(GpuResourceHandlerImpl.java:98) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerChain.preStart(ResourceHandlerChain.java:75) > at >
[jira] [Updated] (YARN-8301) Yarn Service Upgrade: Add documentation
[ https://issues.apache.org/jira/browse/YARN-8301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chandni Singh updated YARN-8301: Attachment: YARN-8301.006.patch > Yarn Service Upgrade: Add documentation > --- > > Key: YARN-8301 > URL: https://issues.apache.org/jira/browse/YARN-8301 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Major > Attachments: YARN-8301.001.patch, YARN-8301.002.patch, > YARN-8301.003.patch, YARN-8301.004.patch, YARN-8301.005.patch, > YARN-8301.006.patch > > > Add documentation for yarn service upgrade. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8301) Yarn Service Upgrade: Add documentation
[ https://issues.apache.org/jira/browse/YARN-8301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chandni Singh updated YARN-8301: Attachment: YARN-8301.005.patch > Yarn Service Upgrade: Add documentation > --- > > Key: YARN-8301 > URL: https://issues.apache.org/jira/browse/YARN-8301 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Major > Attachments: YARN-8301.001.patch, YARN-8301.002.patch, > YARN-8301.003.patch, YARN-8301.004.patch, YARN-8301.005.patch > > > Add documentation for yarn service upgrade. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8542) Yarn Service: Add component name to container json
[ https://issues.apache.org/jira/browse/YARN-8542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chandni Singh updated YARN-8542: Description: GET app/v1/services/{{service-name}}/component-instances returns a list of containers with YARN-8299. {code:java} [ { "id": "container_1531508836237_0001_01_03", "ip": "192.168.2.51", "hostname": "HW12119.local", "state": "READY", "launch_time": 1531509014497, "bare_host": "192.168.2.51", "component_instance_name": "sleeper-1" }, { "id": "container_1531508836237_0001_01_02", "ip": "192.168.2.51", "hostname": "HW12119.local", "state": "READY", "launch_time": 1531509013492, "bare_host": "192.168.2.51", "component_instance_name": "sleeper-0" } ]{code} {{component_name}} is not part of container json, so it is hard to tell which component an instance belongs to. To fix this, will change the format of returned containers to: {code:java} [ { "name": "ping", "containers": [ { "bare_host": "eyang-4.openstacklocal", "component_instance_name": "ping-0", "hostname": "ping-0.qqq.hbase.ycluster", "id": "container_1531765479645_0002_01_02", "ip": "172.26.111.21", "launch_time": 1531767377301, "state": "READY" }, { "bare_host": "eyang-4.openstacklocal", "component_instance_name": "ping-1", "hostname": "ping-1.qqq.hbase.ycluster", "id": "container_1531765479645_0002_01_07", "ip": "172.26.111.21", "launch_time": 1531767410395, "state": "RUNNING_BUT_UNREADY" } ] }, { "name": "sleep", "containers": [ { "bare_host": "eyang-5.openstacklocal", "component_instance_name": "sleep-0", "hostname": "sleep-0.qqq.hbase.ycluster", "id": "container_1531765479645_0002_01_04", "ip": "172.26.111.20", "launch_time": 1531767377710, "state": "READY" }, { "bare_host": "eyang-4.openstacklocal", "component_instance_name": "sleep-1", "hostname": "sleep-1.qqq.hbase.ycluster", "id": "container_1531765479645_0002_01_05", "ip": "172.26.111.21", "launch_time": 1531767378303, "state": "READY" } ] } ]{code} was: GET app/v1/services/{\{service-name}}/component-instances returns a list of containers with YARN-8299. {code:java} [ { "id": "container_1531508836237_0001_01_03", "ip": "192.168.2.51", "hostname": "HW12119.local", "state": "READY", "launch_time": 1531509014497, "bare_host": "192.168.2.51", "component_instance_name": "sleeper-1" }, { "id": "container_1531508836237_0001_01_02", "ip": "192.168.2.51", "hostname": "HW12119.local", "state": "READY", "launch_time": 1531509013492, "bare_host": "192.168.2.51", "component_instance_name": "sleeper-0" } ]{code} {{component_name}} is not part of container json, so it is hard to tell which component an instance belongs to. Change the list of containers return > Yarn Service: Add component name to container json > -- > > Key: YARN-8542 > URL: https://issues.apache.org/jira/browse/YARN-8542 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Major > > GET app/v1/services/{{service-name}}/component-instances returns a list of > containers with YARN-8299. > {code:java} > [ > { > "id": "container_1531508836237_0001_01_03", > "ip": "192.168.2.51", > "hostname": "HW12119.local", > "state": "READY", > "launch_time": 1531509014497, > "bare_host": "192.168.2.51", > "component_instance_name": "sleeper-1" > }, > { > "id": "container_1531508836237_0001_01_02", > "ip": "192.168.2.51", > "hostname": "HW12119.local", > "state": "READY", > "launch_time": 1531509013492, > "bare_host": "192.168.2.51", > "component_instance_name": "sleeper-0" > } > ]{code} > {{component_name}} is not part of container json, so it is hard to tell which > component an instance belongs to. > To fix this, will change the format of returned containers to: > {code:java} > [ > { > "name": "ping", > "containers": [ > { > "bare_host": "eyang-4.openstacklocal", > "component_instance_name": "ping-0", > "hostname": "ping-0.qqq.hbase.ycluster", > "id": "container_1531765479645_0002_01_02", > "ip": "172.26.111.21", > "launch_time": 1531767377301, > "state": "READY" > }, > { > "bare_host": "eyang-4.openstacklocal", > "component_instance_name": "ping-1", > "hostname": "ping-1.qqq.hbase.ycluster", > "id": "container_1531765479645_0002_01_07", > "ip": "172.26.111.21", > "launch_time": 1531767410395, > "state": "RUNNING_BUT_UNREADY" > } > ] > }, > { >
[jira] [Commented] (YARN-8301) Yarn Service Upgrade: Add documentation
[ https://issues.apache.org/jira/browse/YARN-8301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16548405#comment-16548405 ] Chandni Singh commented on YARN-8301: - Addressed [~gsaha] comments in patch 4. I didn't find many trailing whitespaces. Let me know if you still see them. > Yarn Service Upgrade: Add documentation > --- > > Key: YARN-8301 > URL: https://issues.apache.org/jira/browse/YARN-8301 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Major > Attachments: YARN-8301.001.patch, YARN-8301.002.patch, > YARN-8301.003.patch, YARN-8301.004.patch > > > Add documentation for yarn service upgrade. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8301) Yarn Service Upgrade: Add documentation
[ https://issues.apache.org/jira/browse/YARN-8301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chandni Singh updated YARN-8301: Attachment: YARN-8301.004.patch > Yarn Service Upgrade: Add documentation > --- > > Key: YARN-8301 > URL: https://issues.apache.org/jira/browse/YARN-8301 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Major > Attachments: YARN-8301.001.patch, YARN-8301.002.patch, > YARN-8301.003.patch, YARN-8301.004.patch > > > Add documentation for yarn service upgrade. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8301) Yarn Service Upgrade: Add documentation
[ https://issues.apache.org/jira/browse/YARN-8301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16548384#comment-16548384 ] Chandni Singh commented on YARN-8301: - {quote} In line 148 do we need the line "name": "sleeper-service" in the JSON spec for version 1.0.1 of the service. {quote} No, will remove it. > Yarn Service Upgrade: Add documentation > --- > > Key: YARN-8301 > URL: https://issues.apache.org/jira/browse/YARN-8301 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Major > Attachments: YARN-8301.001.patch, YARN-8301.002.patch, > YARN-8301.003.patch > > > Add documentation for yarn service upgrade. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8542) Yarn Service: Add component name to container json
[ https://issues.apache.org/jira/browse/YARN-8542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chandni Singh updated YARN-8542: Description: GET app/v1/services/{\{service-name}}/component-instances returns a list of containers with YARN-8299. {code:java} [ { "id": "container_1531508836237_0001_01_03", "ip": "192.168.2.51", "hostname": "HW12119.local", "state": "READY", "launch_time": 1531509014497, "bare_host": "192.168.2.51", "component_instance_name": "sleeper-1" }, { "id": "container_1531508836237_0001_01_02", "ip": "192.168.2.51", "hostname": "HW12119.local", "state": "READY", "launch_time": 1531509013492, "bare_host": "192.168.2.51", "component_instance_name": "sleeper-0" } ]{code} {{component_name}} is not part of container json, so it is hard to tell which component an instance belongs to. Change the list of containers return was: GET app/v1/services/{\{service-name}}/component-instances returns a list of containers with YARN-8299. {code:java} [ { "id": "container_1531508836237_0001_01_03", "ip": "192.168.2.51", "hostname": "HW12119.local", "state": "READY", "launch_time": 1531509014497, "bare_host": "192.168.2.51", "component_instance_name": "sleeper-1" }, { "id": "container_1531508836237_0001_01_02", "ip": "192.168.2.51", "hostname": "HW12119.local", "state": "READY", "launch_time": 1531509013492, "bare_host": "192.168.2.51", "component_instance_name": "sleeper-0" } ]{code} {{component_name}} is not part of container json, so it is hard to tell which component an instance belongs to. > Yarn Service: Add component name to container json > -- > > Key: YARN-8542 > URL: https://issues.apache.org/jira/browse/YARN-8542 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Major > > GET app/v1/services/{\{service-name}}/component-instances returns a list of > containers with YARN-8299. > {code:java} > [ > { > "id": "container_1531508836237_0001_01_03", > "ip": "192.168.2.51", > "hostname": "HW12119.local", > "state": "READY", > "launch_time": 1531509014497, > "bare_host": "192.168.2.51", > "component_instance_name": "sleeper-1" > }, > { > "id": "container_1531508836237_0001_01_02", > "ip": "192.168.2.51", > "hostname": "HW12119.local", > "state": "READY", > "launch_time": 1531509013492, > "bare_host": "192.168.2.51", > "component_instance_name": "sleeper-0" > } > ]{code} > {{component_name}} is not part of container json, so it is hard to tell which > component an instance belongs to. > Change the list of containers return -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8542) Yarn Service: Add component name to container json
[ https://issues.apache.org/jira/browse/YARN-8542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16548379#comment-16548379 ] Chandni Singh commented on YARN-8542: - [~gsaha] Ok. That sounds reasonable. Will change it to the format you have proposed. > Yarn Service: Add component name to container json > -- > > Key: YARN-8542 > URL: https://issues.apache.org/jira/browse/YARN-8542 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Major > > GET app/v1/services/{\{service-name}}/component-instances returns a list of > containers with YARN-8299. > {code:java} > [ > { > "id": "container_1531508836237_0001_01_03", > "ip": "192.168.2.51", > "hostname": "HW12119.local", > "state": "READY", > "launch_time": 1531509014497, > "bare_host": "192.168.2.51", > "component_instance_name": "sleeper-1" > }, > { > "id": "container_1531508836237_0001_01_02", > "ip": "192.168.2.51", > "hostname": "HW12119.local", > "state": "READY", > "launch_time": 1531509013492, > "bare_host": "192.168.2.51", > "component_instance_name": "sleeper-0" > } > ]{code} > {{component_name}} is not part of container json, so it is hard to tell which > component an instance belongs to. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8301) Yarn Service Upgrade: Add documentation
[ https://issues.apache.org/jira/browse/YARN-8301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chandni Singh updated YARN-8301: Attachment: YARN-8301.003.patch > Yarn Service Upgrade: Add documentation > --- > > Key: YARN-8301 > URL: https://issues.apache.org/jira/browse/YARN-8301 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Major > Attachments: YARN-8301.001.patch, YARN-8301.002.patch, > YARN-8301.003.patch > > > Add documentation for yarn service upgrade. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8301) Yarn Service Upgrade: Add documentation
[ https://issues.apache.org/jira/browse/YARN-8301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16548358#comment-16548358 ] Chandni Singh commented on YARN-8301: - Addressed offline comments in patch 3 > Yarn Service Upgrade: Add documentation > --- > > Key: YARN-8301 > URL: https://issues.apache.org/jira/browse/YARN-8301 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Major > Attachments: YARN-8301.001.patch, YARN-8301.002.patch, > YARN-8301.003.patch > > > Add documentation for yarn service upgrade. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8301) Yarn Service Upgrade: Add documentation
[ https://issues.apache.org/jira/browse/YARN-8301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chandni Singh updated YARN-8301: Attachment: YARN-8301.002.patch > Yarn Service Upgrade: Add documentation > --- > > Key: YARN-8301 > URL: https://issues.apache.org/jira/browse/YARN-8301 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Major > Attachments: YARN-8301.001.patch, YARN-8301.002.patch > > > Add documentation for yarn service upgrade. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8542) Yarn Service: Add component name to container json
[ https://issues.apache.org/jira/browse/YARN-8542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chandni Singh updated YARN-8542: Description: GET app/v1/services/{\{service-name}}/component-instances returns a list of containers with YARN-8299. {code:java} [ { "id": "container_1531508836237_0001_01_03", "ip": "192.168.2.51", "hostname": "HW12119.local", "state": "READY", "launch_time": 1531509014497, "bare_host": "192.168.2.51", "component_instance_name": "sleeper-1" }, { "id": "container_1531508836237_0001_01_02", "ip": "192.168.2.51", "hostname": "HW12119.local", "state": "READY", "launch_time": 1531509013492, "bare_host": "192.168.2.51", "component_instance_name": "sleeper-0" } ]{code} {{component_name}} is not part of container json, so it is hard to tell which component an instance belongs to. was: In YARN-8299, CLI for query container status is implemented to display containers in a flat list. It might be helpful to display component structure hierarchy like this: {code} [ { "name": "ping", "containers": [ { "bare_host": "eyang-4.openstacklocal", "component_instance_name": "ping-0", "hostname": "ping-0.qqq.hbase.ycluster", "id": "container_1531765479645_0002_01_02", "ip": "172.26.111.21", "launch_time": 1531767377301, "state": "READY" }, { "bare_host": "eyang-4.openstacklocal", "component_instance_name": "ping-1", "hostname": "ping-1.qqq.hbase.ycluster", "id": "container_1531765479645_0002_01_07", "ip": "172.26.111.21", "launch_time": 1531767410395, "state": "RUNNING_BUT_UNREADY" } ] }, { "name": "sleep", "containers": [ { "bare_host": "eyang-5.openstacklocal", "component_instance_name": "sleep-0", "hostname": "sleep-0.qqq.hbase.ycluster", "id": "container_1531765479645_0002_01_04", "ip": "172.26.111.20", "launch_time": 1531767377710, "state": "READY" }, { "bare_host": "eyang-4.openstacklocal", "component_instance_name": "sleep-1", "hostname": "sleep-1.qqq.hbase.ycluster", "id": "container_1531765479645_0002_01_05", "ip": "172.26.111.21", "launch_time": 1531767378303, "state": "READY" } ] } ] {code} > Yarn Service: Add component name to container json > -- > > Key: YARN-8542 > URL: https://issues.apache.org/jira/browse/YARN-8542 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Major > > GET app/v1/services/{\{service-name}}/component-instances returns a list of > containers with YARN-8299. > {code:java} > [ > { > "id": "container_1531508836237_0001_01_03", > "ip": "192.168.2.51", > "hostname": "HW12119.local", > "state": "READY", > "launch_time": 1531509014497, > "bare_host": "192.168.2.51", > "component_instance_name": "sleeper-1" > }, > { > "id": "container_1531508836237_0001_01_02", > "ip": "192.168.2.51", > "hostname": "HW12119.local", > "state": "READY", > "launch_time": 1531509013492, > "bare_host": "192.168.2.51", > "component_instance_name": "sleeper-0" > } > ]{code} > {{component_name}} is not part of container json, so it is hard to tell which > component an instance belongs to. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8542) Yarn Service: Add component name to container json
[ https://issues.apache.org/jira/browse/YARN-8542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16548248#comment-16548248 ] Chandni Singh commented on YARN-8542: - [~gsaha] I am not in favor of the below format: {code:java} { "name": "sleep", "containers": [ { "bare_host": "eyang-5.openstacklocal", "component_instance_name": "sleep-0", "hostname": "sleep-0.qqq.hbase.ycluster", "id": "container_1531765479645_0002_01_04", "ip": "172.26.111.20", "launch_time": 1531767377710, "state": "READY" } }{code} It doesn't follow the convention. The request is for containers, so it should return a list of containers. I prefer adding component_name to the container json. Also it is easy for users to further filter a flat list instead of a nested json. > Yarn Service: Add component name to container json > -- > > Key: YARN-8542 > URL: https://issues.apache.org/jira/browse/YARN-8542 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Major > > In YARN-8299, CLI for query container status is implemented to display > containers in a flat list. It might be helpful to display component > structure hierarchy like this: > {code} > [ > { > "name": "ping", > "containers": [ > { > "bare_host": "eyang-4.openstacklocal", > "component_instance_name": "ping-0", > "hostname": "ping-0.qqq.hbase.ycluster", > "id": "container_1531765479645_0002_01_02", > "ip": "172.26.111.21", > "launch_time": 1531767377301, > "state": "READY" > }, > { > "bare_host": "eyang-4.openstacklocal", > "component_instance_name": "ping-1", > "hostname": "ping-1.qqq.hbase.ycluster", > "id": "container_1531765479645_0002_01_07", > "ip": "172.26.111.21", > "launch_time": 1531767410395, > "state": "RUNNING_BUT_UNREADY" > } > ] > }, > { > "name": "sleep", > "containers": [ > { > "bare_host": "eyang-5.openstacklocal", > "component_instance_name": "sleep-0", > "hostname": "sleep-0.qqq.hbase.ycluster", > "id": "container_1531765479645_0002_01_04", > "ip": "172.26.111.20", > "launch_time": 1531767377710, > "state": "READY" > }, > { > "bare_host": "eyang-4.openstacklocal", > "component_instance_name": "sleep-1", > "hostname": "sleep-1.qqq.hbase.ycluster", > "id": "container_1531765479645_0002_01_05", > "ip": "172.26.111.21", > "launch_time": 1531767378303, > "state": "READY" > } > ] > } > ] > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8299) Yarn Service Upgrade: Add GET APIs that returns instances matching query params
[ https://issues.apache.org/jira/browse/YARN-8299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16545691#comment-16545691 ] Chandni Singh commented on YARN-8299: - [~eyang] created YARN-8542 for the improvement you suggested. > Yarn Service Upgrade: Add GET APIs that returns instances matching query > params > --- > > Key: YARN-8299 > URL: https://issues.apache.org/jira/browse/YARN-8299 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Major > Attachments: YARN-8299.001.patch, YARN-8299.002.patch, > YARN-8299.003.patch, YARN-8299.004.patch, YARN-8299.005.patch > > > We need APIs that returns containers that match the query params. These are > needed so that we can find out what containers have been upgraded. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8542) Yarn Service: Add component name to container json
Chandni Singh created YARN-8542: --- Summary: Yarn Service: Add component name to container json Key: YARN-8542 URL: https://issues.apache.org/jira/browse/YARN-8542 Project: Hadoop YARN Issue Type: Improvement Reporter: Chandni Singh Assignee: Chandni Singh -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8299) Yarn Service Upgrade: Add GET APIs that returns instances matching query params
[ https://issues.apache.org/jira/browse/YARN-8299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16543883#comment-16543883 ] Chandni Singh commented on YARN-8299: - Last Jenkins run is against patch 3 which has a broken test. Triggered it to run against patch 5. > Yarn Service Upgrade: Add GET APIs that returns instances matching query > params > --- > > Key: YARN-8299 > URL: https://issues.apache.org/jira/browse/YARN-8299 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Major > Attachments: YARN-8299.001.patch, YARN-8299.002.patch, > YARN-8299.003.patch, YARN-8299.004.patch, YARN-8299.005.patch > > > We need APIs that returns containers that match the query params. These are > needed so that we can find out what containers have been upgraded. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8299) Yarn Service Upgrade: Add GET APIs that returns instances matching query params
[ https://issues.apache.org/jira/browse/YARN-8299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chandni Singh updated YARN-8299: Attachment: YARN-8299.005.patch > Yarn Service Upgrade: Add GET APIs that returns instances matching query > params > --- > > Key: YARN-8299 > URL: https://issues.apache.org/jira/browse/YARN-8299 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Major > Attachments: YARN-8299.001.patch, YARN-8299.002.patch, > YARN-8299.003.patch, YARN-8299.004.patch, YARN-8299.005.patch > > > We need APIs that returns containers that match the query params. These are > needed so that we can find out what containers have been upgraded. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8301) Yarn Service Upgrade: Add documentation
[ https://issues.apache.org/jira/browse/YARN-8301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16543831#comment-16543831 ] Chandni Singh commented on YARN-8301: - [~gsaha] [~eyang] Added a brief document on upgrade. Please review. Thanks. > Yarn Service Upgrade: Add documentation > --- > > Key: YARN-8301 > URL: https://issues.apache.org/jira/browse/YARN-8301 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Major > Attachments: YARN-8301.001.patch > > > Add documentation for yarn service upgrade. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8301) Yarn Service Upgrade: Add documentation
[ https://issues.apache.org/jira/browse/YARN-8301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chandni Singh updated YARN-8301: Attachment: YARN-8301.001.patch > Yarn Service Upgrade: Add documentation > --- > > Key: YARN-8301 > URL: https://issues.apache.org/jira/browse/YARN-8301 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Major > Attachments: YARN-8301.001.patch > > > Add documentation for yarn service upgrade. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8299) Yarn Service Upgrade: Add GET APIs that returns instances matching query params
[ https://issues.apache.org/jira/browse/YARN-8299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chandni Singh updated YARN-8299: Attachment: YARN-8299.004.patch > Yarn Service Upgrade: Add GET APIs that returns instances matching query > params > --- > > Key: YARN-8299 > URL: https://issues.apache.org/jira/browse/YARN-8299 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Major > Attachments: YARN-8299.001.patch, YARN-8299.002.patch, > YARN-8299.003.patch, YARN-8299.004.patch > > > We need APIs that returns containers that match the query params. These are > needed so that we can find out what containers have been upgraded. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8299) Yarn Service Upgrade: Add GET APIs that returns instances matching query params
[ https://issues.apache.org/jira/browse/YARN-8299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chandni Singh updated YARN-8299: Attachment: YARN-8299.003.patch > Yarn Service Upgrade: Add GET APIs that returns instances matching query > params > --- > > Key: YARN-8299 > URL: https://issues.apache.org/jira/browse/YARN-8299 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Major > Attachments: YARN-8299.001.patch, YARN-8299.002.patch, > YARN-8299.003.patch > > > We need APIs that returns containers that match the query params. These are > needed so that we can find out what containers have been upgraded. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8299) Yarn Service Upgrade: Add GET APIs that returns instances matching query params
[ https://issues.apache.org/jira/browse/YARN-8299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16543691#comment-16543691 ] Chandni Singh commented on YARN-8299: - Confirming: There are the 3 filter options: # component names: {{-components}} # version: {{-version}} # component instance states: {{-states}} > Yarn Service Upgrade: Add GET APIs that returns instances matching query > params > --- > > Key: YARN-8299 > URL: https://issues.apache.org/jira/browse/YARN-8299 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Major > Attachments: YARN-8299.001.patch, YARN-8299.002.patch > > > We need APIs that returns containers that match the query params. These are > needed so that we can find out what containers have been upgraded. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8299) Yarn Service Upgrade: Add GET APIs that returns instances matching query params
[ https://issues.apache.org/jira/browse/YARN-8299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chandni Singh updated YARN-8299: Attachment: YARN-8299.002.patch > Yarn Service Upgrade: Add GET APIs that returns instances matching query > params > --- > > Key: YARN-8299 > URL: https://issues.apache.org/jira/browse/YARN-8299 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Major > Attachments: YARN-8299.001.patch, YARN-8299.002.patch > > > We need APIs that returns containers that match the query params. These are > needed so that we can find out what containers have been upgraded. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8299) Yarn Service Upgrade: Add GET APIs that returns instances matching query params
[ https://issues.apache.org/jira/browse/YARN-8299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16543510#comment-16543510 ] Chandni Singh commented on YARN-8299: - {quote} If user input a state that is not in the defined list, it throws ERROR 500 error. It would be nice to report ERROR 400 BAD REQUEST, and display possible states. Since user can only input one state, would it make sense to change -states to -state? {quote} [~eyang] user can input multiple states. {code:java} yarn container -list test1 -states UPGRADING,NEEDS_UPGRADE | python -m json.tool{code} > Yarn Service Upgrade: Add GET APIs that returns instances matching query > params > --- > > Key: YARN-8299 > URL: https://issues.apache.org/jira/browse/YARN-8299 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Major > Attachments: YARN-8299.001.patch > > > We need APIs that returns containers that match the query params. These are > needed so that we can find out what containers have been upgraded. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8299) Yarn Service Upgrade: Add GET APIs that returns instances matching query params
[ https://issues.apache.org/jira/browse/YARN-8299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16543442#comment-16543442 ] Chandni Singh commented on YARN-8299: - TestAMRMClient passes locally for me. I am able to compile and run the tests of hadoop-yarn-client, hadoop-yarn-services-core, and hadoop-yarn-services-api modules without any issues. > Yarn Service Upgrade: Add GET APIs that returns instances matching query > params > --- > > Key: YARN-8299 > URL: https://issues.apache.org/jira/browse/YARN-8299 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Major > Attachments: YARN-8299.001.patch > > > We need APIs that returns containers that match the query params. These are > needed so that we can find out what containers have been upgraded. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-8299) Yarn Service Upgrade: Add GET APIs that returns instances matching query params
[ https://issues.apache.org/jira/browse/YARN-8299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16542253#comment-16542253 ] Chandni Singh edited comment on YARN-8299 at 7/12/18 10:17 PM: --- [~eyang] [~gsaha] could you please review? Command line to list instances: {code:java} yarn container -list test1 -states READY -version 1.0.0 | python -m json.tool{code} was (Author: csingh): [~eyang] [~gsaha] could you please review? Command line to list instances: yarn container -list test1 -states READY > Yarn Service Upgrade: Add GET APIs that returns instances matching query > params > --- > > Key: YARN-8299 > URL: https://issues.apache.org/jira/browse/YARN-8299 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Major > Attachments: YARN-8299.001.patch > > > We need APIs that returns containers that match the query params. These are > needed so that we can find out what containers have been upgraded. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-8299) Yarn Service Upgrade: Add GET APIs that returns instances matching query params
[ https://issues.apache.org/jira/browse/YARN-8299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16542253#comment-16542253 ] Chandni Singh edited comment on YARN-8299 at 7/12/18 10:16 PM: --- [~eyang] [~gsaha] could you please review? Command line to list instances: yarn container -list test1 -states READY was (Author: csingh): [~eyang] [~gsaha] could you please review? Command line to yarn container -list test1 -states READY > Yarn Service Upgrade: Add GET APIs that returns instances matching query > params > --- > > Key: YARN-8299 > URL: https://issues.apache.org/jira/browse/YARN-8299 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Major > Attachments: YARN-8299.001.patch > > > We need APIs that returns containers that match the query params. These are > needed so that we can find out what containers have been upgraded. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8299) Yarn Service Upgrade: Add GET APIs that returns instances matching query params
[ https://issues.apache.org/jira/browse/YARN-8299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16542253#comment-16542253 ] Chandni Singh commented on YARN-8299: - [~eyang] [~gsaha] could you please review? Command line to yarn container -list test1 -states READY > Yarn Service Upgrade: Add GET APIs that returns instances matching query > params > --- > > Key: YARN-8299 > URL: https://issues.apache.org/jira/browse/YARN-8299 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Major > Attachments: YARN-8299.001.patch > > > We need APIs that returns containers that match the query params. These are > needed so that we can find out what containers have been upgraded. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8299) Yarn Service Upgrade: Add GET APIs that returns instances matching query params
[ https://issues.apache.org/jira/browse/YARN-8299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chandni Singh updated YARN-8299: Attachment: YARN-8299.001.patch > Yarn Service Upgrade: Add GET APIs that returns instances matching query > params > --- > > Key: YARN-8299 > URL: https://issues.apache.org/jira/browse/YARN-8299 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Major > Attachments: YARN-8299.001.patch > > > We need APIs that returns containers that match the query params. These are > needed so that we can find out what containers have been upgraded. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8299) Yarn Service Upgrade: Add GET APIs that returns instances matching query params
[ https://issues.apache.org/jira/browse/YARN-8299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chandni Singh updated YARN-8299: Summary: Yarn Service Upgrade: Add GET APIs that returns instances matching query params (was: Yarn Service Upgrade: Add GET APIs that returns components/instances matching query params) > Yarn Service Upgrade: Add GET APIs that returns instances matching query > params > --- > > Key: YARN-8299 > URL: https://issues.apache.org/jira/browse/YARN-8299 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Major > > We need APIs that returns containers/components that match the query params. > These are needed so that we can find out what containers/components have been > upgraded. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8299) Yarn Service Upgrade: Add GET APIs that returns instances matching query params
[ https://issues.apache.org/jira/browse/YARN-8299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chandni Singh updated YARN-8299: Description: We need APIs that returns containers that match the query params. These are needed so that we can find out what containers have been upgraded. (was: We need APIs that returns containers/components that match the query params. These are needed so that we can find out what containers/components have been upgraded.) > Yarn Service Upgrade: Add GET APIs that returns instances matching query > params > --- > > Key: YARN-8299 > URL: https://issues.apache.org/jira/browse/YARN-8299 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Major > > We need APIs that returns containers that match the query params. These are > needed so that we can find out what containers have been upgraded. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8409) ActiveStandbyElectorBasedElectorService is failing with NPE
[ https://issues.apache.org/jira/browse/YARN-8409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526578#comment-16526578 ] Chandni Singh commented on YARN-8409: - Thanks [~eyang] for reviewing and merging the patch. > ActiveStandbyElectorBasedElectorService is failing with NPE > --- > > Key: YARN-8409 > URL: https://issues.apache.org/jira/browse/YARN-8409 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.1.1 >Reporter: Yesha Vora >Assignee: Chandni Singh >Priority: Major > Fix For: 3.2.0, 3.1.1 > > Attachments: YARN-8409.002.patch > > > In RM-HA env, kill ZK leader and then perform RM failover. > Sometimes, active RM gets NPE and fail to come up successfully > {code:java} > 2018-06-08 10:31:03,007 INFO client.ZooKeeperSaslClient > (ZooKeeperSaslClient.java:run(289)) - Client will use GSSAPI as SASL > mechanism. > 2018-06-08 10:31:03,008 INFO zookeeper.ClientCnxn > (ClientCnxn.java:logStartConnect(1019)) - Opening socket connection to server > xxx/xxx:2181. Will attempt to SASL-authenticate using Login Context section > 'Client' > 2018-06-08 10:31:03,009 WARN zookeeper.ClientCnxn > (ClientCnxn.java:run(1146)) - Session 0x0 for server null, unexpected error, > closing socket connection and attempting reconnect > java.net.ConnectException: Connection refused > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) > at > org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) > at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125) > 2018-06-08 10:31:03,344 INFO service.AbstractService > (AbstractService.java:noteFailure(267)) - Service > org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService > failed in state INITED > java.lang.NullPointerException > at > org.apache.hadoop.ha.ActiveStandbyElector$3.run(ActiveStandbyElector.java:1033) > at > org.apache.hadoop.ha.ActiveStandbyElector$3.run(ActiveStandbyElector.java:1030) > at > org.apache.hadoop.ha.ActiveStandbyElector.zkDoWithRetries(ActiveStandbyElector.java:1095) > at > org.apache.hadoop.ha.ActiveStandbyElector.zkDoWithRetries(ActiveStandbyElector.java:1087) > at > org.apache.hadoop.ha.ActiveStandbyElector.createWithRetries(ActiveStandbyElector.java:1030) > at > org.apache.hadoop.ha.ActiveStandbyElector.ensureParentZNode(ActiveStandbyElector.java:347) > at > org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.serviceInit(ActiveStandbyElectorBasedElectorService.java:110) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:336) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1479) > 2018-06-08 10:31:03,345 INFO ha.ActiveStandbyElector > (ActiveStandbyElector.java:quitElection(409)) - Yielding from election{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8409) ActiveStandbyElectorBasedElectorService is failing with NPE
[ https://issues.apache.org/jira/browse/YARN-8409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16525688#comment-16525688 ] Chandni Singh commented on YARN-8409: - [~eyang] could you please review patch 2? > ActiveStandbyElectorBasedElectorService is failing with NPE > --- > > Key: YARN-8409 > URL: https://issues.apache.org/jira/browse/YARN-8409 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.1.1 >Reporter: Yesha Vora >Assignee: Chandni Singh >Priority: Major > Attachments: YARN-8409.002.patch > > > In RM-HA env, kill ZK leader and then perform RM failover. > Sometimes, active RM gets NPE and fail to come up successfully > {code:java} > 2018-06-08 10:31:03,007 INFO client.ZooKeeperSaslClient > (ZooKeeperSaslClient.java:run(289)) - Client will use GSSAPI as SASL > mechanism. > 2018-06-08 10:31:03,008 INFO zookeeper.ClientCnxn > (ClientCnxn.java:logStartConnect(1019)) - Opening socket connection to server > xxx/xxx:2181. Will attempt to SASL-authenticate using Login Context section > 'Client' > 2018-06-08 10:31:03,009 WARN zookeeper.ClientCnxn > (ClientCnxn.java:run(1146)) - Session 0x0 for server null, unexpected error, > closing socket connection and attempting reconnect > java.net.ConnectException: Connection refused > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) > at > org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) > at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125) > 2018-06-08 10:31:03,344 INFO service.AbstractService > (AbstractService.java:noteFailure(267)) - Service > org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService > failed in state INITED > java.lang.NullPointerException > at > org.apache.hadoop.ha.ActiveStandbyElector$3.run(ActiveStandbyElector.java:1033) > at > org.apache.hadoop.ha.ActiveStandbyElector$3.run(ActiveStandbyElector.java:1030) > at > org.apache.hadoop.ha.ActiveStandbyElector.zkDoWithRetries(ActiveStandbyElector.java:1095) > at > org.apache.hadoop.ha.ActiveStandbyElector.zkDoWithRetries(ActiveStandbyElector.java:1087) > at > org.apache.hadoop.ha.ActiveStandbyElector.createWithRetries(ActiveStandbyElector.java:1030) > at > org.apache.hadoop.ha.ActiveStandbyElector.ensureParentZNode(ActiveStandbyElector.java:347) > at > org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.serviceInit(ActiveStandbyElectorBasedElectorService.java:110) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:336) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1479) > 2018-06-08 10:31:03,345 INFO ha.ActiveStandbyElector > (ActiveStandbyElector.java:quitElection(409)) - Yielding from election{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8409) ActiveStandbyElectorBasedElectorService is failing with NPE
[ https://issues.apache.org/jira/browse/YARN-8409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chandni Singh updated YARN-8409: Attachment: (was: YARN-8409.001.patch) > ActiveStandbyElectorBasedElectorService is failing with NPE > --- > > Key: YARN-8409 > URL: https://issues.apache.org/jira/browse/YARN-8409 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.1.1 >Reporter: Yesha Vora >Assignee: Chandni Singh >Priority: Major > Attachments: YARN-8409.002.patch > > > In RM-HA env, kill ZK leader and then perform RM failover. > Sometimes, active RM gets NPE and fail to come up successfully > {code:java} > 2018-06-08 10:31:03,007 INFO client.ZooKeeperSaslClient > (ZooKeeperSaslClient.java:run(289)) - Client will use GSSAPI as SASL > mechanism. > 2018-06-08 10:31:03,008 INFO zookeeper.ClientCnxn > (ClientCnxn.java:logStartConnect(1019)) - Opening socket connection to server > xxx/xxx:2181. Will attempt to SASL-authenticate using Login Context section > 'Client' > 2018-06-08 10:31:03,009 WARN zookeeper.ClientCnxn > (ClientCnxn.java:run(1146)) - Session 0x0 for server null, unexpected error, > closing socket connection and attempting reconnect > java.net.ConnectException: Connection refused > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) > at > org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) > at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125) > 2018-06-08 10:31:03,344 INFO service.AbstractService > (AbstractService.java:noteFailure(267)) - Service > org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService > failed in state INITED > java.lang.NullPointerException > at > org.apache.hadoop.ha.ActiveStandbyElector$3.run(ActiveStandbyElector.java:1033) > at > org.apache.hadoop.ha.ActiveStandbyElector$3.run(ActiveStandbyElector.java:1030) > at > org.apache.hadoop.ha.ActiveStandbyElector.zkDoWithRetries(ActiveStandbyElector.java:1095) > at > org.apache.hadoop.ha.ActiveStandbyElector.zkDoWithRetries(ActiveStandbyElector.java:1087) > at > org.apache.hadoop.ha.ActiveStandbyElector.createWithRetries(ActiveStandbyElector.java:1030) > at > org.apache.hadoop.ha.ActiveStandbyElector.ensureParentZNode(ActiveStandbyElector.java:347) > at > org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.serviceInit(ActiveStandbyElectorBasedElectorService.java:110) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:336) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1479) > 2018-06-08 10:31:03,345 INFO ha.ActiveStandbyElector > (ActiveStandbyElector.java:quitElection(409)) - Yielding from election{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8409) ActiveStandbyElectorBasedElectorService is failing with NPE
[ https://issues.apache.org/jira/browse/YARN-8409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chandni Singh updated YARN-8409: Attachment: YARN-8409.002.patch > ActiveStandbyElectorBasedElectorService is failing with NPE > --- > > Key: YARN-8409 > URL: https://issues.apache.org/jira/browse/YARN-8409 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.1.1 >Reporter: Yesha Vora >Assignee: Chandni Singh >Priority: Major > Attachments: YARN-8409.001.patch, YARN-8409.002.patch > > > In RM-HA env, kill ZK leader and then perform RM failover. > Sometimes, active RM gets NPE and fail to come up successfully > {code:java} > 2018-06-08 10:31:03,007 INFO client.ZooKeeperSaslClient > (ZooKeeperSaslClient.java:run(289)) - Client will use GSSAPI as SASL > mechanism. > 2018-06-08 10:31:03,008 INFO zookeeper.ClientCnxn > (ClientCnxn.java:logStartConnect(1019)) - Opening socket connection to server > xxx/xxx:2181. Will attempt to SASL-authenticate using Login Context section > 'Client' > 2018-06-08 10:31:03,009 WARN zookeeper.ClientCnxn > (ClientCnxn.java:run(1146)) - Session 0x0 for server null, unexpected error, > closing socket connection and attempting reconnect > java.net.ConnectException: Connection refused > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) > at > org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) > at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125) > 2018-06-08 10:31:03,344 INFO service.AbstractService > (AbstractService.java:noteFailure(267)) - Service > org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService > failed in state INITED > java.lang.NullPointerException > at > org.apache.hadoop.ha.ActiveStandbyElector$3.run(ActiveStandbyElector.java:1033) > at > org.apache.hadoop.ha.ActiveStandbyElector$3.run(ActiveStandbyElector.java:1030) > at > org.apache.hadoop.ha.ActiveStandbyElector.zkDoWithRetries(ActiveStandbyElector.java:1095) > at > org.apache.hadoop.ha.ActiveStandbyElector.zkDoWithRetries(ActiveStandbyElector.java:1087) > at > org.apache.hadoop.ha.ActiveStandbyElector.createWithRetries(ActiveStandbyElector.java:1030) > at > org.apache.hadoop.ha.ActiveStandbyElector.ensureParentZNode(ActiveStandbyElector.java:347) > at > org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.serviceInit(ActiveStandbyElectorBasedElectorService.java:110) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:336) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1479) > 2018-06-08 10:31:03,345 INFO ha.ActiveStandbyElector > (ActiveStandbyElector.java:quitElection(409)) - Yielding from election{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8409) ActiveStandbyElectorBasedElectorService is failing with NPE
[ https://issues.apache.org/jira/browse/YARN-8409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chandni Singh updated YARN-8409: Attachment: YARN-8409.001.patch > ActiveStandbyElectorBasedElectorService is failing with NPE > --- > > Key: YARN-8409 > URL: https://issues.apache.org/jira/browse/YARN-8409 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.1.1 >Reporter: Yesha Vora >Assignee: Chandni Singh >Priority: Major > Attachments: YARN-8409.001.patch > > > In RM-HA env, kill ZK leader and then perform RM failover. > Sometimes, active RM gets NPE and fail to come up successfully > {code:java} > 2018-06-08 10:31:03,007 INFO client.ZooKeeperSaslClient > (ZooKeeperSaslClient.java:run(289)) - Client will use GSSAPI as SASL > mechanism. > 2018-06-08 10:31:03,008 INFO zookeeper.ClientCnxn > (ClientCnxn.java:logStartConnect(1019)) - Opening socket connection to server > xxx/xxx:2181. Will attempt to SASL-authenticate using Login Context section > 'Client' > 2018-06-08 10:31:03,009 WARN zookeeper.ClientCnxn > (ClientCnxn.java:run(1146)) - Session 0x0 for server null, unexpected error, > closing socket connection and attempting reconnect > java.net.ConnectException: Connection refused > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) > at > org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) > at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125) > 2018-06-08 10:31:03,344 INFO service.AbstractService > (AbstractService.java:noteFailure(267)) - Service > org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService > failed in state INITED > java.lang.NullPointerException > at > org.apache.hadoop.ha.ActiveStandbyElector$3.run(ActiveStandbyElector.java:1033) > at > org.apache.hadoop.ha.ActiveStandbyElector$3.run(ActiveStandbyElector.java:1030) > at > org.apache.hadoop.ha.ActiveStandbyElector.zkDoWithRetries(ActiveStandbyElector.java:1095) > at > org.apache.hadoop.ha.ActiveStandbyElector.zkDoWithRetries(ActiveStandbyElector.java:1087) > at > org.apache.hadoop.ha.ActiveStandbyElector.createWithRetries(ActiveStandbyElector.java:1030) > at > org.apache.hadoop.ha.ActiveStandbyElector.ensureParentZNode(ActiveStandbyElector.java:347) > at > org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.serviceInit(ActiveStandbyElectorBasedElectorService.java:110) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:336) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1479) > 2018-06-08 10:31:03,345 INFO ha.ActiveStandbyElector > (ActiveStandbyElector.java:quitElection(409)) - Yielding from election{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-8409) ActiveStandbyElectorBasedElectorService is failing with NPE
[ https://issues.apache.org/jira/browse/YARN-8409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16524290#comment-16524290 ] Chandni Singh edited comment on YARN-8409 at 6/27/18 12:51 AM: --- This happens when RM is started immediately after killing zookeeper leader. The {{zkClient}} reference in {{ActiveStandbyElector}} is null which causes NPE. Below is the chain of calls: # In {{ActiveStandbyElector}} constructor, at line 274: {{reEstablishSession()}} is invoked. # {{reEstablishSession}} tries to create zookeeper connection at line 825. # {{createConnection}} calls {{connectToZookeeper}} at line 850 to initialize {{zkClient}} # However, {{connectToZookeeper}} throws IOException because of session timeout # {{zkClient}} never gets initialized and is {{null}}. {{ActiveStandbyElectorBasedElectorService}} currently doesn't care if elector is connected to zookeeper and executes {{elector.ensureParentZNode()}} which then throws NPE. was (Author: csingh): This happens when RM is started immediately after killing zookeeper leader. The {{zkClient}} is null. > ActiveStandbyElectorBasedElectorService is failing with NPE > --- > > Key: YARN-8409 > URL: https://issues.apache.org/jira/browse/YARN-8409 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.1.1 >Reporter: Yesha Vora >Assignee: Chandni Singh >Priority: Major > > In RM-HA env, kill ZK leader and then perform RM failover. > Sometimes, active RM gets NPE and fail to come up successfully > {code:java} > 2018-06-08 10:31:03,007 INFO client.ZooKeeperSaslClient > (ZooKeeperSaslClient.java:run(289)) - Client will use GSSAPI as SASL > mechanism. > 2018-06-08 10:31:03,008 INFO zookeeper.ClientCnxn > (ClientCnxn.java:logStartConnect(1019)) - Opening socket connection to server > xxx/xxx:2181. Will attempt to SASL-authenticate using Login Context section > 'Client' > 2018-06-08 10:31:03,009 WARN zookeeper.ClientCnxn > (ClientCnxn.java:run(1146)) - Session 0x0 for server null, unexpected error, > closing socket connection and attempting reconnect > java.net.ConnectException: Connection refused > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) > at > org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) > at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125) > 2018-06-08 10:31:03,344 INFO service.AbstractService > (AbstractService.java:noteFailure(267)) - Service > org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService > failed in state INITED > java.lang.NullPointerException > at > org.apache.hadoop.ha.ActiveStandbyElector$3.run(ActiveStandbyElector.java:1033) > at > org.apache.hadoop.ha.ActiveStandbyElector$3.run(ActiveStandbyElector.java:1030) > at > org.apache.hadoop.ha.ActiveStandbyElector.zkDoWithRetries(ActiveStandbyElector.java:1095) > at > org.apache.hadoop.ha.ActiveStandbyElector.zkDoWithRetries(ActiveStandbyElector.java:1087) > at > org.apache.hadoop.ha.ActiveStandbyElector.createWithRetries(ActiveStandbyElector.java:1030) > at > org.apache.hadoop.ha.ActiveStandbyElector.ensureParentZNode(ActiveStandbyElector.java:347) > at > org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.serviceInit(ActiveStandbyElectorBasedElectorService.java:110) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:336) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1479) > 2018-06-08 10:31:03,345 INFO ha.ActiveStandbyElector > (ActiveStandbyElector.java:quitElection(409)) - Yielding from election{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8409) ActiveStandbyElectorBasedElectorService is failing with NPE
[ https://issues.apache.org/jira/browse/YARN-8409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16524290#comment-16524290 ] Chandni Singh commented on YARN-8409: - This happens when RM is started immediately after killing zookeeper leader. The {{zkClient}} is null. > ActiveStandbyElectorBasedElectorService is failing with NPE > --- > > Key: YARN-8409 > URL: https://issues.apache.org/jira/browse/YARN-8409 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.1.1 >Reporter: Yesha Vora >Assignee: Chandni Singh >Priority: Major > > In RM-HA env, kill ZK leader and then perform RM failover. > Sometimes, active RM gets NPE and fail to come up successfully > {code:java} > 2018-06-08 10:31:03,007 INFO client.ZooKeeperSaslClient > (ZooKeeperSaslClient.java:run(289)) - Client will use GSSAPI as SASL > mechanism. > 2018-06-08 10:31:03,008 INFO zookeeper.ClientCnxn > (ClientCnxn.java:logStartConnect(1019)) - Opening socket connection to server > xxx/xxx:2181. Will attempt to SASL-authenticate using Login Context section > 'Client' > 2018-06-08 10:31:03,009 WARN zookeeper.ClientCnxn > (ClientCnxn.java:run(1146)) - Session 0x0 for server null, unexpected error, > closing socket connection and attempting reconnect > java.net.ConnectException: Connection refused > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) > at > org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) > at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125) > 2018-06-08 10:31:03,344 INFO service.AbstractService > (AbstractService.java:noteFailure(267)) - Service > org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService > failed in state INITED > java.lang.NullPointerException > at > org.apache.hadoop.ha.ActiveStandbyElector$3.run(ActiveStandbyElector.java:1033) > at > org.apache.hadoop.ha.ActiveStandbyElector$3.run(ActiveStandbyElector.java:1030) > at > org.apache.hadoop.ha.ActiveStandbyElector.zkDoWithRetries(ActiveStandbyElector.java:1095) > at > org.apache.hadoop.ha.ActiveStandbyElector.zkDoWithRetries(ActiveStandbyElector.java:1087) > at > org.apache.hadoop.ha.ActiveStandbyElector.createWithRetries(ActiveStandbyElector.java:1030) > at > org.apache.hadoop.ha.ActiveStandbyElector.ensureParentZNode(ActiveStandbyElector.java:347) > at > org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.serviceInit(ActiveStandbyElectorBasedElectorService.java:110) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:336) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1479) > 2018-06-08 10:31:03,345 INFO ha.ActiveStandbyElector > (ActiveStandbyElector.java:quitElection(409)) - Yielding from election{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-8409) ActiveStandbyElectorBasedElectorService is failing with NPE
[ https://issues.apache.org/jira/browse/YARN-8409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chandni Singh reassigned YARN-8409: --- Assignee: Chandni Singh > ActiveStandbyElectorBasedElectorService is failing with NPE > --- > > Key: YARN-8409 > URL: https://issues.apache.org/jira/browse/YARN-8409 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.1.1 >Reporter: Yesha Vora >Assignee: Chandni Singh >Priority: Major > > In RM-HA env, kill ZK leader and then perform RM failover. > Sometimes, active RM gets NPE and fail to come up successfully > {code:java} > 2018-06-08 10:31:03,007 INFO client.ZooKeeperSaslClient > (ZooKeeperSaslClient.java:run(289)) - Client will use GSSAPI as SASL > mechanism. > 2018-06-08 10:31:03,008 INFO zookeeper.ClientCnxn > (ClientCnxn.java:logStartConnect(1019)) - Opening socket connection to server > xxx/xxx:2181. Will attempt to SASL-authenticate using Login Context section > 'Client' > 2018-06-08 10:31:03,009 WARN zookeeper.ClientCnxn > (ClientCnxn.java:run(1146)) - Session 0x0 for server null, unexpected error, > closing socket connection and attempting reconnect > java.net.ConnectException: Connection refused > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) > at > org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) > at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125) > 2018-06-08 10:31:03,344 INFO service.AbstractService > (AbstractService.java:noteFailure(267)) - Service > org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService > failed in state INITED > java.lang.NullPointerException > at > org.apache.hadoop.ha.ActiveStandbyElector$3.run(ActiveStandbyElector.java:1033) > at > org.apache.hadoop.ha.ActiveStandbyElector$3.run(ActiveStandbyElector.java:1030) > at > org.apache.hadoop.ha.ActiveStandbyElector.zkDoWithRetries(ActiveStandbyElector.java:1095) > at > org.apache.hadoop.ha.ActiveStandbyElector.zkDoWithRetries(ActiveStandbyElector.java:1087) > at > org.apache.hadoop.ha.ActiveStandbyElector.createWithRetries(ActiveStandbyElector.java:1030) > at > org.apache.hadoop.ha.ActiveStandbyElector.ensureParentZNode(ActiveStandbyElector.java:347) > at > org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.serviceInit(ActiveStandbyElectorBasedElectorService.java:110) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:336) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1479) > 2018-06-08 10:31:03,345 INFO ha.ActiveStandbyElector > (ActiveStandbyElector.java:quitElection(409)) - Yielding from election{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-8458) Perform SLS testing and run TestCapacitySchedulerPerf on trunk
[ https://issues.apache.org/jira/browse/YARN-8458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16522912#comment-16522912 ] Chandni Singh edited comment on YARN-8458 at 6/26/18 6:52 PM: -- * {{TestCapacitySchedulerPerf}} on branch-3.1 {code} [INFO] Running org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerPerf [INFO] Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 373.312 s - in org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerPerf [INFO] #ResourceTypes = 2. Avg of fastest 20: 34602.074 #ResourceTypes = 5. Avg of fastest 20: 25000.0 #ResourceTypes = 4. Avg of fastest 20: 26420.08 #ResourceTypes = 3. Avg of fastest 20: 27173.912 {code} * {{TestCapacitySchedulerPerf}} on branch-3.0 {code} [INFO] Running org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerPerf [INFO] Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 277.687 s - in org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerPerf #ResourceTypes = 2. Avg of fastest 20: 35460.992 #ResourceTypes = 5. Avg of fastest 20: 28129.395 #ResourceTypes = 4. Avg of fastest 20: 29498.525 #ResourceTypes = 3. Avg of fastest 20: 31201.248 {code} * {{TestCapacitySchedulerPerf}} on bf2b687 {code} #ResourceTypes = 2. Avg of fastest 20: 30211.48 {code} was (Author: csingh): Result of running {{TestCapacitySchedulerPerf}} on branch-3.1 {code:java} [INFO] Running org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerPerf [INFO] Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 373.312 s - in org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerPerf #ResourceTypes = 2. Avg of fastest 20: 34602.074 #ResourceTypes = 5. Avg of fastest 20: 25000.0 #ResourceTypes = 4. Avg of fastest 20: 26420.08 #ResourceTypes = 3. Avg of fastest 20: 27173.912 {code} Result of running {{TestCapacitySchedulerPerf}} on branch-3.0 {code:java} [INFO] Running org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerPerf [INFO] Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 277.687 s - in org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerPerf #ResourceTypes = 2. Avg of fastest 20: 35460.992 #ResourceTypes = 5. Avg of fastest 20: 28129.395 #ResourceTypes = 4. Avg of fastest 20: 29498.525 #ResourceTypes = 3. Avg of fastest 20: 31201.248 {code} > Perform SLS testing and run TestCapacitySchedulerPerf on trunk > -- > > Key: YARN-8458 > URL: https://issues.apache.org/jira/browse/YARN-8458 > Project: Hadoop YARN > Issue Type: Task >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Major > Attachments: sls_snapshot_cpu_snapshot_june_25.nps, > sls_snapshot_memory_snapshot_june_25.nps > > > Run SLS test and TestCapacitySchedulerPerf -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-8458) Perform SLS testing and run TestCapacitySchedulerPerf on trunk
[ https://issues.apache.org/jira/browse/YARN-8458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16522912#comment-16522912 ] Chandni Singh edited comment on YARN-8458 at 6/25/18 11:33 PM: --- Result of running {{TestCapacitySchedulerPerf}} on branch-3.1 {code:java} [INFO] Running org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerPerf [INFO] Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 373.312 s - in org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerPerf #ResourceTypes = 2. Avg of fastest 20: 34602.074 #ResourceTypes = 5. Avg of fastest 20: 25000.0 #ResourceTypes = 4. Avg of fastest 20: 26420.08 #ResourceTypes = 3. Avg of fastest 20: 27173.912 {code} Result of running {{TestCapacitySchedulerPerf}} on branch-3.0 {code:java} [INFO] Running org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerPerf [INFO] Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 277.687 s - in org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerPerf #ResourceTypes = 2. Avg of fastest 20: 35460.992 #ResourceTypes = 5. Avg of fastest 20: 28129.395 #ResourceTypes = 4. Avg of fastest 20: 29498.525 #ResourceTypes = 3. Avg of fastest 20: 31201.248 {code} was (Author: csingh): Result of running {{TestCapacitySchedulerPerf}} on branch-3.1 {code:java} [INFO] Running org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerPerf [INFO] Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 373.312 s - in org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerPerf [INFO] {code} Result of running {{TestCapacitySchedulerPerf}} on branch-3.0 {code:java} [INFO] Running org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerPerf [INFO] Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 277.687 s - in org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerPerf #ResourceTypes = 2. Avg of fastest 20: 35460.992 #ResourceTypes = 5. Avg of fastest 20: 28129.395 #ResourceTypes = 4. Avg of fastest 20: 29498.525 #ResourceTypes = 3. Avg of fastest 20: 31201.248 {code} > Perform SLS testing and run TestCapacitySchedulerPerf on trunk > -- > > Key: YARN-8458 > URL: https://issues.apache.org/jira/browse/YARN-8458 > Project: Hadoop YARN > Issue Type: Task >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Major > Attachments: sls_snapshot_cpu_snapshot_june_25.nps, > sls_snapshot_memory_snapshot_june_25.nps > > > Run SLS test and TestCapacitySchedulerPerf -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-8458) Perform SLS testing and run TestCapacitySchedulerPerf on trunk
[ https://issues.apache.org/jira/browse/YARN-8458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16522912#comment-16522912 ] Chandni Singh edited comment on YARN-8458 at 6/25/18 11:19 PM: --- Result of running {{TestCapacitySchedulerPerf}} on branch-3.1 {code:java} [INFO] Running org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerPerf [INFO] Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 373.312 s - in org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerPerf [INFO] {code} Result of running {{TestCapacitySchedulerPerf}} on branch-3.0 {code:java} [INFO] Running org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerPerf [INFO] Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 277.687 s - in org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerPerf #ResourceTypes = 2. Avg of fastest 20: 35460.992 #ResourceTypes = 5. Avg of fastest 20: 28129.395 #ResourceTypes = 4. Avg of fastest 20: 29498.525 #ResourceTypes = 3. Avg of fastest 20: 31201.248 {code} was (Author: csingh): Result of running {{TestCapacitySchedulerPerf}} on branch-3.1 {code} [INFO] Running org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerPerf [INFO] Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 373.312 s - in org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerPerf [INFO] {code} Result of running {{TestCapacitySchedulerPerf}} on branch-3.0 {code} [INFO] Running org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerPerf [INFO] Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 277.687 s - in org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerPerf {code} > Perform SLS testing and run TestCapacitySchedulerPerf on trunk > -- > > Key: YARN-8458 > URL: https://issues.apache.org/jira/browse/YARN-8458 > Project: Hadoop YARN > Issue Type: Task >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Major > Attachments: sls_snapshot_cpu_snapshot_june_25.nps, > sls_snapshot_memory_snapshot_june_25.nps > > > Run SLS test and TestCapacitySchedulerPerf -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8458) Perform SLS testing and run TestCapacitySchedulerPerf on trunk
[ https://issues.apache.org/jira/browse/YARN-8458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chandni Singh updated YARN-8458: Summary: Perform SLS testing and run TestCapacitySchedulerPerf on trunk (was: Perform SLS testing and run TestCapacitySchedulerPerf on branch-3.1) > Perform SLS testing and run TestCapacitySchedulerPerf on trunk > -- > > Key: YARN-8458 > URL: https://issues.apache.org/jira/browse/YARN-8458 > Project: Hadoop YARN > Issue Type: Task >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Major > Attachments: sls_snapshot_cpu_snapshot_june_25.nps, > sls_snapshot_memory_snapshot_june_25.nps > > > Run SLS test and TestCapacitySchedulerPerf -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8458) Perform SLS testing and run TestCapacitySchedulerPerf on branch-3.1
[ https://issues.apache.org/jira/browse/YARN-8458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16522912#comment-16522912 ] Chandni Singh commented on YARN-8458: - Result of running {{TestCapacitySchedulerPerf}} on branch-3.1 {code} [INFO] Running org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerPerf [INFO] Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 373.312 s - in org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerPerf [INFO] {code} Result of running {{TestCapacitySchedulerPerf}} on branch-3.0 {code} [INFO] Running org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerPerf [INFO] Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 277.687 s - in org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerPerf {code} > Perform SLS testing and run TestCapacitySchedulerPerf on branch-3.1 > --- > > Key: YARN-8458 > URL: https://issues.apache.org/jira/browse/YARN-8458 > Project: Hadoop YARN > Issue Type: Task >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Major > Attachments: sls_snapshot_cpu_snapshot_june_25.nps, > sls_snapshot_memory_snapshot_june_25.nps > > > Run SLS test and TestCapacitySchedulerPerf -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8458) Perform SLS testing and run TestCapacitySchedulerPerf on branch-3.1
[ https://issues.apache.org/jira/browse/YARN-8458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16522911#comment-16522911 ] Chandni Singh commented on YARN-8458: - SLS result: Total has 441027 container allocated, 1470.09 containers allocated per second Total has 441480 proposal accepted, 1562 rejected > Perform SLS testing and run TestCapacitySchedulerPerf on branch-3.1 > --- > > Key: YARN-8458 > URL: https://issues.apache.org/jira/browse/YARN-8458 > Project: Hadoop YARN > Issue Type: Task >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Major > Attachments: sls_snapshot_cpu_snapshot_june_25.nps, > sls_snapshot_memory_snapshot_june_25.nps > > > Run SLS test and TestCapacitySchedulerPerf -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8458) Perform SLS testing and run TestCapacitySchedulerPerf on branch-3.1
[ https://issues.apache.org/jira/browse/YARN-8458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chandni Singh updated YARN-8458: Attachment: sls_snapshot_memory_snapshot_june_25.nps sls_snapshot_cpu_snapshot_june_25.nps > Perform SLS testing and run TestCapacitySchedulerPerf on branch-3.1 > --- > > Key: YARN-8458 > URL: https://issues.apache.org/jira/browse/YARN-8458 > Project: Hadoop YARN > Issue Type: Task >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Major > Attachments: sls_snapshot_cpu_snapshot_june_25.nps, > sls_snapshot_memory_snapshot_june_25.nps > > > Run SLS test and TestCapacitySchedulerPerf -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8458) Perform SLS testing and run TestCapacitySchedulerPerf on branch-3.1
[ https://issues.apache.org/jira/browse/YARN-8458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chandni Singh updated YARN-8458: Description: Run SLS test and TestCapacitySchedulerPerf (was: Run ) > Perform SLS testing and run TestCapacitySchedulerPerf on branch-3.1 > --- > > Key: YARN-8458 > URL: https://issues.apache.org/jira/browse/YARN-8458 > Project: Hadoop YARN > Issue Type: Task >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Major > > Run SLS test and TestCapacitySchedulerPerf -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8458) Perform SLS testing and run TestCapacitySchedulerPerf on branch-3.1
[ https://issues.apache.org/jira/browse/YARN-8458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chandni Singh updated YARN-8458: Summary: Perform SLS testing and run TestCapacitySchedulerPerf on branch-3.1 (was: Perform SLS testing and run TestCapacitySchedulerPerf) > Perform SLS testing and run TestCapacitySchedulerPerf on branch-3.1 > --- > > Key: YARN-8458 > URL: https://issues.apache.org/jira/browse/YARN-8458 > Project: Hadoop YARN > Issue Type: Task >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Major > > Run -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8458) Perform SLS testing and run TestCapacitySchedulerPerf
[ https://issues.apache.org/jira/browse/YARN-8458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chandni Singh updated YARN-8458: Description: Run > Perform SLS testing and run TestCapacitySchedulerPerf > - > > Key: YARN-8458 > URL: https://issues.apache.org/jira/browse/YARN-8458 > Project: Hadoop YARN > Issue Type: Task >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Major > > Run -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8458) Perform SLS testing and run TestCapacitySchedulerPerf
Chandni Singh created YARN-8458: --- Summary: Perform SLS testing and run TestCapacitySchedulerPerf Key: YARN-8458 URL: https://issues.apache.org/jira/browse/YARN-8458 Project: Hadoop YARN Issue Type: Task Reporter: Chandni Singh Assignee: Chandni Singh -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8445) YARN native service doesn't allow service name equals to component name
[ https://issues.apache.org/jira/browse/YARN-8445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16518811#comment-16518811 ] Chandni Singh commented on YARN-8445: - [~eyang] could you please review? > YARN native service doesn't allow service name equals to component name > --- > > Key: YARN-8445 > URL: https://issues.apache.org/jira/browse/YARN-8445 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Major > Fix For: 3.1.1 > > Attachments: YARN-8445.001.patch > > > Now YARN service doesn't allow specifying service name equals to component > name. > And it causes AM launch fails with msg like: > {code} > org.apache.hadoop.metrics2.MetricsException: Metrics source tf-zeppelin > already exists! > at > org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152) > at > org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:125) > at > org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229) > at > org.apache.hadoop.yarn.service.ServiceMetrics.register(ServiceMetrics.java:75) > at > org.apache.hadoop.yarn.service.component.Component.(Component.java:193) > at > org.apache.hadoop.yarn.service.ServiceScheduler.createAllComponents(ServiceScheduler.java:552) > at > org.apache.hadoop.yarn.service.ServiceScheduler.buildInstance(ServiceScheduler.java:251) > at > org.apache.hadoop.yarn.service.ServiceScheduler.serviceInit(ServiceScheduler.java:283) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108) > at > org.apache.hadoop.yarn.service.ServiceMaster.serviceInit(ServiceMaster.java:142) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at org.apache.hadoop.yarn.service.ServiceMaster.main(ServiceMaster.java:338) > 2018-06-18 06:50:39,473 [main] INFO service.ServiceScheduler - Stopping > service scheduler > {code} > It's better to add this check in validation phase instead of failing AM. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8445) YARN native service doesn't allow service name equals to component name
[ https://issues.apache.org/jira/browse/YARN-8445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chandni Singh updated YARN-8445: Attachment: YARN-8445.001.patch > YARN native service doesn't allow service name equals to component name > --- > > Key: YARN-8445 > URL: https://issues.apache.org/jira/browse/YARN-8445 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Major > Fix For: 3.1.1 > > Attachments: YARN-8445.001.patch > > > Now YARN service doesn't allow specifying service name equals to component > name. > And it causes AM launch fails with msg like: > {code} > org.apache.hadoop.metrics2.MetricsException: Metrics source tf-zeppelin > already exists! > at > org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152) > at > org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:125) > at > org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229) > at > org.apache.hadoop.yarn.service.ServiceMetrics.register(ServiceMetrics.java:75) > at > org.apache.hadoop.yarn.service.component.Component.(Component.java:193) > at > org.apache.hadoop.yarn.service.ServiceScheduler.createAllComponents(ServiceScheduler.java:552) > at > org.apache.hadoop.yarn.service.ServiceScheduler.buildInstance(ServiceScheduler.java:251) > at > org.apache.hadoop.yarn.service.ServiceScheduler.serviceInit(ServiceScheduler.java:283) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108) > at > org.apache.hadoop.yarn.service.ServiceMaster.serviceInit(ServiceMaster.java:142) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at org.apache.hadoop.yarn.service.ServiceMaster.main(ServiceMaster.java:338) > 2018-06-18 06:50:39,473 [main] INFO service.ServiceScheduler - Stopping > service scheduler > {code} > It's better to add this check in validation phase instead of failing AM. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8445) YARN native service doesn't allow service name equals to component name
Chandni Singh created YARN-8445: --- Summary: YARN native service doesn't allow service name equals to component name Key: YARN-8445 URL: https://issues.apache.org/jira/browse/YARN-8445 Project: Hadoop YARN Issue Type: Bug Reporter: Chandni Singh Assignee: Chandni Singh Fix For: 3.1.1 Now YARN service doesn't allow specifying service name equals to component name. And it causes AM launch fails with msg like: {code} org.apache.hadoop.metrics2.MetricsException: Metrics source tf-zeppelin already exists! at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152) at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:125) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229) at org.apache.hadoop.yarn.service.ServiceMetrics.register(ServiceMetrics.java:75) at org.apache.hadoop.yarn.service.component.Component.(Component.java:193) at org.apache.hadoop.yarn.service.ServiceScheduler.createAllComponents(ServiceScheduler.java:552) at org.apache.hadoop.yarn.service.ServiceScheduler.buildInstance(ServiceScheduler.java:251) at org.apache.hadoop.yarn.service.ServiceScheduler.serviceInit(ServiceScheduler.java:283) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108) at org.apache.hadoop.yarn.service.ServiceMaster.serviceInit(ServiceMaster.java:142) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) at org.apache.hadoop.yarn.service.ServiceMaster.main(ServiceMaster.java:338) 2018-06-18 06:50:39,473 [main] INFO service.ServiceScheduler - Stopping service scheduler {code} It's better to add this check in validation phase instead of failing AM. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8402) Yarn Service Destroy: Delete service entries from Zookeeper in the ServiceMaster instead of ServiceClient in the RM
[ https://issues.apache.org/jira/browse/YARN-8402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chandni Singh updated YARN-8402: Description: RM slows down considerably when multiple services are destroyed simultaneously. 1. Started approx 1000 services 2. Destroyed all the 1000 services. Observed considerable slowness in RM after this. The {{ServiceClient}} in RM uses the {{CuratorClient}} to delete zookeeper entries. The zookeeper client is the bottleneck and this could be avoided if the zookeeper entry can be deleted from the AM and then the {{ServiceClient}} can kill the app. was: The overwrite of service definition during flex is done from the ServiceClient. During auto finalization of upgrade, the current service definition gets overwritten as well by the service master. This creates a potential conflict. Need to move the overwrite of service definition during flex to the ServiceClient. Discussed on YARN-8018. > Yarn Service Destroy: Delete service entries from Zookeeper in the > ServiceMaster instead of ServiceClient in the RM > --- > > Key: YARN-8402 > URL: https://issues.apache.org/jira/browse/YARN-8402 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Major > > RM slows down considerably when multiple services are destroyed > simultaneously. > 1. Started approx 1000 services > 2. Destroyed all the 1000 services. > Observed considerable slowness in RM after this. > The {{ServiceClient}} in RM uses the {{CuratorClient}} to delete zookeeper > entries. > The zookeeper client is the bottleneck and this could be avoided if the > zookeeper entry can be deleted from the AM and then the {{ServiceClient}} can > kill the app. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8402) Yarn Service Destroy: Delete service entries from Zookeeper in the ServiceMaster instead of ServiceClient in the RM
[ https://issues.apache.org/jira/browse/YARN-8402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chandni Singh updated YARN-8402: Target Version/s: (was: 3.1.1) > Yarn Service Destroy: Delete service entries from Zookeeper in the > ServiceMaster instead of ServiceClient in the RM > --- > > Key: YARN-8402 > URL: https://issues.apache.org/jira/browse/YARN-8402 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Major > > RM slows down considerably when multiple services are destroyed > simultaneously. > 1. Started approx 1000 services > 2. Destroyed all the 1000 services. > Observed considerable slowness in RM after this. > The {{ServiceClient}} in RM uses the {{CuratorClient}} to delete zookeeper > entries. > The zookeeper client is the bottleneck and this could be avoided if the > zookeeper entry can be deleted from the AM and then the {{ServiceClient}} can > kill the app. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8402) Yarn Service Destroy: Delete service entries from Zookeeper in the ServiceMaster instead of ServiceClient in the RM
Chandni Singh created YARN-8402: --- Summary: Yarn Service Destroy: Delete service entries from Zookeeper in the ServiceMaster instead of ServiceClient in the RM Key: YARN-8402 URL: https://issues.apache.org/jira/browse/YARN-8402 Project: Hadoop YARN Issue Type: Sub-task Reporter: Chandni Singh Assignee: Chandni Singh The overwrite of service definition during flex is done from the ServiceClient. During auto finalization of upgrade, the current service definition gets overwritten as well by the service master. This creates a potential conflict. Need to move the overwrite of service definition during flex to the ServiceClient. Discussed on YARN-8018. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-8362) Number of remaining retries are updated twice after a container failure in NM
[ https://issues.apache.org/jira/browse/YARN-8362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16491101#comment-16491101 ] Chandni Singh edited comment on YARN-8362 at 5/25/18 7:06 PM: -- In patch 2, I fixed the checkstyle. The test failure {{org.apache.hadoop.yarn.server.nodemanager.containermanager.TestContainerManager.testLocalingResourceWhileContainerRunning}} is not related to this change. It fails in the existing trunk even without this change. was (Author: csingh): In patch 2, I fixed the checkstyle. The test failure {{org.apache.hadoop.yarn.server.nodemanager.containermanager.TestContainerManager.testLocalingResourceWhileContainerRunning}} is not related to this change. > Number of remaining retries are updated twice after a container failure in NM > -- > > Key: YARN-8362 > URL: https://issues.apache.org/jira/browse/YARN-8362 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Critical > Fix For: 3.2.0, 3.1.1 > > Attachments: YARN-8362.001.patch, YARN-8362.002.patch > > > The {{shouldRetry(int errorCode)}} in {{ContainerImpl}} with YARN-5015 also > updated some fields in retry context- remaining retries, restart times. > This method is directly called from outside the ContainerImpl class as well- > {{ContainerLaunch.setContainerCompletedStatus}}. This causes following > problems: > # remainingRetries are updated more than once after a failure. if > {{maxRetries = 1}}, then a retry will not be triggered because of multiple > calls to {{shouldRetry(int errorCode).}} > # Writes to {{retryContext}} should be protected and called when the write > lock is held. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8362) Number of remaining retries are updated twice after a container failure in NM
[ https://issues.apache.org/jira/browse/YARN-8362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chandni Singh updated YARN-8362: Attachment: YARN-8362.002.patch > Number of remaining retries are updated twice after a container failure in NM > -- > > Key: YARN-8362 > URL: https://issues.apache.org/jira/browse/YARN-8362 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Critical > Fix For: 3.2.0, 3.1.1 > > Attachments: YARN-8362.001.patch, YARN-8362.002.patch > > > The {{shouldRetry(int errorCode)}} in {{ContainerImpl}} with YARN-5015 also > updated some fields in retry context- remaining retries, restart times. > This method is directly called from outside the ContainerImpl class as well- > {{ContainerLaunch.setContainerCompletedStatus}}. This causes following > problems: > # remainingRetries are updated more than once after a failure. if > {{maxRetries = 1}}, then a retry will not be triggered because of multiple > calls to {{shouldRetry(int errorCode).}} > # Writes to {{retryContext}} should be protected and called when the write > lock is held. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8362) Number of remaining retries are updated twice after a container failure in NM
[ https://issues.apache.org/jira/browse/YARN-8362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chandni Singh updated YARN-8362: Attachment: YARN-8362.001.patch > Number of remaining retries are updated twice after a container failure in NM > -- > > Key: YARN-8362 > URL: https://issues.apache.org/jira/browse/YARN-8362 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Critical > Fix For: 3.2.0, 3.1.1 > > Attachments: YARN-8362.001.patch > > > The {{shouldRetry(int errorCode)}} in {{ContainerImpl}} with YARN-5015 also > updated some fields in retry context- remaining retries, restart times. > This method is directly called from outside the ContainerImpl class as well- > {{ContainerLaunch.setContainerCompletedStatus}}. This causes following > problems: > # remainingRetries are updated more than once after a failure. if > {{maxRetries = 1}}, then a retry will not be triggered because of multiple > calls to {{shouldRetry(int errorCode).}} > # Writes to {{retryContext}} should be protected and called when the write > lock is held. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8362) Number of remaining retries are updated twice after a container failure in NM
Chandni Singh created YARN-8362: --- Summary: Number of remaining retries are updated twice after a container failure in NM Key: YARN-8362 URL: https://issues.apache.org/jira/browse/YARN-8362 Project: Hadoop YARN Issue Type: Bug Reporter: Chandni Singh Assignee: Chandni Singh Fix For: 3.2.0, 3.1.1 The {{shouldRetry(int errorCode)}} in {{ContainerImpl}} with YARN-5015 also updated some fields in retry context- remaining retries, restart times. This method is directly called from outside the ContainerImpl class as well- {{ContainerLaunch.setContainerCompletedStatus}}. This causes following problems: # remainingRetries are updated more than once after a failure. if {{maxRetries = 1}}, then a retry will not be triggered because of multiple calls to {{shouldRetry(int errorCode).}} # Writes to {{retryContext}} should be protected and called when the write lock is held. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8360) Yarn service conflict between restart policy and NM configuration
Chandni Singh created YARN-8360: --- Summary: Yarn service conflict between restart policy and NM configuration Key: YARN-8360 URL: https://issues.apache.org/jira/browse/YARN-8360 Project: Hadoop YARN Issue Type: Bug Components: yarn Reporter: Chandni Singh For the below spec, the service will not stop even after container failures because of the NM auto retry properties : * "yarn.service.container-failure.retry.max": 1, * "yarn.service.container-failure.validity-interval-ms": 5000 The NM will continue auto-restarting containers. {{fail_after 20}} fails after 20 seconds. Since the validity failure interval is 5 seconds, NM will auto restart the container. {code:java} { "name": "fail-demo2", "version": "1.0.0", "components" : [ { "name": "comp1", "number_of_containers": 1, "launch_command": "fail_after 20", "restart_policy": "NEVER", "resource": { "cpus": 1, "memory": "256" }, "configuration": { "properties": { "yarn.service.container-failure.retry.max": 1, "yarn.service.container-failure.validity-interval-ms": 5000 } } } ] } {code} If {{restart_policy}} is NEVER, then the service should stop after the container fails. Since we have introduced, the service level Restart Policies, I think we should make the NM auto retry configurations part of the {{RetryPolicy}} and get rid of all {{yarn.service.container-failure.**}} properties. Otherwise it gets confusing. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8357) Yarn Service: NPE when service is saved first and then started.
[ https://issues.apache.org/jira/browse/YARN-8357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chandni Singh updated YARN-8357: Attachment: YARN-8357.001.patch > Yarn Service: NPE when service is saved first and then started. > --- > > Key: YARN-8357 > URL: https://issues.apache.org/jira/browse/YARN-8357 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Critical > Attachments: YARN-8357.001.patch > > > Line 972 in \{{ServiceClient}} returns a service with state \{{null}} which > is why there is a NPE. > {code:java} > 2018-05-24 04:39:22,911 INFO client.ServiceClient > (ServiceClient.java:getStatus(1203)) - Service test1 does not have an > application ID > 2018-05-24 04:39:22,911 ERROR webapp.ApiServer > (ApiServer.java:updateService(480)) - Error while performing operation for > app: test1 > java.lang.NullPointerException > at > org.apache.hadoop.yarn.service.client.ServiceClient.actionStart(ServiceClient.java:974) > at > org.apache.hadoop.yarn.service.webapp.ApiServer$7.run(ApiServer.java:650) > at > org.apache.hadoop.yarn.service.webapp.ApiServer$7.run(ApiServer.java:644) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1687) > at > org.apache.hadoop.yarn.service.webapp.ApiServer.startService(ApiServer.java:644) > at > org.apache.hadoop.yarn.service.webapp.ApiServer.updateService(ApiServer.java:449) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60) > at > com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$ResponseOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:205) > at > com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75) > at > com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:302) > at > com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) > at > com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108) > at > com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) > at > com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84) > at > com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1542) > at > com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1473) > at > com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1419) > at > com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1409) > at > com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:409) > at > com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:558) > at > com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:733) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8357) Yarn Service: NPE when service is saved first and then started.
Chandni Singh created YARN-8357: --- Summary: Yarn Service: NPE when service is saved first and then started. Key: YARN-8357 URL: https://issues.apache.org/jira/browse/YARN-8357 Project: Hadoop YARN Issue Type: Bug Reporter: Chandni Singh Assignee: Chandni Singh Line 972 in \{{ServiceClient}} returns a service with state \{{null}} which is why there is a NPE. {code:java} 2018-05-24 04:39:22,911 INFO client.ServiceClient (ServiceClient.java:getStatus(1203)) - Service test1 does not have an application ID 2018-05-24 04:39:22,911 ERROR webapp.ApiServer (ApiServer.java:updateService(480)) - Error while performing operation for app: test1 java.lang.NullPointerException at org.apache.hadoop.yarn.service.client.ServiceClient.actionStart(ServiceClient.java:974) at org.apache.hadoop.yarn.service.webapp.ApiServer$7.run(ApiServer.java:650) at org.apache.hadoop.yarn.service.webapp.ApiServer$7.run(ApiServer.java:644) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1687) at org.apache.hadoop.yarn.service.webapp.ApiServer.startService(ApiServer.java:644) at org.apache.hadoop.yarn.service.webapp.ApiServer.updateService(ApiServer.java:449) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60) at com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$ResponseOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:205) at com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75) at com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:302) at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) at com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108) at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) at com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84) at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1542) at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1473) at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1419) at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1409) at com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:409) at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:558) at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:733) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7530) hadoop-yarn-services-api should be part of hadoop-yarn-services
[ https://issues.apache.org/jira/browse/YARN-7530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chandni Singh updated YARN-7530: Attachment: YARN-7530-branch-3.1.001.patch > hadoop-yarn-services-api should be part of hadoop-yarn-services > --- > > Key: YARN-7530 > URL: https://issues.apache.org/jira/browse/YARN-7530 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn-native-services >Affects Versions: 3.1.0 >Reporter: Eric Yang >Assignee: Chandni Singh >Priority: Blocker > Fix For: 3.2.0, 3.1.1 > > Attachments: YARN-7530-branch-3.1.001.patch, YARN-7530.001.patch, > YARN-7530.002.patch > > > Hadoop-yarn-services-api is currently a parallel project to > hadoop-yarn-services project. It would be better if hadoop-yarn-services-api > is part of hadoop-yarn-services for correctness. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-8341) Yarn Service: Integration tests
[ https://issues.apache.org/jira/browse/YARN-8341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16484564#comment-16484564 ] Chandni Singh edited comment on YARN-8341 at 5/22/18 9:36 PM: -- The mvn additions are in yarn-services-api pom. In order to run the Integration tests # cd hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-services/hadoop-yarn-services-api # mvn failsafe:integration-test -Drm.host=localhost -Duser.name=root was (Author: csingh): The mvn additions are in yarn-services-api pom. In order to run the Integration tests # cd hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-services/hadoop-yarn-services-api # mvn failsafe:integration-test -Drm.host=ctr-e138-1518143905142-80042-01-02.hwx.site -Duser.name=root > Yarn Service: Integration tests > > > Key: YARN-8341 > URL: https://issues.apache.org/jira/browse/YARN-8341 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Major > Attachments: YARN-8341.wip.patch > > > In order to test the rest api end-to-end, we can add Integration tests for > Yarn service api. > The integration tests > * belong to junit category {{IntegrationTest}}. > * will be only run when triggered by executing {{mvn > failsafe:integration-test}} > * the surefire plugin for regular tests excludes {{IntegrationTest}} > * RM host, user name, and any additional properties which are needed to > execute the tests against a cluster can be passed as System properties. > For eg. {{mvn failsafe:integration-test -Drm.host=localhost -Duser.name=root}} > We can add more integration tests which can check scalability and performance. > Have these tests here benefits everyone in the community because anyone can > run these tests against there cluster. > Attaching a work in progress patch. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8341) Yarn Service: Integration tests
[ https://issues.apache.org/jira/browse/YARN-8341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chandni Singh updated YARN-8341: Attachment: (was: YARN-8341.wip.patch) > Yarn Service: Integration tests > > > Key: YARN-8341 > URL: https://issues.apache.org/jira/browse/YARN-8341 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Major > Attachments: YARN-8341.wip.patch > > > In order to test the rest api end-to-end, we can add Integration tests for > Yarn service api. > The integration tests > * belong to junit category {{IntegrationTest}}. > * will be only run when triggered by executing {{mvn > failsafe:integration-test}} > * the surefire plugin for regular tests excludes {{IntegrationTest}} > * RM host, user name, and any additional properties which are needed to > execute the tests against a cluster can be passed as System properties. > For eg. {{mvn failsafe:integration-test -Drm.host=localhost -Duser.name=root}} > We can add more integration tests which can check scalability and performance. > Have these tests here benefits everyone in the community because anyone can > run these tests against there cluster. > Attaching a work in progress patch. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8341) Yarn Service: Integration tests
[ https://issues.apache.org/jira/browse/YARN-8341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chandni Singh updated YARN-8341: Attachment: YARN-8341.wip.patch > Yarn Service: Integration tests > > > Key: YARN-8341 > URL: https://issues.apache.org/jira/browse/YARN-8341 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Major > Attachments: YARN-8341.wip.patch, YARN-8341.wip.patch > > > In order to test the rest api end-to-end, we can add Integration tests for > Yarn service api. > The integration tests > * belong to junit category {{IntegrationTest}}. > * will be only run when triggered by executing {{mvn > failsafe:integration-test}} > * the surefire plugin for regular tests excludes {{IntegrationTest}} > * RM host, user name, and any additional properties which are needed to > execute the tests against a cluster can be passed as System properties. > For eg. {{mvn failsafe:integration-test -Drm.host=localhost -Duser.name=root}} > We can add more integration tests which can check scalability and performance. > Have these tests here benefits everyone in the community because anyone can > run these tests against there cluster. > Attaching a work in progress patch. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8341) Yarn Service: Integration tests
[ https://issues.apache.org/jira/browse/YARN-8341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16484564#comment-16484564 ] Chandni Singh commented on YARN-8341: - The mvn additions are in yarn-services-api pom. In order to run the Integration tests # cd hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-services/hadoop-yarn-services-api # mvn failsafe:integration-test -Drm.host=ctr-e138-1518143905142-80042-01-02.hwx.site -Duser.name=root > Yarn Service: Integration tests > > > Key: YARN-8341 > URL: https://issues.apache.org/jira/browse/YARN-8341 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Major > Attachments: YARN-8341.wip.patch > > > In order to test the rest api end-to-end, we can add Integration tests for > Yarn service api. > The integration tests > * belong to junit category {{IntegrationTest}}. > * will be only run when triggered by executing {{mvn > failsafe:integration-test}} > * the surefire plugin for regular tests excludes {{IntegrationTest}} > * RM host, user name, and any additional properties which are needed to > execute the tests against a cluster can be passed as System properties. > For eg. {{mvn failsafe:integration-test -Drm.host=localhost -Duser.name=root}} > We can add more integration tests which can check scalability and performance. > Have these tests here benefits everyone in the community because anyone can > run these tests against there cluster. > Attaching a work in progress patch. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8341) Yarn Service: Integration tests
[ https://issues.apache.org/jira/browse/YARN-8341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chandni Singh updated YARN-8341: Attachment: YARN-8341.wip.patch > Yarn Service: Integration tests > > > Key: YARN-8341 > URL: https://issues.apache.org/jira/browse/YARN-8341 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Major > Attachments: YARN-8341.wip.patch > > > In order to test the rest api end-to-end, we can add Integration tests for > Yarn service api. > The integration tests > * belong to junit category {{IntegrationTest}}. > * will be only run when triggered by executing {{mvn > failsafe:integration-test}} > * the surefire plugin for regular tests excludes {{IntegrationTest}} > * RM host, user name, and any additional properties which are needed to > execute the tests against a cluster can be passed as System properties. > For eg. {{mvn failsafe:integration-test -Drm.host=localhost -Duser.name=root}} > We can add more integration tests which can check scalability and performance. > Have these tests here benefits everyone in the community because anyone can > run these tests against there cluster. > Attaching a work in progress patch. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8341) Yarn Service: Integration tests
Chandni Singh created YARN-8341: --- Summary: Yarn Service: Integration tests Key: YARN-8341 URL: https://issues.apache.org/jira/browse/YARN-8341 Project: Hadoop YARN Issue Type: Improvement Reporter: Chandni Singh Assignee: Chandni Singh In order to test the rest api end-to-end, we can add Integration tests for Yarn service api. The integration tests * belong to junit category {{IntegrationTest}}. * will be only run when triggered by executing {{mvn failsafe:integration-test}} * the surefire plugin for regular tests excludes {{IntegrationTest}} * RM host, user name, and any additional properties which are needed to execute the tests against a cluster can be passed as System properties. For eg. {{mvn failsafe:integration-test -Drm.host=localhost -Duser.name=root}} We can add more integration tests which can check scalability and performance. Have these tests here benefits everyone in the community because anyone can run these tests against there cluster. Attaching a work in progress patch. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7530) hadoop-yarn-services-api should be part of hadoop-yarn-services
[ https://issues.apache.org/jira/browse/YARN-7530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16481385#comment-16481385 ] Chandni Singh commented on YARN-7530: - Thanks [~eyang] for reviewing and committing the patch. [~gsaha] thanks for reviewing. > hadoop-yarn-services-api should be part of hadoop-yarn-services > --- > > Key: YARN-7530 > URL: https://issues.apache.org/jira/browse/YARN-7530 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn-native-services >Affects Versions: 3.1.0 >Reporter: Eric Yang >Assignee: Chandni Singh >Priority: Trivial > Fix For: 3.2.0, 3.1.1 > > Attachments: YARN-7530.001.patch, YARN-7530.002.patch > > > Hadoop-yarn-services-api is currently a parallel project to > hadoop-yarn-services project. It would be better if hadoop-yarn-services-api > is part of hadoop-yarn-services for correctness. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7530) hadoop-yarn-services-api should be part of hadoop-yarn-services
[ https://issues.apache.org/jira/browse/YARN-7530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16480977#comment-16480977 ] Chandni Singh commented on YARN-7530: - [~gsaha] All the unit tests for Service api and service core pass on my machine. Jenkins failed to verify the patch but I was able to run the below command successfully for the hadoop project. {code} mvn clean install -Pdist -Dtar -Dmaven.javadoc.skip=true -DskipShade -Danimal.sniffer.skip=true {code} I'll check if something is missed. > hadoop-yarn-services-api should be part of hadoop-yarn-services > --- > > Key: YARN-7530 > URL: https://issues.apache.org/jira/browse/YARN-7530 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn-native-services >Affects Versions: 3.1.0 >Reporter: Eric Yang >Assignee: Chandni Singh >Priority: Trivial > Fix For: yarn-native-services > > Attachments: YARN-7530.001.patch, YARN-7530.002.patch > > > Hadoop-yarn-services-api is currently a parallel project to > hadoop-yarn-services project. It would be better if hadoop-yarn-services-api > is part of hadoop-yarn-services for correctness. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7530) hadoop-yarn-services-api should be part of hadoop-yarn-services
[ https://issues.apache.org/jira/browse/YARN-7530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chandni Singh updated YARN-7530: Attachment: YARN-7530.002.patch > hadoop-yarn-services-api should be part of hadoop-yarn-services > --- > > Key: YARN-7530 > URL: https://issues.apache.org/jira/browse/YARN-7530 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn-native-services >Affects Versions: 3.1.0 >Reporter: Eric Yang >Assignee: Chandni Singh >Priority: Trivial > Fix For: yarn-native-services > > Attachments: YARN-7530.001.patch, YARN-7530.002.patch > > > Hadoop-yarn-services-api is currently a parallel project to > hadoop-yarn-services project. It would be better if hadoop-yarn-services-api > is part of hadoop-yarn-services for correctness. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8141) YARN Native Service: Respect YARN_CONTAINER_RUNTIME_DOCKER_LOCAL_RESOURCE_MOUNTS specified in service spec
[ https://issues.apache.org/jira/browse/YARN-8141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16479791#comment-16479791 ] Chandni Singh commented on YARN-8141: - Thanks [~eyang] for reviewing and merging. Thanks [~shaneku...@gmail.com], [~billie.rinaldi], and [~leftnoteasy] for the reviews. > YARN Native Service: Respect > YARN_CONTAINER_RUNTIME_DOCKER_LOCAL_RESOURCE_MOUNTS specified in service spec > -- > > Key: YARN-8141 > URL: https://issues.apache.org/jira/browse/YARN-8141 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-native-services >Reporter: Wangda Tan >Assignee: Chandni Singh >Priority: Critical > Labels: Docker > Fix For: 3.2.0, 3.1.1 > > Attachments: YARN-8141.001.patch, YARN-8141.002.patch, > YARN-8141.003.patch, YARN-8141.004.patch, YARN-8141.005.patch, > YARN-8141.006.patch > > > Existing YARN native service overwrites > YARN_CONTAINER_RUNTIME_DOCKER_LOCAL_RESOURCE_MOUNTS regardless if user > specified this in service spec or not. It is important to allow user to mount > local folders like /etc/passwd, etc. > Following logic overwrites the > YARN_CONTAINER_RUNTIME_DOCKER_LOCAL_RESOURCE_MOUNTS environment: > {code:java} > StringBuilder sb = new StringBuilder(); > for (Entrymount : mountPaths.entrySet()) { > if (sb.length() > 0) { > sb.append(","); > } > sb.append(mount.getKey()); > sb.append(":"); > sb.append(mount.getValue()); > } > env.put("YARN_CONTAINER_RUNTIME_DOCKER_LOCAL_RESOURCE_MOUNTS", > sb.toString());{code} > Inside AbstractLauncher.java -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8141) YARN Native Service: Respect YARN_CONTAINER_RUNTIME_DOCKER_LOCAL_RESOURCE_MOUNTS specified in service spec
[ https://issues.apache.org/jira/browse/YARN-8141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chandni Singh updated YARN-8141: Attachment: YARN-8141.006.patch > YARN Native Service: Respect > YARN_CONTAINER_RUNTIME_DOCKER_LOCAL_RESOURCE_MOUNTS specified in service spec > -- > > Key: YARN-8141 > URL: https://issues.apache.org/jira/browse/YARN-8141 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-native-services >Reporter: Wangda Tan >Assignee: Chandni Singh >Priority: Critical > Labels: Docker > Attachments: YARN-8141.001.patch, YARN-8141.002.patch, > YARN-8141.003.patch, YARN-8141.004.patch, YARN-8141.005.patch, > YARN-8141.006.patch > > > Existing YARN native service overwrites > YARN_CONTAINER_RUNTIME_DOCKER_LOCAL_RESOURCE_MOUNTS regardless if user > specified this in service spec or not. It is important to allow user to mount > local folders like /etc/passwd, etc. > Following logic overwrites the > YARN_CONTAINER_RUNTIME_DOCKER_LOCAL_RESOURCE_MOUNTS environment: > {code:java} > StringBuilder sb = new StringBuilder(); > for (Entrymount : mountPaths.entrySet()) { > if (sb.length() > 0) { > sb.append(","); > } > sb.append(mount.getKey()); > sb.append(":"); > sb.append(mount.getValue()); > } > env.put("YARN_CONTAINER_RUNTIME_DOCKER_LOCAL_RESOURCE_MOUNTS", > sb.toString());{code} > Inside AbstractLauncher.java -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8141) YARN Native Service: Respect YARN_CONTAINER_RUNTIME_DOCKER_LOCAL_RESOURCE_MOUNTS specified in service spec
[ https://issues.apache.org/jira/browse/YARN-8141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chandni Singh updated YARN-8141: Attachment: YARN-8141.005.patch > YARN Native Service: Respect > YARN_CONTAINER_RUNTIME_DOCKER_LOCAL_RESOURCE_MOUNTS specified in service spec > -- > > Key: YARN-8141 > URL: https://issues.apache.org/jira/browse/YARN-8141 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-native-services >Reporter: Wangda Tan >Assignee: Chandni Singh >Priority: Critical > Labels: Docker > Attachments: YARN-8141.001.patch, YARN-8141.002.patch, > YARN-8141.003.patch, YARN-8141.004.patch, YARN-8141.005.patch > > > Existing YARN native service overwrites > YARN_CONTAINER_RUNTIME_DOCKER_LOCAL_RESOURCE_MOUNTS regardless if user > specified this in service spec or not. It is important to allow user to mount > local folders like /etc/passwd, etc. > Following logic overwrites the > YARN_CONTAINER_RUNTIME_DOCKER_LOCAL_RESOURCE_MOUNTS environment: > {code:java} > StringBuilder sb = new StringBuilder(); > for (Entrymount : mountPaths.entrySet()) { > if (sb.length() > 0) { > sb.append(","); > } > sb.append(mount.getKey()); > sb.append(":"); > sb.append(mount.getValue()); > } > env.put("YARN_CONTAINER_RUNTIME_DOCKER_LOCAL_RESOURCE_MOUNTS", > sb.toString());{code} > Inside AbstractLauncher.java -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8081) Yarn Service Upgrade: Add support to upgrade a component
[ https://issues.apache.org/jira/browse/YARN-8081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16476929#comment-16476929 ] Chandni Singh commented on YARN-8081: - Thanks [~eyang] for reviewing and merging. > Yarn Service Upgrade: Add support to upgrade a component > > > Key: YARN-8081 > URL: https://issues.apache.org/jira/browse/YARN-8081 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Major > Fix For: 3.2.0, 3.1.1 > > Attachments: YARN-8081.001.patch, YARN-8081.002.patch, > YARN-8081.003.patch > > > Yarn service upgrade should provide an API to upgrade the component. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8141) YARN Native Service: Respect YARN_CONTAINER_RUNTIME_DOCKER_LOCAL_RESOURCE_MOUNTS specified in service spec
[ https://issues.apache.org/jira/browse/YARN-8141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16476578#comment-16476578 ] Chandni Singh commented on YARN-8141: - Failure of org.apache.hadoop.yarn.server.nodemanager.containermanager.TestContainerManager.testContainerUpgradeRollbackDueToFailure looks unrelated. The test passes on my machine. > YARN Native Service: Respect > YARN_CONTAINER_RUNTIME_DOCKER_LOCAL_RESOURCE_MOUNTS specified in service spec > -- > > Key: YARN-8141 > URL: https://issues.apache.org/jira/browse/YARN-8141 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-native-services >Reporter: Wangda Tan >Assignee: Chandni Singh >Priority: Critical > Attachments: YARN-8141.001.patch, YARN-8141.002.patch, > YARN-8141.003.patch, YARN-8141.004.patch > > > Existing YARN native service overwrites > YARN_CONTAINER_RUNTIME_DOCKER_LOCAL_RESOURCE_MOUNTS regardless if user > specified this in service spec or not. It is important to allow user to mount > local folders like /etc/passwd, etc. > Following logic overwrites the > YARN_CONTAINER_RUNTIME_DOCKER_LOCAL_RESOURCE_MOUNTS environment: > {code:java} > StringBuilder sb = new StringBuilder(); > for (Entrymount : mountPaths.entrySet()) { > if (sb.length() > 0) { > sb.append(","); > } > sb.append(mount.getKey()); > sb.append(":"); > sb.append(mount.getValue()); > } > env.put("YARN_CONTAINER_RUNTIME_DOCKER_LOCAL_RESOURCE_MOUNTS", > sb.toString());{code} > Inside AbstractLauncher.java -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8301) Yarn Service Upgrade: Add documentation
Chandni Singh created YARN-8301: --- Summary: Yarn Service Upgrade: Add documentation Key: YARN-8301 URL: https://issues.apache.org/jira/browse/YARN-8301 Project: Hadoop YARN Issue Type: Sub-task Reporter: Chandni Singh Assignee: Chandni Singh Add documentation for yarn service upgrade. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8300) Fix NPE in DefaultUpgradeComponentsFinder
[ https://issues.apache.org/jira/browse/YARN-8300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16476502#comment-16476502 ] Chandni Singh commented on YARN-8300: - Thanks [~suma.shivaprasad] for catching. Looks good to me. > Fix NPE in DefaultUpgradeComponentsFinder > -- > > Key: YARN-8300 > URL: https://issues.apache.org/jira/browse/YARN-8300 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: Suma Shivaprasad >Assignee: Suma Shivaprasad >Priority: Major > Attachments: YARN-8300.1.patch > > > In current upgrades for Yarn native services, we do not support > addition/deletion of compoents during upgrade. On trying to upgrade with the > same number of components in target spec as the current service spec but with > the one of the components having a new target spec and name, see the > following NPE in service AM logs > {noformat} > 2018-05-15 00:10:41,489 [IPC Server handler 0 on 37488] ERROR > service.ClientAMService - Error while trying to upgrade service {} > java.lang.NullPointerException > at > org.apache.hadoop.yarn.service.UpgradeComponentsFinder$DefaultUpgradeComponentsFinder.lambda$findTargetComponentSpecs$0(UpgradeComponentsFinder.java:103) > at java.util.ArrayList.forEach(ArrayList.java:1257) > at > org.apache.hadoop.yarn.service.UpgradeComponentsFinder$DefaultUpgradeComponentsFinder.findTargetComponentSpecs(UpgradeComponentsFinder.java:100) > at > org.apache.hadoop.yarn.service.ServiceManager.processUpgradeRequest(ServiceManager.java:259) > at > org.apache.hadoop.yarn.service.ClientAMService.upgrade(ClientAMService.java:163) > at > org.apache.hadoop.yarn.service.impl.pb.service.ClientAMProtocolPBServiceImpl.upgradeService(ClientAMProtocolPBServiceImpl.java:81) > at > org.apache.hadoop.yarn.proto.ClientAMProtocol$ClientAMProtocolService$2.callBlockingMethod(ClientAMProtocol.java:5972) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:872) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:818) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1682) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2678) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8299) Yarn Service Upgrade: Add GET APIs that returns components/instances matching query params
Chandni Singh created YARN-8299: --- Summary: Yarn Service Upgrade: Add GET APIs that returns components/instances matching query params Key: YARN-8299 URL: https://issues.apache.org/jira/browse/YARN-8299 Project: Hadoop YARN Issue Type: Sub-task Reporter: Chandni Singh Assignee: Chandni Singh We need APIs that returns containers/components that match the query params. These are needed so that we can find out what containers/components have been upgraded. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8298) Yarn Service Upgrade: Support fast component upgrades which accepts component spec
[ https://issues.apache.org/jira/browse/YARN-8298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chandni Singh updated YARN-8298: Summary: Yarn Service Upgrade: Support fast component upgrades which accepts component spec (was: Yarn Service Upgrade: Support fast component upgrades that accept component spec) > Yarn Service Upgrade: Support fast component upgrades which accepts component > spec > -- > > Key: YARN-8298 > URL: https://issues.apache.org/jira/browse/YARN-8298 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Major > > Currently service upgrade involves 2 steps > * initiate upgrade by providing new spec > * trigger upgrade of each instance/component > > We need to add the ability to upgrade a component in shot which accepts the > spec of the component. However there are couple of limitations when upgrading > in this way: > # Aborting the upgrade will not be supported > # Upgrade finalization will be done automatically. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8298) Yarn Service Upgrade: Support fast component upgrades that accept component spec
Chandni Singh created YARN-8298: --- Summary: Yarn Service Upgrade: Support fast component upgrades that accept component spec Key: YARN-8298 URL: https://issues.apache.org/jira/browse/YARN-8298 Project: Hadoop YARN Issue Type: Sub-task Reporter: Chandni Singh Assignee: Chandni Singh Currently service upgrade involves 2 steps * initiate upgrade by providing new spec * trigger upgrade of each instance/component We need to add the ability to upgrade a component in shot which accepts the spec of the component. However there are couple of limitations when upgrading in this way: # Aborting the upgrade will not be supported # Upgrade finalization will be done automatically. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8141) YARN Native Service: Respect YARN_CONTAINER_RUNTIME_DOCKER_LOCAL_RESOURCE_MOUNTS specified in service spec
[ https://issues.apache.org/jira/browse/YARN-8141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chandni Singh updated YARN-8141: Attachment: YARN-8141.004.patch > YARN Native Service: Respect > YARN_CONTAINER_RUNTIME_DOCKER_LOCAL_RESOURCE_MOUNTS specified in service spec > -- > > Key: YARN-8141 > URL: https://issues.apache.org/jira/browse/YARN-8141 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-native-services >Reporter: Wangda Tan >Assignee: Chandni Singh >Priority: Critical > Attachments: YARN-8141.001.patch, YARN-8141.002.patch, > YARN-8141.003.patch, YARN-8141.004.patch > > > Existing YARN native service overwrites > YARN_CONTAINER_RUNTIME_DOCKER_LOCAL_RESOURCE_MOUNTS regardless if user > specified this in service spec or not. It is important to allow user to mount > local folders like /etc/passwd, etc. > Following logic overwrites the > YARN_CONTAINER_RUNTIME_DOCKER_LOCAL_RESOURCE_MOUNTS environment: > {code:java} > StringBuilder sb = new StringBuilder(); > for (Entrymount : mountPaths.entrySet()) { > if (sb.length() > 0) { > sb.append(","); > } > sb.append(mount.getKey()); > sb.append(":"); > sb.append(mount.getValue()); > } > env.put("YARN_CONTAINER_RUNTIME_DOCKER_LOCAL_RESOURCE_MOUNTS", > sb.toString());{code} > Inside AbstractLauncher.java -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org