[ https://issues.apache.org/jira/browse/YARN-8330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16553540#comment-16553540 ]
Jason Lowe commented on YARN-8330: ---------------------------------- One of the envisioned use-cases of ATSv2 was to record every container allocation for an application in order to do analysis like what is done in YARN-415. That requires recording every container that enters the ALLOCATED state. bq. If information is collected for end user consumption to understand their application usage, then by collecting RUNNING state might be sufficient. That does not cover the case where an application receives containers but for some reason holds onto the allocations for a while before launching them. Tez, for example, has some corner cases in its scheduler where it can hold onto container allocations for prolonged periods before launching them while it reuses other, already active containers. Holding onto unlaunched container allocations is part of its footprint on the cluster. Only showing containers that made it to the RUNNING is only telling part of the application's usage story. Fixing the yarn container -list command problem does not mean the fix has to solely be in ATSv2. If the data for a container in ATS is sufficient to distinguish allocated containers from running containers then this can, and arguably should be, filtered for the yarn container -list use-case. If the data recorded for a container can't distinguish this then we should look into fixing that so it can be. But I don't think we should pre-filter ALLOCATED containers on the RM publishing side and preclude the proper app footprint analysis use-case entirely. > An extra container got launched by RM for yarn-service > ------------------------------------------------------ > > Key: YARN-8330 > URL: https://issues.apache.org/jira/browse/YARN-8330 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-native-services > Reporter: Yesha Vora > Assignee: Suma Shivaprasad > Priority: Critical > Attachments: YARN-8330.1.patch, YARN-8330.2.patch, YARN-8330.3.patch > > > Steps: > launch Hbase tarball app > list containers for hbase tarball app > {code} > /usr/hdp/current/hadoop-yarn-client/bin/yarn container -list > appattempt_1525463491331_0006_000001 > WARNING: YARN_LOG_DIR has been replaced by HADOOP_LOG_DIR. Using value of > YARN_LOG_DIR. > WARNING: YARN_LOGFILE has been replaced by HADOOP_LOGFILE. Using value of > YARN_LOGFILE. > WARNING: YARN_PID_DIR has been replaced by HADOOP_PID_DIR. Using value of > YARN_PID_DIR. > WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS. > 18/05/04 22:36:11 INFO client.AHSProxy: Connecting to Application History > server at xxx/xxx:10200 > 18/05/04 22:36:11 INFO client.ConfiguredRMFailoverProxyProvider: Failing over > to rm2 > Total number of containers :5 > Container-Id Start Time Finish Time > State Host Node Http Address > LOG-URL > container_e06_1525463491331_0006_01_000002 Fri May 04 22:34:26 +0000 2018 > N/A RUNNING xxx:25454 http://xxx:8042 > http://xxx:8042/node/containerlogs/container_e06_1525463491331_0006_01_000002/hrt_qa > 2018-05-04 22:36:11,216|INFO|MainThread|machine.py:167 - > run()||GUID=0169fa41-d1c5-4b43-85bf-c3e9f2682398|container_e06_1525463491331_0006_01_000003 > Fri May 04 22:34:26 +0000 2018 N/A > RUNNING xxx:25454 http://xxx:8042 > http://xxx:8042/node/containerlogs/container_e06_1525463491331_0006_01_000003/hrt_qa > 2018-05-04 22:36:11,217|INFO|MainThread|machine.py:167 - > run()||GUID=0169fa41-d1c5-4b43-85bf-c3e9f2682398|container_e06_1525463491331_0006_01_000001 > Fri May 04 22:34:15 +0000 2018 N/A > RUNNING xxx:25454 http://xxx:8042 > http://xxx:8042/node/containerlogs/container_e06_1525463491331_0006_01_000001/hrt_qa > 2018-05-04 22:36:11,217|INFO|MainThread|machine.py:167 - > run()||GUID=0169fa41-d1c5-4b43-85bf-c3e9f2682398|container_e06_1525463491331_0006_01_000005 > Fri May 04 22:34:56 +0000 2018 N/A > RUNNING xxx:25454 http://xxx:8042 > http://xxx:8042/node/containerlogs/container_e06_1525463491331_0006_01_000005/hrt_qa > 2018-05-04 22:36:11,218|INFO|MainThread|machine.py:167 - > run()||GUID=0169fa41-d1c5-4b43-85bf-c3e9f2682398|container_e06_1525463491331_0006_01_000004 > Fri May 04 22:34:56 +0000 2018 N/A > null xxx:25454 http://xxx:8042 > http://xxx:8188/applicationhistory/logs/xxx:25454/container_e06_1525463491331_0006_01_000004/container_e06_1525463491331_0006_01_000004/hrt_qa{code} > Total expected containers = 4 ( 3 components container + 1 am). Instead, RM > is listing 5 containers. > container_e06_1525463491331_0006_01_000004 is in null state. > Yarn service utilized container 02, 03, 05 for component. There is no log > available in NM & AM related to container 04. Only one line in RM log is > printed > {code} > 2018-05-04 22:34:56,618 INFO rmcontainer.RMContainerImpl > (RMContainerImpl.java:handle(489)) - > container_e06_1525463491331_0006_01_000004 Container Transitioned from NEW to > RESERVED{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org