[jira] [Commented] (YARN-1284) LCE: Race condition leaves dangling cgroups entries for killed containers
[ https://issues.apache.org/jira/browse/YARN-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13791525#comment-13791525 ] Hudson commented on YARN-1284: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1574 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1574/]) Amending yarn CHANGES.txt moving YARN-1284 to 2.2.1 (tucu: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1530716) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt > LCE: Race condition leaves dangling cgroups entries for killed containers > - > > Key: YARN-1284 > URL: https://issues.apache.org/jira/browse/YARN-1284 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.2.0 >Reporter: Alejandro Abdelnur >Assignee: Alejandro Abdelnur >Priority: Blocker > Fix For: 2.2.1 > > Attachments: YARN-1284.patch, YARN-1284.patch, YARN-1284.patch, > YARN-1284.patch, YARN-1284.patch > > > When LCE & cgroups are enabled, when a container is is killed (in this case > by its owning AM, an MRAM) it seems to be a race condition at OS level when > doing a SIGTERM/SIGKILL and when the OS does all necessary cleanup. > LCE code, after sending the SIGTERM/SIGKILL and getting the exitcode, > immediately attempts to clean up the cgroups entry for the container. But > this is failing with an error like: > {code} > 2013-10-07 15:21:24,359 WARN > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code > from container container_1381179532433_0016_01_11 is : 143 > 2013-10-07 15:21:24,359 DEBUG > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Processing container_1381179532433_0016_01_11 of type > UPDATE_DIAGNOSTICS_MSG > 2013-10-07 15:21:24,359 DEBUG > org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: > deleteCgroup: > /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11 > 2013-10-07 15:21:24,359 WARN > org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: > Unable to delete cgroup at: > /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11 > {code} > CgroupsLCEResourcesHandler.clearLimits() has logic to wait for 500 ms for AM > containers to avoid this problem. it seems this should be done for all > containers. > Still, waiting for extra 500ms seems too expensive. > We should look at a way of doing this in a more 'efficient way' from time > perspective, may be spinning while the deleteCgroup() cannot be done with a > minimal sleep and a timeout. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1284) LCE: Race condition leaves dangling cgroups entries for killed containers
[ https://issues.apache.org/jira/browse/YARN-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13791485#comment-13791485 ] Hudson commented on YARN-1284: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1548 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1548/]) Amending yarn CHANGES.txt moving YARN-1284 to 2.2.1 (tucu: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1530716) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt > LCE: Race condition leaves dangling cgroups entries for killed containers > - > > Key: YARN-1284 > URL: https://issues.apache.org/jira/browse/YARN-1284 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.2.0 >Reporter: Alejandro Abdelnur >Assignee: Alejandro Abdelnur >Priority: Blocker > Fix For: 2.2.1 > > Attachments: YARN-1284.patch, YARN-1284.patch, YARN-1284.patch, > YARN-1284.patch, YARN-1284.patch > > > When LCE & cgroups are enabled, when a container is is killed (in this case > by its owning AM, an MRAM) it seems to be a race condition at OS level when > doing a SIGTERM/SIGKILL and when the OS does all necessary cleanup. > LCE code, after sending the SIGTERM/SIGKILL and getting the exitcode, > immediately attempts to clean up the cgroups entry for the container. But > this is failing with an error like: > {code} > 2013-10-07 15:21:24,359 WARN > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code > from container container_1381179532433_0016_01_11 is : 143 > 2013-10-07 15:21:24,359 DEBUG > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Processing container_1381179532433_0016_01_11 of type > UPDATE_DIAGNOSTICS_MSG > 2013-10-07 15:21:24,359 DEBUG > org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: > deleteCgroup: > /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11 > 2013-10-07 15:21:24,359 WARN > org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: > Unable to delete cgroup at: > /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11 > {code} > CgroupsLCEResourcesHandler.clearLimits() has logic to wait for 500 ms for AM > containers to avoid this problem. it seems this should be done for all > containers. > Still, waiting for extra 500ms seems too expensive. > We should look at a way of doing this in a more 'efficient way' from time > perspective, may be spinning while the deleteCgroup() cannot be done with a > minimal sleep and a timeout. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1284) LCE: Race condition leaves dangling cgroups entries for killed containers
[ https://issues.apache.org/jira/browse/YARN-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13791384#comment-13791384 ] Hudson commented on YARN-1284: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #358 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/358/]) Amending yarn CHANGES.txt moving YARN-1284 to 2.2.1 (tucu: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1530716) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt > LCE: Race condition leaves dangling cgroups entries for killed containers > - > > Key: YARN-1284 > URL: https://issues.apache.org/jira/browse/YARN-1284 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.2.0 >Reporter: Alejandro Abdelnur >Assignee: Alejandro Abdelnur >Priority: Blocker > Fix For: 2.2.1 > > Attachments: YARN-1284.patch, YARN-1284.patch, YARN-1284.patch, > YARN-1284.patch, YARN-1284.patch > > > When LCE & cgroups are enabled, when a container is is killed (in this case > by its owning AM, an MRAM) it seems to be a race condition at OS level when > doing a SIGTERM/SIGKILL and when the OS does all necessary cleanup. > LCE code, after sending the SIGTERM/SIGKILL and getting the exitcode, > immediately attempts to clean up the cgroups entry for the container. But > this is failing with an error like: > {code} > 2013-10-07 15:21:24,359 WARN > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code > from container container_1381179532433_0016_01_11 is : 143 > 2013-10-07 15:21:24,359 DEBUG > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Processing container_1381179532433_0016_01_11 of type > UPDATE_DIAGNOSTICS_MSG > 2013-10-07 15:21:24,359 DEBUG > org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: > deleteCgroup: > /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11 > 2013-10-07 15:21:24,359 WARN > org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: > Unable to delete cgroup at: > /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11 > {code} > CgroupsLCEResourcesHandler.clearLimits() has logic to wait for 500 ms for AM > containers to avoid this problem. it seems this should be done for all > containers. > Still, waiting for extra 500ms seems too expensive. > We should look at a way of doing this in a more 'efficient way' from time > perspective, may be spinning while the deleteCgroup() cannot be done with a > minimal sleep and a timeout. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1284) LCE: Race condition leaves dangling cgroups entries for killed containers
[ https://issues.apache.org/jira/browse/YARN-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13790628#comment-13790628 ] Hudson commented on YARN-1284: -- SUCCESS: Integrated in Hadoop-trunk-Commit #4574 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/4574/]) Amending yarn CHANGES.txt moving YARN-1284 to 2.2.1 (tucu: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1530716) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt > LCE: Race condition leaves dangling cgroups entries for killed containers > - > > Key: YARN-1284 > URL: https://issues.apache.org/jira/browse/YARN-1284 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.2.0 >Reporter: Alejandro Abdelnur >Assignee: Alejandro Abdelnur >Priority: Blocker > Fix For: 2.2.1 > > Attachments: YARN-1284.patch, YARN-1284.patch, YARN-1284.patch, > YARN-1284.patch, YARN-1284.patch > > > When LCE & cgroups are enabled, when a container is is killed (in this case > by its owning AM, an MRAM) it seems to be a race condition at OS level when > doing a SIGTERM/SIGKILL and when the OS does all necessary cleanup. > LCE code, after sending the SIGTERM/SIGKILL and getting the exitcode, > immediately attempts to clean up the cgroups entry for the container. But > this is failing with an error like: > {code} > 2013-10-07 15:21:24,359 WARN > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code > from container container_1381179532433_0016_01_11 is : 143 > 2013-10-07 15:21:24,359 DEBUG > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Processing container_1381179532433_0016_01_11 of type > UPDATE_DIAGNOSTICS_MSG > 2013-10-07 15:21:24,359 DEBUG > org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: > deleteCgroup: > /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11 > 2013-10-07 15:21:24,359 WARN > org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: > Unable to delete cgroup at: > /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11 > {code} > CgroupsLCEResourcesHandler.clearLimits() has logic to wait for 500 ms for AM > containers to avoid this problem. it seems this should be done for all > containers. > Still, waiting for extra 500ms seems too expensive. > We should look at a way of doing this in a more 'efficient way' from time > perspective, may be spinning while the deleteCgroup() cannot be done with a > minimal sleep and a timeout. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1284) LCE: Race condition leaves dangling cgroups entries for killed containers
[ https://issues.apache.org/jira/browse/YARN-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13790347#comment-13790347 ] Hudson commented on YARN-1284: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #1547 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1547/]) Add missing file TestCgroupsLCEResourcesHandler for YARN-1284. (sandy: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1530493) * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/util/TestCgroupsLCEResourcesHandler.java YARN-1284. LCE: Race condition leaves dangling cgroups entries for killed containers. (Alejandro Abdelnur via Sandy Ryza) (sandy: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1530492) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/util/CgroupsLCEResourcesHandler.java > LCE: Race condition leaves dangling cgroups entries for killed containers > - > > Key: YARN-1284 > URL: https://issues.apache.org/jira/browse/YARN-1284 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.2.0 >Reporter: Alejandro Abdelnur >Assignee: Alejandro Abdelnur >Priority: Blocker > Fix For: 2.3.0 > > Attachments: YARN-1284.patch, YARN-1284.patch, YARN-1284.patch, > YARN-1284.patch, YARN-1284.patch > > > When LCE & cgroups are enabled, when a container is is killed (in this case > by its owning AM, an MRAM) it seems to be a race condition at OS level when > doing a SIGTERM/SIGKILL and when the OS does all necessary cleanup. > LCE code, after sending the SIGTERM/SIGKILL and getting the exitcode, > immediately attempts to clean up the cgroups entry for the container. But > this is failing with an error like: > {code} > 2013-10-07 15:21:24,359 WARN > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code > from container container_1381179532433_0016_01_11 is : 143 > 2013-10-07 15:21:24,359 DEBUG > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Processing container_1381179532433_0016_01_11 of type > UPDATE_DIAGNOSTICS_MSG > 2013-10-07 15:21:24,359 DEBUG > org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: > deleteCgroup: > /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11 > 2013-10-07 15:21:24,359 WARN > org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: > Unable to delete cgroup at: > /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11 > {code} > CgroupsLCEResourcesHandler.clearLimits() has logic to wait for 500 ms for AM > containers to avoid this problem. it seems this should be done for all > containers. > Still, waiting for extra 500ms seems too expensive. > We should look at a way of doing this in a more 'efficient way' from time > perspective, may be spinning while the deleteCgroup() cannot be done with a > minimal sleep and a timeout. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1284) LCE: Race condition leaves dangling cgroups entries for killed containers
[ https://issues.apache.org/jira/browse/YARN-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13790320#comment-13790320 ] Hudson commented on YARN-1284: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1573 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1573/]) Add missing file TestCgroupsLCEResourcesHandler for YARN-1284. (sandy: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1530493) * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/util/TestCgroupsLCEResourcesHandler.java YARN-1284. LCE: Race condition leaves dangling cgroups entries for killed containers. (Alejandro Abdelnur via Sandy Ryza) (sandy: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1530492) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/util/CgroupsLCEResourcesHandler.java > LCE: Race condition leaves dangling cgroups entries for killed containers > - > > Key: YARN-1284 > URL: https://issues.apache.org/jira/browse/YARN-1284 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.2.0 >Reporter: Alejandro Abdelnur >Assignee: Alejandro Abdelnur >Priority: Blocker > Fix For: 2.3.0 > > Attachments: YARN-1284.patch, YARN-1284.patch, YARN-1284.patch, > YARN-1284.patch, YARN-1284.patch > > > When LCE & cgroups are enabled, when a container is is killed (in this case > by its owning AM, an MRAM) it seems to be a race condition at OS level when > doing a SIGTERM/SIGKILL and when the OS does all necessary cleanup. > LCE code, after sending the SIGTERM/SIGKILL and getting the exitcode, > immediately attempts to clean up the cgroups entry for the container. But > this is failing with an error like: > {code} > 2013-10-07 15:21:24,359 WARN > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code > from container container_1381179532433_0016_01_11 is : 143 > 2013-10-07 15:21:24,359 DEBUG > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Processing container_1381179532433_0016_01_11 of type > UPDATE_DIAGNOSTICS_MSG > 2013-10-07 15:21:24,359 DEBUG > org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: > deleteCgroup: > /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11 > 2013-10-07 15:21:24,359 WARN > org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: > Unable to delete cgroup at: > /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11 > {code} > CgroupsLCEResourcesHandler.clearLimits() has logic to wait for 500 ms for AM > containers to avoid this problem. it seems this should be done for all > containers. > Still, waiting for extra 500ms seems too expensive. > We should look at a way of doing this in a more 'efficient way' from time > perspective, may be spinning while the deleteCgroup() cannot be done with a > minimal sleep and a timeout. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1284) LCE: Race condition leaves dangling cgroups entries for killed containers
[ https://issues.apache.org/jira/browse/YARN-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13790243#comment-13790243 ] Hudson commented on YARN-1284: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #357 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/357/]) Add missing file TestCgroupsLCEResourcesHandler for YARN-1284. (sandy: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1530493) * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/util/TestCgroupsLCEResourcesHandler.java YARN-1284. LCE: Race condition leaves dangling cgroups entries for killed containers. (Alejandro Abdelnur via Sandy Ryza) (sandy: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1530492) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/util/CgroupsLCEResourcesHandler.java > LCE: Race condition leaves dangling cgroups entries for killed containers > - > > Key: YARN-1284 > URL: https://issues.apache.org/jira/browse/YARN-1284 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.2.0 >Reporter: Alejandro Abdelnur >Assignee: Alejandro Abdelnur >Priority: Blocker > Fix For: 2.3.0 > > Attachments: YARN-1284.patch, YARN-1284.patch, YARN-1284.patch, > YARN-1284.patch, YARN-1284.patch > > > When LCE & cgroups are enabled, when a container is is killed (in this case > by its owning AM, an MRAM) it seems to be a race condition at OS level when > doing a SIGTERM/SIGKILL and when the OS does all necessary cleanup. > LCE code, after sending the SIGTERM/SIGKILL and getting the exitcode, > immediately attempts to clean up the cgroups entry for the container. But > this is failing with an error like: > {code} > 2013-10-07 15:21:24,359 WARN > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code > from container container_1381179532433_0016_01_11 is : 143 > 2013-10-07 15:21:24,359 DEBUG > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Processing container_1381179532433_0016_01_11 of type > UPDATE_DIAGNOSTICS_MSG > 2013-10-07 15:21:24,359 DEBUG > org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: > deleteCgroup: > /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11 > 2013-10-07 15:21:24,359 WARN > org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: > Unable to delete cgroup at: > /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11 > {code} > CgroupsLCEResourcesHandler.clearLimits() has logic to wait for 500 ms for AM > containers to avoid this problem. it seems this should be done for all > containers. > Still, waiting for extra 500ms seems too expensive. > We should look at a way of doing this in a more 'efficient way' from time > perspective, may be spinning while the deleteCgroup() cannot be done with a > minimal sleep and a timeout. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1284) LCE: Race condition leaves dangling cgroups entries for killed containers
[ https://issues.apache.org/jira/browse/YARN-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13790036#comment-13790036 ] Hudson commented on YARN-1284: -- SUCCESS: Integrated in Hadoop-trunk-Commit #4568 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/4568/]) Add missing file TestCgroupsLCEResourcesHandler for YARN-1284. (sandy: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1530493) * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/util/TestCgroupsLCEResourcesHandler.java YARN-1284. LCE: Race condition leaves dangling cgroups entries for killed containers. (Alejandro Abdelnur via Sandy Ryza) (sandy: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1530492) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/util/CgroupsLCEResourcesHandler.java > LCE: Race condition leaves dangling cgroups entries for killed containers > - > > Key: YARN-1284 > URL: https://issues.apache.org/jira/browse/YARN-1284 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.2.0 >Reporter: Alejandro Abdelnur >Assignee: Alejandro Abdelnur >Priority: Blocker > Fix For: 2.3.0 > > Attachments: YARN-1284.patch, YARN-1284.patch, YARN-1284.patch, > YARN-1284.patch, YARN-1284.patch > > > When LCE & cgroups are enabled, when a container is is killed (in this case > by its owning AM, an MRAM) it seems to be a race condition at OS level when > doing a SIGTERM/SIGKILL and when the OS does all necessary cleanup. > LCE code, after sending the SIGTERM/SIGKILL and getting the exitcode, > immediately attempts to clean up the cgroups entry for the container. But > this is failing with an error like: > {code} > 2013-10-07 15:21:24,359 WARN > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code > from container container_1381179532433_0016_01_11 is : 143 > 2013-10-07 15:21:24,359 DEBUG > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Processing container_1381179532433_0016_01_11 of type > UPDATE_DIAGNOSTICS_MSG > 2013-10-07 15:21:24,359 DEBUG > org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: > deleteCgroup: > /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11 > 2013-10-07 15:21:24,359 WARN > org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: > Unable to delete cgroup at: > /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11 > {code} > CgroupsLCEResourcesHandler.clearLimits() has logic to wait for 500 ms for AM > containers to avoid this problem. it seems this should be done for all > containers. > Still, waiting for extra 500ms seems too expensive. > We should look at a way of doing this in a more 'efficient way' from time > perspective, may be spinning while the deleteCgroup() cannot be done with a > minimal sleep and a timeout. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1284) LCE: Race condition leaves dangling cgroups entries for killed containers
[ https://issues.apache.org/jira/browse/YARN-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13790020#comment-13790020 ] Sandy Ryza commented on YARN-1284: -- +1 > LCE: Race condition leaves dangling cgroups entries for killed containers > - > > Key: YARN-1284 > URL: https://issues.apache.org/jira/browse/YARN-1284 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.2.0 >Reporter: Alejandro Abdelnur >Assignee: Alejandro Abdelnur >Priority: Blocker > Attachments: YARN-1284.patch, YARN-1284.patch, YARN-1284.patch, > YARN-1284.patch, YARN-1284.patch > > > When LCE & cgroups are enabled, when a container is is killed (in this case > by its owning AM, an MRAM) it seems to be a race condition at OS level when > doing a SIGTERM/SIGKILL and when the OS does all necessary cleanup. > LCE code, after sending the SIGTERM/SIGKILL and getting the exitcode, > immediately attempts to clean up the cgroups entry for the container. But > this is failing with an error like: > {code} > 2013-10-07 15:21:24,359 WARN > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code > from container container_1381179532433_0016_01_11 is : 143 > 2013-10-07 15:21:24,359 DEBUG > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Processing container_1381179532433_0016_01_11 of type > UPDATE_DIAGNOSTICS_MSG > 2013-10-07 15:21:24,359 DEBUG > org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: > deleteCgroup: > /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11 > 2013-10-07 15:21:24,359 WARN > org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: > Unable to delete cgroup at: > /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11 > {code} > CgroupsLCEResourcesHandler.clearLimits() has logic to wait for 500 ms for AM > containers to avoid this problem. it seems this should be done for all > containers. > Still, waiting for extra 500ms seems too expensive. > We should look at a way of doing this in a more 'efficient way' from time > perspective, may be spinning while the deleteCgroup() cannot be done with a > minimal sleep and a timeout. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1284) LCE: Race condition leaves dangling cgroups entries for killed containers
[ https://issues.apache.org/jira/browse/YARN-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13789995#comment-13789995 ] Hadoop QA commented on YARN-1284: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12607500/YARN-1284.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/2150//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2150//console This message is automatically generated. > LCE: Race condition leaves dangling cgroups entries for killed containers > - > > Key: YARN-1284 > URL: https://issues.apache.org/jira/browse/YARN-1284 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.2.0 >Reporter: Alejandro Abdelnur >Assignee: Alejandro Abdelnur >Priority: Blocker > Attachments: YARN-1284.patch, YARN-1284.patch, YARN-1284.patch, > YARN-1284.patch, YARN-1284.patch > > > When LCE & cgroups are enabled, when a container is is killed (in this case > by its owning AM, an MRAM) it seems to be a race condition at OS level when > doing a SIGTERM/SIGKILL and when the OS does all necessary cleanup. > LCE code, after sending the SIGTERM/SIGKILL and getting the exitcode, > immediately attempts to clean up the cgroups entry for the container. But > this is failing with an error like: > {code} > 2013-10-07 15:21:24,359 WARN > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code > from container container_1381179532433_0016_01_11 is : 143 > 2013-10-07 15:21:24,359 DEBUG > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Processing container_1381179532433_0016_01_11 of type > UPDATE_DIAGNOSTICS_MSG > 2013-10-07 15:21:24,359 DEBUG > org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: > deleteCgroup: > /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11 > 2013-10-07 15:21:24,359 WARN > org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: > Unable to delete cgroup at: > /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11 > {code} > CgroupsLCEResourcesHandler.clearLimits() has logic to wait for 500 ms for AM > containers to avoid this problem. it seems this should be done for all > containers. > Still, waiting for extra 500ms seems too expensive. > We should look at a way of doing this in a more 'efficient way' from time > perspective, may be spinning while the deleteCgroup() cannot be done with a > minimal sleep and a timeout. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1284) LCE: Race condition leaves dangling cgroups entries for killed containers
[ https://issues.apache.org/jira/browse/YARN-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13789978#comment-13789978 ] Alejandro Abdelnur commented on YARN-1284: -- For the record, I've spend a couple hours trying an alternate approach suggested by [~rvs] while chatting offline about this. His suggestion was to initialize a trash cgroup next to the containers cgroups and when a container is cleanup transition the /tasks to the trash/tasks, doing the equivalent of a {{cat /tasks >> trash/tasks}}. Tried doing that but it seems some of the Java IO native calls make a system call which is not supported by the cgroups filesystem implementation and I was getting the following stack trace: {code} java.io.IOException: Argument list too long java.io.IOException: Argument list too long at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:318) at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:80) ... {code} Given this, beside that I didn't get it to work properly, I would not be comfortable doing this as this may behave different in different Linux versions. > LCE: Race condition leaves dangling cgroups entries for killed containers > - > > Key: YARN-1284 > URL: https://issues.apache.org/jira/browse/YARN-1284 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.2.0 >Reporter: Alejandro Abdelnur >Assignee: Alejandro Abdelnur >Priority: Blocker > Attachments: YARN-1284.patch, YARN-1284.patch, YARN-1284.patch, > YARN-1284.patch > > > When LCE & cgroups are enabled, when a container is is killed (in this case > by its owning AM, an MRAM) it seems to be a race condition at OS level when > doing a SIGTERM/SIGKILL and when the OS does all necessary cleanup. > LCE code, after sending the SIGTERM/SIGKILL and getting the exitcode, > immediately attempts to clean up the cgroups entry for the container. But > this is failing with an error like: > {code} > 2013-10-07 15:21:24,359 WARN > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code > from container container_1381179532433_0016_01_11 is : 143 > 2013-10-07 15:21:24,359 DEBUG > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Processing container_1381179532433_0016_01_11 of type > UPDATE_DIAGNOSTICS_MSG > 2013-10-07 15:21:24,359 DEBUG > org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: > deleteCgroup: > /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11 > 2013-10-07 15:21:24,359 WARN > org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: > Unable to delete cgroup at: > /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11 > {code} > CgroupsLCEResourcesHandler.clearLimits() has logic to wait for 500 ms for AM > containers to avoid this problem. it seems this should be done for all > containers. > Still, waiting for extra 500ms seems too expensive. > We should look at a way of doing this in a more 'efficient way' from time > perspective, may be spinning while the deleteCgroup() cannot be done with a > minimal sleep and a timeout. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1284) LCE: Race condition leaves dangling cgroups entries for killed containers
[ https://issues.apache.org/jira/browse/YARN-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13789977#comment-13789977 ] Hadoop QA commented on YARN-1284: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12607499/YARN-1284.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:red}-1 javac{color:red}. The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2149//console This message is automatically generated. > LCE: Race condition leaves dangling cgroups entries for killed containers > - > > Key: YARN-1284 > URL: https://issues.apache.org/jira/browse/YARN-1284 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.2.0 >Reporter: Alejandro Abdelnur >Assignee: Alejandro Abdelnur >Priority: Blocker > Attachments: YARN-1284.patch, YARN-1284.patch, YARN-1284.patch, > YARN-1284.patch > > > When LCE & cgroups are enabled, when a container is is killed (in this case > by its owning AM, an MRAM) it seems to be a race condition at OS level when > doing a SIGTERM/SIGKILL and when the OS does all necessary cleanup. > LCE code, after sending the SIGTERM/SIGKILL and getting the exitcode, > immediately attempts to clean up the cgroups entry for the container. But > this is failing with an error like: > {code} > 2013-10-07 15:21:24,359 WARN > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code > from container container_1381179532433_0016_01_11 is : 143 > 2013-10-07 15:21:24,359 DEBUG > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Processing container_1381179532433_0016_01_11 of type > UPDATE_DIAGNOSTICS_MSG > 2013-10-07 15:21:24,359 DEBUG > org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: > deleteCgroup: > /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11 > 2013-10-07 15:21:24,359 WARN > org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: > Unable to delete cgroup at: > /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11 > {code} > CgroupsLCEResourcesHandler.clearLimits() has logic to wait for 500 ms for AM > containers to avoid this problem. it seems this should be done for all > containers. > Still, waiting for extra 500ms seems too expensive. > We should look at a way of doing this in a more 'efficient way' from time > perspective, may be spinning while the deleteCgroup() cannot be done with a > minimal sleep and a timeout. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1284) LCE: Race condition leaves dangling cgroups entries for killed containers
[ https://issues.apache.org/jira/browse/YARN-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13789864#comment-13789864 ] Sandy Ryza commented on YARN-1284: -- Oh, and also: {code{ +if (! new File(cgroupPath).delete()) { + LOG.warn("Unable to delete cgroup at: " + cgroupPath +", tried to delete for " + + deleteCgroupTimeout + "ms"); } {code} If the file was already deleted, delete() will return false and we'll log the warning even though nothing went wrong. Instead, we should just check "if (!deleted)". > LCE: Race condition leaves dangling cgroups entries for killed containers > - > > Key: YARN-1284 > URL: https://issues.apache.org/jira/browse/YARN-1284 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.2.0 >Reporter: Alejandro Abdelnur >Assignee: Alejandro Abdelnur >Priority: Blocker > Attachments: YARN-1284.patch, YARN-1284.patch, YARN-1284.patch > > > When LCE & cgroups are enabled, when a container is is killed (in this case > by its owning AM, an MRAM) it seems to be a race condition at OS level when > doing a SIGTERM/SIGKILL and when the OS does all necessary cleanup. > LCE code, after sending the SIGTERM/SIGKILL and getting the exitcode, > immediately attempts to clean up the cgroups entry for the container. But > this is failing with an error like: > {code} > 2013-10-07 15:21:24,359 WARN > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code > from container container_1381179532433_0016_01_11 is : 143 > 2013-10-07 15:21:24,359 DEBUG > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Processing container_1381179532433_0016_01_11 of type > UPDATE_DIAGNOSTICS_MSG > 2013-10-07 15:21:24,359 DEBUG > org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: > deleteCgroup: > /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11 > 2013-10-07 15:21:24,359 WARN > org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: > Unable to delete cgroup at: > /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11 > {code} > CgroupsLCEResourcesHandler.clearLimits() has logic to wait for 500 ms for AM > containers to avoid this problem. it seems this should be done for all > containers. > Still, waiting for extra 500ms seems too expensive. > We should look at a way of doing this in a more 'efficient way' from time > perspective, may be spinning while the deleteCgroup() cannot be done with a > minimal sleep and a timeout. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1284) LCE: Race condition leaves dangling cgroups entries for killed containers
[ https://issues.apache.org/jira/browse/YARN-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13789861#comment-13789861 ] Sandy Ryza commented on YARN-1284: -- A few nits. Otherwise LGTM. {code} + //package private for testing purposes + private long deleteCgroupTimeout; + Clock clock; {code} Comment should go before the second variable. Also there should be a space after the "//". {code} + //visible for testing {code} Should the VisibleForTesting annotation be used? This is in two places. {code} +LOG.debug("deleteCgroup: " + cgroupPath); {code} Should be surrounded by if (LOG.isDebugEnabled()) {code} +//file exists {code} Space after "//"? > LCE: Race condition leaves dangling cgroups entries for killed containers > - > > Key: YARN-1284 > URL: https://issues.apache.org/jira/browse/YARN-1284 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.2.0 >Reporter: Alejandro Abdelnur >Assignee: Alejandro Abdelnur >Priority: Blocker > Attachments: YARN-1284.patch, YARN-1284.patch, YARN-1284.patch > > > When LCE & cgroups are enabled, when a container is is killed (in this case > by its owning AM, an MRAM) it seems to be a race condition at OS level when > doing a SIGTERM/SIGKILL and when the OS does all necessary cleanup. > LCE code, after sending the SIGTERM/SIGKILL and getting the exitcode, > immediately attempts to clean up the cgroups entry for the container. But > this is failing with an error like: > {code} > 2013-10-07 15:21:24,359 WARN > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code > from container container_1381179532433_0016_01_11 is : 143 > 2013-10-07 15:21:24,359 DEBUG > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Processing container_1381179532433_0016_01_11 of type > UPDATE_DIAGNOSTICS_MSG > 2013-10-07 15:21:24,359 DEBUG > org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: > deleteCgroup: > /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11 > 2013-10-07 15:21:24,359 WARN > org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: > Unable to delete cgroup at: > /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11 > {code} > CgroupsLCEResourcesHandler.clearLimits() has logic to wait for 500 ms for AM > containers to avoid this problem. it seems this should be done for all > containers. > Still, waiting for extra 500ms seems too expensive. > We should look at a way of doing this in a more 'efficient way' from time > perspective, may be spinning while the deleteCgroup() cannot be done with a > minimal sleep and a timeout. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1284) LCE: Race condition leaves dangling cgroups entries for killed containers
[ https://issues.apache.org/jira/browse/YARN-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13789769#comment-13789769 ] Alejandro Abdelnur commented on YARN-1284: -- tested in a cluster using cgroups and works as expected, both the delete and the timeouts. > LCE: Race condition leaves dangling cgroups entries for killed containers > - > > Key: YARN-1284 > URL: https://issues.apache.org/jira/browse/YARN-1284 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.2.0 >Reporter: Alejandro Abdelnur >Assignee: Alejandro Abdelnur >Priority: Blocker > Attachments: YARN-1284.patch, YARN-1284.patch, YARN-1284.patch > > > When LCE & cgroups are enabled, when a container is is killed (in this case > by its owning AM, an MRAM) it seems to be a race condition at OS level when > doing a SIGTERM/SIGKILL and when the OS does all necessary cleanup. > LCE code, after sending the SIGTERM/SIGKILL and getting the exitcode, > immediately attempts to clean up the cgroups entry for the container. But > this is failing with an error like: > {code} > 2013-10-07 15:21:24,359 WARN > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code > from container container_1381179532433_0016_01_11 is : 143 > 2013-10-07 15:21:24,359 DEBUG > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Processing container_1381179532433_0016_01_11 of type > UPDATE_DIAGNOSTICS_MSG > 2013-10-07 15:21:24,359 DEBUG > org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: > deleteCgroup: > /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11 > 2013-10-07 15:21:24,359 WARN > org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: > Unable to delete cgroup at: > /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11 > {code} > CgroupsLCEResourcesHandler.clearLimits() has logic to wait for 500 ms for AM > containers to avoid this problem. it seems this should be done for all > containers. > Still, waiting for extra 500ms seems too expensive. > We should look at a way of doing this in a more 'efficient way' from time > perspective, may be spinning while the deleteCgroup() cannot be done with a > minimal sleep and a timeout. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1284) LCE: Race condition leaves dangling cgroups entries for killed containers
[ https://issues.apache.org/jira/browse/YARN-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13789429#comment-13789429 ] Hadoop QA commented on YARN-1284: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12607388/YARN-1284.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/2146//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2146//console This message is automatically generated. > LCE: Race condition leaves dangling cgroups entries for killed containers > - > > Key: YARN-1284 > URL: https://issues.apache.org/jira/browse/YARN-1284 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.2.0 >Reporter: Alejandro Abdelnur >Assignee: Alejandro Abdelnur >Priority: Blocker > Attachments: YARN-1284.patch, YARN-1284.patch, YARN-1284.patch > > > When LCE & cgroups are enabled, when a container is is killed (in this case > by its owning AM, an MRAM) it seems to be a race condition at OS level when > doing a SIGTERM/SIGKILL and when the OS does all necessary cleanup. > LCE code, after sending the SIGTERM/SIGKILL and getting the exitcode, > immediately attempts to clean up the cgroups entry for the container. But > this is failing with an error like: > {code} > 2013-10-07 15:21:24,359 WARN > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code > from container container_1381179532433_0016_01_11 is : 143 > 2013-10-07 15:21:24,359 DEBUG > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Processing container_1381179532433_0016_01_11 of type > UPDATE_DIAGNOSTICS_MSG > 2013-10-07 15:21:24,359 DEBUG > org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: > deleteCgroup: > /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11 > 2013-10-07 15:21:24,359 WARN > org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: > Unable to delete cgroup at: > /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11 > {code} > CgroupsLCEResourcesHandler.clearLimits() has logic to wait for 500 ms for AM > containers to avoid this problem. it seems this should be done for all > containers. > Still, waiting for extra 500ms seems too expensive. > We should look at a way of doing this in a more 'efficient way' from time > perspective, may be spinning while the deleteCgroup() cannot be done with a > minimal sleep and a timeout. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1284) LCE: Race condition leaves dangling cgroups entries for killed containers
[ https://issues.apache.org/jira/browse/YARN-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13789355#comment-13789355 ] Hadoop QA commented on YARN-1284: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12607374/YARN-1284.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 2 new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/2145//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/2145//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-nodemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2145//console This message is automatically generated. > LCE: Race condition leaves dangling cgroups entries for killed containers > - > > Key: YARN-1284 > URL: https://issues.apache.org/jira/browse/YARN-1284 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.2.0 >Reporter: Alejandro Abdelnur >Assignee: Alejandro Abdelnur >Priority: Blocker > Attachments: YARN-1284.patch, YARN-1284.patch > > > When LCE & cgroups are enabled, when a container is is killed (in this case > by its owning AM, an MRAM) it seems to be a race condition at OS level when > doing a SIGTERM/SIGKILL and when the OS does all necessary cleanup. > LCE code, after sending the SIGTERM/SIGKILL and getting the exitcode, > immediately attempts to clean up the cgroups entry for the container. But > this is failing with an error like: > {code} > 2013-10-07 15:21:24,359 WARN > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code > from container container_1381179532433_0016_01_11 is : 143 > 2013-10-07 15:21:24,359 DEBUG > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Processing container_1381179532433_0016_01_11 of type > UPDATE_DIAGNOSTICS_MSG > 2013-10-07 15:21:24,359 DEBUG > org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: > deleteCgroup: > /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11 > 2013-10-07 15:21:24,359 WARN > org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: > Unable to delete cgroup at: > /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11 > {code} > CgroupsLCEResourcesHandler.clearLimits() has logic to wait for 500 ms for AM > containers to avoid this problem. it seems this should be done for all > containers. > Still, waiting for extra 500ms seems too expensive. > We should look at a way of doing this in a more 'efficient way' from time > perspective, may be spinning while the deleteCgroup() cannot be done with a > minimal sleep and a timeout. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1284) LCE: Race condition leaves dangling cgroups entries for killed containers
[ https://issues.apache.org/jira/browse/YARN-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13789262#comment-13789262 ] Hadoop QA commented on YARN-1284: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12607362/YARN-1284.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/2144//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/2144//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-nodemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2144//console This message is automatically generated. > LCE: Race condition leaves dangling cgroups entries for killed containers > - > > Key: YARN-1284 > URL: https://issues.apache.org/jira/browse/YARN-1284 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.2.0 >Reporter: Alejandro Abdelnur >Assignee: Alejandro Abdelnur >Priority: Blocker > Attachments: YARN-1284.patch > > > When LCE & cgroups are enabled, when a container is is killed (in this case > by its owning AM, an MRAM) it seems to be a race condition at OS level when > doing a SIGTERM/SIGKILL and when the OS does all necessary cleanup. > LCE code, after sending the SIGTERM/SIGKILL and getting the exitcode, > immediately attempts to clean up the cgroups entry for the container. But > this is failing with an error like: > {code} > 2013-10-07 15:21:24,359 WARN > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code > from container container_1381179532433_0016_01_11 is : 143 > 2013-10-07 15:21:24,359 DEBUG > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Processing container_1381179532433_0016_01_11 of type > UPDATE_DIAGNOSTICS_MSG > 2013-10-07 15:21:24,359 DEBUG > org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: > deleteCgroup: > /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11 > 2013-10-07 15:21:24,359 WARN > org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: > Unable to delete cgroup at: > /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11 > {code} > CgroupsLCEResourcesHandler.clearLimits() has logic to wait for 500 ms for AM > containers to avoid this problem. it seems this should be done for all > containers. > Still, waiting for extra 500ms seems too expensive. > We should look at a way of doing this in a more 'efficient way' from time > perspective, may be spinning while the deleteCgroup() cannot be done with a > minimal sleep and a timeout. -- This message was sent by Atlassian JIRA (v6.1#6144)