[jira] [Commented] (YARN-1284) LCE: Race condition leaves dangling cgroups entries for killed containers

2013-10-10 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13791525#comment-13791525
 ] 

Hudson commented on YARN-1284:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1574 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1574/])
Amending yarn CHANGES.txt moving YARN-1284 to 2.2.1 (tucu: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1530716)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt


> LCE: Race condition leaves dangling cgroups entries for killed containers
> -
>
> Key: YARN-1284
> URL: https://issues.apache.org/jira/browse/YARN-1284
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.2.0
>Reporter: Alejandro Abdelnur
>Assignee: Alejandro Abdelnur
>Priority: Blocker
> Fix For: 2.2.1
>
> Attachments: YARN-1284.patch, YARN-1284.patch, YARN-1284.patch, 
> YARN-1284.patch, YARN-1284.patch
>
>
> When LCE & cgroups are enabled, when a container is is killed (in this case 
> by its owning AM, an MRAM) it seems to be a race condition at OS level when 
> doing a SIGTERM/SIGKILL and when the OS does all necessary cleanup. 
> LCE code, after sending the SIGTERM/SIGKILL and getting the exitcode, 
> immediately attempts to clean up the cgroups entry for the container. But 
> this is failing with an error like:
> {code}
> 2013-10-07 15:21:24,359 WARN 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code 
> from container container_1381179532433_0016_01_11 is : 143
> 2013-10-07 15:21:24,359 DEBUG 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Processing container_1381179532433_0016_01_11 of type 
> UPDATE_DIAGNOSTICS_MSG
> 2013-10-07 15:21:24,359 DEBUG 
> org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
> deleteCgroup: 
> /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
> 2013-10-07 15:21:24,359 WARN 
> org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
> Unable to delete cgroup at: 
> /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
> {code}
> CgroupsLCEResourcesHandler.clearLimits() has logic to wait for 500 ms for AM 
> containers to avoid this problem. it seems this should be done for all 
> containers.
> Still, waiting for extra 500ms seems too expensive.
> We should look at a way of doing this in a more 'efficient way' from time 
> perspective, may be spinning while the deleteCgroup() cannot be done with a 
> minimal sleep and a timeout.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1284) LCE: Race condition leaves dangling cgroups entries for killed containers

2013-10-10 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13791485#comment-13791485
 ] 

Hudson commented on YARN-1284:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #1548 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1548/])
Amending yarn CHANGES.txt moving YARN-1284 to 2.2.1 (tucu: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1530716)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt


> LCE: Race condition leaves dangling cgroups entries for killed containers
> -
>
> Key: YARN-1284
> URL: https://issues.apache.org/jira/browse/YARN-1284
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.2.0
>Reporter: Alejandro Abdelnur
>Assignee: Alejandro Abdelnur
>Priority: Blocker
> Fix For: 2.2.1
>
> Attachments: YARN-1284.patch, YARN-1284.patch, YARN-1284.patch, 
> YARN-1284.patch, YARN-1284.patch
>
>
> When LCE & cgroups are enabled, when a container is is killed (in this case 
> by its owning AM, an MRAM) it seems to be a race condition at OS level when 
> doing a SIGTERM/SIGKILL and when the OS does all necessary cleanup. 
> LCE code, after sending the SIGTERM/SIGKILL and getting the exitcode, 
> immediately attempts to clean up the cgroups entry for the container. But 
> this is failing with an error like:
> {code}
> 2013-10-07 15:21:24,359 WARN 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code 
> from container container_1381179532433_0016_01_11 is : 143
> 2013-10-07 15:21:24,359 DEBUG 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Processing container_1381179532433_0016_01_11 of type 
> UPDATE_DIAGNOSTICS_MSG
> 2013-10-07 15:21:24,359 DEBUG 
> org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
> deleteCgroup: 
> /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
> 2013-10-07 15:21:24,359 WARN 
> org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
> Unable to delete cgroup at: 
> /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
> {code}
> CgroupsLCEResourcesHandler.clearLimits() has logic to wait for 500 ms for AM 
> containers to avoid this problem. it seems this should be done for all 
> containers.
> Still, waiting for extra 500ms seems too expensive.
> We should look at a way of doing this in a more 'efficient way' from time 
> perspective, may be spinning while the deleteCgroup() cannot be done with a 
> minimal sleep and a timeout.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1284) LCE: Race condition leaves dangling cgroups entries for killed containers

2013-10-10 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13791384#comment-13791384
 ] 

Hudson commented on YARN-1284:
--

SUCCESS: Integrated in Hadoop-Yarn-trunk #358 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/358/])
Amending yarn CHANGES.txt moving YARN-1284 to 2.2.1 (tucu: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1530716)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt


> LCE: Race condition leaves dangling cgroups entries for killed containers
> -
>
> Key: YARN-1284
> URL: https://issues.apache.org/jira/browse/YARN-1284
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.2.0
>Reporter: Alejandro Abdelnur
>Assignee: Alejandro Abdelnur
>Priority: Blocker
> Fix For: 2.2.1
>
> Attachments: YARN-1284.patch, YARN-1284.patch, YARN-1284.patch, 
> YARN-1284.patch, YARN-1284.patch
>
>
> When LCE & cgroups are enabled, when a container is is killed (in this case 
> by its owning AM, an MRAM) it seems to be a race condition at OS level when 
> doing a SIGTERM/SIGKILL and when the OS does all necessary cleanup. 
> LCE code, after sending the SIGTERM/SIGKILL and getting the exitcode, 
> immediately attempts to clean up the cgroups entry for the container. But 
> this is failing with an error like:
> {code}
> 2013-10-07 15:21:24,359 WARN 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code 
> from container container_1381179532433_0016_01_11 is : 143
> 2013-10-07 15:21:24,359 DEBUG 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Processing container_1381179532433_0016_01_11 of type 
> UPDATE_DIAGNOSTICS_MSG
> 2013-10-07 15:21:24,359 DEBUG 
> org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
> deleteCgroup: 
> /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
> 2013-10-07 15:21:24,359 WARN 
> org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
> Unable to delete cgroup at: 
> /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
> {code}
> CgroupsLCEResourcesHandler.clearLimits() has logic to wait for 500 ms for AM 
> containers to avoid this problem. it seems this should be done for all 
> containers.
> Still, waiting for extra 500ms seems too expensive.
> We should look at a way of doing this in a more 'efficient way' from time 
> perspective, may be spinning while the deleteCgroup() cannot be done with a 
> minimal sleep and a timeout.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1284) LCE: Race condition leaves dangling cgroups entries for killed containers

2013-10-09 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13790628#comment-13790628
 ] 

Hudson commented on YARN-1284:
--

SUCCESS: Integrated in Hadoop-trunk-Commit #4574 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/4574/])
Amending yarn CHANGES.txt moving YARN-1284 to 2.2.1 (tucu: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1530716)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt


> LCE: Race condition leaves dangling cgroups entries for killed containers
> -
>
> Key: YARN-1284
> URL: https://issues.apache.org/jira/browse/YARN-1284
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.2.0
>Reporter: Alejandro Abdelnur
>Assignee: Alejandro Abdelnur
>Priority: Blocker
> Fix For: 2.2.1
>
> Attachments: YARN-1284.patch, YARN-1284.patch, YARN-1284.patch, 
> YARN-1284.patch, YARN-1284.patch
>
>
> When LCE & cgroups are enabled, when a container is is killed (in this case 
> by its owning AM, an MRAM) it seems to be a race condition at OS level when 
> doing a SIGTERM/SIGKILL and when the OS does all necessary cleanup. 
> LCE code, after sending the SIGTERM/SIGKILL and getting the exitcode, 
> immediately attempts to clean up the cgroups entry for the container. But 
> this is failing with an error like:
> {code}
> 2013-10-07 15:21:24,359 WARN 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code 
> from container container_1381179532433_0016_01_11 is : 143
> 2013-10-07 15:21:24,359 DEBUG 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Processing container_1381179532433_0016_01_11 of type 
> UPDATE_DIAGNOSTICS_MSG
> 2013-10-07 15:21:24,359 DEBUG 
> org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
> deleteCgroup: 
> /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
> 2013-10-07 15:21:24,359 WARN 
> org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
> Unable to delete cgroup at: 
> /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
> {code}
> CgroupsLCEResourcesHandler.clearLimits() has logic to wait for 500 ms for AM 
> containers to avoid this problem. it seems this should be done for all 
> containers.
> Still, waiting for extra 500ms seems too expensive.
> We should look at a way of doing this in a more 'efficient way' from time 
> perspective, may be spinning while the deleteCgroup() cannot be done with a 
> minimal sleep and a timeout.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1284) LCE: Race condition leaves dangling cgroups entries for killed containers

2013-10-09 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13790347#comment-13790347
 ] 

Hudson commented on YARN-1284:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk #1547 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1547/])
Add missing file TestCgroupsLCEResourcesHandler for YARN-1284. (sandy: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1530493)
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/util/TestCgroupsLCEResourcesHandler.java
YARN-1284. LCE: Race condition leaves dangling cgroups entries for killed 
containers. (Alejandro Abdelnur via Sandy Ryza) (sandy: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1530492)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/util/CgroupsLCEResourcesHandler.java


> LCE: Race condition leaves dangling cgroups entries for killed containers
> -
>
> Key: YARN-1284
> URL: https://issues.apache.org/jira/browse/YARN-1284
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.2.0
>Reporter: Alejandro Abdelnur
>Assignee: Alejandro Abdelnur
>Priority: Blocker
> Fix For: 2.3.0
>
> Attachments: YARN-1284.patch, YARN-1284.patch, YARN-1284.patch, 
> YARN-1284.patch, YARN-1284.patch
>
>
> When LCE & cgroups are enabled, when a container is is killed (in this case 
> by its owning AM, an MRAM) it seems to be a race condition at OS level when 
> doing a SIGTERM/SIGKILL and when the OS does all necessary cleanup. 
> LCE code, after sending the SIGTERM/SIGKILL and getting the exitcode, 
> immediately attempts to clean up the cgroups entry for the container. But 
> this is failing with an error like:
> {code}
> 2013-10-07 15:21:24,359 WARN 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code 
> from container container_1381179532433_0016_01_11 is : 143
> 2013-10-07 15:21:24,359 DEBUG 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Processing container_1381179532433_0016_01_11 of type 
> UPDATE_DIAGNOSTICS_MSG
> 2013-10-07 15:21:24,359 DEBUG 
> org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
> deleteCgroup: 
> /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
> 2013-10-07 15:21:24,359 WARN 
> org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
> Unable to delete cgroup at: 
> /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
> {code}
> CgroupsLCEResourcesHandler.clearLimits() has logic to wait for 500 ms for AM 
> containers to avoid this problem. it seems this should be done for all 
> containers.
> Still, waiting for extra 500ms seems too expensive.
> We should look at a way of doing this in a more 'efficient way' from time 
> perspective, may be spinning while the deleteCgroup() cannot be done with a 
> minimal sleep and a timeout.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1284) LCE: Race condition leaves dangling cgroups entries for killed containers

2013-10-09 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13790320#comment-13790320
 ] 

Hudson commented on YARN-1284:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1573 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1573/])
Add missing file TestCgroupsLCEResourcesHandler for YARN-1284. (sandy: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1530493)
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/util/TestCgroupsLCEResourcesHandler.java
YARN-1284. LCE: Race condition leaves dangling cgroups entries for killed 
containers. (Alejandro Abdelnur via Sandy Ryza) (sandy: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1530492)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/util/CgroupsLCEResourcesHandler.java


> LCE: Race condition leaves dangling cgroups entries for killed containers
> -
>
> Key: YARN-1284
> URL: https://issues.apache.org/jira/browse/YARN-1284
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.2.0
>Reporter: Alejandro Abdelnur
>Assignee: Alejandro Abdelnur
>Priority: Blocker
> Fix For: 2.3.0
>
> Attachments: YARN-1284.patch, YARN-1284.patch, YARN-1284.patch, 
> YARN-1284.patch, YARN-1284.patch
>
>
> When LCE & cgroups are enabled, when a container is is killed (in this case 
> by its owning AM, an MRAM) it seems to be a race condition at OS level when 
> doing a SIGTERM/SIGKILL and when the OS does all necessary cleanup. 
> LCE code, after sending the SIGTERM/SIGKILL and getting the exitcode, 
> immediately attempts to clean up the cgroups entry for the container. But 
> this is failing with an error like:
> {code}
> 2013-10-07 15:21:24,359 WARN 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code 
> from container container_1381179532433_0016_01_11 is : 143
> 2013-10-07 15:21:24,359 DEBUG 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Processing container_1381179532433_0016_01_11 of type 
> UPDATE_DIAGNOSTICS_MSG
> 2013-10-07 15:21:24,359 DEBUG 
> org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
> deleteCgroup: 
> /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
> 2013-10-07 15:21:24,359 WARN 
> org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
> Unable to delete cgroup at: 
> /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
> {code}
> CgroupsLCEResourcesHandler.clearLimits() has logic to wait for 500 ms for AM 
> containers to avoid this problem. it seems this should be done for all 
> containers.
> Still, waiting for extra 500ms seems too expensive.
> We should look at a way of doing this in a more 'efficient way' from time 
> perspective, may be spinning while the deleteCgroup() cannot be done with a 
> minimal sleep and a timeout.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1284) LCE: Race condition leaves dangling cgroups entries for killed containers

2013-10-09 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13790243#comment-13790243
 ] 

Hudson commented on YARN-1284:
--

SUCCESS: Integrated in Hadoop-Yarn-trunk #357 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/357/])
Add missing file TestCgroupsLCEResourcesHandler for YARN-1284. (sandy: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1530493)
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/util/TestCgroupsLCEResourcesHandler.java
YARN-1284. LCE: Race condition leaves dangling cgroups entries for killed 
containers. (Alejandro Abdelnur via Sandy Ryza) (sandy: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1530492)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/util/CgroupsLCEResourcesHandler.java


> LCE: Race condition leaves dangling cgroups entries for killed containers
> -
>
> Key: YARN-1284
> URL: https://issues.apache.org/jira/browse/YARN-1284
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.2.0
>Reporter: Alejandro Abdelnur
>Assignee: Alejandro Abdelnur
>Priority: Blocker
> Fix For: 2.3.0
>
> Attachments: YARN-1284.patch, YARN-1284.patch, YARN-1284.patch, 
> YARN-1284.patch, YARN-1284.patch
>
>
> When LCE & cgroups are enabled, when a container is is killed (in this case 
> by its owning AM, an MRAM) it seems to be a race condition at OS level when 
> doing a SIGTERM/SIGKILL and when the OS does all necessary cleanup. 
> LCE code, after sending the SIGTERM/SIGKILL and getting the exitcode, 
> immediately attempts to clean up the cgroups entry for the container. But 
> this is failing with an error like:
> {code}
> 2013-10-07 15:21:24,359 WARN 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code 
> from container container_1381179532433_0016_01_11 is : 143
> 2013-10-07 15:21:24,359 DEBUG 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Processing container_1381179532433_0016_01_11 of type 
> UPDATE_DIAGNOSTICS_MSG
> 2013-10-07 15:21:24,359 DEBUG 
> org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
> deleteCgroup: 
> /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
> 2013-10-07 15:21:24,359 WARN 
> org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
> Unable to delete cgroup at: 
> /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
> {code}
> CgroupsLCEResourcesHandler.clearLimits() has logic to wait for 500 ms for AM 
> containers to avoid this problem. it seems this should be done for all 
> containers.
> Still, waiting for extra 500ms seems too expensive.
> We should look at a way of doing this in a more 'efficient way' from time 
> perspective, may be spinning while the deleteCgroup() cannot be done with a 
> minimal sleep and a timeout.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1284) LCE: Race condition leaves dangling cgroups entries for killed containers

2013-10-08 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13790036#comment-13790036
 ] 

Hudson commented on YARN-1284:
--

SUCCESS: Integrated in Hadoop-trunk-Commit #4568 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/4568/])
Add missing file TestCgroupsLCEResourcesHandler for YARN-1284. (sandy: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1530493)
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/util/TestCgroupsLCEResourcesHandler.java
YARN-1284. LCE: Race condition leaves dangling cgroups entries for killed 
containers. (Alejandro Abdelnur via Sandy Ryza) (sandy: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1530492)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/util/CgroupsLCEResourcesHandler.java


> LCE: Race condition leaves dangling cgroups entries for killed containers
> -
>
> Key: YARN-1284
> URL: https://issues.apache.org/jira/browse/YARN-1284
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.2.0
>Reporter: Alejandro Abdelnur
>Assignee: Alejandro Abdelnur
>Priority: Blocker
> Fix For: 2.3.0
>
> Attachments: YARN-1284.patch, YARN-1284.patch, YARN-1284.patch, 
> YARN-1284.patch, YARN-1284.patch
>
>
> When LCE & cgroups are enabled, when a container is is killed (in this case 
> by its owning AM, an MRAM) it seems to be a race condition at OS level when 
> doing a SIGTERM/SIGKILL and when the OS does all necessary cleanup. 
> LCE code, after sending the SIGTERM/SIGKILL and getting the exitcode, 
> immediately attempts to clean up the cgroups entry for the container. But 
> this is failing with an error like:
> {code}
> 2013-10-07 15:21:24,359 WARN 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code 
> from container container_1381179532433_0016_01_11 is : 143
> 2013-10-07 15:21:24,359 DEBUG 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Processing container_1381179532433_0016_01_11 of type 
> UPDATE_DIAGNOSTICS_MSG
> 2013-10-07 15:21:24,359 DEBUG 
> org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
> deleteCgroup: 
> /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
> 2013-10-07 15:21:24,359 WARN 
> org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
> Unable to delete cgroup at: 
> /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
> {code}
> CgroupsLCEResourcesHandler.clearLimits() has logic to wait for 500 ms for AM 
> containers to avoid this problem. it seems this should be done for all 
> containers.
> Still, waiting for extra 500ms seems too expensive.
> We should look at a way of doing this in a more 'efficient way' from time 
> perspective, may be spinning while the deleteCgroup() cannot be done with a 
> minimal sleep and a timeout.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1284) LCE: Race condition leaves dangling cgroups entries for killed containers

2013-10-08 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13790020#comment-13790020
 ] 

Sandy Ryza commented on YARN-1284:
--

+1

> LCE: Race condition leaves dangling cgroups entries for killed containers
> -
>
> Key: YARN-1284
> URL: https://issues.apache.org/jira/browse/YARN-1284
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.2.0
>Reporter: Alejandro Abdelnur
>Assignee: Alejandro Abdelnur
>Priority: Blocker
> Attachments: YARN-1284.patch, YARN-1284.patch, YARN-1284.patch, 
> YARN-1284.patch, YARN-1284.patch
>
>
> When LCE & cgroups are enabled, when a container is is killed (in this case 
> by its owning AM, an MRAM) it seems to be a race condition at OS level when 
> doing a SIGTERM/SIGKILL and when the OS does all necessary cleanup. 
> LCE code, after sending the SIGTERM/SIGKILL and getting the exitcode, 
> immediately attempts to clean up the cgroups entry for the container. But 
> this is failing with an error like:
> {code}
> 2013-10-07 15:21:24,359 WARN 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code 
> from container container_1381179532433_0016_01_11 is : 143
> 2013-10-07 15:21:24,359 DEBUG 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Processing container_1381179532433_0016_01_11 of type 
> UPDATE_DIAGNOSTICS_MSG
> 2013-10-07 15:21:24,359 DEBUG 
> org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
> deleteCgroup: 
> /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
> 2013-10-07 15:21:24,359 WARN 
> org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
> Unable to delete cgroup at: 
> /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
> {code}
> CgroupsLCEResourcesHandler.clearLimits() has logic to wait for 500 ms for AM 
> containers to avoid this problem. it seems this should be done for all 
> containers.
> Still, waiting for extra 500ms seems too expensive.
> We should look at a way of doing this in a more 'efficient way' from time 
> perspective, may be spinning while the deleteCgroup() cannot be done with a 
> minimal sleep and a timeout.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1284) LCE: Race condition leaves dangling cgroups entries for killed containers

2013-10-08 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13789995#comment-13789995
 ] 

Hadoop QA commented on YARN-1284:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12607500/YARN-1284.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/2150//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2150//console

This message is automatically generated.

> LCE: Race condition leaves dangling cgroups entries for killed containers
> -
>
> Key: YARN-1284
> URL: https://issues.apache.org/jira/browse/YARN-1284
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.2.0
>Reporter: Alejandro Abdelnur
>Assignee: Alejandro Abdelnur
>Priority: Blocker
> Attachments: YARN-1284.patch, YARN-1284.patch, YARN-1284.patch, 
> YARN-1284.patch, YARN-1284.patch
>
>
> When LCE & cgroups are enabled, when a container is is killed (in this case 
> by its owning AM, an MRAM) it seems to be a race condition at OS level when 
> doing a SIGTERM/SIGKILL and when the OS does all necessary cleanup. 
> LCE code, after sending the SIGTERM/SIGKILL and getting the exitcode, 
> immediately attempts to clean up the cgroups entry for the container. But 
> this is failing with an error like:
> {code}
> 2013-10-07 15:21:24,359 WARN 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code 
> from container container_1381179532433_0016_01_11 is : 143
> 2013-10-07 15:21:24,359 DEBUG 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Processing container_1381179532433_0016_01_11 of type 
> UPDATE_DIAGNOSTICS_MSG
> 2013-10-07 15:21:24,359 DEBUG 
> org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
> deleteCgroup: 
> /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
> 2013-10-07 15:21:24,359 WARN 
> org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
> Unable to delete cgroup at: 
> /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
> {code}
> CgroupsLCEResourcesHandler.clearLimits() has logic to wait for 500 ms for AM 
> containers to avoid this problem. it seems this should be done for all 
> containers.
> Still, waiting for extra 500ms seems too expensive.
> We should look at a way of doing this in a more 'efficient way' from time 
> perspective, may be spinning while the deleteCgroup() cannot be done with a 
> minimal sleep and a timeout.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1284) LCE: Race condition leaves dangling cgroups entries for killed containers

2013-10-08 Thread Alejandro Abdelnur (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13789978#comment-13789978
 ] 

Alejandro Abdelnur commented on YARN-1284:
--

For the record, I've spend a couple hours trying an alternate approach 
suggested by [~rvs] while chatting offline about this. His suggestion was to 
initialize a trash cgroup next to the containers cgroups and when a container 
is cleanup transition the /tasks to the trash/tasks, doing  the 
equivalent of a {{cat /tasks >> trash/tasks}}. Tried doing that but 
it seems some of the Java IO native calls make a system call which is not 
supported by the cgroups filesystem implementation and I was getting the 
following stack trace:

{code}
java.io.IOException: Argument list too long
java.io.IOException: Argument list too long
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:318)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:80)
...
{code}

Given this, beside that I didn't get it to work properly, I would not be 
comfortable doing this as this may behave different in different Linux versions.

> LCE: Race condition leaves dangling cgroups entries for killed containers
> -
>
> Key: YARN-1284
> URL: https://issues.apache.org/jira/browse/YARN-1284
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.2.0
>Reporter: Alejandro Abdelnur
>Assignee: Alejandro Abdelnur
>Priority: Blocker
> Attachments: YARN-1284.patch, YARN-1284.patch, YARN-1284.patch, 
> YARN-1284.patch
>
>
> When LCE & cgroups are enabled, when a container is is killed (in this case 
> by its owning AM, an MRAM) it seems to be a race condition at OS level when 
> doing a SIGTERM/SIGKILL and when the OS does all necessary cleanup. 
> LCE code, after sending the SIGTERM/SIGKILL and getting the exitcode, 
> immediately attempts to clean up the cgroups entry for the container. But 
> this is failing with an error like:
> {code}
> 2013-10-07 15:21:24,359 WARN 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code 
> from container container_1381179532433_0016_01_11 is : 143
> 2013-10-07 15:21:24,359 DEBUG 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Processing container_1381179532433_0016_01_11 of type 
> UPDATE_DIAGNOSTICS_MSG
> 2013-10-07 15:21:24,359 DEBUG 
> org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
> deleteCgroup: 
> /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
> 2013-10-07 15:21:24,359 WARN 
> org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
> Unable to delete cgroup at: 
> /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
> {code}
> CgroupsLCEResourcesHandler.clearLimits() has logic to wait for 500 ms for AM 
> containers to avoid this problem. it seems this should be done for all 
> containers.
> Still, waiting for extra 500ms seems too expensive.
> We should look at a way of doing this in a more 'efficient way' from time 
> perspective, may be spinning while the deleteCgroup() cannot be done with a 
> minimal sleep and a timeout.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1284) LCE: Race condition leaves dangling cgroups entries for killed containers

2013-10-08 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13789977#comment-13789977
 ] 

Hadoop QA commented on YARN-1284:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12607499/YARN-1284.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:red}-1 javac{color:red}.  The patch appears to cause the build to 
fail.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2149//console

This message is automatically generated.

> LCE: Race condition leaves dangling cgroups entries for killed containers
> -
>
> Key: YARN-1284
> URL: https://issues.apache.org/jira/browse/YARN-1284
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.2.0
>Reporter: Alejandro Abdelnur
>Assignee: Alejandro Abdelnur
>Priority: Blocker
> Attachments: YARN-1284.patch, YARN-1284.patch, YARN-1284.patch, 
> YARN-1284.patch
>
>
> When LCE & cgroups are enabled, when a container is is killed (in this case 
> by its owning AM, an MRAM) it seems to be a race condition at OS level when 
> doing a SIGTERM/SIGKILL and when the OS does all necessary cleanup. 
> LCE code, after sending the SIGTERM/SIGKILL and getting the exitcode, 
> immediately attempts to clean up the cgroups entry for the container. But 
> this is failing with an error like:
> {code}
> 2013-10-07 15:21:24,359 WARN 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code 
> from container container_1381179532433_0016_01_11 is : 143
> 2013-10-07 15:21:24,359 DEBUG 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Processing container_1381179532433_0016_01_11 of type 
> UPDATE_DIAGNOSTICS_MSG
> 2013-10-07 15:21:24,359 DEBUG 
> org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
> deleteCgroup: 
> /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
> 2013-10-07 15:21:24,359 WARN 
> org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
> Unable to delete cgroup at: 
> /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
> {code}
> CgroupsLCEResourcesHandler.clearLimits() has logic to wait for 500 ms for AM 
> containers to avoid this problem. it seems this should be done for all 
> containers.
> Still, waiting for extra 500ms seems too expensive.
> We should look at a way of doing this in a more 'efficient way' from time 
> perspective, may be spinning while the deleteCgroup() cannot be done with a 
> minimal sleep and a timeout.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1284) LCE: Race condition leaves dangling cgroups entries for killed containers

2013-10-08 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13789864#comment-13789864
 ] 

Sandy Ryza commented on YARN-1284:
--

Oh, and also:
{code{
+if (! new File(cgroupPath).delete()) {
+  LOG.warn("Unable to delete cgroup at: " + cgroupPath +", tried to delete 
for " +
+  deleteCgroupTimeout + "ms");
 }
{code}
If the file was already deleted, delete() will return false and we'll log the 
warning even though nothing went wrong.  Instead, we should just check "if 
(!deleted)".

> LCE: Race condition leaves dangling cgroups entries for killed containers
> -
>
> Key: YARN-1284
> URL: https://issues.apache.org/jira/browse/YARN-1284
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.2.0
>Reporter: Alejandro Abdelnur
>Assignee: Alejandro Abdelnur
>Priority: Blocker
> Attachments: YARN-1284.patch, YARN-1284.patch, YARN-1284.patch
>
>
> When LCE & cgroups are enabled, when a container is is killed (in this case 
> by its owning AM, an MRAM) it seems to be a race condition at OS level when 
> doing a SIGTERM/SIGKILL and when the OS does all necessary cleanup. 
> LCE code, after sending the SIGTERM/SIGKILL and getting the exitcode, 
> immediately attempts to clean up the cgroups entry for the container. But 
> this is failing with an error like:
> {code}
> 2013-10-07 15:21:24,359 WARN 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code 
> from container container_1381179532433_0016_01_11 is : 143
> 2013-10-07 15:21:24,359 DEBUG 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Processing container_1381179532433_0016_01_11 of type 
> UPDATE_DIAGNOSTICS_MSG
> 2013-10-07 15:21:24,359 DEBUG 
> org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
> deleteCgroup: 
> /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
> 2013-10-07 15:21:24,359 WARN 
> org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
> Unable to delete cgroup at: 
> /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
> {code}
> CgroupsLCEResourcesHandler.clearLimits() has logic to wait for 500 ms for AM 
> containers to avoid this problem. it seems this should be done for all 
> containers.
> Still, waiting for extra 500ms seems too expensive.
> We should look at a way of doing this in a more 'efficient way' from time 
> perspective, may be spinning while the deleteCgroup() cannot be done with a 
> minimal sleep and a timeout.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1284) LCE: Race condition leaves dangling cgroups entries for killed containers

2013-10-08 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13789861#comment-13789861
 ] 

Sandy Ryza commented on YARN-1284:
--

A few nits.  Otherwise LGTM.

{code}
+  //package private for testing purposes
+  private long deleteCgroupTimeout;
+  Clock clock;
{code}
Comment should go before the second variable. Also there should be a space 
after the "//".

{code}
+  //visible for testing
{code}
Should the VisibleForTesting annotation be used? This is in two places.

{code}
+LOG.debug("deleteCgroup: " + cgroupPath);
{code}
Should be surrounded by if (LOG.isDebugEnabled())

{code}
+//file exists
{code}
Space after "//"?

> LCE: Race condition leaves dangling cgroups entries for killed containers
> -
>
> Key: YARN-1284
> URL: https://issues.apache.org/jira/browse/YARN-1284
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.2.0
>Reporter: Alejandro Abdelnur
>Assignee: Alejandro Abdelnur
>Priority: Blocker
> Attachments: YARN-1284.patch, YARN-1284.patch, YARN-1284.patch
>
>
> When LCE & cgroups are enabled, when a container is is killed (in this case 
> by its owning AM, an MRAM) it seems to be a race condition at OS level when 
> doing a SIGTERM/SIGKILL and when the OS does all necessary cleanup. 
> LCE code, after sending the SIGTERM/SIGKILL and getting the exitcode, 
> immediately attempts to clean up the cgroups entry for the container. But 
> this is failing with an error like:
> {code}
> 2013-10-07 15:21:24,359 WARN 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code 
> from container container_1381179532433_0016_01_11 is : 143
> 2013-10-07 15:21:24,359 DEBUG 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Processing container_1381179532433_0016_01_11 of type 
> UPDATE_DIAGNOSTICS_MSG
> 2013-10-07 15:21:24,359 DEBUG 
> org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
> deleteCgroup: 
> /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
> 2013-10-07 15:21:24,359 WARN 
> org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
> Unable to delete cgroup at: 
> /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
> {code}
> CgroupsLCEResourcesHandler.clearLimits() has logic to wait for 500 ms for AM 
> containers to avoid this problem. it seems this should be done for all 
> containers.
> Still, waiting for extra 500ms seems too expensive.
> We should look at a way of doing this in a more 'efficient way' from time 
> perspective, may be spinning while the deleteCgroup() cannot be done with a 
> minimal sleep and a timeout.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1284) LCE: Race condition leaves dangling cgroups entries for killed containers

2013-10-08 Thread Alejandro Abdelnur (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13789769#comment-13789769
 ] 

Alejandro Abdelnur commented on YARN-1284:
--

tested in a cluster using cgroups and works as expected, both the delete and 
the timeouts.

> LCE: Race condition leaves dangling cgroups entries for killed containers
> -
>
> Key: YARN-1284
> URL: https://issues.apache.org/jira/browse/YARN-1284
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.2.0
>Reporter: Alejandro Abdelnur
>Assignee: Alejandro Abdelnur
>Priority: Blocker
> Attachments: YARN-1284.patch, YARN-1284.patch, YARN-1284.patch
>
>
> When LCE & cgroups are enabled, when a container is is killed (in this case 
> by its owning AM, an MRAM) it seems to be a race condition at OS level when 
> doing a SIGTERM/SIGKILL and when the OS does all necessary cleanup. 
> LCE code, after sending the SIGTERM/SIGKILL and getting the exitcode, 
> immediately attempts to clean up the cgroups entry for the container. But 
> this is failing with an error like:
> {code}
> 2013-10-07 15:21:24,359 WARN 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code 
> from container container_1381179532433_0016_01_11 is : 143
> 2013-10-07 15:21:24,359 DEBUG 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Processing container_1381179532433_0016_01_11 of type 
> UPDATE_DIAGNOSTICS_MSG
> 2013-10-07 15:21:24,359 DEBUG 
> org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
> deleteCgroup: 
> /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
> 2013-10-07 15:21:24,359 WARN 
> org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
> Unable to delete cgroup at: 
> /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
> {code}
> CgroupsLCEResourcesHandler.clearLimits() has logic to wait for 500 ms for AM 
> containers to avoid this problem. it seems this should be done for all 
> containers.
> Still, waiting for extra 500ms seems too expensive.
> We should look at a way of doing this in a more 'efficient way' from time 
> perspective, may be spinning while the deleteCgroup() cannot be done with a 
> minimal sleep and a timeout.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1284) LCE: Race condition leaves dangling cgroups entries for killed containers

2013-10-08 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13789429#comment-13789429
 ] 

Hadoop QA commented on YARN-1284:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12607388/YARN-1284.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/2146//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2146//console

This message is automatically generated.

> LCE: Race condition leaves dangling cgroups entries for killed containers
> -
>
> Key: YARN-1284
> URL: https://issues.apache.org/jira/browse/YARN-1284
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.2.0
>Reporter: Alejandro Abdelnur
>Assignee: Alejandro Abdelnur
>Priority: Blocker
> Attachments: YARN-1284.patch, YARN-1284.patch, YARN-1284.patch
>
>
> When LCE & cgroups are enabled, when a container is is killed (in this case 
> by its owning AM, an MRAM) it seems to be a race condition at OS level when 
> doing a SIGTERM/SIGKILL and when the OS does all necessary cleanup. 
> LCE code, after sending the SIGTERM/SIGKILL and getting the exitcode, 
> immediately attempts to clean up the cgroups entry for the container. But 
> this is failing with an error like:
> {code}
> 2013-10-07 15:21:24,359 WARN 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code 
> from container container_1381179532433_0016_01_11 is : 143
> 2013-10-07 15:21:24,359 DEBUG 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Processing container_1381179532433_0016_01_11 of type 
> UPDATE_DIAGNOSTICS_MSG
> 2013-10-07 15:21:24,359 DEBUG 
> org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
> deleteCgroup: 
> /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
> 2013-10-07 15:21:24,359 WARN 
> org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
> Unable to delete cgroup at: 
> /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
> {code}
> CgroupsLCEResourcesHandler.clearLimits() has logic to wait for 500 ms for AM 
> containers to avoid this problem. it seems this should be done for all 
> containers.
> Still, waiting for extra 500ms seems too expensive.
> We should look at a way of doing this in a more 'efficient way' from time 
> perspective, may be spinning while the deleteCgroup() cannot be done with a 
> minimal sleep and a timeout.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1284) LCE: Race condition leaves dangling cgroups entries for killed containers

2013-10-08 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13789355#comment-13789355
 ] 

Hadoop QA commented on YARN-1284:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12607374/YARN-1284.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 2 new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/2145//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/2145//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-nodemanager.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2145//console

This message is automatically generated.

> LCE: Race condition leaves dangling cgroups entries for killed containers
> -
>
> Key: YARN-1284
> URL: https://issues.apache.org/jira/browse/YARN-1284
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.2.0
>Reporter: Alejandro Abdelnur
>Assignee: Alejandro Abdelnur
>Priority: Blocker
> Attachments: YARN-1284.patch, YARN-1284.patch
>
>
> When LCE & cgroups are enabled, when a container is is killed (in this case 
> by its owning AM, an MRAM) it seems to be a race condition at OS level when 
> doing a SIGTERM/SIGKILL and when the OS does all necessary cleanup. 
> LCE code, after sending the SIGTERM/SIGKILL and getting the exitcode, 
> immediately attempts to clean up the cgroups entry for the container. But 
> this is failing with an error like:
> {code}
> 2013-10-07 15:21:24,359 WARN 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code 
> from container container_1381179532433_0016_01_11 is : 143
> 2013-10-07 15:21:24,359 DEBUG 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Processing container_1381179532433_0016_01_11 of type 
> UPDATE_DIAGNOSTICS_MSG
> 2013-10-07 15:21:24,359 DEBUG 
> org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
> deleteCgroup: 
> /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
> 2013-10-07 15:21:24,359 WARN 
> org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
> Unable to delete cgroup at: 
> /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
> {code}
> CgroupsLCEResourcesHandler.clearLimits() has logic to wait for 500 ms for AM 
> containers to avoid this problem. it seems this should be done for all 
> containers.
> Still, waiting for extra 500ms seems too expensive.
> We should look at a way of doing this in a more 'efficient way' from time 
> perspective, may be spinning while the deleteCgroup() cannot be done with a 
> minimal sleep and a timeout.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1284) LCE: Race condition leaves dangling cgroups entries for killed containers

2013-10-08 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13789262#comment-13789262
 ] 

Hadoop QA commented on YARN-1284:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12607362/YARN-1284.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 1 new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/2144//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/2144//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-nodemanager.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2144//console

This message is automatically generated.

> LCE: Race condition leaves dangling cgroups entries for killed containers
> -
>
> Key: YARN-1284
> URL: https://issues.apache.org/jira/browse/YARN-1284
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.2.0
>Reporter: Alejandro Abdelnur
>Assignee: Alejandro Abdelnur
>Priority: Blocker
> Attachments: YARN-1284.patch
>
>
> When LCE & cgroups are enabled, when a container is is killed (in this case 
> by its owning AM, an MRAM) it seems to be a race condition at OS level when 
> doing a SIGTERM/SIGKILL and when the OS does all necessary cleanup. 
> LCE code, after sending the SIGTERM/SIGKILL and getting the exitcode, 
> immediately attempts to clean up the cgroups entry for the container. But 
> this is failing with an error like:
> {code}
> 2013-10-07 15:21:24,359 WARN 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code 
> from container container_1381179532433_0016_01_11 is : 143
> 2013-10-07 15:21:24,359 DEBUG 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Processing container_1381179532433_0016_01_11 of type 
> UPDATE_DIAGNOSTICS_MSG
> 2013-10-07 15:21:24,359 DEBUG 
> org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
> deleteCgroup: 
> /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
> 2013-10-07 15:21:24,359 WARN 
> org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
> Unable to delete cgroup at: 
> /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
> {code}
> CgroupsLCEResourcesHandler.clearLimits() has logic to wait for 500 ms for AM 
> containers to avoid this problem. it seems this should be done for all 
> containers.
> Still, waiting for extra 500ms seems too expensive.
> We should look at a way of doing this in a more 'efficient way' from time 
> perspective, may be spinning while the deleteCgroup() cannot be done with a 
> minimal sleep and a timeout.



--
This message was sent by Atlassian JIRA
(v6.1#6144)