[jira] [Updated] (YARN-1284) LCE: Race condition leaves dangling cgroups entries for killed containers

2013-10-09 Thread Alejandro Abdelnur (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alejandro Abdelnur updated YARN-1284:
-

Fix Version/s: (was: 2.3.0)
   2.2.1

committed to branch-2.2.

> LCE: Race condition leaves dangling cgroups entries for killed containers
> -
>
> Key: YARN-1284
> URL: https://issues.apache.org/jira/browse/YARN-1284
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.2.0
>Reporter: Alejandro Abdelnur
>Assignee: Alejandro Abdelnur
>Priority: Blocker
> Fix For: 2.2.1
>
> Attachments: YARN-1284.patch, YARN-1284.patch, YARN-1284.patch, 
> YARN-1284.patch, YARN-1284.patch
>
>
> When LCE & cgroups are enabled, when a container is is killed (in this case 
> by its owning AM, an MRAM) it seems to be a race condition at OS level when 
> doing a SIGTERM/SIGKILL and when the OS does all necessary cleanup. 
> LCE code, after sending the SIGTERM/SIGKILL and getting the exitcode, 
> immediately attempts to clean up the cgroups entry for the container. But 
> this is failing with an error like:
> {code}
> 2013-10-07 15:21:24,359 WARN 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code 
> from container container_1381179532433_0016_01_11 is : 143
> 2013-10-07 15:21:24,359 DEBUG 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Processing container_1381179532433_0016_01_11 of type 
> UPDATE_DIAGNOSTICS_MSG
> 2013-10-07 15:21:24,359 DEBUG 
> org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
> deleteCgroup: 
> /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
> 2013-10-07 15:21:24,359 WARN 
> org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
> Unable to delete cgroup at: 
> /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
> {code}
> CgroupsLCEResourcesHandler.clearLimits() has logic to wait for 500 ms for AM 
> containers to avoid this problem. it seems this should be done for all 
> containers.
> Still, waiting for extra 500ms seems too expensive.
> We should look at a way of doing this in a more 'efficient way' from time 
> perspective, may be spinning while the deleteCgroup() cannot be done with a 
> minimal sleep and a timeout.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (YARN-1284) LCE: Race condition leaves dangling cgroups entries for killed containers

2013-10-08 Thread Alejandro Abdelnur (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alejandro Abdelnur updated YARN-1284:
-

Attachment: YARN-1284.patch

Updating patch with one last change (which it was not in my git cache), the 
default timeout is not 1000ms (up from 500ms). While testing this in a 4 nodes 
cluster running pi 500 500, there was one occurrence of a left container cgroup 
because of a timeout. This was done in a cluster running in VMs,  which it 
would explain the 500ms timeout, but still I'd rather bump it up given that the 
wait will break as soon as the cgroup is deleted and the attempts are every 
20ms.

> LCE: Race condition leaves dangling cgroups entries for killed containers
> -
>
> Key: YARN-1284
> URL: https://issues.apache.org/jira/browse/YARN-1284
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.2.0
>Reporter: Alejandro Abdelnur
>Assignee: Alejandro Abdelnur
>Priority: Blocker
> Attachments: YARN-1284.patch, YARN-1284.patch, YARN-1284.patch, 
> YARN-1284.patch, YARN-1284.patch
>
>
> When LCE & cgroups are enabled, when a container is is killed (in this case 
> by its owning AM, an MRAM) it seems to be a race condition at OS level when 
> doing a SIGTERM/SIGKILL and when the OS does all necessary cleanup. 
> LCE code, after sending the SIGTERM/SIGKILL and getting the exitcode, 
> immediately attempts to clean up the cgroups entry for the container. But 
> this is failing with an error like:
> {code}
> 2013-10-07 15:21:24,359 WARN 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code 
> from container container_1381179532433_0016_01_11 is : 143
> 2013-10-07 15:21:24,359 DEBUG 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Processing container_1381179532433_0016_01_11 of type 
> UPDATE_DIAGNOSTICS_MSG
> 2013-10-07 15:21:24,359 DEBUG 
> org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
> deleteCgroup: 
> /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
> 2013-10-07 15:21:24,359 WARN 
> org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
> Unable to delete cgroup at: 
> /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
> {code}
> CgroupsLCEResourcesHandler.clearLimits() has logic to wait for 500 ms for AM 
> containers to avoid this problem. it seems this should be done for all 
> containers.
> Still, waiting for extra 500ms seems too expensive.
> We should look at a way of doing this in a more 'efficient way' from time 
> perspective, may be spinning while the deleteCgroup() cannot be done with a 
> minimal sleep and a timeout.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (YARN-1284) LCE: Race condition leaves dangling cgroups entries for killed containers

2013-10-08 Thread Alejandro Abdelnur (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alejandro Abdelnur updated YARN-1284:
-

Attachment: YARN-1284.patch

Addressing Sandy's comments. Reworked the while loop logic using a do-while 
block, seems a bit cleaner that way.

> LCE: Race condition leaves dangling cgroups entries for killed containers
> -
>
> Key: YARN-1284
> URL: https://issues.apache.org/jira/browse/YARN-1284
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.2.0
>Reporter: Alejandro Abdelnur
>Assignee: Alejandro Abdelnur
>Priority: Blocker
> Attachments: YARN-1284.patch, YARN-1284.patch, YARN-1284.patch, 
> YARN-1284.patch
>
>
> When LCE & cgroups are enabled, when a container is is killed (in this case 
> by its owning AM, an MRAM) it seems to be a race condition at OS level when 
> doing a SIGTERM/SIGKILL and when the OS does all necessary cleanup. 
> LCE code, after sending the SIGTERM/SIGKILL and getting the exitcode, 
> immediately attempts to clean up the cgroups entry for the container. But 
> this is failing with an error like:
> {code}
> 2013-10-07 15:21:24,359 WARN 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code 
> from container container_1381179532433_0016_01_11 is : 143
> 2013-10-07 15:21:24,359 DEBUG 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Processing container_1381179532433_0016_01_11 of type 
> UPDATE_DIAGNOSTICS_MSG
> 2013-10-07 15:21:24,359 DEBUG 
> org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
> deleteCgroup: 
> /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
> 2013-10-07 15:21:24,359 WARN 
> org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
> Unable to delete cgroup at: 
> /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
> {code}
> CgroupsLCEResourcesHandler.clearLimits() has logic to wait for 500 ms for AM 
> containers to avoid this problem. it seems this should be done for all 
> containers.
> Still, waiting for extra 500ms seems too expensive.
> We should look at a way of doing this in a more 'efficient way' from time 
> perspective, may be spinning while the deleteCgroup() cannot be done with a 
> minimal sleep and a timeout.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (YARN-1284) LCE: Race condition leaves dangling cgroups entries for killed containers

2013-10-08 Thread Alejandro Abdelnur (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alejandro Abdelnur updated YARN-1284:
-

Attachment: YARN-1284.patch

> LCE: Race condition leaves dangling cgroups entries for killed containers
> -
>
> Key: YARN-1284
> URL: https://issues.apache.org/jira/browse/YARN-1284
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.2.0
>Reporter: Alejandro Abdelnur
>Assignee: Alejandro Abdelnur
>Priority: Blocker
> Attachments: YARN-1284.patch, YARN-1284.patch, YARN-1284.patch
>
>
> When LCE & cgroups are enabled, when a container is is killed (in this case 
> by its owning AM, an MRAM) it seems to be a race condition at OS level when 
> doing a SIGTERM/SIGKILL and when the OS does all necessary cleanup. 
> LCE code, after sending the SIGTERM/SIGKILL and getting the exitcode, 
> immediately attempts to clean up the cgroups entry for the container. But 
> this is failing with an error like:
> {code}
> 2013-10-07 15:21:24,359 WARN 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code 
> from container container_1381179532433_0016_01_11 is : 143
> 2013-10-07 15:21:24,359 DEBUG 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Processing container_1381179532433_0016_01_11 of type 
> UPDATE_DIAGNOSTICS_MSG
> 2013-10-07 15:21:24,359 DEBUG 
> org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
> deleteCgroup: 
> /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
> 2013-10-07 15:21:24,359 WARN 
> org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
> Unable to delete cgroup at: 
> /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
> {code}
> CgroupsLCEResourcesHandler.clearLimits() has logic to wait for 500 ms for AM 
> containers to avoid this problem. it seems this should be done for all 
> containers.
> Still, waiting for extra 500ms seems too expensive.
> We should look at a way of doing this in a more 'efficient way' from time 
> perspective, may be spinning while the deleteCgroup() cannot be done with a 
> minimal sleep and a timeout.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (YARN-1284) LCE: Race condition leaves dangling cgroups entries for killed containers

2013-10-08 Thread Alejandro Abdelnur (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alejandro Abdelnur updated YARN-1284:
-

Attachment: YARN-1284.patch

> LCE: Race condition leaves dangling cgroups entries for killed containers
> -
>
> Key: YARN-1284
> URL: https://issues.apache.org/jira/browse/YARN-1284
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.2.0
>Reporter: Alejandro Abdelnur
>Assignee: Alejandro Abdelnur
>Priority: Blocker
> Attachments: YARN-1284.patch, YARN-1284.patch
>
>
> When LCE & cgroups are enabled, when a container is is killed (in this case 
> by its owning AM, an MRAM) it seems to be a race condition at OS level when 
> doing a SIGTERM/SIGKILL and when the OS does all necessary cleanup. 
> LCE code, after sending the SIGTERM/SIGKILL and getting the exitcode, 
> immediately attempts to clean up the cgroups entry for the container. But 
> this is failing with an error like:
> {code}
> 2013-10-07 15:21:24,359 WARN 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code 
> from container container_1381179532433_0016_01_11 is : 143
> 2013-10-07 15:21:24,359 DEBUG 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Processing container_1381179532433_0016_01_11 of type 
> UPDATE_DIAGNOSTICS_MSG
> 2013-10-07 15:21:24,359 DEBUG 
> org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
> deleteCgroup: 
> /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
> 2013-10-07 15:21:24,359 WARN 
> org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
> Unable to delete cgroup at: 
> /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
> {code}
> CgroupsLCEResourcesHandler.clearLimits() has logic to wait for 500 ms for AM 
> containers to avoid this problem. it seems this should be done for all 
> containers.
> Still, waiting for extra 500ms seems too expensive.
> We should look at a way of doing this in a more 'efficient way' from time 
> perspective, may be spinning while the deleteCgroup() cannot be done with a 
> minimal sleep and a timeout.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (YARN-1284) LCE: Race condition leaves dangling cgroups entries for killed containers

2013-10-08 Thread Alejandro Abdelnur (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alejandro Abdelnur updated YARN-1284:
-

Target Version/s: 2.2.1

> LCE: Race condition leaves dangling cgroups entries for killed containers
> -
>
> Key: YARN-1284
> URL: https://issues.apache.org/jira/browse/YARN-1284
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.2.0
>Reporter: Alejandro Abdelnur
>Assignee: Alejandro Abdelnur
>Priority: Blocker
> Attachments: YARN-1284.patch
>
>
> When LCE & cgroups are enabled, when a container is is killed (in this case 
> by its owning AM, an MRAM) it seems to be a race condition at OS level when 
> doing a SIGTERM/SIGKILL and when the OS does all necessary cleanup. 
> LCE code, after sending the SIGTERM/SIGKILL and getting the exitcode, 
> immediately attempts to clean up the cgroups entry for the container. But 
> this is failing with an error like:
> {code}
> 2013-10-07 15:21:24,359 WARN 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code 
> from container container_1381179532433_0016_01_11 is : 143
> 2013-10-07 15:21:24,359 DEBUG 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Processing container_1381179532433_0016_01_11 of type 
> UPDATE_DIAGNOSTICS_MSG
> 2013-10-07 15:21:24,359 DEBUG 
> org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
> deleteCgroup: 
> /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
> 2013-10-07 15:21:24,359 WARN 
> org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
> Unable to delete cgroup at: 
> /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
> {code}
> CgroupsLCEResourcesHandler.clearLimits() has logic to wait for 500 ms for AM 
> containers to avoid this problem. it seems this should be done for all 
> containers.
> Still, waiting for extra 500ms seems too expensive.
> We should look at a way of doing this in a more 'efficient way' from time 
> perspective, may be spinning while the deleteCgroup() cannot be done with a 
> minimal sleep and a timeout.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (YARN-1284) LCE: Race condition leaves dangling cgroups entries for killed containers

2013-10-08 Thread Alejandro Abdelnur (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alejandro Abdelnur updated YARN-1284:
-

Attachment: YARN-1284.patch

The patch changes the deleteCgroup() method to retry the delete in a loop 
(retrying every 20ms) until it succeeds or it times out (500ms). Also, this is 
done for all containers, not only for AM containers. It also introdcues a 
configuration knob for the timeout.

Other changes, such as method signatures and initConfig() method are to enable 
unittesting of the new logic.

> LCE: Race condition leaves dangling cgroups entries for killed containers
> -
>
> Key: YARN-1284
> URL: https://issues.apache.org/jira/browse/YARN-1284
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.2.0
>Reporter: Alejandro Abdelnur
>Assignee: Alejandro Abdelnur
>Priority: Blocker
> Attachments: YARN-1284.patch
>
>
> When LCE & cgroups are enabled, when a container is is killed (in this case 
> by its owning AM, an MRAM) it seems to be a race condition at OS level when 
> doing a SIGTERM/SIGKILL and when the OS does all necessary cleanup. 
> LCE code, after sending the SIGTERM/SIGKILL and getting the exitcode, 
> immediately attempts to clean up the cgroups entry for the container. But 
> this is failing with an error like:
> {code}
> 2013-10-07 15:21:24,359 WARN 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code 
> from container container_1381179532433_0016_01_11 is : 143
> 2013-10-07 15:21:24,359 DEBUG 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Processing container_1381179532433_0016_01_11 of type 
> UPDATE_DIAGNOSTICS_MSG
> 2013-10-07 15:21:24,359 DEBUG 
> org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
> deleteCgroup: 
> /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
> 2013-10-07 15:21:24,359 WARN 
> org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
> Unable to delete cgroup at: 
> /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
> {code}
> CgroupsLCEResourcesHandler.clearLimits() has logic to wait for 500 ms for AM 
> containers to avoid this problem. it seems this should be done for all 
> containers.
> Still, waiting for extra 500ms seems too expensive.
> We should look at a way of doing this in a more 'efficient way' from time 
> perspective, may be spinning while the deleteCgroup() cannot be done with a 
> minimal sleep and a timeout.



--
This message was sent by Atlassian JIRA
(v6.1#6144)