Hey Bjorn, I think I figured out the issue. Some of the values for cgroups are still hardcoded in myriad. I'll add a JIRA Ticket hopefully we can get an update for 0.2.0. I'll also respond to this thread after a pull request is submitted in case you'd like to test it.
Darin Hi all, I have trouble starting the NM on the slave nodes. Apparently, it does not find it's configuration or sth. is wrong with the configuration. With cgroups enabled, the NM does not start, the logs contain, indicating that there is sth. wrong in the configuratin. However, yarn.nodemanager.linux-container-executor.group is set (to "yarn"). The value used to be "${yarn.nodemanager.linux-container-executor.group}" as indicated by the installation documentation, however I'm uncertain whether this recursion is the correct approach. ================================================== 16/03/14 09:32:45 FATAL nodemanager.NodeManager: Error starting NodeManager org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to initialize container executor at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:213) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:474) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:521) Caused by: java.io.IOException: Linux container executor not configured properly (error=24) at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:193) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:211) ... 3 more Caused by: ExitCodeException exitCode=24: Can't get configured value for yarn.nodemanager.linux-container-executor.group. at org.apache.hadoop.util.Shell.runCommand(Shell.java:543) at org.apache.hadoop.util.Shell.run(Shell.java:460) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:720) at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:187) ... 4 more ================================================== I have given it another try with cgroups disabled (in myriad-config-default.yml), I seem to get a little further, but still stuck at running Yarn jobs: ================================================== 16/03/14 10:56:34 INFO container.Container: Container container_1457949199710_0001_01_000001 transitioned from LOCALIZED to RUNNING 16/03/14 10:56:34 INFO nodemanager.DefaultContainerExecutor: launchContainer: [bash, /var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/bjoernh/appcache/application_1457949199710_0001/container_1457949199710_0001_01_000001/default_container_executor.sh] 16/03/14 10:56:34 WARN nodemanager.DefaultContainerExecutor: Exit code from container container_1457949199710_0001_01_000001 is : 1 16/03/14 10:56:34 WARN nodemanager.DefaultContainerExecutor: Exception from container-launch with container ID: container_1457949199710_0001_01_000001 and exit code: 1 ExitCodeException exitCode=1: at org.apache.hadoop.util.Shell.runCommand(Shell.java:543) at org.apache.hadoop.util.Shell.run(Shell.java:460) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:720) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:210) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 16/03/14 10:56:34 INFO nodemanager.ContainerExecutor: Exception from container-launch. 16/03/14 10:56:34 INFO nodemanager.ContainerExecutor: Container id: container_1457949199710_0001_01_000001 16/03/14 10:56:34 INFO nodemanager.ContainerExecutor: Exit code: 1 ================================================== Unfortunately, directory /var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/bjoernh/appcache/ is empty, the log indicates that it is being deleted after the failed attempt. Again, any hint would be useful. Also regarding the activation of cgroups. Best regards, Björn -- Dipl.-Inform. Björn Hagemeier Federated Systems and Data Juelich Supercomputing Centre Institute for Advanced Simulation Phone: +49 2461 61 1584 Fax : +49 2461 61 6656 Email: b.hageme...@fz-juelich.de Skype: bhagemeier WWW : http://www.fz-juelich.de/jsc JSC is the coordinator of the John von Neumann Institute for Computing and member of the Gauss Centre for Supercomputing ------------------------------------------------------------------------------------- ------------------------------------------------------------------------------------- Forschungszentrum Juelich GmbH 52425 Juelich Sitz der Gesellschaft: Juelich Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498 Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender), Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt, Prof. Dr. Sebastian M. Schmidt ------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------