Re: NM does not start with cgroups enabled

Björn Hagemeier Sat, 19 Mar 2016 12:21:57 -0700

Hi Darin,

container-executor.cfg on the master node:
==================================================
yarn.nodemanager.linux-container-executor.group=yarn #configured value
of yarn.nodemanager.linux-container-executor.group
banned.users=#comma separated list of users who can not run applications
min.user.id=1000#Prevent other super-users
allowed.system.users=bjoernh,yarn##comma separated list of system users
who CAN run applications
==================================================


On the slave nodes:
==================================================
#configured value of yarn.nodemanager.linux-container-executor.group
yarn.nodemanager.linux-container-executor.group=yarn
#comma separated list of users who can not run applications
banned.users=hfds,yarn,mapred,bin
#Prevent other super-users
min.user.id=99
#comma separated list of system users who CAN run applications
allowed.system.users=
==================================================

The difference comes from not having defined the installation of NM in
Puppet on the master node. I already played with diff. values for the
allowed.system.users, but had no success so far.

Is it correct that the container-executor.cfg is only relevant on the NM
nodes?


Best regards and thanks for your efforts,
Björn

Am 16.03.2016 um 07:10 schrieb Darin Johnson:
> what does your container-executor.cfg look like?  Seems like
> yarn.nodemanager.linux-container-executor.group isn't set, or possibly
> bannerusers= hasn't been set (some distro's).
> 
> On Tue, Mar 15, 2016 at 12:52 PM, Darin Johnson <dbjohnson1...@gmail.com>
> wrote:
> 
>> Bjorn,
>>
>> You're isolation configuration is correct, I was going from memory.  I'll
>> take a look at you're configs a little later on my test environment and see
>> what I can come up with.
>>
>> Darin
>>
>> On Tue, Mar 15, 2016 at 12:07 PM, Björn Hagemeier <
>> b.hageme...@fz-juelich.de> wrote:
>>
>>> Dear Darin,
>>>
>>> thanks for your response.
>>>
>>> The precise content of /etc/mesos-slave/isolation is:
>>>
>>> ==================================================
>>> cgroups/cpu,cgroups/mem
>>> ==================================================
>>>
>>> Which I took from some documentation, it may have been that of the
>>> Puppet module I'm using [1]. Should the values be different? Your string
>>> looks a bit different: "cpu/cgroups,memory/cgroups".
>>>
>>> Please find my yarn-site.xml and myriad-config-default.yml attached. I
>>> don't think they contain any sensitive information.
>>>
>>>
>>> Best regards,
>>> Björn
>>>
>>> [1] https://github.com/deric/puppet-mesos
>>>
>>> Am 15.03.2016 um 16:46 schrieb Darin Johnson:
>>>> Hey Bjorn,
>>>>
>>>> Can you copy paste the relevant part of the Myriad and yarn-site.xml?
>>>> Also, can you ensure you are running the mesos-slave with
>>>> --isolation="cpu/cgroups,memory/cgroups?.
>>>>
>>>> I'll try to recreate the problem and/or tell you what's missing in the
>>>> config.
>>>>
>>>> Darin
>>>>
>>>> On Mon, Mar 14, 2016 at 6:19 AM, Björn Hagemeier <
>>> b.hageme...@fz-juelich.de>
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I have trouble starting the NM on the slave nodes. Apparently, it does
>>>>> not find it's configuration or sth. is wrong with the configuration.
>>>>>
>>>>> With cgroups enabled, the NM does not start, the logs contain,
>>>>> indicating that there is sth. wrong in the configuratin. However,
>>>>> yarn.nodemanager.linux-container-executor.group is set (to "yarn"). The
>>>>> value used to be "${yarn.nodemanager.linux-container-executor.group}"
>>> as
>>>>> indicated by the installation documentation, however I'm uncertain
>>>>> whether this recursion is the correct approach.
>>>>>
>>>>>
>>>>> ==================================================
>>>>> 16/03/14 09:32:45 FATAL nodemanager.NodeManager: Error starting
>>> NodeManager
>>>>> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to
>>>>> initialize container executor
>>>>>         at
>>>>>
>>>>>
>>> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:213)
>>>>>         at
>>>>>
>>> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>>>>>         at
>>>>>
>>>>>
>>> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:474)
>>>>>         at
>>>>>
>>>>>
>>> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:521)
>>>>> Caused by: java.io.IOException: Linux container executor not configured
>>>>> properly (error=24)
>>>>>         at
>>>>>
>>>>>
>>> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:193)
>>>>>         at
>>>>>
>>>>>
>>> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:211)
>>>>>         ... 3 more
>>>>> Caused by: ExitCodeException exitCode=24: Can't get configured value
>>> for
>>>>> yarn.nodemanager.linux-container-executor.group.
>>>>>
>>>>>         at org.apache.hadoop.util.Shell.runCommand(Shell.java:543)
>>>>>         at org.apache.hadoop.util.Shell.run(Shell.java:460)
>>>>>         at
>>>>>
>>> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:720)
>>>>>         at
>>>>>
>>>>>
>>> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:187)
>>>>>         ... 4 more
>>>>> ==================================================
>>>>>
>>>>>
>>>>> I have given it another try with cgroups disabled (in
>>>>> myriad-config-default.yml), I seem to get a little further, but still
>>>>> stuck at running Yarn jobs:
>>>>>
>>>>> ==================================================
>>>>> 16/03/14 10:56:34 INFO container.Container: Container
>>>>> container_1457949199710_0001_01_000001 transitioned from LOCALIZED to
>>>>> RUNNING
>>>>> 16/03/14 10:56:34 INFO nodemanager.DefaultContainerExecutor:
>>>>> launchContainer: [bash,
>>>>>
>>>>>
>>> /var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/bjoernh/appcache/application_1457949199710_0001/container_1457949199710_0001_01_000001/default_container_executor.sh]
>>>>> 16/03/14 10:56:34 WARN nodemanager.DefaultContainerExecutor: Exit code
>>>>> from container container_1457949199710_0001_01_000001 is : 1
>>>>> 16/03/14 10:56:34 WARN nodemanager.DefaultContainerExecutor: Exception
>>>>> from container-launch with container ID:
>>>>> container_1457949199710_0001_01_000001 and exit code: 1
>>>>> ExitCodeException exitCode=1:
>>>>>         at org.apache.hadoop.util.Shell.runCommand(Shell.java:543)
>>>>>         at org.apache.hadoop.util.Shell.run(Shell.java:460)
>>>>>         at
>>>>>
>>> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:720)
>>>>>         at
>>>>>
>>>>>
>>> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:210)
>>>>>         at
>>>>>
>>>>>
>>> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
>>>>>         at
>>>>>
>>>>>
>>> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
>>>>>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>>>>         at
>>>>>
>>>>>
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>>>>         at
>>>>>
>>>>>
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>>>>         at java.lang.Thread.run(Thread.java:745)
>>>>> 16/03/14 10:56:34 INFO nodemanager.ContainerExecutor: Exception from
>>>>> container-launch.
>>>>> 16/03/14 10:56:34 INFO nodemanager.ContainerExecutor: Container id:
>>>>> container_1457949199710_0001_01_000001
>>>>> 16/03/14 10:56:34 INFO nodemanager.ContainerExecutor: Exit code: 1
>>>>> ==================================================
>>>>>
>>>>> Unfortunately, directory
>>>>>
>>> /var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/bjoernh/appcache/
>>>>> is empty, the log indicates that it is being deleted after the failed
>>>>> attempt.
>>>>>
>>>>> Again, any hint would be useful. Also regarding the activation of
>>> cgroups.
>>>>>
>>>>>
>>>>> Best regards,
>>>>> Björn
>>>>>
>>>>> --
>>>>> Dipl.-Inform. Björn Hagemeier
>>>>> Federated Systems and Data
>>>>> Juelich Supercomputing Centre
>>>>> Institute for Advanced Simulation
>>>>>
>>>>> Phone: +49 2461 61 1584
>>>>> Fax  : +49 2461 61 6656
>>>>> Email: b.hageme...@fz-juelich.de
>>>>> Skype: bhagemeier
>>>>> WWW  : http://www.fz-juelich.de/jsc
>>>>>
>>>>> JSC is the coordinator of the
>>>>> John von Neumann Institute for Computing
>>>>> and member of the
>>>>> Gauss Centre for Supercomputing
>>>>>
>>>>>
>>>>>
>>> -------------------------------------------------------------------------------------
>>>>>
>>>>>
>>> -------------------------------------------------------------------------------------
>>>>> Forschungszentrum Juelich GmbH
>>>>> 52425 Juelich
>>>>> Sitz der Gesellschaft: Juelich
>>>>> Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
>>>>> Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher
>>>>> Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
>>>>> Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
>>>>> Prof. Dr. Sebastian M. Schmidt
>>>>>
>>>>>
>>> -------------------------------------------------------------------------------------
>>>>>
>>>>>
>>> -------------------------------------------------------------------------------------
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Dipl.-Inform. Björn Hagemeier
>>> Federated Systems and Data
>>> Juelich Supercomputing Centre
>>> Institute for Advanced Simulation
>>>
>>> Phone: +49 2461 61 1584
>>> Fax  : +49 2461 61 6656
>>> Email: b.hageme...@fz-juelich.de
>>> Skype: bhagemeier
>>> WWW  : http://www.fz-juelich.de/jsc
>>>
>>> JSC is the coordinator of the
>>> John von Neumann Institute for Computing
>>> and member of the
>>> Gauss Centre for Supercomputing
>>>
>>>
>>> -------------------------------------------------------------------------------------
>>>
>>> -------------------------------------------------------------------------------------
>>> Forschungszentrum Juelich GmbH
>>> 52425 Juelich
>>> Sitz der Gesellschaft: Juelich
>>> Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
>>> Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher
>>> Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
>>> Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
>>> Prof. Dr. Sebastian M. Schmidt
>>>
>>> -------------------------------------------------------------------------------------
>>>
>>> -------------------------------------------------------------------------------------
>>>
>>>
>>
> 


-- 
Dipl.-Inform. Björn Hagemeier
Federated Systems and Data
Juelich Supercomputing Centre
Institute for Advanced Simulation

Phone: +49 2461 61 1584
Fax  : +49 2461 61 6656
Email: b.hageme...@fz-juelich.de
Skype: bhagemeier
WWW  : http://www.fz-juelich.de/jsc

JSC is the coordinator of the
John von Neumann Institute for Computing
and member of the
Gauss Centre for Supercomputing

-------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------
Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt
-------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------

<<attachment: b_hagemeier.vcf>>

smime.p7s
Description: S/MIME Cryptographic Signature

Re: NM does not start with cgroups enabled

Reply via email to