[jira] [Created] (MYRIAD-198) Remove optionals when sane defaults are available

2016-05-05 Thread DarinJ (JIRA)
DarinJ created MYRIAD-198:
-

 Summary: Remove optionals when sane defaults are available
 Key: MYRIAD-198
 URL: https://issues.apache.org/jira/browse/MYRIAD-198
 Project: Myriad
  Issue Type: Bug
  Components: Executor, Scheduler
Affects Versions: Myriad 0.2.0
Reporter: DarinJ
Priority: Minor


Currently we overuse Optionals in the config and then use an or method in 
various factories later.  In many cases having the configuration return a 
default when the parameter was specified would create cleaner code.  For 
instance:
{quote}
Optional getCgroups() {
  Optional.fromNullable(cgroups);
}
{quote}
vs
{quote}
Boolean getCgroups() {
  return cgroups != null ? cgroups : false;
}
{quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MYRIAD-199) Refactor NodeCapacityManager to allow additional stategies for scalup.

2016-05-05 Thread DarinJ (JIRA)
DarinJ created MYRIAD-199:
-

 Summary: Refactor NodeCapacityManager to allow additional 
stategies for scalup.
 Key: MYRIAD-199
 URL: https://issues.apache.org/jira/browse/MYRIAD-199
 Project: Myriad
  Issue Type: Bug
  Components: Scheduler
Reporter: DarinJ


Currently the NodeCapacityManager schedules tasks with one strategy.  There are 
several other strategies (such as M/M/C queues) which would be ideal for 
different workloads.  We should refactor NodeCapacityManager so we can provide 
other strategies and allow them to be configured at runtime.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MYRIAD-200) Increased Unit Test and Integration Testing

2016-05-05 Thread DarinJ (JIRA)
DarinJ created MYRIAD-200:
-

 Summary: Increased Unit Test and Integration Testing
 Key: MYRIAD-200
 URL: https://issues.apache.org/jira/browse/MYRIAD-200
 Project: Myriad
  Issue Type: Bug
  Components: Executor, Scheduler
Reporter: DarinJ


Currently Unit Test coverage is weak in places also, a good integration test 
framework would be helpful.  (potentially [minimesos|http://minimesos.org]?) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: cgroups suggestions

2016-05-05 Thread Darin Johnson
It turns out everything works if you set permissions appropriately of
$CGROUP_ROOT/mesos/$TASKID/ so the yarn user can write to the hierarchy.
Then all works exactly as expected.

I spent a while running through the container-executor code and when it
mounts a cgroup subsystem it changes the ownership of the hierarchy to the
yarn user, the original cgroups code of myriad attempted to do something
similar by chmoding the directory but assumed the yarn user work be a
member of group root, also when the code was written the chmod happened as
root, currently that is ineffective as the standard framework user does not
necessarily have permission to modify $CGROUP_ROOT/mesos/$TASKID.  However,
we have a mechanism for using a frameworksuperuser which can do this (my
current hack).

The current code also sets
yarn.nodemanager.linux-container-executor.cgroups.mount-path=/sys/fs/cgroup
and yarn.nodemanager.linux-container-executor.cgroups.mount=true, the
documentation the requires edits to yarn-site.xml to get these passed
through.

Now that I've got things working, I'll start cleaning up the original code
to provide an cleaner setup and adjust the documentation as necessary, I
should have a PR soon.


[jira] [Commented] (MYRIAD-192) Better Support Cgroups

2016-05-05 Thread DarinJ (JIRA)

[ 
https://issues.apache.org/jira/browse/MYRIAD-192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15273612#comment-15273612
 ] 

DarinJ commented on MYRIAD-192:
---

https://github.com/apache/incubator-myriad/pull/69

> Better Support Cgroups
> --
>
> Key: MYRIAD-192
> URL: https://issues.apache.org/jira/browse/MYRIAD-192
> Project: Myriad
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: Myriad 0.1.0
>Reporter: DarinJ
> Fix For: Myriad 0.2.0, Myriad 0.1.1
>
>
> Current many of the options for cgroups are hard coded into Myriad.  These 
> should be configurable.  In addition we should no longer chown the sandbox 
> directory to yarn in `DownloadNMExecutorCLGenImpl.java`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: NM does not start with cgroups enabled

2016-05-05 Thread Darin Johnson
Bjorn, I don't know if you're still experimenting with Myriad, but I
believe I've got a fix for your issue.  I'm going to try to get it in our
next release, so if you have any feedback it would be great.  I verified it
on a couple small systems.

https://github.com/apache/incubator-myriad/pull/69

On Wed, Mar 23, 2016 at 8:17 AM, Darin Johnson 
wrote:

> Hey, Bjorn sorry for the delay, looking at the difference between the
> exceptions and my own experience I believe you left some cgroup configs in
> yarn-site.xml of the node manager.
> On Mar 18, 2016 2:58 AM, "Björn Hagemeier" 
> wrote:
>
>> Hi Darin,
>>
>> thanks a lot for this. But what about the other case below, when cgroups
>> is disabled?
>>
>>
>> Björn
>>
>> Am 18.03.2016 um 00:25 schrieb Darin Johnson:
>> > Hey Bjorn,
>> >
>> > I think I figured out the issue.  Some of the values for cgroups are
>> still
>> > hardcoded in myriad.  I'll add a JIRA Ticket hopefully we can get an
>> update
>> > for 0.2.0.  I'll also respond to this thread after a pull request is
>> > submitted in case you'd like to test it.
>> >
>> > Darin
>> > Hi all,
>> >
>> > I have trouble starting the NM on the slave nodes. Apparently, it does
>> > not find it's configuration or sth. is wrong with the configuration.
>> >
>> > With cgroups enabled, the NM does not start, the logs contain,
>> > indicating that there is sth. wrong in the configuratin. However,
>> > yarn.nodemanager.linux-container-executor.group is set (to "yarn"). The
>> > value used to be "${yarn.nodemanager.linux-container-executor.group}" as
>> > indicated by the installation documentation, however I'm uncertain
>> > whether this recursion is the correct approach.
>> >
>> >
>> > ==
>> > 16/03/14 09:32:45 FATAL nodemanager.NodeManager: Error starting
>> NodeManager
>> > org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to
>> > initialize container executor
>> > at
>> >
>> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:213)
>> > at
>> > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>> > at
>> >
>> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:474)
>> > at
>> >
>> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:521)
>> > Caused by: java.io.IOException: Linux container executor not configured
>> > properly (error=24)
>> > at
>> >
>> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:193)
>> > at
>> >
>> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:211)
>> > ... 3 more
>> > Caused by: ExitCodeException exitCode=24: Can't get configured value for
>> > yarn.nodemanager.linux-container-executor.group.
>> >
>> > at org.apache.hadoop.util.Shell.runCommand(Shell.java:543)
>> > at org.apache.hadoop.util.Shell.run(Shell.java:460)
>> > at
>> >
>> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:720)
>> > at
>> >
>> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:187)
>> > ... 4 more
>> > ==
>> >
>> >
>> > I have given it another try with cgroups disabled (in
>> > myriad-config-default.yml), I seem to get a little further, but still
>> > stuck at running Yarn jobs:
>> >
>> > ==
>> > 16/03/14 10:56:34 INFO container.Container: Container
>> > container_1457949199710_0001_01_01 transitioned from LOCALIZED to
>> > RUNNING
>> > 16/03/14 10:56:34 INFO nodemanager.DefaultContainerExecutor:
>> > launchContainer: [bash,
>> >
>> /var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/bjoernh/appcache/application_1457949199710_0001/container_1457949199710_0001_01_01/default_container_executor.sh]
>> > 16/03/14 10:56:34 WARN nodemanager.DefaultContainerExecutor: Exit code
>> > from container container_1457949199710_0001_01_01 is : 1
>> > 16/03/14 10:56:34 WARN nodemanager.DefaultContainerExecutor: Exception
>> > from container-launch with container ID:
>> > container_1457949199710_0001_01_01 and exit code: 1
>> > ExitCodeException exitCode=1:
>> > at org.apache.hadoop.util.Shell.runCommand(Shell.java:543)
>> > at org.apache.hadoop.util.Shell.run(Shell.java:460)
>> > at
>> >
>> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:720)
>> > at
>> >
>> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:210)
>> > at
>> >
>> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
>> > at
>> >
>> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)