[jira] [Commented] (AURORA-1871) Client should reject tasks with duplicate process names

2017-03-24 Thread Joshua Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15940592#comment-15940592
 ] 

Joshua Cohen commented on AURORA-1871:
--

I don't think we need to do this verification in Pystachio, nor do we need to 
write a parser for the DSL (which Pystachio already is). Instead, we can just 
verify in the client after parsing the config, but before sending the job to 
the scheduler. E.g. in 
[context.py#get_job_config](https://github.com/apache/aurora/blob/master/src/main/python/apache/aurora/client/cli/context.py#L151-L172)
 we could add some additional validation logic.

One thing to keep in mind however, is we'd need to be careful that commands on 
existing jobs that may have invalid config are still possible. We should only 
reject jobs on admission.

> Client should reject tasks with duplicate process names
> ---
>
> Key: AURORA-1871
> URL: https://issues.apache.org/jira/browse/AURORA-1871
> Project: Aurora
>  Issue Type: Task
>  Components: Client
>Reporter: Joshua Cohen
>
> If a user creates a job that contains tasks with the same process name, that 
> info is happily passed on to thermos, which will happily run one of those 
> processes, but maybe display a separate one in the UI. In general the 
> behavior in this case is non-deterministic and can lead to hard to track down 
> bugs.
> We should just short circuit and fail in the client if we detect multiple 
> processes with the same name.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (AURORA-1871) Client should reject tasks with duplicate process names

2017-03-24 Thread Joshua Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15940592#comment-15940592
 ] 

Joshua Cohen edited comment on AURORA-1871 at 3/24/17 3:46 PM:
---

I don't think we need to do this verification in Pystachio, nor do we need to 
write a parser for the DSL (which Pystachio already is). Instead, we can just 
verify in the client after parsing the config, but before sending the job to 
the scheduler. E.g. in 
[context.py#get_job_config|https://github.com/apache/aurora/blob/master/src/main/python/apache/aurora/client/cli/context.py#L151-L172]
 we could add some additional validation logic.

One thing to keep in mind however, is we'd need to be careful that commands on 
existing jobs that may have invalid config are still possible. We should only 
reject jobs on admission.


was (Author: joshua.cohen):
I don't think we need to do this verification in Pystachio, nor do we need to 
write a parser for the DSL (which Pystachio already is). Instead, we can just 
verify in the client after parsing the config, but before sending the job to 
the scheduler. E.g. in 
[context.py#get_job_config](https://github.com/apache/aurora/blob/master/src/main/python/apache/aurora/client/cli/context.py#L151-L172)
 we could add some additional validation logic.

One thing to keep in mind however, is we'd need to be careful that commands on 
existing jobs that may have invalid config are still possible. We should only 
reject jobs on admission.

> Client should reject tasks with duplicate process names
> ---
>
> Key: AURORA-1871
> URL: https://issues.apache.org/jira/browse/AURORA-1871
> Project: Aurora
>  Issue Type: Task
>  Components: Client
>Reporter: Joshua Cohen
>
> If a user creates a job that contains tasks with the same process name, that 
> info is happily passed on to thermos, which will happily run one of those 
> processes, but maybe display a separate one in the UI. In general the 
> behavior in this case is non-deterministic and can lead to hard to track down 
> bugs.
> We should just short circuit and fail in the client if we detect multiple 
> processes with the same name.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (AURORA-1889) Thermos Observer UI does not account for disk usage outside of sandbox

2017-02-08 Thread Joshua Cohen (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshua Cohen updated AURORA-1889:
-
Labels: newbie  (was: )

> Thermos Observer UI does not account for disk usage outside of sandbox
> --
>
> Key: AURORA-1889
> URL: https://issues.apache.org/jira/browse/AURORA-1889
> Project: Aurora
>  Issue Type: Task
>  Components: Observer
>Reporter: Joshua Cohen
>Priority: Minor
>  Labels: newbie
>
> When, e.g., /tmp isolation is enabled in Mesos, a tmp directory is created as 
> a peer of the thermos sandbox directory. Utilization in this directory is not 
> accounted for by the Observer.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (AURORA-1889) Thermos Observer UI does not account for disk usage outside of sandbox

2017-02-08 Thread Joshua Cohen (JIRA)
Joshua Cohen created AURORA-1889:


 Summary: Thermos Observer UI does not account for disk usage 
outside of sandbox
 Key: AURORA-1889
 URL: https://issues.apache.org/jira/browse/AURORA-1889
 Project: Aurora
  Issue Type: Task
  Components: Observer
Reporter: Joshua Cohen
Priority: Minor


When, e.g., /tmp isolation is enabled in Mesos, a tmp directory is created as a 
peer of the thermos sandbox directory. Utilization in this directory is not 
accounted for by the Observer.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (AURORA-1883) Support access control on arbitrary constraints

2017-01-26 Thread Joshua Cohen (JIRA)
Joshua Cohen created AURORA-1883:


 Summary: Support access control on arbitrary constraints
 Key: AURORA-1883
 URL: https://issues.apache.org/jira/browse/AURORA-1883
 Project: Aurora
  Issue Type: Task
  Components: Scheduler
Reporter: Joshua Cohen
Priority: Minor


We currently have support for enforcing role-based access control for dedicated 
constraints. I'd propose that broader support for access control on constraints 
would be useful. In my specific case, given the heterogenous nature of hardware 
in our Mesos clusters, I'd like to allow users to constrain tasks to run on 
specific hardware platforms. However, if this were broadly available, there 
would be nothing to stop all users from trying to run exclusively on the newest 
hardware platforms causing contention and potentially an inability to schedule. 
I'd like to see us add support for rejecting tasks with certain constraints 
unless the authenticated user belongs to a role that has been granted access to 
that constraint.

This is aspirational, we can work around this for the time being with a 
dedicated cluster, but that comes with operational overhead (e.g. ensuring the 
makeup of that cluster matches the makeup of the shared cluster), thus this 
ticket as a longer term, generalized mechanism to solve this problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1781) Sandbox taskfs setup fails (groupadd error)

2017-01-18 Thread Joshua Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15828452#comment-15828452
 ] 

Joshua Cohen commented on AURORA-1781:
--

I don't have any insight into the root cause here. Without being able to 
reproduce, it's hard to diagnose.

That said, given that there's a workaround in using the {{--no-create-user}} 
flag to the executor, I don't think this should block the 0.17.0 release.

> Sandbox taskfs setup fails (groupadd error)
> ---
>
> Key: AURORA-1781
> URL: https://issues.apache.org/jira/browse/AURORA-1781
> Project: Aurora
>  Issue Type: Bug
>Affects Versions: 0.16.0
>Reporter: Justin Venus
> Fix For: 0.17.0
>
>
> I hit what smells like a permission issue w/ `/etc/group` when trying to use 
> a docker-image (unified containerizer setup) with mesos-1.0.0. and 
> aurora-0.16.0-rc2.  I cannot reproduce issue w/ mesos-0.28.2 and aurora-015.0.
> {code}
> Failed to initialize sandbox: Failed to create group in sandbox for task 
> image: Command '['groupadd', '-R', 
> '/var/lib/mesos/slaves/5d28d0cc-2793-4471-82d5-e67276c53f70-S2/frameworks/20160221-001235-3801519626-5050-1-/executors/thermos-nobody-prod-jenkins-0-47cc7824-565b-4265-9ab4-9ba3f364ebed/runs/a3f78288-4865-4166-8685-1ad941562f2f/taskfs',
>  '-g', '99', 'nobody']' returned non-zero exit status 10
> {code}
> {code}
> [root@mesos-master01of2 taskfs]# pwd
> /var/lib/mesos/slaves/5d28d0cc-2793-4471-82d5-e67276c53f70-S2/frameworks/20160221-001235-3801519626-5050-1-/executors/thermos-nobody-prod-jenkins-0-47cc7824-565b-4265-9ab4-9ba3f364ebed/runs/a3f78288-4865-4166-8685-1ad941562f2f/taskfs
> [root@mesos-master01of2 taskfs]# groupadd -R $PWD -g 99 nobody
> groupadd: cannot lock /etc/group; try again later.
> {code}
> Maybe related to AURORA-1761
> I'm running CoreOS with the mesos-agent (and thermos) inside docker.  Here is 
> the gist of how it's started.
> {code}
> /usr/bin/sh -c "exec /usr/bin/docker run \
> --name=mesos_slave \
> --net=host \
> --pid=host \
> --privileged \
> -v /sys:/sys \
> -v /usr/bin/docker:/usr/bin/docker:ro \
> -v /var/lib/docker:/var/lib/docker \
> -v /var/run/docker.sock:/root/docker.sock \
> -v /run/systemd/system:/run/systemd/system \
> -v /lib64/libdevmapper.so.1.02:/lib/libdevmapper.so.1.02:ro \
> -v /sys/fs/cgroup:/sys/fs/cgroup \
> -v /var/lib/mesos:/var/lib/mesos \
> -e MESOS_CONTAINERIZERS=docker,mesos \
> -e MESOS_EXECUTOR_REGISTRATION_TIMEOUT=5mins \
> -e MESOS_WORK_DIR=/var/lib/mesos \
> -e MESOS_LOGGING_LEVEL=INFO \
> -e AMAZON_REGION=us-office-2 \
> -e AVAILABILITY_ZONE=us-office-2b \
> -e MESOS_ATTRIBUTES=\"platform:linux;host:$(hostname);rack:us-office-2b\" 
> \
> -e MESOS_CLUSTER=ZeroZero \
> -e MESOS_DOCKER_SOCKET=/root/docker.sock \
> -e 
> MESOS_MASTER=zk://10.150.150.224:2181,10.150.150.225:2181,10.150.150.226:2181/mesos
>  \
> -e MESOS_LOG_DIR=/var/log/mesos \
> -e 
> MESOS_ISOLATION=\"filesystem/linux,cgroups/cpu,cgroups/mem,docker/runtime\" \
> -e MESOS_IMAGE_PROVIDERS=docker \
> -e MESOS_IMAGE_PROVISIONER_BACKEND=copy \
> -e MESOS_DOCKER_REGISTRY=http://docker-registry:31000 \
> -e MESOS_DOCKER_STORE_DIR=/var/lib/mesos/docker \
> --entrypoint=/usr/sbin/mesos-slave \
> docker-registry.thebrighttag.com:31000/mesos:latest \
> --no-systemd_enable_support \
> || rm -f /var/lib/mesos/meta/slaves/latest"
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1878) Increased executor logs can lead to task's running out of disk space

2017-01-11 Thread Joshua Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819359#comment-15819359
 ] 

Joshua Cohen commented on AURORA-1878:
--

https://reviews.apache.org/r/55434/

> Increased executor logs can lead to task's running out of disk space
> 
>
> Key: AURORA-1878
> URL: https://issues.apache.org/jira/browse/AURORA-1878
> Project: Aurora
>  Issue Type: Task
>  Components: Executor
>Reporter: Joshua Cohen
>Assignee: Joshua Cohen
>
> After the health check for updates patch, this log statement is being emitted 
> once every 500ms: 
> https://github.com/apache/aurora/commit/2992c8b4#diff-6d60c873330419a828fb992f46d53372R121
> This is due to this 
> [code|https://github.com/apache/aurora/blob/master/src/main/python/apache/aurora/executor/common/status_checker.py#L120-L124]:
> {code}
> if status_result is not None:
>   log.info('%s reported %s' % (status_checker.__class__.__name__, 
> status_result))
> {code}
> Previously, {{status_result}} would be {{None}} unless the status checker had 
> a terminal event. Now, {{status_result}} will always be set, but we only 
> consider the {{status_result}} to be terminal if the {{status}} is not 
> {{TASK_STARTING}} or {{TASK_RUNNING}}. So, for the healthy case, we log that 
> the task is {{TASK_RUNNING}} every 500ms.
> !https://frinkiac.com/meme/S10E02/818984.jpg?b64lines=IFRISVMgV0lMTCBTT1VORCBFVkVSWQogVEhSRUUgU0VDT05EUyBVTkxFU1MKIFNPTUVUSElORyBJU04nVCBPS0FZIQ==!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (AURORA-1878) Increased executor logs can lead to task's running out of disk space

2017-01-11 Thread Joshua Cohen (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshua Cohen reassigned AURORA-1878:


Assignee: Joshua Cohen

> Increased executor logs can lead to task's running out of disk space
> 
>
> Key: AURORA-1878
> URL: https://issues.apache.org/jira/browse/AURORA-1878
> Project: Aurora
>  Issue Type: Task
>  Components: Executor
>Reporter: Joshua Cohen
>Assignee: Joshua Cohen
>
> After the health check for updates patch, this log statement is being emitted 
> once every 500ms: 
> https://github.com/apache/aurora/commit/2992c8b4#diff-6d60c873330419a828fb992f46d53372R121
> This is due to this 
> [code|https://github.com/apache/aurora/blob/master/src/main/python/apache/aurora/executor/common/status_checker.py#L120-L124]:
> {code}
> if status_result is not None:
>   log.info('%s reported %s' % (status_checker.__class__.__name__, 
> status_result))
> {code}
> Previously, {{status_result}} would be {{None}} unless the status checker had 
> a terminal event. Now, {{status_result}} will always be set, but we only 
> consider the {{status_result}} to be terminal if the {{status}} is not 
> {{TASK_STARTING}} or {{TASK_RUNNING}}. So, for the healthy case, we log that 
> the task is {{TASK_RUNNING}} every 500ms.
> !https://frinkiac.com/meme/S10E02/818984.jpg?b64lines=IFRISVMgV0lMTCBTT1VORCBFVkVSWQogVEhSRUUgU0VDT05EUyBVTkxFU1MKIFNPTUVUSElORyBJU04nVCBPS0FZIQ==!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (AURORA-1878) Increased executor logs can lead to task's running out of disk space

2017-01-11 Thread Joshua Cohen (JIRA)
Joshua Cohen created AURORA-1878:


 Summary: Increased executor logs can lead to task's running out of 
disk space
 Key: AURORA-1878
 URL: https://issues.apache.org/jira/browse/AURORA-1878
 Project: Aurora
  Issue Type: Task
  Components: Executor
Reporter: Joshua Cohen


After the health check for updates patch, this log statement is being emitted 
once every 500ms: 
https://github.com/apache/aurora/commit/2992c8b4#diff-6d60c873330419a828fb992f46d53372R121

This is due to this 
[code|https://github.com/apache/aurora/blob/master/src/main/python/apache/aurora/executor/common/status_checker.py#L120-L124]:

{code}
if status_result is not None:
  log.info('%s reported %s' % (status_checker.__class__.__name__, 
status_result))
{code}

Previously, {{status_result}} would be {{None}} unless the status checker had a 
terminal event. Now, {{status_result}} will always be set, but we only consider 
the {{status_result}} to be terminal if the {{status}} is not {{TASK_STARTING}} 
or {{TASK_RUNNING}}. So, for the healthy case, we log that the task is 
{{TASK_RUNNING}} every 500ms.

!https://frinkiac.com/meme/S10E02/818984.jpg?b64lines=IFRISVMgV0lMTCBTT1VORCBFVkVSWQogVEhSRUUgU0VDT05EUyBVTkxFU1MKIFNPTUVUSElORyBJU04nVCBPS0FZIQ==!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (AURORA-1877) Adopt CONTAINER_ATTACH and CONTAINER_EXEC for aurora task ssh.

2017-01-10 Thread Joshua Cohen (JIRA)
Joshua Cohen created AURORA-1877:


 Summary: Adopt CONTAINER_ATTACH and CONTAINER_EXEC for aurora task 
ssh.
 Key: AURORA-1877
 URL: https://issues.apache.org/jira/browse/AURORA-1877
 Project: Aurora
  Issue Type: Task
Reporter: Joshua Cohen


Mesos epic is here: https://issues.apache.org/jira/browse/MESOS-6460

These APIs will allow {{aurora task ssh}} to enter a container's namespaces 
(both mount and network) greatly improving user's abilities to self-serve 
problem solving with their tasks when these isolation mechanisms are enabled.

I pinged in Mesos slack about the availability of these APIs and it seems like 
they'll land with Mesos 1.2.0, though perhaps without documentation:

{noformat}
Joshua Cohen [3:05 PM] 
@klueska what’s the status on this? 
https://issues.apache.org/jira/browse/MESOS-6460 is it still slated for 1.2.0?

Kevin Klues [3:08 PM] 
Yes. All of the API support has landed and will be included in the release. The 
CLI hasn’t been updated to consume these APIs yet though. I don’t think that 
will make it into 1.2 unfortunately. The DC/OS CLI can be used as an 
alternative though https://github.com/dcos/dcos-cli. You can configure it to 
work with standalone mesos by setting `core.mesos_master_url` in its config.
GitHub
dcos/dcos-cli
dcos-cli - The command line for your datacenter!
 

[3:09]  
@jcohen ^^

Joshua Cohen [3:10 PM] 
are the APIs documented?

[3:10]  
we’d like to build support for this in to aurora’s task ssh functionality.

Kevin Klues [3:14 PM] 
I’ve started on the documentation but haven’t finished it yet (and may not 
before the release). It will be coming soon though. If you want to get started 
right away, the best source of truth about how to consume the APIS is probably 
in the PRs against the DC/OS CLI: 
https://github.com/dcos/dcos-cli/pull/838/files and 
https://github.com/dcos/dcos-cli/pull/856/files
GitHub
dcos task exec by adam-mesos · Pull Request #838 · dcos/dcos-cli · GitHub
This commit adds a new dcos task exec subcommand to the DC/OS CLI. Using dcos 
task exec, one can launch arbitrary commands inside a task's container in a 
DC/OS cluster. This includes commands that ...
 
 GitHub
Updated `dcos task exec` ATTACH_CONTAINER_INPUT semantics. by klueska · Pull 
Request #856 · dcos/dcos-cli · GitHub
With the new semantics, an ATTACH_CONTAINER_INPUT call will be attempted once 
before committing to a persistent connection where the actual input is streamed 
to the agent. This change is required t...
 

Joshua Cohen [3:15 PM] 
thanks @klueska!

Kevin Klues [3:17 PM] 
@jcohen Actually, its probably more easily consumable from the commits 
themselves: 
https://github.com/dcos/dcos-cli/commit/e52bc6f058bc161fb2067a179cea44b2b78d3b37
 and 
https://github.com/dcos/dcos-cli/commit/e6bcdc7efb9e630782e40354fc31c9568d98a473
GitHub
task: add `dcos task exec` (#838) · dcos/dcos-cli@e52bc6f · GitHub
This commit adds a new dcos task exec subcommand to the DC/OS CLI. Using dcos 
task exec, one can launch arbitrary commands inside a task's container in a 
DC/OS cluster. This includes commands that...
 
 GitHub
task: update `dcos task exec` ATTACH_CONTAINER_INPUT semantics. (#856) · 
dcos/dcos-cli@e6bcdc7 · GitHub
With the new semantics, an ATTACH_CONTAINER_INPUT call will be attempted once 
before committing to a persistent connection where the actual input is streamed 
to the agent. This change is requi...
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1873) CuratorServiceGroupMonitor.LOG should use its own logger name

2017-01-10 Thread Joshua Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15816055#comment-15816055
 ] 

Joshua Cohen commented on AURORA-1873:
--

[~bqluan] I've added you as a contributor. You should be able to assign tickets 
to yourself now!

> CuratorServiceGroupMonitor.LOG should use its own logger name
> -
>
> Key: AURORA-1873
> URL: https://issues.apache.org/jira/browse/AURORA-1873
> Project: Aurora
>  Issue Type: Bug
>  Components: Scheduler
>Reporter: Bing-Qian Luan
>Priority: Trivial
>  Labels: newbie
>
> CuratorServiceGroupMonitor.LOG should use CuratorServiceGroupMonitor.class as 
> its logger name (instead of SchedulerMain.class).
> Code:
> private static final Logger LOG = 
> LoggerFactory.getLogger(SchedulerMain.class);



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (AURORA-1871) Client should reject tasks with duplicate process names

2016-12-27 Thread Joshua Cohen (JIRA)
Joshua Cohen created AURORA-1871:


 Summary: Client should reject tasks with duplicate process names
 Key: AURORA-1871
 URL: https://issues.apache.org/jira/browse/AURORA-1871
 Project: Aurora
  Issue Type: Task
  Components: Client
Reporter: Joshua Cohen


If a user creates a job that contains tasks with the same process name, that 
info is happily passed on to thermos, which will happily run one of those 
processes, but maybe display a separate one in the UI. In general the behavior 
in this case is non-deterministic and can lead to hard to track down bugs.

We should just short circuit and fail in the client if we detect multiple 
processes with the same name.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AURORA-1850) Raw StatusResult passed to the scheduler when tasks are healthy

2016-12-07 Thread Joshua Cohen (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshua Cohen updated AURORA-1850:
-
Attachment: 
9ab25047-4418-4121-906c-380ded9e1962__Screen_Shot_2016-12-06_at_4.56.43_PM.png

> Raw StatusResult passed to the scheduler when tasks are healthy
> ---
>
> Key: AURORA-1850
> URL: https://issues.apache.org/jira/browse/AURORA-1850
> Project: Aurora
>  Issue Type: Bug
>  Components: Executor
>Reporter: Joshua Cohen
>Priority: Minor
> Attachments: 
> 9ab25047-4418-4121-906c-380ded9e1962__Screen_Shot_2016-12-06_at_4.56.43_PM.png
>
>
> As part of the recent health check changes, we now pass a message to the 
> scheduler along with the RUNNING transition when the task is healthy. 
> Unfortunately it looks like this message is a stringified `StatusResult`, 
> rather than the message from the `StatusResult` (see attached screenshot for 
> details).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AURORA-1827) Fix SLA percentile calculation

2016-11-23 Thread Joshua Cohen (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshua Cohen updated AURORA-1827:
-
Labels: newbie sla  (was: sla)

> Fix SLA percentile calculation 
> ---
>
> Key: AURORA-1827
> URL: https://issues.apache.org/jira/browse/AURORA-1827
> Project: Aurora
>  Issue Type: Story
>Reporter: Reza Motamedi
>Priority: Trivial
>  Labels: newbie, sla
>
> The calculation of mttX (median-time-to-X) depends on the computation of 
> percentile values. The current implementation does not behave nicely with a 
> small sample size. For instance, for a given sample set of  {50, 150}, 
> 50-percentile is reported to be 50. Although, 100 seems a more appropriate 
> return value.
> One solution is to modify `SlaUtil` to perform an extrapolation when the 
> sample size is small or when the corresponding index to a percentile value is 
> not an integer. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1780) Offers with unknown resources types to Aurora crash the scheduler

2016-11-04 Thread Joshua Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15636746#comment-15636746
 ] 

Joshua Cohen commented on AURORA-1780:
--

+1 sounds like the most reasonable course of action.

> Offers with unknown resources types to Aurora crash the scheduler
> -
>
> Key: AURORA-1780
> URL: https://issues.apache.org/jira/browse/AURORA-1780
> Project: Aurora
>  Issue Type: Bug
> Environment: vagrant
>Reporter: Renan DelValle
>Assignee: Renan DelValle
>
> Taking offers from Agents which have resources that are not known to Aurora 
> cause the Scheduler to crash.
> Steps to reproduce:
> {code}
> vagrant up
> sudo service mesos-slave stop
> echo 
> "cpus(aurora-role):0.5;cpus(*):3.5;mem(aurora-role):1024;disk:2;gpus(*):4;test:200"
>  | sudo tee /etc/mesos-slave/resources
> sudo rm -f /var/lib/mesos/meta/slaves/latest
> sudo service mesos-slave start
> {code}
> Wait around a few moments for the offer to be made to Aurora
> {code}
> I0922 02:41:57.839 [Thread-19, MesosSchedulerImpl:142] Received notification 
> of lost agent: value: "cadaf569-171d-42fc-a417-fbd608ea5bab-S0"
> I0922 02:42:30.585597  2999 log.cpp:577] Attempting to append 109 bytes to 
> the log
> I0922 02:42:30.585654  2999 coordinator.cpp:348] Coordinator attempting to 
> write APPEND action at position 4
> I0922 02:42:30.585747  2999 replica.cpp:537] Replica received write request 
> for position 4 from (10)@192.168.33.7:8083
> I0922 02:42:30.586858  2999 leveldb.cpp:341] Persisting action (125 bytes) to 
> leveldb took 1.086601ms
> I0922 02:42:30.586897  2999 replica.cpp:712] Persisted action at 4
> I0922 02:42:30.587020  2999 replica.cpp:691] Replica received learned notice 
> for position 4 from @0.0.0.0:0
> I0922 02:42:30.587785  2999 leveldb.cpp:341] Persisting action (127 bytes) to 
> leveldb took 746999ns
> I0922 02:42:30.587805  2999 replica.cpp:712] Persisted action at 4
> I0922 02:42:30.587811  2999 replica.cpp:697] Replica learned APPEND action at 
> position 4
> I0922 02:42:30.601 [SchedulerImpl-0, OfferManager$OfferManagerImpl:185] 
> Returning offers for cadaf569-171d-42fc-a417-fbd608ea5bab-S1 for compaction.
> Sep 22, 2016 2:42:38 AM 
> com.google.common.util.concurrent.ServiceManager$ServiceListener failed
> SEVERE: Service SlotSizeCounterService [FAILED] has failed in the RUNNING 
> state.
> java.lang.NullPointerException: Unknown Mesos resource: name: "test"
> type: SCALAR
> scalar {
>   value: 200.0
> }
> role: "*"
>   at java.util.Objects.requireNonNull(Objects.java:228)
>   at 
> org.apache.aurora.scheduler.resources.ResourceType.fromResource(ResourceType.java:355)
>   at 
> org.apache.aurora.scheduler.resources.ResourceManager.lambda$static$0(ResourceManager.java:52)
>   at com.google.common.collect.Iterators$7.computeNext(Iterators.java:675)
>   at 
> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
>   at 
> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
>   at java.util.Iterator.forEachRemaining(Iterator.java:115)
>   at 
> java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
>   at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
>   at 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
>   at 
> java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
>   at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
>   at 
> java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
>   at 
> org.apache.aurora.scheduler.resources.ResourceManager.bagFromResources(ResourceManager.java:274)
>   at 
> org.apache.aurora.scheduler.resources.ResourceManager.bagFromMesosResources(ResourceManager.java:239)
>   at 
> org.apache.aurora.scheduler.stats.AsyncStatsModule$OfferAdapter.get(AsyncStatsModule.java:153)
>   at 
> org.apache.aurora.scheduler.stats.SlotSizeCounter.run(SlotSizeCounter.java:168)
>   at 
> org.apache.aurora.scheduler.stats.AsyncStatsModule$SlotSizeCounterService.runOneIteration(AsyncStatsModule.java:130)
>   at 
> com.google.common.util.concurrent.AbstractScheduledService$ServiceDelegate$Task.run(AbstractScheduledService.java:189)
>   at com.google.common.util.concurrent.Callables$3.run(Callables.java:100)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>   at 
> 

[jira] [Updated] (AURORA-1804) interaction of job history pruning configuration parameters should be documented

2016-10-26 Thread Joshua Cohen (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshua Cohen updated AURORA-1804:
-
Component/s: Documentation

> interaction of job history pruning configuration parameters should be 
> documented
> 
>
> Key: AURORA-1804
> URL: https://issues.apache.org/jira/browse/AURORA-1804
> Project: Aurora
>  Issue Type: Task
>  Components: Documentation
>Reporter: Jacob Scott
>
> Pruning looks to be controlled by
> history_max_per_job_threshold
> history_min_retention_threshold
> history_prune_threshold
> How these interact -- e.g., which has priority, or how the pruning algorithm 
> works in general -- should be documented.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AURORA-1803) Include reference to relevant configuration options in client commands documentation

2016-10-26 Thread Joshua Cohen (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshua Cohen updated AURORA-1803:
-
Component/s: Documentation

> Include reference to relevant configuration options in client commands 
> documentation
> 
>
> Key: AURORA-1803
> URL: https://issues.apache.org/jira/browse/AURORA-1803
> Project: Aurora
>  Issue Type: Task
>  Components: Documentation
>Reporter: Jacob Scott
>
> The job history available when calling 
> {code}
> aurora job status
> {code}
> is controlled by scheduler configuration 
> (https://github.com/apache/aurora/blob/master/docs/reference/scheduler-configuration.md)
> The documentation 
> (http://aurora.apache.org/documentation/latest/reference/client-commands/#getting-job-status)
>   for job status should cover this behavior and mention/link to these 
> configuration options



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1798) resolv.conf is not copied when using the Mesos containerizer with a Docker image

2016-10-20 Thread Joshua Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15592520#comment-15592520
 ] 

Joshua Cohen commented on AURORA-1798:
--

I unset the fix version, we set these explicitly when we cut a new release.

> resolv.conf is not copied when using the Mesos containerizer with a Docker 
> image
> 
>
> Key: AURORA-1798
> URL: https://issues.apache.org/jira/browse/AURORA-1798
> Project: Aurora
>  Issue Type: Bug
>  Components: Executor
>Affects Versions: 0.16.0
>Reporter: Justin Pinkul
>Assignee: Justin Pinkul
>
> When Thermos launches a task using a Docker image it mounts the image as a 
> volume and manually chroots into it. One consequence of this is the logic 
> inside of the {{network/cni}} isolator that copies {{resolv.conf}} from the 
> host into the new rootfs is bypassed. The Thermos executor should manually 
> copy this file into the rootfs until Mesos pod support is implemented.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1797) Add full support for ACI containers

2016-10-18 Thread Joshua Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15585602#comment-15585602
 ] 

Joshua Cohen commented on AURORA-1797:
--

What happened here is that Mesos did not have support for fetching AppC images 
from a registry when the support was first added to Aurora. Now that Mesos 
supports AppC simple discovery, we should update Aurora as well.

> Add full support for ACI containers
> ---
>
> Key: AURORA-1797
> URL: https://issues.apache.org/jira/browse/AURORA-1797
> Project: Aurora
>  Issue Type: Story
>Reporter: Thomas Bach
>
> For {{AppcImage}} to work properly the Mesos fetcher needs the {{os}} and 
> {{arch}} labels in the image description. The relevant code for this can be 
> found here: 
> https://github.com/apache/mesos/blob/171d214afa92cce56d8e1c350d6b2968887e6f15/src/slave/containerizer/mesos/provisioner/appc/fetcher.cpp#L61
> At the moment {{AppcImage}} only supports {{name}} and {{image_id}} as 
> attributes. These are sufficient for the tests in 
> https://github.com/apache/aurora/blob/master/src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh#L427
>  to pass properly. Here the fetcher is never invoked because the directory 
> structure is laid out in such a way that Mesos finds the image in its cache.
> At the moment it is possible to work around this issue by giving Mesos the 
> additional information via the {{default_container_info}} argument.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AURORA-1797) Add full support for ACI containers

2016-10-18 Thread Joshua Cohen (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshua Cohen updated AURORA-1797:
-
Description: 
For {{AppcImage}} to work properly the Mesos fetcher needs the {{os}} and 
{{arch}} labels in the image description. The relevant code for this can be 
found here: 
https://github.com/apache/mesos/blob/171d214afa92cce56d8e1c350d6b2968887e6f15/src/slave/containerizer/mesos/provisioner/appc/fetcher.cpp#L61

At the moment {{AppcImage}} only supports {{name}} and {{image_id}} as 
attributes. These are sufficient for the tests in 
https://github.com/apache/aurora/blob/master/src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh#L427
 to pass properly. Here the fetcher is never invoked because the directory 
structure is laid out in such a way that Mesos finds the image in its cache.

At the moment it is possible to work around this issue by giving Mesos the 
additional information via the {{default_container_info}} argument.

  was:
For {{AppcImage}} to work properly the Mesos fetcher needs the {{os}} and 
{{arch}} labels in the image description. The relevant code for this can be 
found here: 
https://github.com/apache/mesos/blob/171d214afa92cce56d8e1c350d6b2968887e6f15/src/slave/containerizer/mesos/provisioner/appc/fetcher.cpp#L61

At the moment {{AppcImage}} only supports {{name}} and {{image_id}} as 
attributes. These are sufficient for the tests in 
https://github.com/apache/aurora/blob/master/src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh#L427
 to pass properly. Here the fetcher is never invoked because the directory 
structure is laid out in such a way that Mesos finds the image in its cache.

At the moment it is possible to work around this issue by giving Mesos the 
additional information via the {{default_container_info argument}}.


> Add full support for ACI containers
> ---
>
> Key: AURORA-1797
> URL: https://issues.apache.org/jira/browse/AURORA-1797
> Project: Aurora
>  Issue Type: Story
>Reporter: Thomas Bach
>
> For {{AppcImage}} to work properly the Mesos fetcher needs the {{os}} and 
> {{arch}} labels in the image description. The relevant code for this can be 
> found here: 
> https://github.com/apache/mesos/blob/171d214afa92cce56d8e1c350d6b2968887e6f15/src/slave/containerizer/mesos/provisioner/appc/fetcher.cpp#L61
> At the moment {{AppcImage}} only supports {{name}} and {{image_id}} as 
> attributes. These are sufficient for the tests in 
> https://github.com/apache/aurora/blob/master/src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh#L427
>  to pass properly. Here the fetcher is never invoked because the directory 
> structure is laid out in such a way that Mesos finds the image in its cache.
> At the moment it is possible to work around this issue by giving Mesos the 
> additional information via the {{default_container_info}} argument.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (AURORA-1737) Descheduling a cron job checks role access before job key existence

2016-10-12 Thread Joshua Cohen (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshua Cohen resolved AURORA-1737.
--
Resolution: Cannot Reproduce

I can no longer reproduce this problem. Not sure what changed to fix it, but 
resolving as cannot reproduce. If it comes back we can re-open.

> Descheduling a cron job checks role access before job key existence
> ---
>
> Key: AURORA-1737
> URL: https://issues.apache.org/jira/browse/AURORA-1737
> Project: Aurora
>  Issue Type: Bug
>  Components: Scheduler
>Reporter: Joshua Cohen
>Assignee: Jing Chen
>Priority: Minor
>
> Trying to deschedule a cron job for a non-existent role returns a permission 
> error rather than a no-such-job error. This leads to confusion for users in 
> the event of a typo in the role.
> Given that jobs are world-readable, we should check for a valid job key 
> before applying permissions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (AURORA-1788) vagrant up does not properly configure network adapters

2016-10-06 Thread Joshua Cohen (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshua Cohen resolved AURORA-1788.
--
Resolution: Fixed

Thanks for the patch!

> vagrant up does not properly configure network adapters
> ---
>
> Key: AURORA-1788
> URL: https://issues.apache.org/jira/browse/AURORA-1788
> Project: Aurora
>  Issue Type: Bug
>Reporter: Andrew Jorgensen
>Assignee: Andrew Jorgensen
>
> I am not sure of the specifics of why this happens but on vagrant 1.8.6 the 
> network interface does not come up correctly and the private_network is 
> attached to the eth0 nat interface rather than the host-only interface. I 
> tried a number of different parameters but none of them were able to 
> configure the network appropriately. This change manually configures the 
> static ip so that it is connected to the correct adapter. Without this change 
> I could not access the aurora web interface when running vagrant up.
> I've created a patch here: https://reviews.apache.org/r/52609/
> This is what the configuration looks like when run off master:
> {code}
> ip addr
> 1: lo:  mtu 65536 qdisc noqueue state UNKNOWN group 
> default
> link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
> inet 127.0.0.1/8 scope host lo
>valid_lft forever preferred_lft forever
> inet6 ::1/128 scope host
>valid_lft forever preferred_lft forever
> 2: eth0:  mtu 1500 qdisc pfifo_fast state UP 
> group default qlen 1000
> link/ether 08:00:27:b3:1b:30 brd ff:ff:ff:ff:ff:ff
> inet 10.0.2.15/24 brd 10.0.2.255 scope global eth0
>valid_lft forever preferred_lft forever
> inet 192.168.33.7/24 brd 192.168.33.255 scope global eth1
>valid_lft forever preferred_lft forever
> inet6 fe80::a00:27ff:feb3:1b30/64 scope link
>valid_lft forever preferred_lft forever
> 3: eth1:  mtu 1500 qdisc pfifo_fast state 
> DOWN group default
> link/ether 08:00:27:7c:4e:72 brd ff:ff:ff:ff:ff:ff
> 4: docker0:  mtu 1500 qdisc noqueue state 
> DOWN group default
> link/ether 02:42:f6:de:a3:ca brd ff:ff:ff:ff:ff:ff
> inet 172.17.0.1/16 scope global docker0
>valid_lft forever preferred_lft forever
> {code}
> here is what it is supposed to look like:
> {code}
> ip addr
> 1: lo:  mtu 65536 qdisc noqueue state UNKNOWN group 
> default
> link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
> inet 127.0.0.1/8 scope host lo
>valid_lft forever preferred_lft forever
> inet6 ::1/128 scope host
>valid_lft forever preferred_lft forever
> 2: eth0:  mtu 1500 qdisc pfifo_fast state UP 
> group default qlen 1000
> link/ether 08:00:27:b3:1b:30 brd ff:ff:ff:ff:ff:ff
> inet 10.0.2.15/24 brd 10.0.2.255 scope global eth0
>valid_lft forever preferred_lft forever
> inet6 fe80::a00:27ff:feb3:1b30/64 scope link
>valid_lft forever preferred_lft forever
> 3: eth1:  mtu 1500 qdisc pfifo_fast state UP 
> group default qlen 1000
> link/ether 08:00:27:7c:4e:72 brd ff:ff:ff:ff:ff:ff
> inet 192.168.33.7/24 brd 192.168.33.255 scope global eth1
>valid_lft forever preferred_lft forever
> inet6 fe80::a00:27ff:fe7c:4e72/64 scope link
>valid_lft forever preferred_lft forever
> 4: docker0:  mtu 1500 qdisc noqueue state 
> DOWN group default
> link/ether 02:42:f6:de:a3:ca brd ff:ff:ff:ff:ff:ff
> inet 172.17.0.1/16 scope global docker0
>valid_lft forever preferred_lft forever
> {code}
> Steps to reproduce:
> 1. Update to vagrant 1.8.6 (unsure if previous versions are affected as well)
> 2. Run `vagrant up`
> 3. Try to visit http://192.168.33.7:8081
> Expected outcome:
> I expect that following the steps in 
> http://aurora.apache.org/documentation/latest/getting-started/vagrant/ I 
> would be able to visit the web interface for aurora.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AURORA-1788) vagrant up does not properly configure network adapters

2016-10-06 Thread Joshua Cohen (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshua Cohen updated AURORA-1788:
-
Assignee: Andrew Jorgensen

> vagrant up does not properly configure network adapters
> ---
>
> Key: AURORA-1788
> URL: https://issues.apache.org/jira/browse/AURORA-1788
> Project: Aurora
>  Issue Type: Bug
>Reporter: Andrew Jorgensen
>Assignee: Andrew Jorgensen
>
> I am not sure of the specifics of why this happens but on vagrant 1.8.6 the 
> network interface does not come up correctly and the private_network is 
> attached to the eth0 nat interface rather than the host-only interface. I 
> tried a number of different parameters but none of them were able to 
> configure the network appropriately. This change manually configures the 
> static ip so that it is connected to the correct adapter. Without this change 
> I could not access the aurora web interface when running vagrant up.
> I've created a patch here: https://reviews.apache.org/r/52609/
> This is what the configuration looks like when run off master:
> {code}
> ip addr
> 1: lo:  mtu 65536 qdisc noqueue state UNKNOWN group 
> default
> link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
> inet 127.0.0.1/8 scope host lo
>valid_lft forever preferred_lft forever
> inet6 ::1/128 scope host
>valid_lft forever preferred_lft forever
> 2: eth0:  mtu 1500 qdisc pfifo_fast state UP 
> group default qlen 1000
> link/ether 08:00:27:b3:1b:30 brd ff:ff:ff:ff:ff:ff
> inet 10.0.2.15/24 brd 10.0.2.255 scope global eth0
>valid_lft forever preferred_lft forever
> inet 192.168.33.7/24 brd 192.168.33.255 scope global eth1
>valid_lft forever preferred_lft forever
> inet6 fe80::a00:27ff:feb3:1b30/64 scope link
>valid_lft forever preferred_lft forever
> 3: eth1:  mtu 1500 qdisc pfifo_fast state 
> DOWN group default
> link/ether 08:00:27:7c:4e:72 brd ff:ff:ff:ff:ff:ff
> 4: docker0:  mtu 1500 qdisc noqueue state 
> DOWN group default
> link/ether 02:42:f6:de:a3:ca brd ff:ff:ff:ff:ff:ff
> inet 172.17.0.1/16 scope global docker0
>valid_lft forever preferred_lft forever
> {code}
> here is what it is supposed to look like:
> {code}
> ip addr
> 1: lo:  mtu 65536 qdisc noqueue state UNKNOWN group 
> default
> link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
> inet 127.0.0.1/8 scope host lo
>valid_lft forever preferred_lft forever
> inet6 ::1/128 scope host
>valid_lft forever preferred_lft forever
> 2: eth0:  mtu 1500 qdisc pfifo_fast state UP 
> group default qlen 1000
> link/ether 08:00:27:b3:1b:30 brd ff:ff:ff:ff:ff:ff
> inet 10.0.2.15/24 brd 10.0.2.255 scope global eth0
>valid_lft forever preferred_lft forever
> inet6 fe80::a00:27ff:feb3:1b30/64 scope link
>valid_lft forever preferred_lft forever
> 3: eth1:  mtu 1500 qdisc pfifo_fast state UP 
> group default qlen 1000
> link/ether 08:00:27:7c:4e:72 brd ff:ff:ff:ff:ff:ff
> inet 192.168.33.7/24 brd 192.168.33.255 scope global eth1
>valid_lft forever preferred_lft forever
> inet6 fe80::a00:27ff:fe7c:4e72/64 scope link
>valid_lft forever preferred_lft forever
> 4: docker0:  mtu 1500 qdisc noqueue state 
> DOWN group default
> link/ether 02:42:f6:de:a3:ca brd ff:ff:ff:ff:ff:ff
> inet 172.17.0.1/16 scope global docker0
>valid_lft forever preferred_lft forever
> {code}
> Steps to reproduce:
> 1. Update to vagrant 1.8.6 (unsure if previous versions are affected as well)
> 2. Run `vagrant up`
> 3. Try to visit http://192.168.33.7:8081
> Expected outcome:
> I expect that following the steps in 
> http://aurora.apache.org/documentation/latest/getting-started/vagrant/ I 
> would be able to visit the web interface for aurora.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1782) Thermos Executor is not handling apostrophe gracefully

2016-10-05 Thread Joshua Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15549019#comment-15549019
 ] 

Joshua Cohen commented on AURORA-1782:
--

I *am* able to reproduce with the cmdline you provided above. Will investigate 
a fix.

> Thermos Executor is not handling apostrophe gracefully
> --
>
> Key: AURORA-1782
> URL: https://issues.apache.org/jira/browse/AURORA-1782
> Project: Aurora
>  Issue Type: Bug
>Affects Versions: 0.16.0
>Reporter: Justin Venus
> Attachments: process with single quotes.png
>
>
> This is a regression from the behavior in 0.15.0.
> This is enough to cause an execution to fail.
> {code}
> python -c 'import sys
> if __name__ == "__main__":
>   sys.exit(0)'
> {code}
> This is the error seen when running an inline script from a Process().
> {code}
> Failed to parse the flags: Failed to load flag 'command': Failed to load 
> value '{"shell":true,"value":"/bin/bash -c 'python -c import': syntax error 
> at line 1 near: 
> {code}
> Ideally I could escape the apostrophe and have it work for Process's that run 
> 'chrooted/pivoted' by thermos or bare.  My work around has been to write a 
> templating hack for aurora's dsl.
> {code}
> # Template.render(name='foo', template='SOME_STRING_TEMPLATE')
> # Returns a Process().  I can't just base64 the entire template (that would 
> be too easy), 
> # b/c then I can't use {{thermos.ports[http]}}.  So I go through and replace 
> apostrophe
> # with a pattern that is unlikely to be legitimately used and render a 
> 'cmdline' that will 
> # undo the hack.  Of course using the octal or "'" directly proved 
> troublesome, so I had to 
> # base64 encode the simple sed statement and decode the script on the remote 
> side.  Like
> # I said this is a complete hack.
> class Template(object):
>   class T(Struct):
> payload = Required(String)
>   @classmethod
>   def render(cls, name=None, template=None, **kwargs):
> assert name is not None
> assert template is not None
> return cls(name, template, 'template', **kwargs).process
>   def cmdline(self):
> return ''.join([
> '(mkdir .decoder && ',
> '(test -x ./.decoder/decoder || ',
> '((echo {{template_decoder}} | base64 -d > ./.decoder/decoder) && ',
> 'chmod +x ./.decoder/decoder)) || ',
> "(while [ ! -x ./.decoder/decoder ]; do echo resolving-decoder; sleep 
> 1; done)) && ",
> '((echo "{{template_payload}}" | ./.decoder/decoder) > 
> {{template_filename}})'
> ])
>   def decoder_script(self):
> return r"""#!/bin/bash
> sed "s/__APOSTROPHE__/'/g"
> """
>   def postInit(self):
> self.process.in_scope = self.in_scope  # need to override imethod
>   def __init__(self, name, template, prefix, filename=None, auto_init=True, 
> **kwargs):
> self.resolved = False
> self.process = Process(
> name="%s:%s" % (prefix, name),
> cmdline=self.cmdline(), **kwargs).bind(
> self.__class__.T(payload=template.replace("'", '__APOSTROPHE__')),
> template_filename=name if filename is None else filename,
> template_decoder=base64.b64encode(self.decoder_script()))
> if auto_init:
>   self.postInit()
>   def in_scope(self, *args, **kwargs):
> """ensure name/commandline is resolved before proceeding"""
> if self.resolved:
>   return self.process
> self.process = Process.in_scope(self.process, *args, **kwargs)
> if '{{' not in str(self.process.name()):
>   scopes = self.process.scopes()
>   for scope in scopes:
> if not hasattr(scope, 'bind'):
>   continue
> scope = scope.bind(**kwargs)
> if scope.check() and isinstance(scope, self.__class__.T):
>   self.resolved = True
>   payload = str(scope.payload())
>   print(" INFO] Memoizing {}".format(self.process.name()))
>   self.process = self.process.bind(template_payload=payload)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AURORA-1782) Thermos Executor is not handling apostrophe gracefully

2016-10-05 Thread Joshua Cohen (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshua Cohen updated AURORA-1782:
-
Attachment: process with single quotes.png

> Thermos Executor is not handling apostrophe gracefully
> --
>
> Key: AURORA-1782
> URL: https://issues.apache.org/jira/browse/AURORA-1782
> Project: Aurora
>  Issue Type: Bug
>Affects Versions: 0.16.0
>Reporter: Justin Venus
> Attachments: process with single quotes.png
>
>
> This is a regression from the behavior in 0.15.0.
> This is enough to cause an execution to fail.
> {code}
> python -c 'import sys
> if __name__ == "__main__":
>   sys.exit(0)'
> {code}
> This is the error seen when running an inline script from a Process().
> {code}
> Failed to parse the flags: Failed to load flag 'command': Failed to load 
> value '{"shell":true,"value":"/bin/bash -c 'python -c import': syntax error 
> at line 1 near: 
> {code}
> Ideally I could escape the apostrophe and have it work for Process's that run 
> 'chrooted/pivoted' by thermos or bare.  My work around has been to write a 
> templating hack for aurora's dsl.
> {code}
> # Template.render(name='foo', template='SOME_STRING_TEMPLATE')
> # Returns a Process().  I can't just base64 the entire template (that would 
> be too easy), 
> # b/c then I can't use {{thermos.ports[http]}}.  So I go through and replace 
> apostrophe
> # with a pattern that is unlikely to be legitimately used and render a 
> 'cmdline' that will 
> # undo the hack.  Of course using the octal or "'" directly proved 
> troublesome, so I had to 
> # base64 encode the simple sed statement and decode the script on the remote 
> side.  Like
> # I said this is a complete hack.
> class Template(object):
>   class T(Struct):
> payload = Required(String)
>   @classmethod
>   def render(cls, name=None, template=None, **kwargs):
> assert name is not None
> assert template is not None
> return cls(name, template, 'template', **kwargs).process
>   def cmdline(self):
> return ''.join([
> '(mkdir .decoder && ',
> '(test -x ./.decoder/decoder || ',
> '((echo {{template_decoder}} | base64 -d > ./.decoder/decoder) && ',
> 'chmod +x ./.decoder/decoder)) || ',
> "(while [ ! -x ./.decoder/decoder ]; do echo resolving-decoder; sleep 
> 1; done)) && ",
> '((echo "{{template_payload}}" | ./.decoder/decoder) > 
> {{template_filename}})'
> ])
>   def decoder_script(self):
> return r"""#!/bin/bash
> sed "s/__APOSTROPHE__/'/g"
> """
>   def postInit(self):
> self.process.in_scope = self.in_scope  # need to override imethod
>   def __init__(self, name, template, prefix, filename=None, auto_init=True, 
> **kwargs):
> self.resolved = False
> self.process = Process(
> name="%s:%s" % (prefix, name),
> cmdline=self.cmdline(), **kwargs).bind(
> self.__class__.T(payload=template.replace("'", '__APOSTROPHE__')),
> template_filename=name if filename is None else filename,
> template_decoder=base64.b64encode(self.decoder_script()))
> if auto_init:
>   self.postInit()
>   def in_scope(self, *args, **kwargs):
> """ensure name/commandline is resolved before proceeding"""
> if self.resolved:
>   return self.process
> self.process = Process.in_scope(self.process, *args, **kwargs)
> if '{{' not in str(self.process.name()):
>   scopes = self.process.scopes()
>   for scope in scopes:
> if not hasattr(scope, 'bind'):
>   continue
> scope = scope.bind(**kwargs)
> if scope.check() and isinstance(scope, self.__class__.T):
>   self.resolved = True
>   payload = str(scope.payload())
>   print(" INFO] Memoizing {}".format(self.process.name()))
>   self.process = self.process.bind(template_payload=payload)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (AURORA-1782) Thermos Executor is not handling apostrophe gracefully

2016-10-05 Thread Joshua Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15549008#comment-15549008
 ] 

Joshua Cohen edited comment on AURORA-1782 at 10/5/16 3:02 PM:
---

I just tried to replicate this locally and I was unable to do so. I made the 
following change to {{setup_env}} process in the [job used by our end to end 
tests|https://github.com/apache/aurora/blob/master/src/test/sh/org/apache/aurora/e2e/http/http_example.aurora#L45-L50]:

{noformat}
diff --git a/src/test/sh/org/apache/aurora/e2e/http/http_example.aurora 
b/src/test/sh/org/apache/aurora/e2e/http/http_example.aurora
index c71fb81..3b75948 100644
--- a/src/test/sh/org/apache/aurora/e2e/http/http_example.aurora
+++ b/src/test/sh/org/apache/aurora/e2e/http/http_example.aurora
@@ -45,7 +45,7 @@ stage_server = Process(
 setup_env = Process(
   name = 'setup_env',
   cmdline='''cat < .thermos_profile
-export IT_WORKED=hello
+export IT_WORKED='hello'
 EOF'''
 )
{noformat}

And I was able to launch the job successfully (see attached screenshot). I'll 
need to dig more to see what's causing your issue.




was (Author: joshua.cohen):
I just tried to replicate this locally and I was unable to do so. I made the 
following change to {{setup_env}} process in the [job used by our end to end 
tests](https://github.com/apache/aurora/blob/master/src/test/sh/org/apache/aurora/e2e/http/http_example.aurora#L45-L50):

{noformat}
diff --git a/src/test/sh/org/apache/aurora/e2e/http/http_example.aurora 
b/src/test/sh/org/apache/aurora/e2e/http/http_example.aurora
index c71fb81..3b75948 100644
--- a/src/test/sh/org/apache/aurora/e2e/http/http_example.aurora
+++ b/src/test/sh/org/apache/aurora/e2e/http/http_example.aurora
@@ -45,7 +45,7 @@ stage_server = Process(
 setup_env = Process(
   name = 'setup_env',
   cmdline='''cat < .thermos_profile
-export IT_WORKED=hello
+export IT_WORKED='hello'
 EOF'''
 )
{noformat}

And I was able to launch the job successfully (see attached screenshot). I'll 
need to dig more to see what's causing your issue.



> Thermos Executor is not handling apostrophe gracefully
> --
>
> Key: AURORA-1782
> URL: https://issues.apache.org/jira/browse/AURORA-1782
> Project: Aurora
>  Issue Type: Bug
>Affects Versions: 0.16.0
>Reporter: Justin Venus
> Attachments: process with single quotes.png
>
>
> This is a regression from the behavior in 0.15.0.
> This is enough to cause an execution to fail.
> {code}
> python -c 'import sys
> if __name__ == "__main__":
>   sys.exit(0)'
> {code}
> This is the error seen when running an inline script from a Process().
> {code}
> Failed to parse the flags: Failed to load flag 'command': Failed to load 
> value '{"shell":true,"value":"/bin/bash -c 'python -c import': syntax error 
> at line 1 near: 
> {code}
> Ideally I could escape the apostrophe and have it work for Process's that run 
> 'chrooted/pivoted' by thermos or bare.  My work around has been to write a 
> templating hack for aurora's dsl.
> {code}
> # Template.render(name='foo', template='SOME_STRING_TEMPLATE')
> # Returns a Process().  I can't just base64 the entire template (that would 
> be too easy), 
> # b/c then I can't use {{thermos.ports[http]}}.  So I go through and replace 
> apostrophe
> # with a pattern that is unlikely to be legitimately used and render a 
> 'cmdline' that will 
> # undo the hack.  Of course using the octal or "'" directly proved 
> troublesome, so I had to 
> # base64 encode the simple sed statement and decode the script on the remote 
> side.  Like
> # I said this is a complete hack.
> class Template(object):
>   class T(Struct):
> payload = Required(String)
>   @classmethod
>   def render(cls, name=None, template=None, **kwargs):
> assert name is not None
> assert template is not None
> return cls(name, template, 'template', **kwargs).process
>   def cmdline(self):
> return ''.join([
> '(mkdir .decoder && ',
> '(test -x ./.decoder/decoder || ',
> '((echo {{template_decoder}} | base64 -d > ./.decoder/decoder) && ',
> 'chmod +x ./.decoder/decoder)) || ',
> "(while [ ! -x ./.decoder/decoder ]; do echo resolving-decoder; sleep 
> 1; done)) && ",
> '((echo "{{template_payload}}" | ./.decoder/decoder) > 
> {{template_filename}})'
> ])
>   def decoder_script(self):
> return r"""#!/bin/bash
> sed "s/__APOSTROPHE__/'/g"
> """
>   def postInit(self):
> self.process.in_scope = self.in_scope  # need to override imethod
>   def __init__(self, name, template, prefix, filename=None, auto_init=True, 
> **kwargs):
> self.resolved = False
> self.process = Process(
> name="%s:%s" % (prefix, name),
> cmdline=self.cmdline(), **kwargs).bind(
> 

[jira] [Commented] (AURORA-1782) Thermos Executor is not handling apostrophe gracefully

2016-10-05 Thread Joshua Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15548939#comment-15548939
 ] 

Joshua Cohen commented on AURORA-1782:
--

Yes, I agree that this is definitely a major problem. There should be no 
changes required to your {{Process}} definitions to run them within a 
filesystem image. I'm unsure at this time what the fix is in the executor, but 
I'm fairly confident it can be fixed.

> Thermos Executor is not handling apostrophe gracefully
> --
>
> Key: AURORA-1782
> URL: https://issues.apache.org/jira/browse/AURORA-1782
> Project: Aurora
>  Issue Type: Bug
>Affects Versions: 0.16.0
>Reporter: Justin Venus
>
> This is a regression from the behavior in 0.15.0.
> This is enough to cause an execution to fail.
> {code}
> python -c 'import sys
> if __name__ == "__main__":
>   sys.exit(0)'
> {code}
> This is the error seen when running an inline script from a Process().
> {code}
> Failed to parse the flags: Failed to load flag 'command': Failed to load 
> value '{"shell":true,"value":"/bin/bash -c 'python -c import': syntax error 
> at line 1 near: 
> {code}
> Ideally I could escape the apostrophe and have it work for Process's that run 
> 'chrooted/pivoted' by thermos or bare.  My work around has been to write a 
> templating hack for aurora's dsl.
> {code}
> # Template.render(name='foo', template='SOME_STRING_TEMPLATE')
> # Returns a Process().  I can't just base64 the entire template (that would 
> be too easy), 
> # b/c then I can't use {{thermos.ports[http]}}.  So I go through and replace 
> apostrophe
> # with a pattern that is unlikely to be legitimately used and render a 
> 'cmdline' that will 
> # undo the hack.  Of course using the octal or "'" directly proved 
> troublesome, so I had to 
> # base64 encode the simple sed statement and decode the script on the remote 
> side.  Like
> # I said this is a complete hack.
> class Template(object):
>   class T(Struct):
> payload = Required(String)
>   @classmethod
>   def render(cls, name=None, template=None, **kwargs):
> assert name is not None
> assert template is not None
> return cls(name, template, 'template', **kwargs).process
>   def cmdline(self):
> return ''.join([
> '(mkdir .decoder && ',
> '(test -x ./.decoder/decoder || ',
> '((echo {{template_decoder}} | base64 -d > ./.decoder/decoder) && ',
> 'chmod +x ./.decoder/decoder)) || ',
> "(while [ ! -x ./.decoder/decoder ]; do echo resolving-decoder; sleep 
> 1; done)) && ",
> '((echo "{{template_payload}}" | ./.decoder/decoder) > 
> {{template_filename}})'
> ])
>   def decoder_script(self):
> return r"""#!/bin/bash
> sed "s/__APOSTROPHE__/'/g"
> """
>   def postInit(self):
> self.process.in_scope = self.in_scope  # need to override imethod
>   def __init__(self, name, template, prefix, filename=None, auto_init=True, 
> **kwargs):
> self.resolved = False
> self.process = Process(
> name="%s:%s" % (prefix, name),
> cmdline=self.cmdline(), **kwargs).bind(
> self.__class__.T(payload=template.replace("'", '__APOSTROPHE__')),
> template_filename=name if filename is None else filename,
> template_decoder=base64.b64encode(self.decoder_script()))
> if auto_init:
>   self.postInit()
>   def in_scope(self, *args, **kwargs):
> """ensure name/commandline is resolved before proceeding"""
> if self.resolved:
>   return self.process
> self.process = Process.in_scope(self.process, *args, **kwargs)
> if '{{' not in str(self.process.name()):
>   scopes = self.process.scopes()
>   for scope in scopes:
> if not hasattr(scope, 'bind'):
>   continue
> scope = scope.bind(**kwargs)
> if scope.check() and isinstance(scope, self.__class__.T):
>   self.resolved = True
>   payload = str(scope.payload())
>   print(" INFO] Memoizing {}".format(self.process.name()))
>   self.process = self.process.bind(template_payload=payload)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1787) `-global_container_mounts` does not appear to work with the unified containerizer

2016-10-05 Thread Joshua Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15548828#comment-15548828
 ] 

Joshua Cohen commented on AURORA-1787:
--

This is a limitation of the Mesos containerizer afaik, it requires that the 
paths specified exist before mounting. I mentioned this as part of the 
description of MESOS-5229, but it would probably be worthwhile to call this out 
as its own ticket.

You should be able to workaround this by making the container path in 
{{-global_container_mounts}} absolute, rather than relative. That will cause 
Mesos to bind mount the path on top of itself into the container's namespace 
and then the executor will happily mount that same path into the task's 
filesystem.

> `-global_container_mounts` does not appear to work with the unified 
> containerizer
> -
>
> Key: AURORA-1787
> URL: https://issues.apache.org/jira/browse/AURORA-1787
> Project: Aurora
>  Issue Type: Bug
>Reporter: Zameer Manji
>Priority: Critical
>
> Perhaps I misunderstand how this feature is supposed to be used, but apply 
> the following patch to master:
> {noformat}
> From 1ebb5f4c5815c647e31f3253d5e5c316a0d5edd2 Mon Sep 17 00:00:00 2001
> From: Zameer Manji 
> Date: Tue, 4 Oct 2016 20:45:41 -0700
> Subject: [PATCH] Reproduce the issue.
> ---
>  examples/vagrant/upstart/aurora-scheduler.conf |  2 +-
>  src/test/sh/org/apache/aurora/e2e/run-server.sh|  4 
>  .../sh/org/apache/aurora/e2e/test_end_to_end.sh| 26 
> +++---
>  3 files changed, 18 insertions(+), 14 deletions(-)
> diff --git a/examples/vagrant/upstart/aurora-scheduler.conf 
> b/examples/vagrant/upstart/aurora-scheduler.conf
> index 91b27d7..851b5a1 100644
> --- a/examples/vagrant/upstart/aurora-scheduler.conf
> +++ b/examples/vagrant/upstart/aurora-scheduler.conf
> @@ -40,7 +40,7 @@ exec bin/aurora-scheduler \
>-native_log_file_path=/var/db/aurora \
>-backup_dir=/var/lib/aurora/backups \
>-thermos_executor_path=$DIST_DIR/thermos_executor.pex \
> -  
> -global_container_mounts=/home/vagrant/aurora/examples/vagrant/config:/home/vagrant/aurora/examples/vagrant/config:ro
>  \
> +  -global_container_mounts=/etc/rsyslog.d:rsyslog.d.container:ro \
>-thermos_executor_flags="--announcer-ensemble localhost:2181 
> --announcer-zookeeper-auth-config 
> /home/vagrant/aurora/examples/vagrant/config/announcer-auth.json 
> --mesos-containerizer-path=/usr/libexec/mesos/mesos-containerizer" \
>-allowed_container_types=MESOS,DOCKER \
>-http_authentication_mechanism=BASIC \
> diff --git a/src/test/sh/org/apache/aurora/e2e/run-server.sh 
> b/src/test/sh/org/apache/aurora/e2e/run-server.sh
> index 1fe0909..a0ee76f 100755
> --- a/src/test/sh/org/apache/aurora/e2e/run-server.sh
> +++ b/src/test/sh/org/apache/aurora/e2e/run-server.sh
> @@ -1,6 +1,10 @@
>  #!/bin/bash
>  
>  echo "Starting up server..."
> +if [ ! -d "./rsyslog.d.container" ]; then
> +  echo "Mountpoint Doesn't Exist";
> +  exit 1;
> +fi
>  while true
>  do
>echo -e "HTTP/1.1 200 OK\r\n\r\nHello from a filesystem image." | nc -l 
> "$1"
> diff --git a/src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh 
> b/src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh
> index c93be9b..094d776 100755
> --- a/src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh
> +++ b/src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh
> @@ -514,27 +514,27 @@ trap collect_result EXIT
>  aurorabuild all
>  setup_ssh
>  
> -test_version
> -test_http_example "${TEST_JOB_ARGS[@]}"
> -test_health_check
> +# test_version
> +# test_http_example "${TEST_JOB_ARGS[@]}"
> +# test_health_check
>  
> -test_http_example_basic "${TEST_JOB_REVOCABLE_ARGS[@]}"
> +# test_http_example_basic "${TEST_JOB_REVOCABLE_ARGS[@]}"
>  
> -test_http_example_basic "${TEST_JOB_GPU_ARGS[@]}"
> +# test_http_example_basic "${TEST_JOB_GPU_ARGS[@]}"
>  
>  # build the test docker image
> -sudo docker build -t http_example -f "${TEST_ROOT}/Dockerfile.python" 
> ${TEST_ROOT}
> -test_http_example "${TEST_JOB_DOCKER_ARGS[@]}"
> +# sudo docker build -t http_example -f "${TEST_ROOT}/Dockerfile.python" 
> ${TEST_ROOT}
> +# test_http_example "${TEST_JOB_DOCKER_ARGS[@]}"
>  
>  setup_image_stores
>  test_appc_unified
> -test_docker_unified
> +# test_docker_unified
>  
> -test_admin "${TEST_ADMIN_ARGS[@]}"
> -test_basic_auth_unauthenticated  "${TEST_JOB_ARGS[@]}"
> +# test_admin "${TEST_ADMIN_ARGS[@]}"
> +# test_basic_auth_unauthenticated  "${TEST_JOB_ARGS[@]}"
>  
> -test_ephemeral_daemon_with_final 
> "${TEST_JOB_EPHEMERAL_DAEMON_WITH_FINAL_ARGS[@]}"
> +# test_ephemeral_daemon_with_final 
> "${TEST_JOB_EPHEMERAL_DAEMON_WITH_FINAL_ARGS[@]}"
>  
> -/vagrant/src/test/sh/org/apache/aurora/e2e/test_kerberos_end_to_end.sh
> 

[jira] [Commented] (AURORA-1781) Sandbox taskfs setup fails (groupadd error)

2016-10-05 Thread Joshua Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15548798#comment-15548798
 ] 

Joshua Cohen commented on AURORA-1781:
--

[~jvenus] Trying to find a common thread, do you have SELINUX enabled on the 
hosts where you originally saw this problem?

> Sandbox taskfs setup fails (groupadd error)
> ---
>
> Key: AURORA-1781
> URL: https://issues.apache.org/jira/browse/AURORA-1781
> Project: Aurora
>  Issue Type: Bug
>Affects Versions: 0.16.0
>Reporter: Justin Venus
>
> I hit what smells like a permission issue w/ `/etc/group` when trying to use 
> a docker-image (unified containerizer setup) with mesos-1.0.0. and 
> aurora-0.16.0-rc2.  I cannot reproduce issue w/ mesos-0.28.2 and aurora-015.0.
> {code}
> Failed to initialize sandbox: Failed to create group in sandbox for task 
> image: Command '['groupadd', '-R', 
> '/var/lib/mesos/slaves/5d28d0cc-2793-4471-82d5-e67276c53f70-S2/frameworks/20160221-001235-3801519626-5050-1-/executors/thermos-nobody-prod-jenkins-0-47cc7824-565b-4265-9ab4-9ba3f364ebed/runs/a3f78288-4865-4166-8685-1ad941562f2f/taskfs',
>  '-g', '99', 'nobody']' returned non-zero exit status 10
> {code}
> {code}
> [root@mesos-master01of2 taskfs]# pwd
> /var/lib/mesos/slaves/5d28d0cc-2793-4471-82d5-e67276c53f70-S2/frameworks/20160221-001235-3801519626-5050-1-/executors/thermos-nobody-prod-jenkins-0-47cc7824-565b-4265-9ab4-9ba3f364ebed/runs/a3f78288-4865-4166-8685-1ad941562f2f/taskfs
> [root@mesos-master01of2 taskfs]# groupadd -R $PWD -g 99 nobody
> groupadd: cannot lock /etc/group; try again later.
> {code}
> Maybe related to AURORA-1761
> I'm running CoreOS with the mesos-agent (and thermos) inside docker.  Here is 
> the gist of how it's started.
> {code}
> /usr/bin/sh -c "exec /usr/bin/docker run \
> --name=mesos_slave \
> --net=host \
> --pid=host \
> --privileged \
> -v /sys:/sys \
> -v /usr/bin/docker:/usr/bin/docker:ro \
> -v /var/lib/docker:/var/lib/docker \
> -v /var/run/docker.sock:/root/docker.sock \
> -v /run/systemd/system:/run/systemd/system \
> -v /lib64/libdevmapper.so.1.02:/lib/libdevmapper.so.1.02:ro \
> -v /sys/fs/cgroup:/sys/fs/cgroup \
> -v /var/lib/mesos:/var/lib/mesos \
> -e MESOS_CONTAINERIZERS=docker,mesos \
> -e MESOS_EXECUTOR_REGISTRATION_TIMEOUT=5mins \
> -e MESOS_WORK_DIR=/var/lib/mesos \
> -e MESOS_LOGGING_LEVEL=INFO \
> -e AMAZON_REGION=us-office-2 \
> -e AVAILABILITY_ZONE=us-office-2b \
> -e MESOS_ATTRIBUTES=\"platform:linux;host:$(hostname);rack:us-office-2b\" 
> \
> -e MESOS_CLUSTER=ZeroZero \
> -e MESOS_DOCKER_SOCKET=/root/docker.sock \
> -e 
> MESOS_MASTER=zk://10.150.150.224:2181,10.150.150.225:2181,10.150.150.226:2181/mesos
>  \
> -e MESOS_LOG_DIR=/var/log/mesos \
> -e 
> MESOS_ISOLATION=\"filesystem/linux,cgroups/cpu,cgroups/mem,docker/runtime\" \
> -e MESOS_IMAGE_PROVIDERS=docker \
> -e MESOS_IMAGE_PROVISIONER_BACKEND=copy \
> -e MESOS_DOCKER_REGISTRY=http://docker-registry:31000 \
> -e MESOS_DOCKER_STORE_DIR=/var/lib/mesos/docker \
> --entrypoint=/usr/sbin/mesos-slave \
> docker-registry.thebrighttag.com:31000/mesos:latest \
> --no-systemd_enable_support \
> || rm -f /var/lib/mesos/meta/slaves/latest"
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1014) Client binding_helper to resolve docker label to a stable ID at create

2016-09-28 Thread Joshua Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15531472#comment-15531472
 ] 

Joshua Cohen commented on AURORA-1014:
--

We could potentially implement this for the Docker containerizer now, and 
implement it for the Mesos containerizer when MESOS-3505 is resolved?

> Client binding_helper to resolve docker label to a stable ID at create
> --
>
> Key: AURORA-1014
> URL: https://issues.apache.org/jira/browse/AURORA-1014
> Project: Aurora
>  Issue Type: Story
>  Components: Client, Packaging
>Reporter: Kevin Sweeney
>Assignee: Santhosh Kumar Shanmugham
>
> Follow-up from discussion on IRC:
> Some docker labels are mutable, meaning the image a task runs in could change 
> from restart to restart even if the rest of the task config doesn't change. 
> This breaks assumptions that make rolling updates the safe and preferred way 
> to deploy a new Aurora job
> Add a binding helper that resolves a docker label to an immutable image 
> identifier at create time and make it the default for the Docker helper 
> introduced in https://reviews.apache.org/r/28920/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1110) Running task ssh without an instance should pick a random instance

2016-09-23 Thread Joshua Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15518052#comment-15518052
 ] 

Joshua Cohen commented on AURORA-1110:
--

I know that was my suggestion when I filed the ticket, however in retrospect 
I'm not sure it's the best approach. It's entirely possible for a task to not 
have an instance 0 (e.g. if it was explicitly killed). I'd recommend finding 
the available instances and picking one (either randomly or the lowest or 
whatever).

> Running task ssh without an instance should pick a random instance
> --
>
> Key: AURORA-1110
> URL: https://issues.apache.org/jira/browse/AURORA-1110
> Project: Aurora
>  Issue Type: Story
>  Components: Client
>Reporter: Joshua Cohen
>Assignee: Jing Chen
>Priority: Trivial
>  Labels: newbie
>
> I always forget to add an instance to the end of the job key when ssh'ing. It 
> might be nice if running {{aurora task ssh ...}} without specifying an 
> instance either picked a random instance or just defaulted to instance 0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AURORA-1779) BatchWorker fails with PersistenceException

2016-09-22 Thread Joshua Cohen (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshua Cohen updated AURORA-1779:
-
Fix Version/s: 0.16.0

> BatchWorker fails with PersistenceException
> ---
>
> Key: AURORA-1779
> URL: https://issues.apache.org/jira/browse/AURORA-1779
> Project: Aurora
>  Issue Type: Bug
>Reporter: Stephan Erb
>Assignee: Maxim Khutornenko
>Priority: Blocker
> Fix For: 0.16.0
>
> Attachments: aurora-scheduler.log
>
>
> Steps to reproduce (if you are lucky):
> {code}
> vagrant destroy
> vagrant up
> vagrant ssh
> cd /vagrant
> ./build-support/release/verify-release-candidate 0.16.0-rc1
> {code}
> For me, this resulted in the following initial stack trace (and a couple 
> different follow ups due to a corrupted DB state):
> {code}
> I0921 19:45:38.730 [TaskEventBatchWorker, JobUpdateControllerImpl:359] 
> Forwarding task change for vagrant/test/http_example/0 
> I0921 19:45:38.744 [qtp1742582311-39, Slf4jRequestLog:60] 127.0.0.1 - - 
> [21/Sep/2016:19:45:38 +] "POST //aurora.local/api HTTP/1.1" 200 108  
> I0921 19:45:38.771 [TaskEventBatchWorker, JobUpdateControllerImpl:587] 
> IJobUpdateKey{job=IJobKey{role=vagrant, environment=test, name=http_example}, 
> id=97ff31cc-c17f-45ef-a1d9-c16cb07388c7} evaluation result: 
> EvaluationResult{status=WORKING, 
> sideEffects={0=SideEffect{action=Optional.of(ADD_TASK), statusChanges=[]}}} 
> I0921 19:45:38.786 [TaskEventBatchWorker, JobUpdateControllerImpl:654] 
> Executing side-effects for update of IJobUpdateKey{job=IJobKey{role=vagrant, 
> environment=test, name=http_example}, 
> id=97ff31cc-c17f-45ef-a1d9-c16cb07388c7}: 
> {0=SideEffect{action=Optional.of(ADD_TASK), statusChanges=[]}} 
> I0921 19:45:38.789 [TaskEventBatchWorker, InstanceActionHandler$AddTask:95] 
> Adding instance IInstanceKey{jobKey=IJobKey{role=vagrant, environment=test, 
> name=http_example}, instanceId=0} while ROLLING_FORWARD 
> I0921 19:45:38.805 [TaskEventBatchWorker, StateMachine$Builder:389] 
> vagrant-test-http_example-0-100b76ee-ed70-4e15-8e52-046f992d0b4a state 
> machine transition INIT -> PENDING 
> I0921 19:45:38.806 [TaskEventBatchWorker, TaskStateMachine:474] Adding work 
> command SAVE_STATE for 
> vagrant-test-http_example-0-100b76ee-ed70-4e15-8e52-046f992d0b4a 
> I0921 19:45:38.808 [ShutdownHook, SchedulerMain:101] Stopping scheduler 
> services. 
> I0921 19:45:38.822 [TimeSeriesRepositoryImpl STOPPING, 
> TimeSeriesRepositoryImpl:168] Variable sampler shut down 
> I0921 19:45:38.839 [TearDownShutdownRegistry STOPPING, 
> ShutdownRegistry$ShutdownRegistryImpl:77] Executing 4 shutdown commands. 
> E0921 19:45:38.845 [TaskEventBatchWorker, BatchWorker:217] 
> TaskEventBatchWorker: Failed to process batch item. Error: {} 
> org.apache.ibatis.exceptions.PersistenceException: 
> ### Error querying database.  Cause: org.h2.jdbc.JdbcSQLException: Database 
> is already closed (to disable automatic closing at VM shutdown, add 
> ";DB_CLOSE_ON_EXIT=FALSE" to the db URL) [90121-190]
> ### The error may exist in 
> org/apache/aurora/scheduler/storage/db/TaskMapper.xml
> ### The error may involve 
> org.apache.aurora.scheduler.storage.db.TaskMapper.selectById
> ### The error occurred while executing a query
> ### SQL: SELECT   t.id AS row_id,   t.task_config_row_id AS 
> task_config_row_id,   t.task_id AS task_id,   t.instance_id AS 
> instance_id,   t.status AS status,   t.failure_count AS 
> failure_count,   t.ancestor_task_id AS ancestor_id,   j.role AS 
> c_j_role,   j.environment AS c_j_environment,   j.name AS c_j_name,   
> h.slave_id AS slave_id,   h.host AS slave_host,   tp.id as tp_id, 
>   tp.name as tp_name,   tp.port as tp_port,   te.id as te_id, 
>   te.timestamp_ms as te_timestamp,   te.status as te_status,   
> te.message as te_message,   te.scheduler_host as te_scheduler FROM 
> tasks AS t INNER JOIN task_configs as c ON c.id = t.task_config_row_id
>  INNER JOIN job_keys AS j ON j.id = c.job_key_id LEFT OUTER JOIN 
> task_ports as tp ON tp.task_row_id = t.id LEFT OUTER JOIN task_events as 
> te ON te.task_row_id = t.id LEFT OUTER JOIN host_attributes AS h ON h.id 
> = t.slave_row_id WHERE   t.task_id = ?
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1777) aurora_admin client unable to drain hosts

2016-09-20 Thread Joshua Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15507209#comment-15507209
 ] 

Joshua Cohen commented on AURORA-1777:
--

https://reviews.apache.org/r/52087/

> aurora_admin client unable to drain hosts
> -
>
> Key: AURORA-1777
> URL: https://issues.apache.org/jira/browse/AURORA-1777
> Project: Aurora
>  Issue Type: Bug
>Reporter: Zameer Manji
>Assignee: Joshua Cohen
>
> Running the following command:
> {noformat}
> aurora-admin host_drain --hosts= 
> {noformat}
> Results in the following error message:
> {noformat}
> WARN] Connection error with scheduler: Unknown error talking to 
> http:///api: Header value False must be of type str or bytes, not  'bool'>, reconnecting...
> {noformat}
> Diving deeper shows that we are setting the value of 'User-Agent' in the 
> transport to 'False'.
> The root cause of this can be found in {{host_maintenance.py}} where we 
> create the client like so:
> {noformat}
>   def __init__(self, cluster, verbosity, wait_event=None):
> self._client = AuroraClientAPI(cluster, verbosity == 'verbose')
> self._wait_event = wait_event or Event()
> {noformat}
> However the constructor for {{AuroraClientAPI}} is:
> {noformat}
>   def __init__(
>   self,
>   cluster,
>   user_agent,
>   verbose=False,
>   bypass_leader_redirect=False):
> if not isinstance(cluster, Cluster):
>   raise TypeError('AuroraClientAPI expects instance of Cluster for 
> "cluster", got %s' %
>   type(cluster))
> self._scheduler_proxy = SchedulerProxy(
> cluster,
> verbose=verbose,
> user_agent=user_agent,
> bypass_leader_redirect=bypass_leader_redirect)
> self._cluster = cluster
> {noformat}
> Notice the second argument is {{user_agent}}.
> This bug started to become a problem because we upgraded requests and it 
> includes 
> https://github.com/kennethreitz/requests/commit/be31a90906deb5553c2e703fb05cf6964ee23ed5.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AURORA-1777) aurora_admin client unable to drain hosts

2016-09-20 Thread Joshua Cohen (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshua Cohen updated AURORA-1777:
-
Sprint: Twitter Aurora Q2'16 Sprint 21

> aurora_admin client unable to drain hosts
> -
>
> Key: AURORA-1777
> URL: https://issues.apache.org/jira/browse/AURORA-1777
> Project: Aurora
>  Issue Type: Bug
>Reporter: Zameer Manji
>Assignee: Joshua Cohen
>
> Running the following command:
> {noformat}
> aurora-admin host_drain --hosts= 
> {noformat}
> Results in the following error message:
> {noformat}
> WARN] Connection error with scheduler: Unknown error talking to 
> http:///api: Header value False must be of type str or bytes, not  'bool'>, reconnecting...
> {noformat}
> Diving deeper shows that we are setting the value of 'User-Agent' in the 
> transport to 'False'.
> The root cause of this can be found in {{host_maintenance.py}} where we 
> create the client like so:
> {noformat}
>   def __init__(self, cluster, verbosity, wait_event=None):
> self._client = AuroraClientAPI(cluster, verbosity == 'verbose')
> self._wait_event = wait_event or Event()
> {noformat}
> However the constructor for {{AuroraClientAPI}} is:
> {noformat}
>   def __init__(
>   self,
>   cluster,
>   user_agent,
>   verbose=False,
>   bypass_leader_redirect=False):
> if not isinstance(cluster, Cluster):
>   raise TypeError('AuroraClientAPI expects instance of Cluster for 
> "cluster", got %s' %
>   type(cluster))
> self._scheduler_proxy = SchedulerProxy(
> cluster,
> verbose=verbose,
> user_agent=user_agent,
> bypass_leader_redirect=bypass_leader_redirect)
> self._cluster = cluster
> {noformat}
> Notice the second argument is {{user_agent}}.
> This bug started to become a problem because we upgraded requests and it 
> includes 
> https://github.com/kennethreitz/requests/commit/be31a90906deb5553c2e703fb05cf6964ee23ed5.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AURORA-1014) Client binding_helper to resolve docker label to a stable ID at create

2016-09-19 Thread Joshua Cohen (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshua Cohen updated AURORA-1014:
-
Assignee: Santhosh Kumar Shanmugham  (was: brian wickman)

> Client binding_helper to resolve docker label to a stable ID at create
> --
>
> Key: AURORA-1014
> URL: https://issues.apache.org/jira/browse/AURORA-1014
> Project: Aurora
>  Issue Type: Story
>  Components: Client, Packaging
>Reporter: Kevin Sweeney
>Assignee: Santhosh Kumar Shanmugham
>
> Follow-up from discussion on IRC:
> Some docker labels are mutable, meaning the image a task runs in could change 
> from restart to restart even if the rest of the task config doesn't change. 
> This breaks assumptions that make rolling updates the safe and preferred way 
> to deploy a new Aurora job
> Add a binding helper that resolves a docker label to an immutable image 
> identifier at create time and make it the default for the Docker helper 
> introduced in https://reviews.apache.org/r/28920/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AURORA-1680) Aurora 0.16.0 deprecations

2016-09-15 Thread Joshua Cohen (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshua Cohen updated AURORA-1680:
-
Epic Name: Aurora 0.16.0 deprecations

> Aurora 0.16.0 deprecations
> --
>
> Key: AURORA-1680
> URL: https://issues.apache.org/jira/browse/AURORA-1680
> Project: Aurora
>  Issue Type: Epic
>Reporter: Maxim Khutornenko
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AURORA-1688) Change framework_name default value from 'TwitterScheduler' to 'aurora'

2016-09-15 Thread Joshua Cohen (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshua Cohen updated AURORA-1688:
-
Sprint: Twitter Aurora Q2'16 Sprint 21

> Change framework_name default value from 'TwitterScheduler' to 'aurora'
> ---
>
> Key: AURORA-1688
> URL: https://issues.apache.org/jira/browse/AURORA-1688
> Project: Aurora
>  Issue Type: Task
>Reporter: Stephan Erb
>Assignee: Santhosh Kumar Shanmugham
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AURORA-1688) Change framework_name default value from 'TwitterScheduler' to 'aurora'

2016-09-15 Thread Joshua Cohen (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshua Cohen updated AURORA-1688:
-
Issue Type: Task  (was: Bug)

> Change framework_name default value from 'TwitterScheduler' to 'aurora'
> ---
>
> Key: AURORA-1688
> URL: https://issues.apache.org/jira/browse/AURORA-1688
> Project: Aurora
>  Issue Type: Task
>Reporter: Stephan Erb
>Assignee: Santhosh Kumar Shanmugham
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AURORA-1688) Change framework_name default value from 'TwitterScheduler' to 'aurora'

2016-09-15 Thread Joshua Cohen (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshua Cohen updated AURORA-1688:
-
Issue Type: Bug  (was: Sub-task)
Parent: (was: AURORA-1680)

> Change framework_name default value from 'TwitterScheduler' to 'aurora'
> ---
>
> Key: AURORA-1688
> URL: https://issues.apache.org/jira/browse/AURORA-1688
> Project: Aurora
>  Issue Type: Bug
>Reporter: Stephan Erb
>Assignee: Santhosh Kumar Shanmugham
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AURORA-1767) Shell health checker is not namespace and taskfs aware

2016-09-15 Thread Joshua Cohen (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshua Cohen updated AURORA-1767:
-
Sprint: Twitter Aurora Q2'16 Sprint 21

> Shell health checker is not namespace and taskfs aware
> --
>
> Key: AURORA-1767
> URL: https://issues.apache.org/jira/browse/AURORA-1767
> Project: Aurora
>  Issue Type: Story
>  Components: Thermos
>Reporter: Stephan Erb
>Assignee: Joshua Cohen
>
> We launch the shell health checker within the context of the host filesystem. 
> As this is a user-defined command it makes probably sense to launch this 
> within the container instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1681) Remove deprecated --restart-threshold option from 'aurora job restart'

2016-09-15 Thread Joshua Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15494179#comment-15494179
 ] 

Joshua Cohen commented on AURORA-1681:
--

https://reviews.apache.org/r/51924/

> Remove deprecated --restart-threshold option from 'aurora job restart'
> --
>
> Key: AURORA-1681
> URL: https://issues.apache.org/jira/browse/AURORA-1681
> Project: Aurora
>  Issue Type: Task
>  Components: Client
>Reporter: Maxim Khutornenko
>Assignee: Joshua Cohen
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AURORA-1707) Remove deprecated resource fields in TaskConfig and ResourceAggregate

2016-09-15 Thread Joshua Cohen (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshua Cohen updated AURORA-1707:
-

Moving this to the 0.17.0 release.

> Remove deprecated resource fields in TaskConfig and ResourceAggregate
> -
>
> Key: AURORA-1707
> URL: https://issues.apache.org/jira/browse/AURORA-1707
> Project: Aurora
>  Issue Type: Task
>Reporter: Maxim Khutornenko
>
> Remove individual resource fields in TaskConfig and ResourceAggregate 
> replaced by the new {{Resource}} struct.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AURORA-1681) Remove deprecated --restart-threshold option from 'aurora job restart'

2016-09-15 Thread Joshua Cohen (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshua Cohen updated AURORA-1681:
-
Sprint: Twitter Aurora Q2'16 Sprint 21

> Remove deprecated --restart-threshold option from 'aurora job restart'
> --
>
> Key: AURORA-1681
> URL: https://issues.apache.org/jira/browse/AURORA-1681
> Project: Aurora
>  Issue Type: Task
>  Components: Client
>Reporter: Maxim Khutornenko
>Assignee: Joshua Cohen
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AURORA-1708) Remove deprecated production flag in TaskConfig

2016-09-15 Thread Joshua Cohen (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshua Cohen updated AURORA-1708:
-

Moving this to 0.17.0

> Remove deprecated production flag in TaskConfig
> ---
>
> Key: AURORA-1708
> URL: https://issues.apache.org/jira/browse/AURORA-1708
> Project: Aurora
>  Issue Type: Task
>Reporter: Mehrdad Nurolahzade
>
> Remove {{production}} field in {{TaskConfig}} struct.
> Task production vs. non-production behavior should now be specified using the 
> {{tier}} field.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (AURORA-1774) Better document the effect of setting max_failures to 0 on a Task

2016-09-13 Thread Joshua Cohen (JIRA)
Joshua Cohen created AURORA-1774:


 Summary: Better document the effect of setting max_failures to 0 
on a Task
 Key: AURORA-1774
 URL: https://issues.apache.org/jira/browse/AURORA-1774
 Project: Aurora
  Issue Type: Task
  Components: Documentation
Reporter: Joshua Cohen
Priority: Minor


Setting {{max_failures=0}} on a {{Task}} object will cause the task to exit 
successfully [regardless of whether any of the processes 
failed|https://github.com/apache/aurora/blob/master/src/main/python/apache/thermos/core/runner.py#L788-L797].
 We should update the documentation in 
[docs/configuration/reference.md|https://github.com/apache/aurora/blob/master/docs/reference/configuration.md#max_failures-1]
 to make this clear.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1768) Command `aurora task ssh` is not namespace and taskfs aware

2016-09-12 Thread Joshua Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15484248#comment-15484248
 ] 

Joshua Cohen commented on AURORA-1768:
--

It may be enough to just enter the executor's namespace. The task's filesystem 
mount is visible there. If we want to enter the process namespace we can get 
the pid from the checkpoint, but it's more complicated in that we need to 
somehow figure out *which* process's namespaces to enter.

> Command `aurora task ssh` is not namespace and taskfs aware 
> 
>
> Key: AURORA-1768
> URL: https://issues.apache.org/jira/browse/AURORA-1768
> Project: Aurora
>  Issue Type: Story
>  Components: Thermos
>Reporter: Stephan Erb
>
> In order to guarantee isolation among tasks and to simplify debugging in 
> production environments, we should make sure commands executed via `aurora 
> ssh` have been isolated in the same way as the tasks itself. This implies 
> that we have to use the same container filesystem and enter the same 
> namespaces.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1768) Command `aurora task ssh` is not namespace and taskfs aware

2016-09-09 Thread Joshua Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15477671#comment-15477671
 ] 

Joshua Cohen commented on AURORA-1768:
--

This would likely involve {{aurora task ssh}} invoking some helper binary to 
enter the container's namespace upon connection (similar to how it currently 
just 
[cd's|https://github.com/apache/aurora/blob/master/src/main/python/apache/aurora/client/api/command_runner.py#L63-L72]
 into the task's sandbox).

Ideally this helper would just be {{nsenter}}, but I don't think nsenter is 
guaranteed to be available on all distro's (e.g. it needs to be built from 
source for Ubuntu 14.04 for use in our vagrant image). We could instead create 
our own thin pex that relies on 
[python-nsenter|https://github.com/zalando/python-nsenter] to enter the 
necessary namespaces and then 
[embed|https://github.com/apache/aurora/blob/master/build-support/embed_runner_in_executor.py]
 that in the executor (and later 
[extract|https://github.com/apache/aurora/blob/master/src/main/python/apache/aurora/executor/bin/thermos_executor_main.py#L192-L204)
 it].

This raises the second question: how do we determine which namespace to 
actually enter? I'm unsure of this exactly, but I believe it's available via 
procfs at {{/proc//ns/mnt}} (or net, etc.).

> Command `aurora task ssh` is not namespace and taskfs aware 
> 
>
> Key: AURORA-1768
> URL: https://issues.apache.org/jira/browse/AURORA-1768
> Project: Aurora
>  Issue Type: Story
>  Components: Thermos
>Reporter: Stephan Erb
>
> In order to guarantee isolation among tasks and to simplify debugging in 
> production environments, we should make sure commands executed via `aurora 
> ssh` have been isolated in the same way as the tasks itself. This implies 
> that we have to use the same container filesystem and enter the same 
> namespaces.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (AURORA-1767) Shell health checker is not namespace and taskfs aware

2016-09-09 Thread Joshua Cohen (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshua Cohen reassigned AURORA-1767:


Assignee: Joshua Cohen

> Shell health checker is not namespace and taskfs aware
> --
>
> Key: AURORA-1767
> URL: https://issues.apache.org/jira/browse/AURORA-1767
> Project: Aurora
>  Issue Type: Story
>  Components: Thermos
>Reporter: Stephan Erb
>Assignee: Joshua Cohen
>
> We launch the shell health checker within the context of the host filesystem. 
> As this is a user-defined command it makes probably sense to launch this 
> within the container instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1763) GPU drivers are missing when using a Docker image

2016-09-06 Thread Joshua Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15468200#comment-15468200
 ] 

Joshua Cohen commented on AURORA-1763:
--

Yes, for the reasons Jie mentions, setting rootfs is not an option for Thermos.

Another option would be to configure each Mesos agent host with a 
{{/usr/local/nvidia}} and then configure the {{--global_container_mounts}} flag 
on the Scheduler to point to that path. Thermos will then mount that into each 
task.

> GPU drivers are missing when using a Docker image
> -
>
> Key: AURORA-1763
> URL: https://issues.apache.org/jira/browse/AURORA-1763
> Project: Aurora
>  Issue Type: Bug
>  Components: Executor
>Affects Versions: 0.16.0
>Reporter: Justin Pinkul
>
> When launching a GPU job that uses a Docker image and the unified 
> containerizer the Nvidia drivers are not correctly mounted. As an experiment 
> I launched a task using both mesos-execute and Aurora using the same Docker 
> image and ran nvidia-smi. During the experiment I noticed that the 
> /usr/local/nvidia folder was not being mounted properly. To confirm this was 
> the issue I tar'ed the drivers up (/run/mesos/isolators/gpu/nvidia_352.39) 
> and manually added it to the Docker image. When this was done the task was 
> able to launch correctly.
> Here is the resulting mountinfo for the mesos-execute task. Notice how 
> /usr/local/nvidia is mounted from the /mesos directory.
> {noformat}140 102 8:17 
> /mesos_work/provisioner/containers/11c497a2-a300-4c9e-a474-79aad1f28f11/backends/copy/rootfses/8ee046a6-bacb-42ff-b039-2cabda5d0e62
>  / rw,relatime master:24 - ext4 /dev/sdb1 rw,errors=remount-ro,data=ordered
> 141 140 8:17 
> /mesos_work/slaves/67025326-9dfd-4cbb-a008-454a40bce2f5-S1/frameworks/67025326-9dfd-4cbb-a008-454a40bce2f5-0009/executors/gpu-test/runs/11c497a2-a300-4c9e-a474-79aad1f28f11
>  /mnt/mesos/sandbox rw,relatime master:24 - ext4 /dev/sdb1 
> rw,errors=remount-ro,data=ordered
> 142 140 0:15 /mesos/isolators/gpu/nvidia_352.39 /usr/local/nvidia 
> rw,nosuid,relatime master:5 - tmpfs tmpfs rw,size=26438160k,mode=755
> 143 140 0:3 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw
> 144 143 0:3 /sys /proc/sys ro,relatime - proc proc rw
> 145 140 0:14 / /sys ro,nosuid,nodev,noexec,relatime - sysfs sysfs rw
> 146 140 0:38 / /dev rw,nosuid - tmpfs tmpfs rw,mode=755
> 147 146 0:39 / /dev/pts rw,nosuid,noexec,relatime - devpts devpts 
> rw,mode=600,ptmxmode=666
> 148 146 0:40 / /dev/shm rw,nosuid,nodev - tmpfs tmpfs rw{noformat}
> Here is the mountinfo when using Aurora. Notice how /usr/local/nvidia is 
> missing.
> {noformat}72 71 8:1 / / rw,relatime master:1 - ext4 /dev/sda1 
> rw,errors=remount-ro,data=ordered
> 73 72 0:5 / /dev rw,relatime master:2 - devtmpfs udev 
> rw,size=10240k,nr_inodes=16521649,mode=755
> 74 73 0:11 / /dev/pts rw,nosuid,noexec,relatime master:3 - devpts devpts 
> rw,gid=5,mode=620,ptmxmode=000
> 75 73 0:17 / /dev/shm rw,nosuid,nodev master:4 - tmpfs tmpfs rw
> 76 73 0:13 / /dev/mqueue rw,relatime master:21 - mqueue mqueue rw
> 77 73 0:30 / /dev/hugepages rw,relatime master:23 - hugetlbfs hugetlbfs rw
> 78 72 0:15 / /run rw,nosuid,relatime master:5 - tmpfs tmpfs 
> rw,size=26438160k,mode=755
> 79 78 0:18 / /run/lock rw,nosuid,nodev,noexec,relatime master:6 - tmpfs tmpfs 
> rw,size=5120k
> 80 78 0:32 / /run/rpc_pipefs rw,relatime master:25 - rpc_pipefs rpc_pipefs rw
> 82 72 0:14 / /sys rw,nosuid,nodev,noexec,relatime master:7 - sysfs sysfs rw
> 83 82 0:16 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime master:8 - 
> securityfs securityfs rw
> 84 82 0:19 / /sys/fs/cgroup ro,nosuid,nodev,noexec master:9 - tmpfs tmpfs 
> ro,mode=755
> 85 84 0:20 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime master:10 
> - cgroup cgroup 
> rw,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd
> 86 84 0:22 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime master:13 
> - cgroup cgroup rw,cpuset
> 87 84 0:23 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime 
> master:14 - cgroup cgroup rw,cpu,cpuacct
> 88 84 0:24 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime master:15 
> - cgroup cgroup rw,devices
> 89 84 0:25 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime master:16 
> - cgroup cgroup rw,freezer
> 90 84 0:26 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime 
> master:17 - cgroup cgroup rw,net_cls,net_prio
> 91 84 0:27 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime master:18 - 
> cgroup cgroup rw,blkio
> 92 84 0:28 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime 
> master:19 - cgroup cgroup rw,perf_event
> 93 82 0:21 / /sys/fs/pstore rw,nosuid,nodev,noexec,relatime master:11 - 
> pstore pstore rw
> 94 82 0:6 / /sys/kernel/debug rw,relatime master:22 - debugfs debugfs rw

[jira] [Commented] (AURORA-1763) GPU drivers are missing when using a Docker image

2016-09-06 Thread Joshua Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15468146#comment-15468146
 ] 

Joshua Cohen commented on AURORA-1763:
--

Where are they mounted from?

> GPU drivers are missing when using a Docker image
> -
>
> Key: AURORA-1763
> URL: https://issues.apache.org/jira/browse/AURORA-1763
> Project: Aurora
>  Issue Type: Bug
>  Components: Executor
>Affects Versions: 0.16.0
>Reporter: Justin Pinkul
>
> When launching a GPU job that uses a Docker image and the unified 
> containerizer the Nvidia drivers are not correctly mounted. As an experiment 
> I launched a task using both mesos-execute and Aurora using the same Docker 
> image and ran nvidia-smi. During the experiment I noticed that the 
> /usr/local/nvidia folder was not being mounted properly. To confirm this was 
> the issue I tar'ed the drivers up (/run/mesos/isolators/gpu/nvidia_352.39) 
> and manually added it to the Docker image. When this was done the task was 
> able to launch correctly.
> Here is the resulting mountinfo for the mesos-execute task. Notice how 
> /usr/local/nvidia is mounted from the /mesos directory.
> {noformat}140 102 8:17 
> /mesos_work/provisioner/containers/11c497a2-a300-4c9e-a474-79aad1f28f11/backends/copy/rootfses/8ee046a6-bacb-42ff-b039-2cabda5d0e62
>  / rw,relatime master:24 - ext4 /dev/sdb1 rw,errors=remount-ro,data=ordered
> 141 140 8:17 
> /mesos_work/slaves/67025326-9dfd-4cbb-a008-454a40bce2f5-S1/frameworks/67025326-9dfd-4cbb-a008-454a40bce2f5-0009/executors/gpu-test/runs/11c497a2-a300-4c9e-a474-79aad1f28f11
>  /mnt/mesos/sandbox rw,relatime master:24 - ext4 /dev/sdb1 
> rw,errors=remount-ro,data=ordered
> 142 140 0:15 /mesos/isolators/gpu/nvidia_352.39 /usr/local/nvidia 
> rw,nosuid,relatime master:5 - tmpfs tmpfs rw,size=26438160k,mode=755
> 143 140 0:3 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw
> 144 143 0:3 /sys /proc/sys ro,relatime - proc proc rw
> 145 140 0:14 / /sys ro,nosuid,nodev,noexec,relatime - sysfs sysfs rw
> 146 140 0:38 / /dev rw,nosuid - tmpfs tmpfs rw,mode=755
> 147 146 0:39 / /dev/pts rw,nosuid,noexec,relatime - devpts devpts 
> rw,mode=600,ptmxmode=666
> 148 146 0:40 / /dev/shm rw,nosuid,nodev - tmpfs tmpfs rw{noformat}
> Here is the mountinfo when using Aurora. Notice how /usr/local/nvidia is 
> missing.
> {noformat}72 71 8:1 / / rw,relatime master:1 - ext4 /dev/sda1 
> rw,errors=remount-ro,data=ordered
> 73 72 0:5 / /dev rw,relatime master:2 - devtmpfs udev 
> rw,size=10240k,nr_inodes=16521649,mode=755
> 74 73 0:11 / /dev/pts rw,nosuid,noexec,relatime master:3 - devpts devpts 
> rw,gid=5,mode=620,ptmxmode=000
> 75 73 0:17 / /dev/shm rw,nosuid,nodev master:4 - tmpfs tmpfs rw
> 76 73 0:13 / /dev/mqueue rw,relatime master:21 - mqueue mqueue rw
> 77 73 0:30 / /dev/hugepages rw,relatime master:23 - hugetlbfs hugetlbfs rw
> 78 72 0:15 / /run rw,nosuid,relatime master:5 - tmpfs tmpfs 
> rw,size=26438160k,mode=755
> 79 78 0:18 / /run/lock rw,nosuid,nodev,noexec,relatime master:6 - tmpfs tmpfs 
> rw,size=5120k
> 80 78 0:32 / /run/rpc_pipefs rw,relatime master:25 - rpc_pipefs rpc_pipefs rw
> 82 72 0:14 / /sys rw,nosuid,nodev,noexec,relatime master:7 - sysfs sysfs rw
> 83 82 0:16 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime master:8 - 
> securityfs securityfs rw
> 84 82 0:19 / /sys/fs/cgroup ro,nosuid,nodev,noexec master:9 - tmpfs tmpfs 
> ro,mode=755
> 85 84 0:20 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime master:10 
> - cgroup cgroup 
> rw,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd
> 86 84 0:22 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime master:13 
> - cgroup cgroup rw,cpuset
> 87 84 0:23 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime 
> master:14 - cgroup cgroup rw,cpu,cpuacct
> 88 84 0:24 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime master:15 
> - cgroup cgroup rw,devices
> 89 84 0:25 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime master:16 
> - cgroup cgroup rw,freezer
> 90 84 0:26 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime 
> master:17 - cgroup cgroup rw,net_cls,net_prio
> 91 84 0:27 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime master:18 - 
> cgroup cgroup rw,blkio
> 92 84 0:28 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime 
> master:19 - cgroup cgroup rw,perf_event
> 93 82 0:21 / /sys/fs/pstore rw,nosuid,nodev,noexec,relatime master:11 - 
> pstore pstore rw
> 94 82 0:6 / /sys/kernel/debug rw,relatime master:22 - debugfs debugfs rw
> 95 72 0:3 / /proc rw,nosuid,nodev,noexec,relatime master:12 - proc proc rw
> 96 95 0:29 / /proc/sys/fs/binfmt_misc rw,relatime master:20 - autofs 
> systemd-1 rw,fd=22,pgrp=1,timeout=300,minproto=5,maxproto=5,direct
> 97 96 0:34 / /proc/sys/fs/binfmt_misc rw,relatime master:27 - 

[jira] [Commented] (AURORA-1763) GPU drivers are missing when using a Docker image

2016-09-06 Thread Joshua Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15468115#comment-15468115
 ] 

Joshua Cohen commented on AURORA-1763:
--

[~jpinkul] I think in the mean time, the only solution is to explicitly include 
the GPU drivers in your Docker image if you'd like to use the unified 
containerizer? Is that correct [~jieyu]?

> GPU drivers are missing when using a Docker image
> -
>
> Key: AURORA-1763
> URL: https://issues.apache.org/jira/browse/AURORA-1763
> Project: Aurora
>  Issue Type: Bug
>  Components: Executor
>Affects Versions: 0.16.0
>Reporter: Justin Pinkul
>
> When launching a GPU job that uses a Docker image and the unified 
> containerizer the Nvidia drivers are not correctly mounted. As an experiment 
> I launched a task using both mesos-execute and Aurora using the same Docker 
> image and ran nvidia-smi. During the experiment I noticed that the 
> /usr/local/nvidia folder was not being mounted properly. To confirm this was 
> the issue I tar'ed the drivers up (/run/mesos/isolators/gpu/nvidia_352.39) 
> and manually added it to the Docker image. When this was done the task was 
> able to launch correctly.
> Here is the resulting mountinfo for the mesos-execute task. Notice how 
> /usr/local/nvidia is mounted from the /mesos directory.
> {noformat}140 102 8:17 
> /mesos_work/provisioner/containers/11c497a2-a300-4c9e-a474-79aad1f28f11/backends/copy/rootfses/8ee046a6-bacb-42ff-b039-2cabda5d0e62
>  / rw,relatime master:24 - ext4 /dev/sdb1 rw,errors=remount-ro,data=ordered
> 141 140 8:17 
> /mesos_work/slaves/67025326-9dfd-4cbb-a008-454a40bce2f5-S1/frameworks/67025326-9dfd-4cbb-a008-454a40bce2f5-0009/executors/gpu-test/runs/11c497a2-a300-4c9e-a474-79aad1f28f11
>  /mnt/mesos/sandbox rw,relatime master:24 - ext4 /dev/sdb1 
> rw,errors=remount-ro,data=ordered
> 142 140 0:15 /mesos/isolators/gpu/nvidia_352.39 /usr/local/nvidia 
> rw,nosuid,relatime master:5 - tmpfs tmpfs rw,size=26438160k,mode=755
> 143 140 0:3 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw
> 144 143 0:3 /sys /proc/sys ro,relatime - proc proc rw
> 145 140 0:14 / /sys ro,nosuid,nodev,noexec,relatime - sysfs sysfs rw
> 146 140 0:38 / /dev rw,nosuid - tmpfs tmpfs rw,mode=755
> 147 146 0:39 / /dev/pts rw,nosuid,noexec,relatime - devpts devpts 
> rw,mode=600,ptmxmode=666
> 148 146 0:40 / /dev/shm rw,nosuid,nodev - tmpfs tmpfs rw{noformat}
> Here is the mountinfo when using Aurora. Notice how /usr/local/nvidia is 
> missing.
> {noformat}72 71 8:1 / / rw,relatime master:1 - ext4 /dev/sda1 
> rw,errors=remount-ro,data=ordered
> 73 72 0:5 / /dev rw,relatime master:2 - devtmpfs udev 
> rw,size=10240k,nr_inodes=16521649,mode=755
> 74 73 0:11 / /dev/pts rw,nosuid,noexec,relatime master:3 - devpts devpts 
> rw,gid=5,mode=620,ptmxmode=000
> 75 73 0:17 / /dev/shm rw,nosuid,nodev master:4 - tmpfs tmpfs rw
> 76 73 0:13 / /dev/mqueue rw,relatime master:21 - mqueue mqueue rw
> 77 73 0:30 / /dev/hugepages rw,relatime master:23 - hugetlbfs hugetlbfs rw
> 78 72 0:15 / /run rw,nosuid,relatime master:5 - tmpfs tmpfs 
> rw,size=26438160k,mode=755
> 79 78 0:18 / /run/lock rw,nosuid,nodev,noexec,relatime master:6 - tmpfs tmpfs 
> rw,size=5120k
> 80 78 0:32 / /run/rpc_pipefs rw,relatime master:25 - rpc_pipefs rpc_pipefs rw
> 82 72 0:14 / /sys rw,nosuid,nodev,noexec,relatime master:7 - sysfs sysfs rw
> 83 82 0:16 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime master:8 - 
> securityfs securityfs rw
> 84 82 0:19 / /sys/fs/cgroup ro,nosuid,nodev,noexec master:9 - tmpfs tmpfs 
> ro,mode=755
> 85 84 0:20 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime master:10 
> - cgroup cgroup 
> rw,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd
> 86 84 0:22 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime master:13 
> - cgroup cgroup rw,cpuset
> 87 84 0:23 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime 
> master:14 - cgroup cgroup rw,cpu,cpuacct
> 88 84 0:24 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime master:15 
> - cgroup cgroup rw,devices
> 89 84 0:25 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime master:16 
> - cgroup cgroup rw,freezer
> 90 84 0:26 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime 
> master:17 - cgroup cgroup rw,net_cls,net_prio
> 91 84 0:27 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime master:18 - 
> cgroup cgroup rw,blkio
> 92 84 0:28 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime 
> master:19 - cgroup cgroup rw,perf_event
> 93 82 0:21 / /sys/fs/pstore rw,nosuid,nodev,noexec,relatime master:11 - 
> pstore pstore rw
> 94 82 0:6 / /sys/kernel/debug rw,relatime master:22 - debugfs debugfs rw
> 95 72 0:3 / /proc rw,nosuid,nodev,noexec,relatime master:12 - proc proc rw
> 96 95 0:29 / /proc/sys/fs/binfmt_misc 

[jira] [Commented] (AURORA-1763) GPU drivers are missing when using a Docker image

2016-09-06 Thread Joshua Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15467903#comment-15467903
 ] 

Joshua Cohen commented on AURORA-1763:
--

Thanks for the context. I'm wondering if this is what's causing it:

{code:title=https://github.com/apache/mesos/blob/master/src/slave/containerizer/mesos/isolators/gpu/isolator.cpp#L288-L290}
  if (!containerConfig.has_rootfs()) {
 return None();
  }
{code}

Aurora does not configure tasks with a {{ContainerInfo}} that has an {{Image}} 
set. Instead it configures the task's filesystem as a {{Volume}} with an 
{{Image}} set. The executor then uses {{mesos-containerizer launch ...}} to 
pivot/chroot into that filesystem. To me it looks like the code above is 
relying on the container itself having a rootfs, which won't be the case 
currently based on the way we isolate task filesystems

> GPU drivers are missing when using a Docker image
> -
>
> Key: AURORA-1763
> URL: https://issues.apache.org/jira/browse/AURORA-1763
> Project: Aurora
>  Issue Type: Bug
>  Components: Executor
>Affects Versions: 0.16.0
>Reporter: Justin Pinkul
>
> When launching a GPU job that uses a Docker image and the unified 
> containerizer the Nvidia drivers are not correctly mounted. As an experiment 
> I launched a task using both mesos-execute and Aurora using the same Docker 
> image and ran nvidia-smi. During the experiment I noticed that the 
> /usr/local/nvidia folder was not being mounted properly. To confirm this was 
> the issue I tar'ed the drivers up (/run/mesos/isolators/gpu/nvidia_352.39) 
> and manually added it to the Docker image. When this was done the task was 
> able to launch correctly.
> Here is the resulting mountinfo for the mesos-execute task. Notice how 
> /usr/local/nvidia is mounted from the /mesos directory.
> {noformat}140 102 8:17 
> /mesos_work/provisioner/containers/11c497a2-a300-4c9e-a474-79aad1f28f11/backends/copy/rootfses/8ee046a6-bacb-42ff-b039-2cabda5d0e62
>  / rw,relatime master:24 - ext4 /dev/sdb1 rw,errors=remount-ro,data=ordered
> 141 140 8:17 
> /mesos_work/slaves/67025326-9dfd-4cbb-a008-454a40bce2f5-S1/frameworks/67025326-9dfd-4cbb-a008-454a40bce2f5-0009/executors/gpu-test/runs/11c497a2-a300-4c9e-a474-79aad1f28f11
>  /mnt/mesos/sandbox rw,relatime master:24 - ext4 /dev/sdb1 
> rw,errors=remount-ro,data=ordered
> 142 140 0:15 /mesos/isolators/gpu/nvidia_352.39 /usr/local/nvidia 
> rw,nosuid,relatime master:5 - tmpfs tmpfs rw,size=26438160k,mode=755
> 143 140 0:3 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw
> 144 143 0:3 /sys /proc/sys ro,relatime - proc proc rw
> 145 140 0:14 / /sys ro,nosuid,nodev,noexec,relatime - sysfs sysfs rw
> 146 140 0:38 / /dev rw,nosuid - tmpfs tmpfs rw,mode=755
> 147 146 0:39 / /dev/pts rw,nosuid,noexec,relatime - devpts devpts 
> rw,mode=600,ptmxmode=666
> 148 146 0:40 / /dev/shm rw,nosuid,nodev - tmpfs tmpfs rw{noformat}
> Here is the mountinfo when using Aurora. Notice how /usr/local/nvidia is 
> missing.
> {noformat}72 71 8:1 / / rw,relatime master:1 - ext4 /dev/sda1 
> rw,errors=remount-ro,data=ordered
> 73 72 0:5 / /dev rw,relatime master:2 - devtmpfs udev 
> rw,size=10240k,nr_inodes=16521649,mode=755
> 74 73 0:11 / /dev/pts rw,nosuid,noexec,relatime master:3 - devpts devpts 
> rw,gid=5,mode=620,ptmxmode=000
> 75 73 0:17 / /dev/shm rw,nosuid,nodev master:4 - tmpfs tmpfs rw
> 76 73 0:13 / /dev/mqueue rw,relatime master:21 - mqueue mqueue rw
> 77 73 0:30 / /dev/hugepages rw,relatime master:23 - hugetlbfs hugetlbfs rw
> 78 72 0:15 / /run rw,nosuid,relatime master:5 - tmpfs tmpfs 
> rw,size=26438160k,mode=755
> 79 78 0:18 / /run/lock rw,nosuid,nodev,noexec,relatime master:6 - tmpfs tmpfs 
> rw,size=5120k
> 80 78 0:32 / /run/rpc_pipefs rw,relatime master:25 - rpc_pipefs rpc_pipefs rw
> 82 72 0:14 / /sys rw,nosuid,nodev,noexec,relatime master:7 - sysfs sysfs rw
> 83 82 0:16 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime master:8 - 
> securityfs securityfs rw
> 84 82 0:19 / /sys/fs/cgroup ro,nosuid,nodev,noexec master:9 - tmpfs tmpfs 
> ro,mode=755
> 85 84 0:20 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime master:10 
> - cgroup cgroup 
> rw,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd
> 86 84 0:22 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime master:13 
> - cgroup cgroup rw,cpuset
> 87 84 0:23 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime 
> master:14 - cgroup cgroup rw,cpu,cpuacct
> 88 84 0:24 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime master:15 
> - cgroup cgroup rw,devices
> 89 84 0:25 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime master:16 
> - cgroup cgroup rw,freezer
> 90 84 0:26 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime 
> master:17 - cgroup cgroup rw,net_cls,net_prio
> 91 84 

[jira] [Updated] (AURORA-1763) GPU drivers are missing when using a Docker image

2016-09-06 Thread Joshua Cohen (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshua Cohen updated AURORA-1763:
-
Description: 
When launching a GPU job that uses a Docker image and the unified containerizer 
the Nvidia drivers are not correctly mounted. As an experiment I launched a 
task using both mesos-execute and Aurora using the same Docker image and ran 
nvidia-smi. During the experiment I noticed that the /usr/local/nvidia folder 
was not being mounted properly. To confirm this was the issue I tar'ed the 
drivers up (/run/mesos/isolators/gpu/nvidia_352.39) and manually added it to 
the Docker image. When this was done the task was able to launch correctly.

Here is the resulting mountinfo for the mesos-execute task. Notice how 
/usr/local/nvidia is mounted from the /mesos directory.

{noformat}140 102 8:17 
/mesos_work/provisioner/containers/11c497a2-a300-4c9e-a474-79aad1f28f11/backends/copy/rootfses/8ee046a6-bacb-42ff-b039-2cabda5d0e62
 / rw,relatime master:24 - ext4 /dev/sdb1 rw,errors=remount-ro,data=ordered
141 140 8:17 
/mesos_work/slaves/67025326-9dfd-4cbb-a008-454a40bce2f5-S1/frameworks/67025326-9dfd-4cbb-a008-454a40bce2f5-0009/executors/gpu-test/runs/11c497a2-a300-4c9e-a474-79aad1f28f11
 /mnt/mesos/sandbox rw,relatime master:24 - ext4 /dev/sdb1 
rw,errors=remount-ro,data=ordered
142 140 0:15 /mesos/isolators/gpu/nvidia_352.39 /usr/local/nvidia 
rw,nosuid,relatime master:5 - tmpfs tmpfs rw,size=26438160k,mode=755
143 140 0:3 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw
144 143 0:3 /sys /proc/sys ro,relatime - proc proc rw
145 140 0:14 / /sys ro,nosuid,nodev,noexec,relatime - sysfs sysfs rw
146 140 0:38 / /dev rw,nosuid - tmpfs tmpfs rw,mode=755
147 146 0:39 / /dev/pts rw,nosuid,noexec,relatime - devpts devpts 
rw,mode=600,ptmxmode=666
148 146 0:40 / /dev/shm rw,nosuid,nodev - tmpfs tmpfs rw{noformat}

Here is the mountinfo when using Aurora. Notice how /usr/local/nvidia is 
missing.

{noformat}72 71 8:1 / / rw,relatime master:1 - ext4 /dev/sda1 
rw,errors=remount-ro,data=ordered
73 72 0:5 / /dev rw,relatime master:2 - devtmpfs udev 
rw,size=10240k,nr_inodes=16521649,mode=755
74 73 0:11 / /dev/pts rw,nosuid,noexec,relatime master:3 - devpts devpts 
rw,gid=5,mode=620,ptmxmode=000
75 73 0:17 / /dev/shm rw,nosuid,nodev master:4 - tmpfs tmpfs rw
76 73 0:13 / /dev/mqueue rw,relatime master:21 - mqueue mqueue rw
77 73 0:30 / /dev/hugepages rw,relatime master:23 - hugetlbfs hugetlbfs rw
78 72 0:15 / /run rw,nosuid,relatime master:5 - tmpfs tmpfs 
rw,size=26438160k,mode=755
79 78 0:18 / /run/lock rw,nosuid,nodev,noexec,relatime master:6 - tmpfs tmpfs 
rw,size=5120k
80 78 0:32 / /run/rpc_pipefs rw,relatime master:25 - rpc_pipefs rpc_pipefs rw
82 72 0:14 / /sys rw,nosuid,nodev,noexec,relatime master:7 - sysfs sysfs rw
83 82 0:16 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime master:8 - 
securityfs securityfs rw
84 82 0:19 / /sys/fs/cgroup ro,nosuid,nodev,noexec master:9 - tmpfs tmpfs 
ro,mode=755
85 84 0:20 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime master:10 - 
cgroup cgroup 
rw,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd
86 84 0:22 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime master:13 - 
cgroup cgroup rw,cpuset
87 84 0:23 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime 
master:14 - cgroup cgroup rw,cpu,cpuacct
88 84 0:24 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime master:15 - 
cgroup cgroup rw,devices
89 84 0:25 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime master:16 - 
cgroup cgroup rw,freezer
90 84 0:26 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime 
master:17 - cgroup cgroup rw,net_cls,net_prio
91 84 0:27 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime master:18 - 
cgroup cgroup rw,blkio
92 84 0:28 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime 
master:19 - cgroup cgroup rw,perf_event
93 82 0:21 / /sys/fs/pstore rw,nosuid,nodev,noexec,relatime master:11 - pstore 
pstore rw
94 82 0:6 / /sys/kernel/debug rw,relatime master:22 - debugfs debugfs rw
95 72 0:3 / /proc rw,nosuid,nodev,noexec,relatime master:12 - proc proc rw
96 95 0:29 / /proc/sys/fs/binfmt_misc rw,relatime master:20 - autofs systemd-1 
rw,fd=22,pgrp=1,timeout=300,minproto=5,maxproto=5,direct
97 96 0:34 / /proc/sys/fs/binfmt_misc rw,relatime master:27 - binfmt_misc 
binfmt_misc rw
98 72 8:17 / /mnt/01 rw,relatime master:24 - ext4 /dev/sdb1 
rw,errors=remount-ro,data=ordered
99 98 8:17 
/mesos_work/provisioner/containers/3790dd16-d1e2-4974-ba21-095a029b8c7d/backends/copy/rootfses/7ce26962-10a7-40ec-843b-c76e7e29c88d
 
/mnt/01/mesos_work/slaves/67025326-9dfd-4cbb-a008-454a40bce2f5-S1/frameworks/13e02526-f2b7-4677-bb23-0faeeac65be9-/executors/thermos-root-devel-gpu_test-0-beeb742b-28c1-46f3-b49f-23443b6efcc2/runs/3790dd16-d1e2-4974-ba21-095a029b8c7d/taskfs
 rw,relatime master:24 - ext4 /dev/sdb1 

[jira] [Comment Edited] (AURORA-1763) GPU drivers are missing when using a Docker image

2016-09-06 Thread Joshua Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15467796#comment-15467796
 ] 

Joshua Cohen edited comment on AURORA-1763 at 9/6/16 4:17 PM:
--

[~jieyu] Is this something that should be handled by {{mesos-containerizer 
launch ...}}? I'm not sure how mesos decides when to mount 
{{/usr/local/nvidia}} or how Aurora could make that decision in the executor.


was (Author: joshua.cohen):
[~jieyu] Is this something that should be handled by {{mesos-containerizer 
launch ...}}? I'm not sure how mesos decides when to mount 
{{/usr/local/nvidia}} or how Aurora could make that decision in the Executor.

> GPU drivers are missing when using a Docker image
> -
>
> Key: AURORA-1763
> URL: https://issues.apache.org/jira/browse/AURORA-1763
> Project: Aurora
>  Issue Type: Bug
>  Components: Executor
>Affects Versions: 0.16.0
>Reporter: Justin Pinkul
>
> When launching a GPU job that uses a Docker image and the unified 
> containerizer the Nvidia drivers are not correctly mounted. As an experiment 
> I launched a task using both mesos-execute and Aurora using the same Docker 
> image and ran nvidia-smi. During the experiment I noticed that the 
> /usr/local/nvidia folder was not being mounted properly. To confirm this was 
> the issue I tar'ed the drivers up (/run/mesos/isolators/gpu/nvidia_352.39) 
> and manually added it to the Docker image. When this was done the task was 
> able to launch correctly.
> Here is the resulting mountinfo for the mesos-execute task. Notice how 
> /usr/local/nvidia is mounted from the /mesos directory.
> 140 102 8:17 
> /mesos_work/provisioner/containers/11c497a2-a300-4c9e-a474-79aad1f28f11/backends/copy/rootfses/8ee046a6-bacb-42ff-b039-2cabda5d0e62
>  / rw,relatime master:24 - ext4 /dev/sdb1 rw,errors=remount-ro,data=ordered
> 141 140 8:17 
> /mesos_work/slaves/67025326-9dfd-4cbb-a008-454a40bce2f5-S1/frameworks/67025326-9dfd-4cbb-a008-454a40bce2f5-0009/executors/gpu-test/runs/11c497a2-a300-4c9e-a474-79aad1f28f11
>  /mnt/mesos/sandbox rw,relatime master:24 - ext4 /dev/sdb1 
> rw,errors=remount-ro,data=ordered
> 142 140 0:15 /mesos/isolators/gpu/nvidia_352.39 /usr/local/nvidia 
> rw,nosuid,relatime master:5 - tmpfs tmpfs rw,size=26438160k,mode=755
> 143 140 0:3 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw
> 144 143 0:3 /sys /proc/sys ro,relatime - proc proc rw
> 145 140 0:14 / /sys ro,nosuid,nodev,noexec,relatime - sysfs sysfs rw
> 146 140 0:38 / /dev rw,nosuid - tmpfs tmpfs rw,mode=755
> 147 146 0:39 / /dev/pts rw,nosuid,noexec,relatime - devpts devpts 
> rw,mode=600,ptmxmode=666
> 148 146 0:40 / /dev/shm rw,nosuid,nodev - tmpfs tmpfs rw
> Here is the mountinfo when using Aurora. Notice how /usr/local/nvidia is 
> missing.
> 72 71 8:1 / / rw,relatime master:1 - ext4 /dev/sda1 
> rw,errors=remount-ro,data=ordered
> 73 72 0:5 / /dev rw,relatime master:2 - devtmpfs udev 
> rw,size=10240k,nr_inodes=16521649,mode=755
> 74 73 0:11 / /dev/pts rw,nosuid,noexec,relatime master:3 - devpts devpts 
> rw,gid=5,mode=620,ptmxmode=000
> 75 73 0:17 / /dev/shm rw,nosuid,nodev master:4 - tmpfs tmpfs rw
> 76 73 0:13 / /dev/mqueue rw,relatime master:21 - mqueue mqueue rw
> 77 73 0:30 / /dev/hugepages rw,relatime master:23 - hugetlbfs hugetlbfs rw
> 78 72 0:15 / /run rw,nosuid,relatime master:5 - tmpfs tmpfs 
> rw,size=26438160k,mode=755
> 79 78 0:18 / /run/lock rw,nosuid,nodev,noexec,relatime master:6 - tmpfs tmpfs 
> rw,size=5120k
> 80 78 0:32 / /run/rpc_pipefs rw,relatime master:25 - rpc_pipefs rpc_pipefs rw
> 82 72 0:14 / /sys rw,nosuid,nodev,noexec,relatime master:7 - sysfs sysfs rw
> 83 82 0:16 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime master:8 - 
> securityfs securityfs rw
> 84 82 0:19 / /sys/fs/cgroup ro,nosuid,nodev,noexec master:9 - tmpfs tmpfs 
> ro,mode=755
> 85 84 0:20 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime master:10 
> - cgroup cgroup 
> rw,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd
> 86 84 0:22 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime master:13 
> - cgroup cgroup rw,cpuset
> 87 84 0:23 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime 
> master:14 - cgroup cgroup rw,cpu,cpuacct
> 88 84 0:24 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime master:15 
> - cgroup cgroup rw,devices
> 89 84 0:25 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime master:16 
> - cgroup cgroup rw,freezer
> 90 84 0:26 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime 
> master:17 - cgroup cgroup rw,net_cls,net_prio
> 91 84 0:27 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime master:18 - 
> cgroup cgroup rw,blkio
> 92 84 0:28 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime 
> master:19 - cgroup cgroup rw,perf_event
> 93 82 

[jira] [Commented] (AURORA-1690) Allow for isolating the executor's filesystem from the task's

2016-08-29 Thread Joshua Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15447050#comment-15447050
 ] 

Joshua Cohen commented on AURORA-1690:
--

https://reviews.apache.org/r/47853/
https://reviews.apache.org/r/51298/

> Allow for isolating the executor's filesystem from the task's
> -
>
> Key: AURORA-1690
> URL: https://issues.apache.org/jira/browse/AURORA-1690
> Project: Aurora
>  Issue Type: Task
>  Components: Executor, Scheduler
>Reporter: Joshua Cohen
>Assignee: Joshua Cohen
>
> Per 
> https://github.com/apache/mesos/blob/master/docs/container-image.md#executor-dependencies-in-a-container-image
>  we should be able to specify an image to be mounted containing the 
> executor's filesystem. Amongst other things, this will allow us to remove the 
> requirement that task images contain a python 2.7 runtime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (AURORA-1755) Mounts created by executor when using filesystem isolation are leaking to the host filesystem's mtab

2016-08-29 Thread Joshua Cohen (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshua Cohen reassigned AURORA-1755:


Assignee: Joshua Cohen

> Mounts created by executor when using filesystem isolation are leaking to the 
> host filesystem's mtab
> 
>
> Key: AURORA-1755
> URL: https://issues.apache.org/jira/browse/AURORA-1755
> Project: Aurora
>  Issue Type: Bug
>  Components: Executor
>Reporter: Joshua Cohen
>Assignee: Joshua Cohen
>
> {noformat}
> $ cat /etc/mtab |grep /var/lib/mesos |wc -l
> 432
> {noformat}
> In theory this should not be happening, because the executor should be 
> running in its own mount namespace. In practice... something is awry. Should 
> talk to Mesos folks to see what's going on, but we have a few easy solutions 
> regardless:
> add the -n flag to the mount command to not create the mtab entry.
> run the mount commands through mesos-containerizer launch's --pre-exec which 
> will create the mount in the isolated fileystem's namespace.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1755) Mounts created by executor when using filesystem isolation are leaking to the host filesystem's mtab

2016-08-27 Thread Joshua Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15441996#comment-15441996
 ] 

Joshua Cohen commented on AURORA-1755:
--

Note that {{/proc/mounts}} does not have all the leaked values like 
{{/etc/mtab}}.

> Mounts created by executor when using filesystem isolation are leaking to the 
> host filesystem's mtab
> 
>
> Key: AURORA-1755
> URL: https://issues.apache.org/jira/browse/AURORA-1755
> Project: Aurora
>  Issue Type: Bug
>  Components: Executor
>Reporter: Joshua Cohen
>
> {noformat}
> $ cat /etc/mtab |grep /var/lib/mesos |wc -l
> 432
> {noformat}
> In theory this should not be happening, because the executor should be 
> running in its own mount namespace. In practice... something is awry. Should 
> talk to Mesos folks to see what's going on, but we have a few easy solutions 
> regardless:
> add the -n flag to the mount command to not create the mtab entry.
> run the mount commands through mesos-containerizer launch's --pre-exec which 
> will create the mount in the isolated fileystem's namespace.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1750) Expose Aurora task metadata to thermos task

2016-08-22 Thread Joshua Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15431391#comment-15431391
 ] 

Joshua Cohen commented on AURORA-1750:
--

The executor should already have access to the task metadata. {{launchTask}} 
receives a {{TaskInfo}} proto whose data is a serialized Aurora 
{{AssignedTask}} thrift struct. From there it's just a question of reading the 
metadata from assigned task's {{TaskConfig}} property.

> Expose Aurora task metadata to thermos task
> ---
>
> Key: AURORA-1750
> URL: https://issues.apache.org/jira/browse/AURORA-1750
> Project: Aurora
>  Issue Type: Task
>Reporter: Zameer Manji
>Priority: Minor
>
> Much like how we expose mesos hostname, aurora instance number, etc to 
> thermos I think we should be able to expose Aurora task metadata to thermos 
> tasks.
> I don't forsee complexity or harm for this, but it allows users to plumb more 
> information into the task. For example, one could encode a 'package version' 
> or 'build pipeline' or 'audit pipeline' metadata into the task. The task 
> could then expose this to others or act differently if required.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1711) Allow client to store metadata on Update entity

2016-08-15 Thread Joshua Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421467#comment-15421467
 ] 

Joshua Cohen commented on AURORA-1711:
--

Ok, that sounds reasonable enough to me. Especially if we extend the 
{{StartJobUpdateResult}} to make it possible to avoid the query entirely.

> Allow client to store metadata on Update entity
> ---
>
> Key: AURORA-1711
> URL: https://issues.apache.org/jira/browse/AURORA-1711
> Project: Aurora
>  Issue Type: Task
>  Components: Scheduler
>Reporter: David McLaughlin
>
> I have a use case where I'm programmatically starting updates via the Aurora 
> API and sometimes the request to the scheduler times out or fails, even 
> though the update is written to storage and started. 
> I'd like to be able to store some unique identifier on the update so that we 
> can reconcile this state later. We can make this generic by allowing clients 
> to store arbitrary metadata on an update (similar to how they do it with job 
> configuration). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1711) Allow client to store metadata on Update entity

2016-08-15 Thread Joshua Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421363#comment-15421363
 ] 

Joshua Cohen commented on AURORA-1711:
--

My concern is that if we want to support this in the client (and I feel 
strongly that we should), then we're always going to make this query when 
starting a job update (or at the very least, we'll always make this query when 
retrying a failed {{startJobUpdate}} call). Given that, it seems beneficial to 
me to incorporate this directly into the scheduler. Imagine the scenario where 
the scheduler is failing the {{startJobUpdate}} request for reasons other than 
a network partition, e.g. for whatever reason it's overloaded. If the client 
always responds to that failure by first querying to see if the update it's 
trying to launch is in progress and *then* retries the {{startJobUpdate}} call 
that's only going to exacerbate the underlying load on the scheduler. Whereas 
if the scheduler performs this check implicitly if the client-update-id is set 
on the {{JobUpdateRequest}} the process can be better optimized and avoid the 
overhead of an unnecessary RPC.

> Allow client to store metadata on Update entity
> ---
>
> Key: AURORA-1711
> URL: https://issues.apache.org/jira/browse/AURORA-1711
> Project: Aurora
>  Issue Type: Task
>  Components: Scheduler
>Reporter: David McLaughlin
>
> I have a use case where I'm programmatically starting updates via the Aurora 
> API and sometimes the request to the scheduler times out or fails, even 
> though the update is written to storage and started. 
> I'd like to be able to store some unique identifier on the update so that we 
> can reconcile this state later. We can make this generic by allowing clients 
> to store arbitrary metadata on an update (similar to how they do it with job 
> configuration). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1711) Allow client to store metadata on Update entity

2016-08-15 Thread Joshua Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421267#comment-15421267
 ] 

Joshua Cohen commented on AURORA-1711:
--

Does forcing the client to make the query become unnecessarily expensive? The 
aurora command line client can also benefit from this logic in the scenario 
where you're accessing the API through a proxy (e.g. the {{startUpdateRequest}} 
cal is sent through the proxy to the scheduler, the update begins, but for 
whatever reason the proxy times out the request to the client, at which point 
the client automatically retries the request and gets an error about the update 
already being in progress). In that scenario, the client will essentially 
*always* make the query to see if the update is active, so why not just have 
the scheduler implicitly make the check when processing a {{startJobUpdate}} 
request that includes the metadata? I think it would also makes sense, in this 
case, to move away from a generic "metadata" field and towards an explicit 
"client update identifier" field so that the scheduler is not enforcing its own 
meaning on what from the outside appears to be data whose purpose should be 
unknown to the scheduler.

> Allow client to store metadata on Update entity
> ---
>
> Key: AURORA-1711
> URL: https://issues.apache.org/jira/browse/AURORA-1711
> Project: Aurora
>  Issue Type: Task
>  Components: Scheduler
>Reporter: David McLaughlin
>
> I have a use case where I'm programmatically starting updates via the Aurora 
> API and sometimes the request to the scheduler times out or fails, even 
> though the update is written to storage and started. 
> I'd like to be able to store some unique identifier on the update so that we 
> can reconcile this state later. We can make this generic by allowing clients 
> to store arbitrary metadata on an update (similar to how they do it with job 
> configuration). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (AURORA-1711) Allow client to store metadata on Update entity

2016-08-15 Thread Joshua Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421267#comment-15421267
 ] 

Joshua Cohen edited comment on AURORA-1711 at 8/15/16 4:59 PM:
---

Does forcing the client to make the query become unnecessarily expensive? The 
aurora command line client can also benefit from this logic in the scenario 
where you're accessing the API through a proxy (e.g. the {{startJobUpdate}} cal 
is sent through the proxy to the scheduler, the update begins, but for whatever 
reason the proxy times out the request to the client, at which point the client 
automatically retries the request and gets an error about the update already 
being in progress). In that scenario, the client will essentially *always* make 
the query to see if the update is active, so why not just have the scheduler 
implicitly make the check when processing a {{startJobUpdate}} request that 
includes the metadata? I think it would also makes sense, in this case, to move 
away from a generic "metadata" field and towards an explicit "client update 
identifier" field so that the scheduler is not enforcing its own meaning on 
what from the outside appears to be data whose purpose should be unknown to the 
scheduler.


was (Author: joshua.cohen):
Does forcing the client to make the query become unnecessarily expensive? The 
aurora command line client can also benefit from this logic in the scenario 
where you're accessing the API through a proxy (e.g. the {{startUpdateRequest}} 
cal is sent through the proxy to the scheduler, the update begins, but for 
whatever reason the proxy times out the request to the client, at which point 
the client automatically retries the request and gets an error about the update 
already being in progress). In that scenario, the client will essentially 
*always* make the query to see if the update is active, so why not just have 
the scheduler implicitly make the check when processing a {{startJobUpdate}} 
request that includes the metadata? I think it would also makes sense, in this 
case, to move away from a generic "metadata" field and towards an explicit 
"client update identifier" field so that the scheduler is not enforcing its own 
meaning on what from the outside appears to be data whose purpose should be 
unknown to the scheduler.

> Allow client to store metadata on Update entity
> ---
>
> Key: AURORA-1711
> URL: https://issues.apache.org/jira/browse/AURORA-1711
> Project: Aurora
>  Issue Type: Task
>  Components: Scheduler
>Reporter: David McLaughlin
>
> I have a use case where I'm programmatically starting updates via the Aurora 
> API and sometimes the request to the scheduler times out or fails, even 
> though the update is written to storage and started. 
> I'd like to be able to store some unique identifier on the update so that we 
> can reconcile this state later. We can make this generic by allowing clients 
> to store arbitrary metadata on an update (similar to how they do it with job 
> configuration). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (AURORA-1740) Upgrade to Mesos 1.0.0

2016-07-25 Thread Joshua Cohen (JIRA)
Joshua Cohen created AURORA-1740:


 Summary: Upgrade to Mesos 1.0.0
 Key: AURORA-1740
 URL: https://issues.apache.org/jira/browse/AURORA-1740
 Project: Aurora
  Issue Type: Task
Reporter: Joshua Cohen
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (AURORA-1737) Descheduling a cron job checks role access before job key existence

2016-07-15 Thread Joshua Cohen (JIRA)
Joshua Cohen created AURORA-1737:


 Summary: Descheduling a cron job checks role access before job key 
existence
 Key: AURORA-1737
 URL: https://issues.apache.org/jira/browse/AURORA-1737
 Project: Aurora
  Issue Type: Bug
  Components: Scheduler
Reporter: Joshua Cohen
Priority: Minor


Trying to deschedule a cron job for a non-existent role returns a permission 
error rather than a no-such-job error. This leads to confusion for users in the 
event of a typo in the role.

Given that jobs are world-readable, we should check for a valid job key before 
applying permissions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AURORA-1686) Add an http endpoint in the scheduler to provide visibility into the available tiers

2016-06-27 Thread Joshua Cohen (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshua Cohen updated AURORA-1686:
-
Sprint: Twitter Aurora Q2'16 Sprint 20

> Add an http endpoint in the scheduler to provide visibility into the 
> available tiers
> 
>
> Key: AURORA-1686
> URL: https://issues.apache.org/jira/browse/AURORA-1686
> Project: Aurora
>  Issue Type: Task
>Reporter: Amol S Deshmukh
>Assignee: Mehrdad Nurolahzade
>Priority: Minor
>
> Provide a way for cluster-operators and users to view the available tiers and 
> the individual tier configuration via the scheduler UI.
> This could be implemented as a new http resource under / (root path) in the 
> scheduler UI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AURORA-1690) Allow for isolating the executor's filesystem from the task's

2016-06-27 Thread Joshua Cohen (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshua Cohen updated AURORA-1690:
-
Sprint: Twitter Aurora Q2'16 Sprint 20

> Allow for isolating the executor's filesystem from the task's
> -
>
> Key: AURORA-1690
> URL: https://issues.apache.org/jira/browse/AURORA-1690
> Project: Aurora
>  Issue Type: Task
>  Components: Executor, Scheduler
>Reporter: Joshua Cohen
>
> Per 
> https://github.com/apache/mesos/blob/master/docs/container-image.md#executor-dependencies-in-a-container-image
>  we should be able to specify an image to be mounted containing the 
> executor's filesystem. Amongst other things, this will allow us to remove the 
> requirement that task images contain a python 2.7 runtime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1709) After rb/46835 job key's role is required to exist as a user

2016-06-09 Thread Joshua Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15322757#comment-15322757
 ] 

Joshua Cohen commented on AURORA-1709:
--

https://reviews.apache.org/r/48492/

> After rb/46835 job key's role is required to exist as a user
> 
>
> Key: AURORA-1709
> URL: https://issues.apache.org/jira/browse/AURORA-1709
> Project: Aurora
>  Issue Type: Bug
>Reporter: Dmitriy Shirchenko
>Assignee: Joshua Cohen
>
> sandbox.py had the change eg
> ```
>   return FileSystemImageSandbox(self.SANDBOX_NAME, 
> self._get_sandbox_user(assigned_task))
> ```
> which is a problem for us since 
> ```
> pwent = pwd.getpwnam(self._user)
> grent = grp.getgrgid(pwent.pw_gid)
> ```
> throw an exception if the user does not exist. This is a change in behavior 
> of how Aurora launched containers before the diff. Before, job key's role 
> could be mangled liberally and arbitrarily w/out this restriction.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (AURORA-1690) Allow for isolating the executor's filesystem from the task's

2016-05-06 Thread Joshua Cohen (JIRA)
Joshua Cohen created AURORA-1690:


 Summary: Allow for isolating the executor's filesystem from the 
task's
 Key: AURORA-1690
 URL: https://issues.apache.org/jira/browse/AURORA-1690
 Project: Aurora
  Issue Type: Task
  Components: Executor, Scheduler
Reporter: Joshua Cohen


Per 
https://github.com/apache/mesos/blob/master/docs/container-image.md#executor-dependencies-in-a-container-image
 we should be able to specify an image to be mounted containing the executor's 
filesystem. Amongst other things, this will allow us to remove the requirement 
that task images contain a python 2.7 runtime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1640) Write enduser documentation for the Unified Containerizer support

2016-05-04 Thread Joshua Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15271678#comment-15271678
 ] 

Joshua Cohen commented on AURORA-1640:
--

More docs are needed beyond what's in that review, but it's a beginning.

> Write enduser documentation for the Unified Containerizer support
> -
>
> Key: AURORA-1640
> URL: https://issues.apache.org/jira/browse/AURORA-1640
> Project: Aurora
>  Issue Type: Story
>  Components: Documentation
>Reporter: Stephan Erb
>Assignee: Joshua Cohen
>
> We have to document the Unified Containerizer feature so that it is easy for 
> users and operators to adopt it. 
> Ideally, we cover:
> * how to configure the Aurora scheduler
> * links to the relevant Mesos documentation
> * an example showing a working Aurora spec that can be run within our vagrant 
> environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1637) Update Executor to support launching tasks with images.

2016-05-04 Thread Joshua Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15271673#comment-15271673
 ] 

Joshua Cohen commented on AURORA-1637:
--

Partially supported here: https://reviews.apache.org/r/46835/, more work will 
be done to isolate the executor's filesystem from the task's filesystem.

> Update Executor to support launching tasks with images.
> ---
>
> Key: AURORA-1637
> URL: https://issues.apache.org/jira/browse/AURORA-1637
> Project: Aurora
>  Issue Type: Task
>Reporter: Joshua Cohen
>
> We should also investigate whether it's possible to support for launching 
> tasks configured with images but no processes with no executor and rely on 
> the image's entrypoint.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (AURORA-1640) Write enduser documentation for the Unified Containerizer support

2016-05-04 Thread Joshua Cohen (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshua Cohen reassigned AURORA-1640:


Assignee: Joshua Cohen

> Write enduser documentation for the Unified Containerizer support
> -
>
> Key: AURORA-1640
> URL: https://issues.apache.org/jira/browse/AURORA-1640
> Project: Aurora
>  Issue Type: Story
>  Components: Documentation
>Reporter: Stephan Erb
>Assignee: Joshua Cohen
>
> We have to document the Unified Containerizer feature so that it is easy for 
> users and operators to adopt it. 
> Ideally, we cover:
> * how to configure the Aurora scheduler
> * links to the relevant Mesos documentation
> * an example showing a working Aurora spec that can be run within our vagrant 
> environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (AURORA-1672) Duplicate task events displayed in UI

2016-04-20 Thread Joshua Cohen (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshua Cohen reassigned AURORA-1672:


Assignee: Joshua Cohen

> Duplicate task events displayed in UI
> -
>
> Key: AURORA-1672
> URL: https://issues.apache.org/jira/browse/AURORA-1672
> Project: Aurora
>  Issue Type: Task
>  Components: Scheduler
>Reporter: Joshua Cohen
>Assignee: Joshua Cohen
>Priority: Blocker
> Attachments: Example of duplicate task events.png
>
>
> While working on the unified containerizer support, I noticed that I was 
> seeing duplicate task events in the scheduler UI. On digging further, these 
> events were present in the DB as well. Initially I thought it was related to 
> the storage changes for task images, but I was able to reproduce on a sha 
> before those changes were introduced.
> I ran a git bisect to see what commit introduced this bug and it pointed to: 
> https://github.com/apache/aurora/commit/915459dac76ed0732addce87420a4ba51d916de8
> It's not immediately obvious to me how that change is causing this, 
> especially since setting {{-populate_discovery_info=false}} in 
> {{aurora-scheduler.conf}} does *not* fix the problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1672) Duplicate task events displayed in UI

2016-04-20 Thread Joshua Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15250863#comment-15250863
 ] 

Joshua Cohen commented on AURORA-1672:
--

The root cause of this is the introduction of a new process to the http_example 
job that references a non-existent port: 
https://github.com/apache/aurora/blob/915459dac76ed0732addce87420a4ba51d916de8/src/test/sh/org/apache/aurora/e2e/http/http_example.aurora#L21

If I either remove the {{tcp port: {{thermos.ports[tcp]}};}} or add tcp to the 
portmap, this issue goes away.

> Duplicate task events displayed in UI
> -
>
> Key: AURORA-1672
> URL: https://issues.apache.org/jira/browse/AURORA-1672
> Project: Aurora
>  Issue Type: Task
>  Components: Scheduler
>Reporter: Joshua Cohen
>Priority: Blocker
> Attachments: Example of duplicate task events.png
>
>
> While working on the unified containerizer support, I noticed that I was 
> seeing duplicate task events in the scheduler UI. On digging further, these 
> events were present in the DB as well. Initially I thought it was related to 
> the storage changes for task images, but I was able to reproduce on a sha 
> before those changes were introduced.
> I ran a git bisect to see what commit introduced this bug and it pointed to: 
> https://github.com/apache/aurora/commit/915459dac76ed0732addce87420a4ba51d916de8
> It's not immediately obvious to me how that change is causing this, 
> especially since setting {{-populate_discovery_info=false}} in 
> {{aurora-scheduler.conf}} does *not* fix the problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AURORA-1672) Duplicate task events displayed in UI

2016-04-20 Thread Joshua Cohen (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshua Cohen updated AURORA-1672:
-
Attachment: Example of duplicate task events.png

> Duplicate task events displayed in UI
> -
>
> Key: AURORA-1672
> URL: https://issues.apache.org/jira/browse/AURORA-1672
> Project: Aurora
>  Issue Type: Task
>  Components: Scheduler
>Reporter: Joshua Cohen
>Priority: Blocker
> Attachments: Example of duplicate task events.png
>
>
> While working on the unified containerizer support, I noticed that I was 
> seeing duplicate task events in the scheduler UI. On digging further, these 
> events were present in the DB as well. Initially I thought it was related to 
> the storage changes for task images, but I was able to reproduce on a sha 
> before those changes were introduced.
> I ran a git bisect to see what commit introduced this bug and it pointed to: 
> https://github.com/apache/aurora/commit/915459dac76ed0732addce87420a4ba51d916de8
> It's not immediately obvious to me how that change is causing this, 
> especially since setting {{-populate_discovery_info=false}} in 
> {{aurora-scheduler.conf}} does *not* fix the problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (AURORA-1672) Duplicate task events displayed in UI

2016-04-20 Thread Joshua Cohen (JIRA)
Joshua Cohen created AURORA-1672:


 Summary: Duplicate task events displayed in UI
 Key: AURORA-1672
 URL: https://issues.apache.org/jira/browse/AURORA-1672
 Project: Aurora
  Issue Type: Task
  Components: Scheduler
Reporter: Joshua Cohen
Priority: Blocker


While working on the unified containerizer support, I noticed that I was seeing 
duplicate task events in the scheduler UI. On digging further, these events 
were present in the DB as well. Initially I thought it was related to the 
storage changes for task images, but I was able to reproduce on a sha before 
those changes were introduced.

I ran a git bisect to see what commit introduced this bug and it pointed to: 
https://github.com/apache/aurora/commit/915459dac76ed0732addce87420a4ba51d916de8

It's not immediately obvious to me how that change is causing this, especially 
since setting {{-populate_discovery_info=false}} in {{aurora-scheduler.conf}} 
does *not* fix the problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1667) Sort out vagrant box versioning

2016-04-18 Thread Joshua Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15245874#comment-15245874
 ] 

Joshua Cohen commented on AURORA-1667:
--

https://reviews.apache.org/r/46335

> Sort out vagrant box versioning
> ---
>
> Key: AURORA-1667
> URL: https://issues.apache.org/jira/browse/AURORA-1667
> Project: Aurora
>  Issue Type: Task
>Reporter: Joshua Cohen
>Assignee: Joshua Cohen
>
> I tried to commit the update to Mesos 0.27.2 today but ran into an issue 
> provisioning the Vagrant environment. This turns out to be caused by me 
> having not run `vagrant box update`. We should look into a solution that 
> doesn't require this manual step.
> We should also look into pinning the box version referenced by vagrant so we 
> can make these changes atomically in the future.
> Pinning the box version may also solve the need to run vagrant box update as 
> well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (AURORA-1667) Sort out vagrant box versioning

2016-04-18 Thread Joshua Cohen (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshua Cohen resolved AURORA-1667.
--
Resolution: Fixed

> Sort out vagrant box versioning
> ---
>
> Key: AURORA-1667
> URL: https://issues.apache.org/jira/browse/AURORA-1667
> Project: Aurora
>  Issue Type: Task
>Reporter: Joshua Cohen
>Assignee: Joshua Cohen
>
> I tried to commit the update to Mesos 0.27.2 today but ran into an issue 
> provisioning the Vagrant environment. This turns out to be caused by me 
> having not run `vagrant box update`. We should look into a solution that 
> doesn't require this manual step.
> We should also look into pinning the box version referenced by vagrant so we 
> can make these changes atomically in the future.
> Pinning the box version may also solve the need to run vagrant box update as 
> well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (AURORA-1667) Sort out vagrant box versioning

2016-04-18 Thread Joshua Cohen (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshua Cohen reassigned AURORA-1667:


Assignee: Joshua Cohen

> Sort out vagrant box versioning
> ---
>
> Key: AURORA-1667
> URL: https://issues.apache.org/jira/browse/AURORA-1667
> Project: Aurora
>  Issue Type: Task
>Reporter: Joshua Cohen
>Assignee: Joshua Cohen
>
> I tried to commit the update to Mesos 0.27.2 today but ran into an issue 
> provisioning the Vagrant environment. This turns out to be caused by me 
> having not run `vagrant box update`. We should look into a solution that 
> doesn't require this manual step.
> We should also look into pinning the box version referenced by vagrant so we 
> can make these changes atomically in the future.
> Pinning the box version may also solve the need to run vagrant box update as 
> well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (AURORA-1667) Sort out vagrant box versioning

2016-04-15 Thread Joshua Cohen (JIRA)
Joshua Cohen created AURORA-1667:


 Summary: Sort out vagrant box versioning
 Key: AURORA-1667
 URL: https://issues.apache.org/jira/browse/AURORA-1667
 Project: Aurora
  Issue Type: Task
Reporter: Joshua Cohen


I tried to commit the update to Mesos 0.27.2 today but ran into an issue 
provisioning the Vagrant environment. This turns out to be caused by me having 
not run `vagrant box update`. We should look into a solution that doesn't 
require this manual step.

We should also look into pinning the box version referenced by vagrant so we 
can make these changes atomically in the future.

Pinning the box version may also solve the need to run vagrant box update as 
well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AURORA-995) Scheduler should expose its current version

2016-04-12 Thread Joshua Cohen (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshua Cohen updated AURORA-995:

Assignee: Florian Pfeiffer  (was: Joshua Cohen)

> Scheduler should expose its current version
> ---
>
> Key: AURORA-995
> URL: https://issues.apache.org/jira/browse/AURORA-995
> Project: Aurora
>  Issue Type: Task
>  Components: Scheduler
>Reporter: Joshua Cohen
>Assignee: Florian Pfeiffer
>Priority: Minor
>
> It would be very useful if we were able to query a running scheduler to see 
> what version is deployed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (AURORA-995) Scheduler should expose its current version

2016-04-12 Thread Joshua Cohen (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshua Cohen resolved AURORA-995.
-
Resolution: Fixed

Works for me. Thanks Florian!

> Scheduler should expose its current version
> ---
>
> Key: AURORA-995
> URL: https://issues.apache.org/jira/browse/AURORA-995
> Project: Aurora
>  Issue Type: Task
>  Components: Scheduler
>Reporter: Joshua Cohen
>Assignee: Florian Pfeiffer
>Priority: Minor
>
> It would be very useful if we were able to query a running scheduler to see 
> what version is deployed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1635) Update Scheduler storage to support storing images

2016-03-22 Thread Joshua Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15206834#comment-15206834
 ] 

Joshua Cohen commented on AURORA-1635:
--

https://reviews.apache.org/r/45112/

> Update Scheduler storage to support storing images
> --
>
> Key: AURORA-1635
> URL: https://issues.apache.org/jira/browse/AURORA-1635
> Project: Aurora
>  Issue Type: Task
>Reporter: Joshua Cohen
>Assignee: Joshua Cohen
>
> As part of the work to support the Mesos unified containerier, we'll need to 
> store images configured on tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (AURORA-1619) Generate documentation for command line args for all Aurora binaries

2016-03-19 Thread Joshua Cohen (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshua Cohen resolved AURORA-1619.
--
Resolution: Fixed

[~StephanErb] took care of this here: https://reviews.apache.org/r/44770/

> Generate documentation for command line args for all Aurora binaries
> 
>
> Key: AURORA-1619
> URL: https://issues.apache.org/jira/browse/AURORA-1619
> Project: Aurora
>  Issue Type: Task
>  Components: Documentation
>Reporter: Joshua Cohen
>Assignee: Stephan Erb
>Priority: Minor
>
> We've recently had a fair number of questions in IRC about configuration 
> Aurora and its related components. Since all configuration is done via 
> command line args, it would be nice if we had some documentation of these 
> args, rather than asking people to build the binaries and run their 
> respective help commands.
> Bonus points if we generated this automatically directly from the binaries' 
> respective help outputs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (AURORA-869) Document all available scheduler command line options

2016-03-19 Thread Joshua Cohen (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshua Cohen resolved AURORA-869.
-
Resolution: Fixed
  Assignee: Stephan Erb

[~StephanErb] took care of this here: https://reviews.apache.org/r/44770/

> Document all available scheduler command line options
> -
>
> Key: AURORA-869
> URL: https://issues.apache.org/jira/browse/AURORA-869
> Project: Aurora
>  Issue Type: Task
>  Components: Documentation
>Reporter: Maxim Khutornenko
>Assignee: Stephan Erb
>Priority: Minor
>
> We lack a page fully documenting _all_ available scheduler command line 
> options, including those hidden in twitter.common libs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AURORA-1619) Generate documentation for command line args for all Aurora binaries

2016-03-19 Thread Joshua Cohen (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshua Cohen updated AURORA-1619:
-
Assignee: Stephan Erb

> Generate documentation for command line args for all Aurora binaries
> 
>
> Key: AURORA-1619
> URL: https://issues.apache.org/jira/browse/AURORA-1619
> Project: Aurora
>  Issue Type: Task
>  Components: Documentation
>Reporter: Joshua Cohen
>Assignee: Stephan Erb
>Priority: Minor
>
> We've recently had a fair number of questions in IRC about configuration 
> Aurora and its related components. Since all configuration is done via 
> command line args, it would be nice if we had some documentation of these 
> args, rather than asking people to build the binaries and run their 
> respective help commands.
> Bonus points if we generated this automatically directly from the binaries' 
> respective help outputs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (AURORA-1639) Update client to allow configuring tasks with images.

2016-03-15 Thread Joshua Cohen (JIRA)
Joshua Cohen created AURORA-1639:


 Summary: Update client to allow configuring tasks with images.
 Key: AURORA-1639
 URL: https://issues.apache.org/jira/browse/AURORA-1639
 Project: Aurora
  Issue Type: Task
Reporter: Joshua Cohen






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (AURORA-1635) Update Scheduler storage to support storing images

2016-03-15 Thread Joshua Cohen (JIRA)
Joshua Cohen created AURORA-1635:


 Summary: Update Scheduler storage to support storing images
 Key: AURORA-1635
 URL: https://issues.apache.org/jira/browse/AURORA-1635
 Project: Aurora
  Issue Type: Task
Reporter: Joshua Cohen


As part of the work to support the Mesos unified containerier, we'll need to 
store images configured on tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AURORA-1634) Support launching tasks using the Mesos unified containerizer

2016-03-15 Thread Joshua Cohen (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshua Cohen updated AURORA-1634:
-
Component/s: Scheduler

> Support launching tasks using the Mesos unified containerizer
> -
>
> Key: AURORA-1634
> URL: https://issues.apache.org/jira/browse/AURORA-1634
> Project: Aurora
>  Issue Type: Epic
>  Components: Client, Executor, Scheduler
>Reporter: Joshua Cohen
>
> https://docs.google.com/document/d/111T09NBF2zjjl7HE95xglsDpRdKoZqhCRM5hHmOfTLA/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AURORA-1634) Support launching tasks using the Mesos unified containerizer

2016-03-15 Thread Joshua Cohen (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshua Cohen updated AURORA-1634:
-
Component/s: Executor
 Client

> Support launching tasks using the Mesos unified containerizer
> -
>
> Key: AURORA-1634
> URL: https://issues.apache.org/jira/browse/AURORA-1634
> Project: Aurora
>  Issue Type: Epic
>  Components: Client, Executor, Scheduler
>Reporter: Joshua Cohen
>
> https://docs.google.com/document/d/111T09NBF2zjjl7HE95xglsDpRdKoZqhCRM5hHmOfTLA/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1632) Investigate executor fixes when Mesos 0.30.0 stops passing along environment variables

2016-03-10 Thread Joshua Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15190116#comment-15190116
 ] 

Joshua Cohen commented on AURORA-1632:
--

I talked to Jie in IRC and he confirmed that PATH will not be passed along to 
executors. So we're going to have to figure out the answer to #2 (i.e. what 
value should we set for $PATH)?

> Investigate executor fixes when Mesos 0.30.0 stops passing along environment 
> variables
> --
>
> Key: AURORA-1632
> URL: https://issues.apache.org/jira/browse/AURORA-1632
> Project: Aurora
>  Issue Type: Task
>  Components: Executor
>Reporter: Joshua Cohen
>Priority: Blocker
> Attachments: screenshot-1.png
>
>
> In the 0.30.0 release, the Mesos Agent will no longer implicitly pass along 
> its environment variables (see: 
> http://mail-archives.apache.org/mod_mbox/mesos-dev/201603.mbox/%3CCAK7AWaGB24ALh8eb%2BvKMFgc4%2BjmhxZ6ry79HBcKN%2BBt04Sx43A%40mail.gmail.com%3E).
> I tested in vagrant by explicitly setting the 
> {{--executor_environment_variables}} flag on the agent to {{'{}'}} and 
> verified that this does impact us. Initially we get a permission denied error 
> when trying to fork the runner:
> {noformat}
> I0310 16:36:21.048671 18103 thermos_task_runner.py:275] Forking off runner 
> with cmdline:  
> /var/lib/mesos/slaves/aa9f2963-947d-4582-8cec-e694d9d06e79-S0/frameworks/0f9b27e9-6b03-4b5e-9e2f-91eae9ba5c99-0003/executors/thermos-vagrant-test-http_example-0-a905b6d0-79d7-4fff-9cb2-5f5b4a6709cf/runs/56f62331-3ad4-463a-b392-3b80cc664b3c/thermos_runner.pex
>  --setuid=vagrant 
> --task_id=vagrant-test-http_example-0-a905b6d0-79d7-4fff-9cb2-5f5b4a6709cf 
> --log_to_disk=DEBUG --hostname=192.168.33.7 
> --thermos_json=/var/lib/mesos/slaves/aa9f2963-947d-4582-8cec-e694d9d06e79-S0/frameworks/0f9b27e9-6b03-4b5e-9e2f-91eae9ba5c99-0003/executors/thermos-vagrant-test-http_example-0-a905b6d0-79d7-4fff-9cb2-5f5b4a6709cf/runs/56f62331-3ad4-463a-b392-3b80cc664b3c/task.json
>  
> --sandbox=/var/lib/mesos/slaves/aa9f2963-947d-4582-8cec-e694d9d06e79-S0/frameworks/0f9b27e9-6b03-4b5e-9e2f-91eae9ba5c99-0003/executors/thermos-vagrant-test-http_example-0-a905b6d0-79d7-4fff-9cb2-5f5b4a6709cf/runs/56f62331-3ad4-463a-b392-3b80cc664b3c/sandbox
>  
> --log_dir=/var/lib/mesos/slaves/aa9f2963-947d-4582-8cec-e694d9d06e79-S0/frameworks/0f9b27e9-6b03-4b5e-9e2f-91eae9ba5c99-0003/executors/thermos-vagrant-test-http_example-0-a905b6d0-79d7-4fff-9cb2-5f5b4a6709cf/runs/56f62331-3ad4-463a-b392-3b80cc664b3c
>  
> --checkpoint_root=/var/lib/mesos/slaves/aa9f2963-947d-4582-8cec-e694d9d06e79-S0/frameworks/0f9b27e9-6b03-4b5e-9e2f-91eae9ba5c99-0003/executors/thermos-vagrant-test-http_example-0-a905b6d0-79d7-4fff-9cb2-5f5b4a6709cf/runs/56f62331-3ad4-463a-b392-3b80cc664b3c/checkpoints
>  --process_logger_destination=file --port=aurora:31248 --port=http:31248
> F0310 16:36:21.057298 18103 aurora_executor.py:80] Task initialization 
> failed: [Errno 13] Permission denied
> {noformat}
> This error can be addressed with the patch from this pull request: 
> https://github.com/apache/aurora/pull/21. However, even after applying this 
> patch processes fail to fork (see attached screenshot).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1632) Investigate executor fixes when Mesos 0.30.0 stops passing along environment variables

2016-03-10 Thread Joshua Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15190089#comment-15190089
 ] 

Joshua Cohen commented on AURORA-1632:
--

On a related note, further digging reveals that the mention of 
{{LD_LIBRARY_PATH}} in the linked pull request was a red herring. It's {{PATH}} 
that must be set in the environment in order for {{sys.executable}} to work.

> Investigate executor fixes when Mesos 0.30.0 stops passing along environment 
> variables
> --
>
> Key: AURORA-1632
> URL: https://issues.apache.org/jira/browse/AURORA-1632
> Project: Aurora
>  Issue Type: Task
>  Components: Executor
>Reporter: Joshua Cohen
>Priority: Blocker
> Attachments: screenshot-1.png
>
>
> In the 0.30.0 release, the Mesos Agent will no longer implicitly pass along 
> its environment variables (see: 
> http://mail-archives.apache.org/mod_mbox/mesos-dev/201603.mbox/%3CCAK7AWaGB24ALh8eb%2BvKMFgc4%2BjmhxZ6ry79HBcKN%2BBt04Sx43A%40mail.gmail.com%3E).
> I tested in vagrant by explicitly setting the 
> {{--executor_environment_variables}} flag on the agent to {{'{}'}} and 
> verified that this does impact us. Initially we get a permission denied error 
> when trying to fork the runner:
> {noformat}
> I0310 16:36:21.048671 18103 thermos_task_runner.py:275] Forking off runner 
> with cmdline:  
> /var/lib/mesos/slaves/aa9f2963-947d-4582-8cec-e694d9d06e79-S0/frameworks/0f9b27e9-6b03-4b5e-9e2f-91eae9ba5c99-0003/executors/thermos-vagrant-test-http_example-0-a905b6d0-79d7-4fff-9cb2-5f5b4a6709cf/runs/56f62331-3ad4-463a-b392-3b80cc664b3c/thermos_runner.pex
>  --setuid=vagrant 
> --task_id=vagrant-test-http_example-0-a905b6d0-79d7-4fff-9cb2-5f5b4a6709cf 
> --log_to_disk=DEBUG --hostname=192.168.33.7 
> --thermos_json=/var/lib/mesos/slaves/aa9f2963-947d-4582-8cec-e694d9d06e79-S0/frameworks/0f9b27e9-6b03-4b5e-9e2f-91eae9ba5c99-0003/executors/thermos-vagrant-test-http_example-0-a905b6d0-79d7-4fff-9cb2-5f5b4a6709cf/runs/56f62331-3ad4-463a-b392-3b80cc664b3c/task.json
>  
> --sandbox=/var/lib/mesos/slaves/aa9f2963-947d-4582-8cec-e694d9d06e79-S0/frameworks/0f9b27e9-6b03-4b5e-9e2f-91eae9ba5c99-0003/executors/thermos-vagrant-test-http_example-0-a905b6d0-79d7-4fff-9cb2-5f5b4a6709cf/runs/56f62331-3ad4-463a-b392-3b80cc664b3c/sandbox
>  
> --log_dir=/var/lib/mesos/slaves/aa9f2963-947d-4582-8cec-e694d9d06e79-S0/frameworks/0f9b27e9-6b03-4b5e-9e2f-91eae9ba5c99-0003/executors/thermos-vagrant-test-http_example-0-a905b6d0-79d7-4fff-9cb2-5f5b4a6709cf/runs/56f62331-3ad4-463a-b392-3b80cc664b3c
>  
> --checkpoint_root=/var/lib/mesos/slaves/aa9f2963-947d-4582-8cec-e694d9d06e79-S0/frameworks/0f9b27e9-6b03-4b5e-9e2f-91eae9ba5c99-0003/executors/thermos-vagrant-test-http_example-0-a905b6d0-79d7-4fff-9cb2-5f5b4a6709cf/runs/56f62331-3ad4-463a-b392-3b80cc664b3c/checkpoints
>  --process_logger_destination=file --port=aurora:31248 --port=http:31248
> F0310 16:36:21.057298 18103 aurora_executor.py:80] Task initialization 
> failed: [Errno 13] Permission denied
> {noformat}
> This error can be addressed with the patch from this pull request: 
> https://github.com/apache/aurora/pull/21. However, even after applying this 
> patch processes fail to fork (see attached screenshot).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1632) Investigate executor fixes when Mesos 0.30.0 stops passing along environment variables

2016-03-10 Thread Joshua Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15190008#comment-15190008
 ] 

Joshua Cohen commented on AURORA-1632:
--

*ping* [~jieyu] who can hopefully answer question #1 above.

> Investigate executor fixes when Mesos 0.30.0 stops passing along environment 
> variables
> --
>
> Key: AURORA-1632
> URL: https://issues.apache.org/jira/browse/AURORA-1632
> Project: Aurora
>  Issue Type: Task
>  Components: Executor
>Reporter: Joshua Cohen
>Priority: Blocker
> Attachments: screenshot-1.png
>
>
> In the 0.30.0 release, the Mesos Agent will no longer implicitly pass along 
> its environment variables (see: 
> http://mail-archives.apache.org/mod_mbox/mesos-dev/201603.mbox/%3CCAK7AWaGB24ALh8eb%2BvKMFgc4%2BjmhxZ6ry79HBcKN%2BBt04Sx43A%40mail.gmail.com%3E).
> I tested in vagrant by explicitly setting the 
> {{--executor_environment_variables}} flag on the agent to {{'{}'}} and 
> verified that this does impact us. Initially we get a permission denied error 
> when trying to fork the runner:
> {noformat}
> I0310 16:36:21.048671 18103 thermos_task_runner.py:275] Forking off runner 
> with cmdline:  
> /var/lib/mesos/slaves/aa9f2963-947d-4582-8cec-e694d9d06e79-S0/frameworks/0f9b27e9-6b03-4b5e-9e2f-91eae9ba5c99-0003/executors/thermos-vagrant-test-http_example-0-a905b6d0-79d7-4fff-9cb2-5f5b4a6709cf/runs/56f62331-3ad4-463a-b392-3b80cc664b3c/thermos_runner.pex
>  --setuid=vagrant 
> --task_id=vagrant-test-http_example-0-a905b6d0-79d7-4fff-9cb2-5f5b4a6709cf 
> --log_to_disk=DEBUG --hostname=192.168.33.7 
> --thermos_json=/var/lib/mesos/slaves/aa9f2963-947d-4582-8cec-e694d9d06e79-S0/frameworks/0f9b27e9-6b03-4b5e-9e2f-91eae9ba5c99-0003/executors/thermos-vagrant-test-http_example-0-a905b6d0-79d7-4fff-9cb2-5f5b4a6709cf/runs/56f62331-3ad4-463a-b392-3b80cc664b3c/task.json
>  
> --sandbox=/var/lib/mesos/slaves/aa9f2963-947d-4582-8cec-e694d9d06e79-S0/frameworks/0f9b27e9-6b03-4b5e-9e2f-91eae9ba5c99-0003/executors/thermos-vagrant-test-http_example-0-a905b6d0-79d7-4fff-9cb2-5f5b4a6709cf/runs/56f62331-3ad4-463a-b392-3b80cc664b3c/sandbox
>  
> --log_dir=/var/lib/mesos/slaves/aa9f2963-947d-4582-8cec-e694d9d06e79-S0/frameworks/0f9b27e9-6b03-4b5e-9e2f-91eae9ba5c99-0003/executors/thermos-vagrant-test-http_example-0-a905b6d0-79d7-4fff-9cb2-5f5b4a6709cf/runs/56f62331-3ad4-463a-b392-3b80cc664b3c
>  
> --checkpoint_root=/var/lib/mesos/slaves/aa9f2963-947d-4582-8cec-e694d9d06e79-S0/frameworks/0f9b27e9-6b03-4b5e-9e2f-91eae9ba5c99-0003/executors/thermos-vagrant-test-http_example-0-a905b6d0-79d7-4fff-9cb2-5f5b4a6709cf/runs/56f62331-3ad4-463a-b392-3b80cc664b3c/checkpoints
>  --process_logger_destination=file --port=aurora:31248 --port=http:31248
> F0310 16:36:21.057298 18103 aurora_executor.py:80] Task initialization 
> failed: [Errno 13] Permission denied
> {noformat}
> This error can be addressed with the patch from this pull request: 
> https://github.com/apache/aurora/pull/21. However, even after applying this 
> patch processes fail to fork (see attached screenshot).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1632) Investigate executor fixes when Mesos 0.30.0 stops passing along environment variables

2016-03-10 Thread Joshua Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15190001#comment-15190001
 ] 

Joshua Cohen commented on AURORA-1632:
--

So I dug a little bit deeper into this problem and it seems that the cause was 
this: 
https://github.com/apache/aurora/blob/master/src/main/python/apache/thermos/core/process.py#L394-L399

We try and set PATH in the environment of the forked process, but PATH is no 
longer set in our environment, so we end up (silently) raising a KeyError when 
we try to access it.

This raises two questions for me:

# Is PATH one of the environment variables that will *still* be passed to the 
executor after this change (see Jie's response here indicating that some vars 
will still be passed: 
http://mail-archives.apache.org/mod_mbox/mesos-dev/201603.mbox/%3CCAJvN1BOq4aKNGZ5WEQLKp+kgCMaTwWQp8tdn37=16e-_+a+...@mail.gmail.com%3E)
# If the above is not true (i.e. $PATH will not be set in the executor's 
environment), do tasks today expect Aurora to set PATH in their environment? 
Presumably they do, or at the very least we cannot assume they do not. Given 
this, what value should we set PATH to?

> Investigate executor fixes when Mesos 0.30.0 stops passing along environment 
> variables
> --
>
> Key: AURORA-1632
> URL: https://issues.apache.org/jira/browse/AURORA-1632
> Project: Aurora
>  Issue Type: Task
>  Components: Executor
>Reporter: Joshua Cohen
>Priority: Blocker
> Attachments: screenshot-1.png
>
>
> In the 0.30.0 release, the Mesos Agent will no longer implicitly pass along 
> its environment variables (see: 
> http://mail-archives.apache.org/mod_mbox/mesos-dev/201603.mbox/%3CCAK7AWaGB24ALh8eb%2BvKMFgc4%2BjmhxZ6ry79HBcKN%2BBt04Sx43A%40mail.gmail.com%3E).
> I tested in vagrant by explicitly setting the 
> {{--executor_environment_variables}} flag on the agent to {{'{}'}} and 
> verified that this does impact us. Initially we get a permission denied error 
> when trying to fork the runner:
> {noformat}
> I0310 16:36:21.048671 18103 thermos_task_runner.py:275] Forking off runner 
> with cmdline:  
> /var/lib/mesos/slaves/aa9f2963-947d-4582-8cec-e694d9d06e79-S0/frameworks/0f9b27e9-6b03-4b5e-9e2f-91eae9ba5c99-0003/executors/thermos-vagrant-test-http_example-0-a905b6d0-79d7-4fff-9cb2-5f5b4a6709cf/runs/56f62331-3ad4-463a-b392-3b80cc664b3c/thermos_runner.pex
>  --setuid=vagrant 
> --task_id=vagrant-test-http_example-0-a905b6d0-79d7-4fff-9cb2-5f5b4a6709cf 
> --log_to_disk=DEBUG --hostname=192.168.33.7 
> --thermos_json=/var/lib/mesos/slaves/aa9f2963-947d-4582-8cec-e694d9d06e79-S0/frameworks/0f9b27e9-6b03-4b5e-9e2f-91eae9ba5c99-0003/executors/thermos-vagrant-test-http_example-0-a905b6d0-79d7-4fff-9cb2-5f5b4a6709cf/runs/56f62331-3ad4-463a-b392-3b80cc664b3c/task.json
>  
> --sandbox=/var/lib/mesos/slaves/aa9f2963-947d-4582-8cec-e694d9d06e79-S0/frameworks/0f9b27e9-6b03-4b5e-9e2f-91eae9ba5c99-0003/executors/thermos-vagrant-test-http_example-0-a905b6d0-79d7-4fff-9cb2-5f5b4a6709cf/runs/56f62331-3ad4-463a-b392-3b80cc664b3c/sandbox
>  
> --log_dir=/var/lib/mesos/slaves/aa9f2963-947d-4582-8cec-e694d9d06e79-S0/frameworks/0f9b27e9-6b03-4b5e-9e2f-91eae9ba5c99-0003/executors/thermos-vagrant-test-http_example-0-a905b6d0-79d7-4fff-9cb2-5f5b4a6709cf/runs/56f62331-3ad4-463a-b392-3b80cc664b3c
>  
> --checkpoint_root=/var/lib/mesos/slaves/aa9f2963-947d-4582-8cec-e694d9d06e79-S0/frameworks/0f9b27e9-6b03-4b5e-9e2f-91eae9ba5c99-0003/executors/thermos-vagrant-test-http_example-0-a905b6d0-79d7-4fff-9cb2-5f5b4a6709cf/runs/56f62331-3ad4-463a-b392-3b80cc664b3c/checkpoints
>  --process_logger_destination=file --port=aurora:31248 --port=http:31248
> F0310 16:36:21.057298 18103 aurora_executor.py:80] Task initialization 
> failed: [Errno 13] Permission denied
> {noformat}
> This error can be addressed with the patch from this pull request: 
> https://github.com/apache/aurora/pull/21. However, even after applying this 
> patch processes fail to fork (see attached screenshot).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AURORA-1632) Investigate executor fixes when Mesos 0.30.0 stops passing along environment variables

2016-03-10 Thread Joshua Cohen (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshua Cohen updated AURORA-1632:
-
Attachment: screenshot-1.png

> Investigate executor fixes when Mesos 0.30.0 stops passing along environment 
> variables
> --
>
> Key: AURORA-1632
> URL: https://issues.apache.org/jira/browse/AURORA-1632
> Project: Aurora
>  Issue Type: Task
>  Components: Executor
>Reporter: Joshua Cohen
>Priority: Blocker
> Attachments: screenshot-1.png
>
>
> In the 0.30.0 release, the Mesos Agent will no longer implicitly pass along 
> its environment variables (see: 
> http://mail-archives.apache.org/mod_mbox/mesos-dev/201603.mbox/%3CCAK7AWaGB24ALh8eb%2BvKMFgc4%2BjmhxZ6ry79HBcKN%2BBt04Sx43A%40mail.gmail.com%3E).
> I tested in vagrant by explicitly setting the 
> {{--executor_environment_variables}} flag on the agent to {{'{}'}} and 
> verified that this does impact us. Initially we get a permission denied error 
> when trying to fork the runner:
> {noformat}
> I0310 16:36:21.048671 18103 thermos_task_runner.py:275] Forking off runner 
> with cmdline:  
> /var/lib/mesos/slaves/aa9f2963-947d-4582-8cec-e694d9d06e79-S0/frameworks/0f9b27e9-6b03-4b5e-9e2f-91eae9ba5c99-0003/executors/thermos-vagrant-test-http_example-0-a905b6d0-79d7-4fff-9cb2-5f5b4a6709cf/runs/56f62331-3ad4-463a-b392-3b80cc664b3c/thermos_runner.pex
>  --setuid=vagrant 
> --task_id=vagrant-test-http_example-0-a905b6d0-79d7-4fff-9cb2-5f5b4a6709cf 
> --log_to_disk=DEBUG --hostname=192.168.33.7 
> --thermos_json=/var/lib/mesos/slaves/aa9f2963-947d-4582-8cec-e694d9d06e79-S0/frameworks/0f9b27e9-6b03-4b5e-9e2f-91eae9ba5c99-0003/executors/thermos-vagrant-test-http_example-0-a905b6d0-79d7-4fff-9cb2-5f5b4a6709cf/runs/56f62331-3ad4-463a-b392-3b80cc664b3c/task.json
>  
> --sandbox=/var/lib/mesos/slaves/aa9f2963-947d-4582-8cec-e694d9d06e79-S0/frameworks/0f9b27e9-6b03-4b5e-9e2f-91eae9ba5c99-0003/executors/thermos-vagrant-test-http_example-0-a905b6d0-79d7-4fff-9cb2-5f5b4a6709cf/runs/56f62331-3ad4-463a-b392-3b80cc664b3c/sandbox
>  
> --log_dir=/var/lib/mesos/slaves/aa9f2963-947d-4582-8cec-e694d9d06e79-S0/frameworks/0f9b27e9-6b03-4b5e-9e2f-91eae9ba5c99-0003/executors/thermos-vagrant-test-http_example-0-a905b6d0-79d7-4fff-9cb2-5f5b4a6709cf/runs/56f62331-3ad4-463a-b392-3b80cc664b3c
>  
> --checkpoint_root=/var/lib/mesos/slaves/aa9f2963-947d-4582-8cec-e694d9d06e79-S0/frameworks/0f9b27e9-6b03-4b5e-9e2f-91eae9ba5c99-0003/executors/thermos-vagrant-test-http_example-0-a905b6d0-79d7-4fff-9cb2-5f5b4a6709cf/runs/56f62331-3ad4-463a-b392-3b80cc664b3c/checkpoints
>  --process_logger_destination=file --port=aurora:31248 --port=http:31248
> F0310 16:36:21.057298 18103 aurora_executor.py:80] Task initialization 
> failed: [Errno 13] Permission denied
> {noformat}
> This error can be addressed with the patch from this pull request: 
> https://github.com/apache/aurora/pull/21. However, even after applying this 
> patch processes fail to fork (see attached screenshot).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1621) Support redirect to leading scheduler host

2016-02-22 Thread Joshua Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15157379#comment-15157379
 ] 

Joshua Cohen commented on AURORA-1621:
--

This seems like it's potentially a dupe of 
https://issues.apache.org/jira/browse/AURORA-1493, [~ashwinm_uber] do you 
agree? If so, can we close this ticket in favor of that one?

> Support redirect to leading scheduler host
> --
>
> Key: AURORA-1621
> URL: https://issues.apache.org/jira/browse/AURORA-1621
> Project: Aurora
>  Issue Type: Task
>  Components: Scheduler
>Reporter: Ashwin Murthy
>Assignee: Ashwin Murthy
>Priority: Minor
>  Labels: features, uber
>
> Support a health endpoint that returns 200 if the host corresponds to the one 
> of the leading scheduler. If not, return 500. This allows support for 
> redirect to the leading scheduler



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (AURORA-1619) Generate documentation for command line args for all Aurora binaries

2016-02-12 Thread Joshua Cohen (JIRA)
Joshua Cohen created AURORA-1619:


 Summary: Generate documentation for command line args for all 
Aurora binaries
 Key: AURORA-1619
 URL: https://issues.apache.org/jira/browse/AURORA-1619
 Project: Aurora
  Issue Type: Task
  Components: Documentation
Reporter: Joshua Cohen
Priority: Minor


We've recently had a fair number of questions in IRC about configuration Aurora 
and its related components. Since all configuration is done via command line 
args, it would be nice if we had some documentation of these args, rather than 
asking people to build the binaries and run their respective help commands.

Bonus points if we generated this automatically directly from the binaries' 
respective help outputs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (AURORA-1614) Failed sandbox initialization can cause tasks to go LOST

2016-02-11 Thread Joshua Cohen (JIRA)
Joshua Cohen created AURORA-1614:


 Summary: Failed sandbox initialization can cause tasks to go LOST
 Key: AURORA-1614
 URL: https://issues.apache.org/jira/browse/AURORA-1614
 Project: Aurora
  Issue Type: Bug
  Components: Executor
Reporter: Joshua Cohen
Priority: Minor


When we initialize the sandbox, we only catch Sandbox specific error types, 
meaning that if an unexpected error is raised, the executor just hangs until 
the timeout is exceeded, at which point the task goes lost.

We should instead broadly catch exceptions raised during sandbox initialization 
and quickly fail tasks.

Additionally, the {{DockerDirectorySandbox}} was not properly catching errors 
raised when creating/symlinking which led to the above problem in the event of 
a misconfiguration. In practice this issue shouldn't have occurred in normal 
usage, but it made development slow until I tracked down what was causing the 
tasks to just hang.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1614) Failed sandbox initialization can cause tasks to go LOST

2016-02-11 Thread Joshua Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15143024#comment-15143024
 ] 

Joshua Cohen commented on AURORA-1614:
--

https://reviews.apache.org/r/43486/

> Failed sandbox initialization can cause tasks to go LOST
> 
>
> Key: AURORA-1614
> URL: https://issues.apache.org/jira/browse/AURORA-1614
> Project: Aurora
>  Issue Type: Bug
>  Components: Executor
>Reporter: Joshua Cohen
>Assignee: Joshua Cohen
>Priority: Minor
>
> When we initialize the sandbox, we only catch Sandbox specific error types, 
> meaning that if an unexpected error is raised, the executor just hangs until 
> the timeout is exceeded, at which point the task goes lost.
> We should instead broadly catch exceptions raised during sandbox 
> initialization and quickly fail tasks.
> Additionally, the {{DockerDirectorySandbox}} was not properly catching errors 
> raised when creating/symlinking which led to the above problem in the event 
> of a misconfiguration. In practice this issue shouldn't have occurred in 
> normal usage, but it made development slow until I tracked down what was 
> causing the tasks to just hang.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (AURORA-1608) Convert end to end tests to use Docker rather than Vagrant

2016-02-02 Thread Joshua Cohen (JIRA)
Joshua Cohen created AURORA-1608:


 Summary: Convert end to end tests to use Docker rather than Vagrant
 Key: AURORA-1608
 URL: https://issues.apache.org/jira/browse/AURORA-1608
 Project: Aurora
  Issue Type: Task
  Components: Testing
Reporter: Joshua Cohen
Priority: Minor


We'd like to run the e2e tests in CI which is not currently possible w/ 
Vagrant. We can, however, run docker images in CI, so converting the tests to 
docker seems like a net-win.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AURORA-1609) Create automated rollback testing

2016-02-02 Thread Joshua Cohen (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshua Cohen updated AURORA-1609:
-
Description: As revealed by AURORA-1603, we're not always as rigorous as we 
should be about ensuring compatibility between commits. It would be great to 
run automated tests that ensure that rollbacks between commits (and between 
releases!) are possible. It should be possible to update the end to end tests 
to finish off by reverting to the previous commit, rebuilding and restarting 
the scheduler, then ensuring everything starts up cleanly. Once AURORA-1608 we 
can automate this by having e2e tests run as part of our CI job.  (was: As 
revealed by AURORA-1603, we're not always as rigorous as we should be about 
ensuring compatibility between commits. It would be great to run automated 
tests that ensure that rollbacks between commits (and between releases!) are 
possible. Once we have AURORA-1608 done, it should be possible to update the 
end to end tests to finish off by reverting to the previous commit, rebuilding 
and restarting the scheduler, then ensuring everything starts up cleanly.)

> Create automated rollback testing
> -
>
> Key: AURORA-1609
> URL: https://issues.apache.org/jira/browse/AURORA-1609
> Project: Aurora
>  Issue Type: Task
>  Components: Testing
>Reporter: Joshua Cohen
>
> As revealed by AURORA-1603, we're not always as rigorous as we should be 
> about ensuring compatibility between commits. It would be great to run 
> automated tests that ensure that rollbacks between commits (and between 
> releases!) are possible. It should be possible to update the end to end tests 
> to finish off by reverting to the previous commit, rebuilding and restarting 
> the scheduler, then ensuring everything starts up cleanly. Once AURORA-1608 
> we can automate this by having e2e tests run as part of our CI job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   >