[jira] [Created] (AURORA-1642) Thermos runner finalization is broken

2016-03-15 Thread Maxim Khutornenko (JIRA)
Maxim Khutornenko created AURORA-1642:
-

 Summary: Thermos runner finalization is broken
 Key: AURORA-1642
 URL: https://issues.apache.org/jira/browse/AURORA-1642
 Project: Aurora
  Issue Type: Bug
  Components: Executor
Reporter: Maxim Khutornenko


We have noticed thermos runner finalization no longer works after this commit 
[024bac9dcb8f37e4b31210e3a0a7aea2345a16ab|https://reviews.apache.org/r/40922/] 
for tasks with blocking threads. 

I was able to reproduce it in Vagrant by extending the sleep timeout of the 
{{hello}} task and running {{aurora job killall}} immediately after launching 
it:
{noformat}
while true; do
  echo hello world
  sleep 600
{noformat}
The finalizer never has a chance to run and after 1 minute a task is forcefully 
aborted:
{noformat}
D0316 04:00:35.237905 19362 runner.py:951] Runner issued kill: force:False, 
preemption_wait:1 mins
D0316 04:00:35.238183 19362 runner.py:567] Flipping recovery mode off.
D0316 04:00:35.238308 19362 ckpt.py:348] Flipping task state from ACTIVE to 
ACTIVE
D0316 04:00:35.238437 19362 runner.py:242] _on_task_transition: 
TaskStatus(state=0, runner_uid=0, runner_pid=19362, timestamp_ms=1458100835238)
D0316 04:00:35.239079 19362 runner.py:180] Task on_active(TaskStatus(state=0, 
runner_uid=0, runner_pid=19362, timestamp_ms=1458100835238))
D0316 04:00:35.241660 19362 ckpt.py:348] Flipping task state from ACTIVE to 
CLEANING
D0316 04:00:35.241765 19362 runner.py:242] _on_task_transition: 
TaskStatus(state=5, runner_uid=0, runner_pid=19362, timestamp_ms=1458100835241)
D0316 04:00:35.249836 19362 runner.py:188] Task on_cleaning(TaskStatus(state=5, 
runner_uid=0, runner_pid=19362, timestamp_ms=1458100835241))
D0316 04:00:35.249953 19362 helper.py:217] 
TaskRunnerHelper.terminate_process(hello)
D0316 04:00:35.256520 19362 helper.py:220]=> SIGTERM pid 19368
D0316 04:00:35.256705 19362 runner.py:327] TaskRunnerStage[CLEANING]: 
Finalization remaining: 59.9812531471
D0316 04:00:35.262578 19362 runner.py:929] Run loop: Work to be done within 1.0s
D0316 04:00:36.263881 19362 runner.py:939] Run loop: No updates collected, 
touching checkpoint.
D0316 04:00:36.264199 19362 runner.py:327] TaskRunnerStage[CLEANING]: 
Finalization remaining: 58.9737620354
D0316 04:00:36.264734 19362 runner.py:929] Run loop: Work to be done within 1.0s
--
D0316 04:01:31.397888 19362 runner.py:939] Run loop: No updates collected, 
touching checkpoint.
D0316 04:01:31.398144 19362 runner.py:327] TaskRunnerStage[CLEANING]: 
Finalization remaining: 3.83981513977
D0316 04:01:31.398538 19362 runner.py:929] Run loop: Work to be done within 1.0s
D0316 04:01:32.400230 19362 runner.py:939] Run loop: No updates collected, 
touching checkpoint.
D0316 04:01:32.401125 19362 runner.py:327] TaskRunnerStage[CLEANING]: 
Finalization remaining: 2.8368370533
D0316 04:01:32.401596 19362 runner.py:929] Run loop: Work to be done within 1.0s
D0316 04:01:33.404506 19362 runner.py:939] Run loop: No updates collected, 
touching checkpoint.
D0316 04:01:33.404815 19362 runner.py:327] TaskRunnerStage[CLEANING]: 
Finalization remaining: 1.83315014839
D0316 04:01:33.405534 19362 runner.py:929] Run loop: Work to be done within 1.0s
D0316 04:01:34.406909 19362 runner.py:939] Run loop: No updates collected, 
touching checkpoint.
D0316 04:01:34.407223 19362 runner.py:327] TaskRunnerStage[CLEANING]: 
Finalization remaining: 0.830743074417
D0316 04:01:34.407908 19362 runner.py:929] Run loop: Work to be done within 0.8s
D0316 04:01:35.415529 19362 runner.py:939] Run loop: No updates collected, 
touching checkpoint.
D0316 04:01:35.415683 19362 runner.py:327] TaskRunnerStage[CLEANING]: 
Finalization remaining: 0
D0316 04:01:35.415740 19362 runner.py:926] Run loop: No more work to be done in 
state CLEANING
D0316 04:01:35.415888 19362 runner.py:903] Forced terminal state: KILLED
D0316 04:01:35.415936 19362 ckpt.py:348] Flipping task state from CLEANING to 
KILLED
D0316 04:01:35.415980 19362 runner.py:242] _on_task_transition: 
TaskStatus(state=3, runner_uid=0, runner_pid=19362, timestamp_ms=1458100895415)
D0316 04:01:35.416937 19362 runner.py:201] Task on_killed(TaskStatus(state=3, 
runner_uid=0, runner_pid=19362, timestamp_ms=1458100895415))
D0316 04:01:35.417393 19362 runner.py:684] _set_process_status(hello <= KILLED, 
seq=3[auto])
D0316 04:01:35.417458 19362 ckpt.py:379] Running state machine for 
process=hello/seq=3
D0316 04:01:35.417460 19362 runner.py:238] _on_process_transition: 
ProcessStatus(seq=3, process=u'hello', start_time=None, coordinator_pid=None, 
pid=None, return_code=-1, state=4, stop_time=1458100895.417381, fork_time=None)
D0316 04:01:35.417853 19362 runner.py:156] Process on_killed 
ProcessStatus(seq=3, process=u'hello', start_time=None, coordinator_pid=None, 
pid=None, return_code=-1, state=4, stop_time=1458100895.417381, fork_time=None)
D0316 04:01:35.417921 19362 he

[jira] [Commented] (AURORA-1641) Shell health checker is running as root

2016-03-15 Thread Zameer Manji (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196531#comment-15196531
 ] 

Zameer Manji commented on AURORA-1641:
--

An alternative would be to do something like in this StackOverflow answer: 
http://stackoverflow.com/a/6037494/2874

> Shell health checker is running as root
> ---
>
> Key: AURORA-1641
> URL: https://issues.apache.org/jira/browse/AURORA-1641
> Project: Aurora
>  Issue Type: Bug
>  Components: Executor, Security
>Reporter: Stephan Erb
>Priority: Blocker
>
> As the operator of an Aurora cluster, I have to guarantee that users can run 
> commands only with the privileges of their {{role}}. The new health checker 
> feature is risky in that regard, as it runs all health check commands with 
> the privileges of the Thermos runner. In most common deployments this is root.
> The Thermos runner supports various means for setting the uid/user/role that 
> is used to run user processes. The same configuration should also apply to 
> the user-defined health checking command.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1641) Shell health checker is running as root

2016-03-15 Thread Dmitriy Shirchenko (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196520#comment-15196520
 ] 

Dmitriy Shirchenko commented on AURORA-1641:


I would love to help and feel responsible but I'm going on vacation on Sunday 
for a week so don't have time right now :/.

But in the meanwhile can someone give a rough outline of required work?
One proposal I saw was by [~zmanji] who mentioned that we may need to make the 
health check runner look more like: 
https://github.com/apache/aurora/blame/d752d466c550118f052d23519d071eb41b2e5bf6/src/main/python/apache/thermos/core/process.py#L327
 


> Shell health checker is running as root
> ---
>
> Key: AURORA-1641
> URL: https://issues.apache.org/jira/browse/AURORA-1641
> Project: Aurora
>  Issue Type: Bug
>  Components: Executor, Security
>Reporter: Stephan Erb
>Priority: Blocker
>
> As the operator of an Aurora cluster, I have to guarantee that users can run 
> commands only with the privileges of their {{role}}. The new health checker 
> feature is risky in that regard, as it runs all health check commands with 
> the privileges of the Thermos runner. In most common deployments this is root.
> The Thermos runner supports various means for setting the uid/user/role that 
> is used to run user processes. The same configuration should also apply to 
> the user-defined health checking command.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1641) Shell health checker is running as root

2016-03-15 Thread Bill Farner (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196459#comment-15196459
 ] 

Bill Farner commented on AURORA-1641:
-

[~shirchen] do you have bandwidth to tackle this?

> Shell health checker is running as root
> ---
>
> Key: AURORA-1641
> URL: https://issues.apache.org/jira/browse/AURORA-1641
> Project: Aurora
>  Issue Type: Bug
>  Components: Executor, Security
>Reporter: Stephan Erb
>Priority: Blocker
>
> As the operator of an Aurora cluster, I have to guarantee that users can run 
> commands only with the privileges of their {{role}}. The new health checker 
> feature is risky in that regard, as it runs all health check commands with 
> the privileges of the Thermos runner. In most common deployments this is root.
> The Thermos runner supports various means for setting the uid/user/role that 
> is used to run user processes. The same configuration should also apply to 
> the user-defined health checking command.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AURORA-1641) Shell health checker is running as root

2016-03-15 Thread Bill Farner (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Farner updated AURORA-1641:

Issue Type: Bug  (was: Story)

> Shell health checker is running as root
> ---
>
> Key: AURORA-1641
> URL: https://issues.apache.org/jira/browse/AURORA-1641
> Project: Aurora
>  Issue Type: Bug
>  Components: Executor, Security
>Reporter: Stephan Erb
>Priority: Blocker
>
> As the operator of an Aurora cluster, I have to guarantee that users can run 
> commands only with the privileges of their {{role}}. The new health checker 
> feature is risky in that regard, as it runs all health check commands with 
> the privileges of the Thermos runner. In most common deployments this is root.
> The Thermos runner supports various means for setting the uid/user/role that 
> is used to run user processes. The same configuration should also apply to 
> the user-defined health checking command.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (AURORA-1641) Shell health checker is running as root

2016-03-15 Thread Stephan Erb (JIRA)
Stephan Erb created AURORA-1641:
---

 Summary: Shell health checker is running as root
 Key: AURORA-1641
 URL: https://issues.apache.org/jira/browse/AURORA-1641
 Project: Aurora
  Issue Type: Story
  Components: Executor, Security
Reporter: Stephan Erb
Priority: Blocker


As the operator of an Aurora cluster, I have to guarantee that users can run 
commands only with the privileges of their {{role}}. The new health checker 
feature is risky in that regard, as it runs all health check commands with the 
privileges of the Thermos runner. In most common deployments this is root.

The Thermos runner supports various means for setting the uid/user/role that is 
used to run user processes. The same configuration should also apply to the 
user-defined health checking command.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (AURORA-1640) Write enduser documentation for the Unified Containerizer support

2016-03-15 Thread Stephan Erb (JIRA)
Stephan Erb created AURORA-1640:
---

 Summary: Write enduser documentation for the Unified Containerizer 
support
 Key: AURORA-1640
 URL: https://issues.apache.org/jira/browse/AURORA-1640
 Project: Aurora
  Issue Type: Story
  Components: Documentation
Reporter: Stephan Erb


We have to document the Unified Containerizer feature so that it is easy for 
users and operators to adopt it. 

Ideally, we cover:

* how to configure the Aurora scheduler
* links to the relevant Mesos documentation
* an example showing a working Aurora spec that can be run within our vagrant 
environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (AURORA-1639) Update client to allow configuring tasks with images.

2016-03-15 Thread Joshua Cohen (JIRA)
Joshua Cohen created AURORA-1639:


 Summary: Update client to allow configuring tasks with images.
 Key: AURORA-1639
 URL: https://issues.apache.org/jira/browse/AURORA-1639
 Project: Aurora
  Issue Type: Task
Reporter: Joshua Cohen






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (AURORA-1638) Update MesosTaskFactory to send tasks with images configured to use the unified containerizer

2016-03-15 Thread Joshua Cohen (JIRA)
Joshua Cohen created AURORA-1638:


 Summary: Update MesosTaskFactory to send tasks with images 
configured to use the unified containerizer
 Key: AURORA-1638
 URL: https://issues.apache.org/jira/browse/AURORA-1638
 Project: Aurora
  Issue Type: Task
Reporter: Joshua Cohen






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (AURORA-1637) Update Executor to support launching tasks with images.

2016-03-15 Thread Joshua Cohen (JIRA)
Joshua Cohen created AURORA-1637:


 Summary: Update Executor to support launching tasks with images.
 Key: AURORA-1637
 URL: https://issues.apache.org/jira/browse/AURORA-1637
 Project: Aurora
  Issue Type: Task
Reporter: Joshua Cohen


We should also investigate whether it's possible to support for launching tasks 
configured with images but no processes with no executor and rely on the 
image's entrypoint.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (AURORA-1636) Update Scheduler to accept tasks with images

2016-03-15 Thread Joshua Cohen (JIRA)
Joshua Cohen created AURORA-1636:


 Summary: Update Scheduler to accept tasks with images
 Key: AURORA-1636
 URL: https://issues.apache.org/jira/browse/AURORA-1636
 Project: Aurora
  Issue Type: Task
Reporter: Joshua Cohen


This will entail updating the thrift definitions and plumbing those changes 
where necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (AURORA-1635) Update Scheduler storage to support storing images

2016-03-15 Thread Joshua Cohen (JIRA)
Joshua Cohen created AURORA-1635:


 Summary: Update Scheduler storage to support storing images
 Key: AURORA-1635
 URL: https://issues.apache.org/jira/browse/AURORA-1635
 Project: Aurora
  Issue Type: Task
Reporter: Joshua Cohen


As part of the work to support the Mesos unified containerier, we'll need to 
store images configured on tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (AURORA-1634) Support launching tasks using the Mesos unified containerizer

2016-03-15 Thread Joshua Cohen (JIRA)
Joshua Cohen created AURORA-1634:


 Summary: Support launching tasks using the Mesos unified 
containerizer
 Key: AURORA-1634
 URL: https://issues.apache.org/jira/browse/AURORA-1634
 Project: Aurora
  Issue Type: Epic
Reporter: Joshua Cohen


https://docs.google.com/document/d/111T09NBF2zjjl7HE95xglsDpRdKoZqhCRM5hHmOfTLA/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AURORA-1634) Support launching tasks using the Mesos unified containerizer

2016-03-15 Thread Joshua Cohen (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshua Cohen updated AURORA-1634:
-
Component/s: Scheduler

> Support launching tasks using the Mesos unified containerizer
> -
>
> Key: AURORA-1634
> URL: https://issues.apache.org/jira/browse/AURORA-1634
> Project: Aurora
>  Issue Type: Epic
>  Components: Client, Executor, Scheduler
>Reporter: Joshua Cohen
>
> https://docs.google.com/document/d/111T09NBF2zjjl7HE95xglsDpRdKoZqhCRM5hHmOfTLA/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AURORA-1634) Support launching tasks using the Mesos unified containerizer

2016-03-15 Thread Joshua Cohen (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshua Cohen updated AURORA-1634:
-
Component/s: Executor
 Client

> Support launching tasks using the Mesos unified containerizer
> -
>
> Key: AURORA-1634
> URL: https://issues.apache.org/jira/browse/AURORA-1634
> Project: Aurora
>  Issue Type: Epic
>  Components: Client, Executor, Scheduler
>Reporter: Joshua Cohen
>
> https://docs.google.com/document/d/111T09NBF2zjjl7HE95xglsDpRdKoZqhCRM5hHmOfTLA/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)