[jira] [Created] (AURORA-1642) Thermos runner finalization is broken
Maxim Khutornenko created AURORA-1642: - Summary: Thermos runner finalization is broken Key: AURORA-1642 URL: https://issues.apache.org/jira/browse/AURORA-1642 Project: Aurora Issue Type: Bug Components: Executor Reporter: Maxim Khutornenko We have noticed thermos runner finalization no longer works after this commit [024bac9dcb8f37e4b31210e3a0a7aea2345a16ab|https://reviews.apache.org/r/40922/] for tasks with blocking threads. I was able to reproduce it in Vagrant by extending the sleep timeout of the {{hello}} task and running {{aurora job killall}} immediately after launching it: {noformat} while true; do echo hello world sleep 600 {noformat} The finalizer never has a chance to run and after 1 minute a task is forcefully aborted: {noformat} D0316 04:00:35.237905 19362 runner.py:951] Runner issued kill: force:False, preemption_wait:1 mins D0316 04:00:35.238183 19362 runner.py:567] Flipping recovery mode off. D0316 04:00:35.238308 19362 ckpt.py:348] Flipping task state from ACTIVE to ACTIVE D0316 04:00:35.238437 19362 runner.py:242] _on_task_transition: TaskStatus(state=0, runner_uid=0, runner_pid=19362, timestamp_ms=1458100835238) D0316 04:00:35.239079 19362 runner.py:180] Task on_active(TaskStatus(state=0, runner_uid=0, runner_pid=19362, timestamp_ms=1458100835238)) D0316 04:00:35.241660 19362 ckpt.py:348] Flipping task state from ACTIVE to CLEANING D0316 04:00:35.241765 19362 runner.py:242] _on_task_transition: TaskStatus(state=5, runner_uid=0, runner_pid=19362, timestamp_ms=1458100835241) D0316 04:00:35.249836 19362 runner.py:188] Task on_cleaning(TaskStatus(state=5, runner_uid=0, runner_pid=19362, timestamp_ms=1458100835241)) D0316 04:00:35.249953 19362 helper.py:217] TaskRunnerHelper.terminate_process(hello) D0316 04:00:35.256520 19362 helper.py:220]=> SIGTERM pid 19368 D0316 04:00:35.256705 19362 runner.py:327] TaskRunnerStage[CLEANING]: Finalization remaining: 59.9812531471 D0316 04:00:35.262578 19362 runner.py:929] Run loop: Work to be done within 1.0s D0316 04:00:36.263881 19362 runner.py:939] Run loop: No updates collected, touching checkpoint. D0316 04:00:36.264199 19362 runner.py:327] TaskRunnerStage[CLEANING]: Finalization remaining: 58.9737620354 D0316 04:00:36.264734 19362 runner.py:929] Run loop: Work to be done within 1.0s -- D0316 04:01:31.397888 19362 runner.py:939] Run loop: No updates collected, touching checkpoint. D0316 04:01:31.398144 19362 runner.py:327] TaskRunnerStage[CLEANING]: Finalization remaining: 3.83981513977 D0316 04:01:31.398538 19362 runner.py:929] Run loop: Work to be done within 1.0s D0316 04:01:32.400230 19362 runner.py:939] Run loop: No updates collected, touching checkpoint. D0316 04:01:32.401125 19362 runner.py:327] TaskRunnerStage[CLEANING]: Finalization remaining: 2.8368370533 D0316 04:01:32.401596 19362 runner.py:929] Run loop: Work to be done within 1.0s D0316 04:01:33.404506 19362 runner.py:939] Run loop: No updates collected, touching checkpoint. D0316 04:01:33.404815 19362 runner.py:327] TaskRunnerStage[CLEANING]: Finalization remaining: 1.83315014839 D0316 04:01:33.405534 19362 runner.py:929] Run loop: Work to be done within 1.0s D0316 04:01:34.406909 19362 runner.py:939] Run loop: No updates collected, touching checkpoint. D0316 04:01:34.407223 19362 runner.py:327] TaskRunnerStage[CLEANING]: Finalization remaining: 0.830743074417 D0316 04:01:34.407908 19362 runner.py:929] Run loop: Work to be done within 0.8s D0316 04:01:35.415529 19362 runner.py:939] Run loop: No updates collected, touching checkpoint. D0316 04:01:35.415683 19362 runner.py:327] TaskRunnerStage[CLEANING]: Finalization remaining: 0 D0316 04:01:35.415740 19362 runner.py:926] Run loop: No more work to be done in state CLEANING D0316 04:01:35.415888 19362 runner.py:903] Forced terminal state: KILLED D0316 04:01:35.415936 19362 ckpt.py:348] Flipping task state from CLEANING to KILLED D0316 04:01:35.415980 19362 runner.py:242] _on_task_transition: TaskStatus(state=3, runner_uid=0, runner_pid=19362, timestamp_ms=1458100895415) D0316 04:01:35.416937 19362 runner.py:201] Task on_killed(TaskStatus(state=3, runner_uid=0, runner_pid=19362, timestamp_ms=1458100895415)) D0316 04:01:35.417393 19362 runner.py:684] _set_process_status(hello <= KILLED, seq=3[auto]) D0316 04:01:35.417458 19362 ckpt.py:379] Running state machine for process=hello/seq=3 D0316 04:01:35.417460 19362 runner.py:238] _on_process_transition: ProcessStatus(seq=3, process=u'hello', start_time=None, coordinator_pid=None, pid=None, return_code=-1, state=4, stop_time=1458100895.417381, fork_time=None) D0316 04:01:35.417853 19362 runner.py:156] Process on_killed ProcessStatus(seq=3, process=u'hello', start_time=None, coordinator_pid=None, pid=None, return_code=-1, state=4, stop_time=1458100895.417381, fork_time=None) D0316 04:01:35.417921 19362 he
[jira] [Commented] (AURORA-1641) Shell health checker is running as root
[ https://issues.apache.org/jira/browse/AURORA-1641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196531#comment-15196531 ] Zameer Manji commented on AURORA-1641: -- An alternative would be to do something like in this StackOverflow answer: http://stackoverflow.com/a/6037494/2874 > Shell health checker is running as root > --- > > Key: AURORA-1641 > URL: https://issues.apache.org/jira/browse/AURORA-1641 > Project: Aurora > Issue Type: Bug > Components: Executor, Security >Reporter: Stephan Erb >Priority: Blocker > > As the operator of an Aurora cluster, I have to guarantee that users can run > commands only with the privileges of their {{role}}. The new health checker > feature is risky in that regard, as it runs all health check commands with > the privileges of the Thermos runner. In most common deployments this is root. > The Thermos runner supports various means for setting the uid/user/role that > is used to run user processes. The same configuration should also apply to > the user-defined health checking command. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AURORA-1641) Shell health checker is running as root
[ https://issues.apache.org/jira/browse/AURORA-1641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196520#comment-15196520 ] Dmitriy Shirchenko commented on AURORA-1641: I would love to help and feel responsible but I'm going on vacation on Sunday for a week so don't have time right now :/. But in the meanwhile can someone give a rough outline of required work? One proposal I saw was by [~zmanji] who mentioned that we may need to make the health check runner look more like: https://github.com/apache/aurora/blame/d752d466c550118f052d23519d071eb41b2e5bf6/src/main/python/apache/thermos/core/process.py#L327 > Shell health checker is running as root > --- > > Key: AURORA-1641 > URL: https://issues.apache.org/jira/browse/AURORA-1641 > Project: Aurora > Issue Type: Bug > Components: Executor, Security >Reporter: Stephan Erb >Priority: Blocker > > As the operator of an Aurora cluster, I have to guarantee that users can run > commands only with the privileges of their {{role}}. The new health checker > feature is risky in that regard, as it runs all health check commands with > the privileges of the Thermos runner. In most common deployments this is root. > The Thermos runner supports various means for setting the uid/user/role that > is used to run user processes. The same configuration should also apply to > the user-defined health checking command. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AURORA-1641) Shell health checker is running as root
[ https://issues.apache.org/jira/browse/AURORA-1641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196459#comment-15196459 ] Bill Farner commented on AURORA-1641: - [~shirchen] do you have bandwidth to tackle this? > Shell health checker is running as root > --- > > Key: AURORA-1641 > URL: https://issues.apache.org/jira/browse/AURORA-1641 > Project: Aurora > Issue Type: Bug > Components: Executor, Security >Reporter: Stephan Erb >Priority: Blocker > > As the operator of an Aurora cluster, I have to guarantee that users can run > commands only with the privileges of their {{role}}. The new health checker > feature is risky in that regard, as it runs all health check commands with > the privileges of the Thermos runner. In most common deployments this is root. > The Thermos runner supports various means for setting the uid/user/role that > is used to run user processes. The same configuration should also apply to > the user-defined health checking command. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (AURORA-1641) Shell health checker is running as root
[ https://issues.apache.org/jira/browse/AURORA-1641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Farner updated AURORA-1641: Issue Type: Bug (was: Story) > Shell health checker is running as root > --- > > Key: AURORA-1641 > URL: https://issues.apache.org/jira/browse/AURORA-1641 > Project: Aurora > Issue Type: Bug > Components: Executor, Security >Reporter: Stephan Erb >Priority: Blocker > > As the operator of an Aurora cluster, I have to guarantee that users can run > commands only with the privileges of their {{role}}. The new health checker > feature is risky in that regard, as it runs all health check commands with > the privileges of the Thermos runner. In most common deployments this is root. > The Thermos runner supports various means for setting the uid/user/role that > is used to run user processes. The same configuration should also apply to > the user-defined health checking command. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (AURORA-1641) Shell health checker is running as root
Stephan Erb created AURORA-1641: --- Summary: Shell health checker is running as root Key: AURORA-1641 URL: https://issues.apache.org/jira/browse/AURORA-1641 Project: Aurora Issue Type: Story Components: Executor, Security Reporter: Stephan Erb Priority: Blocker As the operator of an Aurora cluster, I have to guarantee that users can run commands only with the privileges of their {{role}}. The new health checker feature is risky in that regard, as it runs all health check commands with the privileges of the Thermos runner. In most common deployments this is root. The Thermos runner supports various means for setting the uid/user/role that is used to run user processes. The same configuration should also apply to the user-defined health checking command. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (AURORA-1640) Write enduser documentation for the Unified Containerizer support
Stephan Erb created AURORA-1640: --- Summary: Write enduser documentation for the Unified Containerizer support Key: AURORA-1640 URL: https://issues.apache.org/jira/browse/AURORA-1640 Project: Aurora Issue Type: Story Components: Documentation Reporter: Stephan Erb We have to document the Unified Containerizer feature so that it is easy for users and operators to adopt it. Ideally, we cover: * how to configure the Aurora scheduler * links to the relevant Mesos documentation * an example showing a working Aurora spec that can be run within our vagrant environment -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (AURORA-1639) Update client to allow configuring tasks with images.
Joshua Cohen created AURORA-1639: Summary: Update client to allow configuring tasks with images. Key: AURORA-1639 URL: https://issues.apache.org/jira/browse/AURORA-1639 Project: Aurora Issue Type: Task Reporter: Joshua Cohen -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (AURORA-1638) Update MesosTaskFactory to send tasks with images configured to use the unified containerizer
Joshua Cohen created AURORA-1638: Summary: Update MesosTaskFactory to send tasks with images configured to use the unified containerizer Key: AURORA-1638 URL: https://issues.apache.org/jira/browse/AURORA-1638 Project: Aurora Issue Type: Task Reporter: Joshua Cohen -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (AURORA-1637) Update Executor to support launching tasks with images.
Joshua Cohen created AURORA-1637: Summary: Update Executor to support launching tasks with images. Key: AURORA-1637 URL: https://issues.apache.org/jira/browse/AURORA-1637 Project: Aurora Issue Type: Task Reporter: Joshua Cohen We should also investigate whether it's possible to support for launching tasks configured with images but no processes with no executor and rely on the image's entrypoint. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (AURORA-1636) Update Scheduler to accept tasks with images
Joshua Cohen created AURORA-1636: Summary: Update Scheduler to accept tasks with images Key: AURORA-1636 URL: https://issues.apache.org/jira/browse/AURORA-1636 Project: Aurora Issue Type: Task Reporter: Joshua Cohen This will entail updating the thrift definitions and plumbing those changes where necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (AURORA-1635) Update Scheduler storage to support storing images
Joshua Cohen created AURORA-1635: Summary: Update Scheduler storage to support storing images Key: AURORA-1635 URL: https://issues.apache.org/jira/browse/AURORA-1635 Project: Aurora Issue Type: Task Reporter: Joshua Cohen As part of the work to support the Mesos unified containerier, we'll need to store images configured on tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (AURORA-1634) Support launching tasks using the Mesos unified containerizer
Joshua Cohen created AURORA-1634: Summary: Support launching tasks using the Mesos unified containerizer Key: AURORA-1634 URL: https://issues.apache.org/jira/browse/AURORA-1634 Project: Aurora Issue Type: Epic Reporter: Joshua Cohen https://docs.google.com/document/d/111T09NBF2zjjl7HE95xglsDpRdKoZqhCRM5hHmOfTLA/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (AURORA-1634) Support launching tasks using the Mesos unified containerizer
[ https://issues.apache.org/jira/browse/AURORA-1634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joshua Cohen updated AURORA-1634: - Component/s: Scheduler > Support launching tasks using the Mesos unified containerizer > - > > Key: AURORA-1634 > URL: https://issues.apache.org/jira/browse/AURORA-1634 > Project: Aurora > Issue Type: Epic > Components: Client, Executor, Scheduler >Reporter: Joshua Cohen > > https://docs.google.com/document/d/111T09NBF2zjjl7HE95xglsDpRdKoZqhCRM5hHmOfTLA/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (AURORA-1634) Support launching tasks using the Mesos unified containerizer
[ https://issues.apache.org/jira/browse/AURORA-1634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joshua Cohen updated AURORA-1634: - Component/s: Executor Client > Support launching tasks using the Mesos unified containerizer > - > > Key: AURORA-1634 > URL: https://issues.apache.org/jira/browse/AURORA-1634 > Project: Aurora > Issue Type: Epic > Components: Client, Executor, Scheduler >Reporter: Joshua Cohen > > https://docs.google.com/document/d/111T09NBF2zjjl7HE95xglsDpRdKoZqhCRM5hHmOfTLA/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)