[jira] [Resolved] (AURORA-1798) resolv.conf is not copied when using the Mesos containerizer with a Docker image

2016-10-20 Thread Justin Pinkul (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justin Pinkul resolved AURORA-1798.
---
Resolution: Fixed

> resolv.conf is not copied when using the Mesos containerizer with a Docker 
> image
> 
>
> Key: AURORA-1798
> URL: https://issues.apache.org/jira/browse/AURORA-1798
> Project: Aurora
>  Issue Type: Bug
>  Components: Executor
>Affects Versions: 0.16.0
>Reporter: Justin Pinkul
>Assignee: Justin Pinkul
> Fix For: 0.17.0
>
>
> When Thermos launches a task using a Docker image it mounts the image as a 
> volume and manually chroots into it. One consequence of this is the logic 
> inside of the {{network/cni}} isolator that copies {{resolv.conf}} from the 
> host into the new rootfs is bypassed. The Thermos executor should manually 
> copy this file into the rootfs until Mesos pod support is implemented.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AURORA-1798) resolv.conf is not copied when using the Mesos containerizer with a Docker image

2016-10-20 Thread Justin Pinkul (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justin Pinkul updated AURORA-1798:
--
Fix Version/s: 0.17.0

> resolv.conf is not copied when using the Mesos containerizer with a Docker 
> image
> 
>
> Key: AURORA-1798
> URL: https://issues.apache.org/jira/browse/AURORA-1798
> Project: Aurora
>  Issue Type: Bug
>  Components: Executor
>Affects Versions: 0.16.0
>Reporter: Justin Pinkul
>Assignee: Justin Pinkul
> Fix For: 0.17.0
>
>
> When Thermos launches a task using a Docker image it mounts the image as a 
> volume and manually chroots into it. One consequence of this is the logic 
> inside of the {{network/cni}} isolator that copies {{resolv.conf}} from the 
> host into the new rootfs is bypassed. The Thermos executor should manually 
> copy this file into the rootfs until Mesos pod support is implemented.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1798) resolv.conf is not copied when using the Mesos containerizer with a Docker image

2016-10-18 Thread Justin Pinkul (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15587070#comment-15587070
 ] 

Justin Pinkul commented on AURORA-1798:
---

Review: https://reviews.apache.org/r/53003/

> resolv.conf is not copied when using the Mesos containerizer with a Docker 
> image
> 
>
> Key: AURORA-1798
> URL: https://issues.apache.org/jira/browse/AURORA-1798
> Project: Aurora
>  Issue Type: Bug
>  Components: Executor
>Affects Versions: 0.16.0
>Reporter: Justin Pinkul
>Assignee: Justin Pinkul
>
> When Thermos launches a task using a Docker image it mounts the image as a 
> volume and manually chroots into it. One consequence of this is the logic 
> inside of the {{network/cni}} isolator that copies {{resolv.conf}} from the 
> host into the new rootfs is bypassed. The Thermos executor should manually 
> copy this file into the rootfs until Mesos pod support is implemented.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (AURORA-1798) resolv.conf is not copied when using the Mesos containerizer with a Docker image

2016-10-18 Thread Justin Pinkul (JIRA)
Justin Pinkul created AURORA-1798:
-

 Summary: resolv.conf is not copied when using the Mesos 
containerizer with a Docker image
 Key: AURORA-1798
 URL: https://issues.apache.org/jira/browse/AURORA-1798
 Project: Aurora
  Issue Type: Bug
  Components: Executor
Affects Versions: 0.16.0
Reporter: Justin Pinkul
Assignee: Justin Pinkul


When Thermos launches a task using a Docker image it mounts the image as a 
volume and manually chroots into it. One consequence of this is the logic 
inside of the {{network/cni}} isolator that copies {{resolv.conf}} from the 
host into the new rootfs is bypassed. The Thermos executor should manually copy 
this file into the rootfs until Mesos pod support is implemented.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1789) namespaces/pid isolator causes lost process

2016-10-12 Thread Justin Pinkul (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15569686#comment-15569686
 ] 

Justin Pinkul commented on AURORA-1789:
---

This review catches this error and throws a useful mesage. 
https://reviews.apache.org/r/52804/

> namespaces/pid isolator causes lost process
> ---
>
> Key: AURORA-1789
> URL: https://issues.apache.org/jira/browse/AURORA-1789
> Project: Aurora
>  Issue Type: Bug
>  Components: Executor
>Affects Versions: 0.16.0
>Reporter: Justin Pinkul
>Assignee: Zameer Manji
>
> When using the Mesos containerizer with namespaces/pid isolator and a Docker 
> image the Thermos executor is unable to launch processes. The executor tries 
> to fork the process then is unable to locate the process after the fork.
> {code:title=thermos_runner.INFO}
> I1006 21:36:22.842595 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:37:22.929864 75 runner.py:825] Detected a LOST task: 
> ProcessStatus(seq=205, process=u'BigBrother start', start_time=None, 
> coordinator_pid=1144, pid=None, return_code=None, state=1, stop_time=None, 
> fork_time=1475789782.842882)
> I1006 21:37:22.931456 75 helper.py:153]   Coordinator BigBrother start [pid: 
> 1144] completed.
> I1006 21:37:22.931732 75 runner.py:133] Process BigBrother start had an 
> abnormal termination
> I1006 21:37:22.935580 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:38:23.023725 75 runner.py:825] Detected a LOST task: 
> ProcessStatus(seq=208, process=u'BigBrother start', start_time=None, 
> coordinator_pid=1157, pid=None, return_code=None, state=1, stop_time=None, 
> fork_time=1475789842.935872)
> I1006 21:38:23.025332 75 helper.py:153]   Coordinator BigBrother start [pid: 
> 1157] completed.
> I1006 21:38:23.025629 75 runner.py:133] Process BigBrother start had an 
> abnormal termination
> I1006 21:38:23.029414 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:39:23.117208 75 runner.py:825] Detected a LOST task: 
> ProcessStatus(seq=211, process=u'BigBrother start', start_time=None, 
> coordinator_pid=1170, pid=None, return_code=None, state=1, stop_time=None, 
> fork_time=1475789903.029694)
> I1006 21:39:23.118841 75 helper.py:153]   Coordinator BigBrother start [pid: 
> 1170] completed.
> I1006 21:39:23.119134 75 runner.py:133] Process BigBrother start had an 
> abnormal termination
> I1006 21:39:23.122920 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:40:23.211095 75 runner.py:825] Detected a LOST task: 
> ProcessStatus(seq=214, process=u'BigBrother start', start_time=None, 
> coordinator_pid=1183, pid=None, return_code=None, state=1, stop_time=None, 
> fork_time=1475789963.123206)
> I1006 21:40:23.212711 75 helper.py:153]   Coordinator BigBrother start [pid: 
> 1183] completed.
> I1006 21:40:23.213006 75 runner.py:133] Process BigBrother start had an 
> abnormal termination
> I1006 21:40:23.216810 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:41:23.305505 75 runner.py:825] Detected a LOST task: 
> ProcessStatus(seq=217, process=u'BigBrother start', start_time=None, 
> coordinator_pid=1196, pid=None, return_code=None, state=1, stop_time=None, 
> fork_time=1475790023.21709)
> I1006 21:41:23.307157 75 helper.py:153]   Coordinator BigBrother start [pid: 
> 1196] completed.
> I1006 21:41:23.307450 75 runner.py:133] Process BigBrother start had an 
> abnormal termination
> I1006 21:41:23.311230 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:42:23.398277 75 runner.py:825] Detected a LOST task: 
> ProcessStatus(seq=220, process=u'BigBrother start', start_time=None, 
> coordinator_pid=1209, pid=None, return_code=None, state=1, stop_time=None, 
> fork_time=1475790083.311512)
> I1006 21:42:23.399893 75 helper.py:153]   Coordinator BigBrother start [pid: 
> 1209] completed.
> I1006 21:42:23.400185 75 runner.py:133] Process BigBrother start had an 
> abnormal termination
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1789) namespaces/pid isolator causes lost process

2016-10-11 Thread Justin Pinkul (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15567120#comment-15567120
 ] 

Justin Pinkul commented on AURORA-1789:
---

I figured it out, the problem here was that {{--mesos_containerizer_path}} was 
not set correctly. It would be nice if there was a check for this with a 
helpful error message.

> namespaces/pid isolator causes lost process
> ---
>
> Key: AURORA-1789
> URL: https://issues.apache.org/jira/browse/AURORA-1789
> Project: Aurora
>  Issue Type: Bug
>  Components: Executor
>Affects Versions: 0.16.0
>Reporter: Justin Pinkul
>Assignee: Zameer Manji
>
> When using the Mesos containerizer with namespaces/pid isolator and a Docker 
> image the Thermos executor is unable to launch processes. The executor tries 
> to fork the process then is unable to locate the process after the fork.
> {code:title=thermos_runner.INFO}
> I1006 21:36:22.842595 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:37:22.929864 75 runner.py:825] Detected a LOST task: 
> ProcessStatus(seq=205, process=u'BigBrother start', start_time=None, 
> coordinator_pid=1144, pid=None, return_code=None, state=1, stop_time=None, 
> fork_time=1475789782.842882)
> I1006 21:37:22.931456 75 helper.py:153]   Coordinator BigBrother start [pid: 
> 1144] completed.
> I1006 21:37:22.931732 75 runner.py:133] Process BigBrother start had an 
> abnormal termination
> I1006 21:37:22.935580 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:38:23.023725 75 runner.py:825] Detected a LOST task: 
> ProcessStatus(seq=208, process=u'BigBrother start', start_time=None, 
> coordinator_pid=1157, pid=None, return_code=None, state=1, stop_time=None, 
> fork_time=1475789842.935872)
> I1006 21:38:23.025332 75 helper.py:153]   Coordinator BigBrother start [pid: 
> 1157] completed.
> I1006 21:38:23.025629 75 runner.py:133] Process BigBrother start had an 
> abnormal termination
> I1006 21:38:23.029414 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:39:23.117208 75 runner.py:825] Detected a LOST task: 
> ProcessStatus(seq=211, process=u'BigBrother start', start_time=None, 
> coordinator_pid=1170, pid=None, return_code=None, state=1, stop_time=None, 
> fork_time=1475789903.029694)
> I1006 21:39:23.118841 75 helper.py:153]   Coordinator BigBrother start [pid: 
> 1170] completed.
> I1006 21:39:23.119134 75 runner.py:133] Process BigBrother start had an 
> abnormal termination
> I1006 21:39:23.122920 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:40:23.211095 75 runner.py:825] Detected a LOST task: 
> ProcessStatus(seq=214, process=u'BigBrother start', start_time=None, 
> coordinator_pid=1183, pid=None, return_code=None, state=1, stop_time=None, 
> fork_time=1475789963.123206)
> I1006 21:40:23.212711 75 helper.py:153]   Coordinator BigBrother start [pid: 
> 1183] completed.
> I1006 21:40:23.213006 75 runner.py:133] Process BigBrother start had an 
> abnormal termination
> I1006 21:40:23.216810 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:41:23.305505 75 runner.py:825] Detected a LOST task: 
> ProcessStatus(seq=217, process=u'BigBrother start', start_time=None, 
> coordinator_pid=1196, pid=None, return_code=None, state=1, stop_time=None, 
> fork_time=1475790023.21709)
> I1006 21:41:23.307157 75 helper.py:153]   Coordinator BigBrother start [pid: 
> 1196] completed.
> I1006 21:41:23.307450 75 runner.py:133] Process BigBrother start had an 
> abnormal termination
> I1006 21:41:23.311230 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:42:23.398277 75 runner.py:825] Detected a LOST task: 
> ProcessStatus(seq=220, process=u'BigBrother start', start_time=None, 
> coordinator_pid=1209, pid=None, return_code=None, state=1, stop_time=None, 
> fork_time=1475790083.311512)
> I1006 21:42:23.399893 75 helper.py:153]   Coordinator BigBrother start [pid: 
> 1209] completed.
> I1006 21:42:23.400185 75 runner.py:133] Process BigBrother start had an 
> abnormal termination
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1789) namespaces/pid isolator causes lost process

2016-10-07 Thread Justin Pinkul (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1649#comment-1649
 ] 

Justin Pinkul commented on AURORA-1789:
---

The PID isolator might be a red herring I'm seeing the same issue with it 
turned off. I'm going to continue to compare this to a working cluster to see 
what the other differences might be.

> namespaces/pid isolator causes lost process
> ---
>
> Key: AURORA-1789
> URL: https://issues.apache.org/jira/browse/AURORA-1789
> Project: Aurora
>  Issue Type: Bug
>  Components: Executor
>Affects Versions: 0.16.0
>Reporter: Justin Pinkul
>Assignee: Zameer Manji
>
> When using the Mesos containerizer with namespaces/pid isolator and a Docker 
> image the Thermos executor is unable to launch processes. The executor tries 
> to fork the process then is unable to locate the process after the fork.
> {code:title=thermos_runner.INFO}
> I1006 21:36:22.842595 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:37:22.929864 75 runner.py:825] Detected a LOST task: 
> ProcessStatus(seq=205, process=u'BigBrother start', start_time=None, 
> coordinator_pid=1144, pid=None, return_code=None, state=1, stop_time=None, 
> fork_time=1475789782.842882)
> I1006 21:37:22.931456 75 helper.py:153]   Coordinator BigBrother start [pid: 
> 1144] completed.
> I1006 21:37:22.931732 75 runner.py:133] Process BigBrother start had an 
> abnormal termination
> I1006 21:37:22.935580 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:38:23.023725 75 runner.py:825] Detected a LOST task: 
> ProcessStatus(seq=208, process=u'BigBrother start', start_time=None, 
> coordinator_pid=1157, pid=None, return_code=None, state=1, stop_time=None, 
> fork_time=1475789842.935872)
> I1006 21:38:23.025332 75 helper.py:153]   Coordinator BigBrother start [pid: 
> 1157] completed.
> I1006 21:38:23.025629 75 runner.py:133] Process BigBrother start had an 
> abnormal termination
> I1006 21:38:23.029414 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:39:23.117208 75 runner.py:825] Detected a LOST task: 
> ProcessStatus(seq=211, process=u'BigBrother start', start_time=None, 
> coordinator_pid=1170, pid=None, return_code=None, state=1, stop_time=None, 
> fork_time=1475789903.029694)
> I1006 21:39:23.118841 75 helper.py:153]   Coordinator BigBrother start [pid: 
> 1170] completed.
> I1006 21:39:23.119134 75 runner.py:133] Process BigBrother start had an 
> abnormal termination
> I1006 21:39:23.122920 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:40:23.211095 75 runner.py:825] Detected a LOST task: 
> ProcessStatus(seq=214, process=u'BigBrother start', start_time=None, 
> coordinator_pid=1183, pid=None, return_code=None, state=1, stop_time=None, 
> fork_time=1475789963.123206)
> I1006 21:40:23.212711 75 helper.py:153]   Coordinator BigBrother start [pid: 
> 1183] completed.
> I1006 21:40:23.213006 75 runner.py:133] Process BigBrother start had an 
> abnormal termination
> I1006 21:40:23.216810 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:41:23.305505 75 runner.py:825] Detected a LOST task: 
> ProcessStatus(seq=217, process=u'BigBrother start', start_time=None, 
> coordinator_pid=1196, pid=None, return_code=None, state=1, stop_time=None, 
> fork_time=1475790023.21709)
> I1006 21:41:23.307157 75 helper.py:153]   Coordinator BigBrother start [pid: 
> 1196] completed.
> I1006 21:41:23.307450 75 runner.py:133] Process BigBrother start had an 
> abnormal termination
> I1006 21:41:23.311230 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:42:23.398277 75 runner.py:825] Detected a LOST task: 
> ProcessStatus(seq=220, process=u'BigBrother start', start_time=None, 
> coordinator_pid=1209, pid=None, return_code=None, state=1, stop_time=None, 
> fork_time=1475790083.311512)
> I1006 21:42:23.399893 75 helper.py:153]   Coordinator BigBrother start [pid: 
> 1209] completed.
> I1006 21:42:23.400185 75 runner.py:133] Process BigBrother start had an 
> abnormal termination
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AURORA-1789) namespaces/pid isolator causes lost process

2016-10-06 Thread Justin Pinkul (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justin Pinkul updated AURORA-1789:
--
Assignee: Zameer Manji  (was: Justin Pinkul)

> namespaces/pid isolator causes lost process
> ---
>
> Key: AURORA-1789
> URL: https://issues.apache.org/jira/browse/AURORA-1789
> Project: Aurora
>  Issue Type: Bug
>  Components: Executor
>Affects Versions: 0.16.0
>Reporter: Justin Pinkul
>Assignee: Zameer Manji
>
> When using the Mesos containerizer with namespaces/pid isolator and a Docker 
> image the Thermos executor is unable to launch processes. The executor tries 
> to fork the process then is unable to locate the process after the fork.
> {code:title=thermos_runner.INFO}
> I1006 21:36:22.842595 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:37:22.929864 75 runner.py:825] Detected a LOST task: 
> ProcessStatus(seq=205, process=u'BigBrother start', start_time=None, 
> coordinator_pid=1144, pid=None, return_code=None, state=1, stop_time=None, 
> fork_time=1475789782.842882)
> I1006 21:37:22.931456 75 helper.py:153]   Coordinator BigBrother start [pid: 
> 1144] completed.
> I1006 21:37:22.931732 75 runner.py:133] Process BigBrother start had an 
> abnormal termination
> I1006 21:37:22.935580 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:38:23.023725 75 runner.py:825] Detected a LOST task: 
> ProcessStatus(seq=208, process=u'BigBrother start', start_time=None, 
> coordinator_pid=1157, pid=None, return_code=None, state=1, stop_time=None, 
> fork_time=1475789842.935872)
> I1006 21:38:23.025332 75 helper.py:153]   Coordinator BigBrother start [pid: 
> 1157] completed.
> I1006 21:38:23.025629 75 runner.py:133] Process BigBrother start had an 
> abnormal termination
> I1006 21:38:23.029414 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:39:23.117208 75 runner.py:825] Detected a LOST task: 
> ProcessStatus(seq=211, process=u'BigBrother start', start_time=None, 
> coordinator_pid=1170, pid=None, return_code=None, state=1, stop_time=None, 
> fork_time=1475789903.029694)
> I1006 21:39:23.118841 75 helper.py:153]   Coordinator BigBrother start [pid: 
> 1170] completed.
> I1006 21:39:23.119134 75 runner.py:133] Process BigBrother start had an 
> abnormal termination
> I1006 21:39:23.122920 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:40:23.211095 75 runner.py:825] Detected a LOST task: 
> ProcessStatus(seq=214, process=u'BigBrother start', start_time=None, 
> coordinator_pid=1183, pid=None, return_code=None, state=1, stop_time=None, 
> fork_time=1475789963.123206)
> I1006 21:40:23.212711 75 helper.py:153]   Coordinator BigBrother start [pid: 
> 1183] completed.
> I1006 21:40:23.213006 75 runner.py:133] Process BigBrother start had an 
> abnormal termination
> I1006 21:40:23.216810 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:41:23.305505 75 runner.py:825] Detected a LOST task: 
> ProcessStatus(seq=217, process=u'BigBrother start', start_time=None, 
> coordinator_pid=1196, pid=None, return_code=None, state=1, stop_time=None, 
> fork_time=1475790023.21709)
> I1006 21:41:23.307157 75 helper.py:153]   Coordinator BigBrother start [pid: 
> 1196] completed.
> I1006 21:41:23.307450 75 runner.py:133] Process BigBrother start had an 
> abnormal termination
> I1006 21:41:23.311230 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:42:23.398277 75 runner.py:825] Detected a LOST task: 
> ProcessStatus(seq=220, process=u'BigBrother start', start_time=None, 
> coordinator_pid=1209, pid=None, return_code=None, state=1, stop_time=None, 
> fork_time=1475790083.311512)
> I1006 21:42:23.399893 75 helper.py:153]   Coordinator BigBrother start [pid: 
> 1209] completed.
> I1006 21:42:23.400185 75 runner.py:133] Process BigBrother start had an 
> abnormal termination
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (AURORA-1789) namespaces/pid isolator causes lost process

2016-10-06 Thread Justin Pinkul (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justin Pinkul reassigned AURORA-1789:
-

Assignee: Justin Pinkul

> namespaces/pid isolator causes lost process
> ---
>
> Key: AURORA-1789
> URL: https://issues.apache.org/jira/browse/AURORA-1789
> Project: Aurora
>  Issue Type: Bug
>  Components: Executor
>Affects Versions: 0.16.0
>Reporter: Justin Pinkul
>Assignee: Justin Pinkul
>
> When using the Mesos containerizer with namespaces/pid isolator and a Docker 
> image the Thermos executor is unable to launch processes. The executor tries 
> to fork the process then is unable to locate the process after the fork.
> {code:title=thermos_runner.INFO}
> I1006 21:36:22.842595 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:37:22.929864 75 runner.py:825] Detected a LOST task: 
> ProcessStatus(seq=205, process=u'BigBrother start', start_time=None, 
> coordinator_pid=1144, pid=None, return_code=None, state=1, stop_time=None, 
> fork_time=1475789782.842882)
> I1006 21:37:22.931456 75 helper.py:153]   Coordinator BigBrother start [pid: 
> 1144] completed.
> I1006 21:37:22.931732 75 runner.py:133] Process BigBrother start had an 
> abnormal termination
> I1006 21:37:22.935580 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:38:23.023725 75 runner.py:825] Detected a LOST task: 
> ProcessStatus(seq=208, process=u'BigBrother start', start_time=None, 
> coordinator_pid=1157, pid=None, return_code=None, state=1, stop_time=None, 
> fork_time=1475789842.935872)
> I1006 21:38:23.025332 75 helper.py:153]   Coordinator BigBrother start [pid: 
> 1157] completed.
> I1006 21:38:23.025629 75 runner.py:133] Process BigBrother start had an 
> abnormal termination
> I1006 21:38:23.029414 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:39:23.117208 75 runner.py:825] Detected a LOST task: 
> ProcessStatus(seq=211, process=u'BigBrother start', start_time=None, 
> coordinator_pid=1170, pid=None, return_code=None, state=1, stop_time=None, 
> fork_time=1475789903.029694)
> I1006 21:39:23.118841 75 helper.py:153]   Coordinator BigBrother start [pid: 
> 1170] completed.
> I1006 21:39:23.119134 75 runner.py:133] Process BigBrother start had an 
> abnormal termination
> I1006 21:39:23.122920 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:40:23.211095 75 runner.py:825] Detected a LOST task: 
> ProcessStatus(seq=214, process=u'BigBrother start', start_time=None, 
> coordinator_pid=1183, pid=None, return_code=None, state=1, stop_time=None, 
> fork_time=1475789963.123206)
> I1006 21:40:23.212711 75 helper.py:153]   Coordinator BigBrother start [pid: 
> 1183] completed.
> I1006 21:40:23.213006 75 runner.py:133] Process BigBrother start had an 
> abnormal termination
> I1006 21:40:23.216810 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:41:23.305505 75 runner.py:825] Detected a LOST task: 
> ProcessStatus(seq=217, process=u'BigBrother start', start_time=None, 
> coordinator_pid=1196, pid=None, return_code=None, state=1, stop_time=None, 
> fork_time=1475790023.21709)
> I1006 21:41:23.307157 75 helper.py:153]   Coordinator BigBrother start [pid: 
> 1196] completed.
> I1006 21:41:23.307450 75 runner.py:133] Process BigBrother start had an 
> abnormal termination
> I1006 21:41:23.311230 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:42:23.398277 75 runner.py:825] Detected a LOST task: 
> ProcessStatus(seq=220, process=u'BigBrother start', start_time=None, 
> coordinator_pid=1209, pid=None, return_code=None, state=1, stop_time=None, 
> fork_time=1475790083.311512)
> I1006 21:42:23.399893 75 helper.py:153]   Coordinator BigBrother start [pid: 
> 1209] completed.
> I1006 21:42:23.400185 75 runner.py:133] Process BigBrother start had an 
> abnormal termination
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (AURORA-1789) namespaces/pid isolator causes lost process

2016-10-06 Thread Justin Pinkul (JIRA)
Justin Pinkul created AURORA-1789:
-

 Summary: namespaces/pid isolator causes lost process
 Key: AURORA-1789
 URL: https://issues.apache.org/jira/browse/AURORA-1789
 Project: Aurora
  Issue Type: Bug
  Components: Executor
Affects Versions: 0.16.0
Reporter: Justin Pinkul


When using the Mesos containerizer with namespaces/pid isolator and a Docker 
image the Thermos executor is unable to launch processes. The executor tries to 
fork the process then is unable to locate the process after the fork.

{code:title=thermos_runner.INFO}
I1006 21:36:22.842595 75 runner.py:865] Forking Process(BigBrother start)
I1006 21:37:22.929864 75 runner.py:825] Detected a LOST task: 
ProcessStatus(seq=205, process=u'BigBrother start', start_time=None, 
coordinator_pid=1144, pid=None, return_code=None, state=1, stop_time=None, 
fork_time=1475789782.842882)
I1006 21:37:22.931456 75 helper.py:153]   Coordinator BigBrother start [pid: 
1144] completed.
I1006 21:37:22.931732 75 runner.py:133] Process BigBrother start had an 
abnormal termination
I1006 21:37:22.935580 75 runner.py:865] Forking Process(BigBrother start)
I1006 21:38:23.023725 75 runner.py:825] Detected a LOST task: 
ProcessStatus(seq=208, process=u'BigBrother start', start_time=None, 
coordinator_pid=1157, pid=None, return_code=None, state=1, stop_time=None, 
fork_time=1475789842.935872)
I1006 21:38:23.025332 75 helper.py:153]   Coordinator BigBrother start [pid: 
1157] completed.
I1006 21:38:23.025629 75 runner.py:133] Process BigBrother start had an 
abnormal termination
I1006 21:38:23.029414 75 runner.py:865] Forking Process(BigBrother start)
I1006 21:39:23.117208 75 runner.py:825] Detected a LOST task: 
ProcessStatus(seq=211, process=u'BigBrother start', start_time=None, 
coordinator_pid=1170, pid=None, return_code=None, state=1, stop_time=None, 
fork_time=1475789903.029694)
I1006 21:39:23.118841 75 helper.py:153]   Coordinator BigBrother start [pid: 
1170] completed.
I1006 21:39:23.119134 75 runner.py:133] Process BigBrother start had an 
abnormal termination
I1006 21:39:23.122920 75 runner.py:865] Forking Process(BigBrother start)
I1006 21:40:23.211095 75 runner.py:825] Detected a LOST task: 
ProcessStatus(seq=214, process=u'BigBrother start', start_time=None, 
coordinator_pid=1183, pid=None, return_code=None, state=1, stop_time=None, 
fork_time=1475789963.123206)
I1006 21:40:23.212711 75 helper.py:153]   Coordinator BigBrother start [pid: 
1183] completed.
I1006 21:40:23.213006 75 runner.py:133] Process BigBrother start had an 
abnormal termination
I1006 21:40:23.216810 75 runner.py:865] Forking Process(BigBrother start)
I1006 21:41:23.305505 75 runner.py:825] Detected a LOST task: 
ProcessStatus(seq=217, process=u'BigBrother start', start_time=None, 
coordinator_pid=1196, pid=None, return_code=None, state=1, stop_time=None, 
fork_time=1475790023.21709)
I1006 21:41:23.307157 75 helper.py:153]   Coordinator BigBrother start [pid: 
1196] completed.
I1006 21:41:23.307450 75 runner.py:133] Process BigBrother start had an 
abnormal termination
I1006 21:41:23.311230 75 runner.py:865] Forking Process(BigBrother start)
I1006 21:42:23.398277 75 runner.py:825] Detected a LOST task: 
ProcessStatus(seq=220, process=u'BigBrother start', start_time=None, 
coordinator_pid=1209, pid=None, return_code=None, state=1, stop_time=None, 
fork_time=1475790083.311512)
I1006 21:42:23.399893 75 helper.py:153]   Coordinator BigBrother start [pid: 
1209] completed.
I1006 21:42:23.400185 75 runner.py:133] Process BigBrother start had an 
abnormal termination
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1763) GPU drivers are missing when using a Docker image

2016-09-06 Thread Justin Pinkul (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15468217#comment-15468217
 ] 

Justin Pinkul commented on AURORA-1763:
---

It looks like this might vary based on the Linux distribution. Using Debian 
they are under {{/run/mesos/isolators/gpu/xxx}}.

> GPU drivers are missing when using a Docker image
> -
>
> Key: AURORA-1763
> URL: https://issues.apache.org/jira/browse/AURORA-1763
> Project: Aurora
>  Issue Type: Bug
>  Components: Executor
>Affects Versions: 0.16.0
>Reporter: Justin Pinkul
>
> When launching a GPU job that uses a Docker image and the unified 
> containerizer the Nvidia drivers are not correctly mounted. As an experiment 
> I launched a task using both mesos-execute and Aurora using the same Docker 
> image and ran nvidia-smi. During the experiment I noticed that the 
> /usr/local/nvidia folder was not being mounted properly. To confirm this was 
> the issue I tar'ed the drivers up (/run/mesos/isolators/gpu/nvidia_352.39) 
> and manually added it to the Docker image. When this was done the task was 
> able to launch correctly.
> Here is the resulting mountinfo for the mesos-execute task. Notice how 
> /usr/local/nvidia is mounted from the /mesos directory.
> {noformat}140 102 8:17 
> /mesos_work/provisioner/containers/11c497a2-a300-4c9e-a474-79aad1f28f11/backends/copy/rootfses/8ee046a6-bacb-42ff-b039-2cabda5d0e62
>  / rw,relatime master:24 - ext4 /dev/sdb1 rw,errors=remount-ro,data=ordered
> 141 140 8:17 
> /mesos_work/slaves/67025326-9dfd-4cbb-a008-454a40bce2f5-S1/frameworks/67025326-9dfd-4cbb-a008-454a40bce2f5-0009/executors/gpu-test/runs/11c497a2-a300-4c9e-a474-79aad1f28f11
>  /mnt/mesos/sandbox rw,relatime master:24 - ext4 /dev/sdb1 
> rw,errors=remount-ro,data=ordered
> 142 140 0:15 /mesos/isolators/gpu/nvidia_352.39 /usr/local/nvidia 
> rw,nosuid,relatime master:5 - tmpfs tmpfs rw,size=26438160k,mode=755
> 143 140 0:3 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw
> 144 143 0:3 /sys /proc/sys ro,relatime - proc proc rw
> 145 140 0:14 / /sys ro,nosuid,nodev,noexec,relatime - sysfs sysfs rw
> 146 140 0:38 / /dev rw,nosuid - tmpfs tmpfs rw,mode=755
> 147 146 0:39 / /dev/pts rw,nosuid,noexec,relatime - devpts devpts 
> rw,mode=600,ptmxmode=666
> 148 146 0:40 / /dev/shm rw,nosuid,nodev - tmpfs tmpfs rw{noformat}
> Here is the mountinfo when using Aurora. Notice how /usr/local/nvidia is 
> missing.
> {noformat}72 71 8:1 / / rw,relatime master:1 - ext4 /dev/sda1 
> rw,errors=remount-ro,data=ordered
> 73 72 0:5 / /dev rw,relatime master:2 - devtmpfs udev 
> rw,size=10240k,nr_inodes=16521649,mode=755
> 74 73 0:11 / /dev/pts rw,nosuid,noexec,relatime master:3 - devpts devpts 
> rw,gid=5,mode=620,ptmxmode=000
> 75 73 0:17 / /dev/shm rw,nosuid,nodev master:4 - tmpfs tmpfs rw
> 76 73 0:13 / /dev/mqueue rw,relatime master:21 - mqueue mqueue rw
> 77 73 0:30 / /dev/hugepages rw,relatime master:23 - hugetlbfs hugetlbfs rw
> 78 72 0:15 / /run rw,nosuid,relatime master:5 - tmpfs tmpfs 
> rw,size=26438160k,mode=755
> 79 78 0:18 / /run/lock rw,nosuid,nodev,noexec,relatime master:6 - tmpfs tmpfs 
> rw,size=5120k
> 80 78 0:32 / /run/rpc_pipefs rw,relatime master:25 - rpc_pipefs rpc_pipefs rw
> 82 72 0:14 / /sys rw,nosuid,nodev,noexec,relatime master:7 - sysfs sysfs rw
> 83 82 0:16 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime master:8 - 
> securityfs securityfs rw
> 84 82 0:19 / /sys/fs/cgroup ro,nosuid,nodev,noexec master:9 - tmpfs tmpfs 
> ro,mode=755
> 85 84 0:20 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime master:10 
> - cgroup cgroup 
> rw,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd
> 86 84 0:22 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime master:13 
> - cgroup cgroup rw,cpuset
> 87 84 0:23 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime 
> master:14 - cgroup cgroup rw,cpu,cpuacct
> 88 84 0:24 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime master:15 
> - cgroup cgroup rw,devices
> 89 84 0:25 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime master:16 
> - cgroup cgroup rw,freezer
> 90 84 0:26 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime 
> master:17 - cgroup cgroup rw,net_cls,net_prio
> 91 84 0:27 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime master:18 - 
> cgroup cgroup rw,blkio
> 92 84 0:28 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime 
> master:19 - cgroup cgroup rw,perf_event
> 93 82 0:21 / /sys/fs/pstore rw,nosuid,nodev,noexec,relatime master:11 - 
> pstore pstore rw
> 94 82 0:6 / /sys/kernel/debug rw,relatime master:22 - debugfs debugfs rw
> 95 72 0:3 / /proc rw,nosuid,nodev,noexec,relatime master:12 - proc proc rw
> 96 95 0:29 / /proc/sys/fs/binfmt_misc rw,relatime master:20 - autofs 
> systemd-1 rw,fd=22,

[jira] [Comment Edited] (AURORA-1763) GPU drivers are missing when using a Docker image

2016-09-06 Thread Justin Pinkul (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15468170#comment-15468170
 ] 

Justin Pinkul edited comment on AURORA-1763 at 9/6/16 6:42 PM:
---

Including the GPU drivers in the Docker image could work as a temporary work 
around but falls apart incredibly fast. The problem here is that the driver is 
specific to the GPU and, I think, even the kernel version. This means that as 
soon as we have more than one type of GPU this totally falls apart. In our 
particular use case we expect to hit this scenario in around a month.

Binding the GPU volume into the task's rootfs will work. However as the 
executor is implemented right now determining the location to mount from will 
be very fragile. Since the executor is not launched with a different rootfs 
Mesos does not mount the driver's for the executor. This means that the 
executor would have to find the drivers itself, in this case from 
{{/run/mesos/isolators/gpu/nvidia_352.39}}. This sounds very fragile, 
especially since the path includes the driver version.

I think a better approach would be to set the rootfs for the executor so Mesos 
mounts the drivers to {{/usr/loca/nvidia}}. Then when the executor starts the 
task it can pass this mount on without duplicating the logic Mesos uses to find 
the drivers.


was (Author: jpinkul):
Including the GPU drivers in the Docker image could work as a temporary work 
around but falls apart incredibly fast. The problem here is that the driver is 
specific to the GPU and, I think, even the kernel version. This means that as 
soon as we have more than one type of GPU this totally falls part. In our 
particular use case we expect to hit this scenario in around a month.

Binding the GPU volume into the task's rootfs will work. However as the 
executor is implemented right now determining the location to mount from will 
be very fragile. Since the executor is not launched with a different rootfs 
Mesos does not mount the driver's for the executor. This means that the 
executor would have to find the drivers itself, in this case from 
{{/run/mesos/isolators/gpu/nvidia_352.39}}. This sounds very fragile, 
especially since the path includes the driver version.

I think a better approach would be to set the rootfs for the executor so Mesos 
mounts the drivers to {{/usr/loca/nvidia}}. Then when the executor starts the 
task it can pass this mount on without duplicating the logic Mesos uses to find 
the drivers.

> GPU drivers are missing when using a Docker image
> -
>
> Key: AURORA-1763
> URL: https://issues.apache.org/jira/browse/AURORA-1763
> Project: Aurora
>  Issue Type: Bug
>  Components: Executor
>Affects Versions: 0.16.0
>Reporter: Justin Pinkul
>
> When launching a GPU job that uses a Docker image and the unified 
> containerizer the Nvidia drivers are not correctly mounted. As an experiment 
> I launched a task using both mesos-execute and Aurora using the same Docker 
> image and ran nvidia-smi. During the experiment I noticed that the 
> /usr/local/nvidia folder was not being mounted properly. To confirm this was 
> the issue I tar'ed the drivers up (/run/mesos/isolators/gpu/nvidia_352.39) 
> and manually added it to the Docker image. When this was done the task was 
> able to launch correctly.
> Here is the resulting mountinfo for the mesos-execute task. Notice how 
> /usr/local/nvidia is mounted from the /mesos directory.
> {noformat}140 102 8:17 
> /mesos_work/provisioner/containers/11c497a2-a300-4c9e-a474-79aad1f28f11/backends/copy/rootfses/8ee046a6-bacb-42ff-b039-2cabda5d0e62
>  / rw,relatime master:24 - ext4 /dev/sdb1 rw,errors=remount-ro,data=ordered
> 141 140 8:17 
> /mesos_work/slaves/67025326-9dfd-4cbb-a008-454a40bce2f5-S1/frameworks/67025326-9dfd-4cbb-a008-454a40bce2f5-0009/executors/gpu-test/runs/11c497a2-a300-4c9e-a474-79aad1f28f11
>  /mnt/mesos/sandbox rw,relatime master:24 - ext4 /dev/sdb1 
> rw,errors=remount-ro,data=ordered
> 142 140 0:15 /mesos/isolators/gpu/nvidia_352.39 /usr/local/nvidia 
> rw,nosuid,relatime master:5 - tmpfs tmpfs rw,size=26438160k,mode=755
> 143 140 0:3 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw
> 144 143 0:3 /sys /proc/sys ro,relatime - proc proc rw
> 145 140 0:14 / /sys ro,nosuid,nodev,noexec,relatime - sysfs sysfs rw
> 146 140 0:38 / /dev rw,nosuid - tmpfs tmpfs rw,mode=755
> 147 146 0:39 / /dev/pts rw,nosuid,noexec,relatime - devpts devpts 
> rw,mode=600,ptmxmode=666
> 148 146 0:40 / /dev/shm rw,nosuid,nodev - tmpfs tmpfs rw{noformat}
> Here is the mountinfo when using Aurora. Notice how /usr/local/nvidia is 
> missing.
> {noformat}72 71 8:1 / / rw,relatime master:1 - ext4 /dev/sda1 
> rw,errors=remount-ro,data=ordered
> 73 72 0:5 / /dev rw,relatime master:2 - devtmpfs udev 
> rw,si

[jira] [Commented] (AURORA-1763) GPU drivers are missing when using a Docker image

2016-09-06 Thread Justin Pinkul (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15468170#comment-15468170
 ] 

Justin Pinkul commented on AURORA-1763:
---

Including the GPU drivers in the Docker image could work as a temporary work 
around but falls apart incredibly fast. The problem here is that the driver is 
specific to the GPU and, I think, even the kernel version. This means that as 
soon as we have more than one type of GPU this totally falls part. In our 
particular use case we expect to hit this scenario in around a month.

Binding the GPU volume into the task's rootfs will work. However as the 
executor is implemented right now determining the location to mount from will 
be very fragile. Since the executor is not launched with a different rootfs 
Mesos does not mount the driver's for the executor. This means that the 
executor would have to find the drivers itself, in this case from 
{{/run/mesos/isolators/gpu/nvidia_352.39}}. This sounds very fragile, 
especially since the path includes the driver version.

I think a better approach would be to set the rootfs for the executor so Mesos 
mounts the drivers to {{/usr/loca/nvidia}}. Then when the executor starts the 
task it can pass this mount on without duplicating the logic Mesos uses to find 
the drivers.

> GPU drivers are missing when using a Docker image
> -
>
> Key: AURORA-1763
> URL: https://issues.apache.org/jira/browse/AURORA-1763
> Project: Aurora
>  Issue Type: Bug
>  Components: Executor
>Affects Versions: 0.16.0
>Reporter: Justin Pinkul
>
> When launching a GPU job that uses a Docker image and the unified 
> containerizer the Nvidia drivers are not correctly mounted. As an experiment 
> I launched a task using both mesos-execute and Aurora using the same Docker 
> image and ran nvidia-smi. During the experiment I noticed that the 
> /usr/local/nvidia folder was not being mounted properly. To confirm this was 
> the issue I tar'ed the drivers up (/run/mesos/isolators/gpu/nvidia_352.39) 
> and manually added it to the Docker image. When this was done the task was 
> able to launch correctly.
> Here is the resulting mountinfo for the mesos-execute task. Notice how 
> /usr/local/nvidia is mounted from the /mesos directory.
> {noformat}140 102 8:17 
> /mesos_work/provisioner/containers/11c497a2-a300-4c9e-a474-79aad1f28f11/backends/copy/rootfses/8ee046a6-bacb-42ff-b039-2cabda5d0e62
>  / rw,relatime master:24 - ext4 /dev/sdb1 rw,errors=remount-ro,data=ordered
> 141 140 8:17 
> /mesos_work/slaves/67025326-9dfd-4cbb-a008-454a40bce2f5-S1/frameworks/67025326-9dfd-4cbb-a008-454a40bce2f5-0009/executors/gpu-test/runs/11c497a2-a300-4c9e-a474-79aad1f28f11
>  /mnt/mesos/sandbox rw,relatime master:24 - ext4 /dev/sdb1 
> rw,errors=remount-ro,data=ordered
> 142 140 0:15 /mesos/isolators/gpu/nvidia_352.39 /usr/local/nvidia 
> rw,nosuid,relatime master:5 - tmpfs tmpfs rw,size=26438160k,mode=755
> 143 140 0:3 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw
> 144 143 0:3 /sys /proc/sys ro,relatime - proc proc rw
> 145 140 0:14 / /sys ro,nosuid,nodev,noexec,relatime - sysfs sysfs rw
> 146 140 0:38 / /dev rw,nosuid - tmpfs tmpfs rw,mode=755
> 147 146 0:39 / /dev/pts rw,nosuid,noexec,relatime - devpts devpts 
> rw,mode=600,ptmxmode=666
> 148 146 0:40 / /dev/shm rw,nosuid,nodev - tmpfs tmpfs rw{noformat}
> Here is the mountinfo when using Aurora. Notice how /usr/local/nvidia is 
> missing.
> {noformat}72 71 8:1 / / rw,relatime master:1 - ext4 /dev/sda1 
> rw,errors=remount-ro,data=ordered
> 73 72 0:5 / /dev rw,relatime master:2 - devtmpfs udev 
> rw,size=10240k,nr_inodes=16521649,mode=755
> 74 73 0:11 / /dev/pts rw,nosuid,noexec,relatime master:3 - devpts devpts 
> rw,gid=5,mode=620,ptmxmode=000
> 75 73 0:17 / /dev/shm rw,nosuid,nodev master:4 - tmpfs tmpfs rw
> 76 73 0:13 / /dev/mqueue rw,relatime master:21 - mqueue mqueue rw
> 77 73 0:30 / /dev/hugepages rw,relatime master:23 - hugetlbfs hugetlbfs rw
> 78 72 0:15 / /run rw,nosuid,relatime master:5 - tmpfs tmpfs 
> rw,size=26438160k,mode=755
> 79 78 0:18 / /run/lock rw,nosuid,nodev,noexec,relatime master:6 - tmpfs tmpfs 
> rw,size=5120k
> 80 78 0:32 / /run/rpc_pipefs rw,relatime master:25 - rpc_pipefs rpc_pipefs rw
> 82 72 0:14 / /sys rw,nosuid,nodev,noexec,relatime master:7 - sysfs sysfs rw
> 83 82 0:16 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime master:8 - 
> securityfs securityfs rw
> 84 82 0:19 / /sys/fs/cgroup ro,nosuid,nodev,noexec master:9 - tmpfs tmpfs 
> ro,mode=755
> 85 84 0:20 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime master:10 
> - cgroup cgroup 
> rw,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd
> 86 84 0:22 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime master:13 
> - cgroup cgroup rw,cpuset
> 87 84 0:23 / /sys/f

[jira] [Commented] (AURORA-1763) GPU drivers are missing when using a Docker image

2016-09-06 Thread Justin Pinkul (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15467868#comment-15467868
 ] 

Justin Pinkul commented on AURORA-1763:
---

It looks like these lines of code adds the mount command to the 
ContainerLaunchInfo inside of the GPU isolator:

{code:title=mesos/src/slave/containerizer/mesos/isolators/gpu/isolator.cpp:323|borderStyle=solid}
launchInfo.add_pre_exec_commands()->set_value(
  "mount --no-mtab --rbind --read-only " +
  volume.HOST_PATH() + " " + target);
{code}

I'm not sure where this mount is being lost.

> GPU drivers are missing when using a Docker image
> -
>
> Key: AURORA-1763
> URL: https://issues.apache.org/jira/browse/AURORA-1763
> Project: Aurora
>  Issue Type: Bug
>  Components: Executor
>Affects Versions: 0.16.0
>Reporter: Justin Pinkul
>
> When launching a GPU job that uses a Docker image and the unified 
> containerizer the Nvidia drivers are not correctly mounted. As an experiment 
> I launched a task using both mesos-execute and Aurora using the same Docker 
> image and ran nvidia-smi. During the experiment I noticed that the 
> /usr/local/nvidia folder was not being mounted properly. To confirm this was 
> the issue I tar'ed the drivers up (/run/mesos/isolators/gpu/nvidia_352.39) 
> and manually added it to the Docker image. When this was done the task was 
> able to launch correctly.
> Here is the resulting mountinfo for the mesos-execute task. Notice how 
> /usr/local/nvidia is mounted from the /mesos directory.
> 140 102 8:17 
> /mesos_work/provisioner/containers/11c497a2-a300-4c9e-a474-79aad1f28f11/backends/copy/rootfses/8ee046a6-bacb-42ff-b039-2cabda5d0e62
>  / rw,relatime master:24 - ext4 /dev/sdb1 rw,errors=remount-ro,data=ordered
> 141 140 8:17 
> /mesos_work/slaves/67025326-9dfd-4cbb-a008-454a40bce2f5-S1/frameworks/67025326-9dfd-4cbb-a008-454a40bce2f5-0009/executors/gpu-test/runs/11c497a2-a300-4c9e-a474-79aad1f28f11
>  /mnt/mesos/sandbox rw,relatime master:24 - ext4 /dev/sdb1 
> rw,errors=remount-ro,data=ordered
> 142 140 0:15 /mesos/isolators/gpu/nvidia_352.39 /usr/local/nvidia 
> rw,nosuid,relatime master:5 - tmpfs tmpfs rw,size=26438160k,mode=755
> 143 140 0:3 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw
> 144 143 0:3 /sys /proc/sys ro,relatime - proc proc rw
> 145 140 0:14 / /sys ro,nosuid,nodev,noexec,relatime - sysfs sysfs rw
> 146 140 0:38 / /dev rw,nosuid - tmpfs tmpfs rw,mode=755
> 147 146 0:39 / /dev/pts rw,nosuid,noexec,relatime - devpts devpts 
> rw,mode=600,ptmxmode=666
> 148 146 0:40 / /dev/shm rw,nosuid,nodev - tmpfs tmpfs rw
> Here is the mountinfo when using Aurora. Notice how /usr/local/nvidia is 
> missing.
> 72 71 8:1 / / rw,relatime master:1 - ext4 /dev/sda1 
> rw,errors=remount-ro,data=ordered
> 73 72 0:5 / /dev rw,relatime master:2 - devtmpfs udev 
> rw,size=10240k,nr_inodes=16521649,mode=755
> 74 73 0:11 / /dev/pts rw,nosuid,noexec,relatime master:3 - devpts devpts 
> rw,gid=5,mode=620,ptmxmode=000
> 75 73 0:17 / /dev/shm rw,nosuid,nodev master:4 - tmpfs tmpfs rw
> 76 73 0:13 / /dev/mqueue rw,relatime master:21 - mqueue mqueue rw
> 77 73 0:30 / /dev/hugepages rw,relatime master:23 - hugetlbfs hugetlbfs rw
> 78 72 0:15 / /run rw,nosuid,relatime master:5 - tmpfs tmpfs 
> rw,size=26438160k,mode=755
> 79 78 0:18 / /run/lock rw,nosuid,nodev,noexec,relatime master:6 - tmpfs tmpfs 
> rw,size=5120k
> 80 78 0:32 / /run/rpc_pipefs rw,relatime master:25 - rpc_pipefs rpc_pipefs rw
> 82 72 0:14 / /sys rw,nosuid,nodev,noexec,relatime master:7 - sysfs sysfs rw
> 83 82 0:16 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime master:8 - 
> securityfs securityfs rw
> 84 82 0:19 / /sys/fs/cgroup ro,nosuid,nodev,noexec master:9 - tmpfs tmpfs 
> ro,mode=755
> 85 84 0:20 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime master:10 
> - cgroup cgroup 
> rw,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd
> 86 84 0:22 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime master:13 
> - cgroup cgroup rw,cpuset
> 87 84 0:23 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime 
> master:14 - cgroup cgroup rw,cpu,cpuacct
> 88 84 0:24 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime master:15 
> - cgroup cgroup rw,devices
> 89 84 0:25 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime master:16 
> - cgroup cgroup rw,freezer
> 90 84 0:26 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime 
> master:17 - cgroup cgroup rw,net_cls,net_prio
> 91 84 0:27 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime master:18 - 
> cgroup cgroup rw,blkio
> 92 84 0:28 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime 
> master:19 - cgroup cgroup rw,perf_event
> 93 82 0:21 / /sys/fs/pstore rw,nosuid,nodev,noexec,relatime master:11 - 
> pstore pstore rw

[jira] [Updated] (AURORA-1763) GPU drivers are missing when using a Docker image

2016-09-06 Thread Justin Pinkul (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justin Pinkul updated AURORA-1763:
--
Description: 
When launching a GPU job that uses a Docker image and the unified containerizer 
the Nvidia drivers are not correctly mounted. As an experiment I launched a 
task using both mesos-execute and Aurora using the same Docker image and ran 
`nvidia-smi`. During the experiment I noticed that the /usr/local/nvidia folder 
was not being mounted properly. To confirm this was the issue I tar'ed the 
drivers up (/run/mesos/isolators/gpu/nvidia_352.39) and manually added it to 
the Docker image. When this was done the task was able to launch correctly.

Here is the resulting mountinfo for the mesos-execute task. Notice how 
/usr/local/nvidia is mounted from the /mesos directory.

140 102 8:17 
/mesos_work/provisioner/containers/11c497a2-a300-4c9e-a474-79aad1f28f11/backends/copy/rootfses/8ee046a6-bacb-42ff-b039-2cabda5d0e62
 / rw,relatime master:24 - ext4 /dev/sdb1 rw,errors=remount-ro,data=ordered
141 140 8:17 
/mesos_work/slaves/67025326-9dfd-4cbb-a008-454a40bce2f5-S1/frameworks/67025326-9dfd-4cbb-a008-454a40bce2f5-0009/executors/gpu-test/runs/11c497a2-a300-4c9e-a474-79aad1f28f11
 /mnt/mesos/sandbox rw,relatime master:24 - ext4 /dev/sdb1 
rw,errors=remount-ro,data=ordered
142 140 0:15 /mesos/isolators/gpu/nvidia_352.39 /usr/local/nvidia 
rw,nosuid,relatime master:5 - tmpfs tmpfs rw,size=26438160k,mode=755
143 140 0:3 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw
144 143 0:3 /sys /proc/sys ro,relatime - proc proc rw
145 140 0:14 / /sys ro,nosuid,nodev,noexec,relatime - sysfs sysfs rw
146 140 0:38 / /dev rw,nosuid - tmpfs tmpfs rw,mode=755
147 146 0:39 / /dev/pts rw,nosuid,noexec,relatime - devpts devpts 
rw,mode=600,ptmxmode=666
148 146 0:40 / /dev/shm rw,nosuid,nodev - tmpfs tmpfs rw

Here is the mountinfo when using Aurora. Notice how /usr/local/nvidia is 
missing.

72 71 8:1 / / rw,relatime master:1 - ext4 /dev/sda1 
rw,errors=remount-ro,data=ordered
73 72 0:5 / /dev rw,relatime master:2 - devtmpfs udev 
rw,size=10240k,nr_inodes=16521649,mode=755
74 73 0:11 / /dev/pts rw,nosuid,noexec,relatime master:3 - devpts devpts 
rw,gid=5,mode=620,ptmxmode=000
75 73 0:17 / /dev/shm rw,nosuid,nodev master:4 - tmpfs tmpfs rw
76 73 0:13 / /dev/mqueue rw,relatime master:21 - mqueue mqueue rw
77 73 0:30 / /dev/hugepages rw,relatime master:23 - hugetlbfs hugetlbfs rw
78 72 0:15 / /run rw,nosuid,relatime master:5 - tmpfs tmpfs 
rw,size=26438160k,mode=755
79 78 0:18 / /run/lock rw,nosuid,nodev,noexec,relatime master:6 - tmpfs tmpfs 
rw,size=5120k
80 78 0:32 / /run/rpc_pipefs rw,relatime master:25 - rpc_pipefs rpc_pipefs rw
82 72 0:14 / /sys rw,nosuid,nodev,noexec,relatime master:7 - sysfs sysfs rw
83 82 0:16 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime master:8 - 
securityfs securityfs rw
84 82 0:19 / /sys/fs/cgroup ro,nosuid,nodev,noexec master:9 - tmpfs tmpfs 
ro,mode=755
85 84 0:20 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime master:10 - 
cgroup cgroup 
rw,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd
86 84 0:22 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime master:13 - 
cgroup cgroup rw,cpuset
87 84 0:23 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime 
master:14 - cgroup cgroup rw,cpu,cpuacct
88 84 0:24 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime master:15 - 
cgroup cgroup rw,devices
89 84 0:25 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime master:16 - 
cgroup cgroup rw,freezer
90 84 0:26 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime 
master:17 - cgroup cgroup rw,net_cls,net_prio
91 84 0:27 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime master:18 - 
cgroup cgroup rw,blkio
92 84 0:28 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime 
master:19 - cgroup cgroup rw,perf_event
93 82 0:21 / /sys/fs/pstore rw,nosuid,nodev,noexec,relatime master:11 - pstore 
pstore rw
94 82 0:6 / /sys/kernel/debug rw,relatime master:22 - debugfs debugfs rw
95 72 0:3 / /proc rw,nosuid,nodev,noexec,relatime master:12 - proc proc rw
96 95 0:29 / /proc/sys/fs/binfmt_misc rw,relatime master:20 - autofs systemd-1 
rw,fd=22,pgrp=1,timeout=300,minproto=5,maxproto=5,direct
97 96 0:34 / /proc/sys/fs/binfmt_misc rw,relatime master:27 - binfmt_misc 
binfmt_misc rw
98 72 8:17 / /mnt/01 rw,relatime master:24 - ext4 /dev/sdb1 
rw,errors=remount-ro,data=ordered
99 98 8:17 
/mesos_work/provisioner/containers/3790dd16-d1e2-4974-ba21-095a029b8c7d/backends/copy/rootfses/7ce26962-10a7-40ec-843b-c76e7e29c88d
 
/mnt/01/mesos_work/slaves/67025326-9dfd-4cbb-a008-454a40bce2f5-S1/frameworks/13e02526-f2b7-4677-bb23-0faeeac65be9-/executors/thermos-root-devel-gpu_test-0-beeb742b-28c1-46f3-b49f-23443b6efcc2/runs/3790dd16-d1e2-4974-ba21-095a029b8c7d/taskfs
 rw,relatime master:24 - ext4 /dev/sdb1 rw,errors=remount-ro,data=ordered
100 99

[jira] [Updated] (AURORA-1763) GPU drivers are missing when using a Docker image

2016-09-06 Thread Justin Pinkul (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justin Pinkul updated AURORA-1763:
--
Description: 
When launching a GPU job that uses a Docker image and the unified containerizer 
the Nvidia drivers are not correctly mounted. As an experiment I launched a 
task using both mesos-execute and Aurora using the same Docker image and ran 
nvidia-smi. During the experiment I noticed that the /usr/local/nvidia folder 
was not being mounted properly. To confirm this was the issue I tar'ed the 
drivers up (/run/mesos/isolators/gpu/nvidia_352.39) and manually added it to 
the Docker image. When this was done the task was able to launch correctly.

Here is the resulting mountinfo for the mesos-execute task. Notice how 
/usr/local/nvidia is mounted from the /mesos directory.

140 102 8:17 
/mesos_work/provisioner/containers/11c497a2-a300-4c9e-a474-79aad1f28f11/backends/copy/rootfses/8ee046a6-bacb-42ff-b039-2cabda5d0e62
 / rw,relatime master:24 - ext4 /dev/sdb1 rw,errors=remount-ro,data=ordered
141 140 8:17 
/mesos_work/slaves/67025326-9dfd-4cbb-a008-454a40bce2f5-S1/frameworks/67025326-9dfd-4cbb-a008-454a40bce2f5-0009/executors/gpu-test/runs/11c497a2-a300-4c9e-a474-79aad1f28f11
 /mnt/mesos/sandbox rw,relatime master:24 - ext4 /dev/sdb1 
rw,errors=remount-ro,data=ordered
142 140 0:15 /mesos/isolators/gpu/nvidia_352.39 /usr/local/nvidia 
rw,nosuid,relatime master:5 - tmpfs tmpfs rw,size=26438160k,mode=755
143 140 0:3 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw
144 143 0:3 /sys /proc/sys ro,relatime - proc proc rw
145 140 0:14 / /sys ro,nosuid,nodev,noexec,relatime - sysfs sysfs rw
146 140 0:38 / /dev rw,nosuid - tmpfs tmpfs rw,mode=755
147 146 0:39 / /dev/pts rw,nosuid,noexec,relatime - devpts devpts 
rw,mode=600,ptmxmode=666
148 146 0:40 / /dev/shm rw,nosuid,nodev - tmpfs tmpfs rw

Here is the mountinfo when using Aurora. Notice how /usr/local/nvidia is 
missing.

72 71 8:1 / / rw,relatime master:1 - ext4 /dev/sda1 
rw,errors=remount-ro,data=ordered
73 72 0:5 / /dev rw,relatime master:2 - devtmpfs udev 
rw,size=10240k,nr_inodes=16521649,mode=755
74 73 0:11 / /dev/pts rw,nosuid,noexec,relatime master:3 - devpts devpts 
rw,gid=5,mode=620,ptmxmode=000
75 73 0:17 / /dev/shm rw,nosuid,nodev master:4 - tmpfs tmpfs rw
76 73 0:13 / /dev/mqueue rw,relatime master:21 - mqueue mqueue rw
77 73 0:30 / /dev/hugepages rw,relatime master:23 - hugetlbfs hugetlbfs rw
78 72 0:15 / /run rw,nosuid,relatime master:5 - tmpfs tmpfs 
rw,size=26438160k,mode=755
79 78 0:18 / /run/lock rw,nosuid,nodev,noexec,relatime master:6 - tmpfs tmpfs 
rw,size=5120k
80 78 0:32 / /run/rpc_pipefs rw,relatime master:25 - rpc_pipefs rpc_pipefs rw
82 72 0:14 / /sys rw,nosuid,nodev,noexec,relatime master:7 - sysfs sysfs rw
83 82 0:16 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime master:8 - 
securityfs securityfs rw
84 82 0:19 / /sys/fs/cgroup ro,nosuid,nodev,noexec master:9 - tmpfs tmpfs 
ro,mode=755
85 84 0:20 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime master:10 - 
cgroup cgroup 
rw,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd
86 84 0:22 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime master:13 - 
cgroup cgroup rw,cpuset
87 84 0:23 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime 
master:14 - cgroup cgroup rw,cpu,cpuacct
88 84 0:24 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime master:15 - 
cgroup cgroup rw,devices
89 84 0:25 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime master:16 - 
cgroup cgroup rw,freezer
90 84 0:26 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime 
master:17 - cgroup cgroup rw,net_cls,net_prio
91 84 0:27 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime master:18 - 
cgroup cgroup rw,blkio
92 84 0:28 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime 
master:19 - cgroup cgroup rw,perf_event
93 82 0:21 / /sys/fs/pstore rw,nosuid,nodev,noexec,relatime master:11 - pstore 
pstore rw
94 82 0:6 / /sys/kernel/debug rw,relatime master:22 - debugfs debugfs rw
95 72 0:3 / /proc rw,nosuid,nodev,noexec,relatime master:12 - proc proc rw
96 95 0:29 / /proc/sys/fs/binfmt_misc rw,relatime master:20 - autofs systemd-1 
rw,fd=22,pgrp=1,timeout=300,minproto=5,maxproto=5,direct
97 96 0:34 / /proc/sys/fs/binfmt_misc rw,relatime master:27 - binfmt_misc 
binfmt_misc rw
98 72 8:17 / /mnt/01 rw,relatime master:24 - ext4 /dev/sdb1 
rw,errors=remount-ro,data=ordered
99 98 8:17 
/mesos_work/provisioner/containers/3790dd16-d1e2-4974-ba21-095a029b8c7d/backends/copy/rootfses/7ce26962-10a7-40ec-843b-c76e7e29c88d
 
/mnt/01/mesos_work/slaves/67025326-9dfd-4cbb-a008-454a40bce2f5-S1/frameworks/13e02526-f2b7-4677-bb23-0faeeac65be9-/executors/thermos-root-devel-gpu_test-0-beeb742b-28c1-46f3-b49f-23443b6efcc2/runs/3790dd16-d1e2-4974-ba21-095a029b8c7d/taskfs
 rw,relatime master:24 - ext4 /dev/sdb1 rw,errors=remount-ro,data=ordered
100 99 8

[jira] [Created] (AURORA-1763) GPU drivers are missing when using a Docker image

2016-09-06 Thread Justin Pinkul (JIRA)
Justin Pinkul created AURORA-1763:
-

 Summary: GPU drivers are missing when using a Docker image
 Key: AURORA-1763
 URL: https://issues.apache.org/jira/browse/AURORA-1763
 Project: Aurora
  Issue Type: Bug
  Components: Executor
Affects Versions: 0.16.0
Reporter: Justin Pinkul


When launching a GPU job that uses a Docker image and the unified containerizer 
the Nvidia drivers are not correctly mounted. As an experiment I launched a 
task using both mesos-execute and Aurora using the same Docker image and ran 
nvidia-smi. During the experiment I noticed that the /usr/local/nvidia folder 
was not being mounted properly. To confirm this was the issue I tar'ed the 
drivers up (/run/mesos/isolators/gpu/nvidia_352.39) and manually added it to 
the Docker image. When this was done the task was able to launch correctly.

Here is the resulting mountinfo for the mesos-execute task. Notice how 
/usr/local/nvidia is mounted from the /mesos directory.

140 102 8:17 
/mesos_work/provisioner/containers/11c497a2-a300-4c9e-a474-79aad1f28f11/backends/copy/rootfses/8ee046a6-bacb-42ff-b039-2cabda5d0e62
 / rw,relatime master:24 - ext4 /dev/sdb1 rw,errors=remount-ro,data=ordered
141 140 8:17 
/mesos_work/slaves/67025326-9dfd-4cbb-a008-454a40bce2f5-S1/frameworks/67025326-9dfd-4cbb-a008-454a40bce2f5-0009/executors/gpu-test/runs/11c497a2-a300-4c9e-a474-79aad1f28f11
 /mnt/mesos/sandbox rw,relatime master:24 - ext4 /dev/sdb1 
rw,errors=remount-ro,data=ordered
142 140 0:15 /mesos/isolators/gpu/nvidia_352.39 /usr/local/nvidia 
rw,nosuid,relatime master:5 - tmpfs tmpfs rw,size=26438160k,mode=755
143 140 0:3 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw
144 143 0:3 /sys /proc/sys ro,relatime - proc proc rw
145 140 0:14 / /sys ro,nosuid,nodev,noexec,relatime - sysfs sysfs rw
146 140 0:38 / /dev rw,nosuid - tmpfs tmpfs rw,mode=755
147 146 0:39 / /dev/pts rw,nosuid,noexec,relatime - devpts devpts 
rw,mode=600,ptmxmode=666
148 146 0:40 / /dev/shm rw,nosuid,nodev - tmpfs tmpfs rw

Here is the mountinfo when using Aurora. Notice how /usr/local/nvidia is 
missing.

72 71 8:1 / / rw,relatime master:1 - ext4 /dev/sda1 
rw,errors=remount-ro,data=ordered
73 72 0:5 / /dev rw,relatime master:2 - devtmpfs udev 
rw,size=10240k,nr_inodes=16521649,mode=755
74 73 0:11 / /dev/pts rw,nosuid,noexec,relatime master:3 - devpts devpts 
rw,gid=5,mode=620,ptmxmode=000
75 73 0:17 / /dev/shm rw,nosuid,nodev master:4 - tmpfs tmpfs rw
76 73 0:13 / /dev/mqueue rw,relatime master:21 - mqueue mqueue rw
77 73 0:30 / /dev/hugepages rw,relatime master:23 - hugetlbfs hugetlbfs rw
78 72 0:15 / /run rw,nosuid,relatime master:5 - tmpfs tmpfs 
rw,size=26438160k,mode=755
79 78 0:18 / /run/lock rw,nosuid,nodev,noexec,relatime master:6 - tmpfs tmpfs 
rw,size=5120k
80 78 0:32 / /run/rpc_pipefs rw,relatime master:25 - rpc_pipefs rpc_pipefs rw
82 72 0:14 / /sys rw,nosuid,nodev,noexec,relatime master:7 - sysfs sysfs rw
83 82 0:16 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime master:8 - 
securityfs securityfs rw
84 82 0:19 / /sys/fs/cgroup ro,nosuid,nodev,noexec master:9 - tmpfs tmpfs 
ro,mode=755
85 84 0:20 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime master:10 - 
cgroup cgroup 
rw,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd
86 84 0:22 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime master:13 - 
cgroup cgroup rw,cpuset
87 84 0:23 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime 
master:14 - cgroup cgroup rw,cpu,cpuacct
88 84 0:24 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime master:15 - 
cgroup cgroup rw,devices
89 84 0:25 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime master:16 - 
cgroup cgroup rw,freezer
90 84 0:26 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime 
master:17 - cgroup cgroup rw,net_cls,net_prio
91 84 0:27 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime master:18 - 
cgroup cgroup rw,blkio
92 84 0:28 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime 
master:19 - cgroup cgroup rw,perf_event
93 82 0:21 / /sys/fs/pstore rw,nosuid,nodev,noexec,relatime master:11 - pstore 
pstore rw
94 82 0:6 / /sys/kernel/debug rw,relatime master:22 - debugfs debugfs rw
95 72 0:3 / /proc rw,nosuid,nodev,noexec,relatime master:12 - proc proc rw
96 95 0:29 / /proc/sys/fs/binfmt_misc rw,relatime master:20 - autofs systemd-1 
rw,fd=22,pgrp=1,timeout=300,minproto=5,maxproto=5,direct
97 96 0:34 / /proc/sys/fs/binfmt_misc rw,relatime master:27 - binfmt_misc 
binfmt_misc rw
98 72 8:17 / /mnt/01 rw,relatime master:24 - ext4 /dev/sdb1 
rw,errors=remount-ro,data=ordered
99 98 8:17 
/mesos_work/provisioner/containers/3790dd16-d1e2-4974-ba21-095a029b8c7d/backends/copy/rootfses/7ce26962-10a7-40ec-843b-c76e7e29c88d
 
/mnt/01/mesos_work/slaves/67025326-9dfd-4cbb-a008-454a40bce2f5-S1/frameworks/13e02526-f2b7-4677-bb23-0faeeac65be9-/executors/thermos-ro

[jira] [Commented] (AURORA-1602) Add aurora_admin command to trigger reconciliation

2016-08-31 Thread Justin Pinkul (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15453730#comment-15453730
 ] 

Justin Pinkul commented on AURORA-1602:
---

We recently hit a failure scenario in Mesos where this tool would have helped. 
Our cluster experienced a full power outage recently and the Aurora scheduler's 
recovered faster than all of the Mesos agents. When Aurora sent the request for 
explicit task reconciliation to the Mesos master the Mesos master had to drop 
the request due to nodes being in a transitory state. This state only lasted 
for a couple of minutes longer but Aurora did not perform reconciliation for 
another reconciliation_explicit_interval (in our case the default of 60min). 
The impact of this was that it took the Aurora scheduler an extra hour to 
reschedule existing jobs that were lost due to the power outage. If there was a 
tool that could trigger this reconciliation the cluster could have been 
recovered faster.

> Add aurora_admin command to trigger reconciliation 
> ---
>
> Key: AURORA-1602
> URL: https://issues.apache.org/jira/browse/AURORA-1602
> Project: Aurora
>  Issue Type: Task
>  Components: Client
>Reporter: Zameer Manji
>
> Currently reconciliation runs on a fixed schedule. Adding an admin RPC to 
> trigger it is useful for operators who want to speed up cluster recovery.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)