[ 
https://issues.apache.org/jira/browse/MESOS-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Massenzio updated MESOS-2215:
-----------------------------------
    Sprint: Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, 
Mesosphere Q1 Sprint 5 - 3/20, Mesosphere Q1 Sprint 10 - 5/30  (was: Mesosphere 
Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20)

> The Docker containerizer attempts to recover any task when checkpointing is 
> enabled, not just docker tasks.
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: MESOS-2215
>                 URL: https://issues.apache.org/jira/browse/MESOS-2215
>             Project: Mesos
>          Issue Type: Bug
>          Components: docker
>    Affects Versions: 0.21.0
>            Reporter: Steve Niemitz
>            Assignee: Timothy Chen
>
> Once the slave restarts and recovers the task, I see this error in the log 
> for all tasks that were recovered every second or so.  Note, these were NOT 
> docker tasks:
> W0113 16:01:00.790323 773142 monitor.cpp:213] Failed to get resource usage 
> for  container 7b729b89-dc7e-4d08-af97-8cd1af560a21 for executor 
> thermos-1421085237813-slipstream-prod-agent-3-8f769514-1835-4151-90d0-3f55dcc940dd
>  of framework 20150109-161713-715350282-5050-290797-0000: Failed to 'docker 
> inspect mesos-7b729b89-dc7e-4d08-af97-8cd1af560a21': exit status = exited 
> with status 1 stderr = Error: No such image or container: 
> mesos-7b729b89-dc7e-4d08-af97-8cd1af560a21
> However the tasks themselves are still healthy and running.
> The slave was launched with --containerizers=mesos,docker
> -----
> More info: it looks like the docker containerizer is a little too ambitious 
> about recovering containers, again this was not a docker task:
> I0113 15:59:59.476145 773142 docker.cpp:814] Recovering container 
> '7b729b89-dc7e-4d08-af97-8cd1af560a21' for executor 
> 'thermos-1421085237813-slipstream-prod-agent-3-8f769514-1835-4151-90d0-3f55dcc940dd'
>  of framework 20150109-161713-715350282-5050-290797-0000
> Looking into the source, it looks like the problem is that the 
> ComposingContainerize runs recover in parallel, but neither the docker 
> containerizer nor mesos containerizer check if they should recover the task 
> or not (because they were the ones that launched it).  Perhaps this needs to 
> be written into the checkpoint somewhere?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to