[ https://issues.apache.org/jira/browse/MESOS-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Marco Massenzio updated MESOS-2215: ----------------------------------- Sprint: Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20, Mesosphere Q1 Sprint 10 - 5/30 (was: Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20) > The Docker containerizer attempts to recover any task when checkpointing is > enabled, not just docker tasks. > ----------------------------------------------------------------------------------------------------------- > > Key: MESOS-2215 > URL: https://issues.apache.org/jira/browse/MESOS-2215 > Project: Mesos > Issue Type: Bug > Components: docker > Affects Versions: 0.21.0 > Reporter: Steve Niemitz > Assignee: Timothy Chen > > Once the slave restarts and recovers the task, I see this error in the log > for all tasks that were recovered every second or so. Note, these were NOT > docker tasks: > W0113 16:01:00.790323 773142 monitor.cpp:213] Failed to get resource usage > for container 7b729b89-dc7e-4d08-af97-8cd1af560a21 for executor > thermos-1421085237813-slipstream-prod-agent-3-8f769514-1835-4151-90d0-3f55dcc940dd > of framework 20150109-161713-715350282-5050-290797-0000: Failed to 'docker > inspect mesos-7b729b89-dc7e-4d08-af97-8cd1af560a21': exit status = exited > with status 1 stderr = Error: No such image or container: > mesos-7b729b89-dc7e-4d08-af97-8cd1af560a21 > However the tasks themselves are still healthy and running. > The slave was launched with --containerizers=mesos,docker > ----- > More info: it looks like the docker containerizer is a little too ambitious > about recovering containers, again this was not a docker task: > I0113 15:59:59.476145 773142 docker.cpp:814] Recovering container > '7b729b89-dc7e-4d08-af97-8cd1af560a21' for executor > 'thermos-1421085237813-slipstream-prod-agent-3-8f769514-1835-4151-90d0-3f55dcc940dd' > of framework 20150109-161713-715350282-5050-290797-0000 > Looking into the source, it looks like the problem is that the > ComposingContainerize runs recover in parallel, but neither the docker > containerizer nor mesos containerizer check if they should recover the task > or not (because they were the ones that launched it). Perhaps this needs to > be written into the checkpoint somewhere? -- This message was sent by Atlassian JIRA (v6.3.4#6332)