[ https://issues.apache.org/jira/browse/MESOS-2601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Marco Massenzio updated MESOS-2601: ----------------------------------- Comment: was deleted (was: [~air] today mentioned that fixing this one is critical for DCOS release: can [~tnachen] provide us with an update on progress? I see that [his fix|https://reviews.apache.org/r/33257/] was committed, is this going to be part of r0.22? 0.22.1?) > Tasks are not removed after recovery from slave and mesos containerizer > ----------------------------------------------------------------------- > > Key: MESOS-2601 > URL: https://issues.apache.org/jira/browse/MESOS-2601 > Project: Mesos > Issue Type: Bug > Components: containerization, slave > Affects Versions: 0.22.1 > Reporter: Timothy Chen > Assignee: Timothy Chen > > We've seen in our test cluster that tasks that were launched with the mesos > containerizer are recovered after slave restart, but actual command process > is not running anymore and the checkpointed executor is not marked as > completed. > The Mesos containerizer recovers and all the isolators couldn't recover the > task, but the containerizer itself is somehow never removed and the monitor > kept calling usage on the containerizer. > Relevant log lines from the beginning of slave recovery: > I0408 18:06:33.261379 32504 slave.cpp:577] Successfully attached file > '/hdd/mesos/slave/slaves/20150401-160104-251662508-5050-2197-S1/frameworks/20141222-194154-218108076-5050-4125-0004/executors/ct:1427921848104:0:EM > DataDog Uploader:/runs/990741ed-909e-49cc-83f8-be63298872da' > ... > I0408 18:06:36.583277 32511 containerizer.cpp:350] Recovering container > '990741ed-909e-49cc-83f8-be63298872da' for executor 'ct:1427921848104:0:EM > DataDog Uploader:' of framework 20141222-194154-218108076-5050-4125-0004 > .... > I0408 18:06:37.017122 32511 linux_launcher.cpp:162] Couldn't find freezer > cgroup for container 990741ed-909e-49cc-83f8-be63298872da, assuming already > destroyed > W0408 18:06:37.074916 32496 cpushare.cpp:199] Couldn't find cgroup for > container 990741ed-909e-49cc-83f8-be63298872da > I0408 18:06:37.075173 32486 mem.cpp:158] Couldn't find cgroup for container > 990741ed-909e-49cc-83f8-be63298872da > E0408 18:06:37.092279 32496 containerizer.cpp:1136] Error in a resource > limitation for container 990741ed-909e-49cc-83f8-be63298872da: Unknown > container > I0408 18:06:37.092643 32496 containerizer.cpp:906] Destroying container > '990741ed-909e-49cc-83f8-be63298872da' > W0408 18:06:37.229626 32501 containerizer.cpp:807] Ignoring update for > currently being destroyed container: 990741ed-909e-49cc-83f8-be63298872da > W0408 18:06:38.129873 32484 containerizer.cpp:844] Skipping resource > statistic for container 990741ed-909e-49cc-83f8-be63298872da because: Unknown > container > W0408 18:06:38.129909 32484 containerizer.cpp:844] Skipping resource > statistic for container 990741ed-909e-49cc-83f8-be63298872da because: Unknown > container -- This message was sent by Atlassian JIRA (v6.3.4#6332)