Hi all,

We are using Mesos 0.23.1 in combination with Aurora 0.10.0. So far we
have been using the JSON format for Mesos' credential files. However,
because of MESOS-3695 we decided to switch to the plain text format
before updating to 0.24.1. Our understanding is that this should be a
NOOP. However, on our cluster this caused multiple tasks to fail on each
slave.

I have attached two excerpts from the Mesos slave log. One were I
grepped for the executor ID of one of the failed tasks, and one were I
grepped for the ID of the corresponding container. What you can see is
that recovery of the container  is started and – immediately afterwards
– the executer killed.

Our change procedure was:
* Place the new plain-text credential file
* Restart the slave with `--credential` pointing to the new file
* Remove the old JSON credential file

We are running the Mesos slave using supervisord and use the following
isolators: cgroups/cpu, cgroups/mem, filesystem/shared, namespaces/pid,
and posix/disk. In addition we use `--enforce_container_disk_quota`.
Regarding recovery we use the options `--recover="reconnect"` and
`--strict="false"`.

The Thermos log does not provide any hints as to what happened. It looks
like Thermos was SIGKILLed.

Has any of you run into this problem before? Do you have an idea what
could cause this behaviour? Do you have any suggestion what information
we could look for to better understand what happens?

Kind Regards,
Matthias

-- 
Dr. Matthias Bach
Senior Software Engineer
*Blue Yonder GmbH*
Ohiostraße 8
D-76149 Karlsruhe

Tel +49 (0)721 383 117 6244
Fax +49 (0)721 383 117 69

matthias.b...@blue-yonder.com
www.blue-yonder.com
Registergericht Mannheim, HRB 704547
USt-IdNr. DE DE 277 091 535
Geschäftsführer: Jochen Bossert, Uwe Weiss (CEO)

I0114 14:09:51.213526 23008 containerizer.cpp:371] Recovering container 'e7f628f5-cd1b-4cf0-b8d4-dd457b46bd9a' for executor 'thermos-1452181970177-USER-prod-JOB_NAME-0-99a16851-42d6-4a52-b768-359b4f499ff3' of framework 20150930-134812-84017418-5050-29407-0001
I0114 14:09:51.230132 23056 mem.cpp:602] Started listening for OOM events for container e7f628f5-cd1b-4cf0-b8d4-dd457b46bd9a
I0114 14:09:51.230499 23056 mem.cpp:718] Started listening on low memory pressure events for container e7f628f5-cd1b-4cf0-b8d4-dd457b46bd9a
I0114 14:09:51.230828 23056 mem.cpp:718] Started listening on medium memory pressure events for container e7f628f5-cd1b-4cf0-b8d4-dd457b46bd9a
I0114 14:09:51.231233 23056 mem.cpp:718] Started listening on critical memory pressure events for container e7f628f5-cd1b-4cf0-b8d4-dd457b46bd9a
I0114 14:09:53.584983 23014 containerizer.cpp:1001] Destroying container 'e7f628f5-cd1b-4cf0-b8d4-dd457b46bd9a'
I0114 14:09:53.585428 23014 linux_launcher.cpp:358] Using pid namespace to destroy container e7f628f5-cd1b-4cf0-b8d4-dd457b46bd9a
I0114 14:09:53.800837 23014 containerizer.cpp:1188] Executor for container 'e7f628f5-cd1b-4cf0-b8d4-dd457b46bd9a' has exited
I0114 14:09:53.802088 23012 cgroups.cpp:2382] Freezing cgroup /sys/fs/cgroup/freezer/mesos/e7f628f5-cd1b-4cf0-b8d4-dd457b46bd9a
I0114 14:09:53.803673 22996 cgroups.cpp:1415] Successfully froze cgroup /sys/fs/cgroup/freezer/mesos/e7f628f5-cd1b-4cf0-b8d4-dd457b46bd9a after 1.552896ms
I0114 14:09:53.804822 23008 cgroups.cpp:2399] Thawing cgroup /sys/fs/cgroup/freezer/mesos/e7f628f5-cd1b-4cf0-b8d4-dd457b46bd9a
I0114 14:09:53.806593 23012 cgroups.cpp:1444] Successfullly thawed cgroup /sys/fs/cgroup/freezer/mesos/e7f628f5-cd1b-4cf0-b8d4-dd457b46bd9a after 1.753856ms
W0114 14:09:54.639930 23014 containerizer.cpp:885] Ignoring update for unknown container: e7f628f5-cd1b-4cf0-b8d4-dd457b46bd9a
I0114 14:09:54.708149 23002 gc.cpp:56] Scheduling '/var/lib/mesos/slaves/20151021-121051-84017418-5050-52142-S3/frameworks/20150930-134812-84017418-5050-29407-0001/executors/thermos-1452181970177-USER-prod-JOB_NAME-0-99a16851-42d6-4a52-b768-359b4f499ff3/runs/e7f628f5-cd1b-4cf0-b8d4-dd457b46bd9a' for gc 6.99999180405037days in the future
I0114 14:09:54.708226 22996 gc.cpp:56] Scheduling '/var/lib/mesos/meta/slaves/20151021-121051-84017418-5050-52142-S3/frameworks/20150930-134812-84017418-5050-29407-0001/executors/thermos-1452181970177-USER-prod-JOB_NAME-0-99a16851-42d6-4a52-b768-359b4f499ff3/runs/e7f628f5-cd1b-4cf0-b8d4-dd457b46bd9a' for gc 6.99999180321778days in the future
 

I0114 14:09:51.090075 22993 slave.cpp:4842] Recovering executor 'thermos-1452181970177-USER-prod-JOB_NAME-0-99a16851-42d6-4a52-b768-359b4f499ff3' of framework 20150930-134812-84017418-5050-29407-0001                                                                                                                                                                                                        
I0114 14:09:51.168311 22999 status_update_manager.cpp:210] Recovering executor 'thermos-1452181970177-USER-prod-JOB_NAME-0-99a16851-42d6-4a52-b768-359b4f499ff3' of framework 20150930-134812-84017418-5050-29407-0001
I0114 14:09:51.213526 23008 containerizer.cpp:371] Recovering container 'e7f628f5-cd1b-4cf0-b8d4-dd457b46bd9a' for executor 'thermos-1452181970177-USER-prod-JOB_NAME-0-99a16851-42d6-4a52-b768-359b4f499ff3' of framework 20150930-134812-84017418-5050-29407-0001
I0114 14:09:51.297840 23049 slave.cpp:3967] Sending reconnect request to executor thermos-1452181970177-USER-prod-JOB_NAME-0-99a16851-42d6-4a52-b768-359b4f499ff3 of framework 20150930-134812-84017418-5050-29407-0001 at executor(1)@NET.10:57730
I0114 14:09:53.298254 23050 slave.cpp:2638] Killing un-reregistered executor 'thermos-1452181970177-USER-prod-JOB_NAME-0-99a16851-42d6-4a52-b768-359b4f499ff3' of framework 20150930-134812-84017418-5050-29407-0001
I0114 14:09:54.638893 22993 slave.cpp:3349] Executor 'thermos-1452181970177-USER-prod-JOB_NAME-0-99a16851-42d6-4a52-b768-359b4f499ff3' of framework 20150930-134812-84017418-5050-29407-0001 has terminated with unknown status
I0114 14:09:54.639679 22993 slave.cpp:2671] Handling status update TASK_FAILED (UUID: 27eb5247-af17-4b34-9cae-5a4fa254adbe) for task 1452181970177-USER-prod-JOB_NAME-0-99a16851-42d6-4a52-b768-359b4f499ff3 of framework 20150930-134812-84017418-5050-29407-0001 from @0.0.0.0:0
I0114 14:09:54.647096 22989 status_update_manager.cpp:322] Received status update TASK_FAILED (UUID: 27eb5247-af17-4b34-9cae-5a4fa254adbe) for task 1452181970177-USER-prod-JOB_NAME-0-99a16851-42d6-4a52-b768-359b4f499ff3 of framework 20150930-134812-84017418-5050-29407-0001
I0114 14:09:54.647122 22989 status_update_manager.cpp:826] Checkpointing UPDATE for status update TASK_FAILED (UUID: 27eb5247-af17-4b34-9cae-5a4fa254adbe) for task 1452181970177-USER-prod-JOB_NAME-0-99a16851-42d6-4a52-b768-359b4f499ff3 of framework 20150930-134812-84017418-5050-29407-0001
I0114 14:09:54.647640 23015 slave.cpp:2926] Forwarding the update TASK_FAILED (UUID: 27eb5247-af17-4b34-9cae-5a4fa254adbe) for task 1452181970177-USER-prod-JOB_NAME-0-99a16851-42d6-4a52-b768-359b4f499ff3 of framework 20150930-134812-84017418-5050-29407-0001 to master@NET.7:5050
I0114 14:09:54.707476 23054 status_update_manager.cpp:394] Received status update acknowledgement (UUID: 27eb5247-af17-4b34-9cae-5a4fa254adbe) for task 1452181970177-USER-prod-JOB_NAME-0-99a16851-42d6-4a52-b768-359b4f499ff3 of framework 20150930-134812-84017418-5050-29407-0001
I0114 14:09:54.707496 23054 status_update_manager.cpp:826] Checkpointing ACK for status update TASK_FAILED (UUID: 27eb5247-af17-4b34-9cae-5a4fa254adbe) for task 1452181970177-USER-prod-JOB_NAME-0-99a16851-42d6-4a52-b768-359b4f499ff3 of framework 20150930-134812-84017418-5050-29407-0001
I0114 14:09:54.708057 23054 slave.cpp:3460] Cleaning up executor 'thermos-1452181970177-USER-prod-JOB_NAME-0-99a16851-42d6-4a52-b768-359b4f499ff3' of framework 20150930-134812-84017418-5050-29407-0001
I0114 14:09:54.708149 23002 gc.cpp:56] Scheduling '/var/lib/mesos/slaves/20151021-121051-84017418-5050-52142-S3/frameworks/20150930-134812-84017418-5050-29407-0001/executors/thermos-1452181970177-USER-prod-JOB_NAME-0-99a16851-42d6-4a52-b768-359b4f499ff3/runs/e7f628f5-cd1b-4cf0-b8d4-dd457b46bd9a' for gc 6.99999180405037days in the future
I0114 14:09:54.708195 22996 gc.cpp:56] Scheduling '/var/lib/mesos/slaves/20151021-121051-84017418-5050-52142-S3/frameworks/20150930-134812-84017418-5050-29407-0001/executors/thermos-1452181970177-USER-prod-JOB_NAME-0-99a16851-42d6-4a52-b768-359b4f499ff3' for gc 6.99999180352889days in the future
I0114 14:09:54.708226 22996 gc.cpp:56] Scheduling '/var/lib/mesos/meta/slaves/20151021-121051-84017418-5050-52142-S3/frameworks/20150930-134812-84017418-5050-29407-0001/executors/thermos-1452181970177-USER-prod-JOB_NAME-0-99a16851-42d6-4a52-b768-359b4f499ff3/runs/e7f628f5-cd1b-4cf0-b8d4-dd457b46bd9a' for gc 6.99999180321778days in the future
I0114 14:09:54.708247 22996 gc.cpp:56] Scheduling '/var/lib/mesos/meta/slaves/20151021-121051-84017418-5050-52142-S3/frameworks/20150930-134812-84017418-5050-29407-0001/executors/thermos-1452181970177-USER-prod-JOB_NAME-0-99a16851-42d6-4a52-b768-359b4f499ff3' for gc 6.99999180305482days in the future
 

Reply via email to