[jira] [Comment Edited] (MESOS-1193) Check failed: promises.contains(containerId) crashes slave
[ https://issues.apache.org/jira/browse/MESOS-1193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071032#comment-14071032 ] Steven Schlansker edited comment on MESOS-1193 at 7/22/14 10:45 PM: That sounds entirely plausible. We are running Docker containers. Given the somewhat unstable state of Docker support in Mesos, we are using our own Docker launching scripts. I had just updated a base image so all the slaves were busy executing a 'docker pull' to grab the new images. Given that the task is a shell script that executes this pull, it may well be past what Mesos thinks of as the "launch" phase. But it definitely was during a lengthy initialization step. It's worth noting that almost all of our jobs are Marathon tasks. I believe the log messages about Chronos are unrelated, we only have one or two things launching with it, and I don't think any were around the time of the crash. was (Author: stevenschlansker): That sounds entirely plausible. We are running Docker containers. Given the somewhat unstable state of Docker support in Mesos, we are using our own Docker launching scripts. I had just updated a base image so all the slaves were busy executing a 'docker pull' to grab the new images. Given that the task is a shell script that executes this pull, it may well be past what Mesos thinks of as the "launch" phase. But it definitely was during a lengthy initialization step. It's worth noting that almost all of our jobs are Marathon tasks. I believe (?) the log messages about Chronos are unrelated, we only have one or two things launching with it, and I don't think any were around the time of the crash. > Check failed: promises.contains(containerId) crashes slave > -- > > Key: MESOS-1193 > URL: https://issues.apache.org/jira/browse/MESOS-1193 > Project: Mesos > Issue Type: Bug > Components: containerization >Affects Versions: 0.18.0 >Reporter: Tobi Knaup > > This was observed with four slaves on one machine, one framework (Marathon) > and around 100 tasks per slave. > I0404 17:58:58.298075 3939 mesos_containerizer.cpp:891] Executor for > container '6d4de71c-a491-4544-afe0-afcbfa37094a' has exited > I0404 17:58:58.298395 3938 slave.cpp:2047] Executor 'web_467-1396634277535' > of framework 201404041625-3823062160-55371-22555- has terminated with > signal Killed > E0404 17:58:58.298475 3929 slave.cpp:2320] Failed to unmonitor container for > executor web_467-1396634277535 of framework > 201404041625-3823062160-55371-22555-: Not monitored > I0404 17:58:58.299075 3938 slave.cpp:1643] Handling status update > TASK_FAILED (UUID: c815e057-e7a2-4c26-a382-6796a1585d1d) for task > web_467-1396634277535 of framework 201404041625-3823062160-55371-22555- > from @0.0.0.0:0 > I0404 17:58:58.299232 3932 status_update_manager.cpp:315] Received status > update TASK_FAILED (UUID: c815e057-e7a2-4c26-a382-6796a1585d1d) for task > web_467-1396634277535 of framework 201404041625-3823062160-55371-22555- > I0404 17:58:58.299360 3932 status_update_manager.cpp:368] Forwarding status > update TASK_FAILED (UUID: c815e057-e7a2-4c26-a382-6796a1585d1d) for task > web_467-1396634277535 of framework 201404041625-3823062160-55371-22555- > to master@144.76.223.227:5050 > I0404 17:58:58.306967 3932 status_update_manager.cpp:393] Received status > update acknowledgement (UUID: c815e057-e7a2-4c26-a382-6796a1585d1d) for task > web_467-1396634277535 of framework 201404041625-3823062160-55371-22555- > I0404 17:58:58.307049 3932 slave.cpp:2186] Cleaning up executor > 'web_467-1396634277535' of framework 201404041625-3823062160-55371-22555- > I0404 17:58:58.307122 3932 gc.cpp:56] Scheduling > '/tmp/mesos5053/slaves/20140404-164105-3823062160-5050-24762-5/frameworks/201404041625-3823062160-55371-22555-/executors/web_467-1396634277535/runs/6d4de71c-a491-4544-afe0-afcbfa37094a' > for gc 6.9644578667days in the future > I0404 17:58:58.307157 3932 gc.cpp:56] Scheduling > '/tmp/mesos5053/slaves/20140404-164105-3823062160-5050-24762-5/frameworks/201404041625-3823062160-55371-22555-/executors/web_467-1396634277535' > for gc 6.9644553185days in the future > F0404 17:58:58.597434 3938 mesos_containerizer.cpp:682] Check failed: > promises.contains(containerId) > *** Check failure stack trace: *** > @ 0x7f5209da6e5d google::LogMessage::Fail() > @ 0x7f5209da8c9d google::LogMessage::SendToLog() > @ 0x7f5209da6a4c google::LogMessage::Flush() > @ 0x7f5209da9599 google::LogMessageFatal::~LogMessageFatal() > @ 0x7f5209ad9f88 > mesos::internal::slave::MesosContainerizerProcess::exec() > @ 0x7f5209af3b56 > _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS
[jira] [Comment Edited] (MESOS-1193) Check failed: promises.contains(containerId) crashes slave
[ https://issues.apache.org/jira/browse/MESOS-1193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071032#comment-14071032 ] Steven Schlansker edited comment on MESOS-1193 at 7/22/14 10:40 PM: That sounds entirely plausible. We are running Docker containers. Given the somewhat unstable state of Docker support in Mesos, we are using our own Docker launching scripts. I had just updated a base image so all the slaves were busy executing a 'docker pull' to grab the new images. Given that the task is a shell script that executes this pull, it may well be past what Mesos thinks of as the "launch" phase. But it definitely was during a lengthy initialization step. It's worth noting that almost all of our jobs are Marathon tasks. I believe (?) the log messages about Chronos are unrelated, we only have one or two things launching with it, and I don't think any were around the time of the crash. was (Author: stevenschlansker): That sounds entirely plausible. We are running Docker containers. Given the somewhat unstable state of Docker support in Mesos, we are using our own Docker launching scripts. I had just updated a base image so all the slaves were busy executing a 'docker pull' to grab the new images. Given that the task is a shell script that executes this pull, it may well be past what Mesos thinks of as the "launch" phase. But it definitely was during a lengthy initialization step. > Check failed: promises.contains(containerId) crashes slave > -- > > Key: MESOS-1193 > URL: https://issues.apache.org/jira/browse/MESOS-1193 > Project: Mesos > Issue Type: Bug > Components: containerization >Affects Versions: 0.18.0 >Reporter: Tobi Knaup > > This was observed with four slaves on one machine, one framework (Marathon) > and around 100 tasks per slave. > I0404 17:58:58.298075 3939 mesos_containerizer.cpp:891] Executor for > container '6d4de71c-a491-4544-afe0-afcbfa37094a' has exited > I0404 17:58:58.298395 3938 slave.cpp:2047] Executor 'web_467-1396634277535' > of framework 201404041625-3823062160-55371-22555- has terminated with > signal Killed > E0404 17:58:58.298475 3929 slave.cpp:2320] Failed to unmonitor container for > executor web_467-1396634277535 of framework > 201404041625-3823062160-55371-22555-: Not monitored > I0404 17:58:58.299075 3938 slave.cpp:1643] Handling status update > TASK_FAILED (UUID: c815e057-e7a2-4c26-a382-6796a1585d1d) for task > web_467-1396634277535 of framework 201404041625-3823062160-55371-22555- > from @0.0.0.0:0 > I0404 17:58:58.299232 3932 status_update_manager.cpp:315] Received status > update TASK_FAILED (UUID: c815e057-e7a2-4c26-a382-6796a1585d1d) for task > web_467-1396634277535 of framework 201404041625-3823062160-55371-22555- > I0404 17:58:58.299360 3932 status_update_manager.cpp:368] Forwarding status > update TASK_FAILED (UUID: c815e057-e7a2-4c26-a382-6796a1585d1d) for task > web_467-1396634277535 of framework 201404041625-3823062160-55371-22555- > to master@144.76.223.227:5050 > I0404 17:58:58.306967 3932 status_update_manager.cpp:393] Received status > update acknowledgement (UUID: c815e057-e7a2-4c26-a382-6796a1585d1d) for task > web_467-1396634277535 of framework 201404041625-3823062160-55371-22555- > I0404 17:58:58.307049 3932 slave.cpp:2186] Cleaning up executor > 'web_467-1396634277535' of framework 201404041625-3823062160-55371-22555- > I0404 17:58:58.307122 3932 gc.cpp:56] Scheduling > '/tmp/mesos5053/slaves/20140404-164105-3823062160-5050-24762-5/frameworks/201404041625-3823062160-55371-22555-/executors/web_467-1396634277535/runs/6d4de71c-a491-4544-afe0-afcbfa37094a' > for gc 6.9644578667days in the future > I0404 17:58:58.307157 3932 gc.cpp:56] Scheduling > '/tmp/mesos5053/slaves/20140404-164105-3823062160-5050-24762-5/frameworks/201404041625-3823062160-55371-22555-/executors/web_467-1396634277535' > for gc 6.9644553185days in the future > F0404 17:58:58.597434 3938 mesos_containerizer.cpp:682] Check failed: > promises.contains(containerId) > *** Check failure stack trace: *** > @ 0x7f5209da6e5d google::LogMessage::Fail() > @ 0x7f5209da8c9d google::LogMessage::SendToLog() > @ 0x7f5209da6a4c google::LogMessage::Flush() > @ 0x7f5209da9599 google::LogMessageFatal::~LogMessageFatal() > @ 0x7f5209ad9f88 > mesos::internal::slave::MesosContainerizerProcess::exec() > @ 0x7f5209af3b56 > _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchI7NothingN5mesos8internal5slave25MesosContainerizerProcessERKNS6_11ContainerIDEiSA_iEENS0_6FutureIT_EERKNS0_3PIDIT0_EEMSH_FSF_T1_T2_ET3_T4_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_ > @ 0x7f5209cd0bf2 process::ProcessM
[jira] [Comment Edited] (MESOS-1193) Check failed: promises.contains(containerId) crashes slave
[ https://issues.apache.org/jira/browse/MESOS-1193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14070592#comment-14070592 ] Steven Schlansker edited comment on MESOS-1193 at 7/22/14 6:08 PM: --- This same issue just took down our entire cluster this morning. Not cool! I wish I had more debugging information, but here's the last 10 log lines: {code} I0722 17:41:26.189750 1376 slave.cpp:2552] Cleaning up executor 'chronos.43852277-11c7-11e4-b541-1e5db1e5be60' of framework 201403072353-2969781770-5050-852- I0722 17:41:26.189893 1381 gc.cpp:56] Scheduling '/mnt/mesos/slaves/20140703-000606-1594050058-5050-3987-6/frameworks/201403072353-2969781770-5050-852-/executors/chronos.43852277-11c7-11e4-b541-1e5db1e5be60/runs/941f20d9-ba74-4531-b64e-6c2b05b0277f' for gc 6.9780272days in the future I0722 17:41:26.189980 1381 gc.cpp:56] Scheduling '/mnt/mesos/slaves/20140703-000606-1594050058-5050-3987-6/frameworks/201403072353-2969781770-5050-852-/executors/chronos.43852277-11c7-11e4-b541-1e5db1e5be60' for gc 6.9780201482days in the future I0722 17:41:26.737553 1380 slave.cpp:933] Got assigned task chronos.67e803db-11c7-11e4-b541-1e5db1e5be60 for framework 201403072353-2969781770-5050-852- I0722 17:41:26.737844 1380 slave.cpp:1043] Launching task chronos.67e803db-11c7-11e4-b541-1e5db1e5be60 for framework 201403072353-2969781770-5050-852- I0722 17:41:26.739146 1375 mesos_containerizer.cpp:537] Starting container '17c2236a-1242-4c36-a6ed-54a31f687e8b' for executor 'chronos.67e803db-11c7-11e4-b541-1e5db1e5be60' of framework '201403072353-2969781770-5050-852-' I0722 17:41:26.739151 1380 slave.cpp:1153] Queuing task 'chronos.67e803db-11c7-11e4-b541-1e5db1e5be60' for executor chronos.67e803db-11c7-11e4-b541-1e5db1e5be60 of framework '201403072353-2969781770-5050-852- I0722 17:41:26.748342 1375 launcher.cpp:117] Forked child with pid '12376' for container '17c2236a-1242-4c36-a6ed-54a31f687e8b' I0722 17:41:26.752080 1380 mesos_containerizer.cpp:647] Fetching URIs for container '17c2236a-1242-4c36-a6ed-54a31f687e8b' using command '/usr/local/libexec/mesos/mesos-fetcher' F0722 17:41:52.215634 1377 mesos_containerizer.cpp:862] Check failed: promises.contains(containerId) {code} {code} I0722 17:56:35.702491 13428 main.cpp:126] Build: 2014-06-09 21:08:25 by root I0722 17:56:35.702517 13428 main.cpp:128] Version: 0.19.0 I0722 17:56:35.702535 13428 main.cpp:131] Git tag: 0.19.0 I0722 17:56:35.702553 13428 main.cpp:135] Git SHA: 51e047524cf744ee257870eb479345646c0428ff I0722 17:56:35.702590 13428 mesos_containerizer.cpp:124] Using isolation: posix/cpu,posix/mem I0722 17:56:35.702942 13428 main.cpp:149] Starting Mesos slave I0722 17:56:35.703721 13428 slave.cpp:143] Slave started on 1)@10.70.6.32:5051 I0722 17:56:35.704082 13428 slave.cpp:255] Slave resources: cpus(*):8; mem(*):29077; disk(*):70336; ports(*):[31000-32000] I0722 17:56:35.705883 13428 slave.cpp:283] Slave hostname: 10.70.6.32 I0722 17:56:35.705915 13428 slave.cpp:284] Slave checkpoint: true {code} was (Author: stevenschlansker): This same issue just took down our entire cluster this morning. Not cool! I wish I had more debugging information, but here's the last 10 log lines: {code} I0722 17:41:26.189750 1376 slave.cpp:2552] Cleaning up executor 'chronos.43852277-11c7-11e4-b541-1e5db1e5be60' of framework 201403072353-2969781770-5050-852- I0722 17:41:26.189893 1381 gc.cpp:56] Scheduling '/mnt/mesos/slaves/20140703-000606-1594050058-5050-3987-6/frameworks/201403072353-2969781770-5050-852-/executors/chronos.43852277-11c7-11e4-b541-1e5db1e5be60/runs/941f20d9-ba74-4531-b64e-6c2b05b0277f' for gc 6.9780272days in the future I0722 17:41:26.189980 1381 gc.cpp:56] Scheduling '/mnt/mesos/slaves/20140703-000606-1594050058-5050-3987-6/frameworks/201403072353-2969781770-5050-852-/executors/chronos.43852277-11c7-11e4-b541-1e5db1e5be60' for gc 6.9780201482days in the future I0722 17:41:26.737553 1380 slave.cpp:933] Got assigned task chronos.67e803db-11c7-11e4-b541-1e5db1e5be60 for framework 201403072353-2969781770-5050-852- I0722 17:41:26.737844 1380 slave.cpp:1043] Launching task chronos.67e803db-11c7-11e4-b541-1e5db1e5be60 for framework 201403072353-2969781770-5050-852- I0722 17:41:26.739146 1375 mesos_containerizer.cpp:537] Starting container '17c2236a-1242-4c36-a6ed-54a31f687e8b' for executor 'chronos.67e803db-11c7-11e4-b541-1e5db1e5be60' of framework '201403072353-2969781770-5050-852-' I0722 17:41:26.739151 1380 slave.cpp:1153] Queuing task 'chronos.67e803db-11c7-11e4-b541-1e5db1e5be60' for executor chronos.67e803db-11c7-11e4-b541-1e5db1e5be60 of framework '201403072353-2969781770-5050-852- I0722 17:41:26.748342 1375 launcher.cpp:117] Forked child with pid '12376' for container '17c2236a-1242-4c36-a6ed-54a31f687e8b' I0722 17:41:26.752080