[ https://issues.apache.org/jira/browse/MESOS-5395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15291648#comment-15291648 ]
Joseph Wu commented on MESOS-5395: ---------------------------------- Nothing in the mesos logs indicates that your task is *not* starting: >From the stdout file, the task you're looking at is {code} project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e {code} The agent logs say that the task started successfully. These timestamps lines up very closely with the task's stderr. {code} I0518 14:55:19.393923 947 slave.cpp:1361] Got assigned task project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e for framework 17cd3756-1d59-4dfc-984d-3fe09f6b5730-0000 I0518 14:55:19.394619 947 gc.cpp:83] Unscheduling '/var/mesos/slaves/282745ab-423a-4350-a449-3e8cdfccfb93-S2/frameworks/17cd3756-1d59-4dfc-984d-3fe09f6b5730-0000' from gc I0518 14:55:19.394680 947 gc.cpp:83] Unscheduling '/var/mesos/meta/slaves/282745ab-423a-4350-a449-3e8cdfccfb93-S2/frameworks/17cd3756-1d59-4dfc-984d-3fe09f6b5730-0000' from gc I0518 14:55:19.394760 947 slave.cpp:1480] Launching task project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e for framework 17cd3756-1d59-4dfc-984d-3fe09f6b5730-0000 I0518 14:55:19.395539 947 paths.cpp:528] Trying to chown '/var/mesos/slaves/282745ab-423a-4350-a449-3e8cdfccfb93-S2/frameworks/17cd3756-1d59-4dfc-984d-3fe09f6b5730-0000/executors/project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e/runs/d3996d05-26f6-4e6c-a89f-8ee9c617182c' to user 'root' I0518 14:55:19.399237 947 slave.cpp:5367] Launching executor project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e of framework 17cd3756-1d59-4dfc-984d-3fe09f6b5730-0000 with resources cpus(*):0.1; mem(*):32 in work directory '/var/mesos/slaves/282745ab-423a-4350-a449-3e8cdfccfb93-S2/frameworks/17cd3756-1d59-4dfc-984d-3fe09f6b5730-0000/executors/project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e/runs/d3996d05-26f6-4e6c-a89f-8ee9c617182c' I0518 14:55:19.399588 947 slave.cpp:1698] Queuing task 'project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e' for executor 'project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e' of framework 17cd3756-1d59-4dfc-984d-3fe09f6b5730-0000 I0518 14:55:19.402344 948 docker.cpp:1036] Starting container 'd3996d05-26f6-4e6c-a89f-8ee9c617182c' for task 'project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e' (and executor 'project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e') of framework '17cd3756-1d59-4dfc-984d-3fe09f6b5730-0000' ... I0518 14:55:26.880151 952 docker.cpp:623] Checkpointing pid 6331 to '/var/mesos/meta/slaves/282745ab-423a-4350-a449-3e8cdfccfb93-S2/frameworks/17cd3756-1d59-4dfc-984d-3fe09f6b5730-0000/executors/project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e/runs/d3996d05-26f6-4e6c-a89f-8ee9c617182c/pids/forked.pid' I0518 14:55:26.907119 952 slave.cpp:2643] Got registration for executor 'project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e' of framework 17cd3756-1d59-4dfc-984d-3fe09f6b5730-0000 from executor(1)@10.254.234.236:42289 I0518 14:55:26.907639 952 docker.cpp:1316] Ignoring updating container 'd3996d05-26f6-4e6c-a89f-8ee9c617182c' with resources passed to update is identical to existing resources I0518 14:55:26.907726 952 slave.cpp:1863] Sending queued task 'project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e' to executor 'project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e' of framework 17cd3756-1d59-4dfc-984d-3fe09f6b5730-0000 at executor(1)@10.254.234.236:42289 I0518 14:55:27.622561 952 slave.cpp:3002] Handling status update TASK_RUNNING (UUID: 26e73671-099c-49f1-a031-57aa9a8cec41) for task project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e of framework 17cd3756-1d59-4dfc-984d-3fe09f6b5730-0000 from executor(1)@10.254.234.236:42289 I0518 14:55:27.622762 953 status_update_manager.cpp:320] Received status update TASK_RUNNING (UUID: 26e73671-099c-49f1-a031-57aa9a8cec41) for task project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e of framework 17cd3756-1d59-4dfc-984d-3fe09f6b5730-0000 I0518 14:55:27.622974 953 status_update_manager.cpp:824] Checkpointing UPDATE for status update TASK_RUNNING (UUID: 26e73671-099c-49f1-a031-57aa9a8cec41) for task project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e of framework 17cd3756-1d59-4dfc-984d-3fe09f6b5730-0000 I0518 14:55:27.679003 953 slave.cpp:3400] Forwarding the update TASK_RUNNING (UUID: 26e73671-099c-49f1-a031-57aa9a8cec41) for task project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e of framework 17cd3756-1d59-4dfc-984d-3fe09f6b5730-0000 to master@10.254.226.211:5050 I0518 14:55:27.679095 953 slave.cpp:3310] Sending acknowledgement for status update TASK_RUNNING (UUID: 26e73671-099c-49f1-a031-57aa9a8cec41) for task project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e of framework 17cd3756-1d59-4dfc-984d-3fe09f6b5730-0000 to executor(1)@10.254.234.236:42289 I0518 14:55:27.691797 950 status_update_manager.cpp:392] Received status update acknowledgement (UUID: 26e73671-099c-49f1-a031-57aa9a8cec41) for task project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e of framework 17cd3756-1d59-4dfc-984d-3fe09f6b5730-0000 I0518 14:55:27.691839 950 status_update_manager.cpp:824] Checkpointing ACK for status update TASK_RUNNING (UUID: 26e73671-099c-49f1-a031-57aa9a8cec41) for task project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e of framework 17cd3756-1d59-4dfc-984d-3fe09f6b5730-0000 {code} Right above this, is presumably marathon's previous attempt at starting your task. {code} I0518 11:56:24.553864 947 docker.cpp:1036] Starting container 'a2227cc9-79aa-417c-8189-a260e8b57b2b' for task 'project-hub_project-hub-frontend.663dbd31-1cd6-11e6-bb25-d00d2cce797e' (and executor 'project-hub_project-hub-frontend.663dbd31-1cd6-11e6-bb25-d00d2cce797e') of framework '17cd3756-1d59-4dfc-984d-3fe09f6b5730-0000' I0518 12:01:24.554524 948 slave.cpp:4322] Terminating executor ''project-hub_project-hub-frontend.663dbd31-1cd6-11e6-bb25-d00d2cce797e' of framework 17cd3756-1d59-4dfc-984d-3fe09f6b5730-0000' because it did not register within 5mins I0518 12:01:24.554687 948 docker.cpp:1696] Destroying container 'a2227cc9-79aa-417c-8189-a260e8b57b2b' I0518 12:01:24.554694 948 docker.cpp:1739] Destroying Container 'a2227cc9-79aa-417c-8189-a260e8b57b2b' in PULLING state {code} By the looks of it, your docker image is either very large (i.e. it cannot be reliably pulled within 5 minutes, or the agent's {{--executor_registration_timeout}} flag); or that agent was partitioned from the docker registry you are using. If your image(s) are very large, consider increasing the value of the {{--executor_registration_timeout}} flag. > Task getting stuck in staging state if launch it on a rebooted slave. > --------------------------------------------------------------------- > > Key: MESOS-5395 > URL: https://issues.apache.org/jira/browse/MESOS-5395 > Project: Mesos > Issue Type: Bug > Affects Versions: 0.28.0 > Environment: mesos/marathon cluster, 3 maters/4 slaves > Mesos: 0.28.0 , Marathon 0.15.2 > Reporter: Mengkui gong > Attachments: mesos-log.zip > > > if rebooting a slave, after that, using Marathon to launch a task, the task > can start on other slaves without problem. But if launch it on the rebooted > slave, the task will be stuck. From Mesos UI shows it in staging state from > active tasks list. From Marathon UI shows it in deploying state. It can > keeping in stuck state for more than 2 hours. After that time, Marathon will > automatically launch the task on this rebooted slave or other slave as > normal. So the rebooted slave be recovered as well after that time. > From Mesos log, I can see "telling slave to kill task" all the time. > I0517 15:25:27.207237 20568 master.cpp:3826] Telling slave > 282745ab-423a-4350-a449-3e8cdfccfb93-S1 at slave(1)@10.254.234.236:5050 > (mesos-slave-3) to kill task > project-hub_project-hub-frontend.b645f24b-1c1f-11e6-bb25-d00d2cce797e of > framework 17cd3756-1d59-4dfc-984d-3fe09f6b5730-0000 (marathon) at > scheduler-fe615b72-ab92-49ca-89e6-e74e600c7e15@10.254.228.3:56757. > From rebooted slave log, I can see: > May 17 15:28:37 euca-10-254-234-236 mesos-slave[829]: I0517 15:28:37.206831 > 916 slave.cpp:1891] Asked to kill task > project-hub_project-hub-frontend.b645f24b-1c1f-11e6-bb25-d00d2cce797e of > framework 17cd3756-1d59-4dfc-984d-3fe09f6b5730-0000 > May 17 15:28:37 euca-10-254-234-236 mesos-slave[829]: W0517 15:28:37.206866 > 916 slave.cpp:2018] Ignoring kill task > project-hub_project-hub-frontend.b645f24b-1c1f-11e6-bb25-d00d2cce797e because > the executor > 'project-hub_project-hub-frontend.b645f24b-1c1f-11e6-bb25-d00d2cce797e' of > framework 17cd3756-1d59-4dfc-984d-3fe09f6b5730-0000 is terminating/terminated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)