[jira] [Commented] (MESOS-5395) Task getting stuck in staging state if launch it on a rebooted slave.

Joseph Wu (JIRA) Thu, 19 May 2016 10:54:07 -0700

    [ 
https://issues.apache.org/jira/browse/MESOS-5395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15291648#comment-15291648
 ]


Joseph Wu commented on MESOS-5395:
----------------------------------

Nothing in the mesos logs indicates that your task is *not* starting:

>From the stdout file, the task you're looking at is
{code}
project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e
{code}

The agent logs say that the task started successfully.  These timestamps lines 
up very closely with the task's stderr.
{code}
I0518 14:55:19.393923   947 slave.cpp:1361] Got assigned task 
project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e for 
framework 17cd3756-1d59-4dfc-984d-3fe09f6b5730-0000
I0518 14:55:19.394619   947 gc.cpp:83] Unscheduling 
'/var/mesos/slaves/282745ab-423a-4350-a449-3e8cdfccfb93-S2/frameworks/17cd3756-1d59-4dfc-984d-3fe09f6b5730-0000'
 from gc
I0518 14:55:19.394680   947 gc.cpp:83] Unscheduling 
'/var/mesos/meta/slaves/282745ab-423a-4350-a449-3e8cdfccfb93-S2/frameworks/17cd3756-1d59-4dfc-984d-3fe09f6b5730-0000'
 from gc
I0518 14:55:19.394760   947 slave.cpp:1480] Launching task 
project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e for 
framework 17cd3756-1d59-4dfc-984d-3fe09f6b5730-0000
I0518 14:55:19.395539   947 paths.cpp:528] Trying to chown 
'/var/mesos/slaves/282745ab-423a-4350-a449-3e8cdfccfb93-S2/frameworks/17cd3756-1d59-4dfc-984d-3fe09f6b5730-0000/executors/project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e/runs/d3996d05-26f6-4e6c-a89f-8ee9c617182c'
 to user 'root'
I0518 14:55:19.399237   947 slave.cpp:5367] Launching executor 
project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e of 
framework 17cd3756-1d59-4dfc-984d-3fe09f6b5730-0000 with resources cpus(*):0.1; 
mem(*):32 in work directory 
'/var/mesos/slaves/282745ab-423a-4350-a449-3e8cdfccfb93-S2/frameworks/17cd3756-1d59-4dfc-984d-3fe09f6b5730-0000/executors/project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e/runs/d3996d05-26f6-4e6c-a89f-8ee9c617182c'
I0518 14:55:19.399588   947 slave.cpp:1698] Queuing task 
'project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e' for 
executor 
'project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e' of 
framework 17cd3756-1d59-4dfc-984d-3fe09f6b5730-0000
I0518 14:55:19.402344   948 docker.cpp:1036] Starting container 
'd3996d05-26f6-4e6c-a89f-8ee9c617182c' for task 
'project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e' (and 
executor 
'project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e') of 
framework '17cd3756-1d59-4dfc-984d-3fe09f6b5730-0000'
...
I0518 14:55:26.880151   952 docker.cpp:623] Checkpointing pid 6331 to 
'/var/mesos/meta/slaves/282745ab-423a-4350-a449-3e8cdfccfb93-S2/frameworks/17cd3756-1d59-4dfc-984d-3fe09f6b5730-0000/executors/project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e/runs/d3996d05-26f6-4e6c-a89f-8ee9c617182c/pids/forked.pid'
I0518 14:55:26.907119   952 slave.cpp:2643] Got registration for executor 
'project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e' of 
framework 17cd3756-1d59-4dfc-984d-3fe09f6b5730-0000 from 
executor(1)@10.254.234.236:42289
I0518 14:55:26.907639   952 docker.cpp:1316] Ignoring updating container 
'd3996d05-26f6-4e6c-a89f-8ee9c617182c' with resources passed to update is 
identical to existing resources
I0518 14:55:26.907726   952 slave.cpp:1863] Sending queued task 
'project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e' to 
executor 
'project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e' of 
framework 17cd3756-1d59-4dfc-984d-3fe09f6b5730-0000 at 
executor(1)@10.254.234.236:42289
I0518 14:55:27.622561   952 slave.cpp:3002] Handling status update TASK_RUNNING 
(UUID: 26e73671-099c-49f1-a031-57aa9a8cec41) for task 
project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e of 
framework 17cd3756-1d59-4dfc-984d-3fe09f6b5730-0000 from 
executor(1)@10.254.234.236:42289
I0518 14:55:27.622762   953 status_update_manager.cpp:320] Received status 
update TASK_RUNNING (UUID: 26e73671-099c-49f1-a031-57aa9a8cec41) for task 
project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e of 
framework 17cd3756-1d59-4dfc-984d-3fe09f6b5730-0000
I0518 14:55:27.622974   953 status_update_manager.cpp:824] Checkpointing UPDATE 
for status update TASK_RUNNING (UUID: 26e73671-099c-49f1-a031-57aa9a8cec41) for 
task project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e of 
framework 17cd3756-1d59-4dfc-984d-3fe09f6b5730-0000
I0518 14:55:27.679003   953 slave.cpp:3400] Forwarding the update TASK_RUNNING 
(UUID: 26e73671-099c-49f1-a031-57aa9a8cec41) for task 
project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e of 
framework 17cd3756-1d59-4dfc-984d-3fe09f6b5730-0000 to 
master@10.254.226.211:5050
I0518 14:55:27.679095   953 slave.cpp:3310] Sending acknowledgement for status 
update TASK_RUNNING (UUID: 26e73671-099c-49f1-a031-57aa9a8cec41) for task 
project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e of 
framework 17cd3756-1d59-4dfc-984d-3fe09f6b5730-0000 to 
executor(1)@10.254.234.236:42289
I0518 14:55:27.691797   950 status_update_manager.cpp:392] Received status 
update acknowledgement (UUID: 26e73671-099c-49f1-a031-57aa9a8cec41) for task 
project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e of 
framework 17cd3756-1d59-4dfc-984d-3fe09f6b5730-0000
I0518 14:55:27.691839   950 status_update_manager.cpp:824] Checkpointing ACK 
for status update TASK_RUNNING (UUID: 26e73671-099c-49f1-a031-57aa9a8cec41) for 
task project-hub_project-hub-frontend.64b60262-1cef-11e6-bb25-d00d2cce797e of 
framework 17cd3756-1d59-4dfc-984d-3fe09f6b5730-0000
{code}

Right above this, is presumably marathon's previous attempt at starting your 
task.
{code}
I0518 11:56:24.553864   947 docker.cpp:1036] Starting container 
'a2227cc9-79aa-417c-8189-a260e8b57b2b' for task 
'project-hub_project-hub-frontend.663dbd31-1cd6-11e6-bb25-d00d2cce797e' (and 
executor 
'project-hub_project-hub-frontend.663dbd31-1cd6-11e6-bb25-d00d2cce797e') of 
framework '17cd3756-1d59-4dfc-984d-3fe09f6b5730-0000'
I0518 12:01:24.554524   948 slave.cpp:4322] Terminating executor 
''project-hub_project-hub-frontend.663dbd31-1cd6-11e6-bb25-d00d2cce797e' of 
framework 17cd3756-1d59-4dfc-984d-3fe09f6b5730-0000' because it did not 
register within 5mins
I0518 12:01:24.554687   948 docker.cpp:1696] Destroying container 
'a2227cc9-79aa-417c-8189-a260e8b57b2b'
I0518 12:01:24.554694   948 docker.cpp:1739] Destroying Container 
'a2227cc9-79aa-417c-8189-a260e8b57b2b' in PULLING state
{code}

By the looks of it, your docker image is either very large (i.e. it cannot be 
reliably pulled within 5 minutes, or the agent's 
{{--executor_registration_timeout}} flag); or that agent was partitioned from 
the docker registry you are using.

If your image(s) are very large, consider increasing the value of the 
{{--executor_registration_timeout}} flag.

> Task getting stuck in staging state if launch it on a rebooted slave.
> ---------------------------------------------------------------------
>
>                 Key: MESOS-5395
>                 URL: https://issues.apache.org/jira/browse/MESOS-5395
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 0.28.0
>         Environment: mesos/marathon cluster,  3 maters/4 slaves
> Mesos: 0.28.0 ,  Marathon 0.15.2
>            Reporter: Mengkui gong
>         Attachments: mesos-log.zip
>
>
> if rebooting a slave, after that,  using Marathon to launch a task,  the task 
> can start on other slaves without problem.  But if launch it on the rebooted 
> slave, the task will be stuck. From Mesos UI shows it in staging state from 
> active tasks list.  From Marathon UI shows it in deploying state. It can 
> keeping in stuck state for more than 2 hours.  After that time, Marathon will 
> automatically launch the task on this rebooted slave or other slave as 
> normal. So the rebooted slave be recovered as well after that time.   
> From Mesos log,  I can see "telling slave to kill task" all the time.
> I0517 15:25:27.207237 20568 master.cpp:3826] Telling slave 
> 282745ab-423a-4350-a449-3e8cdfccfb93-S1 at slave(1)@10.254.234.236:5050 
> (mesos-slave-3) to kill task 
> project-hub_project-hub-frontend.b645f24b-1c1f-11e6-bb25-d00d2cce797e of 
> framework 17cd3756-1d59-4dfc-984d-3fe09f6b5730-0000 (marathon) at 
> scheduler-fe615b72-ab92-49ca-89e6-e74e600c7e15@10.254.228.3:56757.
> From rebooted slave log, I can see:
> May 17 15:28:37 euca-10-254-234-236 mesos-slave[829]: I0517 15:28:37.206831   
> 916 slave.cpp:1891] Asked to kill task 
> project-hub_project-hub-frontend.b645f24b-1c1f-11e6-bb25-d00d2cce797e of 
> framework 17cd3756-1d59-4dfc-984d-3fe09f6b5730-0000
> May 17 15:28:37 euca-10-254-234-236 mesos-slave[829]: W0517 15:28:37.206866   
> 916 slave.cpp:2018] Ignoring kill task 
> project-hub_project-hub-frontend.b645f24b-1c1f-11e6-bb25-d00d2cce797e because 
> the executor 
> 'project-hub_project-hub-frontend.b645f24b-1c1f-11e6-bb25-d00d2cce797e' of 
> framework 17cd3756-1d59-4dfc-984d-3fe09f6b5730-0000 is terminating/terminated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-5395) Task getting stuck in staging state if launch it on a rebooted slave.

Reply via email to