[ 
https://issues.apache.org/jira/browse/MESOS-2605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14494339#comment-14494339
 ] 

Elizabeth Lingg commented on MESOS-2605:
----------------------------------------

As an additional comment, this behavior is observed on our Core OS Cluster. On 
this cluster, we have restarts of Mesos slaves as well as reboots of the 
machine. It seems to happen upon restarts of the Mesos slaves sometimes, but 
not all the time.

> The slave sometimes does not send active executors during reregistration
> ------------------------------------------------------------------------
>
>                 Key: MESOS-2605
>                 URL: https://issues.apache.org/jira/browse/MESOS-2605
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 0.22.0
>            Reporter: Elizabeth Lingg
>            Assignee: Michael Park
>              Labels: mesosphere
>
> The slave sometimes does not send active executors during reregistration. 
> Framework checkpointing is enabled, and the executor successfully 
> reregisters. However, the tasks in that executor are LOST (by abnormal 
> executor termination) because the executor is removed by the mesos master as 
> unknown. See the example below, 
> task.journalnode.journalnode.NodeExecutor.1428609184051.
> See the Slave Logs here for the Task:
> {code}
> Apr 09 19:53:06 ip-10-168-119-78.ec2.internal mesos-slave[25116]: I0409 
> 19:53:06.778790 25126 status_update_manager.cpp:317] Received status update 
> TASK_RUNNING (UUID: 4eb22075-c319-463d-8f70-94db9caa69c6) for task 
> task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 
> 20150408-002100-4261056010-5050-1047-0008
> Apr 09 19:53:06 ip-10-168-119-78.ec2.internal mesos-slave[25116]: I0409 
> 19:53:06.779013 25126 status_update_manager.hpp:346] Checkpointing UPDATE for 
> status update TASK_RUNNING (UUID: 4eb22075-c319-463d-8f70-94db9caa69c6) for 
> task task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 
> 20150408-002100-4261056010-5050-1047-0008
> Apr 09 19:53:06 ip-10-168-119-78.ec2.internal mesos-slave[25116]: I0409 
> 19:53:06.781788 25123 slave.cpp:2753] Forwarding the update TASK_RUNNING 
> (UUID: 4eb22075-c319-463d-8f70-94db9caa69c6) for task 
> task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 
> 20150408-002100-4261056010-5050-1047-0008 to master@10.142.250.253:5050
> Apr 09 19:53:06 ip-10-168-119-78.ec2.internal mesos-slave[25116]: I0409 
> 19:53:06.781889 25123 slave.cpp:2686] Sending acknowledgement for status 
> update TASK_RUNNING (UUID: 4eb22075-c319-463d-8f70-94db9caa69c6) for task 
> task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 
> 20150408-002100-4261056010-5050-1047-0008 to executor(1)@10.168.119.78:47638
> Apr 09 19:53:06 ip-10-168-119-78.ec2.internal mesos-slave[25116]: I0409 
> 19:53:06.784503 25124 status_update_manager.cpp:389] Received status update 
> acknowledgement (UUID: 4eb22075-c319-463d-8f70-94db9caa69c6) for task 
> task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 
> 20150408-002100-4261056010-5050-1047-0008
> Apr 09 19:53:06 ip-10-168-119-78.ec2.internal mesos-slave[25116]: I0409 
> 19:53:06.784567 25124 status_update_manager.hpp:346] Checkpointing ACK for 
> status update TASK_RUNNING (UUID: 4eb22075-c319-463d-8f70-94db9caa69c6) for 
> task task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 
> 20150408-002100-4261056010-5050-1047-0008
> {code}
> Master Logs:
> {code}
> Apr 09 20:19:43 ip-10-142-250-253.ec2.internal mesos-master[1047]: W0409 
> 20:19:43.008666  1067 master.cpp:4015] Executor 
> executor.journalnode.NodeExecutor.1428609184051 of framework 
> 20150408-002100-4261056010-5050-1047-0008 possibly unknown to the slave 
> 20150407-233647-2059219722-5050-1659-S5 at slave(1)@10.168.119.78:5051 
> (ec2-54-237-57-237.compute-1.amazonaws.com)
> Apr 09 20:19:43 ip-10-142-250-253.ec2.internal mesos-master[1047]: I0409 
> 20:19:43.008652  1074 hierarchical.hpp:648] Recovered cpus(*):0.1; 
> mem(*):1536 (total allocatable: cpus(*):3.5; mem(*):21113; disk(*):142210; 
> ports(*):[3889-5044, 5046-5049, 2182-2958, 2960-3887, 1025-2180, 8082-9041, 
> 9043-9159, 9161-9999, 5052-6999, 7002-7198, 7200-8079, 10001-65535]) on slave 
> 20150407-233647-2059219722-5050-1659-S5 from framework 
> 20150408-002100-4261056010-5050-1047-0008
> Apr 09 20:19:43 ip-10-142-250-253.ec2.internal mesos-master[1047]: I0409 
> 20:19:43.008712  1067 master.cpp:4714] Removing executor 
> 'executor.journalnode.NodeExecutor.1428609184051' with resources cpus(*):0.1; 
> mem(*):1536 of framework 20150408-002100-4261056010-5050-1047-0008 on slave 
> 20150407-233647-2059219722-5050-1659-S5 at slave(1)@10.168.119.78:5051 
> (ec2-54-237-57-237.compute-1.amazonaws.com)
> Apr 09 20:19:43 ip-10-142-250-253.ec2.internal mesos-master[1047]: I0409 
> 20:19:43.010372  1067 master.cpp:3295] Status update TASK_LOST (UUID: 
> e5532567-e5b2-4fca-87aa-f3f98e371640) for task 
> task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 
> 20150408-002100-4261056010-5050-1047-0008 from slave 
> 20150407-233647-2059219722-5050-1659-S5 at slave(1)@10.168.119.78:5051 
> (ec2-54-237-57-237.compute-1.amazonaws.com)
> Apr 09 20:19:43 ip-10-142-250-253.ec2.internal mesos-master[1047]: I0409 
> 20:19:43.013746  1067 master.cpp:3295] Status update TASK_LOST (UUID: 
> e5532567-e5b2-4fca-87aa-f3f98e371640) for task 
> task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 
> 20150408-002100-4261056010-5050-1047-0008 from slave 
> 20150407-233647-2059219722-5050-1659-S5 at slave(1)@10.168.119.78:5051 
> (ec2-54-237-57-237.compute-1.amazonaws.com)
> Apr 09 20:19:43 ip-10-142-250-253.ec2.internal mesos-master[1047]: I0409 
> 20:19:43.013767  1067 master.cpp:3336] Forwarding status update TASK_LOST 
> (UUID: e5532567-e5b2-4fca-87aa-f3f98e371640) for task 
> task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 
> 20150408-002100-4261056010-5050-1047-0008
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to