We're encountering issues where our scheduler will never receive the TASK_FINISHED message, causing the scheduler driver to never be stopped.
Logs from scheduler: 2014-10-01 05:48:14,549 INFO [JobManager] >>> Running job DTdJgUhTEeSnPCIACwSpXg 2014-10-01 06:06:06,536 INFO [JobScheduler] >>> Job DTdJgUhTEeSnPCIACwSpXg: Registered as 20140927-080358-3165886730-5050-928-0069 to master '20140927-080358-3165886730-5050-928' 2014-10-01 06:06:06,538 INFO [JobScheduler] >>> Job DTdJgUhTEeSnPCIACwSpXg: Found matching offer, declining all other offers 2014-10-01 06:06:08,706 INFO [JobScheduler] >>> Job DTdJgUhTEeSnPCIACwSpXg: Got status update task_id { value: "DTdJgUhTEeSnPCIACwSpXg" } state: TASK_RUNNING slave_id { value: "20140818-235718-3165886730-5050-901-4" } timestamp: 1.41214356869682E9 Most of these logs are generated by our side, but they indicate that the driver is run, the framework successfully registered with the master, the framework successfully accepted offers, and the framework was successfully transmitted a TASK_RUNNING message. Logs from the master: /var/log/mesos/mesos-master.INFO:I1001 06:06:06.540736 951 master.hpp:655] Adding task DTdJgUhTEeSnPCIACwSpXg with resources cpus(*):1; mem(*):1536 on slave 20140818-235718-3165886730-5050-901-4 (ip-10-51-165-231.ec2.internal) /var/log/mesos/mesos-master.INFO:I1001 06:06:06.541306 951 master.cpp:3111] Launching task DTdJgUhTEeSnPCIACwSpXg of framework 20140927-080358-3165886730-5050-928-0069 with resources cpus(*):1; mem(*):1536 on slave 20140818-235718-3165886730-5050-901-4 at slave(1)@ 10.51.165.231:5051 (ip-10-51-165-231.ec2.internal) /var/log/mesos/mesos-master.INFO:I1001 06:06:08.699162 951 master.cpp:2628] Status update TASK_RUNNING (UUID: 0e17e133-c1d0-4a56-9cef-afb45a486045) for task DTdJgUhTEeSnPCIACwSpXg of framework 20140927-080358-3165886730-5050-928-0069 from slave 20140818-235718-3165886730-5050-901-4 at slave(1)@10.51.165.231:5051 (ip-10-51-165-231.ec2.internal) /var/log/mesos/mesos-master.INFO:I1001 06:16:15.295013 945 master.cpp:2628] Status update TASK_FINISHED (UUID: 587869ec-a4ef-439a-a706-8a46fbc9fde8) for task DTdJgUhTEeSnPCIACwSpXg of framework 20140927-080358-3165886730-5050-928-0069 from slave 20140818-235718-3165886730-5050-901-4 at slave(1)@10.51.165.231:5051 (ip-10-51-165-231.ec2.internal) /var/log/mesos/mesos-master.INFO:I1001 06:16:15.295222 945 master.hpp:673] Removing task DTdJgUhTEeSnPCIACwSpXg with resources cpus(*):1; mem(*):1536 on slave 20140818-235718-3165886730-5050-901-4 (ip-10-51-165-231.ec2.internal) /var/log/mesos/mesos-master.INFO:W1001 06:16:25.295990 951 master.cpp:2621] Could not lookup task for status update TASK_FINISHED (UUID: 587869ec-a4ef-439a-a706-8a46fbc9fde8) for task DTdJgUhTEeSnPCIACwSpXg of framework 20140927-080358-3165886730-5050-928-0069 from slave 20140818-235718-3165886730-5050-901-4 at slave(1)@10.51.165.231:5051 (ip-10-51-165-231.ec2.internal) The "Could not lookup task" messages have been recurring ever since. Logs from the slave: /var/log/mesos/mesos-slave.INFO:I1001 06:06:06.543417 1207 slave.cpp:933] Got assigned task DTdJgUhTEeSnPCIACwSpXg for framework 20140927-080358-3165886730-5050-928-0069 /var/log/mesos/mesos-slave.INFO:I1001 06:06:06.543709 1207 slave.cpp:1043] Launching task DTdJgUhTEeSnPCIACwSpXg for framework 20140927-080358-3165886730-5050-928-0069 /var/log/mesos/mesos-slave.INFO:I1001 06:06:06.544926 1207 slave.cpp:1153] Queuing task 'DTdJgUhTEeSnPCIACwSpXg' for executor default of framework '20140927-080358-3165886730-5050-928-0069 /var/log/mesos/mesos-slave.INFO:I1001 06:06:08.644618 1205 slave.cpp:1783] Flushing queued task DTdJgUhTEeSnPCIACwSpXg for executor 'default' of framework 20140927-080358-3165886730-5050-928-0069 /var/log/mesos/mesos-slave.INFO:I1001 06:06:08.698570 1204 slave.cpp:2018] Handling status update TASK_RUNNING (UUID: 0e17e133-c1d0-4a56-9cef-afb45a486045) for task DTdJgUhTEeSnPCIACwSpXg of framework 20140927-080358-3165886730-5050-928-0069 from executor(1)@ 10.51.165.231:42450 /var/log/mesos/mesos-slave.INFO:I1001 06:06:08.698796 1204 status_update_manager.cpp:320] Received status update TASK_RUNNING (UUID: 0e17e133-c1d0-4a56-9cef-afb45a486045) for task DTdJgUhTEeSnPCIACwSpXg of framework 20140927-080358-3165886730-5050-928-0069 /var/log/mesos/mesos-slave.INFO:I1001 06:06:08.698974 1204 status_update_manager.cpp:373] Forwarding status update TASK_RUNNING (UUID: 0e17e133-c1d0-4a56-9cef-afb45a486045) for task DTdJgUhTEeSnPCIACwSpXg of framework 20140927-080358-3165886730-5050-928-0069 to master@10.153.179.188:5050 /var/log/mesos/mesos-slave.INFO:I1001 06:06:08.699262 1204 slave.cpp:2145] Sending acknowledgement for status update TASK_RUNNING (UUID: 0e17e133-c1d0-4a56-9cef-afb45a486045) for task DTdJgUhTEeSnPCIACwSpXg of framework 20140927-080358-3165886730-5050-928-0069 to executor(1)@ 10.51.165.231:42450 /var/log/mesos/mesos-slave.INFO:I1001 06:06:08.711634 1208 status_update_manager.cpp:398] Received status update acknowledgement (UUID: 0e17e133-c1d0-4a56-9cef-afb45a486045) for task DTdJgUhTEeSnPCIACwSpXg of framework 20140927-080358-3165886730-5050-928-0069 /var/log/mesos/mesos-slave.INFO:I1001 06:16:15.294246 1207 slave.cpp:2018] Handling status update TASK_FINISHED (UUID: 587869ec-a4ef-439a-a706-8a46fbc9fde8) for task DTdJgUhTEeSnPCIACwSpXg of framework 20140927-080358-3165886730-5050-928-0069 from executor(1)@ 10.51.165.231:42450 /var/log/mesos/mesos-slave.INFO:I1001 06:16:15.294914 1205 status_update_manager.cpp:320] Received status update TASK_FINISHED (UUID: 587869ec-a4ef-439a-a706-8a46fbc9fde8) for task DTdJgUhTEeSnPCIACwSpXg of framework 20140927-080358-3165886730-5050-928-0069 /var/log/mesos/mesos-slave.INFO:I1001 06:16:15.295012 1205 status_update_manager.cpp:373] Forwarding status update TASK_FINISHED (UUID: 587869ec-a4ef-439a-a706-8a46fbc9fde8) for task DTdJgUhTEeSnPCIACwSpXg of framework 20140927-080358-3165886730-5050-928-0069 to master@10.153.179.188:5050 /var/log/mesos/mesos-slave.INFO:I1001 06:16:15.295220 1205 slave.cpp:2145] Sending acknowledgement for status update TASK_FINISHED (UUID: 587869ec-a4ef-439a-a706-8a46fbc9fde8) for task DTdJgUhTEeSnPCIACwSpXg of framework 20140927-080358-3165886730-5050-928-0069 to executor(1)@ 10.51.165.231:42450 /var/log/mesos/mesos-slave.INFO:W1001 06:16:25.295886 1202 status_update_manager.cpp:477] Resending status update TASK_FINISHED (UUID: 587869ec-a4ef-439a-a706-8a46fbc9fde8) for task DTdJgUhTEeSnPCIACwSpXg of framework 20140927-080358-3165886730-5050-928-0069 /var/log/mesos/mesos-slave.INFO:I1001 06:16:25.296064 1202 status_update_manager.cpp:373] Forwarding status update TASK_FINISHED (UUID: 587869ec-a4ef-439a-a706-8a46fbc9fde8) for task DTdJgUhTEeSnPCIACwSpXg of framework 20140927-080358-3165886730-5050-928-0069 to master@10.153.179.188:5050 As with the master logs, the "Resending"/"Forwarding" log lines have been recurring ever since. Finally, logs from the executor indicate that the TASK_FINISHED message had been sent. One oddity here is that some log lines that should be printed by the executor after the TASK_FINISHED message is sent never appear -- I'm guessing this is some kind of race condition, which explains why this situation only happens occasionally. That said, there are no logs about the executor's killTask, shutdown, or error methods being invoked, so I'm not sure what is racing here. Any thoughts on what is going on here/how to debug would be appreciated. Thanks! =)