We're encountering issues where our scheduler will never receive the
TASK_FINISHED message, causing the scheduler driver to never be stopped.

Logs from scheduler:

2014-10-01 05:48:14,549 INFO [JobManager] >>> Running job
DTdJgUhTEeSnPCIACwSpXg
2014-10-01 06:06:06,536 INFO [JobScheduler] >>> Job DTdJgUhTEeSnPCIACwSpXg:
Registered as 20140927-080358-3165886730-5050-928-0069 to master
'20140927-080358-3165886730-5050-928'
2014-10-01 06:06:06,538 INFO [JobScheduler] >>> Job DTdJgUhTEeSnPCIACwSpXg:
Found matching offer, declining all other offers
2014-10-01 06:06:08,706 INFO [JobScheduler] >>> Job DTdJgUhTEeSnPCIACwSpXg:
Got status update task_id {
  value: "DTdJgUhTEeSnPCIACwSpXg"
}
state: TASK_RUNNING
slave_id {
  value: "20140818-235718-3165886730-5050-901-4"
}
timestamp: 1.41214356869682E9

Most of these logs are generated by our side, but they indicate that the
driver is run, the framework successfully registered with the master, the
framework successfully accepted offers, and the framework was successfully
transmitted a TASK_RUNNING message.

Logs from the master:

/var/log/mesos/mesos-master.INFO:I1001 06:06:06.540736   951
master.hpp:655] Adding task DTdJgUhTEeSnPCIACwSpXg with resources
cpus(*):1; mem(*):1536 on slave 20140818-235718-3165886730-5050-901-4
(ip-10-51-165-231.ec2.internal)
/var/log/mesos/mesos-master.INFO:I1001 06:06:06.541306   951
master.cpp:3111] Launching task DTdJgUhTEeSnPCIACwSpXg of framework
20140927-080358-3165886730-5050-928-0069 with resources cpus(*):1;
mem(*):1536 on slave 20140818-235718-3165886730-5050-901-4 at slave(1)@
10.51.165.231:5051 (ip-10-51-165-231.ec2.internal)
/var/log/mesos/mesos-master.INFO:I1001 06:06:08.699162   951
master.cpp:2628] Status update TASK_RUNNING (UUID:
0e17e133-c1d0-4a56-9cef-afb45a486045) for task DTdJgUhTEeSnPCIACwSpXg of
framework 20140927-080358-3165886730-5050-928-0069 from slave
20140818-235718-3165886730-5050-901-4 at slave(1)@10.51.165.231:5051
(ip-10-51-165-231.ec2.internal)
/var/log/mesos/mesos-master.INFO:I1001 06:16:15.295013   945
master.cpp:2628] Status update TASK_FINISHED (UUID:
587869ec-a4ef-439a-a706-8a46fbc9fde8) for task DTdJgUhTEeSnPCIACwSpXg of
framework 20140927-080358-3165886730-5050-928-0069 from slave
20140818-235718-3165886730-5050-901-4 at slave(1)@10.51.165.231:5051
(ip-10-51-165-231.ec2.internal)
/var/log/mesos/mesos-master.INFO:I1001 06:16:15.295222   945
master.hpp:673] Removing task DTdJgUhTEeSnPCIACwSpXg with resources
cpus(*):1; mem(*):1536 on slave 20140818-235718-3165886730-5050-901-4
(ip-10-51-165-231.ec2.internal)
/var/log/mesos/mesos-master.INFO:W1001 06:16:25.295990   951
master.cpp:2621] Could not lookup task for status update TASK_FINISHED
(UUID: 587869ec-a4ef-439a-a706-8a46fbc9fde8) for task
DTdJgUhTEeSnPCIACwSpXg of framework
20140927-080358-3165886730-5050-928-0069 from slave
20140818-235718-3165886730-5050-901-4 at slave(1)@10.51.165.231:5051
(ip-10-51-165-231.ec2.internal)

The "Could not lookup task" messages have been recurring ever since.

Logs from the slave:

/var/log/mesos/mesos-slave.INFO:I1001 06:06:06.543417  1207 slave.cpp:933]
Got assigned task DTdJgUhTEeSnPCIACwSpXg for framework
20140927-080358-3165886730-5050-928-0069
/var/log/mesos/mesos-slave.INFO:I1001 06:06:06.543709  1207 slave.cpp:1043]
Launching task DTdJgUhTEeSnPCIACwSpXg for framework
20140927-080358-3165886730-5050-928-0069
/var/log/mesos/mesos-slave.INFO:I1001 06:06:06.544926  1207 slave.cpp:1153]
Queuing task 'DTdJgUhTEeSnPCIACwSpXg' for executor default of framework
'20140927-080358-3165886730-5050-928-0069
/var/log/mesos/mesos-slave.INFO:I1001 06:06:08.644618  1205 slave.cpp:1783]
Flushing queued task DTdJgUhTEeSnPCIACwSpXg for executor 'default' of
framework 20140927-080358-3165886730-5050-928-0069
/var/log/mesos/mesos-slave.INFO:I1001 06:06:08.698570  1204 slave.cpp:2018]
Handling status update TASK_RUNNING (UUID:
0e17e133-c1d0-4a56-9cef-afb45a486045) for task DTdJgUhTEeSnPCIACwSpXg of
framework 20140927-080358-3165886730-5050-928-0069 from executor(1)@
10.51.165.231:42450
/var/log/mesos/mesos-slave.INFO:I1001 06:06:08.698796  1204
status_update_manager.cpp:320] Received status update TASK_RUNNING (UUID:
0e17e133-c1d0-4a56-9cef-afb45a486045) for task DTdJgUhTEeSnPCIACwSpXg of
framework 20140927-080358-3165886730-5050-928-0069
/var/log/mesos/mesos-slave.INFO:I1001 06:06:08.698974  1204
status_update_manager.cpp:373] Forwarding status update TASK_RUNNING (UUID:
0e17e133-c1d0-4a56-9cef-afb45a486045) for task DTdJgUhTEeSnPCIACwSpXg of
framework 20140927-080358-3165886730-5050-928-0069 to
master@10.153.179.188:5050
/var/log/mesos/mesos-slave.INFO:I1001 06:06:08.699262  1204 slave.cpp:2145]
Sending acknowledgement for status update TASK_RUNNING (UUID:
0e17e133-c1d0-4a56-9cef-afb45a486045) for task DTdJgUhTEeSnPCIACwSpXg of
framework 20140927-080358-3165886730-5050-928-0069 to executor(1)@
10.51.165.231:42450
/var/log/mesos/mesos-slave.INFO:I1001 06:06:08.711634  1208
status_update_manager.cpp:398] Received status update acknowledgement
(UUID: 0e17e133-c1d0-4a56-9cef-afb45a486045) for task
DTdJgUhTEeSnPCIACwSpXg of framework 20140927-080358-3165886730-5050-928-0069
/var/log/mesos/mesos-slave.INFO:I1001 06:16:15.294246  1207 slave.cpp:2018]
Handling status update TASK_FINISHED (UUID:
587869ec-a4ef-439a-a706-8a46fbc9fde8) for task DTdJgUhTEeSnPCIACwSpXg of
framework 20140927-080358-3165886730-5050-928-0069 from executor(1)@
10.51.165.231:42450
/var/log/mesos/mesos-slave.INFO:I1001 06:16:15.294914  1205
status_update_manager.cpp:320] Received status update TASK_FINISHED (UUID:
587869ec-a4ef-439a-a706-8a46fbc9fde8) for task DTdJgUhTEeSnPCIACwSpXg of
framework 20140927-080358-3165886730-5050-928-0069
/var/log/mesos/mesos-slave.INFO:I1001 06:16:15.295012  1205
status_update_manager.cpp:373] Forwarding status update TASK_FINISHED
(UUID: 587869ec-a4ef-439a-a706-8a46fbc9fde8) for task
DTdJgUhTEeSnPCIACwSpXg of framework
20140927-080358-3165886730-5050-928-0069 to master@10.153.179.188:5050
/var/log/mesos/mesos-slave.INFO:I1001 06:16:15.295220  1205 slave.cpp:2145]
Sending acknowledgement for status update TASK_FINISHED (UUID:
587869ec-a4ef-439a-a706-8a46fbc9fde8) for task DTdJgUhTEeSnPCIACwSpXg of
framework 20140927-080358-3165886730-5050-928-0069 to executor(1)@
10.51.165.231:42450
/var/log/mesos/mesos-slave.INFO:W1001 06:16:25.295886  1202
status_update_manager.cpp:477] Resending status update TASK_FINISHED (UUID:
587869ec-a4ef-439a-a706-8a46fbc9fde8) for task DTdJgUhTEeSnPCIACwSpXg of
framework 20140927-080358-3165886730-5050-928-0069
/var/log/mesos/mesos-slave.INFO:I1001 06:16:25.296064  1202
status_update_manager.cpp:373] Forwarding status update TASK_FINISHED
(UUID: 587869ec-a4ef-439a-a706-8a46fbc9fde8) for task
DTdJgUhTEeSnPCIACwSpXg of framework
20140927-080358-3165886730-5050-928-0069 to master@10.153.179.188:5050

As with the master logs, the "Resending"/"Forwarding" log lines have been
recurring ever since.

Finally, logs from the executor indicate that the TASK_FINISHED message had
been sent. One oddity here is that some log lines that should be printed by
the executor after the TASK_FINISHED message is sent never appear -- I'm
guessing this is some kind of race condition, which explains why this
situation only happens occasionally. That said, there are no logs about the
executor's killTask, shutdown, or error methods being invoked, so I'm not
sure what is racing here.

Any thoughts on what is going on here/how to debug would be appreciated.
Thanks! =)

Reply via email to