[ https://issues.apache.org/jira/browse/MESOS-7752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16072608#comment-16072608 ]
Till Toenshoff commented on MESOS-7752: --------------------------------------- It does indeed sound like a race in command-executor shutdown vs. terminal task status update. > Command executor still active after terminal task state update. > --------------------------------------------------------------- > > Key: MESOS-7752 > URL: https://issues.apache.org/jira/browse/MESOS-7752 > Project: Mesos > Issue Type: Bug > Affects Versions: 1.3.0 > Reporter: A. Dukhovniy > > Here is a rather simple scenario to reproduce this error: > * Frameworks starts a task with taskId = _task1_ > * Framework kills _task1_ *successfully* and *acknowledges* TASK_KILLED > * Framework starts another task with the same _task1_ but receives > "_TASK_FAILED (Attempted to run multiple tasks using a "command" executor)_" > *Note*: this test is racy so this scenario fails occasionally. > *Here is a full log* from that show a life-cycle of a task id > _app-restart-resident-app-with-five-instances.8882bd16-5fdd-11e7-a00e-0242aceef95c_: > {code:java} > # Starting... > WARN [10:51:14 ResidentTaskIntegrationTest-MesosMaster-32782] I0703 > 10:51:14.476085 14666 master.cpp:3352] Authorizing framework principal > 'principal' to launch task > app-restart-resident-app-with-five-instances.8882bd16-5fdd-11e7-a00e-0242aceef95c > WARN [10:51:14 ResidentTaskIntegrationTest-MesosMaster-32782] I0703 > 10:51:14.510136 14666 master.cpp:4426] Launching task > app-restart-resident-app-with-five-instances.8882bd16-5fdd-11e7-a00e-0242aceef95c > of framework 76d8f3e7-8f3a-4764-bb7d-2bcf8e85e2be-0000 (marathon) at > scheduler-6dbbac16-7355-4a33-aee6-b9697c83e51c@127.0.1.1:61567 with > resources... > WARN [10:51:14 ResidentTaskIntegrationTest-MesosAgent-32788] I0703 > 10:51:14.513908 14697 slave.cpp:2118] Queued task > 'app-restart-resident-app-with-five-instances.8882bd16-5fdd-11e7-a00e-0242aceef95c' > for executor > 'app-restart-resident-app-with-five-instances.8882bd16-5fdd-11e7-a00e-0242aceef95c' > of framework 76d8f3e7-8f3a-4764-bb7d-2bcf8e85e2be-0000 > WARN [10:51:15 ResidentTaskIntegrationTest-MesosMaster-32782] I0703 > 10:51:15.011696 14671 master.cpp:6222] Forwarding status update TASK_RUNNING > (UUID: ed2d0475-9d83-4e09-9f54-5b4d323e4558) for task > app-restart-resident-app-with-five-instances.8882bd16-5fdd-11e7-a00e-0242aceef95c > of framework 76d8f3e7-8f3a-4764-bb7d-2bcf8e85e2be-0000 > WARN [10:51:15 ResidentTaskIntegrationTest-MesosMaster-32782] I0703 > 10:51:15.036391 14671 master.cpp:5092] Processing ACKNOWLEDGE call > ed2d0475-9d83-4e09-9f54-5b4d323e4558 for task > app-restart-resident-app-with-five-instances.8882bd16-5fdd-11e7-a00e-0242aceef95c > of framework 76d8f3e7-8f3a-4764-bb7d-2bcf8e85e2be-0000 (marathon) at > scheduler-6dbbac16-7355-4a33-aee6-b9697c83e51c@127.0.1.1:61567 on agent > 76d8f3e7-8f3a-4764-bb7d-2bcf8e85e2be-S0 > {code} > {code:java} > # Killing... > DEBUG[10:51:15 ResidentTaskIntegrationTest-LocalMarathon-32800] WARN > [10:51:15 KillAction$] Killing known task > [app-restart-resident-app-with-five-instances.8882bd16-5fdd-11e7-a00e-0242aceef95c] > of instance instance > [app-restart-resident-app-with-five-instances.marathon-8882bd16-5fdd-11e7-a00e-0242aceef95c] > WARN [10:51:15 ResidentTaskIntegrationTest-MesosAgent-32788] I0703 > 10:51:15.196702 14697 slave.cpp:3816] Handling status update TASK_KILLED > (UUID: f7e9d0bc-726c-43aa-9ddc-3b082a68642e) for task > app-restart-resident-app-with-five-instances.8882bd16-5fdd-11e7-a00e-0242aceef95c > of framework 76d8f3e7-8f3a-4764-bb7d-2bcf8e85e2be-0000 from > executor(1)@172.16.10.121:35184 > WARN [10:51:15 ResidentTaskIntegrationTest-MesosAgent-32788] I0703 > 10:51:15.197676 14697 slave.cpp:4166] Sending acknowledgement for status > update TASK_KILLED (UUID: f7e9d0bc-726c-43aa-9ddc-3b082a68642e) for task > app-restart-resident-app-with-five-instances.8882bd16-5fdd-11e7-a00e-0242aceef95c > of framework 76d8f3e7-8f3a-4764-bb7d-2bcf8e85e2be-0000 to > executor(1)@172.16.10.121:35184 > WARN [10:51:15 ResidentTaskIntegrationTest-MesosMaster-32782] I0703 > 10:51:15.198299 14671 master.cpp:6154] Status update TASK_KILLED (UUID: > f7e9d0bc-726c-43aa-9ddc-3b082a68642e) for task > app-restart-resident-app-with-five-instances.8882bd16-5fdd-11e7-a00e-0242aceef95c > of framework 76d8f3e7-8f3a-4764-bb7d-2bcf8e85e2be-0000 from agent > 76d8f3e7-8f3a-4764-bb7d-2bcf8e85e2be-S0 at slave(1)@172.16.10.121:32788 > (172.16.10.121) > DEBUG[10:51:15 ResidentTaskIntegrationTest-LocalMarathon-32800] INFO > [10:51:15 MarathonScheduler] Received status update for task > app-restart-resident-app-with-five-instances.8882bd16-5fdd-11e7-a00e-0242aceef95c: > TASK_KILLED (Command terminated with signal Terminated) > WARN [10:51:15 ResidentTaskIntegrationTest-MesosMaster-32782] I0703 > 10:51:15.216081 14671 master.cpp:5092] Processing ACKNOWLEDGE call > f7e9d0bc-726c-43aa-9ddc-3b082a68642e for task > app-restart-resident-app-with-five-instances.8882bd16-5fdd-11e7-a00e-0242aceef95c > of framework 76d8f3e7-8f3a-4764-bb7d-2bcf8e85e2be-0000 (marathon) at > scheduler-6dbbac16-7355-4a33-aee6-b9697c83e51c@127.0.1.1:61567 on agent > 76d8f3e7-8f3a-4764-bb7d-2bcf8e85e2be-S0 > WARN [10:51:15 ResidentTaskIntegrationTest-MesosMaster-32782] I0703 > 10:51:15.216107 14671 master.cpp:8396] Removing task > app-restart-resident-app-with-five-instances.8882bd16-5fdd-11e7-a00e-0242aceef95c > with resources... > WARN [10:51:15 ResidentTaskIntegrationTest-MesosAgent-32788] I0703 > 10:51:15.216667 14697 status_update_manager.cpp:395] Received status update > acknowledgement (UUID: f7e9d0bc-726c-43aa-9ddc-3b082a68642e) for task > app-restart-resident-app-with-five-instances.8882bd16-5fdd-11e7-a00e-0242aceef95c > of framework 76d8f3e7-8f3a-4764-bb7d-2bcf8e85e2be-0000 > WARN [10:51:15 ResidentTaskIntegrationTest-MesosAgent-32788] I0703 > 10:51:15.216722 14697 status_update_manager.cpp:832] Checkpointing ACK for > status update TASK_KILLED (UUID: f7e9d0bc-726c-43aa-9ddc-3b082a68642e) for > task > app-restart-resident-app-with-five-instances.8882bd16-5fdd-11e7-a00e-0242aceef95c > of framework 76d8f3e7-8f3a-4764-bb7d-2bcf8e85e2be-0000 > {code} > {code:java} > # and starting again: > WARN [10:51:15 ResidentTaskIntegrationTest-MesosMaster-32782] I0703 > 10:51:15.247561 14671 master.cpp:3352] Authorizing framework principal > 'principal' to launch task > app-restart-resident-app-with-five-instances.8882bd16-5fdd-11e7-a00e-0242aceef95c > WARN [10:51:15 ResidentTaskIntegrationTest-MesosAgent-32788] I0703 > 10:51:15.252348 14697 slave.cpp:1625] Got assigned task > 'app-restart-resident-app-with-five-instances.8882bd16-5fdd-11e7-a00e-0242aceef95c' > for framework 76d8f3e7-8f3a-4764-bb7d-2bcf8e85e2be-0000 > WARN [10:51:15 ResidentTaskIntegrationTest-MesosAgent-32788] I0703 > 10:51:15.252707 14697 slave.cpp:1785] Launching task > 'app-restart-resident-app-with-five-instances.8882bd16-5fdd-11e7-a00e-0242aceef95c' > for framework 76d8f3e7-8f3a-4764-bb7d-2bcf8e85e2be-0000 > WARN [10:51:15 ResidentTaskIntegrationTest-MesosAgent-32788] I0703 > 10:51:15.253159 14697 slave.cpp:2140] Queued task > 'app-restart-resident-app-with-five-instances.8882bd16-5fdd-11e7-a00e-0242aceef95c' > for executor > 'app-restart-resident-app-with-five-instances.8882bd16-5fdd-11e7-a00e-0242aceef95c' > of framework 76d8f3e7-8f3a-4764-bb7d-2bcf8e85e2be-0000 at > executor(1)@172.16.10.121:35184 > DEBUG[10:51:15 ResidentTaskIntegrationTest-LocalMarathon-32800] INFO > [10:51:15 MarathonScheduler] Received status update for task > app-restart-resident-app-with-five-instances.8882bd16-5fdd-11e7-a00e-0242aceef95c: > TASK_FAILED (Attempted to run multiple tasks using a "command" executor) > {code} > *TL;DR*: framework receives and acknowledges _TASK_KILLED_ status but fails > to re-start the task because _"Attempted to run multiple tasks using a > "command" executor"_ > Though reusing task Ids is discouraged > {code:java} > /** > * A framework-generated ID to distinguish a task. The ID must remain > * unique while the task is active. A framework can reuse an ID _only_ > * if the previous task with the same ID has reached a terminal state > * (e.g., TASK_FINISHED, TASK_LOST, TASK_KILLED, etc.). However, > * reusing task IDs is strongly discouraged (MESOS-2198). > */{code} > it is acceptable after receiving a terminal tasks status which happened above. > *Possible cause*: > I assume that occasionally the executor is not yet cleaned and is reused > during task restart. This however fails here: > https://github.com/apache/mesos/blob/35dd2b600b8af0204d03c4ee5348a1a6672b136c/src/launcher/executor.cpp#L512 > /cc [~tillt] -- This message was sent by Atlassian JIRA (v6.4.14#64029)