[ https://issues.apache.org/jira/browse/AURORA-1966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16383006#comment-16383006 ]
Renan DelValle commented on AURORA-1966: ---------------------------------------- [~davmclau] should we close this since the patch was committed? > TASK_UNKNOWN to PARTITIONED mapping puts Scheduler to kill non-exist Task > indefinitely > -------------------------------------------------------------------------------------- > > Key: AURORA-1966 > URL: https://issues.apache.org/jira/browse/AURORA-1966 > Project: Aurora > Issue Type: Bug > Reporter: Santhosh Kumar Shanmugham > Assignee: David McLaughlin > Priority: Major > > When a Task launch fails, it is moved from ASSIGNED to LOST, which performs a > RESCHEDULE and KILL. Unfortunately the KILL of a non-existent task to the > Mesos master results in a TASK_UNKNOWN status update, which gets mapped to > PARTITIONED. While the transition from LOST to PARTITIONED is not allowed, > some callbacks get executed despite the fact, resulting in a KILL and > RESCHEDULE action. This new KILL triggers another TASK_UNKNOWN and hence > PARTITIONED status update for the same task, putting the Scheduler to > indefinitely attempt KILLing the non-existent task. Attempting a client job > killall results in the same state for the scheduler. > Since the scheduler uses the LOST state for black-holing task the > {{TaskStateMachine}} needs to take those into account. > I was able to reproduce this in the Vagrant image by faking a launch failure. > {code:java} > I0124 05:48:23.198 [qtp1791010542-40, StateMachine] > vagrant-test-fail-partition_aware_disabled-0-07bec0cb-d6a3-4caa-9b6e-60e6d0934606 > state machine transition INIT -> PENDING I0124 05:48:23.213508 9748 > log.cpp:560] Attempting to append 1679 bytes to the log I0124 05:48:23.214570 > 9748 coordinator.cpp:348] Coordinator attempting to write APPEND action at > position 24778 I0124 05:48:23.214834 9748 replica.cpp:540] Replica received > write request for position 24778 from __req_res__(4)@192.168.33.7:8083 I0124 > 05:48:23.221982 9748 leveldb.cpp:341] Persisting action (1700 bytes) to > leveldb took 6.772102ms I0124 05:48:23.222174 9748 replica.cpp:711] Persisted > action APPEND at position 24778 I0124 05:48:23.222901 9748 replica.cpp:694] > Replica received learned notice for position 24778 from > log-network(1)@192.168.33.7:8083 I0124 05:48:23.226833 9748 leveldb.cpp:341] > Persisting action (1702 bytes) to leveldb took 3.227779ms I0124 > 05:48:23.227008 9748 replica.cpp:711] Persisted action APPEND at position > 24778 I0124 05:48:23.262 [qtp1791010542-40, RequestLog] 127.0.0.1 - - > [24/Jan/2018:05:48:23 +0000] "POST //aurora.local/api HTTP/1.1" 200 78 I0124 > 05:48:23.267 [qtp1791010542-40, LoggingInterceptor] > getTasksWithoutConfigs(TaskQuery(role:null, environment:null, jobName:null, > taskIds:null, statuses:null, instanceIds:null, slaveHosts:null, > jobKeys:[JobKey(role:vagrant, environment:test, > name:fail-partition_aware_disabled)], offset:0, limit:0)) I0124 05:48:23.285 > [qtp1791010542-40, RequestLog] 127.0.0.1 - - [24/Jan/2018:05:48:23 +0000] > "POST //aurora.local/api HTTP/1.1" 200 794 I0124 05:48:23.349 > [TaskGroupBatchWorker, StateMachine] Callback transition PENDING to ASSIGNED, > allow: true I0124 05:48:23.353 [TaskGroupBatchWorker, StateMachine] > vagrant-test-fail-partition_aware_disabled-0-07bec0cb-d6a3-4caa-9b6e-60e6d0934606 > state machine transition PENDING -> ASSIGNED I0124 05:48:23.356 > [TaskGroupBatchWorker, TaskAssignerImpl] Offer on agent 192.168.33.7 (id > fe8bc641-aa02-4363-a990-318d20de1bac-S0) is being assigned task for > vagrant-test-fail-partition_aware_disabled-0-07bec0cb-d6a3-4caa-9b6e-60e6d0934606. > W0124 05:48:23.445 [TaskGroupBatchWorker, TaskAssignerImpl] Failed to launch > task. org.apache.aurora.scheduler.offers.OfferManager$LaunchException: Failed > to launch task. at > org.apache.aurora.scheduler.offers.OfferManagerImpl.launchTask(OfferManagerImpl.java:212) > at > org.apache.aurora.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:83) > at > org.apache.aurora.scheduler.scheduling.TaskAssignerImpl.launchUsingOffer(TaskAssignerImpl.java:126) > at > org.apache.aurora.scheduler.scheduling.TaskAssignerImpl.maybeAssign(TaskAssignerImpl.java:262) > at > org.apache.aurora.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:83) > at > org.apache.aurora.scheduler.scheduling.TaskSchedulerImpl.scheduleTasks(TaskSchedulerImpl.java:154) > at > org.apache.aurora.scheduler.scheduling.TaskSchedulerImpl.schedule(TaskSchedulerImpl.java:108) > at > org.apache.aurora.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:83) > at > org.apache.aurora.scheduler.scheduling.TaskGroups$1.lambda$run$0(TaskGroups.java:174) > at org.apache.aurora.scheduler.BatchWorker$Work.apply(BatchWorker.java:117) > at > org.apache.aurora.scheduler.BatchWorker.lambda$processBatch$3(BatchWorker.java:210) > at > org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult.apply(Storage.java:144) > at > org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult.apply(Storage.java:139) > at > org.apache.aurora.scheduler.storage.durability.DurableStorage.lambda$doInTransaction$0(DurableStorage.java:199) > at > org.apache.aurora.scheduler.storage.mem.MemStorage.write(MemStorage.java:89) > at > org.apache.aurora.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:83) > at > org.apache.aurora.scheduler.storage.durability.DurableStorage.doInTransaction(DurableStorage.java:198) > at > org.apache.aurora.scheduler.storage.durability.DurableStorage.write(DurableStorage.java:221) > at > org.apache.aurora.scheduler.storage.CallOrderEnforcingStorage.write(CallOrderEnforcingStorage.java:132) > at > org.apache.aurora.scheduler.BatchWorker.processBatch(BatchWorker.java:207) at > org.apache.aurora.scheduler.BatchWorker.run(BatchWorker.java:199) at > com.google.common.util.concurrent.AbstractExecutionThreadService$1$2.run(AbstractExecutionThreadService.java:66) > at com.google.common.util.concurrent.Callables$4.run(Callables.java:122) at > java.lang.Thread.run(Thread.java:748) Caused by: > java.lang.IllegalStateException: mocking launch failure at > org.apache.aurora.scheduler.offers.OfferManagerImpl.launchTask(OfferManagerImpl.java:201) > ... 23 common frames omitted I0124 05:48:23.446 [TaskGroupBatchWorker, > StateMachine] Callback transition ASSIGNED to LOST, allow: true I0124 > 05:48:23.447 [TaskGroupBatchWorker, StateMachine] > vagrant-test-fail-partition_aware_disabled-0-07bec0cb-d6a3-4caa-9b6e-60e6d0934606 > state machine transition ASSIGNED -> LOST I0124 05:48:23.449 > [TaskGroupBatchWorker, StateManagerImpl] Task being rescheduled: > vagrant-test-fail-partition_aware_disabled-0-07bec0cb-d6a3-4caa-9b6e-60e6d0934606 > I0124 05:48:23.454 [TaskGroupBatchWorker, StateMachine] Callback transition > INIT to PENDING, allow: true I0124 05:48:23.456 [TaskGroupBatchWorker, > StateMachine] > vagrant-test-fail-partition_aware_disabled-0-a882eee4-b214-49e1-be9b-1b60d119ed60 > state machine transition INIT -> PENDING I0124 05:48:23.473 > [TaskGroupBatchWorker, StateManagerImpl] Task being killed: > vagrant-test-fail-partition_aware_disabled-0-07bec0cb-d6a3-4caa-9b6e-60e6d0934606 > I0124 05:48:23.476 [TaskGroupBatchWorker, VersionedSchedulerDriverService] > Killing task > vagrant-test-fail-partition_aware_disabled-0-07bec0cb-d6a3-4caa-9b6e-60e6d0934606 > I0124 05:48:23.496659 9751 scheduler.cpp:243] Sending KILL call to > http://192.168.33.7:5050/master/api/v1/scheduler I0124 05:48:23.499577 9749 > scheduler.cpp:676] Enqueuing event UPDATE received from > http://192.168.33.7:5050/master/api/v1/scheduler W0124 05:48:23.505 > [AsyncProcessor-4, TaskVars] Failed to find rack attribute associated with > host 192.168.33.7 W0124 05:48:23.515 [AsyncProcessor-4, Stats] Re-using > already registered variable for key > task_delivery_delay_SOURCE_MASTER_timeouts_per_sec W0124 05:48:23.515 > [AsyncProcessor-4, Stats] Re-using already registered variable for key > task_delivery_delay_SOURCE_MASTER_requests_per_sec I0124 05:48:23.524458 9754 > log.cpp:560] Attempting to append 3625 bytes to the log I0124 05:48:23.524636 > 9754 coordinator.cpp:348] Coordinator attempting to write APPEND action at > position 24779 I0124 05:48:23.525111 9754 replica.cpp:540] Replica received > write request for position 24779 from __req_res__(5)@192.168.33.7:8083 I0124 > 05:48:23.526832 9754 leveldb.cpp:341] Persisting action (3646 bytes) to > leveldb took 1.612374ms I0124 05:48:23.526927 9754 replica.cpp:711] Persisted > action APPEND at position 24779 I0124 05:48:23.527189 9754 replica.cpp:694] > Replica received learned notice for position 24779 from > log-network(1)@192.168.33.7:8083 I0124 05:48:23.528949 9754 leveldb.cpp:341] > Persisting action (3648 bytes) to leveldb took 1.543755ms I0124 > 05:48:23.529088 9754 replica.cpp:711] Persisted action APPEND at position > 24779 I0124 05:48:23.538 [TaskStatusHandlerImpl, TaskStatusHandlerImpl] > Task:vagrant-test-fail-partition_aware_disabled-0-07bec0cb-d6a3-4caa-9b6e-60e6d0934606 > moved to status:PARTITIONED for mesos state: TASK_UNKNOWN I0124 05:48:23.540 > [TaskStatusHandlerImpl, StateMachine] Callback transition LOST to > PARTITIONED, allow: false I0124 05:48:23.540 [TaskStatusHandlerImpl, > StateMachine] > vagrant-test-fail-partition_aware_disabled-0-07bec0cb-d6a3-4caa-9b6e-60e6d0934606 > state machine transition LOST -> PARTITIONED (not allowed) I0124 > 05:48:23.540 [TaskStatusHandlerImpl, TaskStateMachine] LOST to PARTITIONED => > KILL E0124 05:48:23.540 [TaskStatusHandlerImpl, TaskStateMachine] Illegal > state transition attempted: LOST -> PARTITIONED (not allowed) I0124 > 05:48:23.541 [TaskStatusHandlerImpl, StateManagerImpl] Task being killed: > vagrant-test-fail-partition_aware_disabled-0-07bec0cb-d6a3-4caa-9b6e-60e6d0934606 > I0124 05:48:23.542 [TaskStatusHandlerImpl, VersionedSchedulerDriverService] > Killing task > vagrant-test-fail-partition_aware_disabled-0-07bec0cb-d6a3-4caa-9b6e-60e6d0934606 > I0124 05:48:23.543357 9749 scheduler.cpp:243] Sending KILL call to > http://192.168.33.7:5050/master/api/v1/scheduler W0124 05:48:23.546 > [TaskGroupBatchWorker, TaskSchedulerImpl] Failed to look up tasks > vagrant-test-fail-partition_aware_disabled-0-07bec0cb-d6a3-4caa-9b6e-60e6d0934606 > I0124 05:48:23.587267 9751 scheduler.cpp:676] Enqueuing event UPDATE > received from http://192.168.33.7:5050/master/api/v1/scheduler I0124 > 05:48:23.593 [TaskStatusHandlerImpl, TaskStatusHandlerImpl] > Task:vagrant-test-fail-partition_aware_disabled-0-07bec0cb-d6a3-4caa-9b6e-60e6d0934606 > moved to status:PARTITIONED for mesos state: TASK_UNKNOWN I0124 05:48:23.595 > [TaskStatusHandlerImpl, StateMachine] Callback transition LOST to > PARTITIONED, allow: false I0124 05:48:23.595 [TaskStatusHandlerImpl, > StateMachine] > vagrant-test-fail-partition_aware_disabled-0-07bec0cb-d6a3-4caa-9b6e-60e6d0934606 > state machine transition LOST -> PARTITIONED (not allowed) I0124 > 05:48:23.595 [TaskStatusHandlerImpl, TaskStateMachine] LOST to PARTITIONED => > KILL E0124 05:48:23.595 [TaskStatusHandlerImpl, TaskStateMachine] Illegal > state transition attempted: LOST -> PARTITIONED (not allowed) I0124 > 05:48:23.595 [TaskStatusHandlerImpl, StateManagerImpl] Task being killed: > vagrant-test-fail-partition_aware_disabled-0-07bec0cb-d6a3-4caa-9b6e-60e6d0934606 > I0124 05:48:23.595 [TaskStatusHandlerImpl, VersionedSchedulerDriverService] > Killing task > vagrant-test-fail-partition_aware_disabled-0-07bec0cb-d6a3-4caa-9b6e-60e6d0934606 > I0124 05:48:23.596256 9747 scheduler.cpp:243] Sending KILL call to > http://192.168.33.7:5050/master/api/v1/scheduler I0124 05:48:23.634858 9751 > scheduler.cpp:676] Enqueuing event UPDATE received from > http://192.168.33.7:5050/master/api/v1/scheduler I0124 05:48:23.636 > [TaskStatusHandlerImpl, TaskStatusHandlerImpl] > Task:vagrant-test-fail-partition_aware_disabled-0-07bec0cb-d6a3-4caa-9b6e-60e6d0934606 > moved to status:PARTITIONED for mesos state: TASK_UNKNOWN > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)