[ 
https://issues.apache.org/jira/browse/AURORA-1966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16383006#comment-16383006
 ] 

Renan DelValle commented on AURORA-1966:
----------------------------------------

[~davmclau] should we close this since the patch was committed?

> TASK_UNKNOWN to PARTITIONED mapping puts Scheduler to kill non-exist Task 
> indefinitely
> --------------------------------------------------------------------------------------
>
>                 Key: AURORA-1966
>                 URL: https://issues.apache.org/jira/browse/AURORA-1966
>             Project: Aurora
>          Issue Type: Bug
>            Reporter: Santhosh Kumar Shanmugham
>            Assignee: David McLaughlin
>            Priority: Major
>
> When a Task launch fails, it is moved from ASSIGNED to LOST, which performs a 
> RESCHEDULE and KILL. Unfortunately the KILL of a non-existent task to the 
> Mesos master results in a TASK_UNKNOWN status update, which gets mapped to 
> PARTITIONED. While the transition from LOST to PARTITIONED is not allowed, 
> some callbacks get executed despite the fact, resulting in a KILL and 
> RESCHEDULE action. This new KILL triggers another TASK_UNKNOWN and hence 
> PARTITIONED status update for the same task, putting the Scheduler to 
> indefinitely attempt KILLing the non-existent task. Attempting a client job 
> killall results in the same state for the scheduler.
> Since the scheduler uses the LOST state for black-holing task the 
> {{TaskStateMachine}} needs to take those into account.
> I was able to reproduce this in the Vagrant image by faking a launch failure.
> {code:java}
> I0124 05:48:23.198 [qtp1791010542-40, StateMachine] 
> vagrant-test-fail-partition_aware_disabled-0-07bec0cb-d6a3-4caa-9b6e-60e6d0934606
>  state machine transition INIT -> PENDING I0124 05:48:23.213508 9748 
> log.cpp:560] Attempting to append 1679 bytes to the log I0124 05:48:23.214570 
> 9748 coordinator.cpp:348] Coordinator attempting to write APPEND action at 
> position 24778 I0124 05:48:23.214834 9748 replica.cpp:540] Replica received 
> write request for position 24778 from __req_res__(4)@192.168.33.7:8083 I0124 
> 05:48:23.221982 9748 leveldb.cpp:341] Persisting action (1700 bytes) to 
> leveldb took 6.772102ms I0124 05:48:23.222174 9748 replica.cpp:711] Persisted 
> action APPEND at position 24778 I0124 05:48:23.222901 9748 replica.cpp:694] 
> Replica received learned notice for position 24778 from 
> log-network(1)@192.168.33.7:8083 I0124 05:48:23.226833 9748 leveldb.cpp:341] 
> Persisting action (1702 bytes) to leveldb took 3.227779ms I0124 
> 05:48:23.227008 9748 replica.cpp:711] Persisted action APPEND at position 
> 24778 I0124 05:48:23.262 [qtp1791010542-40, RequestLog] 127.0.0.1 - - 
> [24/Jan/2018:05:48:23 +0000] "POST //aurora.local/api HTTP/1.1" 200 78 I0124 
> 05:48:23.267 [qtp1791010542-40, LoggingInterceptor] 
> getTasksWithoutConfigs(TaskQuery(role:null, environment:null, jobName:null, 
> taskIds:null, statuses:null, instanceIds:null, slaveHosts:null, 
> jobKeys:[JobKey(role:vagrant, environment:test, 
> name:fail-partition_aware_disabled)], offset:0, limit:0)) I0124 05:48:23.285 
> [qtp1791010542-40, RequestLog] 127.0.0.1 - - [24/Jan/2018:05:48:23 +0000] 
> "POST //aurora.local/api HTTP/1.1" 200 794 I0124 05:48:23.349 
> [TaskGroupBatchWorker, StateMachine] Callback transition PENDING to ASSIGNED, 
> allow: true I0124 05:48:23.353 [TaskGroupBatchWorker, StateMachine] 
> vagrant-test-fail-partition_aware_disabled-0-07bec0cb-d6a3-4caa-9b6e-60e6d0934606
>  state machine transition PENDING -> ASSIGNED I0124 05:48:23.356 
> [TaskGroupBatchWorker, TaskAssignerImpl] Offer on agent 192.168.33.7 (id 
> fe8bc641-aa02-4363-a990-318d20de1bac-S0) is being assigned task for 
> vagrant-test-fail-partition_aware_disabled-0-07bec0cb-d6a3-4caa-9b6e-60e6d0934606.
>  W0124 05:48:23.445 [TaskGroupBatchWorker, TaskAssignerImpl] Failed to launch 
> task. org.apache.aurora.scheduler.offers.OfferManager$LaunchException: Failed 
> to launch task. at 
> org.apache.aurora.scheduler.offers.OfferManagerImpl.launchTask(OfferManagerImpl.java:212)
>  at 
> org.apache.aurora.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:83)
>  at 
> org.apache.aurora.scheduler.scheduling.TaskAssignerImpl.launchUsingOffer(TaskAssignerImpl.java:126)
>  at 
> org.apache.aurora.scheduler.scheduling.TaskAssignerImpl.maybeAssign(TaskAssignerImpl.java:262)
>  at 
> org.apache.aurora.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:83)
>  at 
> org.apache.aurora.scheduler.scheduling.TaskSchedulerImpl.scheduleTasks(TaskSchedulerImpl.java:154)
>  at 
> org.apache.aurora.scheduler.scheduling.TaskSchedulerImpl.schedule(TaskSchedulerImpl.java:108)
>  at 
> org.apache.aurora.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:83)
>  at 
> org.apache.aurora.scheduler.scheduling.TaskGroups$1.lambda$run$0(TaskGroups.java:174)
>  at org.apache.aurora.scheduler.BatchWorker$Work.apply(BatchWorker.java:117) 
> at 
> org.apache.aurora.scheduler.BatchWorker.lambda$processBatch$3(BatchWorker.java:210)
>  at 
> org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult.apply(Storage.java:144)
>  at 
> org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult.apply(Storage.java:139)
>  at 
> org.apache.aurora.scheduler.storage.durability.DurableStorage.lambda$doInTransaction$0(DurableStorage.java:199)
>  at 
> org.apache.aurora.scheduler.storage.mem.MemStorage.write(MemStorage.java:89) 
> at 
> org.apache.aurora.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:83)
>  at 
> org.apache.aurora.scheduler.storage.durability.DurableStorage.doInTransaction(DurableStorage.java:198)
>  at 
> org.apache.aurora.scheduler.storage.durability.DurableStorage.write(DurableStorage.java:221)
>  at 
> org.apache.aurora.scheduler.storage.CallOrderEnforcingStorage.write(CallOrderEnforcingStorage.java:132)
>  at 
> org.apache.aurora.scheduler.BatchWorker.processBatch(BatchWorker.java:207) at 
> org.apache.aurora.scheduler.BatchWorker.run(BatchWorker.java:199) at 
> com.google.common.util.concurrent.AbstractExecutionThreadService$1$2.run(AbstractExecutionThreadService.java:66)
>  at com.google.common.util.concurrent.Callables$4.run(Callables.java:122) at 
> java.lang.Thread.run(Thread.java:748) Caused by: 
> java.lang.IllegalStateException: mocking launch failure at 
> org.apache.aurora.scheduler.offers.OfferManagerImpl.launchTask(OfferManagerImpl.java:201)
>  ... 23 common frames omitted I0124 05:48:23.446 [TaskGroupBatchWorker, 
> StateMachine] Callback transition ASSIGNED to LOST, allow: true I0124 
> 05:48:23.447 [TaskGroupBatchWorker, StateMachine] 
> vagrant-test-fail-partition_aware_disabled-0-07bec0cb-d6a3-4caa-9b6e-60e6d0934606
>  state machine transition ASSIGNED -> LOST I0124 05:48:23.449 
> [TaskGroupBatchWorker, StateManagerImpl] Task being rescheduled: 
> vagrant-test-fail-partition_aware_disabled-0-07bec0cb-d6a3-4caa-9b6e-60e6d0934606
>  I0124 05:48:23.454 [TaskGroupBatchWorker, StateMachine] Callback transition 
> INIT to PENDING, allow: true I0124 05:48:23.456 [TaskGroupBatchWorker, 
> StateMachine] 
> vagrant-test-fail-partition_aware_disabled-0-a882eee4-b214-49e1-be9b-1b60d119ed60
>  state machine transition INIT -> PENDING I0124 05:48:23.473 
> [TaskGroupBatchWorker, StateManagerImpl] Task being killed: 
> vagrant-test-fail-partition_aware_disabled-0-07bec0cb-d6a3-4caa-9b6e-60e6d0934606
>  I0124 05:48:23.476 [TaskGroupBatchWorker, VersionedSchedulerDriverService] 
> Killing task 
> vagrant-test-fail-partition_aware_disabled-0-07bec0cb-d6a3-4caa-9b6e-60e6d0934606
>  I0124 05:48:23.496659 9751 scheduler.cpp:243] Sending KILL call to 
> http://192.168.33.7:5050/master/api/v1/scheduler I0124 05:48:23.499577 9749 
> scheduler.cpp:676] Enqueuing event UPDATE received from 
> http://192.168.33.7:5050/master/api/v1/scheduler W0124 05:48:23.505 
> [AsyncProcessor-4, TaskVars] Failed to find rack attribute associated with 
> host 192.168.33.7 W0124 05:48:23.515 [AsyncProcessor-4, Stats] Re-using 
> already registered variable for key 
> task_delivery_delay_SOURCE_MASTER_timeouts_per_sec W0124 05:48:23.515 
> [AsyncProcessor-4, Stats] Re-using already registered variable for key 
> task_delivery_delay_SOURCE_MASTER_requests_per_sec I0124 05:48:23.524458 9754 
> log.cpp:560] Attempting to append 3625 bytes to the log I0124 05:48:23.524636 
> 9754 coordinator.cpp:348] Coordinator attempting to write APPEND action at 
> position 24779 I0124 05:48:23.525111 9754 replica.cpp:540] Replica received 
> write request for position 24779 from __req_res__(5)@192.168.33.7:8083 I0124 
> 05:48:23.526832 9754 leveldb.cpp:341] Persisting action (3646 bytes) to 
> leveldb took 1.612374ms I0124 05:48:23.526927 9754 replica.cpp:711] Persisted 
> action APPEND at position 24779 I0124 05:48:23.527189 9754 replica.cpp:694] 
> Replica received learned notice for position 24779 from 
> log-network(1)@192.168.33.7:8083 I0124 05:48:23.528949 9754 leveldb.cpp:341] 
> Persisting action (3648 bytes) to leveldb took 1.543755ms I0124 
> 05:48:23.529088 9754 replica.cpp:711] Persisted action APPEND at position 
> 24779 I0124 05:48:23.538 [TaskStatusHandlerImpl, TaskStatusHandlerImpl] 
> Task:vagrant-test-fail-partition_aware_disabled-0-07bec0cb-d6a3-4caa-9b6e-60e6d0934606
>  moved to status:PARTITIONED for mesos state: TASK_UNKNOWN I0124 05:48:23.540 
> [TaskStatusHandlerImpl, StateMachine] Callback transition LOST to 
> PARTITIONED, allow: false I0124 05:48:23.540 [TaskStatusHandlerImpl, 
> StateMachine] 
> vagrant-test-fail-partition_aware_disabled-0-07bec0cb-d6a3-4caa-9b6e-60e6d0934606
>  state machine transition LOST -> PARTITIONED (not allowed) I0124 
> 05:48:23.540 [TaskStatusHandlerImpl, TaskStateMachine] LOST to PARTITIONED => 
> KILL E0124 05:48:23.540 [TaskStatusHandlerImpl, TaskStateMachine] Illegal 
> state transition attempted: LOST -> PARTITIONED (not allowed) I0124 
> 05:48:23.541 [TaskStatusHandlerImpl, StateManagerImpl] Task being killed: 
> vagrant-test-fail-partition_aware_disabled-0-07bec0cb-d6a3-4caa-9b6e-60e6d0934606
>  I0124 05:48:23.542 [TaskStatusHandlerImpl, VersionedSchedulerDriverService] 
> Killing task 
> vagrant-test-fail-partition_aware_disabled-0-07bec0cb-d6a3-4caa-9b6e-60e6d0934606
>  I0124 05:48:23.543357 9749 scheduler.cpp:243] Sending KILL call to 
> http://192.168.33.7:5050/master/api/v1/scheduler W0124 05:48:23.546 
> [TaskGroupBatchWorker, TaskSchedulerImpl] Failed to look up tasks 
> vagrant-test-fail-partition_aware_disabled-0-07bec0cb-d6a3-4caa-9b6e-60e6d0934606
>  I0124 05:48:23.587267 9751 scheduler.cpp:676] Enqueuing event UPDATE 
> received from http://192.168.33.7:5050/master/api/v1/scheduler I0124 
> 05:48:23.593 [TaskStatusHandlerImpl, TaskStatusHandlerImpl] 
> Task:vagrant-test-fail-partition_aware_disabled-0-07bec0cb-d6a3-4caa-9b6e-60e6d0934606
>  moved to status:PARTITIONED for mesos state: TASK_UNKNOWN I0124 05:48:23.595 
> [TaskStatusHandlerImpl, StateMachine] Callback transition LOST to 
> PARTITIONED, allow: false I0124 05:48:23.595 [TaskStatusHandlerImpl, 
> StateMachine] 
> vagrant-test-fail-partition_aware_disabled-0-07bec0cb-d6a3-4caa-9b6e-60e6d0934606
>  state machine transition LOST -> PARTITIONED (not allowed) I0124 
> 05:48:23.595 [TaskStatusHandlerImpl, TaskStateMachine] LOST to PARTITIONED => 
> KILL E0124 05:48:23.595 [TaskStatusHandlerImpl, TaskStateMachine] Illegal 
> state transition attempted: LOST -> PARTITIONED (not allowed) I0124 
> 05:48:23.595 [TaskStatusHandlerImpl, StateManagerImpl] Task being killed: 
> vagrant-test-fail-partition_aware_disabled-0-07bec0cb-d6a3-4caa-9b6e-60e6d0934606
>  I0124 05:48:23.595 [TaskStatusHandlerImpl, VersionedSchedulerDriverService] 
> Killing task 
> vagrant-test-fail-partition_aware_disabled-0-07bec0cb-d6a3-4caa-9b6e-60e6d0934606
>  I0124 05:48:23.596256 9747 scheduler.cpp:243] Sending KILL call to 
> http://192.168.33.7:5050/master/api/v1/scheduler I0124 05:48:23.634858 9751 
> scheduler.cpp:676] Enqueuing event UPDATE received from 
> http://192.168.33.7:5050/master/api/v1/scheduler I0124 05:48:23.636 
> [TaskStatusHandlerImpl, TaskStatusHandlerImpl] 
> Task:vagrant-test-fail-partition_aware_disabled-0-07bec0cb-d6a3-4caa-9b6e-60e6d0934606
>  moved to status:PARTITIONED for mesos state: TASK_UNKNOWN
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to