[GitHub] [druid] rohangarg commented on a diff in pull request #12901: Fix race in TaskQueue.notifyStatus.

GitBox Sun, 14 Aug 2022 13:22:22 -0700


rohangarg commented on code in PR #12901:
URL: https://github.com/apache/druid/pull/12901#discussion_r945334933



##########
indexing-service/src/main/java/org/apache/druid/indexing/overlord/TaskQueue.java:
##########
@@ -82,14 +84,21 @@
  */
 public class TaskQueue
 {
-  private final long MANAGEMENT_WAIT_TIMEOUT_NANOS = 
TimeUnit.SECONDS.toNanos(60);
-  private final long MIN_WAIT_TIME_MS = 100;
+  private static final long MANAGEMENT_WAIT_TIMEOUT_NANOS = 
TimeUnit.SECONDS.toNanos(60);
+  private static final long MIN_WAIT_TIME_MS = 100;
 
+  // Task ID -> Task that is currently running

Review Comment:
   I think this map can contains tasks in pending state too. Also, this could 
be out-of-sync with task-storage as per documentation (upon manual changes in 
task-storage). Maybe, this can be called in-memory view of known tasks?



##########
indexing-service/src/main/java/org/apache/druid/indexing/overlord/TaskQueue.java:
##########
@@ -568,62 +591,65 @@ private void notifyStatus(final Task task, final 
TaskStatus taskStatus, String r
         taskStatus.getId()
     );
 
-    // Inform taskRunner that this task can be shut down
-    TaskLocation taskLocation = TaskLocation.unknown();
-    try {
-      taskLocation = taskRunner.getTaskLocation(task.getId());
-      taskRunner.shutdown(task.getId(), reasonFormat, args);
+    if (!taskStatus.isComplete()) {
+      // Nothing to do for incomplete statuses.
+      return;
     }
-    catch (Exception e) {
-      log.warn(e, "TaskRunner failed to cleanup task after completion: %s", 
task.getId());
-    }
-
-    int removed = 0;
-
-    ///////// critical section
 
+    // Critical section: add this task to recentlyCompletedTasks, so it isn't 
managed while being cleaned up.
     giant.lock();
     try {
-      // Remove from running tasks
-      for (int i = tasks.size() - 1; i >= 0; i--) {
-        if (tasks.get(i).getId().equals(task.getId())) {
-          removed++;
-          removeTaskInternal(tasks.get(i));
-          break;
-        }
-      }
-
-      // Remove from futures list
-      taskFutures.remove(task.getId());
+      recentlyCompletedTasks.add(task.getId());
     }
     finally {
       giant.unlock();
     }
 
-    ///////// end critical
+    final TaskLocation taskLocation = taskRunner.getTaskLocation(task.getId());
 
-    if (removed == 0) {
-      log.warn("Unknown task completed: %s", task.getId());
+    // Save status to metadata store first, so if we crash while doing the 
rest of the shutdown, our successor

Review Comment:
   doubt : does this order change also solve the race condition problem itself? 
As per my current understanding, re-launching can happen only if 
`syncFromStorage` thinks that there is an active task in metadata storage and 
the in-memory view doesn't know about it. That leads to creation of in-memory 
task which then management thread launches.
   
   With this change, the `syncFromStorage` can have three views : 
   1. `tasks` map contains the task and task-storage has it as active task
   2. `tasks` map contains the task and task-storage doesn't have it as active 
task in which case both `syncFromStorage` and `notifyStatus` will clean it up
   3. `tasks` map doesn't contain the task and task-storage doesn't have it too
   
   I think all 3 cases should be ok, but I maybe missing something here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org

[GitHub] [druid] rohangarg commented on a diff in pull request #12901: Fix race in TaskQueue.notifyStatus.

Reply via email to