gianm commented on PR #18766:
URL: https://github.com/apache/druid/pull/18766#issuecomment-3722064540

   I believe the Docker test failure is due to a new race in 
`IngestionSmokeTest#test_streamLogs_ofCancelledTask`. This code now fails 
because the `streamOptional` is empty:
   
   ```java
       eventCollector.latchableEmitter().waitForEvent(
           event -> event.hasMetricName("task/run/time")
                         .hasDimension(DruidMetrics.TASK_ID, taskId)
                         .hasDimension(DruidMetrics.TASK_STATUS, "FAILED")
       );
   
       final Optional<InputStream> streamOptional =
           overlord.bindings()
                   .getInstance(TaskLogStreamer.class)
                   .streamTaskLog(taskId, 0);
   
       Assertions.assertTrue(streamOptional.isPresent());
   ```
   
   It happens because, with this patch, the `task/run/time` metric is emitted 
in `notifyStatus`, after the metadata store is updated, the task runner is 
notified to stop the task, and locks are released. For a canceled task, this 
happens before it actually stops running. Formerly, the metric in this scenario 
would be emitted when the task runner reports the task as finished and fires 
its callback. So, now there's a window between when `task/run/time` is emitted 
and when the log is written to S3.
   
   I believe this little window where logs are not available shortly after 
tasks are canceled has always existed. I have seen it myself in production: 
sometimes you get a 404 on the task log for a recently-canceled task. But now 
it's visible to the test due to the timing of metric emission.
   
   I do believe that emitting metrics in `notifyStatus`, as this patch does, is 
better than what we were doing before. `notifyStatus` is the method that 
updates the metadata store, which is canonical, so tying metrics to that makes 
the metrics more accurate. (That was the bug this patch is fixing.)
   
   @kfaraz I'm wondering if you have a better idea to deal with the timing 
issue, beyond adding a sleep here. That's the best thing I can think of right 
now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to