kfaraz opened a new pull request, #19034:
URL: https://github.com/apache/druid/pull/19034

   ### Description
   
   During aggressive auto-scaling, the tasks frequently fail with the error 
"Inconsistency between stored metadata and target state" causing ingestion lag. 
This is typically a self-healing issue as the supervisor re-launches the failed 
tasks with updated offsets, but it is still operational overhead and often 
causes ingestion lag.
   
   ```java
   java.util.concurrent.ExecutionException: 
org.apache.druid.java.util.common.ISE:
     Failed to publish segments because of
   [Inconsistency between stored metadata state[KafkaDataSourceMetadata{}] and 
target state[KafkaDataSourceMetadata{}].
   ```
   
   The root cause behind this failure seems to be the following race condition:
   
   - Scaling event is triggered.
   - `changeTaskCount()` is called
   - `checkTaskDuration()` tries to checkpoint the actively reading tasks and 
moves them to pending completion
   - `checkTaskDuration()` also updates the `partitionOffsets` with the latest 
result of the checkpointing
   - ⚠️ `clearAllocationInfo()` clears `partitionOffsets`
   - New task group B is created and is assigned a partition P1 which an old 
task group (still pending completion) was also reading from.
   - ⚠️ (race) Task group B is initialized with offsets present in the metadata 
store. But this does not reflect the latest checkpoint since task group A is 
yet to publish.
   - Task group A publishes offsets and updates the metadata store.
   - ❌ Task group B tries to publish and fails since the committed offsets have 
now diverged.
   
   The bug does not occur if task group A is able to finish publishing the 
offsets before task group B has been created.
   
   ### Changes
   
   - Do not clear `partitionOffsets` before auto-scaling so that subsequent 
tasks know where the previous tasks had left off.
   - Simplify the condition in `IndexerSQLMetadataStorageCoordinator`
   - Add some comments and javadocs
   - TESTS PENDING
   
   ### Note
   
   This bug may still occur if Overlord leadership changes right before the 
scaling event.
   But there is currently no way to handle that since `partitionOffsets` is an 
in-memory data structure and is not meant to be persisted.
   
   <hr>
   
   This PR has:
   
   - [ ] been self-reviewed.
      - [ ] using the [concurrency 
checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md)
 (Remove this item if the PR doesn't have any relation to concurrency.)
   - [ ] added documentation for new or modified features or behaviors.
   - [ ] a release note entry in the PR description.
   - [ ] added Javadocs for most classes and all non-trivial methods. Linked 
related entities via Javadoc links.
   - [ ] added or updated version, license, or notice information in 
[licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md)
   - [ ] added comments explaining the "why" and the intent of the code 
wherever would not be obvious for an unfamiliar reader.
   - [ ] added unit tests or modified existing tests to cover new code paths, 
ensuring the threshold for [code 
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
 is met.
   - [ ] added integration tests.
   - [ ] been tested in a test Druid cluster.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to