Pankaj260100 opened a new issue, #15054: URL: https://github.com/apache/druid/issues/15054
### Affected Version = 25.0.0 The Druid version where the problem was encountered. ### Description - Getting this error ` Failed to publish segments because of [java.lang.RuntimeException: Aborting transaction!]. ` when submitted the supervisor config(I didn't update the topic name; We just updated the intermediate handoff time). - On Debugging, I found that a few tasks failed to publish the segment because the existing metadata state didn't match the new start state. - On further debugging, I was able to find out the issue and also able to replicate this issue. - I submitted the supervisor config just after the task rollover happened for a few tasks; the task duration is 1 hour in my case. Here is the scenario: Task A is running on one of the indexers from 8 am to 9 am, And after 9 am, Task A starts publishing the segments. A new Task B started ingesting from the same partitions that Task A was ingesting. Task B is running on some different indexer, and just after a minute(at 9:01 am), I submitted the supervisor config, and then Task B also started publishing the data it consumed in the last 1 minute. Task B failed to publish the segment because the end offset stored in the metadata store doesn’t match the start offset of the Task B segment. Because Task A hasn't updated the end offset yet(it is still in progress). After some time, it recovers on its own. But the order was not maintained here, and that's why this issue happened. - Task A was able to publish the segment after some time successfully. But the lag went very high as Task B failed, and Task C(which started after I submit the supervisor config) also failed. After sometime, Task D(it began when Task B and Task C failed) will also fail because Task D's start offset is picked from the metadata store, and by the time Task D publishes the segment, Task A published the segment and updated the endoffset. Task D will face the same metadata state mismatch issue. Then, finally, Task E will ingest from the proper start offset and recover from this issue. But the lag will be very high. Slack Thread: https://apachedruidworkspace.slack.com/archives/C0309C9L90D/p1695710662075909 As @abhishekagarwal87 suggested, we can use retry here So Task B will not fail. It will retry, and After Task A has updated the end offset in the metadata store, Task B will also successfully publish the segment and update the metadata store. But the thing is, metadata mismatch can also happen in 2 below scenarios: 1. when someone updated the topic name without changing the supervisor/ data source name. So, there is no point in retrying in this case. 2. When multiple replicas of ingestion tasks are running, only one replica will successfully update the metadata, and other replicas will fail. So there's no point in retrying in this case as well. It will be hard to differentiate between this case and the above issue I faced, as in both, the metadata mismatch happens because the end offset in the metadata store doesn't match the new start state. Any other solution and any suggestion, guys? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
