[I] Failed to publish segments because of [java.lang.RuntimeException: Aborting transaction!]. (druid)

via GitHub Thu, 28 Sep 2023 10:53:57 -0700


Pankaj260100 opened a new issue, #15054:
URL: https://github.com/apache/druid/issues/15054


   ### Affected Version = 25.0.0
   
   The Druid version where the problem was encountered.
   
   ### Description
   - Getting this error ` Failed to publish segments because of 
[java.lang.RuntimeException: Aborting transaction!]. ` when submitted the 
supervisor config(I didn't update the topic name; We just updated the 
intermediate handoff time). 
   - On Debugging, I found that a few tasks failed to publish the segment 
because the existing metadata state didn't match the new start state.  
   - On further debugging, I was able to find out the issue and also able to 
replicate this issue.
   - I submitted the supervisor config just after the task rollover happened 
for a few tasks; the task duration is 1 hour in my case.
   Here is the scenario: Task A is running on one of the indexers from 8 am to 
9 am, And after 9 am, Task A starts publishing the segments. A new Task B 
started ingesting from the same partitions that Task A was ingesting. Task B is 
running on some different indexer, and just after a minute(at 9:01 am), I 
submitted the supervisor config, and then Task B also started publishing the 
data it consumed in the last 1 minute. Task B failed to publish the segment 
because the end offset stored in the metadata store doesn’t match the start 
offset of the Task B segment. Because Task A hasn't updated the end offset 
yet(it is still in progress). After some time, it recovers on its own. But the 
order was not maintained here, and that's why this issue happened.
   
   - Task A was able to publish the segment after some time successfully. But 
the lag went very high as Task B failed, and Task C(which started after I 
submit the supervisor config) also failed. After sometime, Task D(it began when 
Task B and Task C failed) will also fail because Task D's start offset is 
picked from the metadata store, and by the time Task D publishes the segment, 
Task A published the segment and updated the endoffset. Task D will face the 
same metadata state mismatch issue. Then, finally, Task E will ingest from the 
proper start offset and recover from this issue. But the lag will be very high. 
   
   Slack Thread:  
https://apachedruidworkspace.slack.com/archives/C0309C9L90D/p1695710662075909
   
   As @abhishekagarwal87 suggested, we can use retry here So Task B will not 
fail. It will retry, and After Task A has updated the end offset in the 
metadata store, Task B will also successfully publish the segment and update 
the metadata store.
   
   But the thing is, metadata mismatch can also happen in 2 below scenarios:
   1. when someone updated the topic name without changing the supervisor/ data 
source name. So, there is no point in retrying in this case.
   2. When multiple replicas of ingestion tasks are running, only one replica 
will successfully update the metadata, and other replicas will fail. So there's 
no point in retrying in this case as well. It will be hard to differentiate 
between this case and the above issue I faced,  as in both, the metadata 
mismatch happens because the end offset in the metadata store doesn't match the 
new start state.
   
   Any other solution and any suggestion, guys?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Failed to publish segments because of [java.lang.RuntimeException: Aborting transaction!]. (druid)

Reply via email to