keith-turner commented on PR #5570:
URL: https://github.com/apache/accumulo/pull/5570#issuecomment-2901573601
Adding a more concrete description of this bug and the fix.
Before this fix the following could happen in a tablet server process.
1. THREAD_1 is working on loading a tablet that has an existing external
compaction in the metadata table
2. THREAD_1 adds the external compaction id to
CompactionManager.runningExternalCompactions
3. THREAD_2 is running CompactionManager.mainLoop or
CompactionManager.commitExternalCompaction or
CompactionManager.externalCompactionFailed
4. THREAD_2 sees an external compaction id in
CompactionManager.runningExternalCompactions that no online tablet in the
tserver knows about
5. THREAD_2 removes the external compaction id from
CompactionManager.runningExternalCompactions
6. THREAD_1 adds the tablet it is working on to the set of online tablets.
This is the set that THREAD_2 did not see the tablet in.
When the above sequence of events happens the tablet server will always
ignore RPCs from the coordinator to commit or fail the compaction. Until the
tablet server is restarted and the race condition does not happen on the new
tserer where tablet lands, the external compaction can never commit and its
files stay reserved.
This fix does two things to avoid the race condition. First in
CompactionManager.mainLoop, it was modified to consider tablets that are
opening and online. Tablets in the opening state will add existing external
compactions to CompactionManager.runningExternalCompactions. Second the two
RPC handling methods in CompactionManager that were removing entries from
CompactionManager.runningExternalCompactions were modified to only do this if
when the compaction id is in both sets.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]