zclllyybb commented on issue #64682: URL: https://github.com/apache/doris/issues/64682#issuecomment-4766250213
Breakwater-GitHub-Analysis-Slot: slot_a0ba233667ad This content is generated by AI for reference only. Initial assessment: this looks like a valid latent bug in `MetaLockUtils.commitLockTables` on current `master`. I checked upstream `master` at `90db340515e88e177bc4e6151ba0b9a84cf04d8c`. In `fe/fe-core/src/main/java/org/apache/doris/common/util/MetaLockUtils.java`, the failure rollback path for `commitLockTables()` iterates `j = i - 1 ... 0`, but still calls `tableList.get(i).commitUnlock()`. That diverges from the nearby table-lock helpers, which unlock the already acquired entries with `get(j)` and then propagate the failure. The consequence is not only stylistic. The main caller I found is the cloud-mode branch of `BrokerLoadJob`, which calls `MetaLockUtils.commitLockTables(tableList)` before entering the later transaction `try/finally` that unlocks all tables. Therefore, if commit-lock acquisition fails partway through the list, `commitLockTables()` itself is responsible for releasing the earlier locks. With the current code, tables `0..i-1` are not released on that path. One nuance: the issue says the original exception is swallowed and the loop keeps going. The method indeed does not explicitly rethrow `e`; however, with the current `Table.commitUnlock()` implementation, `commitUnlock()` delegates to `MonitoredReentrantLock.unlock()`, which delegates to `ReentrantLock.unlock()`. If the failed table's lock is not owned by the current thread, the wrong rollback call can raise an unlock-side runtime exception instead. Either way, the already acquired commit locks remain unreleased, and the original acquisition failure is not handled correctly. PR #64676 appears to address the two important parts of the fix: - change the rollback target from `tableList.get(i)` to `tableList.get(j)`; - rethrow the acquisition failure after rollback; - add a focused FE unit test covering rollback and failure propagation. Suggested maintainer next step: review PR #64676 as the direct fix. I do not think extra runtime logs are required to validate the bug because the affected branch is visible by code inspection. If maintainers want to assess whether this has already occurred in a running deployment, the useful evidence would be FE logs around cloud-mode broker-load transaction commit failures plus a thread dump or lock-owner evidence showing a table commit lock held after a failed commit-lock acquisition. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
