ankitsol commented on PR #7617: URL: https://github.com/apache/hbase/pull/7617#issuecomment-4399200980
Update: with the current PR, `TestBulkLoadReplicationHFileRefs` gets hung when the entire suite is run, although individual test runs are passing. So I suspect there is still some missing scenario/cleanup gap in the shipper restart lifecycle logic. I analysed the thread dump and the hang is happening during `RemovePeerProcedure`, specifically inside: `ReplicationPeerManager.removeAllQueuesAndHFileRefs()` → `TableReplicationQueueStorage.removeAllQueues()` The procedure appears to be waiting indefinitely while scanning/removing replication queues. One important observation from the dump is that there are no active `ReplicationSourceShipper` or `ReplicationSourceWALReader` threads at the time of the hang, which suggests this is likely not an active thread deadlock, but rather inconsistent/orphaned replication queue state left behind during the custom restart flow. Current suspicion is that restarting only the shipper thread is insufficient and we likely need coordinated restart/cleanup of the entire WAL group pipeline (reader + shipper + buffered batches + queue bookkeeping) before recreating the worker. I am currently validating this by adding a coordinated `restartWalGroup(...)` flow which: * stops reader + shipper together * clears pending WAL entry batches * releases buffer accounting correctly * removes old worker atomically * recreates a fresh reader/shipper pipeline Will update once I validate whether this resolves the suite-level hang. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
