ankitsol commented on PR #7617:
URL: https://github.com/apache/hbase/pull/7617#issuecomment-4399200980

   Update: with the current PR, `TestBulkLoadReplicationHFileRefs` gets hung 
when the entire suite is run, although individual test runs are passing. So I 
suspect there is still some missing scenario/cleanup gap in the shipper restart 
lifecycle logic.
   
   I analysed the thread dump and the hang is happening during 
`RemovePeerProcedure`, specifically inside:
   
   `ReplicationPeerManager.removeAllQueuesAndHFileRefs()`
   → `TableReplicationQueueStorage.removeAllQueues()`
   
   The procedure appears to be waiting indefinitely while scanning/removing 
replication queues.
   
   One important observation from the dump is that there are no active 
`ReplicationSourceShipper` or `ReplicationSourceWALReader` threads at the time 
of the hang, which suggests this is likely not an active thread deadlock, but 
rather inconsistent/orphaned replication queue state left behind during the 
custom restart flow.
   
   Current suspicion is that restarting only the shipper thread is insufficient 
and we likely need coordinated restart/cleanup of the entire WAL group pipeline 
(reader + shipper + buffered batches + queue bookkeeping) before recreating the 
worker.
   
   I am currently validating this by adding a coordinated 
`restartWalGroup(...)` flow which:
   
   * stops reader + shipper together
   * clears pending WAL entry batches
   * releases buffer accounting correctly
   * removes old worker atomically
   * recreates a fresh reader/shipper pipeline
   
   Will update once I validate whether this resolves the suite-level hang.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to