Sid Khillon created HBASE-29987:
-----------------------------------

             Summary: Replication position corruption when WAL file switch 
detected in ReplicationSourceWALReader run loop
                 Key: HBASE-29987
                 URL: https://issues.apache.org/jira/browse/HBASE-29987
             Project: HBase
          Issue Type: Bug
          Components: Replication, wal, Zookeeper
            Reporter: Sid Khillon


When {{ReplicationSourceWALReader.run()}} detects a WAL file switch via the 
{{switched()}} check at line 160, it enqueues an EOF batch but does not update 
{{{}currentPosition{}}}. If the outer loop subsequently restarts (e.g., due to 
{{{}WALEntryFilterRetryableException{}}}), the new {{WALEntryStream}} is 
created with the stale position from the old WAL file, which gets applied to 
the new WAL file. This causes the reader to enter an infinite retry loop 
attempting to seek to an invalid position, permanently stalling replication.

 

The {{switched()}} path at line 160 fires when {{readWALEntries()}} returns a 
batch without seeing EOF — either because batch capacity was reached, or 
because an error (e.g., NameNode timeout) caused {{hasNext()}} inside 
{{readWALEntries()}} to return RETRY, breaking the loop early. The next 
{{hasNext()}} at line 153 then detects EOF, dequeues the old file, and returns 
{{{}RETRY_IMMEDIATELY{}}}. The {{switched()}} check fires because 
{{{}currentPath{}}}(captured before {{{}hasNext(){}}}) was the old file, but 
the stream’s path is now null after the dequeue. {{currentPosition}} is not 
updated.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to