[ https://issues.apache.org/jira/browse/HBASE-27871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wellington Chevreuil updated HBASE-27871: ----------------------------------------- Affects Version/s: 2.4.16 > Meta replication stuck forever if wal it's still reading gets rolled and > deleted > -------------------------------------------------------------------------------- > > Key: HBASE-27871 > URL: https://issues.apache.org/jira/browse/HBASE-27871 > Project: HBase > Issue Type: Bug > Components: meta replicas > Affects Versions: 2.6.0, 2.4.16, 2.4.17, 2.5.4 > Reporter: Wellington Chevreuil > Assignee: Wellington Chevreuil > Priority: Major > Fix For: 2.6.0, 2.4.18, 2.5.6 > > > This affects branch-2 based releases only (in master, HBASE-26416 refactored > region replication to not rely on the replication framework anymore). > Per the original [meta region replicas > design|https://docs.google.com/document/d/1jJWVc-idHhhgL4KDRpjMsQJKCl_NRaCLGiH3Wqwd3O8/edit], > we use most of the replication framework for communicating changes in the > primary replica back to the secondary ones, but we skip storing the queue > state in ZK. In the event of a region replication crash, we should let the > related replication source thread be interrupted, so that > RegionReplicaReplicationEndpoint would set a new source from the scratch and > make sure to update the secondary replicas. > > We have run into a situation in one of our customers' cluster where the > region replica source faced a long lag (probably because the RSes hosting the > secondary replicas were busy and slower in processing the region replication > entries), so that the current wal got rolled and eventually deleted whilst > the replication source reader was still referring it. In such cases, > ReplicationSourceReader only sees the IOException and keeps retrying the read > indefinitely, but since the file is now gone, it will just get stuck there > forever. In the particular case of FNFE (which I believe would only happen > for region replication), we should just raise an exception and let > RegionReplicaReplicationEndpoint handle it to reset the region replication > source. > > -- This message was sent by Atlassian Jira (v8.20.10#820010)