Nick Dimiduk created HBASE-27707: ------------------------------------ Summary: Region replica replication sometimes orphans WAL queue entries during recovery Key: HBASE-27707 URL: https://issues.apache.org/jira/browse/HBASE-27707 Project: HBase Issue Type: Bug Components: read replicas, Replication Affects Versions: 2.5.0 Reporter: Nick Dimiduk
Running with timeline-consistent read replicas and {{hbase.region.replica.replication.enabled=true}}, we're seeing some region servers have WAL queue entires that never clear. This appears to correlate with SCP and recovery of replication queues. The result is WALs that build up, consuming dangerous amounts of space on HDFS. Remediation requires disabling and removing the {{region_replica_replication}} peer, which forces an impacted region server to abort with the message "Failed to operate on replication queue". We then delete the zk entry, which unlocks the WAL and the cleaner chore can sweep them. -- This message was sent by Atlassian Jira (v8.20.10#820010)