Rushabh Shah created HBASE-26435: ------------------------------------ Summary: [branch-1] The log rolling request maybe canceled immediately in LogRoller due to a race Key: HBASE-26435 URL: https://issues.apache.org/jira/browse/HBASE-26435 Project: HBase Issue Type: Sub-task Components: wal Affects Versions: 1.6.0 Reporter: Rushabh Shah Fix For: 1.7.2
Saw this issue in our internal 1.6 branch. The WALÂ was rolled but the new WAL file was not writable and it logged the following error also {noformat} 2021-11-03 19:20:19,503 WARN [.168:60020.logRoller] hdfs.DFSClient - Error while syncing java.io.IOException: Could not get block locations. Source file "/hbase/WALs/<rs-name>,60020,1635567166484/<rs-name>%2C60020%2C1635567166484.1635967219389" - Aborting... at org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1466) at org.apache.hadoop.hdfs.DataStreamer.processDatanodeError(DataStreamer.java:1251) at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:670) 2021-11-03 19:20:19,507 WARN [.168:60020.logRoller] wal.FSHLog - pre-sync failed but an optimization so keep going java.io.IOException: Could not get block locations. Source file "/hbase/WALs/<rs-name>,60020,1635567166484/<rs-name>%2C60020%2C1635567166484.1635967219389" - Aborting... at org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1466) at org.apache.hadoop.hdfs.DataStreamer.processDatanodeError(DataStreamer.java:1251) at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:670) {noformat} Since the new WAL file was not writable, appends to that file started failing immediately it was rolled. {noformat} 2021-11-03 19:20:19,677 INFO [.168:60020.logRoller] wal.FSHLog - Rolled WAL /hbase/WALs/<rs-name>,60020,1635567166484/<rs-name>%2C60020%2C1635567166484.1635965392022 with entries=253234, filesize=425.67 MB; new WAL /hbase/WALs/<rs-name>,60020,1635567166484/<rs-name>%2C60020%2C1635567166484.1635967219389 2021-11-03 19:20:19,690 WARN [020.append-pool17-t1] wal.FSHLog - Append sequenceId=1962661783, requesting roll of WAL java.io.IOException: Could not get block locations. Source file "/hbase/WALs/<rs-name>,60020,1635567166484/<rs-name>%2C60020%2C1635567166484.1635967219389" - Aborting... at org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1466) at org.apache.hadoop.hdfs.DataStreamer.processDatanodeError(DataStreamer.java:1251) at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:670) 2021-11-03 19:20:19,690 INFO [.168:60020.logRoller] wal.FSHLog - Archiving hdfs://prod-EMPTY-hbase2a/hbase/WALs/<rs-name>,60020,1635567166484/<rs-name>%2C60020%2C1635567166484.1635960792837 to hdfs://prod-EMPTY-hbase2a/hbase/oldWALs/hbase2a-dnds1-232-ukb.ops.sfdc.net%2C60020%2C1635567166484.1635960792837 {noformat} We always reset the rollLog flag within LogRoller thread after the rollWal call is complete. Within FSHLog#rollWriter method, it does many things, like replacing the writer and archiving old logs. If append thread fails to write to new file while logRoller thread is cleaning old logs, we will miss the rollLog flag since LogRoller will reset the flag to false while the previous rollWriter call is going on. Relevant code: https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/LogRoller.java#L183-L203 We need to reset rollLog flag before we start rolling the wal. This is fixed in branch-2 and master via HBASE-22684 but we didn't fix it in branch-1 Also branch-2 has multi wal implementation so it can apply cleanly in branch-1. -- This message was sent by Atlassian Jira (v8.20.1#820001)