Rushabh Shah created HBASE-26435:
------------------------------------
Summary: [branch-1] The log rolling request maybe canceled
immediately in LogRoller due to a race
Key: HBASE-26435
URL: https://issues.apache.org/jira/browse/HBASE-26435
Project: HBase
Issue Type: Sub-task
Components: wal
Affects Versions: 1.6.0
Reporter: Rushabh Shah
Fix For: 1.7.2
Saw this issue in our internal 1.6 branch.
The WALÂ was rolled but the new WAL file was not writable and it logged the
following error also
{noformat}
2021-11-03 19:20:19,503 WARN [.168:60020.logRoller] hdfs.DFSClient - Error
while syncing
java.io.IOException: Could not get block locations. Source file
"/hbase/WALs/<rs-name>,60020,1635567166484/<rs-name>%2C60020%2C1635567166484.1635967219389"
- Aborting...
at
org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1466)
at
org.apache.hadoop.hdfs.DataStreamer.processDatanodeError(DataStreamer.java:1251)
at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:670)
2021-11-03 19:20:19,507 WARN [.168:60020.logRoller] wal.FSHLog - pre-sync
failed but an optimization so keep going
java.io.IOException: Could not get block locations. Source file
"/hbase/WALs/<rs-name>,60020,1635567166484/<rs-name>%2C60020%2C1635567166484.1635967219389"
- Aborting...
at
org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1466)
at
org.apache.hadoop.hdfs.DataStreamer.processDatanodeError(DataStreamer.java:1251)
at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:670)
{noformat}
Since the new WAL file was not writable, appends to that file started failing
immediately it was rolled.
{noformat}
2021-11-03 19:20:19,677 INFO [.168:60020.logRoller] wal.FSHLog - Rolled WAL
/hbase/WALs/<rs-name>,60020,1635567166484/<rs-name>%2C60020%2C1635567166484.1635965392022
with entries=253234, filesize=425.67 MB; new WAL
/hbase/WALs/<rs-name>,60020,1635567166484/<rs-name>%2C60020%2C1635567166484.1635967219389
2021-11-03 19:20:19,690 WARN [020.append-pool17-t1] wal.FSHLog - Append
sequenceId=1962661783, requesting roll of WAL
java.io.IOException: Could not get block locations. Source file
"/hbase/WALs/<rs-name>,60020,1635567166484/<rs-name>%2C60020%2C1635567166484.1635967219389"
- Aborting...
at
org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1466)
at
org.apache.hadoop.hdfs.DataStreamer.processDatanodeError(DataStreamer.java:1251)
at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:670)
2021-11-03 19:20:19,690 INFO [.168:60020.logRoller] wal.FSHLog - Archiving
hdfs://prod-EMPTY-hbase2a/hbase/WALs/<rs-name>,60020,1635567166484/<rs-name>%2C60020%2C1635567166484.1635960792837
to
hdfs://prod-EMPTY-hbase2a/hbase/oldWALs/hbase2a-dnds1-232-ukb.ops.sfdc.net%2C60020%2C1635567166484.1635960792837
{noformat}
We always reset the rollLog flag within LogRoller thread after the rollWal call
is complete.
Within FSHLog#rollWriter method, it does many things, like replacing the writer
and archiving old logs. If append thread fails to write to new file while
logRoller thread is cleaning old logs, we will miss the rollLog flag since
LogRoller will reset the flag to false while the previous rollWriter call is
going on.
Relevant code:
https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/LogRoller.java#L183-L203
We need to reset rollLog flag before we start rolling the wal.
This is fixed in branch-2 and master via HBASE-22684 but we didn't fix it in
branch-1
Also branch-2 has multi wal implementation so it can apply cleanly in branch-1.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)