Hi,

We've recently upgraded our production clusters to 1.4.6. We have jobs 
periodically run that take snapshots of some of our hbase tables and these jobs 
seem to be running into https://issues.apache.org/jira/browse/HBASE-21069. I 
understand there was a missing null check, but in the bug I don't really see 
any explanation of how the null occurs in the first place. For those of us 
running 1.4.6, is there anything we can do to avoid hitting the bug?

This problem is made worse because we are running a cluster in AWS EMR, meaning 
our WAL is on a different filesystem (HDFS) and the hbase root directory 
(EMRFS), and we are hitting some sort of issue where sometimes the master gets 
stuck while splitting a WAL from the crashed region server:

2018-11-20 12:01:58,599 ERROR [split-log-closeStream-2] wal.WALSplitter: 
Couldn't rename 
s3://cmx-emr-hbase-us-west-2-oregon/hbase/data/default/upload_metadata_v2/3f98fcda5f711b29af28e9613d4b833b/recovered.edits/0000000000165359708-ip-172-20-113-197.us-west-2.compute.internal%2C16020%2C1542620776146.1542673338055.temp
 to 
s3://cmx-emr-hbase-us-west-2-oregon/hbase/data/default/upload_metadata_v2/3f98fcda5f711b29af28e9613d4b833b/recovered.edits/0000000000165359720
java.io.IOException: Cannot get log reader
                at 
org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:365)
                at 
org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:277)
                at 
org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:265)
                at 
org.apache.hadoop.hbase.wal.WALSplitter$LogRecoveredEditsOutputSink.deleteOneWithFewerEntries(WALSplitter.java:1363)
                at 
org.apache.hadoop.hbase.wal.WALSplitter$LogRecoveredEditsOutputSink.closeWriter(WALSplitter.java:1496)
                at 
org.apache.hadoop.hbase.wal.WALSplitter$LogRecoveredEditsOutputSink$2.call(WALSplitter.java:1448)
                at 
org.apache.hadoop.hbase.wal.WALSplitter$LogRecoveredEditsOutputSink$2.call(WALSplitter.java:1445)
                at java.util.concurrent.FutureTask.run(FutureTask.java:266)
                at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
                at java.util.concurrent.FutureTask.run(FutureTask.java:266)
                at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
                at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
                at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalArgumentException: Wrong FS: 
s3://cmx-emr-hbase-us-west-2-oregon/hbase/data/default/upload_metadata_v2/3f98fcda5f711b29af28e9613d4b833b/recovered.edits/0000000000165359720,
 expected: hdfs://ip-172-20-113-83.us-west-2.compute.internal:8020
                at 
org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:669)
                at 
org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:214)
                at 
org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:329)
                at 
org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:325)
                at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
                at 
org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:337)
                at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:790)
                at 
org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:303)
                ... 12 more

It seems like https://issues.apache.org/jira/browse/HBASE-20723 did not hit all 
use cases. My understanding is that in 1.4.8 the recovered edits are collocated 
with the WAL so this will no longer be an issue 
(https://issues.apache.org/jira/browse/HBASE-20734) but AWS has yet to release 
an EMR with 1.4.8 so this is causing us pain right now when we hit this 
situation (it doesn't seem to happen every time a region server crashes - only 
twice so far).

Unfortunately because we are running an AWS EMR cluster, so we can't really 
just patch the region servers ourselves. We have the option of upgrading to 
1.4.7 to get the fix for HBASE-21069,  but that will take us a little time to 
test, release, and schedule downtime for our application so any mitigating 
steps we could take in the meantime would be appreciated.

Thanks,

--Jacob LeBlanc


Reply via email to