Hi,
We've recently upgraded our production clusters to 1.4.6. We have jobs
periodically run that take snapshots of some of our hbase tables and these jobs
seem to be running into https://issues.apache.org/jira/browse/HBASE-21069. I
understand there was a missing null check, but in the bug I don't really see
any explanation of how the null occurs in the first place. For those of us
running 1.4.6, is there anything we can do to avoid hitting the bug?
This problem is made worse because we are running a cluster in AWS EMR, meaning
our WAL is on a different filesystem (HDFS) and the hbase root directory
(EMRFS), and we are hitting some sort of issue where sometimes the master gets
stuck while splitting a WAL from the crashed region server:
2018-11-20 12:01:58,599 ERROR [split-log-closeStream-2] wal.WALSplitter:
Couldn't rename
s3://cmx-emr-hbase-us-west-2-oregon/hbase/data/default/upload_metadata_v2/3f98fcda5f711b29af28e9613d4b833b/recovered.edits/0000000000165359708-ip-172-20-113-197.us-west-2.compute.internal%2C16020%2C1542620776146.1542673338055.temp
to
s3://cmx-emr-hbase-us-west-2-oregon/hbase/data/default/upload_metadata_v2/3f98fcda5f711b29af28e9613d4b833b/recovered.edits/0000000000165359720
java.io.IOException: Cannot get log reader
at
org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:365)
at
org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:277)
at
org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:265)
at
org.apache.hadoop.hbase.wal.WALSplitter$LogRecoveredEditsOutputSink.deleteOneWithFewerEntries(WALSplitter.java:1363)
at
org.apache.hadoop.hbase.wal.WALSplitter$LogRecoveredEditsOutputSink.closeWriter(WALSplitter.java:1496)
at
org.apache.hadoop.hbase.wal.WALSplitter$LogRecoveredEditsOutputSink$2.call(WALSplitter.java:1448)
at
org.apache.hadoop.hbase.wal.WALSplitter$LogRecoveredEditsOutputSink$2.call(WALSplitter.java:1445)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalArgumentException: Wrong FS:
s3://cmx-emr-hbase-us-west-2-oregon/hbase/data/default/upload_metadata_v2/3f98fcda5f711b29af28e9613d4b833b/recovered.edits/0000000000165359720,
expected: hdfs://ip-172-20-113-83.us-west-2.compute.internal:8020
at
org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:669)
at
org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:214)
at
org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:329)
at
org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:325)
at
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at
org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:337)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:790)
at
org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:303)
... 12 more
It seems like https://issues.apache.org/jira/browse/HBASE-20723 did not hit all
use cases. My understanding is that in 1.4.8 the recovered edits are collocated
with the WAL so this will no longer be an issue
(https://issues.apache.org/jira/browse/HBASE-20734) but AWS has yet to release
an EMR with 1.4.8 so this is causing us pain right now when we hit this
situation (it doesn't seem to happen every time a region server crashes - only
twice so far).
Unfortunately because we are running an AWS EMR cluster, so we can't really
just patch the region servers ourselves. We have the option of upgrading to
1.4.7 to get the fix for HBASE-21069, but that will take us a little time to
test, release, and schedule downtime for our application so any mitigating
steps we could take in the meantime would be appreciated.
Thanks,
--Jacob LeBlanc