Hi,

We periodically execute Spark jobs to run ETL from some of our HBase tables to 
another data repository. The Spark jobs read data by taking a snapshot and then 
using the TableSnapshotInputFormat class. Lately we've been having some 
failures because when the jobs try to read the data, it is trying to delete 
files under the recovered.edits directory for some regions and the user under 
which we run the jobs doesn't have permissions to do that. Pastebin of the 
error and stack trace from one of our job logs is here: 
https://pastebin.com/MAhVc9JB

This has started happening since upgrading to EMR 5.22 where the 
recovered.edits directory is collocated with the WALs in HDFS where it used to 
be in S3-backed EMRFS.

I have two questions regarding this:


1)      First of why are these files under the recovered.edits directory? The 
timestamp of the files coincides with a hiccup we had with our cluster where I 
had to use "hbase hbck -fixAssignments" to fix regions that were stuck in 
transition. But that command seemed to work just fine and all regions were 
assigned and there have since been no inconsistencies. Does this mean the WALs 
were not replayed correctly? Does "hbase hbck -fixAssignments" not recover 
regions properly?

2)      Why is our job trying to delete these files? I don't know enough to say 
for sure, but it seems like using TableSnapshotInputFormat to read snapshot 
data should not be trying recover or delete edits.

I've fixed the problems by running "assign '<region>'" in hbase shell for every 
region that had files under the recovered.edits directory and those files 
seemed to be cleaned up when the assignment completed. But I'd like to 
understand this better especially if something is interfering with replaying 
edits from WALs (also making sure our ETL jobs don't start failing would be 
nice).

Thanks!

--Jacob LeBlanc

Reply via email to