Data loss after split in 0.98.12

James Estes Tue, 28 Jul 2015 10:09:06 -0700

All,

We've been running with HBase 0.98.12 and Hadoop 2.6.0* for about 3 months
now with really no issues in 4 clusters. However, recently we've been
seeing some issues. I'm not sure they're related to the combination, and
they may be fixed in 1.1.1 (which we are in the process of rolling out
soon), but I wanted to post them here in case anyone can help understand
what is going on, or wanted to dig in to see if this issue could be
affecting others.


The most critical issue is data loss. This happened only once, and is the
first time I've ever personally seen HBase lose data. From what we can
tell, a region was compacting, then a split started for the same region
while the compaction was in progress (it had finished 3/5 column families).
The split starts to wait for the compaction, and the compaction cancels
(presumably because of the split). Then the split starts to progress. It
initializes the daughter regions. Then the region server crashes. The
region server crash is not related to the split (it had been crashing daily
for another reason related to scanning a large row as part of a custom
daily backup which happened to be occurring at the same time as this
split).

When the region comes up, data is missing from the region that was
compacting and splitting (per some monitoring tests we have that scan for
known static data sets). There are logs indicating that the daughter
regions have no store files, so I suspect that the daughter regions
replaced the parent region before the store files were fully associated
with the daughter regions. It could also be that the large row is failing
the split, but then I'd hope the parent region would be restored and abort
the split.

We quickly ran a restore from our backups (custom backup/restore: we have
the cells and restore them). I can provide logs from the region servers as
well as the Master & data nodes, but I've annotated the region server log
that was handling the compaction and splitting, as well as the region
transitions for the parent and daughter regions from the master log:
https://gist.github.com/housejester/41a935db881d42f137b4

I'll be posting the other issues we have seen as separate threads
(including the crashes) so that the threads can be a bit more focused.

Thanks,
James

* Note: I'm aware that this pairing is not tested, and even using Hadoop
2.6.0 as a default caused some concern
https://issues.apache.org/jira/browse/HBASE-13339 None of the issues
mentioned applied to us, and our own testing didn't turn up any issues, so
we went forward with this setup.

Data loss after split in 0.98.12

Reply via email to