All, We've been running with HBase 0.98.12 and Hadoop 2.6.0* for about 3 months now with really no issues in 4 clusters. However, recently we've been seeing some issues. I'm not sure they're related to the combination, and they may be fixed in 1.1.1 (which we are in the process of rolling out soon), but I wanted to post them here in case anyone can help understand what is going on, or wanted to dig in to see if this issue could be affecting others.
The most critical issue is data loss. This happened only once, and is the first time I've ever personally seen HBase lose data. From what we can tell, a region was compacting, then a split started for the same region while the compaction was in progress (it had finished 3/5 column families). The split starts to wait for the compaction, and the compaction cancels (presumably because of the split). Then the split starts to progress. It initializes the daughter regions. Then the region server crashes. The region server crash is not related to the split (it had been crashing daily for another reason related to scanning a large row as part of a custom daily backup which happened to be occurring at the same time as this split). When the region comes up, data is missing from the region that was compacting and splitting (per some monitoring tests we have that scan for known static data sets). There are logs indicating that the daughter regions have no store files, so I suspect that the daughter regions replaced the parent region before the store files were fully associated with the daughter regions. It could also be that the large row is failing the split, but then I'd hope the parent region would be restored and abort the split. We quickly ran a restore from our backups (custom backup/restore: we have the cells and restore them). I can provide logs from the region servers as well as the Master & data nodes, but I've annotated the region server log that was handling the compaction and splitting, as well as the region transitions for the parent and daughter regions from the master log: https://gist.github.com/housejester/41a935db881d42f137b4 I'll be posting the other issues we have seen as separate threads (including the crashes) so that the threads can be a bit more focused. Thanks, James * Note: I'm aware that this pairing is not tested, and even using Hadoop 2.6.0 as a default caused some concern https://issues.apache.org/jira/browse/HBASE-13339 None of the issues mentioned applied to us, and our own testing didn't turn up any issues, so we went forward with this setup.
