In our hbase clusters, split sometimes failed because the file to be splited does not exist in parent region. In 0.94.2, this will cause regionserver shutdown because the split transction has reached PONR state. In 0.94.20 or 0.98, split will fail and can roll back, because the split transction only reach the state offlined_parent.
In 0.94.2, the error is like below: 2014-09-23 22:27:55,710 INFO org.apache.hadoop.hbase.catalog.MetaEditor: Offlined parent region xxxxx in META 2014-09-23 22:27:55,820 INFO org.apache.hadoop.hbase.regionserver.SplitRequest: Running rollback/cleanup of failed split of xxxxx Caused by: java.io.IOException: java.io.IOException: java.io.FileNotFoundException: File does not exist: xxxxx Caused by: java.io.IOException: java.io.FileNotFoundException: File does not exist: xxxxx Caused by: java.io.FileNotFoundException: File does not exist: xxxxx 2014-09-23 22:27:55,823 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server xxx,60020,1411383568857: Abort; we got an error after point-of-no-return The reasion of missing files is a little complex, the whole procedure include two failure split and one compact: 1) there are too many files in the region and compact is requested, but not execute because there are many CompactionRequests(compactionRunners) in the compaction queue. The compactionRequest hodes the object of the Store, and also hodes a storefile list to compact of the store. 2) the region size is big enough, and split is requested. the region is offline during spliting,and the store is closed. but the split failed when spliting files(for some reason, like io busy, etc. causing time out) 2014-09-23 18:26:02,738 INFO org.apache.hadoop.hbase.regionserver.SplitRequest: Running rollback/cleanup of failed split of xxxxx; Took too long to split the files and create the references, aborting split 3) split successfully roll back, and the region is online again. During roll back procedure, a new Store object is created, but the store in the compaction queue did not removed, so there are two(or maybe more) store object in regionserver. 4) the compaction on the store of the region requested before running, and some storefiles are compact and removed, new bigger storefiles are created. but the store reinitialized in the rollback split procedure doesn't know the change of the storefiles. 5) split on region running again and fail again, because the storefiles in parrent region doesn't exist(removed by compaction). Also, the split transction doesn't know that there is a new file created by the compaction. In 0.94.2, this error can't be found until the daughter region open, but it's too late, the split failed at PONR state, and this will causing regionserver shutdown. In 0.94.20 and 0.98, when doing splitStoreFiles, it will looking into the storefile in the parent region and can found the error before PONR, so split failure can be roll back. code in HRegionFileSystem.splitStoreFile: ... byte[] lastKey = f.createReader().getLastKey(); So, this situation is a fatal error in previous 0.94 version, and also a common bug in the later 0.94 and higher version. And this is also the reason why sometimes storefile reader is null(closed by the first failure split).