In our hbase clusters, split sometimes failed because the file to be
splited does not exist in parent region. In 0.94.2, this will cause
regionserver shutdown because the split transction has reached  PONR state.
In 0.94.20 or 0.98, split will fail and can roll back, because the split
transction only reach  the state offlined_parent.

In 0.94.2, the error is like below:
2014-09-23 22:27:55,710 INFO org.apache.hadoop.hbase.catalog.MetaEditor:
Offlined parent region xxxxx in META
2014-09-23 22:27:55,820 INFO
org.apache.hadoop.hbase.regionserver.SplitRequest: Running rollback/cleanup
of failed split of xxxxx
Caused by: java.io.IOException: java.io.IOException:
java.io.FileNotFoundException: File does not exist: xxxxx
Caused by: java.io.IOException: java.io.FileNotFoundException: File does
not exist: xxxxx
Caused by: java.io.FileNotFoundException: File does not exist: xxxxx
2014-09-23 22:27:55,823 FATAL
org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server
xxx,60020,1411383568857: Abort; we got an error after point-of-no-return

The reasion of missing files is a little complex, the whole procedure
include two failure split and one compact:
1) there are too many files in the region and compact is requested, but not
execute because there are many CompactionRequests(compactionRunners) in the
compaction queue. The compactionRequest hodes the object of the Store, and
also hodes a storefile list to compact of the store.

2) the region size is big enough, and split is requested. the region is
offline during spliting,and the store is closed. but the split failed when
spliting files(for some reason, like io busy, etc. causing time out)
2014-09-23 18:26:02,738 INFO
org.apache.hadoop.hbase.regionserver.SplitRequest: Running rollback/cleanup
of failed split of xxxxx; Took too long to split the files and create the
references, aborting split

3) split successfully roll back, and the region is online again. During
roll back procedure, a new Store object is created, but the store in the
compaction queue did not removed, so there are two(or maybe more) store
object in regionserver.

4) the compaction on the store of the region requested before running, and
some storefiles are compact and removed, new bigger storefiles are created.
but the store reinitialized in the rollback split procedure doesn't know
the change of the storefiles.

5) split on region running again and fail again, because the storefiles in
parrent region doesn't exist(removed by compaction). Also, the split
transction doesn't know that there is a new file created by the compaction.
In 0.94.2, this error can't be found until the daughter region open, but
it's too late, the split failed at PONR state, and this will causing
regionserver shutdown. In 0.94.20 and 0.98, when doing splitStoreFiles, it
will looking into the storefile in the parent region and can found the
error before PONR, so split failure can be roll back.
     code in HRegionFileSystem.splitStoreFile:
     ...
     byte[] lastKey = f.createReader().getLastKey();

So, this situation is a fatal error in previous 0.94 version, and also a
common bug in the later 0.94 and higher version. And this is also the
reason why sometimes storefile reader is null(closed by the first failure
split).

Reply via email to