Hi all,

We're running HBase (0.94.15-cdh4.6.0) on top of HDFS (Hadoop
2.0.0-cdh4.6.0).
For all of our tables, we set the replication factor to 1 (dfs.replication
= 1 in hbase-site.xml). We set to 1 because we want to minimize the HDFS
usage (now we realize we should set this value to at least 2, because
"failure is a norm" in distributed systems).

Due to the amount of data, at some point, we have low disk space in HDFS
and one of our DNs was down. Now we have these problems in HBase and HDFS
although we have recovered our DN.

*Issue#1*. Some of HBase region always in transition. '*hbase hbck -repair*'
is stuck because it's waiting for region transition to finish. Some output

*hbase(main):003:0> status 'detailed'*
*12 regionsInTransition*
*
plr_id_insta_media_live,\x02:;6;7;398962:3:399a49:653:64,1421565172917.1528f288473632aca2636443574a6ba1.
state=OPENING, ts=1424227696897, server=null*
*
plr_sg_insta_media_live,\x0098;522:997;8798665a64;67879,1410768824800.2c79bbc5c0dc2d2b39c04c8abc0a90ff.
state=OFFLINE, ts=1424227714203, server=null*
*
plr_sg_insta_media_live,\x00465892:9935773828;a4459;649,1410767723471.55097cfc60bc9f50303dadb02abcd64b.
state=OPENING, ts=1424227701234, server=null*
*
plr_sg_insta_media_live,\x00474973488232837733a38744,1410767723471.740d6655afb74a2ff421c6ef16037f57.
state=OPENING, ts=1424227708053, server=null*
*
plr_id_insta_media_live,\x02::449::4;:466;3988a6432677;3,1419435100617.7caf3d749dce37037eec9ccc29d272a1.
state=OPENING, ts=1424227701484, server=null*
*
plr_sg_insta_media_live,\x05779793546323;::4:4a3:8227928,1418845792479.81c4da129ae5b7b204d5373d9e0fea3d.
state=OPENING, ts=1424227705353, server=null*
*
plr_sg_insta_media_live,\x009;5:686348963:33:5a5634887,1410769837567.8a9ded24960a7787ca016e2073b24151.
state=OPENING, ts=1424227706293, server=null*
*
plr_sg_insta_media_live,\x0375;6;7377578;84226a7663792,1418980694076.a1e1c98f646ee899010f19a9c693c67c.
state=OPENING, ts=1424227680569, server=null*
*
plr_sg_insta_media_live,\x018;3826368274679364a3;;73457;,1421425643816.b04ffda1b2024bac09c9e6246fb7b183.
state=OPENING, ts=1424227680538, server=null*
*
plr_sg_insta_media_live,\x0154752;22:43377542:a:86:239,1410771044924.c57d6b4d23f21d3e914a91721a99ce12.
state=OPENING, ts=1424227710847, server=null*
*
plr_sg_insta_media_live,\x0069;7;9384697:;8685a885485:,1410767928822.c7b5e53cdd9e1007117bcaa199b30d1c.
state=OPENING, ts=1424227700962, server=null*
*
plr_sg_insta_media_live,\x04994537646:78233569a3467:987;7,1410787903804.cd49ec64a0a417aa11949c2bc2d3df6e.
state=OPENING, ts=1424227691774, server=null*


*Issue#2*. The next step that we do is to check HDFS file status using '*hdfs
fsck /*'. It shows that the filesystem '/' is corrupted with these
statistics
* Total size:    15494284950796 B (Total open files size: 17179869184 B)*
* Total dirs:    9198*
* Total files:   124685 (Files currently being written: 21)*
* Total blocks (validated):      219620 (avg. block size 70550427 B) (Total
open file blocks (not validated): 144)*
*  *********************************
*  CORRUPT FILES:        42*
*  MISSING BLOCKS:       142*
*  MISSING SIZE:         14899184084 B*
*  CORRUPT BLOCKS:       142*
*  *********************************
* Corrupt blocks:                142*
* Number of data-nodes:          14*
* Number of racks:               1*
*FSCK ended at Tue Feb 17 17:25:18 SGT 2015 in 3026 milliseconds*


*The filesystem under path '/' is CORRUPT*

So it seems that HDFS loses some of its block due to DN failures and since
the dfs.replication factor is 1, it could not recover the missing blocks.

*Issue#3*. Although '*hbase hbck -repair*' is stuck, we are able to run '*hbase
hbck -fixHdfsHoles*'. We notice this following error messages (I copied
some of them to represent each type of error messages that we have).
- *ERROR: Region { meta =>
plr_id_insta_media_live,\x02:;6;7;398962:3:399a49:653:64,1421565172917.1528f288473632aca2636443574a6ba1.,
hdfs => hdfs://nameservice1/hbase/plr_id_insta_media_live/1528f2884*
*73632aca2636443574a6ba1, deployed =>  } not deployed on any region server.*
- *ERROR: Region { meta => null, hdfs =>
hdfs://nameservice1/hbase/plr_sg_insta_media_live/8473d25be5980c169bff13cf90229939,
deployed =>  } on HDFS, but not listed in META or deployed on any region
server*
*- ERROR: Region { meta =>
plr_sg_insta_media_live,\x0293:729769;975376;2a33995622;3,1421985489851.8819ebd296f075513056be4bbd30ee9c.,
hdfs => null, deployed =>  } found in META, but not in HDFS or deployed on
any region server.*
-ERROR: There is a hole in the region chain between
\x099599464:7:5;3595;8a:57868;95 and \x099;56535:4632439643a82826562:.  You
need to create a new .regioninfo and region dir in hdfs to plug the hole.
-ERROR: Last region should end with an empty key. You need to create a new
region and regioninfo in HDFS to plug the hole.

Now to fix this issue, we plan to perform this following action items:

   1. Move or delete corrupted files in HDFS
   2. Repair HBase by deleting the reference of corrupted files/blocks from
   HBase meta tables (it’s okay to lost some of the data)
   3. Or create empty HFiles as shown in
   http://comments.gmane.org/gmane.comp.java.hadoop.hbase.user/31308


And our questions are:

   1. Is it safe to move or delete corrupted files in HDFS? Can we make
   HBase to ignore those files and delete corresponding HBase files?
   2. Any comments on our action items?


Best regards,

Arinto
www.otnira.com

Reply via email to