Hi all, We're running HBase (0.94.15-cdh4.6.0) on top of HDFS (Hadoop 2.0.0-cdh4.6.0). For all of our tables, we set the replication factor to 1 (dfs.replication = 1 in hbase-site.xml). We set to 1 because we want to minimize the HDFS usage (now we realize we should set this value to at least 2, because "failure is a norm" in distributed systems).
Due to the amount of data, at some point, we have low disk space in HDFS and one of our DNs was down. Now we have these problems in HBase and HDFS although we have recovered our DN. *Issue#1*. Some of HBase region always in transition. '*hbase hbck -repair*' is stuck because it's waiting for region transition to finish. Some output *hbase(main):003:0> status 'detailed'* *12 regionsInTransition* * plr_id_insta_media_live,\x02:;6;7;398962:3:399a49:653:64,1421565172917.1528f288473632aca2636443574a6ba1. state=OPENING, ts=1424227696897, server=null* * plr_sg_insta_media_live,\x0098;522:997;8798665a64;67879,1410768824800.2c79bbc5c0dc2d2b39c04c8abc0a90ff. state=OFFLINE, ts=1424227714203, server=null* * plr_sg_insta_media_live,\x00465892:9935773828;a4459;649,1410767723471.55097cfc60bc9f50303dadb02abcd64b. state=OPENING, ts=1424227701234, server=null* * plr_sg_insta_media_live,\x00474973488232837733a38744,1410767723471.740d6655afb74a2ff421c6ef16037f57. state=OPENING, ts=1424227708053, server=null* * plr_id_insta_media_live,\x02::449::4;:466;3988a6432677;3,1419435100617.7caf3d749dce37037eec9ccc29d272a1. state=OPENING, ts=1424227701484, server=null* * plr_sg_insta_media_live,\x05779793546323;::4:4a3:8227928,1418845792479.81c4da129ae5b7b204d5373d9e0fea3d. state=OPENING, ts=1424227705353, server=null* * plr_sg_insta_media_live,\x009;5:686348963:33:5a5634887,1410769837567.8a9ded24960a7787ca016e2073b24151. state=OPENING, ts=1424227706293, server=null* * plr_sg_insta_media_live,\x0375;6;7377578;84226a7663792,1418980694076.a1e1c98f646ee899010f19a9c693c67c. state=OPENING, ts=1424227680569, server=null* * plr_sg_insta_media_live,\x018;3826368274679364a3;;73457;,1421425643816.b04ffda1b2024bac09c9e6246fb7b183. state=OPENING, ts=1424227680538, server=null* * plr_sg_insta_media_live,\x0154752;22:43377542:a:86:239,1410771044924.c57d6b4d23f21d3e914a91721a99ce12. state=OPENING, ts=1424227710847, server=null* * plr_sg_insta_media_live,\x0069;7;9384697:;8685a885485:,1410767928822.c7b5e53cdd9e1007117bcaa199b30d1c. state=OPENING, ts=1424227700962, server=null* * plr_sg_insta_media_live,\x04994537646:78233569a3467:987;7,1410787903804.cd49ec64a0a417aa11949c2bc2d3df6e. state=OPENING, ts=1424227691774, server=null* *Issue#2*. The next step that we do is to check HDFS file status using '*hdfs fsck /*'. It shows that the filesystem '/' is corrupted with these statistics * Total size: 15494284950796 B (Total open files size: 17179869184 B)* * Total dirs: 9198* * Total files: 124685 (Files currently being written: 21)* * Total blocks (validated): 219620 (avg. block size 70550427 B) (Total open file blocks (not validated): 144)* * ********************************* * CORRUPT FILES: 42* * MISSING BLOCKS: 142* * MISSING SIZE: 14899184084 B* * CORRUPT BLOCKS: 142* * ********************************* * Corrupt blocks: 142* * Number of data-nodes: 14* * Number of racks: 1* *FSCK ended at Tue Feb 17 17:25:18 SGT 2015 in 3026 milliseconds* *The filesystem under path '/' is CORRUPT* So it seems that HDFS loses some of its block due to DN failures and since the dfs.replication factor is 1, it could not recover the missing blocks. *Issue#3*. Although '*hbase hbck -repair*' is stuck, we are able to run '*hbase hbck -fixHdfsHoles*'. We notice this following error messages (I copied some of them to represent each type of error messages that we have). - *ERROR: Region { meta => plr_id_insta_media_live,\x02:;6;7;398962:3:399a49:653:64,1421565172917.1528f288473632aca2636443574a6ba1., hdfs => hdfs://nameservice1/hbase/plr_id_insta_media_live/1528f2884* *73632aca2636443574a6ba1, deployed => } not deployed on any region server.* - *ERROR: Region { meta => null, hdfs => hdfs://nameservice1/hbase/plr_sg_insta_media_live/8473d25be5980c169bff13cf90229939, deployed => } on HDFS, but not listed in META or deployed on any region server* *- ERROR: Region { meta => plr_sg_insta_media_live,\x0293:729769;975376;2a33995622;3,1421985489851.8819ebd296f075513056be4bbd30ee9c., hdfs => null, deployed => } found in META, but not in HDFS or deployed on any region server.* -ERROR: There is a hole in the region chain between \x099599464:7:5;3595;8a:57868;95 and \x099;56535:4632439643a82826562:. You need to create a new .regioninfo and region dir in hdfs to plug the hole. -ERROR: Last region should end with an empty key. You need to create a new region and regioninfo in HDFS to plug the hole. Now to fix this issue, we plan to perform this following action items: 1. Move or delete corrupted files in HDFS 2. Repair HBase by deleting the reference of corrupted files/blocks from HBase meta tables (it’s okay to lost some of the data) 3. Or create empty HFiles as shown in http://comments.gmane.org/gmane.comp.java.hadoop.hbase.user/31308 And our questions are: 1. Is it safe to move or delete corrupted files in HDFS? Can we make HBase to ignore those files and delete corresponding HBase files? 2. Any comments on our action items? Best regards, Arinto www.otnira.com