Re: HBase Region always in transition + corrupt HDFS
Hi all While digging through /hbase folder in HDFS, I found WAL folder that is splitting and last modified time is 2 weeks ago (I changed the node name in this example) *drwxr-xr-x - hbase hbase 0 2015-02-11 18:16 /hbase/.logs/,60020,1417575124373-splitting* Is this case common? After reading about log splitting, my understanding is this folder should disappear after successful recovery. CMIIW. Best regards, Arinto Arinto www.otnira.com On Tue, Feb 24, 2015 at 11:45 AM, Arinto Murdopo wrote: > > On Tue, Feb 24, 2015 at 9:46 AM, Jean-Marc Spaggiari < > jean-m...@spaggiari.org> wrote: > > > I don't have the list of the corrupted files yet. I notice that when I >> try >> > to Get some of the files, my HBase client code throws these exceptions: >> > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after >> > attempts=2, exceptions: >> > Mon Feb 23 17:49:32 SGT 2015, >> > org.apache.hadoop.hbase.client.HTable$3@11ff4a1c, >> > org.apache.hadoop.hbase.NotServingRegionException: >> > org.apache.hadoop.hbase.NotServingRegionException: Region is not online: >> > >> > >> plr_sg_insta_media_live,\x0177998597896:953:5:a5:58786,1410771627251.6c323832d2dc77c586f1cf6441c7ef6e. >> > >> >> FSCK should give ou the list of corrupt files. Can you extract it from >> there? >> > > Yup, I managed to extract them. We have corrupt files as well as missing > files. Luckily there's no .regionfile corrupted or missing. I'll read more > about HFile before updating this thread more. :) > > >> > >> > Can I use these exceptions to determine the corrupted files? >> > The files are media data (images or videos) obtained from the internet. >> > >> >> This exception gives you all the hints for a directory most probably under >> /hbase/plr_sg_insta_media_live/6c323832d2dc77c586f1cf6441c7ef6e >> >> Files under this directory might be corrupted but you need to find which >> files. If it's a HFiles it's easy. If it's the .regioninfo it's a bit more >> tricky. > > > > > Arinto > www.otnira.com >
Re: HBase Region always in transition + corrupt HDFS
On Tue, Feb 24, 2015 at 9:46 AM, Jean-Marc Spaggiari < jean-m...@spaggiari.org> wrote: > I don't have the list of the corrupted files yet. I notice that when I try > > to Get some of the files, my HBase client code throws these exceptions: > > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after > > attempts=2, exceptions: > > Mon Feb 23 17:49:32 SGT 2015, > > org.apache.hadoop.hbase.client.HTable$3@11ff4a1c, > > org.apache.hadoop.hbase.NotServingRegionException: > > org.apache.hadoop.hbase.NotServingRegionException: Region is not online: > > > > > plr_sg_insta_media_live,\x0177998597896:953:5:a5:58786,1410771627251.6c323832d2dc77c586f1cf6441c7ef6e. > > > > FSCK should give ou the list of corrupt files. Can you extract it from > there? > Yup, I managed to extract them. We have corrupt files as well as missing files. Luckily there's no .regionfile corrupted or missing. I'll read more about HFile before updating this thread more. :) > > > > Can I use these exceptions to determine the corrupted files? > > The files are media data (images or videos) obtained from the internet. > > > > This exception gives you all the hints for a directory most probably under > /hbase/plr_sg_insta_media_live/6c323832d2dc77c586f1cf6441c7ef6e > > Files under this directory might be corrupted but you need to find which > files. If it's a HFiles it's easy. If it's the .regioninfo it's a bit more > tricky. Arinto www.otnira.com
Re: HBase Region always in transition + corrupt HDFS
Arinto: Probably you should take a look at HBASE-12949. Cheers On Mon, Feb 23, 2015 at 5:25 PM, Arinto Murdopo wrote: > @JM: > You mentioned about deleting "the files", are you referring to HDFS files > or file on HBase? > > Our cluster have 15 nodes. We used 14 of them as DN. Actually we tried to > enable the remaining one as DN (so that we have 15 DN), but then we > disabled it (so now we have 14 again). Probably our crawlers write some > data into the additional DN without any replication. Maybe I could try to > enable again the DN. > > I don't have the list of the corrupted files yet. I notice that when I try > to Get some of the files, my HBase client code throws these exceptions: > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after > attempts=2, exceptions: > Mon Feb 23 17:49:32 SGT 2015, > org.apache.hadoop.hbase.client.HTable$3@11ff4a1c, > org.apache.hadoop.hbase.NotServingRegionException: > org.apache.hadoop.hbase.NotServingRegionException: Region is not online: > > plr_sg_insta_media_live,\x0177998597896:953:5:a5:58786,1410771627251.6c323832d2dc77c586f1cf6441c7ef6e. > > Can I use these exceptions to determine the corrupted files? > The files are media data (images or videos) obtained from the internet. > > @Michael Segel: Yup, 3 is the default and recommended value. We were > overwhelmed with the amount of data, so we foolishly reduced our > replication factor. We have learnt the lesson the hard way :). > > Fortunately it's okay to lose the data, i.e. we can easily recover them > from our other data. > > > > Arinto > www.otnira.com > > On Tue, Feb 24, 2015 at 8:06 AM, Michael Segel wrote: > > > I’m sorry, but I implied checking the checksums of the blocks. > > Didn’t think I needed to spell it out. Next time I’ll be a bit more > > precise. > > > > > On Feb 23, 2015, at 2:34 PM, Nick Dimiduk wrote: > > > > > > HBase/HDFS are maintaining block checksums, so presumably a corrupted > > block > > > would fail checksum validation. Increasing the number of replicas > > increases > > > the odds that you'll still have a valid block. I'm not an HDFS expert, > > but > > > I would be very surprised if HDFS is validating a "questionable block" > > via > > > byte-wise comparison over the network amongst the replica peers. > > > > > > On Mon, Feb 23, 2015 at 12:25 PM, Michael Segel > > wrote: > > > > > >> > > >> On Feb 23, 2015, at 1:47 AM, Arinto Murdopo wrote: > > >> > > >> We're running HBase (0.94.15-cdh4.6.0) on top of HDFS (Hadoop > > >> 2.0.0-cdh4.6.0). > > >> For all of our tables, we set the replication factor to 1 > > (dfs.replication > > >> = 1 in hbase-site.xml). We set to 1 because we want to minimize the > HDFS > > >> usage (now we realize we should set this value to at least 2, because > > >> "failure is a norm" in distributed systems). > > >> > > >> > > >> > > >> Sorry, but you really want this to be a replication value of at least > 3 > > >> and not 2. > > >> > > >> Suppose you have corruption but not a lost block. Which copy of the > two > > is > > >> right? > > >> With 3, you can compare the three and hopefully 2 of the 3 will match. > > >> > > >> > > > > >
Re: HBase Region always in transition + corrupt HDFS
2015-02-23 20:25 GMT-05:00 Arinto Murdopo : > @JM: > You mentioned about deleting "the files", are you referring to HDFS files > or file on HBase? > Your HBase files are stored in HDFS. So I think we are refering to the same thing. Look into /hbase in our HDFS to find HBase files. > > Our cluster have 15 nodes. We used 14 of them as DN. Actually we tried to > enable the remaining one as DN (so that we have 15 DN), but then we > disabled it (so now we have 14 again). Probably our crawlers write some > data into the additional DN without any replication. Maybe I could try to > enable again the DN. > That's a very valid option. If you still have the DN directories, just enable it back to see if you can recover the blocks... > I don't have the list of the corrupted files yet. I notice that when I try > to Get some of the files, my HBase client code throws these exceptions: > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after > attempts=2, exceptions: > Mon Feb 23 17:49:32 SGT 2015, > org.apache.hadoop.hbase.client.HTable$3@11ff4a1c, > org.apache.hadoop.hbase.NotServingRegionException: > org.apache.hadoop.hbase.NotServingRegionException: Region is not online: > > plr_sg_insta_media_live,\x0177998597896:953:5:a5:58786,1410771627251.6c323832d2dc77c586f1cf6441c7ef6e. > FSCK should give ou the list of corrupt files. Can you extract it from there? > > Can I use these exceptions to determine the corrupted files? > The files are media data (images or videos) obtained from the internet. > This exception gives you all the hints for a directory most probably under /hbase/plr_sg_insta_media_live/6c323832d2dc77c586f1cf6441c7ef6e Files under this directory might be corrupted but you need to find which files. If it's a HFiles it's easy. If it's the .regioninfo it's a bit more tricky. JM > Arinto > www.otnira.com > > On Tue, Feb 24, 2015 at 8:06 AM, Michael Segel wrote: > > > I’m sorry, but I implied checking the checksums of the blocks. > > Didn’t think I needed to spell it out. Next time I’ll be a bit more > > precise. > > > > > On Feb 23, 2015, at 2:34 PM, Nick Dimiduk wrote: > > > > > > HBase/HDFS are maintaining block checksums, so presumably a corrupted > > block > > > would fail checksum validation. Increasing the number of replicas > > increases > > > the odds that you'll still have a valid block. I'm not an HDFS expert, > > but > > > I would be very surprised if HDFS is validating a "questionable block" > > via > > > byte-wise comparison over the network amongst the replica peers. > > > > > > On Mon, Feb 23, 2015 at 12:25 PM, Michael Segel > > wrote: > > > > > >> > > >> On Feb 23, 2015, at 1:47 AM, Arinto Murdopo wrote: > > >> > > >> We're running HBase (0.94.15-cdh4.6.0) on top of HDFS (Hadoop > > >> 2.0.0-cdh4.6.0). > > >> For all of our tables, we set the replication factor to 1 > > (dfs.replication > > >> = 1 in hbase-site.xml). We set to 1 because we want to minimize the > HDFS > > >> usage (now we realize we should set this value to at least 2, because > > >> "failure is a norm" in distributed systems). > > >> > > >> > > >> > > >> Sorry, but you really want this to be a replication value of at least > 3 > > >> and not 2. > > >> > > >> Suppose you have corruption but not a lost block. Which copy of the > two > > is > > >> right? > > >> With 3, you can compare the three and hopefully 2 of the 3 will match. > > >> > > >> > > > > >
Re: HBase Region always in transition + corrupt HDFS
@JM: You mentioned about deleting "the files", are you referring to HDFS files or file on HBase? Our cluster have 15 nodes. We used 14 of them as DN. Actually we tried to enable the remaining one as DN (so that we have 15 DN), but then we disabled it (so now we have 14 again). Probably our crawlers write some data into the additional DN without any replication. Maybe I could try to enable again the DN. I don't have the list of the corrupted files yet. I notice that when I try to Get some of the files, my HBase client code throws these exceptions: org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=2, exceptions: Mon Feb 23 17:49:32 SGT 2015, org.apache.hadoop.hbase.client.HTable$3@11ff4a1c, org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException: Region is not online: plr_sg_insta_media_live,\x0177998597896:953:5:a5:58786,1410771627251.6c323832d2dc77c586f1cf6441c7ef6e. Can I use these exceptions to determine the corrupted files? The files are media data (images or videos) obtained from the internet. @Michael Segel: Yup, 3 is the default and recommended value. We were overwhelmed with the amount of data, so we foolishly reduced our replication factor. We have learnt the lesson the hard way :). Fortunately it's okay to lose the data, i.e. we can easily recover them from our other data. Arinto www.otnira.com On Tue, Feb 24, 2015 at 8:06 AM, Michael Segel wrote: > I’m sorry, but I implied checking the checksums of the blocks. > Didn’t think I needed to spell it out. Next time I’ll be a bit more > precise. > > > On Feb 23, 2015, at 2:34 PM, Nick Dimiduk wrote: > > > > HBase/HDFS are maintaining block checksums, so presumably a corrupted > block > > would fail checksum validation. Increasing the number of replicas > increases > > the odds that you'll still have a valid block. I'm not an HDFS expert, > but > > I would be very surprised if HDFS is validating a "questionable block" > via > > byte-wise comparison over the network amongst the replica peers. > > > > On Mon, Feb 23, 2015 at 12:25 PM, Michael Segel > wrote: > > > >> > >> On Feb 23, 2015, at 1:47 AM, Arinto Murdopo wrote: > >> > >> We're running HBase (0.94.15-cdh4.6.0) on top of HDFS (Hadoop > >> 2.0.0-cdh4.6.0). > >> For all of our tables, we set the replication factor to 1 > (dfs.replication > >> = 1 in hbase-site.xml). We set to 1 because we want to minimize the HDFS > >> usage (now we realize we should set this value to at least 2, because > >> "failure is a norm" in distributed systems). > >> > >> > >> > >> Sorry, but you really want this to be a replication value of at least 3 > >> and not 2. > >> > >> Suppose you have corruption but not a lost block. Which copy of the two > is > >> right? > >> With 3, you can compare the three and hopefully 2 of the 3 will match. > >> > >> > >
Re: HBase Region always in transition + corrupt HDFS
I’m sorry, but I implied checking the checksums of the blocks. Didn’t think I needed to spell it out. Next time I’ll be a bit more precise. > On Feb 23, 2015, at 2:34 PM, Nick Dimiduk wrote: > > HBase/HDFS are maintaining block checksums, so presumably a corrupted block > would fail checksum validation. Increasing the number of replicas increases > the odds that you'll still have a valid block. I'm not an HDFS expert, but > I would be very surprised if HDFS is validating a "questionable block" via > byte-wise comparison over the network amongst the replica peers. > > On Mon, Feb 23, 2015 at 12:25 PM, Michael Segel wrote: > >> >> On Feb 23, 2015, at 1:47 AM, Arinto Murdopo wrote: >> >> We're running HBase (0.94.15-cdh4.6.0) on top of HDFS (Hadoop >> 2.0.0-cdh4.6.0). >> For all of our tables, we set the replication factor to 1 (dfs.replication >> = 1 in hbase-site.xml). We set to 1 because we want to minimize the HDFS >> usage (now we realize we should set this value to at least 2, because >> "failure is a norm" in distributed systems). >> >> >> >> Sorry, but you really want this to be a replication value of at least 3 >> and not 2. >> >> Suppose you have corruption but not a lost block. Which copy of the two is >> right? >> With 3, you can compare the three and hopefully 2 of the 3 will match. >> >> smime.p7s Description: S/MIME cryptographic signature
Re: HBase Region always in transition + corrupt HDFS
HBase/HDFS are maintaining block checksums, so presumably a corrupted block would fail checksum validation. Increasing the number of replicas increases the odds that you'll still have a valid block. I'm not an HDFS expert, but I would be very surprised if HDFS is validating a "questionable block" via byte-wise comparison over the network amongst the replica peers. On Mon, Feb 23, 2015 at 12:25 PM, Michael Segel wrote: > > On Feb 23, 2015, at 1:47 AM, Arinto Murdopo wrote: > > We're running HBase (0.94.15-cdh4.6.0) on top of HDFS (Hadoop > 2.0.0-cdh4.6.0). > For all of our tables, we set the replication factor to 1 (dfs.replication > = 1 in hbase-site.xml). We set to 1 because we want to minimize the HDFS > usage (now we realize we should set this value to at least 2, because > "failure is a norm" in distributed systems). > > > > Sorry, but you really want this to be a replication value of at least 3 > and not 2. > > Suppose you have corruption but not a lost block. Which copy of the two is > right? > With 3, you can compare the three and hopefully 2 of the 3 will match. > >
Re: HBase Region always in transition + corrupt HDFS
> On Feb 23, 2015, at 1:47 AM, Arinto Murdopo wrote: > > We're running HBase (0.94.15-cdh4.6.0) on top of HDFS (Hadoop > 2.0.0-cdh4.6.0). > For all of our tables, we set the replication factor to 1 (dfs.replication > = 1 in hbase-site.xml). We set to 1 because we want to minimize the HDFS > usage (now we realize we should set this value to at least 2, because > "failure is a norm" in distributed systems). Sorry, but you really want this to be a replication value of at least 3 and not 2. Suppose you have corruption but not a lost block. Which copy of the two is right? With 3, you can compare the three and hopefully 2 of the 3 will match. smime.p7s Description: S/MIME cryptographic signature
Re: HBase Region always in transition + corrupt HDFS
You have no other choice than removing those files... you will loose the related data but it should be fine if they are only HFiles. Do you have the list of corrupted files? What kind of files it is? Also, have you lost a node or a disk? How have you lost about 150 blocks? JM 2015-02-23 2:47 GMT-05:00 Arinto Murdopo : > Hi all, > > We're running HBase (0.94.15-cdh4.6.0) on top of HDFS (Hadoop > 2.0.0-cdh4.6.0). > For all of our tables, we set the replication factor to 1 (dfs.replication > = 1 in hbase-site.xml). We set to 1 because we want to minimize the HDFS > usage (now we realize we should set this value to at least 2, because > "failure is a norm" in distributed systems). > > Due to the amount of data, at some point, we have low disk space in HDFS > and one of our DNs was down. Now we have these problems in HBase and HDFS > although we have recovered our DN. > > *Issue#1*. Some of HBase region always in transition. '*hbase hbck > -repair*' > is stuck because it's waiting for region transition to finish. Some output > > *hbase(main):003:0> status 'detailed'* > *12 regionsInTransition* > * > > plr_id_insta_media_live,\x02:;6;7;398962:3:399a49:653:64,1421565172917.1528f288473632aca2636443574a6ba1. > state=OPENING, ts=1424227696897, server=null* > * > > plr_sg_insta_media_live,\x0098;522:997;8798665a64;67879,1410768824800.2c79bbc5c0dc2d2b39c04c8abc0a90ff. > state=OFFLINE, ts=1424227714203, server=null* > * > > plr_sg_insta_media_live,\x00465892:9935773828;a4459;649,1410767723471.55097cfc60bc9f50303dadb02abcd64b. > state=OPENING, ts=1424227701234, server=null* > * > > plr_sg_insta_media_live,\x00474973488232837733a38744,1410767723471.740d6655afb74a2ff421c6ef16037f57. > state=OPENING, ts=1424227708053, server=null* > * > > plr_id_insta_media_live,\x02::449::4;:466;3988a6432677;3,1419435100617.7caf3d749dce37037eec9ccc29d272a1. > state=OPENING, ts=1424227701484, server=null* > * > > plr_sg_insta_media_live,\x05779793546323;::4:4a3:8227928,1418845792479.81c4da129ae5b7b204d5373d9e0fea3d. > state=OPENING, ts=1424227705353, server=null* > * > > plr_sg_insta_media_live,\x009;5:686348963:33:5a5634887,1410769837567.8a9ded24960a7787ca016e2073b24151. > state=OPENING, ts=1424227706293, server=null* > * > > plr_sg_insta_media_live,\x0375;6;7377578;84226a7663792,1418980694076.a1e1c98f646ee899010f19a9c693c67c. > state=OPENING, ts=1424227680569, server=null* > * > > plr_sg_insta_media_live,\x018;3826368274679364a3;;73457;,1421425643816.b04ffda1b2024bac09c9e6246fb7b183. > state=OPENING, ts=1424227680538, server=null* > * > > plr_sg_insta_media_live,\x0154752;22:43377542:a:86:239,1410771044924.c57d6b4d23f21d3e914a91721a99ce12. > state=OPENING, ts=1424227710847, server=null* > * > > plr_sg_insta_media_live,\x0069;7;9384697:;8685a885485:,1410767928822.c7b5e53cdd9e1007117bcaa199b30d1c. > state=OPENING, ts=1424227700962, server=null* > * > > plr_sg_insta_media_live,\x04994537646:78233569a3467:987;7,1410787903804.cd49ec64a0a417aa11949c2bc2d3df6e. > state=OPENING, ts=1424227691774, server=null* > > > *Issue#2*. The next step that we do is to check HDFS file status using > '*hdfs > fsck /*'. It shows that the filesystem '/' is corrupted with these > statistics > * Total size:15494284950796 B (Total open files size: 17179869184 B)* > * Total dirs:9198* > * Total files: 124685 (Files currently being written: 21)* > * Total blocks (validated): 219620 (avg. block size 70550427 B) (Total > open file blocks (not validated): 144)* > * * > * CORRUPT FILES:42* > * MISSING BLOCKS: 142* > * MISSING SIZE: 14899184084 B* > * CORRUPT BLOCKS: 142* > * * > * Corrupt blocks:142* > * Number of data-nodes: 14* > * Number of racks: 1* > *FSCK ended at Tue Feb 17 17:25:18 SGT 2015 in 3026 milliseconds* > > > *The filesystem under path '/' is CORRUPT* > > So it seems that HDFS loses some of its block due to DN failures and since > the dfs.replication factor is 1, it could not recover the missing blocks. > > *Issue#3*. Although '*hbase hbck -repair*' is stuck, we are able to run > '*hbase > hbck -fixHdfsHoles*'. We notice this following error messages (I copied > some of them to represent each type of error messages that we have). > - *ERROR: Region { meta => > > plr_id_insta_media_live,\x02:;6;7;398962:3:399a49:653:64,1421565172917.1528f288473632aca2636443574a6ba1., > hdfs => hdfs://nameservice1/hbase/plr_id_insta_media_live/1528f2884* > *73632aca2636443574a6ba1, deployed => } not deployed on any region > server.* > - *ERROR: Region { meta => null, hdfs => > > hdfs://nameservice1/hbase/plr_sg_insta_media_live/8473d25be5980c169bff13cf90229939, > deployed => } on HDFS, but not listed in META or deployed on any region > server* > *- ERROR: Region { meta => > > plr_sg_insta_media_live,\x0293:729769;975376;2a33995622;3,1421985489851.8819ebd296f075513056be4bbd30ee9c., > hdfs => null, deployed => } found