Re: HBase Region always in transition + corrupt HDFS

2015-02-24 Thread Arinto Murdopo
Hi all

While digging through /hbase folder in HDFS, I found WAL folder that is
splitting and last modified time is 2 weeks ago (I changed the node name in
this example)
*drwxr-xr-x   - hbase hbase  0 2015-02-11 18:16
/hbase/.logs/,60020,1417575124373-splitting*

Is this case common? After reading about log splitting, my understanding is
this folder should disappear after successful recovery. CMIIW.

Best regards,

Arinto


Arinto
www.otnira.com

On Tue, Feb 24, 2015 at 11:45 AM, Arinto Murdopo  wrote:

>
> On Tue, Feb 24, 2015 at 9:46 AM, Jean-Marc Spaggiari <
> jean-m...@spaggiari.org> wrote:
>
> > I don't have the list of the corrupted files yet. I notice that when I
>> try
>> > to Get some of the files, my HBase client code throws these exceptions:
>> > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after
>> > attempts=2, exceptions:
>> > Mon Feb 23 17:49:32 SGT 2015,
>> > org.apache.hadoop.hbase.client.HTable$3@11ff4a1c,
>> > org.apache.hadoop.hbase.NotServingRegionException:
>> > org.apache.hadoop.hbase.NotServingRegionException: Region is not online:
>> >
>> >
>> plr_sg_insta_media_live,\x0177998597896:953:5:a5:58786,1410771627251.6c323832d2dc77c586f1cf6441c7ef6e.
>> >
>>
>> FSCK should give ou the list of corrupt files. Can you extract it from
>> there?
>>
>
> Yup, I managed to extract them. We have corrupt files as well as missing
> files. Luckily there's no .regionfile corrupted or missing. I'll read more
> about HFile before updating this thread more. :)
>
>
>> >
>> > Can I use these exceptions to determine the corrupted files?
>> > The files are media data (images or videos) obtained from the internet.
>> >
>>
>> This exception gives you all the hints for a directory most probably under
>> /hbase/plr_sg_insta_media_live/6c323832d2dc77c586f1cf6441c7ef6e
>>
>> Files under this directory might be corrupted but you need to find which
>> files. If it's a HFiles it's easy. If it's the .regioninfo it's a bit more
>> tricky.
>
>
>
>
> Arinto
> www.otnira.com
>


Re: HBase Region always in transition + corrupt HDFS

2015-02-23 Thread Arinto Murdopo
On Tue, Feb 24, 2015 at 9:46 AM, Jean-Marc Spaggiari <
jean-m...@spaggiari.org> wrote:

> I don't have the list of the corrupted files yet. I notice that when I try
> > to Get some of the files, my HBase client code throws these exceptions:
> > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after
> > attempts=2, exceptions:
> > Mon Feb 23 17:49:32 SGT 2015,
> > org.apache.hadoop.hbase.client.HTable$3@11ff4a1c,
> > org.apache.hadoop.hbase.NotServingRegionException:
> > org.apache.hadoop.hbase.NotServingRegionException: Region is not online:
> >
> >
> plr_sg_insta_media_live,\x0177998597896:953:5:a5:58786,1410771627251.6c323832d2dc77c586f1cf6441c7ef6e.
> >
>
> FSCK should give ou the list of corrupt files. Can you extract it from
> there?
>

Yup, I managed to extract them. We have corrupt files as well as missing
files. Luckily there's no .regionfile corrupted or missing. I'll read more
about HFile before updating this thread more. :)


> >
> > Can I use these exceptions to determine the corrupted files?
> > The files are media data (images or videos) obtained from the internet.
> >
>
> This exception gives you all the hints for a directory most probably under
> /hbase/plr_sg_insta_media_live/6c323832d2dc77c586f1cf6441c7ef6e
>
> Files under this directory might be corrupted but you need to find which
> files. If it's a HFiles it's easy. If it's the .regioninfo it's a bit more
> tricky.




Arinto
www.otnira.com


Re: HBase Region always in transition + corrupt HDFS

2015-02-23 Thread Ted Yu
Arinto:
Probably you should take a look at HBASE-12949.

Cheers

On Mon, Feb 23, 2015 at 5:25 PM, Arinto Murdopo  wrote:

> @JM:
> You mentioned about deleting "the files", are you referring to HDFS files
> or file on HBase?
>
> Our cluster have 15 nodes. We used 14 of them as DN. Actually we tried to
> enable the remaining one as DN (so that we have 15 DN), but then we
> disabled it (so now we have 14 again). Probably our crawlers write some
> data into the additional DN without any replication. Maybe I could try to
> enable again the DN.
>
> I don't have the list of the corrupted files yet. I notice that when I try
> to Get some of the files, my HBase client code throws these exceptions:
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after
> attempts=2, exceptions:
> Mon Feb 23 17:49:32 SGT 2015,
> org.apache.hadoop.hbase.client.HTable$3@11ff4a1c,
> org.apache.hadoop.hbase.NotServingRegionException:
> org.apache.hadoop.hbase.NotServingRegionException: Region is not online:
>
> plr_sg_insta_media_live,\x0177998597896:953:5:a5:58786,1410771627251.6c323832d2dc77c586f1cf6441c7ef6e.
>
> Can I use these exceptions to determine the corrupted files?
> The files are media data (images or videos) obtained from the internet.
>
> @Michael Segel: Yup, 3 is the default and recommended value. We were
> overwhelmed with the amount of data, so we foolishly reduced our
> replication factor. We have learnt the lesson the hard way :).
>
> Fortunately it's okay to lose the data, i.e. we can easily recover them
> from our other data.
>
>
>
> Arinto
> www.otnira.com
>
> On Tue, Feb 24, 2015 at 8:06 AM, Michael Segel  wrote:
>
> > I’m sorry, but I implied checking the checksums of the blocks.
> > Didn’t think I needed to spell it out.  Next time I’ll be a bit more
> > precise.
> >
> > > On Feb 23, 2015, at 2:34 PM, Nick Dimiduk  wrote:
> > >
> > > HBase/HDFS are maintaining block checksums, so presumably a corrupted
> > block
> > > would fail checksum validation. Increasing the number of replicas
> > increases
> > > the odds that you'll still have a valid block. I'm not an HDFS expert,
> > but
> > > I would be very surprised if HDFS is validating a "questionable block"
> > via
> > > byte-wise comparison over the network amongst the replica peers.
> > >
> > > On Mon, Feb 23, 2015 at 12:25 PM, Michael Segel 
> > wrote:
> > >
> > >>
> > >> On Feb 23, 2015, at 1:47 AM, Arinto Murdopo  wrote:
> > >>
> > >> We're running HBase (0.94.15-cdh4.6.0) on top of HDFS (Hadoop
> > >> 2.0.0-cdh4.6.0).
> > >> For all of our tables, we set the replication factor to 1
> > (dfs.replication
> > >> = 1 in hbase-site.xml). We set to 1 because we want to minimize the
> HDFS
> > >> usage (now we realize we should set this value to at least 2, because
> > >> "failure is a norm" in distributed systems).
> > >>
> > >>
> > >>
> > >> Sorry, but you really want this to be a replication value of at least
> 3
> > >> and not 2.
> > >>
> > >> Suppose you have corruption but not a lost block. Which copy of the
> two
> > is
> > >> right?
> > >> With 3, you can compare the three and hopefully 2 of the 3 will match.
> > >>
> > >>
> >
> >
>


Re: HBase Region always in transition + corrupt HDFS

2015-02-23 Thread Jean-Marc Spaggiari
2015-02-23 20:25 GMT-05:00 Arinto Murdopo :

> @JM:
> You mentioned about deleting "the files", are you referring to HDFS files
> or file on HBase?
>

Your HBase files are stored in HDFS. So I think we are refering to the same
thing. Look into /hbase in our HDFS to find HBase files.



>
> Our cluster have 15 nodes. We used 14 of them as DN. Actually we tried to
> enable the remaining one as DN (so that we have 15 DN), but then we
> disabled it (so now we have 14 again). Probably our crawlers write some
> data into the additional DN without any replication. Maybe I could try to
> enable again the DN.
>

That's a very valid option. If you still have the DN directories, just
enable it back to see if you can recover the blocks...



> I don't have the list of the corrupted files yet. I notice that when I try
> to Get some of the files, my HBase client code throws these exceptions:
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after
> attempts=2, exceptions:
> Mon Feb 23 17:49:32 SGT 2015,
> org.apache.hadoop.hbase.client.HTable$3@11ff4a1c,
> org.apache.hadoop.hbase.NotServingRegionException:
> org.apache.hadoop.hbase.NotServingRegionException: Region is not online:
>
> plr_sg_insta_media_live,\x0177998597896:953:5:a5:58786,1410771627251.6c323832d2dc77c586f1cf6441c7ef6e.
>

FSCK should give ou the list of corrupt files. Can you extract it from
there?



>
> Can I use these exceptions to determine the corrupted files?
> The files are media data (images or videos) obtained from the internet.
>

This exception gives you all the hints for a directory most probably under
/hbase/plr_sg_insta_media_live/6c323832d2dc77c586f1cf6441c7ef6e

Files under this directory might be corrupted but you need to find which
files. If it's a HFiles it's easy. If it's the .regioninfo it's a bit more
tricky.

JM



> Arinto
> www.otnira.com
>
> On Tue, Feb 24, 2015 at 8:06 AM, Michael Segel  wrote:
>
> > I’m sorry, but I implied checking the checksums of the blocks.
> > Didn’t think I needed to spell it out.  Next time I’ll be a bit more
> > precise.
> >
> > > On Feb 23, 2015, at 2:34 PM, Nick Dimiduk  wrote:
> > >
> > > HBase/HDFS are maintaining block checksums, so presumably a corrupted
> > block
> > > would fail checksum validation. Increasing the number of replicas
> > increases
> > > the odds that you'll still have a valid block. I'm not an HDFS expert,
> > but
> > > I would be very surprised if HDFS is validating a "questionable block"
> > via
> > > byte-wise comparison over the network amongst the replica peers.
> > >
> > > On Mon, Feb 23, 2015 at 12:25 PM, Michael Segel 
> > wrote:
> > >
> > >>
> > >> On Feb 23, 2015, at 1:47 AM, Arinto Murdopo  wrote:
> > >>
> > >> We're running HBase (0.94.15-cdh4.6.0) on top of HDFS (Hadoop
> > >> 2.0.0-cdh4.6.0).
> > >> For all of our tables, we set the replication factor to 1
> > (dfs.replication
> > >> = 1 in hbase-site.xml). We set to 1 because we want to minimize the
> HDFS
> > >> usage (now we realize we should set this value to at least 2, because
> > >> "failure is a norm" in distributed systems).
> > >>
> > >>
> > >>
> > >> Sorry, but you really want this to be a replication value of at least
> 3
> > >> and not 2.
> > >>
> > >> Suppose you have corruption but not a lost block. Which copy of the
> two
> > is
> > >> right?
> > >> With 3, you can compare the three and hopefully 2 of the 3 will match.
> > >>
> > >>
> >
> >
>


Re: HBase Region always in transition + corrupt HDFS

2015-02-23 Thread Arinto Murdopo
@JM:
You mentioned about deleting "the files", are you referring to HDFS files
or file on HBase?

Our cluster have 15 nodes. We used 14 of them as DN. Actually we tried to
enable the remaining one as DN (so that we have 15 DN), but then we
disabled it (so now we have 14 again). Probably our crawlers write some
data into the additional DN without any replication. Maybe I could try to
enable again the DN.

I don't have the list of the corrupted files yet. I notice that when I try
to Get some of the files, my HBase client code throws these exceptions:
org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after
attempts=2, exceptions:
Mon Feb 23 17:49:32 SGT 2015,
org.apache.hadoop.hbase.client.HTable$3@11ff4a1c,
org.apache.hadoop.hbase.NotServingRegionException:
org.apache.hadoop.hbase.NotServingRegionException: Region is not online:
plr_sg_insta_media_live,\x0177998597896:953:5:a5:58786,1410771627251.6c323832d2dc77c586f1cf6441c7ef6e.

Can I use these exceptions to determine the corrupted files?
The files are media data (images or videos) obtained from the internet.

@Michael Segel: Yup, 3 is the default and recommended value. We were
overwhelmed with the amount of data, so we foolishly reduced our
replication factor. We have learnt the lesson the hard way :).

Fortunately it's okay to lose the data, i.e. we can easily recover them
from our other data.



Arinto
www.otnira.com

On Tue, Feb 24, 2015 at 8:06 AM, Michael Segel  wrote:

> I’m sorry, but I implied checking the checksums of the blocks.
> Didn’t think I needed to spell it out.  Next time I’ll be a bit more
> precise.
>
> > On Feb 23, 2015, at 2:34 PM, Nick Dimiduk  wrote:
> >
> > HBase/HDFS are maintaining block checksums, so presumably a corrupted
> block
> > would fail checksum validation. Increasing the number of replicas
> increases
> > the odds that you'll still have a valid block. I'm not an HDFS expert,
> but
> > I would be very surprised if HDFS is validating a "questionable block"
> via
> > byte-wise comparison over the network amongst the replica peers.
> >
> > On Mon, Feb 23, 2015 at 12:25 PM, Michael Segel 
> wrote:
> >
> >>
> >> On Feb 23, 2015, at 1:47 AM, Arinto Murdopo  wrote:
> >>
> >> We're running HBase (0.94.15-cdh4.6.0) on top of HDFS (Hadoop
> >> 2.0.0-cdh4.6.0).
> >> For all of our tables, we set the replication factor to 1
> (dfs.replication
> >> = 1 in hbase-site.xml). We set to 1 because we want to minimize the HDFS
> >> usage (now we realize we should set this value to at least 2, because
> >> "failure is a norm" in distributed systems).
> >>
> >>
> >>
> >> Sorry, but you really want this to be a replication value of at least 3
> >> and not 2.
> >>
> >> Suppose you have corruption but not a lost block. Which copy of the two
> is
> >> right?
> >> With 3, you can compare the three and hopefully 2 of the 3 will match.
> >>
> >>
>
>


Re: HBase Region always in transition + corrupt HDFS

2015-02-23 Thread Michael Segel
I’m sorry, but I implied checking the checksums of the blocks. 
Didn’t think I needed to spell it out.  Next time I’ll be a bit more precise. 

> On Feb 23, 2015, at 2:34 PM, Nick Dimiduk  wrote:
> 
> HBase/HDFS are maintaining block checksums, so presumably a corrupted block
> would fail checksum validation. Increasing the number of replicas increases
> the odds that you'll still have a valid block. I'm not an HDFS expert, but
> I would be very surprised if HDFS is validating a "questionable block" via
> byte-wise comparison over the network amongst the replica peers.
> 
> On Mon, Feb 23, 2015 at 12:25 PM, Michael Segel  wrote:
> 
>> 
>> On Feb 23, 2015, at 1:47 AM, Arinto Murdopo  wrote:
>> 
>> We're running HBase (0.94.15-cdh4.6.0) on top of HDFS (Hadoop
>> 2.0.0-cdh4.6.0).
>> For all of our tables, we set the replication factor to 1 (dfs.replication
>> = 1 in hbase-site.xml). We set to 1 because we want to minimize the HDFS
>> usage (now we realize we should set this value to at least 2, because
>> "failure is a norm" in distributed systems).
>> 
>> 
>> 
>> Sorry, but you really want this to be a replication value of at least 3
>> and not 2.
>> 
>> Suppose you have corruption but not a lost block. Which copy of the two is
>> right?
>> With 3, you can compare the three and hopefully 2 of the 3 will match.
>> 
>> 



smime.p7s
Description: S/MIME cryptographic signature


Re: HBase Region always in transition + corrupt HDFS

2015-02-23 Thread Nick Dimiduk
HBase/HDFS are maintaining block checksums, so presumably a corrupted block
would fail checksum validation. Increasing the number of replicas increases
the odds that you'll still have a valid block. I'm not an HDFS expert, but
I would be very surprised if HDFS is validating a "questionable block" via
byte-wise comparison over the network amongst the replica peers.

On Mon, Feb 23, 2015 at 12:25 PM, Michael Segel  wrote:

>
> On Feb 23, 2015, at 1:47 AM, Arinto Murdopo  wrote:
>
> We're running HBase (0.94.15-cdh4.6.0) on top of HDFS (Hadoop
> 2.0.0-cdh4.6.0).
> For all of our tables, we set the replication factor to 1 (dfs.replication
> = 1 in hbase-site.xml). We set to 1 because we want to minimize the HDFS
> usage (now we realize we should set this value to at least 2, because
> "failure is a norm" in distributed systems).
>
>
>
> Sorry, but you really want this to be a replication value of at least 3
> and not 2.
>
> Suppose you have corruption but not a lost block. Which copy of the two is
> right?
> With 3, you can compare the three and hopefully 2 of the 3 will match.
>
>


Re: HBase Region always in transition + corrupt HDFS

2015-02-23 Thread Michael Segel

> On Feb 23, 2015, at 1:47 AM, Arinto Murdopo  wrote:
> 
> We're running HBase (0.94.15-cdh4.6.0) on top of HDFS (Hadoop
> 2.0.0-cdh4.6.0).
> For all of our tables, we set the replication factor to 1 (dfs.replication
> = 1 in hbase-site.xml). We set to 1 because we want to minimize the HDFS
> usage (now we realize we should set this value to at least 2, because
> "failure is a norm" in distributed systems).


Sorry, but you really want this to be a replication value of at least 3 and not 
2. 

Suppose you have corruption but not a lost block. Which copy of the two is 
right?
With 3, you can compare the three and hopefully 2 of the 3 will match. 



smime.p7s
Description: S/MIME cryptographic signature


Re: HBase Region always in transition + corrupt HDFS

2015-02-23 Thread Jean-Marc Spaggiari
You have no other choice than removing those files... you will loose the
related data but it should be fine if they are only HFiles. Do you have the
list of corrupted files? What kind of files it is?

Also, have you lost a node or a disk? How have you lost about 150 blocks?

JM

2015-02-23 2:47 GMT-05:00 Arinto Murdopo :

> Hi all,
>
> We're running HBase (0.94.15-cdh4.6.0) on top of HDFS (Hadoop
> 2.0.0-cdh4.6.0).
> For all of our tables, we set the replication factor to 1 (dfs.replication
> = 1 in hbase-site.xml). We set to 1 because we want to minimize the HDFS
> usage (now we realize we should set this value to at least 2, because
> "failure is a norm" in distributed systems).
>
> Due to the amount of data, at some point, we have low disk space in HDFS
> and one of our DNs was down. Now we have these problems in HBase and HDFS
> although we have recovered our DN.
>
> *Issue#1*. Some of HBase region always in transition. '*hbase hbck
> -repair*'
> is stuck because it's waiting for region transition to finish. Some output
>
> *hbase(main):003:0> status 'detailed'*
> *12 regionsInTransition*
> *
>
> plr_id_insta_media_live,\x02:;6;7;398962:3:399a49:653:64,1421565172917.1528f288473632aca2636443574a6ba1.
> state=OPENING, ts=1424227696897, server=null*
> *
>
> plr_sg_insta_media_live,\x0098;522:997;8798665a64;67879,1410768824800.2c79bbc5c0dc2d2b39c04c8abc0a90ff.
> state=OFFLINE, ts=1424227714203, server=null*
> *
>
> plr_sg_insta_media_live,\x00465892:9935773828;a4459;649,1410767723471.55097cfc60bc9f50303dadb02abcd64b.
> state=OPENING, ts=1424227701234, server=null*
> *
>
> plr_sg_insta_media_live,\x00474973488232837733a38744,1410767723471.740d6655afb74a2ff421c6ef16037f57.
> state=OPENING, ts=1424227708053, server=null*
> *
>
> plr_id_insta_media_live,\x02::449::4;:466;3988a6432677;3,1419435100617.7caf3d749dce37037eec9ccc29d272a1.
> state=OPENING, ts=1424227701484, server=null*
> *
>
> plr_sg_insta_media_live,\x05779793546323;::4:4a3:8227928,1418845792479.81c4da129ae5b7b204d5373d9e0fea3d.
> state=OPENING, ts=1424227705353, server=null*
> *
>
> plr_sg_insta_media_live,\x009;5:686348963:33:5a5634887,1410769837567.8a9ded24960a7787ca016e2073b24151.
> state=OPENING, ts=1424227706293, server=null*
> *
>
> plr_sg_insta_media_live,\x0375;6;7377578;84226a7663792,1418980694076.a1e1c98f646ee899010f19a9c693c67c.
> state=OPENING, ts=1424227680569, server=null*
> *
>
> plr_sg_insta_media_live,\x018;3826368274679364a3;;73457;,1421425643816.b04ffda1b2024bac09c9e6246fb7b183.
> state=OPENING, ts=1424227680538, server=null*
> *
>
> plr_sg_insta_media_live,\x0154752;22:43377542:a:86:239,1410771044924.c57d6b4d23f21d3e914a91721a99ce12.
> state=OPENING, ts=1424227710847, server=null*
> *
>
> plr_sg_insta_media_live,\x0069;7;9384697:;8685a885485:,1410767928822.c7b5e53cdd9e1007117bcaa199b30d1c.
> state=OPENING, ts=1424227700962, server=null*
> *
>
> plr_sg_insta_media_live,\x04994537646:78233569a3467:987;7,1410787903804.cd49ec64a0a417aa11949c2bc2d3df6e.
> state=OPENING, ts=1424227691774, server=null*
>
>
> *Issue#2*. The next step that we do is to check HDFS file status using
> '*hdfs
> fsck /*'. It shows that the filesystem '/' is corrupted with these
> statistics
> * Total size:15494284950796 B (Total open files size: 17179869184 B)*
> * Total dirs:9198*
> * Total files:   124685 (Files currently being written: 21)*
> * Total blocks (validated):  219620 (avg. block size 70550427 B) (Total
> open file blocks (not validated): 144)*
> *  *
> *  CORRUPT FILES:42*
> *  MISSING BLOCKS:   142*
> *  MISSING SIZE: 14899184084 B*
> *  CORRUPT BLOCKS:   142*
> *  *
> * Corrupt blocks:142*
> * Number of data-nodes:  14*
> * Number of racks:   1*
> *FSCK ended at Tue Feb 17 17:25:18 SGT 2015 in 3026 milliseconds*
>
>
> *The filesystem under path '/' is CORRUPT*
>
> So it seems that HDFS loses some of its block due to DN failures and since
> the dfs.replication factor is 1, it could not recover the missing blocks.
>
> *Issue#3*. Although '*hbase hbck -repair*' is stuck, we are able to run
> '*hbase
> hbck -fixHdfsHoles*'. We notice this following error messages (I copied
> some of them to represent each type of error messages that we have).
> - *ERROR: Region { meta =>
>
> plr_id_insta_media_live,\x02:;6;7;398962:3:399a49:653:64,1421565172917.1528f288473632aca2636443574a6ba1.,
> hdfs => hdfs://nameservice1/hbase/plr_id_insta_media_live/1528f2884*
> *73632aca2636443574a6ba1, deployed =>  } not deployed on any region
> server.*
> - *ERROR: Region { meta => null, hdfs =>
>
> hdfs://nameservice1/hbase/plr_sg_insta_media_live/8473d25be5980c169bff13cf90229939,
> deployed =>  } on HDFS, but not listed in META or deployed on any region
> server*
> *- ERROR: Region { meta =>
>
> plr_sg_insta_media_live,\x0293:729769;975376;2a33995622;3,1421985489851.8819ebd296f075513056be4bbd30ee9c.,
> hdfs => null, deployed =>  } found