2015-02-23 20:25 GMT-05:00 Arinto Murdopo <ari...@gmail.com>: > @JM: > You mentioned about deleting "the files", are you referring to HDFS files > or file on HBase? >
Your HBase files are stored in HDFS. So I think we are refering to the same thing. Look into /hbase in our HDFS to find HBase files. > > Our cluster have 15 nodes. We used 14 of them as DN. Actually we tried to > enable the remaining one as DN (so that we have 15 DN), but then we > disabled it (so now we have 14 again). Probably our crawlers write some > data into the additional DN without any replication. Maybe I could try to > enable again the DN. > That's a very valid option. If you still have the DN directories, just enable it back to see if you can recover the blocks... > I don't have the list of the corrupted files yet. I notice that when I try > to Get some of the files, my HBase client code throws these exceptions: > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after > attempts=2, exceptions: > Mon Feb 23 17:49:32 SGT 2015, > org.apache.hadoop.hbase.client.HTable$3@11ff4a1c, > org.apache.hadoop.hbase.NotServingRegionException: > org.apache.hadoop.hbase.NotServingRegionException: Region is not online: > > plr_sg_insta_media_live,\x0177998597896:953:5:a5:58786,1410771627251.6c323832d2dc77c586f1cf6441c7ef6e. > FSCK should give ou the list of corrupt files. Can you extract it from there? > > Can I use these exceptions to determine the corrupted files? > The files are media data (images or videos) obtained from the internet. > This exception gives you all the hints for a directory most probably under /hbase/plr_sg_insta_media_live/6c323832d2dc77c586f1cf6441c7ef6e Files under this directory might be corrupted but you need to find which files. If it's a HFiles it's easy. If it's the .regioninfo it's a bit more tricky. JM > Arinto > www.otnira.com > > On Tue, Feb 24, 2015 at 8:06 AM, Michael Segel <mse...@segel.com> wrote: > > > I’m sorry, but I implied checking the checksums of the blocks. > > Didn’t think I needed to spell it out. Next time I’ll be a bit more > > precise. > > > > > On Feb 23, 2015, at 2:34 PM, Nick Dimiduk <ndimi...@gmail.com> wrote: > > > > > > HBase/HDFS are maintaining block checksums, so presumably a corrupted > > block > > > would fail checksum validation. Increasing the number of replicas > > increases > > > the odds that you'll still have a valid block. I'm not an HDFS expert, > > but > > > I would be very surprised if HDFS is validating a "questionable block" > > via > > > byte-wise comparison over the network amongst the replica peers. > > > > > > On Mon, Feb 23, 2015 at 12:25 PM, Michael Segel <mse...@segel.com> > > wrote: > > > > > >> > > >> On Feb 23, 2015, at 1:47 AM, Arinto Murdopo <ari...@gmail.com> wrote: > > >> > > >> We're running HBase (0.94.15-cdh4.6.0) on top of HDFS (Hadoop > > >> 2.0.0-cdh4.6.0). > > >> For all of our tables, we set the replication factor to 1 > > (dfs.replication > > >> = 1 in hbase-site.xml). We set to 1 because we want to minimize the > HDFS > > >> usage (now we realize we should set this value to at least 2, because > > >> "failure is a norm" in distributed systems). > > >> > > >> > > >> > > >> Sorry, but you really want this to be a replication value of at least > 3 > > >> and not 2. > > >> > > >> Suppose you have corruption but not a lost block. Which copy of the > two > > is > > >> right? > > >> With 3, you can compare the three and hopefully 2 of the 3 will match. > > >> > > >> > > > > >