Thank Bertrand, I've checked these information earlier. There's only XOR implementation, and missed blocks are reconstructed by creating new files.
2014-07-22 3:47 GMT+08:00 Bertrand Dechoux <decho...@gmail.com>: > And there is actually quite a lot of information about it. > > > https://github.com/facebook/hadoop-20/blob/master/src/contrib/raid/src/java/org/apache/hadoop/hdfs/DistributedRaidFileSystem.java > > http://wiki.apache.org/hadoop/HDFS-RAID > > > https://issues.apache.org/jira/browse/MAPREDUCE/component/12313416/?selectedTab=com.atlassian.jira.jira-projects-plugin:component-summary-panel > > Bertrand Dechoux > > > On Mon, Jul 21, 2014 at 3:46 PM, Bertrand Dechoux <decho...@gmail.com> > wrote: > >> I wrote my answer thinking about the XOR implementation. With >> reed-solomon and single replication, the cases that need to be considered >> are indeed smaller, simpler. >> >> It seems I was wrong about my last statement though. If the machine >> hosting a single-replicated block is lost, it isn't likely that the block >> can't be reconstructed from a summary of the data. But with a RAID6 >> strategy / RS, its is indeed possible but of course up to a certain number >> of blocks. >> >> There clearly is a case for such tool on a Hadoop cluster. >> >> Bertrand Dechoux >> >> >> On Mon, Jul 21, 2014 at 2:35 PM, Zesheng Wu <wuzeshen...@gmail.com> >> wrote: >> >>> If a block is corrupted but hasn't been detected by HDFS, you could >>> delete the block from the local filesystem (it's only a file) then HDFS >>> will replicate the good remaining replica of this block. >>> >>> We only have one replica for each block, if a block is corrupted, HDFS >>> cannot replicate it. >>> >>> >>> 2014-07-21 20:30 GMT+08:00 Zesheng Wu <wuzeshen...@gmail.com>: >>> >>> Thanks Bertrand, my reply comments inline following. >>>> >>>> So you know that a block is corrupted thanks to an external process >>>> which in this case is checking the parity blocks. If a block is corrupted >>>> but hasn't been detected by HDFS, you could delete the block from the local >>>> filesystem (it's only a file) then HDFS will replicate the good remaining >>>> replica of this block. >>>> *[Zesheng: We will implement a periodical checking mechanism to check >>>> the corrupted blocks.]* >>>> >>>> For performance reason (and that's what you want to do?), you might be >>>> able to fix the corruption without needing to retrieve the good replica. It >>>> might be possible by working directly with the local system by replacing >>>> the corrupted block by the corrected block (which again are files). On >>>> issue is that the corrected block might be different than the good replica. >>>> If HDFS is able to tell (with CRC) it might be good else you will end up >>>> with two different good replicas for the same block and that will not be >>>> pretty... >>>> *[Zesheng: Indeed we will use the reed-solomon erasure codes to >>>> implement the HDFS raid, so the corrupted block will be recovered from the >>>> back good data and coding blocks]* >>>> >>>> If the result is to be open source, you might want to check with >>>> Facebook about their implementation and track the process within Apache >>>> JIRA. You could gain additional feedbacks. One downside of HDFS RAID is >>>> that the less replicas there is, the less read of the data for processing >>>> will be 'efficient/fast'. Reducing the number of replicas also diminishes >>>> the number of supported node failures. I wouldn't say it's an easy ride. >>>> *[Zesheng: Yes, I agree with you that the read performance downgrade, >>>> but not with the number of supported node failures, reed-solomon algorithm >>>> can maintain equal or even higher node failures **tolerance. About >>>> open source, we will consider this in the future.**]* >>>> >>>> >>>> 2014-07-21 20:01 GMT+08:00 Bertrand Dechoux <decho...@gmail.com>: >>>> >>>> So you know that a block is corrupted thanks to an external process >>>>> which in this case is checking the parity blocks. If a block is corrupted >>>>> but hasn't been detected by HDFS, you could delete the block from the >>>>> local >>>>> filesystem (it's only a file) then HDFS will replicate the good remaining >>>>> replica of this block. >>>>> >>>>> For performance reason (and that's what you want to do?), you might be >>>>> able to fix the corruption without needing to retrieve the good replica. >>>>> It >>>>> might be possible by working directly with the local system by replacing >>>>> the corrupted block by the corrected block (which again are files). On >>>>> issue is that the corrected block might be different than the good >>>>> replica. >>>>> If HDFS is able to tell (with CRC) it might be good else you will end up >>>>> with two different good replicas for the same block and that will not be >>>>> pretty... >>>>> >>>>> If the result is to be open source, you might want to check with >>>>> Facebook about their implementation and track the process within Apache >>>>> JIRA. You could gain additional feedbacks. One downside of HDFS RAID is >>>>> that the less replicas there is, the less read of the data for processing >>>>> will be 'efficient/fast'. Reducing the number of replicas also diminishes >>>>> the number of supported node failures. I wouldn't say it's an easy ride. >>>>> >>>>> Bertrand Dechoux >>>>> >>>>> >>>>> On Mon, Jul 21, 2014 at 1:29 PM, Zesheng Wu <wuzeshen...@gmail.com> >>>>> wrote: >>>>> >>>>>> We want to implement a RAID on top of HDFS, something like facebook >>>>>> implemented as described in: >>>>>> https://code.facebook.com/posts/536638663113101/saving-capacity-with-hdfs-raid/ >>>>>> >>>>>> >>>>>> 2014-07-21 17:19 GMT+08:00 Bertrand Dechoux <decho...@gmail.com>: >>>>>> >>>>>> You want to implement a RAID on top of HDFS or use HDFS on top of >>>>>>> RAID? I am not sure I understand any of these use cases. HDFS handles >>>>>>> for >>>>>>> you replication and error detection. Fine tuning the cluster wouldn't be >>>>>>> the easier solution? >>>>>>> >>>>>>> Bertrand Dechoux >>>>>>> >>>>>>> >>>>>>> On Mon, Jul 21, 2014 at 7:25 AM, Zesheng Wu <wuzeshen...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Thanks for reply, Arpit. >>>>>>>> Yes, we need to do this regularly. The original requirement of this >>>>>>>> is that we want to do RAID(which is based reed-solomon erasure codes) >>>>>>>> on >>>>>>>> our HDFS cluster. When a block is corrupted or missing, the downgrade >>>>>>>> read >>>>>>>> needs quick recovery of the block. We are considering how to recovery >>>>>>>> the >>>>>>>> corrupted/missing block quickly. >>>>>>>> >>>>>>>> >>>>>>>> 2014-07-19 5:18 GMT+08:00 Arpit Agarwal <aagar...@hortonworks.com>: >>>>>>>> >>>>>>>>> IMHO this is a spectacularly bad idea. Is it a one off event? Why >>>>>>>>> not just take the perf hit and recreate the file? >>>>>>>>> >>>>>>>>> If you need to do this regularly you should consider a mutable >>>>>>>>> file store like HBase. If you start modifying blocks from under HDFS >>>>>>>>> you >>>>>>>>> open up all sorts of consistency issues. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Fri, Jul 18, 2014 at 2:10 PM, Shumin Guo <gsmst...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> That will break the consistency of the file system, but it >>>>>>>>>> doesn't hurt to try. >>>>>>>>>> On Jul 17, 2014 8:48 PM, "Zesheng Wu" <wuzeshen...@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> How about write a new block with new checksum file, and replace >>>>>>>>>>> the old block file and checksum file both? >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> 2014-07-17 19:34 GMT+08:00 Wellington Chevreuil < >>>>>>>>>>> wellington.chevre...@gmail.com>: >>>>>>>>>>> >>>>>>>>>>>> Hi, >>>>>>>>>>>> >>>>>>>>>>>> there's no way to do that, as HDFS does not provide file >>>>>>>>>>>> updates features. You'll need to write a new file with the changes. >>>>>>>>>>>> >>>>>>>>>>>> Notice that even if you manage to find the physical block >>>>>>>>>>>> replica files on the disk, corresponding to the part of the file >>>>>>>>>>>> you want >>>>>>>>>>>> to change, you can't simply update it manually, as this would give >>>>>>>>>>>> a >>>>>>>>>>>> different checksum, making HDFS mark such blocks as corrupt. >>>>>>>>>>>> >>>>>>>>>>>> Regards, >>>>>>>>>>>> Wellington. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On 17 Jul 2014, at 10:50, Zesheng Wu <wuzeshen...@gmail.com> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> > Hi guys, >>>>>>>>>>>> > >>>>>>>>>>>> > I recently encounter a scenario which needs to replace an >>>>>>>>>>>> exist block with a newly written block >>>>>>>>>>>> > The most straightforward way to finish may be like this: >>>>>>>>>>>> > Suppose the original file is A, and we write a new file B >>>>>>>>>>>> which is composed by the new data blocks, then we merge A and B to >>>>>>>>>>>> C which >>>>>>>>>>>> is the file we wanted >>>>>>>>>>>> > The obvious shortcoming of this method is wasting of network >>>>>>>>>>>> bandwidth >>>>>>>>>>>> > >>>>>>>>>>>> > I'm wondering whether there is a way to replace the old block >>>>>>>>>>>> by the new block directly. >>>>>>>>>>>> > Any thoughts? >>>>>>>>>>>> > >>>>>>>>>>>> > -- >>>>>>>>>>>> > Best Wishes! >>>>>>>>>>>> > >>>>>>>>>>>> > Yours, Zesheng >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Best Wishes! >>>>>>>>>>> >>>>>>>>>>> Yours, Zesheng >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> CONFIDENTIALITY NOTICE >>>>>>>>> NOTICE: This message is intended for the use of the individual or >>>>>>>>> entity to which it is addressed and may contain information that is >>>>>>>>> confidential, privileged and exempt from disclosure under applicable >>>>>>>>> law. >>>>>>>>> If the reader of this message is not the intended recipient, you are >>>>>>>>> hereby >>>>>>>>> notified that any printing, copying, dissemination, distribution, >>>>>>>>> disclosure or forwarding of this communication is strictly >>>>>>>>> prohibited. If >>>>>>>>> you have received this communication in error, please contact the >>>>>>>>> sender >>>>>>>>> immediately and delete it from your system. Thank You. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Best Wishes! >>>>>>>> >>>>>>>> Yours, Zesheng >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Best Wishes! >>>>>> >>>>>> Yours, Zesheng >>>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> Best Wishes! >>>> >>>> Yours, Zesheng >>>> >>> >>> >>> >>> -- >>> Best Wishes! >>> >>> Yours, Zesheng >>> >> >> > -- Best Wishes! Yours, Zesheng