Mmm, it seems that the facebook branch https://github.com/facebook/hadoop-20/ <https://github.com/facebook/hadoop-20/blob/master/src/contrib/raid/src/java/org/apache/hadoop/hdfs/DistributedRaidFileSystem.java> has implemented reed-solomon codes, what I was checking earlier were the following two issues: https://issues.apache.org/jira/browse/HDFS-503 https://issues.apache.org/jira/browse/HDFS-600
Thanks again Bertrand, I will check through the facebook branch to find more information. 2014-07-22 9:31 GMT+08:00 Zesheng Wu <wuzeshen...@gmail.com>: > Thank Bertrand, I've checked these information earlier. There's only XOR > implementation, and missed blocks are reconstructed by creating new files. > > > > 2014-07-22 3:47 GMT+08:00 Bertrand Dechoux <decho...@gmail.com>: > > And there is actually quite a lot of information about it. >> >> >> https://github.com/facebook/hadoop-20/blob/master/src/contrib/raid/src/java/org/apache/hadoop/hdfs/DistributedRaidFileSystem.java >> >> http://wiki.apache.org/hadoop/HDFS-RAID >> >> >> https://issues.apache.org/jira/browse/MAPREDUCE/component/12313416/?selectedTab=com.atlassian.jira.jira-projects-plugin:component-summary-panel >> >> Bertrand Dechoux >> >> >> On Mon, Jul 21, 2014 at 3:46 PM, Bertrand Dechoux <decho...@gmail.com> >> wrote: >> >>> I wrote my answer thinking about the XOR implementation. With >>> reed-solomon and single replication, the cases that need to be considered >>> are indeed smaller, simpler. >>> >>> It seems I was wrong about my last statement though. If the machine >>> hosting a single-replicated block is lost, it isn't likely that the block >>> can't be reconstructed from a summary of the data. But with a RAID6 >>> strategy / RS, its is indeed possible but of course up to a certain number >>> of blocks. >>> >>> There clearly is a case for such tool on a Hadoop cluster. >>> >>> Bertrand Dechoux >>> >>> >>> On Mon, Jul 21, 2014 at 2:35 PM, Zesheng Wu <wuzeshen...@gmail.com> >>> wrote: >>> >>>> If a block is corrupted but hasn't been detected by HDFS, you could >>>> delete the block from the local filesystem (it's only a file) then HDFS >>>> will replicate the good remaining replica of this block. >>>> >>>> We only have one replica for each block, if a block is corrupted, HDFS >>>> cannot replicate it. >>>> >>>> >>>> 2014-07-21 20:30 GMT+08:00 Zesheng Wu <wuzeshen...@gmail.com>: >>>> >>>> Thanks Bertrand, my reply comments inline following. >>>>> >>>>> So you know that a block is corrupted thanks to an external process >>>>> which in this case is checking the parity blocks. If a block is corrupted >>>>> but hasn't been detected by HDFS, you could delete the block from the >>>>> local >>>>> filesystem (it's only a file) then HDFS will replicate the good remaining >>>>> replica of this block. >>>>> *[Zesheng: We will implement a periodical checking mechanism to >>>>> check the corrupted blocks.]* >>>>> >>>>> For performance reason (and that's what you want to do?), you might be >>>>> able to fix the corruption without needing to retrieve the good replica. >>>>> It >>>>> might be possible by working directly with the local system by replacing >>>>> the corrupted block by the corrected block (which again are files). On >>>>> issue is that the corrected block might be different than the good >>>>> replica. >>>>> If HDFS is able to tell (with CRC) it might be good else you will end up >>>>> with two different good replicas for the same block and that will not be >>>>> pretty... >>>>> *[Zesheng: Indeed we will use the reed-solomon erasure codes to >>>>> implement the HDFS raid, so the corrupted block will be recovered from the >>>>> back good data and coding blocks]* >>>>> >>>>> If the result is to be open source, you might want to check with >>>>> Facebook about their implementation and track the process within Apache >>>>> JIRA. You could gain additional feedbacks. One downside of HDFS RAID is >>>>> that the less replicas there is, the less read of the data for processing >>>>> will be 'efficient/fast'. Reducing the number of replicas also diminishes >>>>> the number of supported node failures. I wouldn't say it's an easy ride. >>>>> *[Zesheng: Yes, I agree with you that the read performance downgrade, >>>>> but not with the number of supported node failures, reed-solomon algorithm >>>>> can maintain equal or even higher node failures **tolerance. About >>>>> open source, we will consider this in the future.**]* >>>>> >>>>> >>>>> 2014-07-21 20:01 GMT+08:00 Bertrand Dechoux <decho...@gmail.com>: >>>>> >>>>> So you know that a block is corrupted thanks to an external process >>>>>> which in this case is checking the parity blocks. If a block is corrupted >>>>>> but hasn't been detected by HDFS, you could delete the block from the >>>>>> local >>>>>> filesystem (it's only a file) then HDFS will replicate the good remaining >>>>>> replica of this block. >>>>>> >>>>>> For performance reason (and that's what you want to do?), you might >>>>>> be able to fix the corruption without needing to retrieve the good >>>>>> replica. >>>>>> It might be possible by working directly with the local system by >>>>>> replacing >>>>>> the corrupted block by the corrected block (which again are files). On >>>>>> issue is that the corrected block might be different than the good >>>>>> replica. >>>>>> If HDFS is able to tell (with CRC) it might be good else you will end up >>>>>> with two different good replicas for the same block and that will not be >>>>>> pretty... >>>>>> >>>>>> If the result is to be open source, you might want to check with >>>>>> Facebook about their implementation and track the process within Apache >>>>>> JIRA. You could gain additional feedbacks. One downside of HDFS RAID is >>>>>> that the less replicas there is, the less read of the data for processing >>>>>> will be 'efficient/fast'. Reducing the number of replicas also diminishes >>>>>> the number of supported node failures. I wouldn't say it's an easy ride. >>>>>> >>>>>> Bertrand Dechoux >>>>>> >>>>>> >>>>>> On Mon, Jul 21, 2014 at 1:29 PM, Zesheng Wu <wuzeshen...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> We want to implement a RAID on top of HDFS, something like facebook >>>>>>> implemented as described in: >>>>>>> https://code.facebook.com/posts/536638663113101/saving-capacity-with-hdfs-raid/ >>>>>>> >>>>>>> >>>>>>> 2014-07-21 17:19 GMT+08:00 Bertrand Dechoux <decho...@gmail.com>: >>>>>>> >>>>>>> You want to implement a RAID on top of HDFS or use HDFS on top of >>>>>>>> RAID? I am not sure I understand any of these use cases. HDFS handles >>>>>>>> for >>>>>>>> you replication and error detection. Fine tuning the cluster wouldn't >>>>>>>> be >>>>>>>> the easier solution? >>>>>>>> >>>>>>>> Bertrand Dechoux >>>>>>>> >>>>>>>> >>>>>>>> On Mon, Jul 21, 2014 at 7:25 AM, Zesheng Wu <wuzeshen...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Thanks for reply, Arpit. >>>>>>>>> Yes, we need to do this regularly. The original requirement of >>>>>>>>> this is that we want to do RAID(which is based reed-solomon erasure >>>>>>>>> codes) >>>>>>>>> on our HDFS cluster. When a block is corrupted or missing, the >>>>>>>>> downgrade >>>>>>>>> read needs quick recovery of the block. We are considering how to >>>>>>>>> recovery >>>>>>>>> the corrupted/missing block quickly. >>>>>>>>> >>>>>>>>> >>>>>>>>> 2014-07-19 5:18 GMT+08:00 Arpit Agarwal <aagar...@hortonworks.com> >>>>>>>>> : >>>>>>>>> >>>>>>>>>> IMHO this is a spectacularly bad idea. Is it a one off event? Why >>>>>>>>>> not just take the perf hit and recreate the file? >>>>>>>>>> >>>>>>>>>> If you need to do this regularly you should consider a mutable >>>>>>>>>> file store like HBase. If you start modifying blocks from under HDFS >>>>>>>>>> you >>>>>>>>>> open up all sorts of consistency issues. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Fri, Jul 18, 2014 at 2:10 PM, Shumin Guo <gsmst...@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> That will break the consistency of the file system, but it >>>>>>>>>>> doesn't hurt to try. >>>>>>>>>>> On Jul 17, 2014 8:48 PM, "Zesheng Wu" <wuzeshen...@gmail.com> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> How about write a new block with new checksum file, and replace >>>>>>>>>>>> the old block file and checksum file both? >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> 2014-07-17 19:34 GMT+08:00 Wellington Chevreuil < >>>>>>>>>>>> wellington.chevre...@gmail.com>: >>>>>>>>>>>> >>>>>>>>>>>>> Hi, >>>>>>>>>>>>> >>>>>>>>>>>>> there's no way to do that, as HDFS does not provide file >>>>>>>>>>>>> updates features. You'll need to write a new file with the >>>>>>>>>>>>> changes. >>>>>>>>>>>>> >>>>>>>>>>>>> Notice that even if you manage to find the physical block >>>>>>>>>>>>> replica files on the disk, corresponding to the part of the file >>>>>>>>>>>>> you want >>>>>>>>>>>>> to change, you can't simply update it manually, as this would >>>>>>>>>>>>> give a >>>>>>>>>>>>> different checksum, making HDFS mark such blocks as corrupt. >>>>>>>>>>>>> >>>>>>>>>>>>> Regards, >>>>>>>>>>>>> Wellington. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On 17 Jul 2014, at 10:50, Zesheng Wu <wuzeshen...@gmail.com> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> > Hi guys, >>>>>>>>>>>>> > >>>>>>>>>>>>> > I recently encounter a scenario which needs to replace an >>>>>>>>>>>>> exist block with a newly written block >>>>>>>>>>>>> > The most straightforward way to finish may be like this: >>>>>>>>>>>>> > Suppose the original file is A, and we write a new file B >>>>>>>>>>>>> which is composed by the new data blocks, then we merge A and B >>>>>>>>>>>>> to C which >>>>>>>>>>>>> is the file we wanted >>>>>>>>>>>>> > The obvious shortcoming of this method is wasting of network >>>>>>>>>>>>> bandwidth >>>>>>>>>>>>> > >>>>>>>>>>>>> > I'm wondering whether there is a way to replace the old >>>>>>>>>>>>> block by the new block directly. >>>>>>>>>>>>> > Any thoughts? >>>>>>>>>>>>> > >>>>>>>>>>>>> > -- >>>>>>>>>>>>> > Best Wishes! >>>>>>>>>>>>> > >>>>>>>>>>>>> > Yours, Zesheng >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Best Wishes! >>>>>>>>>>>> >>>>>>>>>>>> Yours, Zesheng >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> CONFIDENTIALITY NOTICE >>>>>>>>>> NOTICE: This message is intended for the use of the individual or >>>>>>>>>> entity to which it is addressed and may contain information that is >>>>>>>>>> confidential, privileged and exempt from disclosure under applicable >>>>>>>>>> law. >>>>>>>>>> If the reader of this message is not the intended recipient, you are >>>>>>>>>> hereby >>>>>>>>>> notified that any printing, copying, dissemination, distribution, >>>>>>>>>> disclosure or forwarding of this communication is strictly >>>>>>>>>> prohibited. If >>>>>>>>>> you have received this communication in error, please contact the >>>>>>>>>> sender >>>>>>>>>> immediately and delete it from your system. Thank You. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Best Wishes! >>>>>>>>> >>>>>>>>> Yours, Zesheng >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Best Wishes! >>>>>>> >>>>>>> Yours, Zesheng >>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Best Wishes! >>>>> >>>>> Yours, Zesheng >>>>> >>>> >>>> >>>> >>>> -- >>>> Best Wishes! >>>> >>>> Yours, Zesheng >>>> >>> >>> >> > > > -- > Best Wishes! > > Yours, Zesheng > -- Best Wishes! Yours, Zesheng