And there is actually quite a lot of information about it. https://github.com/facebook/hadoop-20/blob/master/src/contrib/raid/src/java/org/apache/hadoop/hdfs/DistributedRaidFileSystem.java
http://wiki.apache.org/hadoop/HDFS-RAID https://issues.apache.org/jira/browse/MAPREDUCE/component/12313416/?selectedTab=com.atlassian.jira.jira-projects-plugin:component-summary-panel Bertrand Dechoux On Mon, Jul 21, 2014 at 3:46 PM, Bertrand Dechoux <decho...@gmail.com> wrote: > I wrote my answer thinking about the XOR implementation. With reed-solomon > and single replication, the cases that need to be considered are indeed > smaller, simpler. > > It seems I was wrong about my last statement though. If the machine > hosting a single-replicated block is lost, it isn't likely that the block > can't be reconstructed from a summary of the data. But with a RAID6 > strategy / RS, its is indeed possible but of course up to a certain number > of blocks. > > There clearly is a case for such tool on a Hadoop cluster. > > Bertrand Dechoux > > > On Mon, Jul 21, 2014 at 2:35 PM, Zesheng Wu <wuzeshen...@gmail.com> wrote: > >> If a block is corrupted but hasn't been detected by HDFS, you could >> delete the block from the local filesystem (it's only a file) then HDFS >> will replicate the good remaining replica of this block. >> >> We only have one replica for each block, if a block is corrupted, HDFS >> cannot replicate it. >> >> >> 2014-07-21 20:30 GMT+08:00 Zesheng Wu <wuzeshen...@gmail.com>: >> >> Thanks Bertrand, my reply comments inline following. >>> >>> So you know that a block is corrupted thanks to an external process >>> which in this case is checking the parity blocks. If a block is corrupted >>> but hasn't been detected by HDFS, you could delete the block from the local >>> filesystem (it's only a file) then HDFS will replicate the good remaining >>> replica of this block. >>> *[Zesheng: We will implement a periodical checking mechanism to check >>> the corrupted blocks.]* >>> >>> For performance reason (and that's what you want to do?), you might be >>> able to fix the corruption without needing to retrieve the good replica. It >>> might be possible by working directly with the local system by replacing >>> the corrupted block by the corrected block (which again are files). On >>> issue is that the corrected block might be different than the good replica. >>> If HDFS is able to tell (with CRC) it might be good else you will end up >>> with two different good replicas for the same block and that will not be >>> pretty... >>> *[Zesheng: Indeed we will use the reed-solomon erasure codes to >>> implement the HDFS raid, so the corrupted block will be recovered from the >>> back good data and coding blocks]* >>> >>> If the result is to be open source, you might want to check with >>> Facebook about their implementation and track the process within Apache >>> JIRA. You could gain additional feedbacks. One downside of HDFS RAID is >>> that the less replicas there is, the less read of the data for processing >>> will be 'efficient/fast'. Reducing the number of replicas also diminishes >>> the number of supported node failures. I wouldn't say it's an easy ride. >>> *[Zesheng: Yes, I agree with you that the read performance downgrade, >>> but not with the number of supported node failures, reed-solomon algorithm >>> can maintain equal or even higher node failures **tolerance. About open >>> source, we will consider this in the future.**]* >>> >>> >>> 2014-07-21 20:01 GMT+08:00 Bertrand Dechoux <decho...@gmail.com>: >>> >>> So you know that a block is corrupted thanks to an external process >>>> which in this case is checking the parity blocks. If a block is corrupted >>>> but hasn't been detected by HDFS, you could delete the block from the local >>>> filesystem (it's only a file) then HDFS will replicate the good remaining >>>> replica of this block. >>>> >>>> For performance reason (and that's what you want to do?), you might be >>>> able to fix the corruption without needing to retrieve the good replica. It >>>> might be possible by working directly with the local system by replacing >>>> the corrupted block by the corrected block (which again are files). On >>>> issue is that the corrected block might be different than the good replica. >>>> If HDFS is able to tell (with CRC) it might be good else you will end up >>>> with two different good replicas for the same block and that will not be >>>> pretty... >>>> >>>> If the result is to be open source, you might want to check with >>>> Facebook about their implementation and track the process within Apache >>>> JIRA. You could gain additional feedbacks. One downside of HDFS RAID is >>>> that the less replicas there is, the less read of the data for processing >>>> will be 'efficient/fast'. Reducing the number of replicas also diminishes >>>> the number of supported node failures. I wouldn't say it's an easy ride. >>>> >>>> Bertrand Dechoux >>>> >>>> >>>> On Mon, Jul 21, 2014 at 1:29 PM, Zesheng Wu <wuzeshen...@gmail.com> >>>> wrote: >>>> >>>>> We want to implement a RAID on top of HDFS, something like facebook >>>>> implemented as described in: >>>>> https://code.facebook.com/posts/536638663113101/saving-capacity-with-hdfs-raid/ >>>>> >>>>> >>>>> 2014-07-21 17:19 GMT+08:00 Bertrand Dechoux <decho...@gmail.com>: >>>>> >>>>> You want to implement a RAID on top of HDFS or use HDFS on top of >>>>>> RAID? I am not sure I understand any of these use cases. HDFS handles for >>>>>> you replication and error detection. Fine tuning the cluster wouldn't be >>>>>> the easier solution? >>>>>> >>>>>> Bertrand Dechoux >>>>>> >>>>>> >>>>>> On Mon, Jul 21, 2014 at 7:25 AM, Zesheng Wu <wuzeshen...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Thanks for reply, Arpit. >>>>>>> Yes, we need to do this regularly. The original requirement of this >>>>>>> is that we want to do RAID(which is based reed-solomon erasure codes) on >>>>>>> our HDFS cluster. When a block is corrupted or missing, the downgrade >>>>>>> read >>>>>>> needs quick recovery of the block. We are considering how to recovery >>>>>>> the >>>>>>> corrupted/missing block quickly. >>>>>>> >>>>>>> >>>>>>> 2014-07-19 5:18 GMT+08:00 Arpit Agarwal <aagar...@hortonworks.com>: >>>>>>> >>>>>>>> IMHO this is a spectacularly bad idea. Is it a one off event? Why >>>>>>>> not just take the perf hit and recreate the file? >>>>>>>> >>>>>>>> If you need to do this regularly you should consider a mutable file >>>>>>>> store like HBase. If you start modifying blocks from under HDFS you >>>>>>>> open up >>>>>>>> all sorts of consistency issues. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Jul 18, 2014 at 2:10 PM, Shumin Guo <gsmst...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> That will break the consistency of the file system, but it doesn't >>>>>>>>> hurt to try. >>>>>>>>> On Jul 17, 2014 8:48 PM, "Zesheng Wu" <wuzeshen...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> How about write a new block with new checksum file, and replace >>>>>>>>>> the old block file and checksum file both? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> 2014-07-17 19:34 GMT+08:00 Wellington Chevreuil < >>>>>>>>>> wellington.chevre...@gmail.com>: >>>>>>>>>> >>>>>>>>>>> Hi, >>>>>>>>>>> >>>>>>>>>>> there's no way to do that, as HDFS does not provide file updates >>>>>>>>>>> features. You'll need to write a new file with the changes. >>>>>>>>>>> >>>>>>>>>>> Notice that even if you manage to find the physical block >>>>>>>>>>> replica files on the disk, corresponding to the part of the file >>>>>>>>>>> you want >>>>>>>>>>> to change, you can't simply update it manually, as this would give a >>>>>>>>>>> different checksum, making HDFS mark such blocks as corrupt. >>>>>>>>>>> >>>>>>>>>>> Regards, >>>>>>>>>>> Wellington. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On 17 Jul 2014, at 10:50, Zesheng Wu <wuzeshen...@gmail.com> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> > Hi guys, >>>>>>>>>>> > >>>>>>>>>>> > I recently encounter a scenario which needs to replace an >>>>>>>>>>> exist block with a newly written block >>>>>>>>>>> > The most straightforward way to finish may be like this: >>>>>>>>>>> > Suppose the original file is A, and we write a new file B >>>>>>>>>>> which is composed by the new data blocks, then we merge A and B to >>>>>>>>>>> C which >>>>>>>>>>> is the file we wanted >>>>>>>>>>> > The obvious shortcoming of this method is wasting of network >>>>>>>>>>> bandwidth >>>>>>>>>>> > >>>>>>>>>>> > I'm wondering whether there is a way to replace the old block >>>>>>>>>>> by the new block directly. >>>>>>>>>>> > Any thoughts? >>>>>>>>>>> > >>>>>>>>>>> > -- >>>>>>>>>>> > Best Wishes! >>>>>>>>>>> > >>>>>>>>>>> > Yours, Zesheng >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Best Wishes! >>>>>>>>>> >>>>>>>>>> Yours, Zesheng >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> CONFIDENTIALITY NOTICE >>>>>>>> NOTICE: This message is intended for the use of the individual or >>>>>>>> entity to which it is addressed and may contain information that is >>>>>>>> confidential, privileged and exempt from disclosure under applicable >>>>>>>> law. >>>>>>>> If the reader of this message is not the intended recipient, you are >>>>>>>> hereby >>>>>>>> notified that any printing, copying, dissemination, distribution, >>>>>>>> disclosure or forwarding of this communication is strictly prohibited. >>>>>>>> If >>>>>>>> you have received this communication in error, please contact the >>>>>>>> sender >>>>>>>> immediately and delete it from your system. Thank You. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Best Wishes! >>>>>>> >>>>>>> Yours, Zesheng >>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Best Wishes! >>>>> >>>>> Yours, Zesheng >>>>> >>>> >>>> >>> >>> >>> -- >>> Best Wishes! >>> >>> Yours, Zesheng >>> >> >> >> >> -- >> Best Wishes! >> >> Yours, Zesheng >> > >