I wrote my answer thinking about the XOR implementation. With reed-solomon and single replication, the cases that need to be considered are indeed smaller, simpler.
It seems I was wrong about my last statement though. If the machine hosting a single-replicated block is lost, it isn't likely that the block can't be reconstructed from a summary of the data. But with a RAID6 strategy / RS, its is indeed possible but of course up to a certain number of blocks. There clearly is a case for such tool on a Hadoop cluster. Bertrand Dechoux On Mon, Jul 21, 2014 at 2:35 PM, Zesheng Wu <wuzeshen...@gmail.com> wrote: > If a block is corrupted but hasn't been detected by HDFS, you could > delete the block from the local filesystem (it's only a file) then HDFS > will replicate the good remaining replica of this block. > > We only have one replica for each block, if a block is corrupted, HDFS > cannot replicate it. > > > 2014-07-21 20:30 GMT+08:00 Zesheng Wu <wuzeshen...@gmail.com>: > > Thanks Bertrand, my reply comments inline following. >> >> So you know that a block is corrupted thanks to an external process which >> in this case is checking the parity blocks. If a block is corrupted but >> hasn't been detected by HDFS, you could delete the block from the local >> filesystem (it's only a file) then HDFS will replicate the good remaining >> replica of this block. >> *[Zesheng: We will implement a periodical checking mechanism to check >> the corrupted blocks.]* >> >> For performance reason (and that's what you want to do?), you might be >> able to fix the corruption without needing to retrieve the good replica. It >> might be possible by working directly with the local system by replacing >> the corrupted block by the corrected block (which again are files). On >> issue is that the corrected block might be different than the good replica. >> If HDFS is able to tell (with CRC) it might be good else you will end up >> with two different good replicas for the same block and that will not be >> pretty... >> *[Zesheng: Indeed we will use the reed-solomon erasure codes to implement >> the HDFS raid, so the corrupted block will be recovered from the back good >> data and coding blocks]* >> >> If the result is to be open source, you might want to check with Facebook >> about their implementation and track the process within Apache JIRA. You >> could gain additional feedbacks. One downside of HDFS RAID is that the less >> replicas there is, the less read of the data for processing will be >> 'efficient/fast'. Reducing the number of replicas also diminishes the >> number of supported node failures. I wouldn't say it's an easy ride. >> *[Zesheng: Yes, I agree with you that the read performance downgrade, but >> not with the number of supported node failures, reed-solomon algorithm can >> maintain equal or even higher node failures **tolerance. About open >> source, we will consider this in the future.**]* >> >> >> 2014-07-21 20:01 GMT+08:00 Bertrand Dechoux <decho...@gmail.com>: >> >> So you know that a block is corrupted thanks to an external process which >>> in this case is checking the parity blocks. If a block is corrupted but >>> hasn't been detected by HDFS, you could delete the block from the local >>> filesystem (it's only a file) then HDFS will replicate the good remaining >>> replica of this block. >>> >>> For performance reason (and that's what you want to do?), you might be >>> able to fix the corruption without needing to retrieve the good replica. It >>> might be possible by working directly with the local system by replacing >>> the corrupted block by the corrected block (which again are files). On >>> issue is that the corrected block might be different than the good replica. >>> If HDFS is able to tell (with CRC) it might be good else you will end up >>> with two different good replicas for the same block and that will not be >>> pretty... >>> >>> If the result is to be open source, you might want to check with >>> Facebook about their implementation and track the process within Apache >>> JIRA. You could gain additional feedbacks. One downside of HDFS RAID is >>> that the less replicas there is, the less read of the data for processing >>> will be 'efficient/fast'. Reducing the number of replicas also diminishes >>> the number of supported node failures. I wouldn't say it's an easy ride. >>> >>> Bertrand Dechoux >>> >>> >>> On Mon, Jul 21, 2014 at 1:29 PM, Zesheng Wu <wuzeshen...@gmail.com> >>> wrote: >>> >>>> We want to implement a RAID on top of HDFS, something like facebook >>>> implemented as described in: >>>> https://code.facebook.com/posts/536638663113101/saving-capacity-with-hdfs-raid/ >>>> >>>> >>>> 2014-07-21 17:19 GMT+08:00 Bertrand Dechoux <decho...@gmail.com>: >>>> >>>> You want to implement a RAID on top of HDFS or use HDFS on top of RAID? >>>>> I am not sure I understand any of these use cases. HDFS handles for you >>>>> replication and error detection. Fine tuning the cluster wouldn't be the >>>>> easier solution? >>>>> >>>>> Bertrand Dechoux >>>>> >>>>> >>>>> On Mon, Jul 21, 2014 at 7:25 AM, Zesheng Wu <wuzeshen...@gmail.com> >>>>> wrote: >>>>> >>>>>> Thanks for reply, Arpit. >>>>>> Yes, we need to do this regularly. The original requirement of this >>>>>> is that we want to do RAID(which is based reed-solomon erasure codes) on >>>>>> our HDFS cluster. When a block is corrupted or missing, the downgrade >>>>>> read >>>>>> needs quick recovery of the block. We are considering how to recovery the >>>>>> corrupted/missing block quickly. >>>>>> >>>>>> >>>>>> 2014-07-19 5:18 GMT+08:00 Arpit Agarwal <aagar...@hortonworks.com>: >>>>>> >>>>>>> IMHO this is a spectacularly bad idea. Is it a one off event? Why >>>>>>> not just take the perf hit and recreate the file? >>>>>>> >>>>>>> If you need to do this regularly you should consider a mutable file >>>>>>> store like HBase. If you start modifying blocks from under HDFS you >>>>>>> open up >>>>>>> all sorts of consistency issues. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Fri, Jul 18, 2014 at 2:10 PM, Shumin Guo <gsmst...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> That will break the consistency of the file system, but it doesn't >>>>>>>> hurt to try. >>>>>>>> On Jul 17, 2014 8:48 PM, "Zesheng Wu" <wuzeshen...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> How about write a new block with new checksum file, and replace >>>>>>>>> the old block file and checksum file both? >>>>>>>>> >>>>>>>>> >>>>>>>>> 2014-07-17 19:34 GMT+08:00 Wellington Chevreuil < >>>>>>>>> wellington.chevre...@gmail.com>: >>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> there's no way to do that, as HDFS does not provide file updates >>>>>>>>>> features. You'll need to write a new file with the changes. >>>>>>>>>> >>>>>>>>>> Notice that even if you manage to find the physical block replica >>>>>>>>>> files on the disk, corresponding to the part of the file you want to >>>>>>>>>> change, you can't simply update it manually, as this would give a >>>>>>>>>> different >>>>>>>>>> checksum, making HDFS mark such blocks as corrupt. >>>>>>>>>> >>>>>>>>>> Regards, >>>>>>>>>> Wellington. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 17 Jul 2014, at 10:50, Zesheng Wu <wuzeshen...@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> > Hi guys, >>>>>>>>>> > >>>>>>>>>> > I recently encounter a scenario which needs to replace an exist >>>>>>>>>> block with a newly written block >>>>>>>>>> > The most straightforward way to finish may be like this: >>>>>>>>>> > Suppose the original file is A, and we write a new file B which >>>>>>>>>> is composed by the new data blocks, then we merge A and B to C which >>>>>>>>>> is the >>>>>>>>>> file we wanted >>>>>>>>>> > The obvious shortcoming of this method is wasting of network >>>>>>>>>> bandwidth >>>>>>>>>> > >>>>>>>>>> > I'm wondering whether there is a way to replace the old block >>>>>>>>>> by the new block directly. >>>>>>>>>> > Any thoughts? >>>>>>>>>> > >>>>>>>>>> > -- >>>>>>>>>> > Best Wishes! >>>>>>>>>> > >>>>>>>>>> > Yours, Zesheng >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Best Wishes! >>>>>>>>> >>>>>>>>> Yours, Zesheng >>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> CONFIDENTIALITY NOTICE >>>>>>> NOTICE: This message is intended for the use of the individual or >>>>>>> entity to which it is addressed and may contain information that is >>>>>>> confidential, privileged and exempt from disclosure under applicable >>>>>>> law. >>>>>>> If the reader of this message is not the intended recipient, you are >>>>>>> hereby >>>>>>> notified that any printing, copying, dissemination, distribution, >>>>>>> disclosure or forwarding of this communication is strictly prohibited. >>>>>>> If >>>>>>> you have received this communication in error, please contact the sender >>>>>>> immediately and delete it from your system. Thank You. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Best Wishes! >>>>>> >>>>>> Yours, Zesheng >>>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> Best Wishes! >>>> >>>> Yours, Zesheng >>>> >>> >>> >> >> >> -- >> Best Wishes! >> >> Yours, Zesheng >> > > > > -- > Best Wishes! > > Yours, Zesheng >