And there is actually quite a lot of information about it.

https://github.com/facebook/hadoop-20/blob/master/src/contrib/raid/src/java/org/apache/hadoop/hdfs/DistributedRaidFileSystem.java

http://wiki.apache.org/hadoop/HDFS-RAID

https://issues.apache.org/jira/browse/MAPREDUCE/component/12313416/?selectedTab=com.atlassian.jira.jira-projects-plugin:component-summary-panel

Bertrand Dechoux


On Mon, Jul 21, 2014 at 3:46 PM, Bertrand Dechoux <decho...@gmail.com>
wrote:

> I wrote my answer thinking about the XOR implementation. With reed-solomon
> and single replication, the cases that need to be considered are indeed
> smaller, simpler.
>
> It seems I was wrong about my last statement though. If the machine
> hosting a single-replicated block is lost, it isn't likely that the block
> can't be reconstructed from a summary of the data. But with a RAID6
> strategy / RS, its is indeed possible but of course up to a certain number
> of blocks.
>
> There clearly is a case for such tool on a Hadoop cluster.
>
> Bertrand Dechoux
>
>
> On Mon, Jul 21, 2014 at 2:35 PM, Zesheng Wu <wuzeshen...@gmail.com> wrote:
>
>>  If a block is corrupted but hasn't been detected by HDFS, you could
>> delete the block from the local filesystem (it's only a file) then HDFS
>> will replicate the good remaining replica of this block.
>>
>> We only have one replica for each block, if a block is corrupted, HDFS
>> cannot replicate it.
>>
>>
>> 2014-07-21 20:30 GMT+08:00 Zesheng Wu <wuzeshen...@gmail.com>:
>>
>> Thanks Bertrand, my reply comments inline following.
>>>
>>> So you know that a block is corrupted thanks to an external process
>>> which in this case is checking the parity blocks. If a block is corrupted
>>> but hasn't been detected by HDFS, you could delete the block from the local
>>> filesystem (it's only a file) then HDFS will replicate the good remaining
>>> replica of this block.
>>>  *[Zesheng: We will implement a periodical checking mechanism to check
>>> the corrupted blocks.]*
>>>
>>> For performance reason (and that's what you want to do?), you might be
>>> able to fix the corruption without needing to retrieve the good replica. It
>>> might be possible by working directly with the local system by replacing
>>> the corrupted block by the corrected block (which again are files). On
>>> issue is that the corrected block might be different than the good replica.
>>> If HDFS is able to tell (with CRC) it might be good else you will end up
>>> with two different good replicas for the same block and that will not be
>>> pretty...
>>> *[Zesheng: Indeed we will use the reed-solomon erasure codes to
>>> implement the HDFS raid, so the corrupted block will be recovered from the
>>> back good data and coding blocks]*
>>>
>>> If the result is to be open source, you might want to check with
>>> Facebook about their implementation and track the process within Apache
>>> JIRA. You could gain additional feedbacks. One downside of HDFS RAID is
>>> that the less replicas there is, the less read of the data for processing
>>> will be 'efficient/fast'. Reducing the number of replicas also diminishes
>>> the number of supported node failures. I wouldn't say it's an easy ride.
>>> *[Zesheng: Yes, I agree with you that the read performance downgrade,
>>> but not with the number of supported node failures, reed-solomon algorithm
>>> can maintain equal or even higher node failures **tolerance. About open
>>> source, we will consider this in the future.**]*
>>>
>>>
>>> 2014-07-21 20:01 GMT+08:00 Bertrand Dechoux <decho...@gmail.com>:
>>>
>>> So you know that a block is corrupted thanks to an external process
>>>> which in this case is checking the parity blocks. If a block is corrupted
>>>> but hasn't been detected by HDFS, you could delete the block from the local
>>>> filesystem (it's only a file) then HDFS will replicate the good remaining
>>>> replica of this block.
>>>>
>>>> For performance reason (and that's what you want to do?), you might be
>>>> able to fix the corruption without needing to retrieve the good replica. It
>>>> might be possible by working directly with the local system by replacing
>>>> the corrupted block by the corrected block (which again are files). On
>>>> issue is that the corrected block might be different than the good replica.
>>>> If HDFS is able to tell (with CRC) it might be good else you will end up
>>>> with two different good replicas for the same block and that will not be
>>>> pretty...
>>>>
>>>> If the result is to be open source, you might want to check with
>>>> Facebook about their implementation and track the process within Apache
>>>> JIRA. You could gain additional feedbacks. One downside of HDFS RAID is
>>>> that the less replicas there is, the less read of the data for processing
>>>> will be 'efficient/fast'. Reducing the number of replicas also diminishes
>>>> the number of supported node failures. I wouldn't say it's an easy ride.
>>>>
>>>> Bertrand Dechoux
>>>>
>>>>
>>>> On Mon, Jul 21, 2014 at 1:29 PM, Zesheng Wu <wuzeshen...@gmail.com>
>>>> wrote:
>>>>
>>>>> We want to implement a RAID on top of HDFS, something like facebook
>>>>> implemented as described in:
>>>>> https://code.facebook.com/posts/536638663113101/saving-capacity-with-hdfs-raid/
>>>>>
>>>>>
>>>>> 2014-07-21 17:19 GMT+08:00 Bertrand Dechoux <decho...@gmail.com>:
>>>>>
>>>>> You want to implement a RAID on top of HDFS or use HDFS on top of
>>>>>> RAID? I am not sure I understand any of these use cases. HDFS handles for
>>>>>> you replication and error detection. Fine tuning the cluster wouldn't be
>>>>>> the easier solution?
>>>>>>
>>>>>> Bertrand Dechoux
>>>>>>
>>>>>>
>>>>>> On Mon, Jul 21, 2014 at 7:25 AM, Zesheng Wu <wuzeshen...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks for reply, Arpit.
>>>>>>> Yes, we need to do this regularly. The original requirement of this
>>>>>>> is that we want to do RAID(which is based reed-solomon erasure codes) on
>>>>>>> our HDFS cluster. When a block is corrupted or missing, the downgrade 
>>>>>>> read
>>>>>>> needs quick recovery of the block. We are considering how to recovery 
>>>>>>> the
>>>>>>> corrupted/missing block quickly.
>>>>>>>
>>>>>>>
>>>>>>> 2014-07-19 5:18 GMT+08:00 Arpit Agarwal <aagar...@hortonworks.com>:
>>>>>>>
>>>>>>>> IMHO this is a spectacularly bad idea. Is it a one off event? Why
>>>>>>>> not just take the perf hit and recreate the file?
>>>>>>>>
>>>>>>>> If you need to do this regularly you should consider a mutable file
>>>>>>>> store like HBase. If you start modifying blocks from under HDFS you 
>>>>>>>> open up
>>>>>>>> all sorts of consistency issues.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Jul 18, 2014 at 2:10 PM, Shumin Guo <gsmst...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> That will break the consistency of the file system, but it doesn't
>>>>>>>>> hurt to try.
>>>>>>>>>  On Jul 17, 2014 8:48 PM, "Zesheng Wu" <wuzeshen...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> How about write a new block with new checksum file, and replace
>>>>>>>>>> the old block file and checksum file both?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 2014-07-17 19:34 GMT+08:00 Wellington Chevreuil <
>>>>>>>>>> wellington.chevre...@gmail.com>:
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> there's no way to do that, as HDFS does not provide file updates
>>>>>>>>>>> features. You'll need to write a new file with the changes.
>>>>>>>>>>>
>>>>>>>>>>> Notice that even if you manage to find the physical block
>>>>>>>>>>> replica files on the disk, corresponding to the part of the file 
>>>>>>>>>>> you want
>>>>>>>>>>> to change, you can't simply update it manually, as this would give a
>>>>>>>>>>> different checksum, making HDFS mark such blocks as corrupt.
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Wellington.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 17 Jul 2014, at 10:50, Zesheng Wu <wuzeshen...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> > Hi guys,
>>>>>>>>>>> >
>>>>>>>>>>> > I recently encounter a scenario which needs to replace an
>>>>>>>>>>> exist block with a newly written block
>>>>>>>>>>> > The most straightforward way to finish may be like this:
>>>>>>>>>>> > Suppose the original file is A, and we write a new file B
>>>>>>>>>>> which is composed by the new data blocks, then we merge A and B to 
>>>>>>>>>>> C which
>>>>>>>>>>> is the file we wanted
>>>>>>>>>>> > The obvious shortcoming of this method is wasting of network
>>>>>>>>>>> bandwidth
>>>>>>>>>>> >
>>>>>>>>>>> > I'm wondering whether there is a way to replace the old block
>>>>>>>>>>> by the new block directly.
>>>>>>>>>>> > Any thoughts?
>>>>>>>>>>> >
>>>>>>>>>>> > --
>>>>>>>>>>> > Best Wishes!
>>>>>>>>>>> >
>>>>>>>>>>> > Yours, Zesheng
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Best Wishes!
>>>>>>>>>>
>>>>>>>>>> Yours, Zesheng
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> CONFIDENTIALITY NOTICE
>>>>>>>> NOTICE: This message is intended for the use of the individual or
>>>>>>>> entity to which it is addressed and may contain information that is
>>>>>>>> confidential, privileged and exempt from disclosure under applicable 
>>>>>>>> law.
>>>>>>>> If the reader of this message is not the intended recipient, you are 
>>>>>>>> hereby
>>>>>>>> notified that any printing, copying, dissemination, distribution,
>>>>>>>> disclosure or forwarding of this communication is strictly prohibited. 
>>>>>>>> If
>>>>>>>> you have received this communication in error, please contact the 
>>>>>>>> sender
>>>>>>>> immediately and delete it from your system. Thank You.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Best Wishes!
>>>>>>>
>>>>>>> Yours, Zesheng
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Best Wishes!
>>>>>
>>>>> Yours, Zesheng
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Best Wishes!
>>>
>>> Yours, Zesheng
>>>
>>
>>
>>
>> --
>> Best Wishes!
>>
>> Yours, Zesheng
>>
>
>

Reply via email to