Re: Replace a block with a new one

Zesheng Wu Mon, 21 Jul 2014 18:55:26 -0700

Mmm, it seems that the facebook branch
https://github.com/facebook/hadoop-20/
<https://github.com/facebook/hadoop-20/blob/master/src/contrib/raid/src/java/org/apache/hadoop/hdfs/DistributedRaidFileSystem.java>
has
implemented reed-solomon codes, what I was checking earlier were the
following two issues:
https://issues.apache.org/jira/browse/HDFS-503
https://issues.apache.org/jira/browse/HDFS-600


Thanks again Bertrand, I will check through the facebook branch to find
more information.


2014-07-22 9:31 GMT+08:00 Zesheng Wu <wuzeshen...@gmail.com>:

> Thank Bertrand, I've checked these information earlier. There's only XOR
> implementation, and missed blocks are reconstructed by creating new files.
>
>
>
> 2014-07-22 3:47 GMT+08:00 Bertrand Dechoux <decho...@gmail.com>:
>
> And there is actually quite a lot of information about it.
>>
>>
>> https://github.com/facebook/hadoop-20/blob/master/src/contrib/raid/src/java/org/apache/hadoop/hdfs/DistributedRaidFileSystem.java
>>
>> http://wiki.apache.org/hadoop/HDFS-RAID
>>
>>
>> https://issues.apache.org/jira/browse/MAPREDUCE/component/12313416/?selectedTab=com.atlassian.jira.jira-projects-plugin:component-summary-panel
>>
>> Bertrand Dechoux
>>
>>
>> On Mon, Jul 21, 2014 at 3:46 PM, Bertrand Dechoux <decho...@gmail.com>
>> wrote:
>>
>>> I wrote my answer thinking about the XOR implementation. With
>>> reed-solomon and single replication, the cases that need to be considered
>>> are indeed smaller, simpler.
>>>
>>> It seems I was wrong about my last statement though. If the machine
>>> hosting a single-replicated block is lost, it isn't likely that the block
>>> can't be reconstructed from a summary of the data. But with a RAID6
>>> strategy / RS, its is indeed possible but of course up to a certain number
>>> of blocks.
>>>
>>> There clearly is a case for such tool on a Hadoop cluster.
>>>
>>> Bertrand Dechoux
>>>
>>>
>>> On Mon, Jul 21, 2014 at 2:35 PM, Zesheng Wu <wuzeshen...@gmail.com>
>>> wrote:
>>>
>>>>  If a block is corrupted but hasn't been detected by HDFS, you could
>>>> delete the block from the local filesystem (it's only a file) then HDFS
>>>> will replicate the good remaining replica of this block.
>>>>
>>>> We only have one replica for each block, if a block is corrupted, HDFS
>>>> cannot replicate it.
>>>>
>>>>
>>>> 2014-07-21 20:30 GMT+08:00 Zesheng Wu <wuzeshen...@gmail.com>:
>>>>
>>>> Thanks Bertrand, my reply comments inline following.
>>>>>
>>>>> So you know that a block is corrupted thanks to an external process
>>>>> which in this case is checking the parity blocks. If a block is corrupted
>>>>> but hasn't been detected by HDFS, you could delete the block from the 
>>>>> local
>>>>> filesystem (it's only a file) then HDFS will replicate the good remaining
>>>>> replica of this block.
>>>>>  *[Zesheng: We will implement a periodical checking mechanism to
>>>>> check the corrupted blocks.]*
>>>>>
>>>>> For performance reason (and that's what you want to do?), you might be
>>>>> able to fix the corruption without needing to retrieve the good replica. 
>>>>> It
>>>>> might be possible by working directly with the local system by replacing
>>>>> the corrupted block by the corrected block (which again are files). On
>>>>> issue is that the corrected block might be different than the good 
>>>>> replica.
>>>>> If HDFS is able to tell (with CRC) it might be good else you will end up
>>>>> with two different good replicas for the same block and that will not be
>>>>> pretty...
>>>>> *[Zesheng: Indeed we will use the reed-solomon erasure codes to
>>>>> implement the HDFS raid, so the corrupted block will be recovered from the
>>>>> back good data and coding blocks]*
>>>>>
>>>>> If the result is to be open source, you might want to check with
>>>>> Facebook about their implementation and track the process within Apache
>>>>> JIRA. You could gain additional feedbacks. One downside of HDFS RAID is
>>>>> that the less replicas there is, the less read of the data for processing
>>>>> will be 'efficient/fast'. Reducing the number of replicas also diminishes
>>>>> the number of supported node failures. I wouldn't say it's an easy ride.
>>>>> *[Zesheng: Yes, I agree with you that the read performance downgrade,
>>>>> but not with the number of supported node failures, reed-solomon algorithm
>>>>> can maintain equal or even higher node failures **tolerance. About
>>>>> open source, we will consider this in the future.**]*
>>>>>
>>>>>
>>>>> 2014-07-21 20:01 GMT+08:00 Bertrand Dechoux <decho...@gmail.com>:
>>>>>
>>>>> So you know that a block is corrupted thanks to an external process
>>>>>> which in this case is checking the parity blocks. If a block is corrupted
>>>>>> but hasn't been detected by HDFS, you could delete the block from the 
>>>>>> local
>>>>>> filesystem (it's only a file) then HDFS will replicate the good remaining
>>>>>> replica of this block.
>>>>>>
>>>>>> For performance reason (and that's what you want to do?), you might
>>>>>> be able to fix the corruption without needing to retrieve the good 
>>>>>> replica.
>>>>>> It might be possible by working directly with the local system by 
>>>>>> replacing
>>>>>> the corrupted block by the corrected block (which again are files). On
>>>>>> issue is that the corrected block might be different than the good 
>>>>>> replica.
>>>>>> If HDFS is able to tell (with CRC) it might be good else you will end up
>>>>>> with two different good replicas for the same block and that will not be
>>>>>> pretty...
>>>>>>
>>>>>> If the result is to be open source, you might want to check with
>>>>>> Facebook about their implementation and track the process within Apache
>>>>>> JIRA. You could gain additional feedbacks. One downside of HDFS RAID is
>>>>>> that the less replicas there is, the less read of the data for processing
>>>>>> will be 'efficient/fast'. Reducing the number of replicas also diminishes
>>>>>> the number of supported node failures. I wouldn't say it's an easy ride.
>>>>>>
>>>>>> Bertrand Dechoux
>>>>>>
>>>>>>
>>>>>> On Mon, Jul 21, 2014 at 1:29 PM, Zesheng Wu <wuzeshen...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> We want to implement a RAID on top of HDFS, something like facebook
>>>>>>> implemented as described in:
>>>>>>> https://code.facebook.com/posts/536638663113101/saving-capacity-with-hdfs-raid/
>>>>>>>
>>>>>>>
>>>>>>> 2014-07-21 17:19 GMT+08:00 Bertrand Dechoux <decho...@gmail.com>:
>>>>>>>
>>>>>>> You want to implement a RAID on top of HDFS or use HDFS on top of
>>>>>>>> RAID? I am not sure I understand any of these use cases. HDFS handles 
>>>>>>>> for
>>>>>>>> you replication and error detection. Fine tuning the cluster wouldn't 
>>>>>>>> be
>>>>>>>> the easier solution?
>>>>>>>>
>>>>>>>> Bertrand Dechoux
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Jul 21, 2014 at 7:25 AM, Zesheng Wu <wuzeshen...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Thanks for reply, Arpit.
>>>>>>>>> Yes, we need to do this regularly. The original requirement of
>>>>>>>>> this is that we want to do RAID(which is based reed-solomon erasure 
>>>>>>>>> codes)
>>>>>>>>> on our HDFS cluster. When a block is corrupted or missing, the 
>>>>>>>>> downgrade
>>>>>>>>> read needs quick recovery of the block. We are considering how to 
>>>>>>>>> recovery
>>>>>>>>> the corrupted/missing block quickly.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2014-07-19 5:18 GMT+08:00 Arpit Agarwal <aagar...@hortonworks.com>
>>>>>>>>> :
>>>>>>>>>
>>>>>>>>>> IMHO this is a spectacularly bad idea. Is it a one off event? Why
>>>>>>>>>> not just take the perf hit and recreate the file?
>>>>>>>>>>
>>>>>>>>>> If you need to do this regularly you should consider a mutable
>>>>>>>>>> file store like HBase. If you start modifying blocks from under HDFS 
>>>>>>>>>> you
>>>>>>>>>> open up all sorts of consistency issues.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Jul 18, 2014 at 2:10 PM, Shumin Guo <gsmst...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> That will break the consistency of the file system, but it
>>>>>>>>>>> doesn't hurt to try.
>>>>>>>>>>>  On Jul 17, 2014 8:48 PM, "Zesheng Wu" <wuzeshen...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> How about write a new block with new checksum file, and replace
>>>>>>>>>>>> the old block file and checksum file both?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> 2014-07-17 19:34 GMT+08:00 Wellington Chevreuil <
>>>>>>>>>>>> wellington.chevre...@gmail.com>:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> there's no way to do that, as HDFS does not provide file
>>>>>>>>>>>>> updates features. You'll need to write a new file with the 
>>>>>>>>>>>>> changes.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Notice that even if you manage to find the physical block
>>>>>>>>>>>>> replica files on the disk, corresponding to the part of the file 
>>>>>>>>>>>>> you want
>>>>>>>>>>>>> to change, you can't simply update it manually, as this would 
>>>>>>>>>>>>> give a
>>>>>>>>>>>>> different checksum, making HDFS mark such blocks as corrupt.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>> Wellington.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 17 Jul 2014, at 10:50, Zesheng Wu <wuzeshen...@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> > Hi guys,
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > I recently encounter a scenario which needs to replace an
>>>>>>>>>>>>> exist block with a newly written block
>>>>>>>>>>>>> > The most straightforward way to finish may be like this:
>>>>>>>>>>>>> > Suppose the original file is A, and we write a new file B
>>>>>>>>>>>>> which is composed by the new data blocks, then we merge A and B 
>>>>>>>>>>>>> to C which
>>>>>>>>>>>>> is the file we wanted
>>>>>>>>>>>>> > The obvious shortcoming of this method is wasting of network
>>>>>>>>>>>>> bandwidth
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > I'm wondering whether there is a way to replace the old
>>>>>>>>>>>>> block by the new block directly.
>>>>>>>>>>>>> > Any thoughts?
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > --
>>>>>>>>>>>>> > Best Wishes!
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > Yours, Zesheng
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Best Wishes!
>>>>>>>>>>>>
>>>>>>>>>>>> Yours, Zesheng
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> CONFIDENTIALITY NOTICE
>>>>>>>>>> NOTICE: This message is intended for the use of the individual or
>>>>>>>>>> entity to which it is addressed and may contain information that is
>>>>>>>>>> confidential, privileged and exempt from disclosure under applicable 
>>>>>>>>>> law.
>>>>>>>>>> If the reader of this message is not the intended recipient, you are 
>>>>>>>>>> hereby
>>>>>>>>>> notified that any printing, copying, dissemination, distribution,
>>>>>>>>>> disclosure or forwarding of this communication is strictly 
>>>>>>>>>> prohibited. If
>>>>>>>>>> you have received this communication in error, please contact the 
>>>>>>>>>> sender
>>>>>>>>>> immediately and delete it from your system. Thank You.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Best Wishes!
>>>>>>>>>
>>>>>>>>> Yours, Zesheng
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Best Wishes!
>>>>>>>
>>>>>>> Yours, Zesheng
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Best Wishes!
>>>>>
>>>>> Yours, Zesheng
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Best Wishes!
>>>>
>>>> Yours, Zesheng
>>>>
>>>
>>>
>>
>
>
> --
> Best Wishes!
>
> Yours, Zesheng
>



-- 
Best Wishes!

Yours, Zesheng

Re: Replace a block with a new one

Reply via email to