Support multiple block placement policies
Hi there, According to the code, the current implement of HDFS only supports one specific type of block placement policy, which is BlockPlacementPolicyDefault by default. The default policy is enough for most of the circumstances, but under some special circumstances, it works not so well. For example, on a shared cluster, we want to erasure encode all the files under some specified directories. So the files under these directories need to use a new placement policy. But at the same time, other files still use the default placement policy. Here we need to support multiple placement policies for the HDFS. One plain thought is that, the default placement policy is still configured as the default. On the other hand, HDFS can let user specify customized placement policy through the extended attributes(xattr). When the HDFS choose the replica targets, it firstly check the customized placement policy, if not specified, it fallbacks to the default one. Any thoughts? -- Best Wishes! Yours, Zesheng
Re: HDFS: Couldn't obtain the locations of the last block
Thanks Yi, I will look into HDFS-4516. 2014-09-10 15:03 GMT+08:00 Liu, Yi A yi.a@intel.com: Hi Zesheng, I got from an offline email of you and knew your Hadoop version was 2.0.0-alpha and you also said “The block is allocated successfully in NN, but isn’t created in DN”. Yes, we may have this issue in 2.0.0-alpha. I suspect your issue is similar with HDFS-4516. And can you try Hadoop 2.4 or later, you should not be able to re-produce it for these versions. From your description, the second block is created successfully and NN would flush the edit log info to shared journal and shared storage might persist the info, but before reporting back in rpc, there might be timeout to NN from shared storage. So the block exist in shared edit log, but DN doesn’t create it in anyway. On restart, client could fail, because in that Hadoop version, client would retry only in the case of NN last block size reported as non-zero if it was synced (see more in HDFS-4516). Regards, Yi Liu *From:* Zesheng Wu [mailto:wuzeshen...@gmail.com] *Sent:* Tuesday, September 09, 2014 6:16 PM *To:* user@hadoop.apache.org *Subject:* HDFS: Couldn't obtain the locations of the last block Hi, These days we encountered a critical bug in HDFS which can result in HBase can't start normally. The scenario is like following: 1. rs1 writes data to HDFS file f1, and the first block is written successfully 2. rs1 apply to create the second block successfully, at this time, nn1(ann) is crashed due to writing journal timeout 3. nn2(snn) isn't become active because of zkfc2 is in abnormal state 4. nn1 is restarted and becomes active 5. During the process of nn1 restarting, rs1 is crashed due to writing to safemode nn(nn1) 6. As a result, the file f1 is in abnormal state and the HBase cluster can't serve any more We can use the command line shell to list the file, look like following: -rw--- 3 hbase_srv supergroup 134217728 2014-09-05 11:32 /hbase/lgsrv-push/xxx But when we try to download the file from hdfs, the dfs client complains: 14/09/09 18:12:11 WARN hdfs.DFSClient: Last block locations not available. Datanodes might not have reported blocks completely. Will retry for 3 times 14/09/09 18:12:15 WARN hdfs.DFSClient: Last block locations not available. Datanodes might not have reported blocks completely. Will retry for 2 times 14/09/09 18:12:19 WARN hdfs.DFSClient: Last block locations not available. Datanodes might not have reported blocks completely. Will retry for 1 times get: Could not obtain the last block locations. Anyone can help on this? -- Best Wishes! Yours, Zesheng -- Best Wishes! Yours, Zesheng
Re: HDFS: Couldn't obtain the locations of the last block
Hi Yi, I went through HDFS-4516, and it really solves our problem, thanks very much! 2014-09-10 16:39 GMT+08:00 Zesheng Wu wuzeshen...@gmail.com: Thanks Yi, I will look into HDFS-4516. 2014-09-10 15:03 GMT+08:00 Liu, Yi A yi.a@intel.com: Hi Zesheng, I got from an offline email of you and knew your Hadoop version was 2.0.0-alpha and you also said “The block is allocated successfully in NN, but isn’t created in DN”. Yes, we may have this issue in 2.0.0-alpha. I suspect your issue is similar with HDFS-4516. And can you try Hadoop 2.4 or later, you should not be able to re-produce it for these versions. From your description, the second block is created successfully and NN would flush the edit log info to shared journal and shared storage might persist the info, but before reporting back in rpc, there might be timeout to NN from shared storage. So the block exist in shared edit log, but DN doesn’t create it in anyway. On restart, client could fail, because in that Hadoop version, client would retry only in the case of NN last block size reported as non-zero if it was synced (see more in HDFS-4516). Regards, Yi Liu *From:* Zesheng Wu [mailto:wuzeshen...@gmail.com] *Sent:* Tuesday, September 09, 2014 6:16 PM *To:* user@hadoop.apache.org *Subject:* HDFS: Couldn't obtain the locations of the last block Hi, These days we encountered a critical bug in HDFS which can result in HBase can't start normally. The scenario is like following: 1. rs1 writes data to HDFS file f1, and the first block is written successfully 2. rs1 apply to create the second block successfully, at this time, nn1(ann) is crashed due to writing journal timeout 3. nn2(snn) isn't become active because of zkfc2 is in abnormal state 4. nn1 is restarted and becomes active 5. During the process of nn1 restarting, rs1 is crashed due to writing to safemode nn(nn1) 6. As a result, the file f1 is in abnormal state and the HBase cluster can't serve any more We can use the command line shell to list the file, look like following: -rw--- 3 hbase_srv supergroup 134217728 2014-09-05 11:32 /hbase/lgsrv-push/xxx But when we try to download the file from hdfs, the dfs client complains: 14/09/09 18:12:11 WARN hdfs.DFSClient: Last block locations not available. Datanodes might not have reported blocks completely. Will retry for 3 times 14/09/09 18:12:15 WARN hdfs.DFSClient: Last block locations not available. Datanodes might not have reported blocks completely. Will retry for 2 times 14/09/09 18:12:19 WARN hdfs.DFSClient: Last block locations not available. Datanodes might not have reported blocks completely. Will retry for 1 times get: Could not obtain the last block locations. Anyone can help on this? -- Best Wishes! Yours, Zesheng -- Best Wishes! Yours, Zesheng -- Best Wishes! Yours, Zesheng
HDFS: Couldn't obtain the locations of the last block
Hi, These days we encountered a critical bug in HDFS which can result in HBase can't start normally. The scenario is like following: 1. rs1 writes data to HDFS file f1, and the first block is written successfully 2. rs1 apply to create the second block successfully, at this time, nn1(ann) is crashed due to writing journal timeout 3. nn2(snn) isn't become active because of zkfc2 is in abnormal state 4. nn1 is restarted and becomes active 5. During the process of nn1 restarting, rs1 is crashed due to writing to safemode nn(nn1) 6. As a result, the file f1 is in abnormal state and the HBase cluster can't serve any more We can use the command line shell to list the file, look like following: -rw--- 3 hbase_srv supergroup 134217728 2014-09-05 11:32 /hbase/lgsrv-push/xxx But when we try to download the file from hdfs, the dfs client complains: 14/09/09 18:12:11 WARN hdfs.DFSClient: Last block locations not available. Datanodes might not have reported blocks completely. Will retry for 3 times 14/09/09 18:12:15 WARN hdfs.DFSClient: Last block locations not available. Datanodes might not have reported blocks completely. Will retry for 2 times 14/09/09 18:12:19 WARN hdfs.DFSClient: Last block locations not available. Datanodes might not have reported blocks completely. Will retry for 1 times get: Could not obtain the last block locations. Anyone can help on this? -- Best Wishes! Yours, Zesheng
Re: Replace a block with a new one
We want to implement a RAID on top of HDFS, something like facebook implemented as described in: https://code.facebook.com/posts/536638663113101/saving-capacity-with-hdfs-raid/ 2014-07-21 17:19 GMT+08:00 Bertrand Dechoux decho...@gmail.com: You want to implement a RAID on top of HDFS or use HDFS on top of RAID? I am not sure I understand any of these use cases. HDFS handles for you replication and error detection. Fine tuning the cluster wouldn't be the easier solution? Bertrand Dechoux On Mon, Jul 21, 2014 at 7:25 AM, Zesheng Wu wuzeshen...@gmail.com wrote: Thanks for reply, Arpit. Yes, we need to do this regularly. The original requirement of this is that we want to do RAID(which is based reed-solomon erasure codes) on our HDFS cluster. When a block is corrupted or missing, the downgrade read needs quick recovery of the block. We are considering how to recovery the corrupted/missing block quickly. 2014-07-19 5:18 GMT+08:00 Arpit Agarwal aagar...@hortonworks.com: IMHO this is a spectacularly bad idea. Is it a one off event? Why not just take the perf hit and recreate the file? If you need to do this regularly you should consider a mutable file store like HBase. If you start modifying blocks from under HDFS you open up all sorts of consistency issues. On Fri, Jul 18, 2014 at 2:10 PM, Shumin Guo gsmst...@gmail.com wrote: That will break the consistency of the file system, but it doesn't hurt to try. On Jul 17, 2014 8:48 PM, Zesheng Wu wuzeshen...@gmail.com wrote: How about write a new block with new checksum file, and replace the old block file and checksum file both? 2014-07-17 19:34 GMT+08:00 Wellington Chevreuil wellington.chevre...@gmail.com: Hi, there's no way to do that, as HDFS does not provide file updates features. You'll need to write a new file with the changes. Notice that even if you manage to find the physical block replica files on the disk, corresponding to the part of the file you want to change, you can't simply update it manually, as this would give a different checksum, making HDFS mark such blocks as corrupt. Regards, Wellington. On 17 Jul 2014, at 10:50, Zesheng Wu wuzeshen...@gmail.com wrote: Hi guys, I recently encounter a scenario which needs to replace an exist block with a newly written block The most straightforward way to finish may be like this: Suppose the original file is A, and we write a new file B which is composed by the new data blocks, then we merge A and B to C which is the file we wanted The obvious shortcoming of this method is wasting of network bandwidth I'm wondering whether there is a way to replace the old block by the new block directly. Any thoughts? -- Best Wishes! Yours, Zesheng -- Best Wishes! Yours, Zesheng CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You. -- Best Wishes! Yours, Zesheng -- Best Wishes! Yours, Zesheng
Re: Replace a block with a new one
Thanks Bertrand, my reply comments inline following. So you know that a block is corrupted thanks to an external process which in this case is checking the parity blocks. If a block is corrupted but hasn't been detected by HDFS, you could delete the block from the local filesystem (it's only a file) then HDFS will replicate the good remaining replica of this block. *[Zesheng: We will implement a periodical checking mechanism to check the corrupted blocks.]* For performance reason (and that's what you want to do?), you might be able to fix the corruption without needing to retrieve the good replica. It might be possible by working directly with the local system by replacing the corrupted block by the corrected block (which again are files). On issue is that the corrected block might be different than the good replica. If HDFS is able to tell (with CRC) it might be good else you will end up with two different good replicas for the same block and that will not be pretty... *[Zesheng: Indeed we will use the reed-solomon erasure codes to implement the HDFS raid, so the corrupted block will be recovered from the back good data and coding blocks]* If the result is to be open source, you might want to check with Facebook about their implementation and track the process within Apache JIRA. You could gain additional feedbacks. One downside of HDFS RAID is that the less replicas there is, the less read of the data for processing will be 'efficient/fast'. Reducing the number of replicas also diminishes the number of supported node failures. I wouldn't say it's an easy ride. *[Zesheng: Yes, I agree with you that the read performance downgrade, but not with the number of supported node failures, reed-solomon algorithm can maintain equal or even higher node failures **tolerance. About open source, we will consider this in the future.**]* 2014-07-21 20:01 GMT+08:00 Bertrand Dechoux decho...@gmail.com: So you know that a block is corrupted thanks to an external process which in this case is checking the parity blocks. If a block is corrupted but hasn't been detected by HDFS, you could delete the block from the local filesystem (it's only a file) then HDFS will replicate the good remaining replica of this block. For performance reason (and that's what you want to do?), you might be able to fix the corruption without needing to retrieve the good replica. It might be possible by working directly with the local system by replacing the corrupted block by the corrected block (which again are files). On issue is that the corrected block might be different than the good replica. If HDFS is able to tell (with CRC) it might be good else you will end up with two different good replicas for the same block and that will not be pretty... If the result is to be open source, you might want to check with Facebook about their implementation and track the process within Apache JIRA. You could gain additional feedbacks. One downside of HDFS RAID is that the less replicas there is, the less read of the data for processing will be 'efficient/fast'. Reducing the number of replicas also diminishes the number of supported node failures. I wouldn't say it's an easy ride. Bertrand Dechoux On Mon, Jul 21, 2014 at 1:29 PM, Zesheng Wu wuzeshen...@gmail.com wrote: We want to implement a RAID on top of HDFS, something like facebook implemented as described in: https://code.facebook.com/posts/536638663113101/saving-capacity-with-hdfs-raid/ 2014-07-21 17:19 GMT+08:00 Bertrand Dechoux decho...@gmail.com: You want to implement a RAID on top of HDFS or use HDFS on top of RAID? I am not sure I understand any of these use cases. HDFS handles for you replication and error detection. Fine tuning the cluster wouldn't be the easier solution? Bertrand Dechoux On Mon, Jul 21, 2014 at 7:25 AM, Zesheng Wu wuzeshen...@gmail.com wrote: Thanks for reply, Arpit. Yes, we need to do this regularly. The original requirement of this is that we want to do RAID(which is based reed-solomon erasure codes) on our HDFS cluster. When a block is corrupted or missing, the downgrade read needs quick recovery of the block. We are considering how to recovery the corrupted/missing block quickly. 2014-07-19 5:18 GMT+08:00 Arpit Agarwal aagar...@hortonworks.com: IMHO this is a spectacularly bad idea. Is it a one off event? Why not just take the perf hit and recreate the file? If you need to do this regularly you should consider a mutable file store like HBase. If you start modifying blocks from under HDFS you open up all sorts of consistency issues. On Fri, Jul 18, 2014 at 2:10 PM, Shumin Guo gsmst...@gmail.com wrote: That will break the consistency of the file system, but it doesn't hurt to try. On Jul 17, 2014 8:48 PM, Zesheng Wu wuzeshen...@gmail.com wrote: How about write a new block with new checksum file, and replace the old block file and checksum file both? 2014-07-17 19:34 GMT+08:00 Wellington
Re: Replace a block with a new one
If a block is corrupted but hasn't been detected by HDFS, you could delete the block from the local filesystem (it's only a file) then HDFS will replicate the good remaining replica of this block. We only have one replica for each block, if a block is corrupted, HDFS cannot replicate it. 2014-07-21 20:30 GMT+08:00 Zesheng Wu wuzeshen...@gmail.com: Thanks Bertrand, my reply comments inline following. So you know that a block is corrupted thanks to an external process which in this case is checking the parity blocks. If a block is corrupted but hasn't been detected by HDFS, you could delete the block from the local filesystem (it's only a file) then HDFS will replicate the good remaining replica of this block. *[Zesheng: We will implement a periodical checking mechanism to check the corrupted blocks.]* For performance reason (and that's what you want to do?), you might be able to fix the corruption without needing to retrieve the good replica. It might be possible by working directly with the local system by replacing the corrupted block by the corrected block (which again are files). On issue is that the corrected block might be different than the good replica. If HDFS is able to tell (with CRC) it might be good else you will end up with two different good replicas for the same block and that will not be pretty... *[Zesheng: Indeed we will use the reed-solomon erasure codes to implement the HDFS raid, so the corrupted block will be recovered from the back good data and coding blocks]* If the result is to be open source, you might want to check with Facebook about their implementation and track the process within Apache JIRA. You could gain additional feedbacks. One downside of HDFS RAID is that the less replicas there is, the less read of the data for processing will be 'efficient/fast'. Reducing the number of replicas also diminishes the number of supported node failures. I wouldn't say it's an easy ride. *[Zesheng: Yes, I agree with you that the read performance downgrade, but not with the number of supported node failures, reed-solomon algorithm can maintain equal or even higher node failures **tolerance. About open source, we will consider this in the future.**]* 2014-07-21 20:01 GMT+08:00 Bertrand Dechoux decho...@gmail.com: So you know that a block is corrupted thanks to an external process which in this case is checking the parity blocks. If a block is corrupted but hasn't been detected by HDFS, you could delete the block from the local filesystem (it's only a file) then HDFS will replicate the good remaining replica of this block. For performance reason (and that's what you want to do?), you might be able to fix the corruption without needing to retrieve the good replica. It might be possible by working directly with the local system by replacing the corrupted block by the corrected block (which again are files). On issue is that the corrected block might be different than the good replica. If HDFS is able to tell (with CRC) it might be good else you will end up with two different good replicas for the same block and that will not be pretty... If the result is to be open source, you might want to check with Facebook about their implementation and track the process within Apache JIRA. You could gain additional feedbacks. One downside of HDFS RAID is that the less replicas there is, the less read of the data for processing will be 'efficient/fast'. Reducing the number of replicas also diminishes the number of supported node failures. I wouldn't say it's an easy ride. Bertrand Dechoux On Mon, Jul 21, 2014 at 1:29 PM, Zesheng Wu wuzeshen...@gmail.com wrote: We want to implement a RAID on top of HDFS, something like facebook implemented as described in: https://code.facebook.com/posts/536638663113101/saving-capacity-with-hdfs-raid/ 2014-07-21 17:19 GMT+08:00 Bertrand Dechoux decho...@gmail.com: You want to implement a RAID on top of HDFS or use HDFS on top of RAID? I am not sure I understand any of these use cases. HDFS handles for you replication and error detection. Fine tuning the cluster wouldn't be the easier solution? Bertrand Dechoux On Mon, Jul 21, 2014 at 7:25 AM, Zesheng Wu wuzeshen...@gmail.com wrote: Thanks for reply, Arpit. Yes, we need to do this regularly. The original requirement of this is that we want to do RAID(which is based reed-solomon erasure codes) on our HDFS cluster. When a block is corrupted or missing, the downgrade read needs quick recovery of the block. We are considering how to recovery the corrupted/missing block quickly. 2014-07-19 5:18 GMT+08:00 Arpit Agarwal aagar...@hortonworks.com: IMHO this is a spectacularly bad idea. Is it a one off event? Why not just take the perf hit and recreate the file? If you need to do this regularly you should consider a mutable file store like HBase. If you start modifying blocks from under HDFS you open up all sorts of consistency
Re: Replace a block with a new one
Thank Bertrand, I've checked these information earlier. There's only XOR implementation, and missed blocks are reconstructed by creating new files. 2014-07-22 3:47 GMT+08:00 Bertrand Dechoux decho...@gmail.com: And there is actually quite a lot of information about it. https://github.com/facebook/hadoop-20/blob/master/src/contrib/raid/src/java/org/apache/hadoop/hdfs/DistributedRaidFileSystem.java http://wiki.apache.org/hadoop/HDFS-RAID https://issues.apache.org/jira/browse/MAPREDUCE/component/12313416/?selectedTab=com.atlassian.jira.jira-projects-plugin:component-summary-panel Bertrand Dechoux On Mon, Jul 21, 2014 at 3:46 PM, Bertrand Dechoux decho...@gmail.com wrote: I wrote my answer thinking about the XOR implementation. With reed-solomon and single replication, the cases that need to be considered are indeed smaller, simpler. It seems I was wrong about my last statement though. If the machine hosting a single-replicated block is lost, it isn't likely that the block can't be reconstructed from a summary of the data. But with a RAID6 strategy / RS, its is indeed possible but of course up to a certain number of blocks. There clearly is a case for such tool on a Hadoop cluster. Bertrand Dechoux On Mon, Jul 21, 2014 at 2:35 PM, Zesheng Wu wuzeshen...@gmail.com wrote: If a block is corrupted but hasn't been detected by HDFS, you could delete the block from the local filesystem (it's only a file) then HDFS will replicate the good remaining replica of this block. We only have one replica for each block, if a block is corrupted, HDFS cannot replicate it. 2014-07-21 20:30 GMT+08:00 Zesheng Wu wuzeshen...@gmail.com: Thanks Bertrand, my reply comments inline following. So you know that a block is corrupted thanks to an external process which in this case is checking the parity blocks. If a block is corrupted but hasn't been detected by HDFS, you could delete the block from the local filesystem (it's only a file) then HDFS will replicate the good remaining replica of this block. *[Zesheng: We will implement a periodical checking mechanism to check the corrupted blocks.]* For performance reason (and that's what you want to do?), you might be able to fix the corruption without needing to retrieve the good replica. It might be possible by working directly with the local system by replacing the corrupted block by the corrected block (which again are files). On issue is that the corrected block might be different than the good replica. If HDFS is able to tell (with CRC) it might be good else you will end up with two different good replicas for the same block and that will not be pretty... *[Zesheng: Indeed we will use the reed-solomon erasure codes to implement the HDFS raid, so the corrupted block will be recovered from the back good data and coding blocks]* If the result is to be open source, you might want to check with Facebook about their implementation and track the process within Apache JIRA. You could gain additional feedbacks. One downside of HDFS RAID is that the less replicas there is, the less read of the data for processing will be 'efficient/fast'. Reducing the number of replicas also diminishes the number of supported node failures. I wouldn't say it's an easy ride. *[Zesheng: Yes, I agree with you that the read performance downgrade, but not with the number of supported node failures, reed-solomon algorithm can maintain equal or even higher node failures **tolerance. About open source, we will consider this in the future.**]* 2014-07-21 20:01 GMT+08:00 Bertrand Dechoux decho...@gmail.com: So you know that a block is corrupted thanks to an external process which in this case is checking the parity blocks. If a block is corrupted but hasn't been detected by HDFS, you could delete the block from the local filesystem (it's only a file) then HDFS will replicate the good remaining replica of this block. For performance reason (and that's what you want to do?), you might be able to fix the corruption without needing to retrieve the good replica. It might be possible by working directly with the local system by replacing the corrupted block by the corrected block (which again are files). On issue is that the corrected block might be different than the good replica. If HDFS is able to tell (with CRC) it might be good else you will end up with two different good replicas for the same block and that will not be pretty... If the result is to be open source, you might want to check with Facebook about their implementation and track the process within Apache JIRA. You could gain additional feedbacks. One downside of HDFS RAID is that the less replicas there is, the less read of the data for processing will be 'efficient/fast'. Reducing the number of replicas also diminishes the number of supported node failures. I wouldn't say it's an easy ride. Bertrand Dechoux On Mon, Jul 21, 2014 at 1:29 PM
Re: Replace a block with a new one
Mmm, it seems that the facebook branch https://github.com/facebook/hadoop-20/ https://github.com/facebook/hadoop-20/blob/master/src/contrib/raid/src/java/org/apache/hadoop/hdfs/DistributedRaidFileSystem.java has implemented reed-solomon codes, what I was checking earlier were the following two issues: https://issues.apache.org/jira/browse/HDFS-503 https://issues.apache.org/jira/browse/HDFS-600 Thanks again Bertrand, I will check through the facebook branch to find more information. 2014-07-22 9:31 GMT+08:00 Zesheng Wu wuzeshen...@gmail.com: Thank Bertrand, I've checked these information earlier. There's only XOR implementation, and missed blocks are reconstructed by creating new files. 2014-07-22 3:47 GMT+08:00 Bertrand Dechoux decho...@gmail.com: And there is actually quite a lot of information about it. https://github.com/facebook/hadoop-20/blob/master/src/contrib/raid/src/java/org/apache/hadoop/hdfs/DistributedRaidFileSystem.java http://wiki.apache.org/hadoop/HDFS-RAID https://issues.apache.org/jira/browse/MAPREDUCE/component/12313416/?selectedTab=com.atlassian.jira.jira-projects-plugin:component-summary-panel Bertrand Dechoux On Mon, Jul 21, 2014 at 3:46 PM, Bertrand Dechoux decho...@gmail.com wrote: I wrote my answer thinking about the XOR implementation. With reed-solomon and single replication, the cases that need to be considered are indeed smaller, simpler. It seems I was wrong about my last statement though. If the machine hosting a single-replicated block is lost, it isn't likely that the block can't be reconstructed from a summary of the data. But with a RAID6 strategy / RS, its is indeed possible but of course up to a certain number of blocks. There clearly is a case for such tool on a Hadoop cluster. Bertrand Dechoux On Mon, Jul 21, 2014 at 2:35 PM, Zesheng Wu wuzeshen...@gmail.com wrote: If a block is corrupted but hasn't been detected by HDFS, you could delete the block from the local filesystem (it's only a file) then HDFS will replicate the good remaining replica of this block. We only have one replica for each block, if a block is corrupted, HDFS cannot replicate it. 2014-07-21 20:30 GMT+08:00 Zesheng Wu wuzeshen...@gmail.com: Thanks Bertrand, my reply comments inline following. So you know that a block is corrupted thanks to an external process which in this case is checking the parity blocks. If a block is corrupted but hasn't been detected by HDFS, you could delete the block from the local filesystem (it's only a file) then HDFS will replicate the good remaining replica of this block. *[Zesheng: We will implement a periodical checking mechanism to check the corrupted blocks.]* For performance reason (and that's what you want to do?), you might be able to fix the corruption without needing to retrieve the good replica. It might be possible by working directly with the local system by replacing the corrupted block by the corrected block (which again are files). On issue is that the corrected block might be different than the good replica. If HDFS is able to tell (with CRC) it might be good else you will end up with two different good replicas for the same block and that will not be pretty... *[Zesheng: Indeed we will use the reed-solomon erasure codes to implement the HDFS raid, so the corrupted block will be recovered from the back good data and coding blocks]* If the result is to be open source, you might want to check with Facebook about their implementation and track the process within Apache JIRA. You could gain additional feedbacks. One downside of HDFS RAID is that the less replicas there is, the less read of the data for processing will be 'efficient/fast'. Reducing the number of replicas also diminishes the number of supported node failures. I wouldn't say it's an easy ride. *[Zesheng: Yes, I agree with you that the read performance downgrade, but not with the number of supported node failures, reed-solomon algorithm can maintain equal or even higher node failures **tolerance. About open source, we will consider this in the future.**]* 2014-07-21 20:01 GMT+08:00 Bertrand Dechoux decho...@gmail.com: So you know that a block is corrupted thanks to an external process which in this case is checking the parity blocks. If a block is corrupted but hasn't been detected by HDFS, you could delete the block from the local filesystem (it's only a file) then HDFS will replicate the good remaining replica of this block. For performance reason (and that's what you want to do?), you might be able to fix the corruption without needing to retrieve the good replica. It might be possible by working directly with the local system by replacing the corrupted block by the corrected block (which again are files). On issue is that the corrected block might be different than the good replica. If HDFS is able to tell (with CRC) it might be good else you will end up with two different
Re: Replace a block with a new one
Thanks for reply, Arpit. Yes, we need to do this regularly. The original requirement of this is that we want to do RAID(which is based reed-solomon erasure codes) on our HDFS cluster. When a block is corrupted or missing, the downgrade read needs quick recovery of the block. We are considering how to recovery the corrupted/missing block quickly. 2014-07-19 5:18 GMT+08:00 Arpit Agarwal aagar...@hortonworks.com: IMHO this is a spectacularly bad idea. Is it a one off event? Why not just take the perf hit and recreate the file? If you need to do this regularly you should consider a mutable file store like HBase. If you start modifying blocks from under HDFS you open up all sorts of consistency issues. On Fri, Jul 18, 2014 at 2:10 PM, Shumin Guo gsmst...@gmail.com wrote: That will break the consistency of the file system, but it doesn't hurt to try. On Jul 17, 2014 8:48 PM, Zesheng Wu wuzeshen...@gmail.com wrote: How about write a new block with new checksum file, and replace the old block file and checksum file both? 2014-07-17 19:34 GMT+08:00 Wellington Chevreuil wellington.chevre...@gmail.com: Hi, there's no way to do that, as HDFS does not provide file updates features. You'll need to write a new file with the changes. Notice that even if you manage to find the physical block replica files on the disk, corresponding to the part of the file you want to change, you can't simply update it manually, as this would give a different checksum, making HDFS mark such blocks as corrupt. Regards, Wellington. On 17 Jul 2014, at 10:50, Zesheng Wu wuzeshen...@gmail.com wrote: Hi guys, I recently encounter a scenario which needs to replace an exist block with a newly written block The most straightforward way to finish may be like this: Suppose the original file is A, and we write a new file B which is composed by the new data blocks, then we merge A and B to C which is the file we wanted The obvious shortcoming of this method is wasting of network bandwidth I'm wondering whether there is a way to replace the old block by the new block directly. Any thoughts? -- Best Wishes! Yours, Zesheng -- Best Wishes! Yours, Zesheng CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You. -- Best Wishes! Yours, Zesheng
Replace a block with a new one
Hi guys, I recently encounter a scenario which needs to replace an exist block with a newly written block The most straightforward way to finish may be like this: Suppose the original file is A, and we write a new file B which is composed by the new data blocks, then we merge A and B to C which is the file we wanted The obvious shortcoming of this method is wasting of network bandwidth I'm wondering whether there is a way to replace the old block by the new block directly. Any thoughts? -- Best Wishes! Yours, Zesheng
Re: Replace a block with a new one
How about write a new block with new checksum file, and replace the old block file and checksum file both? 2014-07-17 19:34 GMT+08:00 Wellington Chevreuil wellington.chevre...@gmail.com: Hi, there's no way to do that, as HDFS does not provide file updates features. You'll need to write a new file with the changes. Notice that even if you manage to find the physical block replica files on the disk, corresponding to the part of the file you want to change, you can't simply update it manually, as this would give a different checksum, making HDFS mark such blocks as corrupt. Regards, Wellington. On 17 Jul 2014, at 10:50, Zesheng Wu wuzeshen...@gmail.com wrote: Hi guys, I recently encounter a scenario which needs to replace an exist block with a newly written block The most straightforward way to finish may be like this: Suppose the original file is A, and we write a new file B which is composed by the new data blocks, then we merge A and B to C which is the file we wanted The obvious shortcoming of this method is wasting of network bandwidth I'm wondering whether there is a way to replace the old block by the new block directly. Any thoughts? -- Best Wishes! Yours, Zesheng -- Best Wishes! Yours, Zesheng
Re: HDFS File Writes Reads
1. HDFS doesn't allow parallel write 2. HDFS use pipeline to write multiple replicas, so it doesn't take three times more time than a traditional file write 3. HDFS allow parallel read 2014-06-17 19:17 GMT+08:00 Vijaya Narayana Reddy Bhoomi Reddy vijay.bhoomire...@gmail.com: Hi, I have a basic question regarding file writes and reads in HDFS. Is the file write and read process a sequential activity or executed in parallel? For example, lets assume that there is a File File1 which constitutes of three blocks B1, B2 and B3. 1. Will the write process write B2 only after B1 is complete and B3 only after B2 is complete or for a large file with many blocks, can this happen in parallel? In all the hadoop documentation, I read this to be a sequential operation. Does that mean for a file of 1TB, it takes three times more time than a traditional file write? (due to default replication factor of 3) 2. Is it similar in the case of read as well? Kindly someone please provide some clarity on this... Regards Vijay -- Best Wishes! Yours, Zesheng
Re: Programmatic Kerberos login with password to a secure cluster
Perhaps you can use LDAP(or any other possible way) to do the authentication on the WebServer, and then let the WebServer as an authenticated proxy user to agent real users requests. 2014-06-17 4:11 GMT+08:00 Geoff Thompson ge...@bearpeak.com: Greetings, We are developing a YARN application where the client executes on a machine that is external to a secure cluster. I have been able to successfully do a Kerberos login by manually running the kinit command on the external machine then starting the client. However, our goal is to not require the user to run kinit. I have been able to programmatically login using a keytab file using method loginUserFromKeytab from class org.apache.hadoop.security.UserGroupInformation. This is very useful. However, we also want to see if we can not require the use of a keytab file and allow the user to enter a password into the UI for our YARN client. Essentially I would like to write a “loginUserWithPassword” method. I can see that this would require creating a javax.security.auth.login.LoginContext with my own callback handler. Reading the UserGroupInformation source code I see that a LoginContext needs to be built with a “HadoopConfiguration” which is a private static class inside UserGroupInformation. This class is too difficult to duplicate in my own code since it has too many dependencies on other private details in class UserGroupInformation plus dependencies on other non-public classes in the org.apache.hadoop.security package. Does any one know how I could do a programmatic Kerberos login with a password? Or perhaps access a HadoopConfiguration? Thanks, Geoff -- Best Wishes! Yours, Zesheng
Re: Upgrade to 2.4
Hi Ian, -rollingUpgrade is available since hadoop 2.4, so you can't use -rollingUpgrade to upgrade a 2.3 cluster to 2.4. Because HDFS 2.4 introduce protobuf formated fsimage, we must stop the whole cluster and then upgrade the cluster. I tried to upgrade a 2.0 HDFS cluster to 2.4 several days ago, following are some details, wish this can give some help to you: 1. Put the active NN into safemode: dfsadmin -safemode enter 2. Perform a saveNamespace operation: dfsadmin -saveNamespace 3. For each component you are using, back up configuration data, databases, and other important files 4. Shutdown hadoop serverice across your entire cluster 5. Check each host to make sure that there are no processes running hdfs/yarn 6. Back up your HDFS metadata 7. Upgrade and start the Journal Nodes and ZKFC 8. Make sure that all the journal nodes and zkfc have started normally 9. Execute this upgrade operation on active NN: namenode -upgrade (Important) 10. Wait and make sure from ANN's log that upgrade is completed (from log) 11. Bootstrap standby NN: namenode -bootstrapStandby 12. Start standby NN: namenode start 13. Start each of the DNs: datanode start 14. Make sure that everything is running smoothly, this could take a matter of days, or even weeks 15. Finalize the hdfs metadata upgrade: dfsadmin -finalizeUpgrade 2014-05-30 17:54 GMT+08:00 Ian Brooks i.bro...@sensewhere.com: Hi, I though i'd give hadoop 2.4 a try as the improvements to rolling upgrades would be useful to us. Unless i'm missing something, I can't see how to upgrade from 2.3 to 2.4 -rollingUpgrade doesnt work as the old namenode doesn't understand the command and -upgrade is no longer available. I had a quick google and can't find any documentation on this upgrade, does anyone know how its supposed to be done? -Ian -- Best Wishes! Yours, Zesheng
Re: HDFS undo Overwriting
I am afraid this cannot undo, in HDFS only the data which is deleted by the dfs client and goes into the trash can be undone. 2014-05-30 18:18 GMT+08:00 Amjad ALSHABANI ashshab...@gmail.com: Hello Everybody, I ve made a mistake when writing to HDFS. I created new database in Hive giving the location on HDFS but I found that it removed all other data that exist already. = before creation, the directory on HDFS contains : pns@app11:~$ hadoop fs -ls /user/hive/warehouse Found 25 items drwxr-xr-x - user1 supergroup 0 2013-11-20 13:40 */user/hive/warehouse/*dfy_ans_autres drwxr-xr-x - user1 supergroup 0 2013-11-20 13:40 /user/hive/warehouse/dfy_ans_maillog drwxr-xr-x - user1 supergroup 0 2013-11-20 14:28 /user/hive/warehouse/dfy_cnx drwxr-xr-x - user2 supergroup 0 2014-05-30 06:05 /user/hive/warehouse/pns.db drwxr-xr-x - user2 supergroup 0 2014-02-24 17:00 /user/hive/warehouse/pns_fr_integ drwxr-xr-x - user2 supergroup 0 2014-05-06 15:33 /user/hive/warehouse/pns_logstat.db ... ... ... hive -e CREATE DATABASE my_stats LOCATION 'hdfs://:9000 */user/hive/warehouse/*mystats.db' but now I couldn't see the other directories on HDFS: pns@app11:~/aalshabani$ hls /user/hive/warehouse Found 1 items drwxr-xr-x - user2 supergroup 0 2014-05-30 11:37 */user/hive/warehouse*/mystats.db Is there anyway I could restore the other directories?? Best regards. -- Best Wishes! Yours, Zesheng
Re: issue about how to decommission a datanode from hadoop cluster
I think you just need to set an exclude file on NN, that's enough. 2014-05-30 14:09 GMT+08:00 ch huang justlo...@gmail.com: hi,maillist: I use CDH4.4 yarnhdfs cluster ,i want to decommission a datanode ,should i modify hdfs-site.xml and mapred-xml of all node in cluster to exclude the node ,or i just need set hdfs-site.xml and mapred-xml on NN ? -- Best Wishes! Yours, Zesheng
Re:
Of course you can, you can think this as an independent runnable program. 2014/1/11 Andrea Barbato and.barb...@gmail.com Hi, i have a simple question. I have this example code: class WordCountMapper : public HadoopPipes::Mapper {public: // constructor: does nothing WordCountMapper( HadoopPipes::TaskContext context ) { } // map function: receives a line, outputs (word,1) to reducer. void map( HadoopPipes::MapContext context ) { ... } }}; class WordCountReducer : public HadoopPipes::Reducer {public: // constructor: does nothing WordCountReducer(HadoopPipes::TaskContext context) {} // reduce function void reduce( HadoopPipes::ReduceContext context ) { ... }}; int main(int argc, char *argv[]) { return HadoopPipes::runTask(HadoopPipes::TemplateFactoryWordCountMapper,WordCountReducer() );} Can I write some code lines (like the I/O operations) in the main function body? Thanks in advance. -- Best Wishes! Yours, Zesheng