Support multiple block placement policies

2014-09-15 Thread Zesheng Wu
Hi there,

According to the code, the current implement of HDFS only supports one
specific type of block placement policy, which is
BlockPlacementPolicyDefault by default.
The default policy is enough for most of the circumstances, but under some
special circumstances, it works not so well.

For example, on a shared cluster, we want to erasure encode all the files
under some specified directories. So the files under these directories need
to use a new placement policy.
But at the same time, other files still use the default placement policy.
Here we need to support multiple placement policies for the HDFS.

One plain thought is that, the default placement policy is still configured
as the default. On the other hand, HDFS can let user specify customized
placement policy through the extended attributes(xattr). When the HDFS
choose the replica targets, it firstly check the customized placement
policy, if not specified, it fallbacks to the default one.

Any thoughts?

-- 
Best Wishes!

Yours, Zesheng


Re: HDFS: Couldn't obtain the locations of the last block

2014-09-10 Thread Zesheng Wu
Thanks Yi, I will look into HDFS-4516.


2014-09-10 15:03 GMT+08:00 Liu, Yi A yi.a@intel.com:

  Hi Zesheng,



 I got from an offline email of you and knew your Hadoop version was
 2.0.0-alpha and you also said “The block is allocated successfully in NN,
 but isn’t created in DN”.

 Yes, we may have this issue in 2.0.0-alpha. I suspect your issue is
 similar with HDFS-4516.   And can you try Hadoop 2.4 or later, you should
 not be able to re-produce it for these versions.



 From your description, the second block is created successfully and NN
 would flush the edit log info to shared journal and shared storage might
 persist the info, but before reporting back in rpc, there might be timeout
 to NN from shared storage.  So the block exist in shared edit log, but DN
 doesn’t create it in anyway.  On restart, client could fail, because in
 that Hadoop version, client would retry only in the case of NN last block
 size reported as non-zero if it was synced (see more in HDFS-4516).



 Regards,

 Yi Liu



 *From:* Zesheng Wu [mailto:wuzeshen...@gmail.com]
 *Sent:* Tuesday, September 09, 2014 6:16 PM
 *To:* user@hadoop.apache.org
 *Subject:* HDFS: Couldn't obtain the locations of the last block



 Hi,



 These days we encountered a critical bug in HDFS which can result in HBase
 can't start normally.

 The scenario is like following:

 1.  rs1 writes data to HDFS file f1, and the first block is written
 successfully

 2.  rs1 apply to create the second block successfully, at this time,
 nn1(ann) is crashed due to writing journal timeout

 3. nn2(snn) isn't become active because of zkfc2 is in abnormal state

 4. nn1 is restarted and becomes active

 5. During the process of nn1 restarting, rs1 is crashed due to writing to
 safemode nn(nn1)

 6. As a result, the file f1 is in abnormal state and the HBase cluster
 can't serve any more



 We can use the command line shell to list the file, look like following:

 -rw---   3 hbase_srv supergroup  134217728 2014-09-05 11:32 
 /hbase/lgsrv-push/xxx

  But when we try to download the file from hdfs, the dfs client complains:

 14/09/09 18:12:11 WARN hdfs.DFSClient: Last block locations not available. 
 Datanodes might not have reported blocks completely. Will retry for 3 times

 14/09/09 18:12:15 WARN hdfs.DFSClient: Last block locations not available. 
 Datanodes might not have reported blocks completely. Will retry for 2 times

 14/09/09 18:12:19 WARN hdfs.DFSClient: Last block locations not available. 
 Datanodes might not have reported blocks completely. Will retry for 1 times

 get: Could not obtain the last block locations.

 Anyone can help on this?

  --
 Best Wishes!

 Yours, Zesheng




-- 
Best Wishes!

Yours, Zesheng


Re: HDFS: Couldn't obtain the locations of the last block

2014-09-10 Thread Zesheng Wu
Hi Yi,

I went through HDFS-4516, and it really solves our problem, thanks very
much!

2014-09-10 16:39 GMT+08:00 Zesheng Wu wuzeshen...@gmail.com:

 Thanks Yi, I will look into HDFS-4516.


 2014-09-10 15:03 GMT+08:00 Liu, Yi A yi.a@intel.com:

  Hi Zesheng,



 I got from an offline email of you and knew your Hadoop version was
 2.0.0-alpha and you also said “The block is allocated successfully in NN,
 but isn’t created in DN”.

 Yes, we may have this issue in 2.0.0-alpha. I suspect your issue is
 similar with HDFS-4516.   And can you try Hadoop 2.4 or later, you should
 not be able to re-produce it for these versions.



 From your description, the second block is created successfully and NN
 would flush the edit log info to shared journal and shared storage might
 persist the info, but before reporting back in rpc, there might be timeout
 to NN from shared storage.  So the block exist in shared edit log, but DN
 doesn’t create it in anyway.  On restart, client could fail, because in
 that Hadoop version, client would retry only in the case of NN last block
 size reported as non-zero if it was synced (see more in HDFS-4516).



 Regards,

 Yi Liu



 *From:* Zesheng Wu [mailto:wuzeshen...@gmail.com]
 *Sent:* Tuesday, September 09, 2014 6:16 PM
 *To:* user@hadoop.apache.org
 *Subject:* HDFS: Couldn't obtain the locations of the last block



 Hi,



 These days we encountered a critical bug in HDFS which can result in
 HBase can't start normally.

 The scenario is like following:

 1.  rs1 writes data to HDFS file f1, and the first block is written
 successfully

 2.  rs1 apply to create the second block successfully, at this time,
 nn1(ann) is crashed due to writing journal timeout

 3. nn2(snn) isn't become active because of zkfc2 is in abnormal state

 4. nn1 is restarted and becomes active

 5. During the process of nn1 restarting, rs1 is crashed due to writing to
 safemode nn(nn1)

 6. As a result, the file f1 is in abnormal state and the HBase cluster
 can't serve any more



 We can use the command line shell to list the file, look like following:

 -rw---   3 hbase_srv supergroup  134217728 2014-09-05 11:32 
 /hbase/lgsrv-push/xxx

  But when we try to download the file from hdfs, the dfs client
 complains:

 14/09/09 18:12:11 WARN hdfs.DFSClient: Last block locations not available. 
 Datanodes might not have reported blocks completely. Will retry for 3 times

 14/09/09 18:12:15 WARN hdfs.DFSClient: Last block locations not available. 
 Datanodes might not have reported blocks completely. Will retry for 2 times

 14/09/09 18:12:19 WARN hdfs.DFSClient: Last block locations not available. 
 Datanodes might not have reported blocks completely. Will retry for 1 times

 get: Could not obtain the last block locations.

 Anyone can help on this?

  --
 Best Wishes!

 Yours, Zesheng




 --
 Best Wishes!

 Yours, Zesheng




-- 
Best Wishes!

Yours, Zesheng


HDFS: Couldn't obtain the locations of the last block

2014-09-09 Thread Zesheng Wu
Hi,

These days we encountered a critical bug in HDFS which can result in HBase
can't start normally.
The scenario is like following:
1.  rs1 writes data to HDFS file f1, and the first block is written
successfully
2.  rs1 apply to create the second block successfully, at this time,
nn1(ann) is crashed due to writing journal timeout
3. nn2(snn) isn't become active because of zkfc2 is in abnormal state
4. nn1 is restarted and becomes active
5. During the process of nn1 restarting, rs1 is crashed due to writing to
safemode nn(nn1)
6. As a result, the file f1 is in abnormal state and the HBase cluster
can't serve any more

We can use the command line shell to list the file, look like following:

-rw---   3 hbase_srv supergroup  134217728 2014-09-05 11:32
/hbase/lgsrv-push/xxx

But when we try to download the file from hdfs, the dfs client complains:

14/09/09 18:12:11 WARN hdfs.DFSClient: Last block locations not
available. Datanodes might not have reported blocks completely. Will
retry for 3 times
14/09/09 18:12:15 WARN hdfs.DFSClient: Last block locations not
available. Datanodes might not have reported blocks completely. Will
retry for 2 times
14/09/09 18:12:19 WARN hdfs.DFSClient: Last block locations not
available. Datanodes might not have reported blocks completely. Will
retry for 1 times
get: Could not obtain the last block locations.

Anyone can help on this?

-- 
Best Wishes!

Yours, Zesheng


Re: Replace a block with a new one

2014-07-21 Thread Zesheng Wu
We want to implement a RAID on top of HDFS, something like facebook
implemented as described in:
https://code.facebook.com/posts/536638663113101/saving-capacity-with-hdfs-raid/


2014-07-21 17:19 GMT+08:00 Bertrand Dechoux decho...@gmail.com:

 You want to implement a RAID on top of HDFS or use HDFS on top of RAID? I
 am not sure I understand any of these use cases. HDFS handles for you
 replication and error detection. Fine tuning the cluster wouldn't be the
 easier solution?

 Bertrand Dechoux


 On Mon, Jul 21, 2014 at 7:25 AM, Zesheng Wu wuzeshen...@gmail.com wrote:

 Thanks for reply, Arpit.
 Yes, we need to do this regularly. The original requirement of this is
 that we want to do RAID(which is based reed-solomon erasure codes) on our
 HDFS cluster. When a block is corrupted or missing, the downgrade read
 needs quick recovery of the block. We are considering how to recovery the
 corrupted/missing block quickly.


 2014-07-19 5:18 GMT+08:00 Arpit Agarwal aagar...@hortonworks.com:

 IMHO this is a spectacularly bad idea. Is it a one off event? Why not
 just take the perf hit and recreate the file?

 If you need to do this regularly you should consider a mutable file
 store like HBase. If you start modifying blocks from under HDFS you open up
 all sorts of consistency issues.




 On Fri, Jul 18, 2014 at 2:10 PM, Shumin Guo gsmst...@gmail.com wrote:

 That will break the consistency of the file system, but it doesn't hurt
 to try.
  On Jul 17, 2014 8:48 PM, Zesheng Wu wuzeshen...@gmail.com wrote:

 How about write a new block with new checksum file, and replace the
 old block file and checksum file both?


 2014-07-17 19:34 GMT+08:00 Wellington Chevreuil 
 wellington.chevre...@gmail.com:

 Hi,

 there's no way to do that, as HDFS does not provide file updates
 features. You'll need to write a new file with the changes.

 Notice that even if you manage to find the physical block replica
 files on the disk, corresponding to the part of the file you want to
 change, you can't simply update it manually, as this would give a 
 different
 checksum, making HDFS mark such blocks as corrupt.

 Regards,
 Wellington.



 On 17 Jul 2014, at 10:50, Zesheng Wu wuzeshen...@gmail.com wrote:

  Hi guys,
 
  I recently encounter a scenario which needs to replace an exist
 block with a newly written block
  The most straightforward way to finish may be like this:
  Suppose the original file is A, and we write a new file B which is
 composed by the new data blocks, then we merge A and B to C which is the
 file we wanted
  The obvious shortcoming of this method is wasting of network
 bandwidth
 
  I'm wondering whether there is a way to replace the old block by
 the new block directly.
  Any thoughts?
 
  --
  Best Wishes!
 
  Yours, Zesheng




 --
 Best Wishes!

 Yours, Zesheng



 CONFIDENTIALITY NOTICE
 NOTICE: This message is intended for the use of the individual or entity
 to which it is addressed and may contain information that is confidential,
 privileged and exempt from disclosure under applicable law. If the reader
 of this message is not the intended recipient, you are hereby notified that
 any printing, copying, dissemination, distribution, disclosure or
 forwarding of this communication is strictly prohibited. If you have
 received this communication in error, please contact the sender immediately
 and delete it from your system. Thank You.




 --
 Best Wishes!

 Yours, Zesheng





-- 
Best Wishes!

Yours, Zesheng


Re: Replace a block with a new one

2014-07-21 Thread Zesheng Wu
Thanks Bertrand, my reply comments inline following.

So you know that a block is corrupted thanks to an external process which
in this case is checking the parity blocks. If a block is corrupted but
hasn't been detected by HDFS, you could delete the block from the local
filesystem (it's only a file) then HDFS will replicate the good remaining
replica of this block.
*[Zesheng: We will implement a periodical checking mechanism to check the
corrupted blocks.]*

For performance reason (and that's what you want to do?), you might be able
to fix the corruption without needing to retrieve the good replica. It
might be possible by working directly with the local system by replacing
the corrupted block by the corrected block (which again are files). On
issue is that the corrected block might be different than the good replica.
If HDFS is able to tell (with CRC) it might be good else you will end up
with two different good replicas for the same block and that will not be
pretty...
*[Zesheng: Indeed we will use the reed-solomon erasure codes to implement
the HDFS raid, so the corrupted block will be recovered from the back good
data and coding blocks]*

If the result is to be open source, you might want to check with Facebook
about their implementation and track the process within Apache JIRA. You
could gain additional feedbacks. One downside of HDFS RAID is that the less
replicas there is, the less read of the data for processing will be
'efficient/fast'. Reducing the number of replicas also diminishes the
number of supported node failures. I wouldn't say it's an easy ride.
*[Zesheng: Yes, I agree with you that the read performance downgrade, but
not with the number of supported node failures, reed-solomon algorithm can
maintain equal or even higher node failures **tolerance. About open source,
we will consider this in the future.**]*


2014-07-21 20:01 GMT+08:00 Bertrand Dechoux decho...@gmail.com:

 So you know that a block is corrupted thanks to an external process which
 in this case is checking the parity blocks. If a block is corrupted but
 hasn't been detected by HDFS, you could delete the block from the local
 filesystem (it's only a file) then HDFS will replicate the good remaining
 replica of this block.

 For performance reason (and that's what you want to do?), you might be
 able to fix the corruption without needing to retrieve the good replica. It
 might be possible by working directly with the local system by replacing
 the corrupted block by the corrected block (which again are files). On
 issue is that the corrected block might be different than the good replica.
 If HDFS is able to tell (with CRC) it might be good else you will end up
 with two different good replicas for the same block and that will not be
 pretty...

 If the result is to be open source, you might want to check with Facebook
 about their implementation and track the process within Apache JIRA. You
 could gain additional feedbacks. One downside of HDFS RAID is that the less
 replicas there is, the less read of the data for processing will be
 'efficient/fast'. Reducing the number of replicas also diminishes the
 number of supported node failures. I wouldn't say it's an easy ride.

 Bertrand Dechoux


 On Mon, Jul 21, 2014 at 1:29 PM, Zesheng Wu wuzeshen...@gmail.com wrote:

 We want to implement a RAID on top of HDFS, something like facebook
 implemented as described in:
 https://code.facebook.com/posts/536638663113101/saving-capacity-with-hdfs-raid/


 2014-07-21 17:19 GMT+08:00 Bertrand Dechoux decho...@gmail.com:

 You want to implement a RAID on top of HDFS or use HDFS on top of RAID? I
 am not sure I understand any of these use cases. HDFS handles for you
 replication and error detection. Fine tuning the cluster wouldn't be the
 easier solution?

 Bertrand Dechoux


 On Mon, Jul 21, 2014 at 7:25 AM, Zesheng Wu wuzeshen...@gmail.com
 wrote:

 Thanks for reply, Arpit.
 Yes, we need to do this regularly. The original requirement of this is
 that we want to do RAID(which is based reed-solomon erasure codes) on our
 HDFS cluster. When a block is corrupted or missing, the downgrade read
 needs quick recovery of the block. We are considering how to recovery the
 corrupted/missing block quickly.


 2014-07-19 5:18 GMT+08:00 Arpit Agarwal aagar...@hortonworks.com:

 IMHO this is a spectacularly bad idea. Is it a one off event? Why not
 just take the perf hit and recreate the file?

 If you need to do this regularly you should consider a mutable file
 store like HBase. If you start modifying blocks from under HDFS you open 
 up
 all sorts of consistency issues.




 On Fri, Jul 18, 2014 at 2:10 PM, Shumin Guo gsmst...@gmail.com
 wrote:

 That will break the consistency of the file system, but it doesn't
 hurt to try.
  On Jul 17, 2014 8:48 PM, Zesheng Wu wuzeshen...@gmail.com wrote:

 How about write a new block with new checksum file, and replace the
 old block file and checksum file both?


 2014-07-17 19:34 GMT+08:00 Wellington

Re: Replace a block with a new one

2014-07-21 Thread Zesheng Wu
 If a block is corrupted but hasn't been detected by HDFS, you could delete
the block from the local filesystem (it's only a file) then HDFS will
replicate the good remaining replica of this block.

We only have one replica for each block, if a block is corrupted, HDFS
cannot replicate it.


2014-07-21 20:30 GMT+08:00 Zesheng Wu wuzeshen...@gmail.com:

 Thanks Bertrand, my reply comments inline following.

 So you know that a block is corrupted thanks to an external process which
 in this case is checking the parity blocks. If a block is corrupted but
 hasn't been detected by HDFS, you could delete the block from the local
 filesystem (it's only a file) then HDFS will replicate the good remaining
 replica of this block.
  *[Zesheng: We will implement a periodical checking mechanism to check
 the corrupted blocks.]*

 For performance reason (and that's what you want to do?), you might be
 able to fix the corruption without needing to retrieve the good replica. It
 might be possible by working directly with the local system by replacing
 the corrupted block by the corrected block (which again are files). On
 issue is that the corrected block might be different than the good replica.
 If HDFS is able to tell (with CRC) it might be good else you will end up
 with two different good replicas for the same block and that will not be
 pretty...
 *[Zesheng: Indeed we will use the reed-solomon erasure codes to implement
 the HDFS raid, so the corrupted block will be recovered from the back good
 data and coding blocks]*

 If the result is to be open source, you might want to check with Facebook
 about their implementation and track the process within Apache JIRA. You
 could gain additional feedbacks. One downside of HDFS RAID is that the less
 replicas there is, the less read of the data for processing will be
 'efficient/fast'. Reducing the number of replicas also diminishes the
 number of supported node failures. I wouldn't say it's an easy ride.
 *[Zesheng: Yes, I agree with you that the read performance downgrade, but
 not with the number of supported node failures, reed-solomon algorithm can
 maintain equal or even higher node failures **tolerance. About open
 source, we will consider this in the future.**]*


 2014-07-21 20:01 GMT+08:00 Bertrand Dechoux decho...@gmail.com:

 So you know that a block is corrupted thanks to an external process which
 in this case is checking the parity blocks. If a block is corrupted but
 hasn't been detected by HDFS, you could delete the block from the local
 filesystem (it's only a file) then HDFS will replicate the good remaining
 replica of this block.

 For performance reason (and that's what you want to do?), you might be
 able to fix the corruption without needing to retrieve the good replica. It
 might be possible by working directly with the local system by replacing
 the corrupted block by the corrected block (which again are files). On
 issue is that the corrected block might be different than the good replica.
 If HDFS is able to tell (with CRC) it might be good else you will end up
 with two different good replicas for the same block and that will not be
 pretty...

 If the result is to be open source, you might want to check with Facebook
 about their implementation and track the process within Apache JIRA. You
 could gain additional feedbacks. One downside of HDFS RAID is that the less
 replicas there is, the less read of the data for processing will be
 'efficient/fast'. Reducing the number of replicas also diminishes the
 number of supported node failures. I wouldn't say it's an easy ride.

 Bertrand Dechoux


 On Mon, Jul 21, 2014 at 1:29 PM, Zesheng Wu wuzeshen...@gmail.com
 wrote:

 We want to implement a RAID on top of HDFS, something like facebook
 implemented as described in:
 https://code.facebook.com/posts/536638663113101/saving-capacity-with-hdfs-raid/


 2014-07-21 17:19 GMT+08:00 Bertrand Dechoux decho...@gmail.com:

 You want to implement a RAID on top of HDFS or use HDFS on top of RAID?
 I am not sure I understand any of these use cases. HDFS handles for you
 replication and error detection. Fine tuning the cluster wouldn't be the
 easier solution?

 Bertrand Dechoux


 On Mon, Jul 21, 2014 at 7:25 AM, Zesheng Wu wuzeshen...@gmail.com
 wrote:

 Thanks for reply, Arpit.
 Yes, we need to do this regularly. The original requirement of this is
 that we want to do RAID(which is based reed-solomon erasure codes) on our
 HDFS cluster. When a block is corrupted or missing, the downgrade read
 needs quick recovery of the block. We are considering how to recovery the
 corrupted/missing block quickly.


 2014-07-19 5:18 GMT+08:00 Arpit Agarwal aagar...@hortonworks.com:

 IMHO this is a spectacularly bad idea. Is it a one off event? Why not
 just take the perf hit and recreate the file?

 If you need to do this regularly you should consider a mutable file
 store like HBase. If you start modifying blocks from under HDFS you open 
 up
 all sorts of consistency

Re: Replace a block with a new one

2014-07-21 Thread Zesheng Wu
Thank Bertrand, I've checked these information earlier. There's only XOR
implementation, and missed blocks are reconstructed by creating new files.



2014-07-22 3:47 GMT+08:00 Bertrand Dechoux decho...@gmail.com:

 And there is actually quite a lot of information about it.


 https://github.com/facebook/hadoop-20/blob/master/src/contrib/raid/src/java/org/apache/hadoop/hdfs/DistributedRaidFileSystem.java

 http://wiki.apache.org/hadoop/HDFS-RAID


 https://issues.apache.org/jira/browse/MAPREDUCE/component/12313416/?selectedTab=com.atlassian.jira.jira-projects-plugin:component-summary-panel

 Bertrand Dechoux


 On Mon, Jul 21, 2014 at 3:46 PM, Bertrand Dechoux decho...@gmail.com
 wrote:

 I wrote my answer thinking about the XOR implementation. With
 reed-solomon and single replication, the cases that need to be considered
 are indeed smaller, simpler.

 It seems I was wrong about my last statement though. If the machine
 hosting a single-replicated block is lost, it isn't likely that the block
 can't be reconstructed from a summary of the data. But with a RAID6
 strategy / RS, its is indeed possible but of course up to a certain number
 of blocks.

 There clearly is a case for such tool on a Hadoop cluster.

 Bertrand Dechoux


 On Mon, Jul 21, 2014 at 2:35 PM, Zesheng Wu wuzeshen...@gmail.com
 wrote:

  If a block is corrupted but hasn't been detected by HDFS, you could
 delete the block from the local filesystem (it's only a file) then HDFS
 will replicate the good remaining replica of this block.

 We only have one replica for each block, if a block is corrupted, HDFS
 cannot replicate it.


 2014-07-21 20:30 GMT+08:00 Zesheng Wu wuzeshen...@gmail.com:

 Thanks Bertrand, my reply comments inline following.

 So you know that a block is corrupted thanks to an external process
 which in this case is checking the parity blocks. If a block is corrupted
 but hasn't been detected by HDFS, you could delete the block from the local
 filesystem (it's only a file) then HDFS will replicate the good remaining
 replica of this block.
  *[Zesheng: We will implement a periodical checking mechanism to check
 the corrupted blocks.]*

 For performance reason (and that's what you want to do?), you might be
 able to fix the corruption without needing to retrieve the good replica. It
 might be possible by working directly with the local system by replacing
 the corrupted block by the corrected block (which again are files). On
 issue is that the corrected block might be different than the good replica.
 If HDFS is able to tell (with CRC) it might be good else you will end up
 with two different good replicas for the same block and that will not be
 pretty...
 *[Zesheng: Indeed we will use the reed-solomon erasure codes to
 implement the HDFS raid, so the corrupted block will be recovered from the
 back good data and coding blocks]*

 If the result is to be open source, you might want to check with
 Facebook about their implementation and track the process within Apache
 JIRA. You could gain additional feedbacks. One downside of HDFS RAID is
 that the less replicas there is, the less read of the data for processing
 will be 'efficient/fast'. Reducing the number of replicas also diminishes
 the number of supported node failures. I wouldn't say it's an easy ride.
 *[Zesheng: Yes, I agree with you that the read performance downgrade,
 but not with the number of supported node failures, reed-solomon algorithm
 can maintain equal or even higher node failures **tolerance. About
 open source, we will consider this in the future.**]*


 2014-07-21 20:01 GMT+08:00 Bertrand Dechoux decho...@gmail.com:

 So you know that a block is corrupted thanks to an external process
 which in this case is checking the parity blocks. If a block is corrupted
 but hasn't been detected by HDFS, you could delete the block from the 
 local
 filesystem (it's only a file) then HDFS will replicate the good remaining
 replica of this block.

 For performance reason (and that's what you want to do?), you might be
 able to fix the corruption without needing to retrieve the good replica. 
 It
 might be possible by working directly with the local system by replacing
 the corrupted block by the corrected block (which again are files). On
 issue is that the corrected block might be different than the good 
 replica.
 If HDFS is able to tell (with CRC) it might be good else you will end up
 with two different good replicas for the same block and that will not be
 pretty...

 If the result is to be open source, you might want to check with
 Facebook about their implementation and track the process within Apache
 JIRA. You could gain additional feedbacks. One downside of HDFS RAID is
 that the less replicas there is, the less read of the data for processing
 will be 'efficient/fast'. Reducing the number of replicas also diminishes
 the number of supported node failures. I wouldn't say it's an easy ride.

 Bertrand Dechoux


 On Mon, Jul 21, 2014 at 1:29 PM

Re: Replace a block with a new one

2014-07-21 Thread Zesheng Wu
Mmm, it seems that the facebook branch
https://github.com/facebook/hadoop-20/
https://github.com/facebook/hadoop-20/blob/master/src/contrib/raid/src/java/org/apache/hadoop/hdfs/DistributedRaidFileSystem.java
has
implemented reed-solomon codes, what I was checking earlier were the
following two issues:
https://issues.apache.org/jira/browse/HDFS-503
https://issues.apache.org/jira/browse/HDFS-600

Thanks again Bertrand, I will check through the facebook branch to find
more information.


2014-07-22 9:31 GMT+08:00 Zesheng Wu wuzeshen...@gmail.com:

 Thank Bertrand, I've checked these information earlier. There's only XOR
 implementation, and missed blocks are reconstructed by creating new files.



 2014-07-22 3:47 GMT+08:00 Bertrand Dechoux decho...@gmail.com:

 And there is actually quite a lot of information about it.


 https://github.com/facebook/hadoop-20/blob/master/src/contrib/raid/src/java/org/apache/hadoop/hdfs/DistributedRaidFileSystem.java

 http://wiki.apache.org/hadoop/HDFS-RAID


 https://issues.apache.org/jira/browse/MAPREDUCE/component/12313416/?selectedTab=com.atlassian.jira.jira-projects-plugin:component-summary-panel

 Bertrand Dechoux


 On Mon, Jul 21, 2014 at 3:46 PM, Bertrand Dechoux decho...@gmail.com
 wrote:

 I wrote my answer thinking about the XOR implementation. With
 reed-solomon and single replication, the cases that need to be considered
 are indeed smaller, simpler.

 It seems I was wrong about my last statement though. If the machine
 hosting a single-replicated block is lost, it isn't likely that the block
 can't be reconstructed from a summary of the data. But with a RAID6
 strategy / RS, its is indeed possible but of course up to a certain number
 of blocks.

 There clearly is a case for such tool on a Hadoop cluster.

 Bertrand Dechoux


 On Mon, Jul 21, 2014 at 2:35 PM, Zesheng Wu wuzeshen...@gmail.com
 wrote:

  If a block is corrupted but hasn't been detected by HDFS, you could
 delete the block from the local filesystem (it's only a file) then HDFS
 will replicate the good remaining replica of this block.

 We only have one replica for each block, if a block is corrupted, HDFS
 cannot replicate it.


 2014-07-21 20:30 GMT+08:00 Zesheng Wu wuzeshen...@gmail.com:

 Thanks Bertrand, my reply comments inline following.

 So you know that a block is corrupted thanks to an external process
 which in this case is checking the parity blocks. If a block is corrupted
 but hasn't been detected by HDFS, you could delete the block from the 
 local
 filesystem (it's only a file) then HDFS will replicate the good remaining
 replica of this block.
  *[Zesheng: We will implement a periodical checking mechanism to
 check the corrupted blocks.]*

 For performance reason (and that's what you want to do?), you might be
 able to fix the corruption without needing to retrieve the good replica. 
 It
 might be possible by working directly with the local system by replacing
 the corrupted block by the corrected block (which again are files). On
 issue is that the corrected block might be different than the good 
 replica.
 If HDFS is able to tell (with CRC) it might be good else you will end up
 with two different good replicas for the same block and that will not be
 pretty...
 *[Zesheng: Indeed we will use the reed-solomon erasure codes to
 implement the HDFS raid, so the corrupted block will be recovered from the
 back good data and coding blocks]*

 If the result is to be open source, you might want to check with
 Facebook about their implementation and track the process within Apache
 JIRA. You could gain additional feedbacks. One downside of HDFS RAID is
 that the less replicas there is, the less read of the data for processing
 will be 'efficient/fast'. Reducing the number of replicas also diminishes
 the number of supported node failures. I wouldn't say it's an easy ride.
 *[Zesheng: Yes, I agree with you that the read performance downgrade,
 but not with the number of supported node failures, reed-solomon algorithm
 can maintain equal or even higher node failures **tolerance. About
 open source, we will consider this in the future.**]*


 2014-07-21 20:01 GMT+08:00 Bertrand Dechoux decho...@gmail.com:

 So you know that a block is corrupted thanks to an external process
 which in this case is checking the parity blocks. If a block is corrupted
 but hasn't been detected by HDFS, you could delete the block from the 
 local
 filesystem (it's only a file) then HDFS will replicate the good remaining
 replica of this block.

 For performance reason (and that's what you want to do?), you might
 be able to fix the corruption without needing to retrieve the good 
 replica.
 It might be possible by working directly with the local system by 
 replacing
 the corrupted block by the corrected block (which again are files). On
 issue is that the corrected block might be different than the good 
 replica.
 If HDFS is able to tell (with CRC) it might be good else you will end up
 with two different

Re: Replace a block with a new one

2014-07-20 Thread Zesheng Wu
Thanks for reply, Arpit.
Yes, we need to do this regularly. The original requirement of this is that
we want to do RAID(which is based reed-solomon erasure codes) on our HDFS
cluster. When a block is corrupted or missing, the downgrade read needs
quick recovery of the block. We are considering how to recovery the
corrupted/missing block quickly.


2014-07-19 5:18 GMT+08:00 Arpit Agarwal aagar...@hortonworks.com:

 IMHO this is a spectacularly bad idea. Is it a one off event? Why not just
 take the perf hit and recreate the file?

 If you need to do this regularly you should consider a mutable file store
 like HBase. If you start modifying blocks from under HDFS you open up all
 sorts of consistency issues.




 On Fri, Jul 18, 2014 at 2:10 PM, Shumin Guo gsmst...@gmail.com wrote:

 That will break the consistency of the file system, but it doesn't hurt
 to try.
  On Jul 17, 2014 8:48 PM, Zesheng Wu wuzeshen...@gmail.com wrote:

 How about write a new block with new checksum file, and replace the old
 block file and checksum file both?


 2014-07-17 19:34 GMT+08:00 Wellington Chevreuil 
 wellington.chevre...@gmail.com:

 Hi,

 there's no way to do that, as HDFS does not provide file updates
 features. You'll need to write a new file with the changes.

 Notice that even if you manage to find the physical block replica files
 on the disk, corresponding to the part of the file you want to change, you
 can't simply update it manually, as this would give a different checksum,
 making HDFS mark such blocks as corrupt.

 Regards,
 Wellington.



 On 17 Jul 2014, at 10:50, Zesheng Wu wuzeshen...@gmail.com wrote:

  Hi guys,
 
  I recently encounter a scenario which needs to replace an exist block
 with a newly written block
  The most straightforward way to finish may be like this:
  Suppose the original file is A, and we write a new file B which is
 composed by the new data blocks, then we merge A and B to C which is the
 file we wanted
  The obvious shortcoming of this method is wasting of network bandwidth
 
  I'm wondering whether there is a way to replace the old block by the
 new block directly.
  Any thoughts?
 
  --
  Best Wishes!
 
  Yours, Zesheng




 --
 Best Wishes!

 Yours, Zesheng



 CONFIDENTIALITY NOTICE
 NOTICE: This message is intended for the use of the individual or entity
 to which it is addressed and may contain information that is confidential,
 privileged and exempt from disclosure under applicable law. If the reader
 of this message is not the intended recipient, you are hereby notified that
 any printing, copying, dissemination, distribution, disclosure or
 forwarding of this communication is strictly prohibited. If you have
 received this communication in error, please contact the sender immediately
 and delete it from your system. Thank You.




-- 
Best Wishes!

Yours, Zesheng


Replace a block with a new one

2014-07-17 Thread Zesheng Wu
Hi guys,

I recently encounter a scenario which needs to replace an exist block with
a newly written block
The most straightforward way to finish may be like this:
Suppose the original file is A, and we write a new file B which is composed
by the new data blocks, then we merge A and B to C which is the file we
wanted
The obvious shortcoming of this method is wasting of network bandwidth

I'm wondering whether there is a way to replace the old block by the new
block directly.
Any thoughts?

-- 
Best Wishes!

Yours, Zesheng


Re: Replace a block with a new one

2014-07-17 Thread Zesheng Wu
How about write a new block with new checksum file, and replace the old
block file and checksum file both?


2014-07-17 19:34 GMT+08:00 Wellington Chevreuil 
wellington.chevre...@gmail.com:

 Hi,

 there's no way to do that, as HDFS does not provide file updates features.
 You'll need to write a new file with the changes.

 Notice that even if you manage to find the physical block replica files on
 the disk, corresponding to the part of the file you want to change, you
 can't simply update it manually, as this would give a different checksum,
 making HDFS mark such blocks as corrupt.

 Regards,
 Wellington.



 On 17 Jul 2014, at 10:50, Zesheng Wu wuzeshen...@gmail.com wrote:

  Hi guys,
 
  I recently encounter a scenario which needs to replace an exist block
 with a newly written block
  The most straightforward way to finish may be like this:
  Suppose the original file is A, and we write a new file B which is
 composed by the new data blocks, then we merge A and B to C which is the
 file we wanted
  The obvious shortcoming of this method is wasting of network bandwidth
 
  I'm wondering whether there is a way to replace the old block by the new
 block directly.
  Any thoughts?
 
  --
  Best Wishes!
 
  Yours, Zesheng




-- 
Best Wishes!

Yours, Zesheng


Re: HDFS File Writes Reads

2014-06-17 Thread Zesheng Wu
1. HDFS doesn't allow parallel write
2. HDFS use pipeline to write multiple replicas, so it doesn't take three
times more time than a traditional file write
3. HDFS allow parallel read


2014-06-17 19:17 GMT+08:00 Vijaya Narayana Reddy Bhoomi Reddy 
vijay.bhoomire...@gmail.com:

 Hi,

 I have a basic question regarding file writes and reads in HDFS. Is the
 file write and read process a sequential activity or executed in parallel?

 For example, lets assume that there is a File File1 which constitutes of
 three blocks B1, B2 and B3.

 1. Will the write process write B2 only after B1 is complete and B3 only
 after B2 is complete or for a large file with many blocks, can this happen
 in parallel? In all the hadoop documentation, I read this to be a
 sequential operation. Does that mean for a file of 1TB, it takes three
 times more time than a traditional file write? (due to default replication
 factor of 3)
 2. Is it similar in the case of read as well?

 Kindly someone please provide some clarity on this...

 Regards
 Vijay




-- 
Best Wishes!

Yours, Zesheng


Re: Programmatic Kerberos login with password to a secure cluster

2014-06-16 Thread Zesheng Wu
Perhaps you can use LDAP(or any other possible way) to do the
authentication on the WebServer, and then let the WebServer as an
authenticated proxy user to  agent real users requests.


2014-06-17 4:11 GMT+08:00 Geoff Thompson ge...@bearpeak.com:

 Greetings,

 We are developing a YARN application where the client executes on a
 machine that is external to a secure cluster. I have been able to
 successfully do a Kerberos login by manually running the kinit command on
 the external machine then starting the client. However, our goal is to not
 require the user to run kinit.

 I have been able to programmatically login using a keytab file using
 method loginUserFromKeytab from class
 org.apache.hadoop.security.UserGroupInformation. This is very useful.
 However, we also want to see if we can not require the use of a keytab file
 and allow the user to enter a password into the UI for our YARN client.

 Essentially I would like to write a “loginUserWithPassword” method. I can
 see that this would require creating a
 javax.security.auth.login.LoginContext with my own callback handler.

 Reading the UserGroupInformation source code I see that a LoginContext
 needs to be built with a “HadoopConfiguration” which is a private static
 class inside UserGroupInformation. This class is too difficult to duplicate
 in my own code since it has too many dependencies on other private details
 in class UserGroupInformation plus dependencies on other non-public classes
 in the org.apache.hadoop.security package.

 Does any one know how I could do a programmatic Kerberos login with a
 password? Or perhaps access a HadoopConfiguration?

 Thanks,

 Geoff







-- 
Best Wishes!

Yours, Zesheng


Re: Upgrade to 2.4

2014-05-30 Thread Zesheng Wu
Hi Ian,
-rollingUpgrade is available since hadoop 2.4, so you can't use
-rollingUpgrade to upgrade a 2.3 cluster to 2.4.
Because HDFS 2.4 introduce protobuf formated fsimage, we must stop the
whole cluster and then upgrade the cluster.
I tried to upgrade a 2.0 HDFS cluster to 2.4 several days ago, following
are some details, wish this can give some help to you:

1. Put the active NN into safemode:  dfsadmin -safemode enter
2. Perform a saveNamespace operation: dfsadmin -saveNamespace
3. For each component you are using, back up configuration data, databases,
and other important files
4. Shutdown hadoop serverice across your entire cluster
5. Check each host to make sure that there are no processes running
hdfs/yarn
6. Back up your HDFS metadata
7. Upgrade and start the Journal Nodes and ZKFC
8. Make sure that all the journal nodes and zkfc have started normally
9. Execute this upgrade operation on active NN:  namenode -upgrade
(Important)
10. Wait and make sure from ANN's log that upgrade is completed (from log)
11. Bootstrap standby NN: namenode -bootstrapStandby
12. Start standby NN: namenode start
13. Start each of the DNs: datanode start
14. Make sure that everything is running smoothly, this could take a matter
of days, or even weeks
15. Finalize the hdfs metadata upgrade: dfsadmin -finalizeUpgrade


2014-05-30 17:54 GMT+08:00 Ian Brooks i.bro...@sensewhere.com:

 Hi,

 I though i'd give hadoop 2.4 a try as the improvements to rolling upgrades
 would be useful to us. Unless i'm missing something, I can't see how to
 upgrade from 2.3 to 2.4

 -rollingUpgrade doesnt work as the old namenode doesn't understand the
 command and -upgrade is no longer available.

 I had a quick google and can't find any documentation on this upgrade,
 does anyone know how its supposed to be done?

 -Ian




-- 
Best Wishes!

Yours, Zesheng


Re: HDFS undo Overwriting

2014-05-30 Thread Zesheng Wu
I am afraid this cannot undo, in HDFS only the data which is deleted by the
dfs client and goes into the trash can be undone.


2014-05-30 18:18 GMT+08:00 Amjad ALSHABANI ashshab...@gmail.com:

 Hello Everybody,

 I ve made a mistake when writing to HDFS. I created new database in Hive
 giving the location on HDFS but I found that it removed all other data that
 exist already.

 =
 before creation, the directory on HDFS contains :
 pns@app11:~$ hadoop fs -ls /user/hive/warehouse
 Found 25 items
 drwxr-xr-x   - user1 supergroup  0 2013-11-20 13:40
 */user/hive/warehouse/*dfy_ans_autres
 drwxr-xr-x   - user1 supergroup  0 2013-11-20 13:40
 /user/hive/warehouse/dfy_ans_maillog
 drwxr-xr-x   - user1 supergroup  0 2013-11-20 14:28
 /user/hive/warehouse/dfy_cnx
 drwxr-xr-x   - user2   supergroup  0 2014-05-30 06:05
 /user/hive/warehouse/pns.db
 drwxr-xr-x   - user2  supergroup  0 2014-02-24 17:00
 /user/hive/warehouse/pns_fr_integ
 drwxr-xr-x   - user2  supergroup  0 2014-05-06 15:33
 /user/hive/warehouse/pns_logstat.db
 ...
 ...
 ...


 hive -e CREATE DATABASE my_stats LOCATION 'hdfs://:9000
 */user/hive/warehouse/*mystats.db'

 but now I couldn't see the other directories on HDFS:

 pns@app11:~/aalshabani$ hls /user/hive/warehouse
 Found 1 items
 drwxr-xr-x   - user2 supergroup  0 2014-05-30 11:37
 */user/hive/warehouse*/mystats.db


 Is there anyway I could restore the other directories??


 Best regards.




-- 
Best Wishes!

Yours, Zesheng


Re: issue about how to decommission a datanode from hadoop cluster

2014-05-30 Thread Zesheng Wu
I think you just need to set an exclude file on NN, that's enough.


2014-05-30 14:09 GMT+08:00 ch huang justlo...@gmail.com:

 hi,maillist:
 I use CDH4.4 yarnhdfs cluster  ,i want to decommission a
 datanode ,should i modify hdfs-site.xml and mapred-xml of all node in
 cluster to exclude the node ,or i just need set hdfs-site.xml and
 mapred-xml on NN ?




-- 
Best Wishes!

Yours, Zesheng


Re:

2014-01-10 Thread Zesheng Wu
Of course you can, you can think this as an independent runnable program.


2014/1/11 Andrea Barbato and.barb...@gmail.com

 Hi, i have a simple question.
 I have this example code:

 class WordCountMapper : public HadoopPipes::Mapper {public:
   // constructor: does nothing
   WordCountMapper( HadoopPipes::TaskContext context ) { }
   // map function: receives a line, outputs (word,1) to reducer.
   void map( HadoopPipes::MapContext context ) { ... }
   }};
 class WordCountReducer : public HadoopPipes::Reducer {public:
   // constructor: does nothing
   WordCountReducer(HadoopPipes::TaskContext context) {}
   // reduce function
   void reduce( HadoopPipes::ReduceContext context ) { ... }};
 int main(int argc, char *argv[]) {
   return 
 HadoopPipes::runTask(HadoopPipes::TemplateFactoryWordCountMapper,WordCountReducer()
  );}

 Can I write some code lines (like the I/O operations) in the main function
 body?
 Thanks in advance.




-- 
Best Wishes!

Yours, Zesheng