Reg HDFS checksum
Hi All, This is question regarding "HDFS checksum" computation. I understood that When we read a file from HDFS by default it verifies the checksum and your read would not succeed if the file is corrupted. Also CRC is internal to hadoop. Here are my questions: 1. How can I use "hadoop dfs -get [-ignoreCrc] [-crc] " command? 2. I used "get" command for a .gz file with -crc option ( "hadoop dfs -get -crc input1/test.gz /home/hadoop/test/. " ). Does this check for .crc file created in hadoop? When I tried this, I got an error as "-crc option is not valid when source file system does not have crc files. Automatically turn the option off." means that hadoop does not create crc for this file? Is this correct? 3. How can I enable hadoop to create CRC file? Regards, Thamil Regards, Thamizhannal P
Re: Reg HDFS checksum
Hello Thamizh, Perhaps the discussion in the following link can shed some light on this: http://getsatisfaction.com/cloudera/topics/hadoop_fs_crc On Fri, Apr 8, 2011 at 5:47 PM, Thamizh wrote: > Hi All, > > This is question regarding "HDFS checksum" computation. -- Harsh J
Re: Reg HDFS checksum
Hi Harsh , Thanks a lot for your reference. I am looking forward to know about, how does Hadoop computes CRC for any file? If you have some reference please share me. It would be great help for me. Regards, Thamizhannal P --- On Sat, 9/4/11, Harsh J wrote: From: Harsh J Subject: Re: Reg HDFS checksum To: common-user@hadoop.apache.org Date: Saturday, 9 April, 2011, 3:20 PM Hello Thamizh, Perhaps the discussion in the following link can shed some light on this: http://getsatisfaction.com/cloudera/topics/hadoop_fs_crc On Fri, Apr 8, 2011 at 5:47 PM, Thamizh wrote: > Hi All, > > This is question regarding "HDFS checksum" computation. -- Harsh J
Re: Reg HDFS checksum
Thamizh, For a much older project I wrote a demo tool that computed the hadoop style checksum locally: https://github.com/jpatanooga/IvoryMonkey Checksum generator is a single threaded replica of Hadoop's internal Distributed hash-checksum mechanic. What its actually doing is saving the CRC32 of every 512 bytes (per block) and then doing a MD5 hash on that. Then when the "getFileChecksum()" method is called, each block for a file sends its md5 hash to a collector which are gathered together and a md5 hash is calc'd for all of the block hashes. My version includes code that can calculate the hash on the client side (but breaks things up in the same way that hdfs does and will calc it the same way). During development, we also discovered and filed: https://issues.apache.org/jira/browse/HDFS-772 To invoke this method, use my shell wrapper: https://github.com/jpatanooga/IvoryMonkey/blob/master/src/tv/floe/IvoryMonkey/hadoop/fs/Shell.java Hope this provides some reference information for you. On Sat, Apr 9, 2011 at 10:38 AM, Thamizh wrote: > Hi Harsh , > Thanks a lot for your reference. > I am looking forward to know about, how does Hadoop computes CRC for any > file? If you have some reference please share me. It would be great help for > me. > > Regards, > > Thamizhannal P > > --- On Sat, 9/4/11, Harsh J wrote: > > From: Harsh J > Subject: Re: Reg HDFS checksum > To: common-user@hadoop.apache.org > Date: Saturday, 9 April, 2011, 3:20 PM > > Hello Thamizh, > > Perhaps the discussion in the following link can shed some light on > this: http://getsatisfaction.com/cloudera/topics/hadoop_fs_crc > > On Fri, Apr 8, 2011 at 5:47 PM, Thamizh wrote: >> Hi All, >> >> This is question regarding "HDFS checksum" computation. > > -- > Harsh J > -- Twitter: @jpatanooga Solution Architect @ Cloudera hadoop: http://www.cloudera.com blog: http://jpatterson.floe.tv
Re: Reg HDFS checksum
Thanks of lot Josh. I have been given a .gz file and been told that it has been downloaded from HDFS. When I tried to compute integrity of that file using "gzip -t", It ended up with "invalid compressed data--format violated" and even "gzip -d" also given the same result. I am bit worried about Hadoop's CRC checking mechanism. So, I looking forward to implement external CRC checker for Hadoop. Regards, Thamizhannal P --- On Mon, 11/4/11, Josh Patterson wrote: From: Josh Patterson Subject: Re: Reg HDFS checksum To: common-user@hadoop.apache.org Cc: "Thamizh" Date: Monday, 11 April, 2011, 7:53 PM Thamizh, For a much older project I wrote a demo tool that computed the hadoop style checksum locally: https://github.com/jpatanooga/IvoryMonkey Checksum generator is a single threaded replica of Hadoop's internal Distributed hash-checksum mechanic. What its actually doing is saving the CRC32 of every 512 bytes (per block) and then doing a MD5 hash on that. Then when the "getFileChecksum()" method is called, each block for a file sends its md5 hash to a collector which are gathered together and a md5 hash is calc'd for all of the block hashes. My version includes code that can calculate the hash on the client side (but breaks things up in the same way that hdfs does and will calc it the same way). During development, we also discovered and filed: https://issues.apache.org/jira/browse/HDFS-772 To invoke this method, use my shell wrapper: https://github.com/jpatanooga/IvoryMonkey/blob/master/src/tv/floe/IvoryMonkey/hadoop/fs/Shell.java Hope this provides some reference information for you. On Sat, Apr 9, 2011 at 10:38 AM, Thamizh wrote: > Hi Harsh , > Thanks a lot for your reference. > I am looking forward to know about, how does Hadoop computes CRC for any > file? If you have some reference please share me. It would be great help for > me. > > Regards, > > Thamizhannal P > > --- On Sat, 9/4/11, Harsh J wrote: > > From: Harsh J > Subject: Re: Reg HDFS checksum > To: common-user@hadoop.apache.org > Date: Saturday, 9 April, 2011, 3:20 PM > > Hello Thamizh, > > Perhaps the discussion in the following link can shed some light on > this: http://getsatisfaction.com/cloudera/topics/hadoop_fs_crc > > On Fri, Apr 8, 2011 at 5:47 PM, Thamizh wrote: >> Hi All, >> >> This is question regarding "HDFS checksum" computation. > > -- > Harsh J > -- Twitter: @jpatanooga Solution Architect @ Cloudera hadoop: http://www.cloudera.com blog: http://jpatterson.floe.tv
Re: Reg HDFS checksum
If you take a look at: https://github.com/jpatanooga/IvoryMonkey/blob/master/src/tv/floe/IvoryMonkey/hadoop/fs/ExternalHDFSChecksumGenerator.java you'll see a single process version of what HDFS does under the hood, albeit in a highly distributed fashion. Whats going on here is that for every 512 bytes a CRC32 is calc'd and saved at each local datanode for that block. when the "checksum" is requested, these CRC32's are pulled together and MD5 hashed, which is sent to the client process. The client process then MD5 hashes all of these hashes together to produce a final hash. For some context: Our purpose on the openPDC project for this was we had some legacy software writing to HDFS through a FTP proxy bridge: https://openpdc.svn.codeplex.com/svn/Hadoop/Current%20Version/HdfsBridge/ Since the openPDC data was ultra critical in that we could not lose *any* data, and the team wanted to use a simple FTP client lib to write to HDFS (least amount of work for them, standard libs), we needed a way to make sure that no corruption occurred during the "hop" through the FTP bridge (acted as intermediary to DFSClient, something could fail, and the file might be slightly truncated, yet hard to detect this). In the FTP bridge we allowed a custom FTP command to call the now exposed "hdfs-checksum" command, and the sending agent could then compute the hash locally (in the case of the openPDC it was done in C#), and make sure the file made it there intact. This system has been in production for over a year now storing and maintaining smart grid data and has been highly reliable. I say all of this to say: After having dug through HDFS's checksumming code I am pretty confident that its Good Stuff, although I dont proclaim to be a filesystem expert by any means. It may be just some simple error or oversight in your process, possibly? On Tue, Apr 12, 2011 at 7:32 AM, Thamizh wrote: > > Thanks of lot Josh. > > I have been given a .gz file and been told that it has been downloaded from > HDFS. > > When I tried to compute integrity of that file using "gzip -t", It ended up > with "invalid compressed data--format violated" and even "gzip -d" also given > the same result. > > I am bit worried about Hadoop's CRC checking mechanism. So, I looking forward > to implement external CRC checker for Hadoop. > > Regards, > > Thamizhannal P > > --- On Mon, 11/4/11, Josh Patterson wrote: > > From: Josh Patterson > Subject: Re: Reg HDFS checksum > To: common-user@hadoop.apache.org > Cc: "Thamizh" > Date: Monday, 11 April, 2011, 7:53 PM > > Thamizh, > For a much older project I wrote a demo tool that computed the hadoop > style checksum locally: > > https://github.com/jpatanooga/IvoryMonkey > > Checksum generator is a single threaded replica of Hadoop's internal > Distributed hash-checksum mechanic. > > What its actually doing is saving the CRC32 of every 512 bytes (per > block) and then doing a MD5 hash on that. Then when the > "getFileChecksum()" method is called, each block for a file sends its > md5 hash to a collector which are gathered together and a md5 hash is > calc'd for all of the block hashes. > > My version includes code that can calculate the hash on the client > side (but breaks things up in the same way that hdfs does and will > calc it the same way). > > During development, we also discovered and filed: > > https://issues.apache.org/jira/browse/HDFS-772 > > To invoke this method, use my shell wrapper: > > https://github.com/jpatanooga/IvoryMonkey/blob/master/src/tv/floe/IvoryMonkey/hadoop/fs/Shell.java > > Hope this provides some reference information for you. > > On Sat, Apr 9, 2011 at 10:38 AM, Thamizh wrote: >> Hi Harsh , >> Thanks a lot for your reference. >> I am looking forward to know about, how does Hadoop computes CRC for any >> file? If you have some reference please share me. It would be great help for >> me. >> >> Regards, >> >> Thamizhannal P >> >> --- On Sat, 9/4/11, Harsh J wrote: >> >> From: Harsh J >> Subject: Re: Reg HDFS checksum >> To: common-user@hadoop.apache.org >> Date: Saturday, 9 April, 2011, 3:20 PM >> >> Hello Thamizh, >> >> Perhaps the discussion in the following link can shed some light on >> this: http://getsatisfaction.com/cloudera/topics/hadoop_fs_crc >> >> On Fri, Apr 8, 2011 at 5:47 PM, Thamizh wrote: >>> Hi All, >>> >>> This is question regarding "HDFS checksum" computation. >> >> -- >> Harsh J >> > > > > -- > Twitter: @jpatanooga > Solution Architect @ Cloudera > hadoop: http://www.cloudera.com > blog: http://jpatterson.floe.tv > -- Twitter: @jpatanooga Solution Architect @ Cloudera hadoop: http://www.cloudera.com blog: http://jpatterson.floe.tv
Re: Reg HDFS checksum
On 12/04/2011 07:06, Josh Patterson wrote: If you take a look at: https://github.com/jpatanooga/IvoryMonkey/blob/master/src/tv/floe/IvoryMonkey/hadoop/fs/ExternalHDFSChecksumGenerator.java you'll see a single process version of what HDFS does under the hood, albeit in a highly distributed fashion. Whats going on here is that for every 512 bytes a CRC32 is calc'd and saved at each local datanode for that block. when the "checksum" is requested, these CRC32's are pulled together and MD5 hashed, which is sent to the client process. The client process then MD5 hashes all of these hashes together to produce a final hash. For some context: Our purpose on the openPDC project for this was we had some legacy software writing to HDFS through a FTP proxy bridge: https://openpdc.svn.codeplex.com/svn/Hadoop/Current%20Version/HdfsBridge/ Since the openPDC data was ultra critical in that we could not lose *any* data, and the team wanted to use a simple FTP client lib to write to HDFS (least amount of work for them, standard libs), we needed a way to make sure that no corruption occurred during the "hop" through the FTP bridge (acted as intermediary to DFSClient, something could fail, and the file might be slightly truncated, yet hard to detect this). In the FTP bridge we allowed a custom FTP command to call the now exposed "hdfs-checksum" command, and the sending agent could then compute the hash locally (in the case of the openPDC it was done in C#), and make sure the file made it there intact. This system has been in production for over a year now storing and maintaining smart grid data and has been highly reliable. I say all of this to say: After having dug through HDFS's checksumming code I am pretty confident that its Good Stuff, although I dont proclaim to be a filesystem expert by any means. It may be just some simple error or oversight in your process, possibly? Assuming it came down over HTTP, it's perfectly conceivable that something went wrong on the way, especially if a proxy server get involved. All HTTP checks is that the (optional) content length is consistent with what arrived -it relies on TCP checksums, which verify the network links work, but not the other bits of the system in the way (like any proxy server)