Re: Reg HDFS checksum

2011-04-09 Thread Harsh J
Hello Thamizh,

Perhaps the discussion in the following link can shed some light on
this: http://getsatisfaction.com/cloudera/topics/hadoop_fs_crc

On Fri, Apr 8, 2011 at 5:47 PM, Thamizh  wrote:
> Hi All,
>
> This is question regarding "HDFS checksum" computation.

-- 
Harsh J


Re: Reg HDFS checksum

2011-04-09 Thread Thamizh
Hi Harsh ,
Thanks a lot for your reference.
I am looking forward to know about, how does Hadoop computes CRC for any file? 
If you have some reference please share me. It would be great help for me.

Regards,

  Thamizhannal P

--- On Sat, 9/4/11, Harsh J  wrote:

From: Harsh J 
Subject: Re: Reg HDFS checksum
To: common-user@hadoop.apache.org
Date: Saturday, 9 April, 2011, 3:20 PM

Hello Thamizh,

Perhaps the discussion in the following link can shed some light on
this: http://getsatisfaction.com/cloudera/topics/hadoop_fs_crc

On Fri, Apr 8, 2011 at 5:47 PM, Thamizh  wrote:
> Hi All,
>
> This is question regarding "HDFS checksum" computation.

-- 
Harsh J


Re: Reg HDFS checksum

2011-04-11 Thread Josh Patterson
Thamizh,
For a much older project I wrote a demo tool that computed the hadoop
style checksum locally:

https://github.com/jpatanooga/IvoryMonkey

Checksum generator is a single threaded replica of Hadoop's internal
Distributed hash-checksum mechanic.

What its actually doing is saving the CRC32 of every 512 bytes (per
block) and then doing a MD5 hash on that. Then when the
"getFileChecksum()" method is called, each block for a file sends its
md5 hash to a collector which are gathered together and a md5 hash is
calc'd for all of the block hashes.

My version includes code that can calculate the hash on the client
side (but breaks things up in the same way that hdfs does and will
calc it the same way).

During development, we also discovered and filed:

https://issues.apache.org/jira/browse/HDFS-772

To invoke this method, use my shell wrapper:

https://github.com/jpatanooga/IvoryMonkey/blob/master/src/tv/floe/IvoryMonkey/hadoop/fs/Shell.java

Hope this provides some reference information for you.

On Sat, Apr 9, 2011 at 10:38 AM, Thamizh  wrote:
> Hi Harsh ,
> Thanks a lot for your reference.
> I am looking forward to know about, how does Hadoop computes CRC for any 
> file? If you have some reference please share me. It would be great help for 
> me.
>
> Regards,
>
>  Thamizhannal P
>
> --- On Sat, 9/4/11, Harsh J  wrote:
>
> From: Harsh J 
> Subject: Re: Reg HDFS checksum
> To: common-user@hadoop.apache.org
> Date: Saturday, 9 April, 2011, 3:20 PM
>
> Hello Thamizh,
>
> Perhaps the discussion in the following link can shed some light on
> this: http://getsatisfaction.com/cloudera/topics/hadoop_fs_crc
>
> On Fri, Apr 8, 2011 at 5:47 PM, Thamizh  wrote:
>> Hi All,
>>
>> This is question regarding "HDFS checksum" computation.
>
> --
> Harsh J
>



-- 
Twitter: @jpatanooga
Solution Architect @ Cloudera
hadoop: http://www.cloudera.com
blog: http://jpatterson.floe.tv


Re: Reg HDFS checksum

2011-04-12 Thread Thamizh

Thanks of lot Josh.  

I have been given a .gz file and been told that it has been downloaded from 
HDFS.

When I tried to compute integrity of that file using "gzip -t", It ended up 
with "invalid compressed data--format violated" and even "gzip -d" also given 
the same result.

I am bit worried about Hadoop's CRC checking mechanism. So, I looking forward 
to implement external CRC checker for Hadoop.

Regards,

  Thamizhannal P

--- On Mon, 11/4/11, Josh Patterson  wrote:

From: Josh Patterson 
Subject: Re: Reg HDFS checksum
To: common-user@hadoop.apache.org
Cc: "Thamizh" 
Date: Monday, 11 April, 2011, 7:53 PM

Thamizh,
For a much older project I wrote a demo tool that computed the hadoop
style checksum locally:

https://github.com/jpatanooga/IvoryMonkey

Checksum generator is a single threaded replica of Hadoop's internal
Distributed hash-checksum mechanic.

What its actually doing is saving the CRC32 of every 512 bytes (per
block) and then doing a MD5 hash on that. Then when the
"getFileChecksum()" method is called, each block for a file sends its
md5 hash to a collector which are gathered together and a md5 hash is
calc'd for all of the block hashes.

My version includes code that can calculate the hash on the client
side (but breaks things up in the same way that hdfs does and will
calc it the same way).

During development, we also discovered and filed:

https://issues.apache.org/jira/browse/HDFS-772

To invoke this method, use my shell wrapper:

https://github.com/jpatanooga/IvoryMonkey/blob/master/src/tv/floe/IvoryMonkey/hadoop/fs/Shell.java

Hope this provides some reference information for you.

On Sat, Apr 9, 2011 at 10:38 AM, Thamizh  wrote:
> Hi Harsh ,
> Thanks a lot for your reference.
> I am looking forward to know about, how does Hadoop computes CRC for any 
> file? If you have some reference please share me. It would be great help for 
> me.
>
> Regards,
>
>  Thamizhannal P
>
> --- On Sat, 9/4/11, Harsh J  wrote:
>
> From: Harsh J 
> Subject: Re: Reg HDFS checksum
> To: common-user@hadoop.apache.org
> Date: Saturday, 9 April, 2011, 3:20 PM
>
> Hello Thamizh,
>
> Perhaps the discussion in the following link can shed some light on
> this: http://getsatisfaction.com/cloudera/topics/hadoop_fs_crc
>
> On Fri, Apr 8, 2011 at 5:47 PM, Thamizh  wrote:
>> Hi All,
>>
>> This is question regarding "HDFS checksum" computation.
>
> --
> Harsh J
>



-- 
Twitter: @jpatanooga
Solution Architect @ Cloudera
hadoop: http://www.cloudera.com
blog: http://jpatterson.floe.tv


Re: Reg HDFS checksum

2011-04-12 Thread Josh Patterson
If you take a look at:

https://github.com/jpatanooga/IvoryMonkey/blob/master/src/tv/floe/IvoryMonkey/hadoop/fs/ExternalHDFSChecksumGenerator.java

you'll see a single process version of what HDFS does under the hood,
albeit in a highly distributed fashion. Whats going on here is that
for every 512 bytes a CRC32 is calc'd and saved at each local datanode
for that block. when the "checksum" is requested, these CRC32's are
pulled together and MD5 hashed, which is sent to the client process.
The client process then MD5 hashes all of these hashes together to
produce a final hash.

For some context: Our purpose on the openPDC project for this was we
had some legacy software writing to HDFS through a FTP proxy bridge:

https://openpdc.svn.codeplex.com/svn/Hadoop/Current%20Version/HdfsBridge/

Since the openPDC data was ultra critical in that we could not lose
*any* data, and the team wanted to use a simple FTP client lib to
write to HDFS (least amount of work for them, standard libs), we
needed a way to make sure that no corruption occurred during the "hop"
through the FTP bridge (acted as intermediary to DFSClient, something
could fail, and the file might be slightly truncated, yet hard to
detect this). In the FTP bridge we allowed a custom FTP command to
call the now exposed "hdfs-checksum" command, and the sending agent
could then compute the hash locally (in the case of the openPDC it was
done in C#), and make sure the file made it there intact. This system
has been in production for over a year now storing and maintaining
smart grid data and has been highly reliable.

I say all of this to say: After having dug through HDFS's checksumming
code I am pretty confident that its Good Stuff, although I dont
proclaim to be a filesystem expert by any means. It may be just some
simple error or oversight in your process, possibly?

On Tue, Apr 12, 2011 at 7:32 AM, Thamizh  wrote:
>
> Thanks of lot Josh.
>
> I have been given a .gz file and been told that it has been downloaded from 
> HDFS.
>
> When I tried to compute integrity of that file using "gzip -t", It ended up 
> with "invalid compressed data--format violated" and even "gzip -d" also given 
> the same result.
>
> I am bit worried about Hadoop's CRC checking mechanism. So, I looking forward 
> to implement external CRC checker for Hadoop.
>
> Regards,
>
>  Thamizhannal P
>
> --- On Mon, 11/4/11, Josh Patterson  wrote:
>
> From: Josh Patterson 
> Subject: Re: Reg HDFS checksum
> To: common-user@hadoop.apache.org
> Cc: "Thamizh" 
> Date: Monday, 11 April, 2011, 7:53 PM
>
> Thamizh,
> For a much older project I wrote a demo tool that computed the hadoop
> style checksum locally:
>
> https://github.com/jpatanooga/IvoryMonkey
>
> Checksum generator is a single threaded replica of Hadoop's internal
> Distributed hash-checksum mechanic.
>
> What its actually doing is saving the CRC32 of every 512 bytes (per
> block) and then doing a MD5 hash on that. Then when the
> "getFileChecksum()" method is called, each block for a file sends its
> md5 hash to a collector which are gathered together and a md5 hash is
> calc'd for all of the block hashes.
>
> My version includes code that can calculate the hash on the client
> side (but breaks things up in the same way that hdfs does and will
> calc it the same way).
>
> During development, we also discovered and filed:
>
> https://issues.apache.org/jira/browse/HDFS-772
>
> To invoke this method, use my shell wrapper:
>
> https://github.com/jpatanooga/IvoryMonkey/blob/master/src/tv/floe/IvoryMonkey/hadoop/fs/Shell.java
>
> Hope this provides some reference information for you.
>
> On Sat, Apr 9, 2011 at 10:38 AM, Thamizh  wrote:
>> Hi Harsh ,
>> Thanks a lot for your reference.
>> I am looking forward to know about, how does Hadoop computes CRC for any 
>> file? If you have some reference please share me. It would be great help for 
>> me.
>>
>> Regards,
>>
>>  Thamizhannal P
>>
>> --- On Sat, 9/4/11, Harsh J  wrote:
>>
>> From: Harsh J 
>> Subject: Re: Reg HDFS checksum
>> To: common-user@hadoop.apache.org
>> Date: Saturday, 9 April, 2011, 3:20 PM
>>
>> Hello Thamizh,
>>
>> Perhaps the discussion in the following link can shed some light on
>> this: http://getsatisfaction.com/cloudera/topics/hadoop_fs_crc
>>
>> On Fri, Apr 8, 2011 at 5:47 PM, Thamizh  wrote:
>>> Hi All,
>>>
>>> This is question regarding "HDFS checksum" computation.
>>
>> --
>> Harsh J
>>
>
>
>
> --
> Twitter: @jpatanooga
> Solution Architect @ Cloudera
> hadoop: http://www.cloudera.com
> blog: http://jpatterson.floe.tv
>



-- 
Twitter: @jpatanooga
Solution Architect @ Cloudera
hadoop: http://www.cloudera.com
blog: http://jpatterson.floe.tv


Re: Reg HDFS checksum

2011-04-12 Thread Steve Loughran

On 12/04/2011 07:06, Josh Patterson wrote:

If you take a look at:

https://github.com/jpatanooga/IvoryMonkey/blob/master/src/tv/floe/IvoryMonkey/hadoop/fs/ExternalHDFSChecksumGenerator.java

you'll see a single process version of what HDFS does under the hood,
albeit in a highly distributed fashion. Whats going on here is that
for every 512 bytes a CRC32 is calc'd and saved at each local datanode
for that block. when the "checksum" is requested, these CRC32's are
pulled together and MD5 hashed, which is sent to the client process.
The client process then MD5 hashes all of these hashes together to
produce a final hash.

For some context: Our purpose on the openPDC project for this was we
had some legacy software writing to HDFS through a FTP proxy bridge:

https://openpdc.svn.codeplex.com/svn/Hadoop/Current%20Version/HdfsBridge/

Since the openPDC data was ultra critical in that we could not lose
*any* data, and the team wanted to use a simple FTP client lib to
write to HDFS (least amount of work for them, standard libs), we
needed a way to make sure that no corruption occurred during the "hop"
through the FTP bridge (acted as intermediary to DFSClient, something
could fail, and the file might be slightly truncated, yet hard to
detect this). In the FTP bridge we allowed a custom FTP command to
call the now exposed "hdfs-checksum" command, and the sending agent
could then compute the hash locally (in the case of the openPDC it was
done in C#), and make sure the file made it there intact. This system
has been in production for over a year now storing and maintaining
smart grid data and has been highly reliable.

I say all of this to say: After having dug through HDFS's checksumming
code I am pretty confident that its Good Stuff, although I dont
proclaim to be a filesystem expert by any means. It may be just some
simple error or oversight in your process, possibly?


Assuming it came down over HTTP, it's perfectly conceivable that 
something went wrong on the way, especially if a proxy server get 
involved. All HTTP checks is that the (optional) content length is 
consistent with what arrived -it relies on TCP checksums, which verify 
the network links work, but not the other bits of the system in the way 
(like any proxy server)