On 12/04/2011 07:06, Josh Patterson wrote:
If you take a look at:

https://github.com/jpatanooga/IvoryMonkey/blob/master/src/tv/floe/IvoryMonkey/hadoop/fs/ExternalHDFSChecksumGenerator.java

you'll see a single process version of what HDFS does under the hood,
albeit in a highly distributed fashion. Whats going on here is that
for every 512 bytes a CRC32 is calc'd and saved at each local datanode
for that block. when the "checksum" is requested, these CRC32's are
pulled together and MD5 hashed, which is sent to the client process.
The client process then MD5 hashes all of these hashes together to
produce a final hash.

For some context: Our purpose on the openPDC project for this was we
had some legacy software writing to HDFS through a FTP proxy bridge:

https://openpdc.svn.codeplex.com/svn/Hadoop/Current%20Version/HdfsBridge/

Since the openPDC data was ultra critical in that we could not lose
*any* data, and the team wanted to use a simple FTP client lib to
write to HDFS (least amount of work for them, standard libs), we
needed a way to make sure that no corruption occurred during the "hop"
through the FTP bridge (acted as intermediary to DFSClient, something
could fail, and the file might be slightly truncated, yet hard to
detect this). In the FTP bridge we allowed a custom FTP command to
call the now exposed "hdfs-checksum" command, and the sending agent
could then compute the hash locally (in the case of the openPDC it was
done in C#), and make sure the file made it there intact. This system
has been in production for over a year now storing and maintaining
smart grid data and has been highly reliable.

I say all of this to say: After having dug through HDFS's checksumming
code I am pretty confident that its Good Stuff, although I dont
proclaim to be a filesystem expert by any means. It may be just some
simple error or oversight in your process, possibly?

Assuming it came down over HTTP, it's perfectly conceivable that something went wrong on the way, especially if a proxy server get involved. All HTTP checks is that the (optional) content length is consistent with what arrived -it relies on TCP checksums, which verify the network links work, but not the other bits of the system in the way (like any proxy server)

Reply via email to