[ 
https://issues.apache.org/jira/browse/HADOOP-3514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12612541#action_12612541
 ] 

Jothi Padmanabhan commented on HADOOP-3514:
-------------------------------------------

Based on all the discussions above, here is a summary of the requirements:
# Have one Checksum per file for the intermediate files.
# Checksums should be verified while reading the intermediate file from the 
disk as well as when reading from the network.

This translates to Checksum verification at the following places
# When Spill files are read to be merged into the final map output file
# When the servlet reads this map output file and streams it over the network
# When the reducer receives the data from the network. It should verify 
checksum irrespective of whether the map output is being saved to memory or 
being saved to disk.
# If the reducer saves the map output received from the network to the disk, 
subsequent merge should verify the checksum.

Here is one possible approach 

Create two new Streams -- ChecksumInputStream and CheksumOutputStream. These 
streams sit between the compressor stream and the
actual raw file streams.

IFile.java (writer)
{code}
checksumOut = new ChecksumOutputStream(out)                      //out is based 
on a local file
this.compressedOut = codec.createOutputStream(checksumOut, compressor);
this.out = new FSDataOutputStream(this.compressedOut,  null);
{code}

In CheskumOutputStream, keep updating checksum with every write() and in 
close(), write the checksum to the end of the file.
In ChecksumInputStream, keep updating checksum with every read(). At EOF, 
validate the checksum with the last four bytes which would have been
written earlier.

This would work only with sequential reads and writes and would not support 
things like seeks etc. However, this should work fine with the reading and 
writing of intermediate files where data is just written and read sequentially.

Comments?



> Reduce seeks during shuffle, by inline crcs
> -------------------------------------------
>
>                 Key: HADOOP-3514
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3514
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.18.0
>            Reporter: Devaraj Das
>            Assignee: Jothi Padmanabhan
>             Fix For: 0.19.0
>
>
> The number of seeks can be reduced by half in the iFile if we move the crc 
> into the iFile rather than having a separate file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to