[
https://issues.apache.org/jira/browse/HADOOP-3514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12612541#action_12612541
]
Jothi Padmanabhan commented on HADOOP-3514:
-------------------------------------------
Based on all the discussions above, here is a summary of the requirements:
# Have one Checksum per file for the intermediate files.
# Checksums should be verified while reading the intermediate file from the
disk as well as when reading from the network.
This translates to Checksum verification at the following places
# When Spill files are read to be merged into the final map output file
# When the servlet reads this map output file and streams it over the network
# When the reducer receives the data from the network. It should verify
checksum irrespective of whether the map output is being saved to memory or
being saved to disk.
# If the reducer saves the map output received from the network to the disk,
subsequent merge should verify the checksum.
Here is one possible approach
Create two new Streams -- ChecksumInputStream and CheksumOutputStream. These
streams sit between the compressor stream and the
actual raw file streams.
IFile.java (writer)
{code}
checksumOut = new ChecksumOutputStream(out) //out is based
on a local file
this.compressedOut = codec.createOutputStream(checksumOut, compressor);
this.out = new FSDataOutputStream(this.compressedOut, null);
{code}
In CheskumOutputStream, keep updating checksum with every write() and in
close(), write the checksum to the end of the file.
In ChecksumInputStream, keep updating checksum with every read(). At EOF,
validate the checksum with the last four bytes which would have been
written earlier.
This would work only with sequential reads and writes and would not support
things like seeks etc. However, this should work fine with the reading and
writing of intermediate files where data is just written and read sequentially.
Comments?
> Reduce seeks during shuffle, by inline crcs
> -------------------------------------------
>
> Key: HADOOP-3514
> URL: https://issues.apache.org/jira/browse/HADOOP-3514
> Project: Hadoop Core
> Issue Type: Improvement
> Components: mapred
> Affects Versions: 0.18.0
> Reporter: Devaraj Das
> Assignee: Jothi Padmanabhan
> Fix For: 0.19.0
>
>
> The number of seeks can be reduced by half in the iFile if we move the crc
> into the iFile rather than having a separate file.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.