[ http://issues.apache.org/jira/browse/HADOOP-441?page=all ]
Arun C Murthy updated HADOOP-441:
---------------------------------
Attachment: codec.patch
reports.tgz
Here's the patch for custom codecs (sequencefile v5).
I've hit a potential red-flag where the 'writes' to 'block compressed'
SequenceFiles through the new custom codec framework suffers ~10%-15%
(vis-a-vis version 4 i.e. SEQ4). The 'writes' to
non-compressed/record-compressed SequenceFiles seems to hold up very well
indeed. Similarly 'reads' of all types of SequenceFiles also are quite fine.
I turned an evaluation version of jprobe's profiler on both v4 and v5 of
SequenceFile and the results are very surprising. I have attached the detailed
summaries (reports.tgz) of the command:
$ java org.apache.hadoop.io.TestSequenceFile -local -count {10000 - 10000000 i}
-rwonly file_bc.seq -compressType BLOCK
The test was run to write the exact same data (RandomDatum's generator was
seeded with '0' in all cases).
Summarising: it seems that the Deflater.deflateBytes (a native jni call -
DeflaterOutputStream.write -> Deflater.deflate -> Deflater.deflateBytes) seems
to perform very differently in v5.
a) In both v4 and v5 the exact same no. of calls are made to
SequenceFile.BlockCompressedWriter.writeBlock -> DeflaterOutputStream.write.
b) In v4 there seem to be slightly _more_ no. of calls to Deflater.deflate and
hence Deflater.deflateBytes
c) Yet, the performance of v5 (with _lesser_ no. of calls) suffers since
Deflater.deflateBytes takes longer to execute! And this completely reproducable.
I talked to Owen who mentioned that he had noticed similar flaky performance
with the Deflater earlier...
Appreciate any code reviews/ideas etc.
Thoughts?
> SequenceFile should support 'custom compressors'
> ------------------------------------------------
>
> Key: HADOOP-441
> URL: http://issues.apache.org/jira/browse/HADOOP-441
> Project: Hadoop
> Issue Type: New Feature
> Components: io
> Reporter: Arun C Murthy
> Assigned To: Arun C Murthy
> Fix For: 0.6.0
>
> Attachments: codec.patch, codec.patch, codec20060831.patch,
> codec_updated_interfaces_20060830.patch, reports.tgz
>
>
> SequenceFiles should support 'custom compressors' which can be specified by
> the user on creation of the file.
> Readily available packages for gzip and zip (java.util.zip) are among obvious
> choices to support. Of course there will be hooks so that other compressors
> can be added in future as long as there is a way to construct (input/output)
> streams on top of the compressor/decompressor.
> The 'classname' of the 'custom compressor/decompressor' could be stored in
> the header of the SequenceFile which can then be used by SequenceFile.Reader
> to figure out the appropriate 'decompressor'. Thus I propose we add
> constructors to SequenceFile.Writer which take in the 'classname' of the
> compressor's input/output stream classes (e.g.
> DeflaterOutputStream/InflaterInputStream or
> GZIPOutputStream/GZIPInputStream), which acts as the hook for future
> compressors/decompressors.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira