[jira] Updated: (HADOOP-441) SequenceFile should support 'custom compressors'

Arun C Murthy (JIRA) Tue, 05 Sep 2006 22:35:58 -0700

     [ http://issues.apache.org/jira/browse/HADOOP-441?page=all ]


Arun C Murthy updated HADOOP-441:
---------------------------------

    Attachment: codec.patch
                reports.tgz

Here's the patch for custom codecs (sequencefile v5).

I've hit a potential red-flag where the 'writes' to 'block compressed' 
SequenceFiles through the new custom codec framework suffers ~10%-15% 
(vis-a-vis version 4 i.e. SEQ4). The 'writes' to 
non-compressed/record-compressed SequenceFiles seems to hold up very well 
indeed. Similarly 'reads' of all types of SequenceFiles also are quite fine.

I turned an evaluation version of jprobe's profiler on both v4 and v5 of 
SequenceFile and the results are very surprising. I have attached the detailed 
summaries (reports.tgz) of the command:
$ java org.apache.hadoop.io.TestSequenceFile -local -count {10000 - 10000000 i} 
-rwonly file_bc.seq -compressType BLOCK

The test was run to write the exact same data (RandomDatum's generator was 
seeded with '0' in all cases).

Summarising: it seems that the Deflater.deflateBytes (a native jni call - 
DeflaterOutputStream.write -> Deflater.deflate -> Deflater.deflateBytes) seems 
to perform very differently in v5. 
a) In both v4 and v5 the exact same no. of calls are made to 
SequenceFile.BlockCompressedWriter.writeBlock -> DeflaterOutputStream.write. 
b) In v4 there seem to be slightly _more_ no. of calls to Deflater.deflate and 
hence Deflater.deflateBytes
c) Yet, the performance of v5 (with _lesser_ no. of calls) suffers since 
Deflater.deflateBytes takes longer to execute! And this completely reproducable.

I talked to Owen who mentioned that he had noticed similar flaky performance 
with the Deflater earlier...

Appreciate any code reviews/ideas etc.

Thoughts? 

> SequenceFile should support 'custom compressors'
> ------------------------------------------------
>
>                 Key: HADOOP-441
>                 URL: http://issues.apache.org/jira/browse/HADOOP-441
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: io
>            Reporter: Arun C Murthy
>         Assigned To: Arun C Murthy
>             Fix For: 0.6.0
>
>         Attachments: codec.patch, codec.patch, codec20060831.patch, 
> codec_updated_interfaces_20060830.patch, reports.tgz
>
>
> SequenceFiles should support 'custom compressors' which can be specified by 
> the user on creation of the file. 
> Readily available packages for gzip and zip (java.util.zip) are among obvious 
> choices to support. Of course there will be hooks so that other compressors 
> can be added in future as long as there is a way to construct (input/output) 
> streams on top of the compressor/decompressor.
> The 'classname' of the 'custom compressor/decompressor' could be stored in 
> the header of the SequenceFile which can then be used by SequenceFile.Reader 
> to figure out the appropriate 'decompressor'. Thus I propose we add 
> constructors to SequenceFile.Writer which take in the 'classname' of the 
> compressor's input/output stream classes (e.g. 
> DeflaterOutputStream/InflaterInputStream or 
> GZIPOutputStream/GZIPInputStream), which acts as the hook for future 
> compressors/decompressors.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (HADOOP-441) SequenceFile should support 'custom compressors'

Reply via email to