[ http://issues.apache.org/jira/browse/HADOOP-441?page=all ]
Arun C Murthy updated HADOOP-441: --------------------------------- Attachment: codec.patch reports.tgz Here's the patch for custom codecs (sequencefile v5). I've hit a potential red-flag where the 'writes' to 'block compressed' SequenceFiles through the new custom codec framework suffers ~10%-15% (vis-a-vis version 4 i.e. SEQ4). The 'writes' to non-compressed/record-compressed SequenceFiles seems to hold up very well indeed. Similarly 'reads' of all types of SequenceFiles also are quite fine. I turned an evaluation version of jprobe's profiler on both v4 and v5 of SequenceFile and the results are very surprising. I have attached the detailed summaries (reports.tgz) of the command: $ java org.apache.hadoop.io.TestSequenceFile -local -count {10000 - 10000000 i} -rwonly file_bc.seq -compressType BLOCK The test was run to write the exact same data (RandomDatum's generator was seeded with '0' in all cases). Summarising: it seems that the Deflater.deflateBytes (a native jni call - DeflaterOutputStream.write -> Deflater.deflate -> Deflater.deflateBytes) seems to perform very differently in v5. a) In both v4 and v5 the exact same no. of calls are made to SequenceFile.BlockCompressedWriter.writeBlock -> DeflaterOutputStream.write. b) In v4 there seem to be slightly _more_ no. of calls to Deflater.deflate and hence Deflater.deflateBytes c) Yet, the performance of v5 (with _lesser_ no. of calls) suffers since Deflater.deflateBytes takes longer to execute! And this completely reproducable. I talked to Owen who mentioned that he had noticed similar flaky performance with the Deflater earlier... Appreciate any code reviews/ideas etc. Thoughts? > SequenceFile should support 'custom compressors' > ------------------------------------------------ > > Key: HADOOP-441 > URL: http://issues.apache.org/jira/browse/HADOOP-441 > Project: Hadoop > Issue Type: New Feature > Components: io > Reporter: Arun C Murthy > Assigned To: Arun C Murthy > Fix For: 0.6.0 > > Attachments: codec.patch, codec.patch, codec20060831.patch, > codec_updated_interfaces_20060830.patch, reports.tgz > > > SequenceFiles should support 'custom compressors' which can be specified by > the user on creation of the file. > Readily available packages for gzip and zip (java.util.zip) are among obvious > choices to support. Of course there will be hooks so that other compressors > can be added in future as long as there is a way to construct (input/output) > streams on top of the compressor/decompressor. > The 'classname' of the 'custom compressor/decompressor' could be stored in > the header of the SequenceFile which can then be used by SequenceFile.Reader > to figure out the appropriate 'decompressor'. Thus I propose we add > constructors to SequenceFile.Writer which take in the 'classname' of the > compressor's input/output stream classes (e.g. > DeflaterOutputStream/InflaterInputStream or > GZIPOutputStream/GZIPInputStream), which acts as the hook for future > compressors/decompressors. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira