[jira] Commented: (HADOOP-54) SequenceFile should compress blocks, not individual entries

Owen O'Malley (JIRA) Mon, 24 Jul 2006 21:31:38 -0700

    [ 
http://issues.apache.org/jira/browse/HADOOP-54?page=comments#action_12423227 ] 
            
Owen O'Malley commented on HADOOP-54:
-------------------------------------


My point is that the raw bytes are useless except in their original context.

Say my value is compressed as the byte stream: 12, 34, 56, 78
If I'm merging 100 files, I can't write 12, 34, 56, 78 to the output file and 
expect it to work, because naturally the compressed bytes depend on the state 
of the compressor.

So your reference tuple, would need to be:

<raw bytes, compressor class, compressor state>

where the compressor state is some compressor specific data. In the case of 
gzip, it is the last 32k of decompressed byte or whatever.

And that assumes that no one ever tries to use a compression algorithm that 
uses partial bytes.

It looks to me like you'd add a lot of complexity for very little gain. You'd 
only win if you had large compressed values that you didn't really need to look 
at or use for anything. (For example, if you wanted to take a table that was 
url -> html document and generate the number of urls in each domain.) 

> SequenceFile should compress blocks, not individual entries
> -----------------------------------------------------------
>
>                 Key: HADOOP-54
>                 URL: http://issues.apache.org/jira/browse/HADOOP-54
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: io
>    Affects Versions: 0.2.0
>            Reporter: Doug Cutting
>         Assigned To: Arun C Murthy
>             Fix For: 0.5.0
>
>         Attachments: VIntCompressionResults.txt
>
>
> SequenceFile will optionally compress individual values.  But both 
> compression and performance would be much better if sequences of keys and 
> values are compressed together.  Sync marks should only be placed between 
> blocks.  This will require some changes to MapFile too, so that all file 
> positions stored there are the positions of blocks, not entries within 
> blocks.  Probably this can be accomplished by adding a 
> getBlockStartPosition() method to SequenceFile.Writer.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-54) SequenceFile should compress blocks, not individual entries

Reply via email to