[ 
https://issues.apache.org/jira/browse/HDFS-2115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13057351#comment-13057351
 ] 

Todd Lipcon commented on HDFS-2115:
-----------------------------------

I'm thinking something like the following:
- DFSClient can optionally specify a compression codec when writing a file. If 
specified, each "packet" in the write pipeline will be compressed with that 
codec.
- DataNode uses a special header in the block meta file to indicate that the 
block is compressed with the given codec.
- To facilitate random access, an index file is kept (either separately or part 
of the block meta file) which contains pairs of (uncompressed offset, 
compressed offset). This allows binary search to each compression block.
- DFSClient reader is modified to support decompression on the client side.
- Some handshaking will be necessary in case the set of codecs available on the 
client and server differ.

Any thoughts on this? Not sure when I'd have time to work on it, but worth 
starting some brainstorming.

> Transparent compression in HDFS
> -------------------------------
>
>                 Key: HDFS-2115
>                 URL: https://issues.apache.org/jira/browse/HDFS-2115
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: data-node, hdfs client
>            Reporter: Todd Lipcon
>
> In practice, we find that a lot of users store text data in HDFS without 
> using any compression codec. Improving usability of compressible formats like 
> Avro/RCFile helps with this, but we could also help many users by providing 
> an option to transparently compress data as it is stored.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to