[ https://issues.apache.org/jira/browse/HDFS-2115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13057351#comment-13057351 ]
Todd Lipcon commented on HDFS-2115: ----------------------------------- I'm thinking something like the following: - DFSClient can optionally specify a compression codec when writing a file. If specified, each "packet" in the write pipeline will be compressed with that codec. - DataNode uses a special header in the block meta file to indicate that the block is compressed with the given codec. - To facilitate random access, an index file is kept (either separately or part of the block meta file) which contains pairs of (uncompressed offset, compressed offset). This allows binary search to each compression block. - DFSClient reader is modified to support decompression on the client side. - Some handshaking will be necessary in case the set of codecs available on the client and server differ. Any thoughts on this? Not sure when I'd have time to work on it, but worth starting some brainstorming. > Transparent compression in HDFS > ------------------------------- > > Key: HDFS-2115 > URL: https://issues.apache.org/jira/browse/HDFS-2115 > Project: Hadoop HDFS > Issue Type: New Feature > Components: data-node, hdfs client > Reporter: Todd Lipcon > > In practice, we find that a lot of users store text data in HDFS without > using any compression codec. Improving usability of compressible formats like > Avro/RCFile helps with this, but we could also help many users by providing > an option to transparently compress data as it is stored. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira