We have discussed this in the past. I think the single biggest issue is that HDFS doesn't understand the schema of the data which is stored in it. So it may not be aware of what compression scheme would be most appropriate for the application and data.
While it is true that HDFS doens't allow random writes, it does allow random reads. In fact, HDFS currently supports a very low-cost seek operation while reading an input stream. Compression would increase the cost of seeking greatly. Of course, the cost increase would depend on the kind of compression used-- "chunk-based" schemes where there was some kind of side index could be more efficient at seeking. Because HDFS transparent encryption is client-side, it will not work unless compression is client-side as well. The reason is because compressing encrypted data provides no space savings. But a client-side scheme loses some of the benefits of doing compression in HDFS, like the ability to cache the uncompressed data in the DataNode. I think any project along these lines should start with a careful analysis of what the goals are and what advantages the scheme has over the current client-side compression. best, Colin On Mon, Jul 4, 2016, at 07:16, Robert James wrote: > A lot of work in Hadoop concerns splittable compression. Could this > be solved by offerring compression at the HDFS block (ie 64 MB) level, > just like many OS filesystems do? > > http://stackoverflow.com/questions/6511255/why-cant-hadoop-split-up-a-large-text-file-and-then-compress-the-splits-using-g?rq=1 > discusses this and suggests the issues is separation of concerns. > However, if the compression is done at the *HDFS block* level (with > perhaps a single flag indicating such), this would be totally > transparent to readers and writers. This is the exact way, for > example, NTFS compression works; apps need no knowledge of the > compression. HDFS, since it doesn't allow random reads and writes, > but only streaming, is a perfect candidate for this. > > Thoughts? > > --------------------------------------------------------------------- > To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org > For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org