We have discussed this in the past.  I think the single biggest issue is
that HDFS doesn't understand the schema of the data which is stored in
it.  So it may not be aware of what compression scheme would be most
appropriate for the application and data.

While it is true that HDFS doens't allow random writes, it does allow
random reads.  In fact, HDFS currently supports a very low-cost seek
operation while reading an input stream.  Compression would increase the
cost of seeking greatly.  Of course, the cost increase would depend on
the kind of compression used-- "chunk-based" schemes where there was
some kind of side index could be more efficient at seeking.

Because HDFS transparent encryption is client-side, it will not work
unless compression is client-side as well.  The reason is because
compressing encrypted data provides no space savings.  But a client-side
scheme loses some of the benefits of doing compression in HDFS, like the
ability to cache the uncompressed data in the DataNode.

I think any project along these lines should start with a careful
analysis of what the goals are and what advantages the scheme has over
the current client-side compression.

best,
Colin


On Mon, Jul 4, 2016, at 07:16, Robert James wrote:
> A lot of work in Hadoop concerns splittable compression.  Could this
> be solved by offerring compression at the HDFS block (ie 64 MB) level,
> just like many OS filesystems do?
> 
> http://stackoverflow.com/questions/6511255/why-cant-hadoop-split-up-a-large-text-file-and-then-compress-the-splits-using-g?rq=1
> discusses this and suggests the issues is separation of concerns.
> However, if the compression is done at the *HDFS block* level (with
> perhaps a single flag indicating such), this would be totally
> transparent to readers and writers.  This is the exact way, for
> example, NTFS compression works; apps need no knowledge of the
> compression.  HDFS, since it doesn't allow random reads and writes,
> but only streaming, is a perfect candidate for this.
> 
> Thoughts?
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
> For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

Reply via email to