[ 
https://issues.apache.org/jira/browse/HADOOP-13114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15816356#comment-15816356
 ] 

Ravi Prakash commented on HADOOP-13114:
---------------------------------------

Thanks Koji! I was under the impression that even binary files could be 
compressed quite well. For e.g. if I compress /usr/bin/xsane (a binary file)
{code}
[raviprak@ravi ~]$ ls -alh xsane.gz 
-rwxr-xr-x 1 raviprak raviprak 298K Jan 10 11:06 xsane.gz
[raviprak@ravi ~]$ ls -alh /usr/bin/xsane
-rwxr-xr-x 1 root root 744K Feb  5  2016 /usr/bin/xsane
{code}
The question is how many "binary" files we expect to be on HDFS, but that means 
we'd make assumptions about Hadoop's use cases and I'm not sure I want to 
hazard that. I'm sorry if I misunderstand you. Could you please elucidate your 
concern if its not that?

Thanks Nathan! I am ambivalent about this myself. Ideally we'd want to compress 
during transit (like {{rsync -z}}), but this JIRA was split out of that desire 
(from HADOOP-8065). For a variety of reasons HADOOP-8065 has been requested by 
a lot of _our_ customers (in addition to the hadoop users you can see in the 
voters and watchers list.) Also, a few first-time contributors went above and 
beyond on this JIRA.

bq. What happens if we run the command with compression twice? distcp a->b, 
then b->c? I'm assuming c is a compressed version of b which is a compressed 
version of a. In order to read we'd have to unwind both layers of compression. 
Seems strange and really easy to accidentally have this happen.
You are right that compressed files would be nested, one inside the other. 
Compression tools would do similar nesting, won't they? So I'm not sure it can 
be helped. And if I had checked the compression status, I'm sure someone will 
pipe up and say that I should have been nesting ;-) Perhaps yet another flag?

bq. Obvious question is: "if it's valuable to compress, why wasn't it 
compressed in the first place?"
In my experience, some times the source hadoop cluster is not in the control of 
the copier, or has a lot more capacity (and so compression there is not a 
concern). Sometimes the source is written by IoT objects into a staging area, 
and rather than have a separate job that compresses data, it'd be helpful to 
combine the copy with the compression. 

bq. Just the name bothers me a bit. copy commands don't normally transform 
data, but this one would.
Having said that, I do feel this argument is particularly compelling. I am not 
sure if this would be breaking precedent considering there is {{--append}} 
which is not exactly a "copy" either, but I do agree with your concern.

For now I will stop work on this JIRA unless I hear from a few more diverse 
viewpoints.

> DistCp should have option to compress data on write
> ---------------------------------------------------
>
>                 Key: HADOOP-13114
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13114
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: tools/distcp
>    Affects Versions: 2.8.0, 2.7.3, 3.0.0-alpha1
>            Reporter: Suraj Nayak
>            Assignee: Suraj Nayak
>            Priority: Minor
>              Labels: distcp
>         Attachments: HADOOP-13114-trunk_2016-05-07-1.patch, 
> HADOOP-13114-trunk_2016-05-08-1.patch, HADOOP-13114-trunk_2016-05-10-1.patch, 
> HADOOP-13114-trunk_2016-05-12-1.patch, HADOOP-13114.05.patch, 
> HADOOP-13114.06.patch
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> DistCp utility should have capability to store data in user specified 
> compression format. This avoids one hop of compressing data after transfer. 
> Backup strategies to different cluster also get benefit of saving one IO 
> operation to and from HDFS, thus saving resources, time and effort.
> * Create an option -compressOutput defaulting to 
> {{org.apache.hadoop.io.compress.BZip2Codec}}. 
> * Users will be able to change codec with {{-D 
> mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec}}
> * If distcp compression is enabled, suffix the filenames with default codec 
> extension to indicate the file is compressed. Thus users can be aware of what 
> codec was used to compress the data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to