Re: Controlling compression during import

Arvind Prabhakar Mon, 05 Sep 2011 12:12:23 -0700

On Sun, Sep 4, 2011 at 3:49 PM, Ken Krugler <[email protected]> wrote:
> Hi there,
> The current documentation says:
>
> By default, data is not compressed. You can compress your data by using the
> deflate (gzip) algorithm with the -z or --compress argument, or specify any
> Hadoop compression codec using the --compression-codec argument. This
> applies to both SequenceFiles or text files.
>
> But I think this is a bit misleading.
> Currently if output compression is enabled in a cluster, then the Sqooped
> data is alway compressed, regardless of the setting of this flag.
> It seems better to actually make compression controllable via --compress,
> which means changing ImportJobBase.configureOutputFormat()
>     if (options.shouldUseCompression()) {
>       FileOutputFormat.setCompressOutput(job, true);
>       FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);
>       SequenceFileOutputFormat.setOutputCompressionType(job,
>           CompressionType.BLOCK);
>     }
>    // new stuff
>     else {
>       FileOutputFormat.setCompressOutput(job, false);
>     }
> Thoughts?


This is a good point Ken. However, IMO it is better left as is since
there may be a wider cluster management policy in effect that requires
compression for all output files. One way to look at it is that for
normal use, there is a predefined compression scheme configured
cluster wide, and occasionally when required, Sqoop users can use a
different scheme where necessary.

Thanks,
Arvind


> -- Ken
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> custom big data solutions & training
> Hadoop, Cascading, Mahout & Solr
>
>
>

Re: Controlling compression during import

Reply via email to