Controlling compression during import

Ken Krugler Sun, 04 Sep 2011 15:50:23 -0700

Hi there,

The current documentation says:
> By default, data is not compressed. You can compress your data by using the 
> deflate (gzip) algorithm with the -z or --compress argument, or specify any 
> Hadoop compression codec using the --compression-codec argument. This applies 
> to both SequenceFiles or text files.
> 
But I think this is a bit misleading.


Currently if output compression is enabled in a cluster, then the Sqooped data 
is alway compressed, regardless of the setting of this flag.

It seems better to actually make compression controllable via --compress, which 
means changing ImportJobBase.configureOutputFormat()

    if (options.shouldUseCompression()) {
      FileOutputFormat.setCompressOutput(job, true);
      FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);
      SequenceFileOutputFormat.setOutputCompressionType(job,
          CompressionType.BLOCK);
    }
   // new stuff
    else {
      FileOutputFormat.setCompressOutput(job, false);
    }

Thoughts?

-- Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr

Controlling compression during import

Reply via email to