On Sun, Sep 4, 2011 at 3:49 PM, Ken Krugler <[email protected]> wrote: > Hi there, > The current documentation says: > > By default, data is not compressed. You can compress your data by using the > deflate (gzip) algorithm with the -z or --compress argument, or specify any > Hadoop compression codec using the --compression-codec argument. This > applies to both SequenceFiles or text files. > > But I think this is a bit misleading. > Currently if output compression is enabled in a cluster, then the Sqooped > data is alway compressed, regardless of the setting of this flag. > It seems better to actually make compression controllable via --compress, > which means changing ImportJobBase.configureOutputFormat() > if (options.shouldUseCompression()) { > FileOutputFormat.setCompressOutput(job, true); > FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class); > SequenceFileOutputFormat.setOutputCompressionType(job, > CompressionType.BLOCK); > } > // new stuff > else { > FileOutputFormat.setCompressOutput(job, false); > } > Thoughts?
This is a good point Ken. However, IMO it is better left as is since there may be a wider cluster management policy in effect that requires compression for all output files. One way to look at it is that for normal use, there is a predefined compression scheme configured cluster wide, and occasionally when required, Sqoop users can use a different scheme where necessary. Thanks, Arvind > -- Ken > -------------------------- > Ken Krugler > +1 530-210-6378 > http://bixolabs.com > custom big data solutions & training > Hadoop, Cascading, Mahout & Solr > > >
