On Sep 5, 2011, at 12:12pm, Arvind Prabhakar wrote:

> On Sun, Sep 4, 2011 at 3:49 PM, Ken Krugler <[email protected]> 
> wrote:
>> Hi there,
>> The current documentation says:
>> 
>> By default, data is not compressed. You can compress your data by using the
>> deflate (gzip) algorithm with the -z or --compress argument, or specify any
>> Hadoop compression codec using the --compression-codec argument. This
>> applies to both SequenceFiles or text files.
>> 
>> But I think this is a bit misleading.
>> Currently if output compression is enabled in a cluster, then the Sqooped
>> data is alway compressed, regardless of the setting of this flag.
>> It seems better to actually make compression controllable via --compress,
>> which means changing ImportJobBase.configureOutputFormat()
>>     if (options.shouldUseCompression()) {
>>       FileOutputFormat.setCompressOutput(job, true);
>>       FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);
>>       SequenceFileOutputFormat.setOutputCompressionType(job,
>>           CompressionType.BLOCK);
>>     }
>>    // new stuff
>>     else {
>>       FileOutputFormat.setCompressOutput(job, false);
>>     }
>> Thoughts?
> 
> This is a good point Ken. However, IMO it is better left as is since
> there may be a wider cluster management policy in effect that requires
> compression for all output files. One way to look at it is that for
> normal use, there is a predefined compression scheme configured
> cluster wide, and occasionally when required, Sqoop users can use a
> different scheme where necessary.

The problem is that when you use text files as Sqoop output, these get 
compressed at the file level by (typically) deflate, gzip or lzo.

So you wind up with unsplittable files, which means that the degree of 
parallelism during the next step of processing is constrained by the number of 
mappers used during sqooping. But you typically set the number of mappers based 
on DB load & size of the data set.

And if partitioning isn't great, then you also wind up with heavily skewed 
sizes for these unsplittable files, which makes things even worse.

The current work-around is to use binary or Avro output instead of text, but 
that's an odd requirement to be able to avoid the above problem.

If the argument is to avoid implicitly changing the cluster's default 
compression policy, then I'd suggest supporting a -nocompression flag.

Regards,

-- Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr



Reply via email to