Hi Ken, you make some good points, to which I've added comments individually.

re: the degree of parallelism during the next step of processing is
constrained by the number of mappers used during sqooping: does
https://issues.cloudera.org/browse/SQOOP-137 address it? If so, you
might want to add your comments there.

re: winding up with unsplittable files and heavily skewed sizes: you
can file separate JIRAs for those if desired.

re: partitioning isn't great: for some databases such as Oracle, the
problem of heavily skewed sizes can be overcome using row-ids, you can
file a JIRA for that if you feel it's needed.

Regards, Kate

On Mon, Sep 5, 2011 at 12:32 PM, Ken Krugler
<[email protected]> wrote:
>
> On Sep 5, 2011, at 12:12pm, Arvind Prabhakar wrote:
>
>> On Sun, Sep 4, 2011 at 3:49 PM, Ken Krugler <[email protected]> 
>> wrote:
>>> Hi there,
>>> The current documentation says:
>>>
>>> By default, data is not compressed. You can compress your data by using the
>>> deflate (gzip) algorithm with the -z or --compress argument, or specify any
>>> Hadoop compression codec using the --compression-codec argument. This
>>> applies to both SequenceFiles or text files.
>>>
>>> But I think this is a bit misleading.
>>> Currently if output compression is enabled in a cluster, then the Sqooped
>>> data is alway compressed, regardless of the setting of this flag.
>>> It seems better to actually make compression controllable via --compress,
>>> which means changing ImportJobBase.configureOutputFormat()
>>>     if (options.shouldUseCompression()) {
>>>       FileOutputFormat.setCompressOutput(job, true);
>>>       FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);
>>>       SequenceFileOutputFormat.setOutputCompressionType(job,
>>>           CompressionType.BLOCK);
>>>     }
>>>    // new stuff
>>>     else {
>>>       FileOutputFormat.setCompressOutput(job, false);
>>>     }
>>> Thoughts?
>>
>> This is a good point Ken. However, IMO it is better left as is since
>> there may be a wider cluster management policy in effect that requires
>> compression for all output files. One way to look at it is that for
>> normal use, there is a predefined compression scheme configured
>> cluster wide, and occasionally when required, Sqoop users can use a
>> different scheme where necessary.
>
> The problem is that when you use text files as Sqoop output, these get 
> compressed at the file level by (typically) deflate, gzip or lzo.
>
> So you wind up with unsplittable files, which means that the degree of 
> parallelism during the next step of processing is constrained by the number 
> of mappers used during sqooping. But you typically set the number of mappers 
> based on DB load & size of the data set.
>
> And if partitioning isn't great, then you also wind up with heavily skewed 
> sizes for these unsplittable files, which makes things even worse.
>
> The current work-around is to use binary or Avro output instead of text, but 
> that's an odd requirement to be able to avoid the above problem.
>
> If the argument is to avoid implicitly changing the cluster's default 
> compression policy, then I'd suggest supporting a -nocompression flag.
>
> Regards,
>
> -- Ken
>
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> custom big data solutions & training
> Hadoop, Cascading, Mahout & Solr
>
>
>
>

Reply via email to