Hi Ken, you make some good points, to which I've added comments individually.
re: the degree of parallelism during the next step of processing is constrained by the number of mappers used during sqooping: does https://issues.cloudera.org/browse/SQOOP-137 address it? If so, you might want to add your comments there. re: winding up with unsplittable files and heavily skewed sizes: you can file separate JIRAs for those if desired. re: partitioning isn't great: for some databases such as Oracle, the problem of heavily skewed sizes can be overcome using row-ids, you can file a JIRA for that if you feel it's needed. Regards, Kate On Mon, Sep 5, 2011 at 12:32 PM, Ken Krugler <[email protected]> wrote: > > On Sep 5, 2011, at 12:12pm, Arvind Prabhakar wrote: > >> On Sun, Sep 4, 2011 at 3:49 PM, Ken Krugler <[email protected]> >> wrote: >>> Hi there, >>> The current documentation says: >>> >>> By default, data is not compressed. You can compress your data by using the >>> deflate (gzip) algorithm with the -z or --compress argument, or specify any >>> Hadoop compression codec using the --compression-codec argument. This >>> applies to both SequenceFiles or text files. >>> >>> But I think this is a bit misleading. >>> Currently if output compression is enabled in a cluster, then the Sqooped >>> data is alway compressed, regardless of the setting of this flag. >>> It seems better to actually make compression controllable via --compress, >>> which means changing ImportJobBase.configureOutputFormat() >>> if (options.shouldUseCompression()) { >>> FileOutputFormat.setCompressOutput(job, true); >>> FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class); >>> SequenceFileOutputFormat.setOutputCompressionType(job, >>> CompressionType.BLOCK); >>> } >>> // new stuff >>> else { >>> FileOutputFormat.setCompressOutput(job, false); >>> } >>> Thoughts? >> >> This is a good point Ken. However, IMO it is better left as is since >> there may be a wider cluster management policy in effect that requires >> compression for all output files. One way to look at it is that for >> normal use, there is a predefined compression scheme configured >> cluster wide, and occasionally when required, Sqoop users can use a >> different scheme where necessary. > > The problem is that when you use text files as Sqoop output, these get > compressed at the file level by (typically) deflate, gzip or lzo. > > So you wind up with unsplittable files, which means that the degree of > parallelism during the next step of processing is constrained by the number > of mappers used during sqooping. But you typically set the number of mappers > based on DB load & size of the data set. > > And if partitioning isn't great, then you also wind up with heavily skewed > sizes for these unsplittable files, which makes things even worse. > > The current work-around is to use binary or Avro output instead of text, but > that's an odd requirement to be able to avoid the above problem. > > If the argument is to avoid implicitly changing the cluster's default > compression policy, then I'd suggest supporting a -nocompression flag. > > Regards, > > -- Ken > > -------------------------- > Ken Krugler > +1 530-210-6378 > http://bixolabs.com > custom big data solutions & training > Hadoop, Cascading, Mahout & Solr > > > >
