Yeah this is a problem. I've created https://issues.apache.org/jira/browse/SQOOP-2346 to track this. Thanks for letting the community know.
-Abe On Tue, May 5, 2015 at 1:50 PM, Mauricio Aristizabal < [email protected]> wrote: > We've never been able to get sqoop-merge to use compression. Even tried > setting MR output properties. I do hope you guys can figure out the issue, > would save us a lot of space. > > On Tue, May 5, 2015 at 1:45 PM, Michael Arena <[email protected]> > wrote: > >> Sqoop 1.4.5-cdh5.2.1 on Cloudera CDH 5.2.1 cluster >> >> From: Abraham Elmahrek >> Reply-To: "[email protected]" >> Date: Tuesday, May 5, 2015 at 4:18 PM >> To: "[email protected]" >> Subject: Re: Sqoop Incremental job does not honor the compression >> settings after the initial run >> >> This seems like a bug. Which version of Sqoop are you using? >> >> On Tue, May 5, 2015 at 12:50 PM, Michael Arena <[email protected]> >> wrote: >> >>> I am incrementally loading data from SQL Server to Hadoop using an >>> Oozie Sqoop Action. >>> Oozie runs a saved job in the Sqoop Metastore as created below: >>> >>> sqoop job \ >>> --create import__test__mydb__mytable \ >>> --meta-connect *** \ >>> -- import \ >>> --connect "jdbc:sqlserver://mydbserver:1433;databaseName=mydb;" \ >>> --username **** \ >>> --password-file **** \ >>> --num-mappers 4 \ >>> --target-dir /***/***/mytable \ >>> --fields-terminated-by '\t' --input-fields-terminated-by '\t' \ >>> --null-string '\\N' --null-non-string '\\N' \ >>> --input-null-string '\\N' --input-null-non-string '\\N' \ >>> --relaxed-isolation \ >>> --query "SELECT id, first_name, last_name, mod_time FROM mytable" \ >>> --split-by id \ >>> --merge-key id \ >>> --incremental lastmodified \ >>> --check-column mod_time \ >>> --last-value "1900-01-01 00:00:00.000" \ >>> --compress --compression-codec >>> org.apache.hadoop.io.compress.SnappyCodec >>> >>> >>> The initial time the job runs, it creates 4 files like: >>> part-m-00000.snappy >>> part-m-00002.snappy >>> part-m-00003.snappy >>> part-m-00004.snappy >>> >>> It did not need to do the "merge" step since there was no existing >>> data. >>> >>> However, the next time it runs, it pulls over modified rows from SQL >>> Server and then "merges" them into the existing data and creates files: >>> part-r-00000 >>> part-r-00001 >>> part-r-00002 >>> ... >>> part-r-00020 >>> part-r-00031 >>> >>> which are uncompressed TSV files. >>> >>> >>> The Sqoop Metastore has the compression settings saved: >>> % sqoop job --show import__test__mydb__mytable >>> ... >>> enable.compression = true >>> compression.codec = org.apache.hadoop.io.compress.SnappyCodec >>> ... >>> >>> >>> Since the files are named "part-m-0000X.snappy" after the first run, I >>> am guessing that the "-m-" in the name means the mappers created them (and >>> also since I specified 4 mappers). >>> >>> On the second run, I am guessing that the (32?) reducers created the >>> output since there was merging necessary and the files have "-r-" in the >>> name. >>> >>> Is this a bug or expected behavior? >>> Is there some other settings to tell the reducers to honor the >>> compression settings? >>> If it is a bug, where do I create an issue (JIRA) for it? >>> >>> >>> How are you engaging with millennials at your organization? Earn >>> “Lifetime Loyalty with Effective Millennial Engagement” by signing up for >>> our next webinar. Join us *Tuesday, May 12 at 1:00 EDT *to obtain the >>> tools you need to earn brand loyalty from this important demographic. Click >>> here <http://content.paytronix.com/Lifetime-Loyalty_0515_sig.html>to >>> register! >>> >> >> >> >> How are you engaging with millennials at your organization? Earn >> “Lifetime Loyalty with Effective Millennial Engagement” by signing up for >> our next webinar. Join us *Tuesday, May 12 at 1:00 EDT *to obtain the >> tools you need to earn brand loyalty from this important demographic. Click >> here <http://content.paytronix.com/Lifetime-Loyalty_0515_sig.html> to >> register! >> > > > > -- > > *Mauricio Aristizabal* > > Manager - Business Intelligence + Data Science | Impact Radius > > 10 East Figueroa Street, 2nd Floor | Santa Barbara, CA 93101 > > m: +1 (323) 309-4260 | [email protected] > > > *Learn more – Watch our 2 minute overview > <http://www.impactradius.com/?src=slsap>* > > > www.impactradius.com | Twitter <http://twitter.com/impactradius> | > Facebook <https://www.facebook.com/pages/Impact-Radius/153376411365183> | > LinkedIn <http://www.linkedin.com/company/impact-radius-inc.> | YouTube > <https://www.youtube.com/user/ImpactRadius> > > Maximizing Return on Ad Spend > > >
