We've never been able to get sqoop-merge to use compression. Even tried setting MR output properties. I do hope you guys can figure out the issue, would save us a lot of space.
On Tue, May 5, 2015 at 1:45 PM, Michael Arena <[email protected]> wrote: > Sqoop 1.4.5-cdh5.2.1 on Cloudera CDH 5.2.1 cluster > > From: Abraham Elmahrek > Reply-To: "[email protected]" > Date: Tuesday, May 5, 2015 at 4:18 PM > To: "[email protected]" > Subject: Re: Sqoop Incremental job does not honor the compression > settings after the initial run > > This seems like a bug. Which version of Sqoop are you using? > > On Tue, May 5, 2015 at 12:50 PM, Michael Arena <[email protected]> > wrote: > >> I am incrementally loading data from SQL Server to Hadoop using an >> Oozie Sqoop Action. >> Oozie runs a saved job in the Sqoop Metastore as created below: >> >> sqoop job \ >> --create import__test__mydb__mytable \ >> --meta-connect *** \ >> -- import \ >> --connect "jdbc:sqlserver://mydbserver:1433;databaseName=mydb;" \ >> --username **** \ >> --password-file **** \ >> --num-mappers 4 \ >> --target-dir /***/***/mytable \ >> --fields-terminated-by '\t' --input-fields-terminated-by '\t' \ >> --null-string '\\N' --null-non-string '\\N' \ >> --input-null-string '\\N' --input-null-non-string '\\N' \ >> --relaxed-isolation \ >> --query "SELECT id, first_name, last_name, mod_time FROM mytable" \ >> --split-by id \ >> --merge-key id \ >> --incremental lastmodified \ >> --check-column mod_time \ >> --last-value "1900-01-01 00:00:00.000" \ >> --compress --compression-codec >> org.apache.hadoop.io.compress.SnappyCodec >> >> >> The initial time the job runs, it creates 4 files like: >> part-m-00000.snappy >> part-m-00002.snappy >> part-m-00003.snappy >> part-m-00004.snappy >> >> It did not need to do the "merge" step since there was no existing data. >> >> However, the next time it runs, it pulls over modified rows from SQL >> Server and then "merges" them into the existing data and creates files: >> part-r-00000 >> part-r-00001 >> part-r-00002 >> ... >> part-r-00020 >> part-r-00031 >> >> which are uncompressed TSV files. >> >> >> The Sqoop Metastore has the compression settings saved: >> % sqoop job --show import__test__mydb__mytable >> ... >> enable.compression = true >> compression.codec = org.apache.hadoop.io.compress.SnappyCodec >> ... >> >> >> Since the files are named "part-m-0000X.snappy" after the first run, I >> am guessing that the "-m-" in the name means the mappers created them (and >> also since I specified 4 mappers). >> >> On the second run, I am guessing that the (32?) reducers created the >> output since there was merging necessary and the files have "-r-" in the >> name. >> >> Is this a bug or expected behavior? >> Is there some other settings to tell the reducers to honor the >> compression settings? >> If it is a bug, where do I create an issue (JIRA) for it? >> >> >> How are you engaging with millennials at your organization? Earn >> “Lifetime Loyalty with Effective Millennial Engagement” by signing up for >> our next webinar. Join us *Tuesday, May 12 at 1:00 EDT *to obtain the >> tools you need to earn brand loyalty from this important demographic. Click >> here <http://content.paytronix.com/Lifetime-Loyalty_0515_sig.html>to >> register! >> > > > > How are you engaging with millennials at your organization? Earn “Lifetime > Loyalty with Effective Millennial Engagement” by signing up for our next > webinar. Join us *Tuesday, May 12 at 1:00 EDT *to obtain the tools you > need to earn brand loyalty from this important demographic. Click here > <http://content.paytronix.com/Lifetime-Loyalty_0515_sig.html> to > register! > -- *Mauricio Aristizabal* Manager - Business Intelligence + Data Science | Impact Radius 10 East Figueroa Street, 2nd Floor | Santa Barbara, CA 93101 m: +1 (323) 309-4260 | [email protected] *Learn more – Watch our 2 minute overview <http://www.impactradius.com/?src=slsap>* www.impactradius.com | Twitter <http://twitter.com/impactradius> | Facebook <https://www.facebook.com/pages/Impact-Radius/153376411365183> | LinkedIn <http://www.linkedin.com/company/impact-radius-inc.> | YouTube <https://www.youtube.com/user/ImpactRadius> Maximizing Return on Ad Spend
