This seems like a bug. Which version of Sqoop are you using? On Tue, May 5, 2015 at 12:50 PM, Michael Arena <[email protected]> wrote:
> I am incrementally loading data from SQL Server to Hadoop using an > Oozie Sqoop Action. > Oozie runs a saved job in the Sqoop Metastore as created below: > > sqoop job \ > --create import__test__mydb__mytable \ > --meta-connect *** \ > -- import \ > --connect "jdbc:sqlserver://mydbserver:1433;databaseName=mydb;" \ > --username **** \ > --password-file **** \ > --num-mappers 4 \ > --target-dir /***/***/mytable \ > --fields-terminated-by '\t' --input-fields-terminated-by '\t' \ > --null-string '\\N' --null-non-string '\\N' \ > --input-null-string '\\N' --input-null-non-string '\\N' \ > --relaxed-isolation \ > --query "SELECT id, first_name, last_name, mod_time FROM mytable" \ > --split-by id \ > --merge-key id \ > --incremental lastmodified \ > --check-column mod_time \ > --last-value "1900-01-01 00:00:00.000" \ > --compress --compression-codec > org.apache.hadoop.io.compress.SnappyCodec > > > The initial time the job runs, it creates 4 files like: > part-m-00000.snappy > part-m-00002.snappy > part-m-00003.snappy > part-m-00004.snappy > > It did not need to do the "merge" step since there was no existing data. > > However, the next time it runs, it pulls over modified rows from SQL > Server and then "merges" them into the existing data and creates files: > part-r-00000 > part-r-00001 > part-r-00002 > ... > part-r-00020 > part-r-00031 > > which are uncompressed TSV files. > > > The Sqoop Metastore has the compression settings saved: > % sqoop job --show import__test__mydb__mytable > ... > enable.compression = true > compression.codec = org.apache.hadoop.io.compress.SnappyCodec > ... > > > Since the files are named "part-m-0000X.snappy" after the first run, I > am guessing that the "-m-" in the name means the mappers created them (and > also since I specified 4 mappers). > > On the second run, I am guessing that the (32?) reducers created the > output since there was merging necessary and the files have "-r-" in the > name. > > Is this a bug or expected behavior? > Is there some other settings to tell the reducers to honor the > compression settings? > If it is a bug, where do I create an issue (JIRA) for it? > > > How are you engaging with millennials at your organization? Earn “Lifetime > Loyalty with Effective Millennial Engagement” by signing up for our next > webinar. Join us *Tuesday, May 12 at 1:00 EDT *to obtain the tools you > need to earn brand loyalty from this important demographic. Click here > <http://content.paytronix.com/Lifetime-Loyalty_0515_sig.html> to > register! >
