Re: Sqoop Incremental job does not honor the compression settings after the initial run

2015-05-05 Thread Michael Arena
Sqoop 1.4.5-cdh5.2.1 on Cloudera CDH 5.2.1 cluster

From: Abraham Elmahrek
Reply-To: user@sqoop.apache.orgmailto:user@sqoop.apache.org
Date: Tuesday, May 5, 2015 at 4:18 PM
To: user@sqoop.apache.orgmailto:user@sqoop.apache.org
Subject: Re: Sqoop Incremental job does not honor the compression settings 
after the initial run

This seems like a bug. Which version of Sqoop are you using?

On Tue, May 5, 2015 at 12:50 PM, Michael Arena 
mar...@paytronix.commailto:mar...@paytronix.com wrote:
I am incrementally loading data from SQL Server to Hadoop using an Oozie Sqoop 
Action.
Oozie runs a saved job in the Sqoop Metastore as created below:

sqoop job \
   --create import__test__mydb__mytable \
   --meta-connect *** \
   -- import \
   --connect jdbc:sqlserver://mydbserver:1433;databaseName=mydb; \
   --username  \
   --password-file  \
   --num-mappers 4 \
   --target-dir /***/***/mytable \
   --fields-terminated-by '\t' --input-fields-terminated-by '\t' \
   --null-string '\\N' --null-non-string '\\N' \
   --input-null-string '\\N' --input-null-non-string '\\N' \
   --relaxed-isolation \
   --query SELECT id, first_name, last_name, mod_time FROM mytable \
   --split-by id \
   --merge-key id \
   --incremental lastmodified \
   --check-column mod_time \
   --last-value 1900-01-01 00:00:00.000 \
   --compress --compression-codec org.apache.hadoop.io.compress.SnappyCodec


The initial time the job runs, it creates 4 files like:
part-m-0.snappy
part-m-2.snappy
part-m-3.snappy
part-m-4.snappy

It did not need to do the merge step since there was no existing data.

However, the next time it runs, it pulls over modified rows from SQL Server and 
then merges them into the existing data and creates files:
part-r-0
part-r-1
part-r-2
...
part-r-00020
part-r-00031

which are uncompressed TSV files.


The Sqoop Metastore has the compression settings saved:
% sqoop job --show import__test__mydb__mytable
...
enable.compression = true
compression.codec = org.apache.hadoop.io.compress.SnappyCodec
...


Since the files are named part-m-X.snappy after the first run, I am 
guessing that the -m- in the name means the mappers created them (and also 
since I specified 4 mappers).

On the second run, I am guessing that the (32?) reducers created the output 
since there was merging necessary and the files have -r- in the name.

Is this a bug or expected behavior?
Is there some other settings to tell the reducers to honor the compression 
settings?
If it is a bug, where do I create an issue (JIRA) for it?


How are you engaging with millennials at your organization? Earn “Lifetime 
Loyalty with Effective Millennial Engagement” by signing up for our next 
webinar. Join us Tuesday, May 12 at 1:00 EDT to obtain the tools you need to 
earn brand loyalty from this important demographic. Click here 
http://content.paytronix.com/Lifetime-Loyalty_0515_sig.html to register!



How are you engaging with millennials at your organization? Earn “Lifetime 
Loyalty with Effective Millennial Engagement” by signing up for our next 
webinar. Join us Tuesday, May 12 at 1:00 EDT to obtain the tools you need to 
earn brand loyalty from this important demographic. Click here 
http://content.paytronix.com/Lifetime-Loyalty_0515_sig.html to register!


Re: Sqoop Incremental job does not honor the compression settings after the initial run

2015-05-05 Thread Mauricio Aristizabal
We've never been able to get sqoop-merge to use compression.  Even tried
setting MR output properties.  I do hope you guys can figure out the issue,
would save us a lot of space.

On Tue, May 5, 2015 at 1:45 PM, Michael Arena mar...@paytronix.com wrote:

   Sqoop 1.4.5-cdh5.2.1 on Cloudera CDH 5.2.1 cluster

   From: Abraham Elmahrek
 Reply-To: user@sqoop.apache.org
 Date: Tuesday, May 5, 2015 at 4:18 PM
 To: user@sqoop.apache.org
 Subject: Re: Sqoop Incremental job does not honor the compression
 settings after the initial run

   This seems like a bug. Which version of Sqoop are you using?

 On Tue, May 5, 2015 at 12:50 PM, Michael Arena mar...@paytronix.com
 wrote:

   I am incrementally loading data from SQL Server to Hadoop using an
 Oozie Sqoop Action.
 Oozie runs a saved job in the Sqoop Metastore as created below:

  sqoop job \
--create import__test__mydb__mytable \
 --meta-connect *** \
-- import \
 --connect jdbc:sqlserver://mydbserver:1433;databaseName=mydb; \
--username  \
--password-file  \
--num-mappers 4 \
--target-dir /***/***/mytable \
--fields-terminated-by '\t' --input-fields-terminated-by '\t' \
--null-string '\\N' --null-non-string '\\N' \
--input-null-string '\\N' --input-null-non-string '\\N' \
--relaxed-isolation \
--query SELECT id, first_name, last_name, mod_time FROM mytable \
--split-by id \
--merge-key id \
--incremental lastmodified \
--check-column mod_time \
--last-value 1900-01-01 00:00:00.000 \
 --compress --compression-codec
 org.apache.hadoop.io.compress.SnappyCodec


  The initial time the job runs, it creates 4 files like:
  part-m-0.snappy
  part-m-2.snappy
  part-m-3.snappy
  part-m-4.snappy

  It did not need to do the merge step since there was no existing data.

  However, the next time it runs, it pulls over modified rows from SQL
 Server and then merges them into the existing data and creates files:
  part-r-0
  part-r-1
  part-r-2
  ...
  part-r-00020
  part-r-00031

  which are uncompressed TSV files.


  The Sqoop Metastore has the compression settings saved:
  % sqoop job --show import__test__mydb__mytable
  ...
  enable.compression = true
 compression.codec = org.apache.hadoop.io.compress.SnappyCodec
  ...


  Since the files are named part-m-X.snappy after the first run, I
 am guessing that the -m- in the name means the mappers created them (and
 also since I specified 4 mappers).

  On the second run, I am guessing that the (32?) reducers created the
 output since there was merging necessary and the files have -r- in the
 name.

  Is this a bug or expected behavior?
  Is there some other settings to tell the reducers to honor the
 compression settings?
  If it is a bug, where do I create an issue (JIRA) for it?


 How are you engaging with millennials at your organization? Earn
 “Lifetime Loyalty with Effective Millennial Engagement” by signing up for
 our next webinar. Join us *Tuesday, May 12 at 1:00 EDT *to obtain the
 tools you need to earn brand loyalty from this important demographic. Click
 here http://content.paytronix.com/Lifetime-Loyalty_0515_sig.htmlto
 register!




 How are you engaging with millennials at your organization? Earn “Lifetime
 Loyalty with Effective Millennial Engagement” by signing up for our next
 webinar. Join us *Tuesday, May 12 at 1:00 EDT *to obtain the tools you
 need to earn brand loyalty from this important demographic. Click here
 http://content.paytronix.com/Lifetime-Loyalty_0515_sig.html to
 register!




-- 

*Mauricio Aristizabal*

Manager - Business Intelligence + Data Science | Impact Radius

10 East Figueroa Street, 2nd Floor | Santa Barbara, CA 93101

m: +1 (323) 309-4260 | mauri...@impactradius.com


*Learn more  – Watch our 2 minute overview
http://www.impactradius.com/?src=slsap*


www.impactradius.com | Twitter http://twitter.com/impactradius | Facebook
https://www.facebook.com/pages/Impact-Radius/153376411365183 | LinkedIn
http://www.linkedin.com/company/impact-radius-inc. | YouTube
https://www.youtube.com/user/ImpactRadius

Maximizing Return on Ad Spend