Markus Kemper created SQOOP-3130:
------------------------------------
Summary: Sqoop (export + --export-dir + Avro files) is not obeying
--num-mappers requested
Key: SQOOP-3130
URL: https://issues.apache.org/jira/browse/SQOOP-3130
Project: Sqoop
Issue Type: Bug
Reporter: Markus Kemper
When using Sqoop (export + --export-dir + Avro files) it is not obeying
--num-mappers requested, instead it is creating a map task per Avro data file.
One known workaround for this issue is to use the Sqoop --hcatalog options.
Please see the test case below demonstrating the issue and workaround.
*Test Case*
{noformat}
#################
# STEP 01 - Create Data
#################
for i in {1..100000}; do d=`date +"%Y-%m-%d %H:%M:%S" --date="+$i days"`; echo
"$i,$d,row data" >> ./data.csv; done
ls -l ./*;
wc data.csv
hdfs dfs -mkdir -p /user/root/external/t1
hdfs dfs -put ./data.csv /user/root/external/t1/data.csv
hdfs dfs -ls -R /user/root/external/t1/
Output:
-rw-r--r-- 1 root root 3488895 Feb 1 11:20 ./data.csv
~~~~~
100000 300000 3488895 data.csv
~~~~~
-rw-r--r-- 3 root root 3488895 2017-02-01 11:26
/user/root/external/t1/data.csv
#################
# STEP 02 - Create RDBMS Table and Export Data
#################
export MYCONN=jdbc:oracle:thin:@oracle.cloudera.com:1521/db11g;
export MYUSER=sqoop
export MYPSWD=cloudera
sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query
"drop table t1_text"
sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query
"create table t1_text (c1 int, c2 date, c3 varchar(10))"
sqoop export --connect $MYCONN --username $MYUSER --password $MYPSWD --table
T1_TEXT --export-dir /user/root/external/t1 --input-fields-terminated-by ','
--num-mappers 1
sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query
"select count(*) from t1_text"
Output:
17/02/01 11:33:31 INFO mapreduce.ExportJobBase: Transferred 3.3274 MB in
24.8037 seconds (137.3688 KB/sec)
17/02/01 11:33:31 INFO mapreduce.ExportJobBase: Exported 100000 records.
~~~~~~
------------------------
| COUNT(*) |
------------------------
| 100000 |
------------------------
#################
# STEP 03 - Import Data as Text Creating 10 HDFS Files
#################
sqoop import --connect $MYCONN --username $MYUSER --password $MYPSWD --table
T1_TEXT --target-dir /user/root/external/t1_text --delete-target-dir
--num-mappers 10 --split-by C1 --as-textfile
hdfs dfs -ls /user/root/external/t1_text/part*
Output:
17/02/01 11:38:26 INFO mapreduce.ImportJobBase: Transferred 3.518 MB in 57.0517
seconds (63.1434 KB/sec)
17/02/01 11:38:26 INFO mapreduce.ImportJobBase: Retrieved 100000 records.
~~~~~
-rw-r--r-- 3 root root 358894 2017-02-01 11:38
/user/root/external/t1_text/part-m-00000
-rw-r--r-- 3 root root 370000 2017-02-01 11:38
/user/root/external/t1_text/part-m-00001
-rw-r--r-- 3 root root 370000 2017-02-01 11:38
/user/root/external/t1_text/part-m-00002
-rw-r--r-- 3 root root 370000 2017-02-01 11:38
/user/root/external/t1_text/part-m-00003
-rw-r--r-- 3 root root 370000 2017-02-01 11:38
/user/root/external/t1_text/part-m-00004
-rw-r--r-- 3 root root 370000 2017-02-01 11:38
/user/root/external/t1_text/part-m-00005
-rw-r--r-- 3 root root 370000 2017-02-01 11:38
/user/root/external/t1_text/part-m-00006
-rw-r--r-- 3 root root 370000 2017-02-01 11:38
/user/root/external/t1_text/part-m-00007
-rw-r--r-- 3 root root 370000 2017-02-01 11:38
/user/root/external/t1_text/part-m-00008
-rw-r--r-- 3 root root 370001 2017-02-01 11:38
/user/root/external/t1_text/part-m-00009
#################
# STEP 04 - Export 10 Text Formatted Data Using 2 Splits
#################
sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query
"delete from t1_text"
sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query
"select count(*) from t1_text"
sqoop export --connect $MYCONN --username $MYUSER --password $MYPSWD --table
T1_TEXT --export-dir /user/root/external/t1_text --input-fields-terminated-by
',' --num-mappers 2
sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query
"select count(*) from t1_text"
Output:
------------------------
| COUNT(*) |
------------------------
| 0 |
------------------------
~~~~~
17/02/01 11:47:26 INFO input.FileInputFormat: Total input paths to process : 10
17/02/01 11:47:26 INFO mapreduce.JobSubmitter: number of splits:2
<SNIP>
17/02/01 11:47:55 INFO mapreduce.ExportJobBase: Transferred 3.5189 MB in
31.7104 seconds (113.6324 KB/sec)
17/02/01 11:47:55 INFO mapreduce.ExportJobBase: Exported 100000 records.
~~~~~
------------------------
| COUNT(*) |
------------------------
| 100000 |
------------------------
#################
# STEP 05 - Import Data as Avro Creating 10 HDFS Files
#################
sqoop import --connect $MYCONN --username $MYUSER --password $MYPSWD --table
T1_TEXT --target-dir /user/root/external/t1_avro --delete-target-dir
--num-mappers 10 --split-by C1 --as-avrodatafile
hdfs dfs -ls /user/root/external/t1_avro/*.avro
Output:
17/02/01 11:57:38 INFO mapreduce.ImportJobBase: Transferred 2.3703 MB in 68.454
seconds (35.4568 KB/sec)
17/02/01 11:57:38 INFO mapreduce.ImportJobBase: Retrieved 100000 records.
~~~~~
-rw-r--r-- 3 root root 231119 2017-02-01 11:57
/user/root/external/t1_avro/part-m-00000.avro
-rw-r--r-- 3 root root 250477 2017-02-01 11:57
/user/root/external/t1_avro/part-m-00001.avro
-rw-r--r-- 3 root root 250477 2017-02-01 11:57
/user/root/external/t1_avro/part-m-00002.avro
-rw-r--r-- 3 root root 250477 2017-02-01 11:57
/user/root/external/t1_avro/part-m-00003.avro
-rw-r--r-- 3 root root 250477 2017-02-01 11:57
/user/root/external/t1_avro/part-m-00004.avro
-rw-r--r-- 3 root root 250477 2017-02-01 11:57
/user/root/external/t1_avro/part-m-00005.avro
-rw-r--r-- 3 root root 250477 2017-02-01 11:57
/user/root/external/t1_avro/part-m-00006.avro
-rw-r--r-- 3 root root 250477 2017-02-01 11:57
/user/root/external/t1_avro/part-m-00007.avro
-rw-r--r-- 3 root root 250477 2017-02-01 11:57
/user/root/external/t1_avro/part-m-00008.avro
-rw-r--r-- 3 root root 250478 2017-02-01 11:56
/user/root/external/t1_avro/part-m-00009.avro
#################
# STEP 06 - Export 10 Avro Formatted Data Using 2 Splits (reproduction of issue)
#################
sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query
"delete from t1_text"
sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query
"select count(*) from t1_text"
sqoop export --connect $MYCONN --username $MYUSER --password $MYPSWD --table
T1_TEXT --export-dir /user/root/external/t1_avro --input-fields-terminated-by
',' --num-mappers 2
sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query
"select count(*) from t1_text"
Output:
17/02/01 12:01:07 INFO input.FileInputFormat: Total input paths to process : 10
17/02/01 12:01:08 INFO mapreduce.JobSubmitter: number of splits:10
<================== not correct, should be only 2 not 10
<SNIP>
17/02/01 12:02:02 INFO mapreduce.ExportJobBase: Transferred 2.4497 MB in
57.1965 seconds (43.8575 KB/sec)
17/02/01 12:02:02 INFO mapreduce.ExportJobBase: Exported 100000 records.
~~~~~
------------------------
| COUNT(*) |
------------------------
| 100000 |
------------------------
#################
# STEP 07 - Export 10 Avro Formatted Data Using 2 Splits Using HCat Options
(workaround)
#################
sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query
"delete from t1_text"
sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query
"select count(*) from t1_text"
beeline -u "jdbc:hive2://hs2.coe.cloudera.com:10000" -n user1 -e "use default;
drop table t1_avro; create external table t1_avro (c1 int, c2 string, c3
string) row format delimited fields terminated by ',' stored as avro location
'hdfs:///user/root/external/t1_avro'; select count(*) from t1_avro;"
sqoop export --connect $MYCONN --username $MYUSER --password $MYPSWD --table
T1_TEXT --hcatalog-database default --hcatalog-table t1_avro --num-mappers 2
sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query
"select count(*) from t1_text"
Output:
------------------------
| COUNT(*) |
------------------------
| 0 |
------------------------
~~~~~
+---------+--+
| _c0 |
+---------+--+
| 100000 |
+---------+--+
~~~~~
17/02/01 13:41:54 INFO mapred.FileInputFormat: Total input paths to process : 10
17/02/01 13:41:54 INFO mapreduce.JobSubmitter: number of splits:2
<================ correct!
<SNIP>
17/02/01 13:42:34 INFO mapreduce.ExportJobBase: Transferred 2.5225 MB in
48.7286 seconds (53.0082 KB/sec)
17/02/01 13:42:34 INFO mapreduce.ExportJobBase: Exported 100000 records.
~~~~~
------------------------
| COUNT(*) |
------------------------
| 100000 |
------------------------
{noformat}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)