[
https://issues.apache.org/jira/browse/SQOOP-3130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15854178#comment-15854178
]
Markus Kemper commented on SQOOP-3130:
--------------------------------------
>From reviewing this issue further another possible workaround is to set the
>(mapreduce.input.fileinputformat.split.minsize) to help control the count of
>map tasks used when exporting Avro data files. This method is not likely to
>be 100% deterministic if the volume of data is larger that the boundaries of
>the "minsize" and count of map tasks requested however it is more ideal than
>one map task per Avro data file. The example below demonstrates this
>workaround.
*Use (mapreduce.input.fileinputformat.split.minsize) without \--num-mappers*
{noformat}
sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query
"delete from t1_text"
sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query
"select count(*) from t1_text"
sqoop export -Dmapreduce.input.fileinputformat.split.minsize=134217728
--connect $MYCONN --username $MYUSER --password $MYPSWD --table T1_TEXT
--export-dir /user/root/external/t1_text --input-fields-terminated-by ','
17/02/06 05:56:25 INFO input.FileInputFormat: Total input paths to process : 10
17/02/06 05:56:25 INFO mapreduce.JobSubmitter: number of splits:4 <===========
uses default --num-mappers value (4)
<SNIP>
17/02/06 05:57:11 INFO mapreduce.ExportJobBase: Transferred 3.519 MB in 47.9893
seconds (75.0887 KB/sec)
17/02/06 05:57:11 INFO mapreduce.ExportJobBase: Exported 100000 records.
{noformat}
*Use (mapreduce.input.fileinputformat.split.minsize) with \--num-mappers*
{noformat}
sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query
"delete from t1_text"
sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query
"select count(*) from t1_text"
sqoop export -Dmapreduce.input.fileinputformat.split.minsize=134217728
--connect $MYCONN --username $MYUSER --password $MYPSWD --table T1_TEXT
--export-dir /user/root/external/t1_text --input-fields-terminated-by ','
--num-mappers 2
17/02/06 05:59:12 INFO input.FileInputFormat: Total input paths to process : 10
17/02/06 05:59:12 INFO mapreduce.JobSubmitter: number of splits:2 <===========
uses requested --num-mappers value (2)
<SNIP>
17/02/06 05:59:42 INFO mapreduce.ExportJobBase: Transferred 3.5189 MB in
32.3481 seconds (111.3925 KB/sec)
17/02/06 05:59:42 INFO mapreduce.ExportJobBase: Exported 100000 records.
{noformat}
> Sqoop (export + --export-dir + Avro files) is not obeying --num-mappers
> requested
> ---------------------------------------------------------------------------------
>
> Key: SQOOP-3130
> URL: https://issues.apache.org/jira/browse/SQOOP-3130
> Project: Sqoop
> Issue Type: Bug
> Reporter: Markus Kemper
>
> When using Sqoop (export + --export-dir + Avro files) it is not obeying
> --num-mappers requested, instead it is creating a map task per Avro data
> file.
> One known workaround for this issue is to use the Sqoop --hcatalog options.
> Please see the test case below demonstrating the issue and workaround.
> *Test Case*
> {noformat}
> #################
> # STEP 01 - Create Data
> #################
> for i in {1..100000}; do d=`date +"%Y-%m-%d %H:%M:%S" --date="+$i days"`;
> echo "$i,$d,row data" >> ./data.csv; done
> ls -l ./*;
> wc data.csv
> hdfs dfs -mkdir -p /user/root/external/t1
> hdfs dfs -put ./data.csv /user/root/external/t1/data.csv
> hdfs dfs -ls -R /user/root/external/t1/
> Output:
> -rw-r--r-- 1 root root 3488895 Feb 1 11:20 ./data.csv
> ~~~~~
> 100000 300000 3488895 data.csv
> ~~~~~
> -rw-r--r-- 3 root root 3488895 2017-02-01 11:26
> /user/root/external/t1/data.csv
> #################
> # STEP 02 - Create RDBMS Table and Export Data
> #################
> export MYCONN=jdbc:oracle:thin:@oracle.cloudera.com:1521/db11g;
> export MYUSER=sqoop
> export MYPSWD=cloudera
> sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query
> "drop table t1_text"
> sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query
> "create table t1_text (c1 int, c2 date, c3 varchar(10))"
> sqoop export --connect $MYCONN --username $MYUSER --password $MYPSWD --table
> T1_TEXT --export-dir /user/root/external/t1 --input-fields-terminated-by ','
> --num-mappers 1
> sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query
> "select count(*) from t1_text"
> Output:
> 17/02/01 11:33:31 INFO mapreduce.ExportJobBase: Transferred 3.3274 MB in
> 24.8037 seconds (137.3688 KB/sec)
> 17/02/01 11:33:31 INFO mapreduce.ExportJobBase: Exported 100000 records.
> ~~~~~~
> ------------------------
> | COUNT(*) |
> ------------------------
> | 100000 |
> ------------------------
> #################
> # STEP 03 - Import Data as Text Creating 10 HDFS Files
> #################
> sqoop import --connect $MYCONN --username $MYUSER --password $MYPSWD --table
> T1_TEXT --target-dir /user/root/external/t1_text --delete-target-dir
> --num-mappers 10 --split-by C1 --as-textfile
> hdfs dfs -ls /user/root/external/t1_text/part*
> Output:
> 17/02/01 11:38:26 INFO mapreduce.ImportJobBase: Transferred 3.518 MB in
> 57.0517 seconds (63.1434 KB/sec)
> 17/02/01 11:38:26 INFO mapreduce.ImportJobBase: Retrieved 100000 records.
> ~~~~~
> -rw-r--r-- 3 root root 358894 2017-02-01 11:38
> /user/root/external/t1_text/part-m-00000
> -rw-r--r-- 3 root root 370000 2017-02-01 11:38
> /user/root/external/t1_text/part-m-00001
> -rw-r--r-- 3 root root 370000 2017-02-01 11:38
> /user/root/external/t1_text/part-m-00002
> -rw-r--r-- 3 root root 370000 2017-02-01 11:38
> /user/root/external/t1_text/part-m-00003
> -rw-r--r-- 3 root root 370000 2017-02-01 11:38
> /user/root/external/t1_text/part-m-00004
> -rw-r--r-- 3 root root 370000 2017-02-01 11:38
> /user/root/external/t1_text/part-m-00005
> -rw-r--r-- 3 root root 370000 2017-02-01 11:38
> /user/root/external/t1_text/part-m-00006
> -rw-r--r-- 3 root root 370000 2017-02-01 11:38
> /user/root/external/t1_text/part-m-00007
> -rw-r--r-- 3 root root 370000 2017-02-01 11:38
> /user/root/external/t1_text/part-m-00008
> -rw-r--r-- 3 root root 370001 2017-02-01 11:38
> /user/root/external/t1_text/part-m-00009
> #################
> # STEP 04 - Export 10 Text Formatted Data Using 2 Splits
> #################
> sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query
> "delete from t1_text"
> sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query
> "select count(*) from t1_text"
> sqoop export --connect $MYCONN --username $MYUSER --password $MYPSWD --table
> T1_TEXT --export-dir /user/root/external/t1_text --input-fields-terminated-by
> ',' --num-mappers 2
> sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query
> "select count(*) from t1_text"
> Output:
> ------------------------
> | COUNT(*) |
> ------------------------
> | 0 |
> ------------------------
> ~~~~~
> 17/02/01 11:47:26 INFO input.FileInputFormat: Total input paths to process :
> 10
> 17/02/01 11:47:26 INFO mapreduce.JobSubmitter: number of splits:2
> <SNIP>
> 17/02/01 11:47:55 INFO mapreduce.ExportJobBase: Transferred 3.5189 MB in
> 31.7104 seconds (113.6324 KB/sec)
> 17/02/01 11:47:55 INFO mapreduce.ExportJobBase: Exported 100000 records.
> ~~~~~
> ------------------------
> | COUNT(*) |
> ------------------------
> | 100000 |
> ------------------------
> #################
> # STEP 05 - Import Data as Avro Creating 10 HDFS Files
> #################
> sqoop import --connect $MYCONN --username $MYUSER --password $MYPSWD --table
> T1_TEXT --target-dir /user/root/external/t1_avro --delete-target-dir
> --num-mappers 10 --split-by C1 --as-avrodatafile
> hdfs dfs -ls /user/root/external/t1_avro/*.avro
> Output:
> 17/02/01 11:57:38 INFO mapreduce.ImportJobBase: Transferred 2.3703 MB in
> 68.454 seconds (35.4568 KB/sec)
> 17/02/01 11:57:38 INFO mapreduce.ImportJobBase: Retrieved 100000 records.
> ~~~~~
> -rw-r--r-- 3 root root 231119 2017-02-01 11:57
> /user/root/external/t1_avro/part-m-00000.avro
> -rw-r--r-- 3 root root 250477 2017-02-01 11:57
> /user/root/external/t1_avro/part-m-00001.avro
> -rw-r--r-- 3 root root 250477 2017-02-01 11:57
> /user/root/external/t1_avro/part-m-00002.avro
> -rw-r--r-- 3 root root 250477 2017-02-01 11:57
> /user/root/external/t1_avro/part-m-00003.avro
> -rw-r--r-- 3 root root 250477 2017-02-01 11:57
> /user/root/external/t1_avro/part-m-00004.avro
> -rw-r--r-- 3 root root 250477 2017-02-01 11:57
> /user/root/external/t1_avro/part-m-00005.avro
> -rw-r--r-- 3 root root 250477 2017-02-01 11:57
> /user/root/external/t1_avro/part-m-00006.avro
> -rw-r--r-- 3 root root 250477 2017-02-01 11:57
> /user/root/external/t1_avro/part-m-00007.avro
> -rw-r--r-- 3 root root 250477 2017-02-01 11:57
> /user/root/external/t1_avro/part-m-00008.avro
> -rw-r--r-- 3 root root 250478 2017-02-01 11:56
> /user/root/external/t1_avro/part-m-00009.avro
> #################
> # STEP 06 - Export 10 Avro Formatted Data Using 2 Splits (reproduction of
> issue)
> #################
> sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query
> "delete from t1_text"
> sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query
> "select count(*) from t1_text"
> sqoop export --connect $MYCONN --username $MYUSER --password $MYPSWD --table
> T1_TEXT --export-dir /user/root/external/t1_avro --input-fields-terminated-by
> ',' --num-mappers 2
> sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query
> "select count(*) from t1_text"
> Output:
> 17/02/01 12:01:07 INFO input.FileInputFormat: Total input paths to process :
> 10
> 17/02/01 12:01:08 INFO mapreduce.JobSubmitter: number of splits:10
> <================== not correct, should be only 2 not 10
> <SNIP>
> 17/02/01 12:02:02 INFO mapreduce.ExportJobBase: Transferred 2.4497 MB in
> 57.1965 seconds (43.8575 KB/sec)
> 17/02/01 12:02:02 INFO mapreduce.ExportJobBase: Exported 100000 records.
> ~~~~~
> ------------------------
> | COUNT(*) |
> ------------------------
> | 100000 |
> ------------------------
> #################
> # STEP 07 - Export 10 Avro Formatted Data Using 2 Splits Using HCat Options
> (workaround)
> #################
> sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query
> "delete from t1_text"
> sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query
> "select count(*) from t1_text"
> beeline -u "jdbc:hive2://hs2.coe.cloudera.com:10000" -n user1 -e "use
> default; drop table t1_avro; create external table t1_avro (c1 int, c2
> string, c3 string) row format delimited fields terminated by ',' stored as
> avro location 'hdfs:///user/root/external/t1_avro'; select count(*) from
> t1_avro;"
> sqoop export --connect $MYCONN --username $MYUSER --password $MYPSWD --table
> T1_TEXT --hcatalog-database default --hcatalog-table t1_avro --num-mappers 2
> sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query
> "select count(*) from t1_text"
> Output:
> ------------------------
> | COUNT(*) |
> ------------------------
> | 0 |
> ------------------------
> ~~~~~
> +---------+--+
> | _c0 |
> +---------+--+
> | 100000 |
> +---------+--+
> ~~~~~
> 17/02/01 13:41:54 INFO mapred.FileInputFormat: Total input paths to process :
> 10
> 17/02/01 13:41:54 INFO mapreduce.JobSubmitter: number of splits:2
> <================ correct!
> <SNIP>
> 17/02/01 13:42:34 INFO mapreduce.ExportJobBase: Transferred 2.5225 MB in
> 48.7286 seconds (53.0082 KB/sec)
> 17/02/01 13:42:34 INFO mapreduce.ExportJobBase: Exported 100000 records.
> ~~~~~
> ------------------------
> | COUNT(*) |
> ------------------------
> | 100000 |
> ------------------------
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)