[
https://issues.apache.org/jira/browse/SQOOP-3047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15637951#comment-15637951
]
Markus Kemper commented on SQOOP-3047:
--------------------------------------
Linking SQOOP-3046
> Add support for (import + --hive-import + --as-parquet) when Parquet table
> already exists
> -----------------------------------------------------------------------------------------
>
> Key: SQOOP-3047
> URL: https://issues.apache.org/jira/browse/SQOOP-3047
> Project: Sqoop
> Issue Type: Improvement
> Components: hive-integration
> Reporter: Markus Kemper
>
> When a parquet table already exists (import + --hive-import +
> --as-parquetfile) will fail due to a mismatch with the json schema that Hive
> produces vs. the json schema that Sqoop generates. The test case below
> demonstrates how to reproduce the issue as well as workaround it.
> SETUP (create parquet table with Sqoop import and Beeline CTAS)
> {noformat}
> STEP 01 - Create MySQL Tables
> sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query
> "drop table t1"
> sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query
> "create table t1 (c_int int, c_date date, c_timestamp timestamp)"
> sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query
> "describe t1"
> ---------------------------------------------------------------------------------------------------------
> | Field | Type | Null | Key | Default
> | Extra |
> ---------------------------------------------------------------------------------------------------------
> | c_int | int(11) | YES | | (null)
> | |
> | c_date | date | YES | | (null)
> | |
> | c_timestamp | timestamp | NO | | CURRENT_TIMESTAMP
> | on update CURRENT_TIMESTAMP |
> ---------------------------------------------------------------------------------------------------------
> STEP 02 : Insert and Select Row
> sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query
> "insert into t1 values (1, current_date(), current_timestamp())"
> sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query
> "select * from t1"
> --------------------------------------------------
> | c_int | c_date | c_timestamp |
> --------------------------------------------------
> | 1 | 2016-10-26 | 2016-10-26 14:30:33.0 |
> --------------------------------------------------
> STEP 03 : Create Hive Tables
> beeline -u jdbc:hive2:// -e "use default; drop table t1"
> sqoop import -Dmapreduce.map.log.level=DEBUG --connect $MYCONN --username
> $MYUSER --password $MYPSWD --table t1 --hcatalog-database default
> --hcatalog-table t1_text --create-hcatalog-table --hcatalog-storage-stanza
> 'stored as parquet' --num-mappers 1
> beeline -u jdbc:hive2:// -e "use default; create table t1 stored as parquet
> as select * from t1_text;show create table t1;"
> +----------------------------------------------------+--+
> | createtab_stmt |
> +----------------------------------------------------+--+
> | CREATE TABLE `t1`( |
> | `c_int` int, |
> | `c_date` string, |
> | `c_timestamp` string) |
> | ROW FORMAT SERDE |
> | 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' |
> | STORED AS INPUTFORMAT |
> | 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' |
> | OUTPUTFORMAT |
> | 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' |
> | LOCATION |
> | 'hdfs://<namenode>:8020/user/hive/warehouse/t1' |
> | TBLPROPERTIES ( |
> | 'COLUMN_STATS_ACCURATE'='true', |
> | 'numFiles'='1', |
> | 'numRows'='2', |
> | 'rawDataSize'='6', |
> | 'totalSize'='605', |
> | 'transient_lastDdlTime'='1478298298') |
> +----------------------------------------------------+--+
> {noformat}
> REPRODUCE ISSUE (import --hive-import append and overwrite)
> {noformat}
> STEP 01: Attempt --hive-import --append
>
> sqoop import -Dmapreduce.map.log.level=DEBUG --connect $MYCONN --username
> $MYUSER --password $MYPSWD --table t1 --hive-import --hive-database default
> --hive-table t1 --as-parquetfile --num-mappers 1 --append
> 16/11/02 20:54:23 WARN mapreduce.DataDrivenImportJob: Target Hive table 't1'
> exists! Sqoop will append data into the existing Hive table. Consider using
> --hive-overwrite, if you do NOT intend to do appending.
> 16/11/02 20:54:24 ERROR tool.ImportTool: Encountered IOException running
> import job: java.io.IOException: Expected schema:
> {"type":"record","name":"t1","fields":[{"name":"c_int","type":["null","int"],"doc":"Converted
> from
> 'int'","default":null},{"name":"c_date","type":["null","string"],"doc":"Converted
> from
> 'string'","default":null},{"name":"c_timestamp","type":["null","string"],"doc":"Converted
> from 'string'","default":null}]}
> Actual schema: {"type":"record","name":"t1","doc":"Sqoop import of
> t1","fields":[{"name":"c_int","type":["null","int"],"default":null,"columnName":"c_int","sqlType":"4"},{"name":"c_date","type":["null","long"],"default":null,"columnName":"c_date","sqlType":"91"},{"name":"c_timestamp","type":["null","long"],"default":null,"columnName":"c_timestamp","sqlType":"93"}],"tableName":"t1"}
> STEP 02: Attempt --hive-import --hive-overwrite
> sqoop import -Dmapreduce.map.log.level=DEBUG --connect $MYCONN --username
> $MYUSER --password $MYPSWD --table t1 --hive-import --hive-database default
> --hive-table t1 --as-parquetfile --num-mappers 1 --hive-overwrite
> 16/11/02 20:56:55 INFO hive.metastore: Connected to metastore.
> 16/11/02 20:56:56 ERROR tool.ImportTool: Encountered IOException running
> import job: java.io.IOException: Expected schema:
> {"type":"record","name":"t1","fields":[{"name":"c_int","type":["null","int"],"doc":"Converted
> from
> 'int'","default":null},{"name":"c_date","type":["null","string"],"doc":"Converted
> from
> 'string'","default":null},{"name":"c_timestamp","type":["null","string"],"doc":"Converted
> from 'string'","default":null}]}
> Actual schema: {"type":"record","name":"t1","doc":"Sqoop import of
> t1","fields":[{"name":"c_int","type":["null","int"],"default":null,"columnName":"c_int","sqlType":"4"},{"name":"c_date","type":["null","long"],"default":null,"columnName":"c_date","sqlType":"91"},{"name":"c_timestamp","type":["null","long"],"default":null,"columnName":"c_timestamp","sqlType":"93"}],"tableName":"t1"}
> {noformat}
> WORKAROUND (create parquet table with Sqoop import --hive-import)
> {noformat}
> STEP 01: Create Parquet Table with Sqoop
> beeline -u jdbc:hive2:// -e "use default; drop table t1"
> sqoop import -Dmapreduce.map.log.level=DEBUG --connect $MYCONN --username
> $MYUSER --password $MYPSWD --table t1 --hive-import --hive-database default
> --hive-table t1 --as-parquetfile --num-mappers 1
> hdfs dfs -ls -R /user/hive/warehouse/t1
> hdfs dfs -cat /user/hive/warehouse/t1/.metadata/schemas/1.avsc
> Output:
> drwxr-xr-x - root hive 0 2016-11-02 21:12
> /user/hive/warehouse/t1/.metadata
> drwxr-xr-x - root hive 0 2016-11-02 21:12
> /user/hive/warehouse/t1/.metadata/schemas
> -rw-r--r-- 3 root hive 522 2016-11-02 21:12
> /user/hive/warehouse/t1/.metadata/schemas/1.avsc
> drwxr-xr-x - root hive 0 2016-11-02 21:12
> /user/hive/warehouse/t1/.signals
> -rw-r--r-- 3 root hive 0 2016-11-02 21:12
> /user/hive/warehouse/t1/.signals/unbounded
> -rw-r--r-- 3 root supergroup 864 2016-11-02 21:12
> /user/hive/warehouse/t1/69f39f03-7b60-472c-a436-0764061fd7e5.parquet
> ---
> {
> "type" : "record",
> "name" : "t1",
> "doc" : "Sqoop import of t1",
> "fields" : [ {
> "name" : "c_int",
> "type" : [ "null", "int" ],
> "default" : null,
> "columnName" : "c_int",
> "sqlType" : "4"
> }, {
> "name" : "c_date",
> "type" : [ "null", "long" ],
> "default" : null,
> "columnName" : "c_date",
> "sqlType" : "91"
> }, {
> "name" : "c_timestamp",
> "type" : [ "null", "long" ],
> "default" : null,
> "columnName" : "c_timestamp",
> "sqlType" : "93"
> } ],
> "tableName" : "t1"
> }
> STEP 02: Demonstrate (import overwrite)
>
> sqoop import -Dmapreduce.map.log.level=DEBUG --connect $MYCONN --username
> $MYUSER --password $MYPSWD --table t1 --hive-import --hive-database default
> --hive-table t1 --as-parquetfile --num-mappers 1 --hive-overwrite
> beeline -u jdbc:hive2:// -e "use default; select * from t1;"
> +-----------+----------------+-----------------+--+
> | t1.c_int | t1.c_date | t1.c_timestamp |
> +-----------+----------------+-----------------+--+
> | 1 | 1478070000000 | 1478142514000 |
> +-----------+----------------+-----------------+--+
> STEP 03: Demonstrate (import append)
> sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query
> "insert into t1 values (2, current_date(), current_timestamp())"
> sqoop import -Dmapreduce.map.log.level=DEBUG --connect $MYCONN --username
> $MYUSER --password $MYPSWD --table t1 --hive-import --hive-database default
> --hive-table t1 --as-parquetfile --num-mappers 1 --where "c_int = 2" --append
> beeline -u jdbc:hive2:// -e "use default; select * from t1;"
> +-----------+----------------+-----------------+--+
> | t1.c_int | t1.c_date | t1.c_timestamp |
> +-----------+----------------+-----------------+--+
> | 1 | 1478070000000 | 1478142514000 |
> | 2 | 1478242800000 | 1478297764000 |
> +-----------+----------------+-----------------+--+
> {noformat}
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)