Markus Kemper created SQOOP-3046:
------------------------------------
Summary: Add support for (import + --hcatalog* + --as-parquetfile)
Key: SQOOP-3046
URL: https://issues.apache.org/jira/browse/SQOOP-3046
Project: Sqoop
Issue Type: Improvement
Components: hive-integration
Reporter: Markus Kemper
This is a request to identify a way to support Sqoop import with --hcatalog
options when writing Parquet data files. The test case below demonstrates the
issue.
CODE SNIP
{noformat}
../MapredParquetOutputFormat.java
69 @Override
70 public RecordWriter<Void, ParquetHiveRecord> getRecordWriter(
71 final FileSystem ignored,
72 final JobConf job,
73 final String name,
74 final Progressable progress
75 ) throws IOException {
76 throw new RuntimeException("Should never be used");
77 }
{noformat}
TEST CASE:
{noformat}
STEP 01 - Create MySQL Tables
sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query
"drop table t1"
sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query
"create table t1 (c_int int, c_date date, c_timestamp timestamp)"
sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query
"describe t1"
---------------------------------------------------------------------------------------------------------
| Field | Type | Null | Key | Default
| Extra |
---------------------------------------------------------------------------------------------------------
| c_int | int(11) | YES | | (null)
| |
| c_date | date | YES | | (null)
| |
| c_timestamp | timestamp | NO | | CURRENT_TIMESTAMP
| on update CURRENT_TIMESTAMP |
---------------------------------------------------------------------------------------------------------
STEP 02 : Insert and Select Row
sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query
"insert into t1 values (1, current_date(), current_timestamp())"
sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query
"select * from t1"
--------------------------------------------------
| c_int | c_date | c_timestamp |
--------------------------------------------------
| 1 | 2016-10-26 | 2016-10-26 14:30:33.0 |
--------------------------------------------------
beeline -u jdbc:hive2:// -e "use default; drop table t1"
sqoop import -Dmapreduce.map.log.level=DEBUG --connect $MYCONN --username
$MYUSER --password $MYPSWD --table t1 --hcatalog-database default
--hcatalog-table t1 --create-hcatalog-table --hcatalog-storage-stanza 'stored
as parquet' --num-mappers 1
[sqoop console debug]
16/11/02 20:25:15 INFO mapreduce.Job: Task Id :
attempt_1478089149450_0046_m_000000_0, Status : FAILED
Error: java.lang.RuntimeException: Should never be used
at
org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat.getRecordWriter(MapredParquetOutputFormat.java:76)
at
org.apache.hive.hcatalog.mapreduce.FileOutputFormatContainer.getRecordWriter(FileOutputFormatContainer.java:102)
at
org.apache.hive.hcatalog.mapreduce.HCatOutputFormat.getRecordWriter(HCatOutputFormat.java:260)
at
org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.<init>(MapTask.java:647)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:767)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1714)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
[yarn maptask debug]
2016-11-02 20:25:15,565 INFO [main] org.apache.hadoop.mapred.MapTask:
Processing split: 1=1 AND 1=1
2016-11-02 20:25:15,583 DEBUG [main]
org.apache.sqoop.mapreduce.db.DataDrivenDBInputFormat: Creating db record
reader for db product: MYSQL
2016-11-02 20:25:15,613 INFO [main]
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: File Output
Committer Algorithm version is 1
2016-11-02 20:25:15,614 INFO [main]
org.apache.hadoop.conf.Configuration.deprecation: mapred.output.key.class is
deprecated. Instead, use mapreduce.job.output.key.class
2016-11-02 20:25:15,620 INFO [main]
org.apache.hadoop.conf.Configuration.deprecation: mapred.output.value.class is
deprecated. Instead, use mapreduce.job.output.value.class
2016-11-02 20:25:15,633 WARN [main] org.apache.hadoop.mapred.YarnChild:
Exception running child : java.lang.RuntimeException: Should never be used
at
org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat.getRecordWriter(MapredParquetOutputFormat.java:76)
at
org.apache.hive.hcatalog.mapreduce.FileOutputFormatContainer.getRecordWriter(FileOutputFormatContainer.java:102)
at
org.apache.hive.hcatalog.mapreduce.HCatOutputFormat.getRecordWriter(HCatOutputFormat.java:260)
at
org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.<init>(MapTask.java:647)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:767)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1714)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
{noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)