Re: Sqoop export from Hive table stored as Parquet

Venkat Ranganathan Tue, 25 Oct 2016 13:41:50 -0700

Thanks for testing and answering that the export works.   One thing to note is 
that imports are not allowed into parquet tables with HCatalog (See HIVE-7502). 
  It has been a long standing issue since we created the Sqoop HCatalog 
integration a few years ago ☹

Thanks

Venkat

From: Markus Kemper <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Tuesday, October 25, 2016 at 12:01 PM
To: "[email protected]" <[email protected]>
Subject: Re: Sqoop export from Hive table stored as Parquet

Awesomeness and thank you for helping with the other forums.

Markus Kemper
Customer Operations Engineer
[www.cloudera.com]<http://www.cloudera.com>

On Tue, Oct 25, 2016 at 2:56 PM, Douglas Spadotto 
<[email protected]<mailto:[email protected]>> wrote:
Hi Markus,

It worked fine end to end. Here it goes the edited output:

SOURCE:

hive> describe extended teste1;
OK
id                   int
nome                 string

Detailed Table Information Table(tableName:teste1, dbName:default, 
owner:cloudera, createTime:1477408342, lastAccessTime:0, retention:0, 
sd:StorageDescriptor(cols:[FieldSchema(name:id, type:int, comment:null), 
FieldSchema(name:nome, type:string, comment:null)], 
location:hdfs://quickstart.cloudera:8020/user/hive/warehouse/teste1, 
inputFormat:org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat, 
outputFormat:org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat, 
compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, 
serializationLib:org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, 
parameters:{serialization.format=1}), bucketCols:[], sortCols:[], 
parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], 
skewedColValueLocationMaps:{}), storedAsSubDirectories:false), 
partitionKeys:[], parameters:{numFiles=2, transient_lastDdlTime=1477408508, 
COLUMN_STATS_ACCURATE=true, totalSize=645, numRows=3, rawDataSize=6}, 
viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE)
Time taken: 0.08 seconds, Fetched: 4 row(s)

hive> select * from teste1;
OK
...
1 Douglas
2 Spadotto
3 Doug
Time taken: 0.323 seconds, Fetched: 3 row(s)

SQOOP:

cloudera@quickstart ~]$ sqoop export --connect 
jdbc:postgresql://localhost/postgres --table test1 --hcatalog-table teste1
Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
16/10/25 11:42:19 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.8.0
16/10/25 11:42:19 INFO manager.SqlManager: Using default fetchSize of 1000
16/10/25 11:42:19 INFO tool.CodeGenTool: Beginning code generation
16/10/25 11:42:19 INFO manager.SqlManager: Executing SQL statement: SELECT t.* 
FROM "test1" AS t LIMIT 1
16/10/25 11:42:19 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is 
/usr/lib/hadoop-mapreduce
Note: /tmp/sqoop-cloudera/compile/50329ee88534a5c3b6df8b37aadf2131/test1.java 
uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
16/10/25 11:42:20 INFO orm.CompilationManager: Writing jar file: 
/tmp/sqoop-cloudera/compile/50329ee88534a5c3b6df8b37aadf2131/test1.jar
16/10/25 11:42:20 INFO mapreduce.ExportJobBase: Beginning export of test1
16/10/25 11:42:21 INFO Configuration.deprecation: mapred.jar is deprecated. 
Instead, use mapreduce.job.jar
16/10/25 11:42:21 INFO Configuration.deprecation: mapred.map.max.attempts is 
deprecated. Instead, use mapreduce.map.maxattempts
16/10/25 11:42:21 INFO mapreduce.ExportJobBase: Configuring HCatalog for export 
job
16/10/25 11:42:21 INFO hcat.SqoopHCatUtilities: Configuring HCatalog specific 
details for job
16/10/25 11:42:21 INFO manager.SqlManager: Executing SQL statement: SELECT t.* 
FROM "test1" AS t LIMIT 1
16/10/25 11:42:21 INFO hcat.SqoopHCatUtilities: Database column names projected 
: [id, nome]
16/10/25 11:42:21 INFO hcat.SqoopHCatUtilities: Database column name - info map 
:
id : [Type : 4,Precision : 10,Scale : 0]
nome : [Type : 12,Precision : 50,Scale : 0]  <-- Got the structure here!

16/10/25 11:42:21 INFO hive.metastore: Trying to connect to metastore with URI 
thrift://quickstart.cloudera:9083
16/10/25 11:42:21 INFO hive.metastore: Opened a connection to metastore, 
current connections: 1
16/10/25 11:42:21 INFO hive.metastore: Connected to metastore.
16/10/25 11:42:21 INFO hcat.SqoopHCatUtilities: HCatalog full table schema 
fields = [id, nome]
...
16/10/25 11:42:22 WARN hcat.SqoopHCatUtilities: No files under 
/usr/lib/hive-hcatalog/share/hcatalog/storage-handlers to add to distributed 
cache for hcatalog job
16/10/25 11:42:22 INFO hcat.SqoopHCatUtilities: Configuring HCatalog for export 
job
16/10/25 11:42:22 INFO hcat.SqoopHCatUtilities: Ignoring configuration request 
for HCatalog info
16/10/25 11:42:22 INFO Configuration.deprecation: 
mapred.reduce.tasks.speculative.execution is deprecated. Instead, use 
mapreduce.reduce.speculative
16/10/25 11:42:22 INFO Configuration.deprecation: 
mapred.map.tasks.speculative.execution is deprecated. Instead, use 
mapreduce.map.speculative
16/10/25 11:42:22 INFO Configuration.deprecation: mapred.map.tasks is 
deprecated. Instead, use mapreduce.job.maps
16/10/25 11:42:22 INFO client.RMProxy: Connecting to ResourceManager at 
quickstart.cloudera/192.168.26.129:8032<http://192.168.26.129:8032>
16/10/25 11:42:49 INFO Configuration.deprecation: mapred.input.dir is 
deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
16/10/25 11:42:49 INFO mapred.FileInputFormat: Total input paths to process : 2
16/10/25 11:42:50 INFO mapreduce.JobSubmitter: number of splits:2
...
16/10/25 11:43:07 INFO mapreduce.ExportJobBase: Transferred 15.9258 KB in 
45.1568 seconds (361.1415 bytes/sec)
16/10/25 11:43:07 INFO mapreduce.ExportJobBase: Exported 3 records.
16/10/25 11:43:07 INFO hive.metastore: Closed a connection to metastore, 
current connections: 0

DESTINATION:

[cloudera@quickstart ~]$ psql -d postgres -c "select * from test1"
 id |   nome
----+----------
  1 | Douglas
  2 | Spadotto
  3 | Doug
(3 rows)

Now I'll go off into the Internet answer to everyone that posted this same 
question on other forums. :)

Regards,

Douglas

On Tue, Oct 25, 2016 at 4:00 PM, Markus Kemper 
<[email protected]<mailto:[email protected]>> wrote:
Glad to hear it.  Let us know how it goes

Markus Kemper
Customer Operations Engineer
[www.cloudera.com]<http://www.cloudera.com>

On Tue, Oct 25, 2016 at 1:55 PM, Douglas Spadotto 
<[email protected]<mailto:[email protected]>> wrote:
Hi Markus,

Thank you!

I tried this (hcatalog options) myself a few minutes after I hit "send" on the 
e-mail. It worked fine, Sqoop was able to read the Parquet structure. Just my 
MR crashed but it was due to my unstable environment.

It looks like I'm on the way to the solution.

Cheers,

Douglas

On Tue, Oct 25, 2016 at 3:32 PM, Markus Kemper 
<[email protected]<mailto:[email protected]>> wrote:
Hello Douglas,

The only workaround that I am aware of is to use the Sqoop --hcatalog options, 
for example:

sqoop export --connect <jdbc_connection_string> --table <rdbms_table> 
--hcatalog-database <hive_database> --hcatalog-table <hive_table>

Markus Kemper
Customer Operations Engineer
[www.cloudera.com]<http://www.cloudera.com>

On Tue, Oct 25, 2016 at 1:21 PM, Douglas Spadotto 
<[email protected]<mailto:[email protected]>> wrote:
Hello everyone,

I saw in the past few months quite a few messages about Parquet support on 
Sqoop, all about importing. Some of them worked well.

But for exporting I'm receiving this error when trying to export from a Hive 
table stored as Parquet to Postgresql:

[cloudera@quickstart ~]$ sqoop export --connect 
jdbc:postgresql://localhost/postgres --table test1  --export-dir 
/user/hive/warehouse/teste1
Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
16/10/25 09:19:09 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.8.0
16/10/25 09:19:09 INFO manager.SqlManager: Using default fetchSize of 1000
16/10/25 09:19:09 INFO tool.CodeGenTool: Beginning code generation
16/10/25 09:19:10 INFO manager.SqlManager: Executing SQL statement: SELECT t.* 
FROM "test1" AS t LIMIT 1
16/10/25 09:19:10 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is 
/usr/lib/hadoop-mapreduce
Note: /tmp/sqoop-cloudera/compile/019c3435216213411e2de14c483af692/test1.java 
uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
16/10/25 09:19:11 INFO orm.CompilationManager: Writing jar file: 
/tmp/sqoop-cloudera/compile/019c3435216213411e2de14c483af692/test1.jar
16/10/25 09:19:11 INFO mapreduce.ExportJobBase: Beginning export of test1
16/10/25 09:19:12 INFO Configuration.deprecation: mapred.jar is deprecated. 
Instead, use mapreduce.job.jar
16/10/25 09:19:12 INFO Configuration.deprecation: mapred.map.max.attempts is 
deprecated. Instead, use mapreduce.map.maxattempts
16/10/25 09:19:13 INFO manager.SqlManager: Executing SQL statement: SELECT t.* 
FROM "test1" AS t LIMIT 1
16/10/25 09:19:13 ERROR sqoop.Sqoop: Got exception running Sqoop: 
org.kitesdk.data.DatasetNotFoundException: Descriptor location does not exist: 
hdfs://quickstart.cloudera:8020/user/hive/warehouse/teste1/.metadata
org.kitesdk.data.DatasetNotFoundException: Descriptor location does not exist: 
hdfs://quickstart.cloudera:8020/user/hive/warehouse/teste1/.metadata
at 
org.kitesdk.data.spi.filesystem.FileSystemMetadataProvider.checkExists(FileSystemMetadataProvider.java:562)
at 
org.kitesdk.data.spi.filesystem.FileSystemMetadataProvider.find(FileSystemMetadataProvider.java:605)
at 
org.kitesdk.data.spi.filesystem.FileSystemMetadataProvider.load(FileSystemMetadataProvider.java:114)
at 
org.kitesdk.data.spi.filesystem.FileSystemDatasetRepository.load(FileSystemDatasetRepository.java:197)
at org.kitesdk.data.Datasets.load(Datasets.java:108)
at org.kitesdk.data.Datasets.load(Datasets.java:140)
at 
org.kitesdk.data.mapreduce.DatasetKeyInputFormat$ConfigBuilder.readFrom(DatasetKeyInputFormat.java:92)
at 
org.kitesdk.data.mapreduce.DatasetKeyInputFormat$ConfigBuilder.readFrom(DatasetKeyInputFormat.java:139)
at 
org.apache.sqoop.mapreduce.JdbcExportJob.configureInputFormat(JdbcExportJob.java:84)
at org.apache.sqoop.mapreduce.ExportJobBase.runExport(ExportJobBase.java:432)
at org.apache.sqoop.manager.SqlManager.exportTable(SqlManager.java:931)
at org.apache.sqoop.tool.ExportTool.exportTable(ExportTool.java:81)
at org.apache.sqoop.tool.ExportTool.run(ExportTool.java:100)
at org.apache.sqoop.Sqoop.run(Sqoop.java:143)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:179)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:218)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:227)
at org.apache.sqoop.Sqoop.main(Sqoop.java:236)

I saw a recent JIRA opened about this, 
https://issues.apache.org/jira/browse/SQOOP-2907, and am wondering if there is 
any workaround for this?

Thanks in advance,

Douglas

--
Visite: http://canseidesercowboy.wordpress.com/
Siga: @dougspadotto ou @excowboys
-----
Frodo: "I wish none of this had happened."
Gandalf: "So do all who live to see such times, but that is not for them to 
decide. All we have to decide is what to do with the time that is given to us."
-- Lord of the Rings: The Fellowship of the Ring (2001)

--
Visite: http://canseidesercowboy.wordpress.com/
Siga: @dougspadotto ou @excowboys
-----
Frodo: "I wish none of this had happened."
Gandalf: "So do all who live to see such times, but that is not for them to 
decide. All we have to decide is what to do with the time that is given to us."
-- Lord of the Rings: The Fellowship of the Ring (2001)

--
Visite: http://canseidesercowboy.wordpress.com/
Siga: @dougspadotto ou @excowboys
-----
Frodo: "I wish none of this had happened."
Gandalf: "So do all who live to see such times, but that is not for them to 
decide. All we have to decide is what to do with the time that is given to us."
-- Lord of the Rings: The Fellowship of the Ring (2001)

Re: Sqoop export from Hive table stored as Parquet

Reply via email to