Re: Sqoop export from Hive table stored as Parquet

Markus Kemper Tue, 25 Oct 2016 12:07:42 -0700

Awesomeness and thank you for helping with the other forums.


Markus Kemper
Customer Operations Engineer
[image: www.cloudera.com] <http://www.cloudera.com>


On Tue, Oct 25, 2016 at 2:56 PM, Douglas Spadotto <[email protected]>
wrote:

> Hi Markus,
>
> It worked fine end to end. Here it goes the edited output:
>
> *SOURCE:*
>
> hive> describe extended teste1;
> OK
> id                   int
> nome                 string
>
> Detailed Table Information Table(tableName:teste1, dbName:default,
> owner:cloudera, createTime:1477408342, lastAccessTime:0, retention:0,
> sd:StorageDescriptor(cols:[FieldSchema(name:id, type:int, comment:null),
> FieldSchema(name:nome, type:string, comment:null)],
> location:hdfs://quickstart.cloudera:8020/user/hive/warehouse/teste1, 
> *inputFormat:org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat,
> outputFormat:org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat,*
> compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null,
> serializationLib:org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe,
> parameters:{serialization.format=1}), bucketCols:[], sortCols:[],
> parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[],
> skewedColValues:[], skewedColValueLocationMaps:{}),
> storedAsSubDirectories:false), partitionKeys:[], parameters:{numFiles=2,
> transient_lastDdlTime=1477408508, COLUMN_STATS_ACCURATE=true,
> totalSize=645, numRows=3, rawDataSize=6}, viewOriginalText:null,
> viewExpandedText:null, tableType:MANAGED_TABLE)
> Time taken: 0.08 seconds, Fetched: 4 row(s)
>
> hive> select * from teste1;
> OK
> ...
> 1 Douglas
> 2 Spadotto
> 3 Doug
> Time taken: 0.323 seconds, Fetched: 3 row(s)
>
> *SQOOP:*
>
> cloudera@quickstart ~]$ sqoop export --connect
> jdbc:postgresql://localhost/postgres --table test1 --hcatalog-table
> teste1
> Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will
> fail.
> Please set $ACCUMULO_HOME to the root of your Accumulo installation.
> 16/10/25 11:42:19 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.8.0
> 16/10/25 11:42:19 INFO manager.SqlManager: Using default fetchSize of 1000
> 16/10/25 11:42:19 INFO tool.CodeGenTool: Beginning code generation
> 16/10/25 11:42:19 INFO manager.SqlManager: Executing SQL statement: SELECT
> t.* FROM "test1" AS t LIMIT 1
> 16/10/25 11:42:19 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is
> /usr/lib/hadoop-mapreduce
> Note: /tmp/sqoop-cloudera/compile/50329ee88534a5c3b6df8b37aadf2131/test1.java
> uses or overrides a deprecated API.
> Note: Recompile with -Xlint:deprecation for details.
> 16/10/25 11:42:20 INFO orm.CompilationManager: Writing jar file:
> /tmp/sqoop-cloudera/compile/50329ee88534a5c3b6df8b37aadf2131/test1.jar
> 16/10/25 11:42:20 INFO mapreduce.ExportJobBase: Beginning export of test1
> 16/10/25 11:42:21 INFO Configuration.deprecation: mapred.jar is
> deprecated. Instead, use mapreduce.job.jar
> 16/10/25 11:42:21 INFO Configuration.deprecation: mapred.map.max.attempts
> is deprecated. Instead, use mapreduce.map.maxattempts
> 16/10/25 11:42:21 INFO mapreduce.ExportJobBase: Configuring HCatalog for
> export job
> 16/10/25 11:42:21 INFO hcat.SqoopHCatUtilities: Configuring HCatalog
> specific details for job
> 16/10/25 11:42:21 INFO manager.SqlManager: Executing SQL statement: SELECT
> t.* FROM "test1" AS t LIMIT 1
> 16/10/25 11:42:21 INFO hcat.SqoopHCatUtilities: Database column names
> projected : [id, nome]
> 16/10/25 11:42:21 INFO hcat.SqoopHCatUtilities: Database column name -
> info map :
> id : [Type : 4,Precision : 10,Scale : 0]
> nome : [Type : 12,Precision : 50,Scale : 0]  *<-- Got the structure here!*
>
> 16/10/25 11:42:21 INFO hive.metastore: Trying to connect to metastore with
> URI thrift://quickstart.cloudera:9083
> 16/10/25 11:42:21 INFO hive.metastore: Opened a connection to metastore,
> current connections: 1
> 16/10/25 11:42:21 INFO hive.metastore: Connected to metastore.
> 16/10/25 11:42:21 INFO hcat.SqoopHCatUtilities: HCatalog full table schema
> fields = [id, nome]
> ...
> 16/10/25 11:42:22 WARN hcat.SqoopHCatUtilities: No files under
> /usr/lib/hive-hcatalog/share/hcatalog/storage-handlers to add to
> distributed cache for hcatalog job
> 16/10/25 11:42:22 INFO hcat.SqoopHCatUtilities: Configuring HCatalog for
> export job
> 16/10/25 11:42:22 INFO hcat.SqoopHCatUtilities: Ignoring configuration
> request for HCatalog info
> 16/10/25 11:42:22 INFO Configuration.deprecation: 
> mapred.reduce.tasks.speculative.execution
> is deprecated. Instead, use mapreduce.reduce.speculative
> 16/10/25 11:42:22 INFO Configuration.deprecation:
> mapred.map.tasks.speculative.execution is deprecated. Instead, use
> mapreduce.map.speculative
> 16/10/25 11:42:22 INFO Configuration.deprecation: mapred.map.tasks is
> deprecated. Instead, use mapreduce.job.maps
> 16/10/25 11:42:22 INFO client.RMProxy: Connecting to ResourceManager at
> quickstart.cloudera/192.168.26.129:8032
> 16/10/25 11:42:49 INFO Configuration.deprecation: mapred.input.dir is
> deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
> 16/10/25 11:42:49 INFO mapred.FileInputFormat: Total input paths to
> process : 2
> 16/10/25 11:42:50 INFO mapreduce.JobSubmitter: number of splits:2
> ...
> 16/10/25 11:43:07 INFO mapreduce.ExportJobBase: Transferred 15.9258 KB in
> 45.1568 seconds (361.1415 bytes/sec)
> 16/10/25 11:43:07 INFO mapreduce.ExportJobBase: Exported 3 records.
> 16/10/25 11:43:07 INFO hive.metastore: Closed a connection to metastore,
> current connections: 0
>
> *DESTINATION:*
>
> [cloudera@quickstart ~]$ psql -d postgres -c "select * from test1"
>  id |   nome
> ----+----------
>   1 | Douglas
>   2 | Spadotto
>   3 | Doug
> (3 rows)
>
> Now I'll go off into the Internet answer to everyone that posted this same
> question on other forums. :)
>
> Regards,
>
> Douglas
>
>
> On Tue, Oct 25, 2016 at 4:00 PM, Markus Kemper <[email protected]>
> wrote:
>
>> Glad to hear it.  Let us know how it goes
>>
>>
>> Markus Kemper
>> Customer Operations Engineer
>> [image: www.cloudera.com] <http://www.cloudera.com>
>>
>>
>> On Tue, Oct 25, 2016 at 1:55 PM, Douglas Spadotto <[email protected]
>> > wrote:
>>
>>> Hi Markus,
>>>
>>> Thank you!
>>>
>>> I tried this (hcatalog options) myself a few minutes after I hit "send"
>>> on the e-mail. It worked fine, Sqoop was able to read the Parquet
>>> structure. Just my MR crashed but it was due to my unstable environment.
>>>
>>> It looks like I'm on the way to the solution.
>>>
>>> Cheers,
>>>
>>> Douglas
>>>
>>> On Tue, Oct 25, 2016 at 3:32 PM, Markus Kemper <[email protected]>
>>> wrote:
>>>
>>>> Hello Douglas,
>>>>
>>>> The only workaround that I am aware of is to use the Sqoop --hcatalog
>>>> options, for example:
>>>>
>>>> sqoop export --connect <jdbc_connection_string> --table <rdbms_table>
>>>> --hcatalog-database <hive_database> --hcatalog-table <hive_table>
>>>>
>>>>
>>>>
>>>> Markus Kemper
>>>> Customer Operations Engineer
>>>> [image: www.cloudera.com] <http://www.cloudera.com>
>>>>
>>>>
>>>> On Tue, Oct 25, 2016 at 1:21 PM, Douglas Spadotto <
>>>> [email protected]> wrote:
>>>>
>>>>> Hello everyone,
>>>>>
>>>>> I saw in the past few months quite a few messages about Parquet
>>>>> support on Sqoop, all about importing. Some of them worked well.
>>>>>
>>>>> But for exporting I'm receiving this error when trying to export from
>>>>> a Hive table stored as Parquet to Postgresql:
>>>>>
>>>>> [cloudera@quickstart ~]$ sqoop export --connect
>>>>> jdbc:postgresql://localhost/postgres --table test1  --export-dir
>>>>> /user/hive/warehouse/teste1
>>>>> Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports
>>>>> will fail.
>>>>> Please set $ACCUMULO_HOME to the root of your Accumulo installation.
>>>>> 16/10/25 09:19:09 INFO sqoop.Sqoop: Running Sqoop version:
>>>>> 1.4.6-cdh5.8.0
>>>>> 16/10/25 09:19:09 INFO manager.SqlManager: Using default fetchSize of
>>>>> 1000
>>>>> 16/10/25 09:19:09 INFO tool.CodeGenTool: Beginning code generation
>>>>> 16/10/25 09:19:10 INFO manager.SqlManager: Executing SQL statement:
>>>>> SELECT t.* FROM "test1" AS t LIMIT 1
>>>>> 16/10/25 09:19:10 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is
>>>>> /usr/lib/hadoop-mapreduce
>>>>> Note: 
>>>>> /tmp/sqoop-cloudera/compile/019c3435216213411e2de14c483af692/test1.java
>>>>> uses or overrides a deprecated API.
>>>>> Note: Recompile with -Xlint:deprecation for details.
>>>>> 16/10/25 09:19:11 INFO orm.CompilationManager: Writing jar file:
>>>>> /tmp/sqoop-cloudera/compile/019c3435216213411e2de14c483af692/test1.jar
>>>>> 16/10/25 09:19:11 INFO mapreduce.ExportJobBase: Beginning export of
>>>>> test1
>>>>> 16/10/25 09:19:12 INFO Configuration.deprecation: mapred.jar is
>>>>> deprecated. Instead, use mapreduce.job.jar
>>>>> 16/10/25 09:19:12 INFO Configuration.deprecation:
>>>>> mapred.map.max.attempts is deprecated. Instead, use
>>>>> mapreduce.map.maxattempts
>>>>> 16/10/25 09:19:13 INFO manager.SqlManager: Executing SQL statement:
>>>>> SELECT t.* FROM "test1" AS t LIMIT 1
>>>>> 16/10/25 09:19:13 ERROR sqoop.Sqoop: Got exception running Sqoop:
>>>>> org.kitesdk.data.DatasetNotFoundException: Descriptor location does
>>>>> not exist: hdfs://quickstart.cloudera:802
>>>>> 0/user/hive/warehouse/teste1/.metadata
>>>>> org.kitesdk.data.DatasetNotFoundException: Descriptor location does
>>>>> not exist: hdfs://quickstart.cloudera:802
>>>>> 0/user/hive/warehouse/teste1/.metadata
>>>>> at org.kitesdk.data.spi.filesystem.FileSystemMetadataProvider.c
>>>>> heckExists(FileSystemMetadataProvider.java:562)
>>>>> at org.kitesdk.data.spi.filesystem.FileSystemMetadataProvider.f
>>>>> ind(FileSystemMetadataProvider.java:605)
>>>>> at org.kitesdk.data.spi.filesystem.FileSystemMetadataProvider.l
>>>>> oad(FileSystemMetadataProvider.java:114)
>>>>> at org.kitesdk.data.spi.filesystem.FileSystemDatasetRepository.
>>>>> load(FileSystemDatasetRepository.java:197)
>>>>> at org.kitesdk.data.Datasets.load(Datasets.java:108)
>>>>> at org.kitesdk.data.Datasets.load(Datasets.java:140)
>>>>> at org.kitesdk.data.mapreduce.DatasetKeyInputFormat$ConfigBuild
>>>>> er.readFrom(DatasetKeyInputFormat.java:92)
>>>>> at org.kitesdk.data.mapreduce.DatasetKeyInputFormat$ConfigBuild
>>>>> er.readFrom(DatasetKeyInputFormat.java:139)
>>>>> at org.apache.sqoop.mapreduce.JdbcExportJob.configureInputForma
>>>>> t(JdbcExportJob.java:84)
>>>>> at org.apache.sqoop.mapreduce.ExportJobBase.runExport(ExportJob
>>>>> Base.java:432)
>>>>> at org.apache.sqoop.manager.SqlManager.exportTable(SqlManager.j
>>>>> ava:931)
>>>>> at org.apache.sqoop.tool.ExportTool.exportTable(ExportTool.java:81)
>>>>> at org.apache.sqoop.tool.ExportTool.run(ExportTool.java:100)
>>>>> at org.apache.sqoop.Sqoop.run(Sqoop.java:143)
>>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>>>>> at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:179)
>>>>> at org.apache.sqoop.Sqoop.runTool(Sqoop.java:218)
>>>>> at org.apache.sqoop.Sqoop.runTool(Sqoop.java:227)
>>>>> at org.apache.sqoop.Sqoop.main(Sqoop.java:236)
>>>>>
>>>>> I saw a recent JIRA opened about this, https://issues.apache.or
>>>>> g/jira/browse/SQOOP-2907, and am wondering if there is any workaround
>>>>> for this?
>>>>>
>>>>> Thanks in advance,
>>>>>
>>>>> Douglas
>>>>>
>>>>> --
>>>>> Visite: http://canseidesercowboy.wordpress.com/
>>>>> Siga: @dougspadotto ou @excowboys
>>>>> -----
>>>>> Frodo: "I wish none of this had happened."
>>>>> Gandalf: "So do all who live to see such times, but that is not for
>>>>> them to decide. All we have to decide is what to do with the time that is
>>>>> given to us."
>>>>> -- Lord of the Rings: The Fellowship of the Ring (2001)
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Visite: http://canseidesercowboy.wordpress.com/
>>> Siga: @dougspadotto ou @excowboys
>>> -----
>>> Frodo: "I wish none of this had happened."
>>> Gandalf: "So do all who live to see such times, but that is not for them
>>> to decide. All we have to decide is what to do with the time that is given
>>> to us."
>>> -- Lord of the Rings: The Fellowship of the Ring (2001)
>>>
>>
>>
>
>
> --
> Visite: http://canseidesercowboy.wordpress.com/
> Siga: @dougspadotto ou @excowboys
> -----
> Frodo: "I wish none of this had happened."
> Gandalf: "So do all who live to see such times, but that is not for them
> to decide. All we have to decide is what to do with the time that is given
> to us."
> -- Lord of the Rings: The Fellowship of the Ring (2001)
>

Re: Sqoop export from Hive table stored as Parquet

Reply via email to