I've had a little more luck with this after upgrading to CDH 5.3. The
oracle direct connector seems to be working well with hcatalog integration
and the various output file formats. However its seems that parquet doesn't
work with hcatalog integration. When using "stored as parquet" as the
--hcatalog-storage-stanza all the mappers are erroring with the below.
15/02/02 17:17:03 INFO mapreduce.Job: Task Id :
attempt_1422914679712_0003_m_000042_1, Status : FAILED
Error: java.lang.RuntimeException: Should never be used
at
org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat.getRecordWriter(MapredParquetOutputFormat.java:79)
at
org.apache.hive.hcatalog.mapreduce.FileOutputFormatContainer.getRecordWriter(FileOutputFormatContainer.java:103)
at
org.apache.hive.hcatalog.mapreduce.HCatOutputFormat.getRecordWriter(HCatOutputFormat.java:260)
at
org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.<init>(MapTask.java:644)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Anyone had any luck sqooping directly to parquet with Decimal and Timestamp
types?
On Tue, Dec 2, 2014 at 6:17 PM, Joshua Baxter <[email protected]>
wrote:
> I'm using Sqoop, Oraoop and the --as-parquet-file switch to pull down
> partitions of a large fact table and getting some great speed. There are
> not any columns i can evenly split by with the default connector but with
> Oraoop I can get evenly sized parquet files that i can use directly in
> impala and hive without incurring remote reads. A couple things i have
> noticed though.
>
> - Decimal fields are getting exported as strings. SQOOP-1445 refers to
> this but it sounds like a fix isn't planned due to the HCatalog support.
> Unfortunately the direct connectors, apart from Netezza, are not currently
> not supported.
> - You need to use option -Doraoop.timestamp.string=false otherwise you
> get an Not in union ["long","null"]: 2014-07-24 00:00:00 exception due to
> the intermediary file format. However the resulting parquet file is a
> double rather then a hive or impala compatible timestamp.
>
> Here is what i am running now.
>
> sqoop import -Doraoop.chunk.method=ROWID -Doraoop.timestamp.string=false
> -Doraoop.import.partitions=${PARTITION} \
> --direct \
> --connect jdbc:oracle:thin:@//${DATABASE} \
> --table "${TABLE}" \
> --columns COL1,COL2,COL3,COL4,COL5,COL6 \
> --map-column-java COL1=Long,COL2=Long,COL3=Long,COL4=Long \
> --m 48 \
> --target-dir /user/joshba/LANDING_PAD/TABLE-${PARTITION}/ \
> --delete-target-dir
>
> COL1-4 are stored as NUMBER(38,0) but don't hold anything more than a the
> size of a long so I've remapped those to save space. COL5 is a Decimal and
> COL6 is a DATE. Is there any way I can remap these also so that they are
> written into the parquet file as DECIMAL and timestamp compatible types
> respectively so there isn't a needed to redefine these columns.
>
> Many Thanks
>
> Josh
>