Hi,

I am trying to create an external table on a S3 bucket, however I'm
receiving the following error in the process:

hive> CREATE EXTERNAL TABLE ping_prod
    > PARTITIONED BY(day string)
    > ROW FORMAT SERDE
    > 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
    > STORED AS INPUTFORMAT
    > 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
    > OUTPUTFORMAT
    > 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
    > LOCATION 's3a://path-to-bucket.com/data/'
    > TBLPROPERTIES (
    > 'avro.schema.url'='s3a://path-to-bucket.com/avro/schema.avsc');
FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.DDLTask.
MetaException(message:com.amazonaws.AmazonClientException: Unable to
unmarshall response (Failed to sanitize XML document destined for
handler class 
com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$ListBucketHandler).
Response Code: 200, Response Text: OK)

See: https://gist.github.com/RikHeijdens/582371b5e6d24abc7471 for a
complete stacktrace.

This S3 bucket is very large (at least 800 TB), and contains about 400
directories with Avro serialized data. Each directory contains a day
worth of data.

I figured out that this might be because the size of the bucket, and
the amount of files in this, so I tried to create an external table on
a subset (1 day) of the data.
That worked fine, and didn't cause any problems.

I was wondering if this is a known issue, and why this is happening?
I think it's an out of memory error, if that's the case, why would
Hive need so much memory to create an external table?
Also are there any workarounds for this problem?

I'm running HDP-2.3.4.0-3485 and I am using the following Hive version:
[root@docker-ambari tmp]# hive --version
WARNING: Use "yarn jar" to launch YARN applications.
Hive 1.2.1.2.3.4.0-3485
Subversion 
git://c66-slave-20176e25-2/grid/0/jenkins/workspace/HDP-build-centos6/bigtop/build/hive/rpm/BUILD/hive-1.2.1.2.3.4.0
-r efb067075854961dfa41165d5802a62ae334a2db
Compiled by jenkins on Wed Dec 16 04:01:39 UTC 2015
>From source with checksum 4ecc763ed826fd070121da702cbd17e9

Thanks,
Rik

Reply via email to