Hi, I am trying to create an external table on a S3 bucket, however I'm receiving the following error in the process:
hive> CREATE EXTERNAL TABLE ping_prod > PARTITIONED BY(day string) > ROW FORMAT SERDE > 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' > STORED AS INPUTFORMAT > 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' > LOCATION 's3a://path-to-bucket.com/data/' > TBLPROPERTIES ( > 'avro.schema.url'='s3a://path-to-bucket.com/avro/schema.avsc'); FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:com.amazonaws.AmazonClientException: Unable to unmarshall response (Failed to sanitize XML document destined for handler class com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$ListBucketHandler). Response Code: 200, Response Text: OK) See: https://gist.github.com/RikHeijdens/582371b5e6d24abc7471 for a complete stacktrace. This S3 bucket is very large (at least 800 TB), and contains about 400 directories with Avro serialized data. Each directory contains a day worth of data. I figured out that this might be because the size of the bucket, and the amount of files in this, so I tried to create an external table on a subset (1 day) of the data. That worked fine, and didn't cause any problems. I was wondering if this is a known issue, and why this is happening? I think it's an out of memory error, if that's the case, why would Hive need so much memory to create an external table? Also are there any workarounds for this problem? I'm running HDP-2.3.4.0-3485 and I am using the following Hive version: [root@docker-ambari tmp]# hive --version WARNING: Use "yarn jar" to launch YARN applications. Hive 1.2.1.2.3.4.0-3485 Subversion git://c66-slave-20176e25-2/grid/0/jenkins/workspace/HDP-build-centos6/bigtop/build/hive/rpm/BUILD/hive-1.2.1.2.3.4.0 -r efb067075854961dfa41165d5802a62ae334a2db Compiled by jenkins on Wed Dec 16 04:01:39 UTC 2015 >From source with checksum 4ecc763ed826fd070121da702cbd17e9 Thanks, Rik