Hello Spark community,

I currently have a Spark 1.3.1 batch driver, deployed in YARN-cluster mode
on an EMR cluster (AMI 3.7.0) that reads input data through an HiveContext,
in particular SELECTing data from an EXTERNAL TABLE backed on S3. Such
table has dynamic partitions and contains *hundreds of small GZip files*.
Considering at the moment unfeasible to collate such files on the source
side, I experience that, by default, the SELECT query is mapped by Spark
into as much tasks as many files are found in the table root
path(+partitions), e.g. 860 files === 860 tasks to complete the Spark stage
of that read operation.

This behaviour obviously creates an incredible overhead and, often, in
failed stages due to OOM exceptions and subsequent crashes of the
executors. Regardless the size of the input that I can manage to handle, I
would really appreciate if you could suggest how to collate somehow the
input partitions while reading, or, at least, reduce the number of tasks
spawned by the Hive query.

Looking at
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-hive-differences.html#emr-hive-gzip-splits
I tried by setting:

hiveContext.sql("set hive.hadoop.supports.splittable.combineinputformat=true)


before creating the external table to read from and query it, but it
resulted in NO changes. Tried also to set that in the hive-site.xml on the
cluster, but I experienced the same behaviour.

Thanks to whomever will give me any hints.

Best regards,
Roberto

Reply via email to