Please community, I'd really appreciate your opinion on this topic. Best regards, Roberto
---------- Forwarded message ---------- From: Roberto Coluccio <roberto.coluc...@gmail.com> Date: Sat, Jul 25, 2015 at 6:28 PM Subject: [Spark + Hive + EMR + S3] Issue when reading from Hive external table backed on S3 with large amount of small files To: user@spark.apache.org Hello Spark community, I currently have a Spark 1.3.1 batch driver, deployed in YARN-cluster mode on an EMR cluster (AMI 3.7.0) that reads input data through an HiveContext, in particular SELECTing data from an EXTERNAL TABLE backed on S3. Such table has dynamic partitions and contains *hundreds of small GZip files*. Considering at the moment unfeasible to collate such files on the source side, I experience that, by default, the SELECT query is mapped by Spark into as much tasks as many files are found in the table root path(+partitions), e.g. 860 files === 860 tasks to complete the Spark stage of that read operation. This behaviour obviously creates an incredible overhead and, often, in failed stages due to OOM exceptions and subsequent crashes of the executors. Regardless the size of the input that I can manage to handle, I would really appreciate if you could suggest how to collate somehow the input partitions while reading, or, at least, reduce the number of tasks spawned by the Hive query. Looking at http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-hive-differences.html#emr-hive-gzip-splits I tried by setting: hiveContext.sql("set hive.hadoop.supports.splittable.combineinputformat=true) before creating the external table to read from and query it, but it resulted in NO changes. Tried also to set that in the hive-site.xml on the cluster, but I experienced the same behaviour. Thanks to whomever will give me any hints. Best regards, Roberto