We have got data stored in S3 partitioned by several columns. Let's say following this hierarchy: s3://bucket/data/column1=X/column2=Y/parquet-files
We run a Spark job in a EMR cluster (1 master,3 slaves) and realised the following: A) - When we declare the initial dataframe to be the whole dataset (val df = sqlContext.read.parquet("s3://bucket/data/) then the driver splits the job into several tasks (259) that are performed by the executors and we believe the driver gets back the parquet metadata. Question: The above takes about 25 minutes for our dataset, we believe it should be a lazy query (as we are not performing any actions) however it looks like something is happening, all the executors are reading from S3. We have tried mergeData=false and setting the schema explicitly via .schema(someSchema). Is there any way to speed this up? B) - When we declare the initial dataframe to be scoped by the first column (val df = sqlContext.read.parquet("s3://bucket/data/column1=X) then it seems that all the work (getting the parquet metadata) is done by the driver and there is no job submitted to Spark. Question: Why does (A) send the work to executors but (B) does not? The above is for EMR 5.5.0, Hadoop 2.7.3 and Spark 2.1.0. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-querying-parquet-data-partitioned-in-S3-tp28809.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org