We have got data stored in S3 partitioned by several columns. Let's say
following this hierarchy:
s3://bucket/data/column1=X/column2=Y/parquet-files

We run a Spark job in a EMR cluster (1 master,3 slaves) and realised the
following:

A) - When we declare the initial dataframe to be the whole dataset (val df =
sqlContext.read.parquet("s3://bucket/data/) then the driver splits the job
into several tasks (259) that are performed by the executors and we believe
the driver gets back the parquet metadata.

Question: The above takes about 25 minutes for our dataset, we believe it
should be a lazy query (as we are not performing any actions) however it
looks like something is happening, all the executors are reading from S3. We
have tried mergeData=false and setting the schema explicitly via
.schema(someSchema). Is there any way to speed this up?

B) - When we declare the initial dataframe to be scoped by the first column
(val df = sqlContext.read.parquet("s3://bucket/data/column1=X) then it seems
that all the work (getting the parquet metadata) is done by the driver and
there is no job submitted to Spark. 

Question: Why does (A) send the work to executors but (B) does not?

The above is for EMR 5.5.0, Hadoop 2.7.3 and Spark 2.1.0.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-querying-parquet-data-partitioned-in-S3-tp28809.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to