Spark querying parquet data partitioned in S3

Francisco Blaya Fri, 30 Jun 2017 02:55:38 -0700

We have got data stored in S3 partitioned by several columns. Let's say
following this hierarchy:
s3://bucket/data/column1=X/column2=Y/parquet-files


We run a Spark job in a EMR cluster (1 master,3 slaves) and realised the
following:

A) - When we declare the initial dataframe to be the whole dataset (val df
= sqlContext.read.parquet("s3://bucket/data/) then the driver splits the
job into several tasks (259) that are performed by the executors and we
believe the driver gets back the parquet metadata.

Question: The above takes about 25 minutes for our dataset, we believe it
should be a lazy query (as we are not performing any actions) however it
looks like something is happening, all the executors are reading from S3.
We have tried mergeData=false and setting the schema explicitly via
.schema(someSchema). Is there any way to speed this up?

B) - When we declare the initial dataframe to be scoped by the first column
(val df = sqlContext.read.parquet("s3://bucket/data/column1=X) then it
seems that all the work (getting the parquet metadata) is done by the
driver and there is no job submitted to Spark.

Question: Why does (A) send the work to executors but (B) does not?

The above is for EMR 5.5.0, Hadoop 2.7.3 and Spark 2.1.0.

-- 
hivehome.com <http://www.hivehome.com>



Hive | London | Cambridge | Houston | Toronto
The information contained in or attached to this email is confidential and 
intended only for the use of the individual(s) to which it is addressed. It 
may contain information which is confidential and/or covered by legal 
professional or other privilege. The views expressed in this email are not 
necessarily the views of Centrica plc, and the company, its directors, 
officers or employees make no representation or accept any liability for 
their accuracy or completeness unless expressly stated to the contrary. 
Centrica Connected Home Limited (company no: 5782908), registered in 
England and Wales with its registered office at Millstream, Maidenhead 
Road, Windsor, Berkshire SL4 5GD.

Spark querying parquet data partitioned in S3

Reply via email to