We have got data stored in S3 partitioned by several columns. Let's say following this hierarchy: s3://bucket/data/column1=X/column2=Y/parquet-files
We run a Spark job in a EMR cluster (1 master,3 slaves) and realised the following: A) - When we declare the initial dataframe to be the whole dataset (val df = sqlContext.read.parquet("s3://bucket/data/) then the driver splits the job into several tasks (259) that are performed by the executors and we believe the driver gets back the parquet metadata. Question: The above takes about 25 minutes for our dataset, we believe it should be a lazy query (as we are not performing any actions) however it looks like something is happening, all the executors are reading from S3. We have tried mergeData=false and setting the schema explicitly via .schema(someSchema). Is there any way to speed this up? B) - When we declare the initial dataframe to be scoped by the first column (val df = sqlContext.read.parquet("s3://bucket/data/column1=X) then it seems that all the work (getting the parquet metadata) is done by the driver and there is no job submitted to Spark. Question: Why does (A) send the work to executors but (B) does not? The above is for EMR 5.5.0, Hadoop 2.7.3 and Spark 2.1.0. -- hivehome.com <http://www.hivehome.com> Hive | London | Cambridge | Houston | Toronto The information contained in or attached to this email is confidential and intended only for the use of the individual(s) to which it is addressed. It may contain information which is confidential and/or covered by legal professional or other privilege. The views expressed in this email are not necessarily the views of Centrica plc, and the company, its directors, officers or employees make no representation or accept any liability for their accuracy or completeness unless expressly stated to the contrary. Centrica Connected Home Limited (company no: 5782908), registered in England and Wales with its registered office at Millstream, Maidenhead Road, Windsor, Berkshire SL4 5GD.