[Spark-sql]: DF parquet read write multiple tasks

2018-04-02 Thread snjv
Spark : 2.2 Number of cores : 128 ( all allocated to spark) Filesystem : Alluxio 1.6 Block size on alluxio: 32MB Input1 size : 586MB ( 150m records with only 1 column as int) Input2 size : 50MB ( 10m records with only 1 column as int) Input1 is spread across 20 parquet files. each file size is

[Spark sql]: Re-execution of same operation takes less time than 1st

2018-04-02 Thread snjv
Hi, When we execute the same operation twice, spark takes less time ( ~40%) than the first. Our operation is like this: Read 150M rows ( spread in multiple parquet files) into DF Read 10M rows ( spread in multiple parquet files) into other DF. Do an intersect operation. Size of 150M row file: