Spark : 2.2
Number of cores : 128 ( all allocated to spark)
Filesystem : Alluxio 1.6
Block size on alluxio: 32MB
Input1 size : 586MB ( 150m records with only 1 column as int)
Input2 size : 50MB ( 10m records with only 1 column as int)
Input1 is spread across 20 parquet files. each file size is
Hi,
When we execute the same operation twice, spark takes less time ( ~40%) than
the first.
Our operation is like this:
Read 150M rows ( spread in multiple parquet files) into DF
Read 10M rows ( spread in multiple parquet files) into other DF.
Do an intersect operation.
Size of 150M row file: