Hi,

When we execute the same operation twice, spark takes less time ( ~40%) than
the first.
Our operation is like this: 
Read 150M rows ( spread in multiple parquet files) into DF
Read 10M rows ( spread in multiple parquet files) into other DF.
Do an intersect operation.

Size of 150M row file: 587MB
size of 10M file: 50M

If first execution takes around 20 sec the next one will take just 10-12
sec.
Any specific reason for this? Is any optimization is there that we can
utilize during the first operation?

Regards
Sanjeev



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to