Hi, When we execute the same operation twice, spark takes less time ( ~40%) than the first. Our operation is like this: Read 150M rows ( spread in multiple parquet files) into DF Read 10M rows ( spread in multiple parquet files) into other DF. Do an intersect operation.
Size of 150M row file: 587MB size of 10M file: 50M If first execution takes around 20 sec the next one will take just 10-12 sec. Any specific reason for this? Is any optimization is there that we can utilize during the first operation? Regards Sanjeev -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org