Hi all, *Query 1)*
Need a serious help! I'm running feature engineering of different types on a dataset and trying to benchmark from by tweaking different types of Spark properties. I don't know where it is going wrong that a single machine is working faster than a 3 node cluster, even though, most of the operations from code are distributed. The log I collected by running in different ways is - Remote Spark Benchmarking (4 node cluster, 1 driver, 3 workers) - Cluster details: 12 GB RAM, 6 cores each. Medium data -> 1,00,000 sample (0.1 million rows) [Data placed in Local File System, same path, same data on all worker nodes] Runs - 1) Time Taken for the Feature Engineering Pipeline to finish: 482.20375990867615 secs.; --num-executors 3 --executor-cores 5 --executor-memory 4G 2) Time Taken for the Feature Engineering Pipeline to finish: 467.3759717941284 secs.; --num-executors 10 --executor-cores 6 --executor-memory 11G 3) Time Taken for the Feature Engineering Pipeline to finish: 459.885710477829 secs.; --num-executors 3 --executor-cores 6 --executor-memory 8G 4) Time Taken for the Feature Engineering Pipeline to finish: 476.61902809143066 secs.; --num-executors 3 --executor-cores 5 --executor-memory 4G --conf spark.memory.fraction=0.2 5) Time Taken for the Feature Engineering Pipeline to finish: 575.9314386844635 secs.; --num-executors 3 --executor-cores 5 --executor-memory 4G --conf spark.default.parallelism=200 Medium data -> 1,00,000 sample (0.1 million rows) [Data placed in Local File System] 1) Time Taken for the Feature Engineering Pipeline to finish: 594.1818737983704 secs. 2) Time Taken for the Feature Engineering Pipeline to finish: 528.6015181541443 secs. (on single driver node [local]) 3) Time Taken for the Feature Engineering Pipeline to finish: 323.6546362755467 secs. (on my laptop - 16GB RAM and 8 Cores). *Query 2)* The below is the event timeline of the same code taken from the Spark UI, can you provide some insight on why there are two big gaps between the parallel tasks? Does it mean, that time, there's no operation happening? I am kind of new to Spark UI monitoring, can anyone suggest me other aspects which needs to be monitored to optimize further? Thanks, Aakash.