Fundamental Question on Spark's distribution

Aakash Basu Thu, 07 Jun 2018 02:54:54 -0700

Hi all,

*Query 1)*


Need a serious help! I'm running feature engineering of different types on
a dataset and trying to benchmark from by tweaking different types of Spark
properties.

I don't know where it is going wrong that a single machine is working
faster than a 3 node cluster, even though, most of the operations from code
are distributed.

The log I collected by running in different ways is -

Remote Spark Benchmarking (4 node cluster, 1 driver, 3 workers) -

Cluster details: 12 GB RAM, 6 cores each.

Medium data -> 1,00,000 sample (0.1 million rows) [Data placed in Local
File System, same path, same data on all worker nodes]
Runs -
1) Time Taken for the Feature Engineering Pipeline to finish:
482.20375990867615 secs.; --num-executors 3 --executor-cores 5
--executor-memory 4G
2) Time Taken for the Feature Engineering Pipeline to finish:
467.3759717941284 secs.; --num-executors 10 --executor-cores 6
--executor-memory 11G
3) Time Taken for the Feature Engineering Pipeline to finish:
459.885710477829 secs.; --num-executors 3 --executor-cores 6
--executor-memory 8G
4) Time Taken for the Feature Engineering Pipeline to finish:
476.61902809143066 secs.; --num-executors 3 --executor-cores 5
--executor-memory 4G --conf spark.memory.fraction=0.2
5) Time Taken for the Feature Engineering Pipeline to finish:
575.9314386844635 secs.; --num-executors 3 --executor-cores 5
--executor-memory 4G --conf spark.default.parallelism=200

Medium data -> 1,00,000 sample (0.1 million rows) [Data placed in Local
File System]
1) Time Taken for the Feature Engineering Pipeline to finish:
594.1818737983704 secs.
2) Time Taken for the Feature Engineering Pipeline to finish:
528.6015181541443 secs. (on single driver node [local])
3) Time Taken for the Feature Engineering Pipeline to finish:
323.6546362755467 secs. (on my laptop - 16GB RAM and 8 Cores).

*Query 2)*

The below is the event timeline of the same code taken from the Spark UI,
can you provide some insight on why there are two big gaps between the
parallel tasks? Does it mean, that time, there's no operation happening? I
am kind of new to Spark UI monitoring, can anyone suggest me other aspects
which needs to be monitored to optimize further?





Thanks,
Aakash.

Fundamental Question on Spark's distribution

Reply via email to