I have this 2-node cluster setup, where each node has 4-cores.

            MASTER
    (Worker-on-master)              (Worker-on-node1)

(slaves(master,node1))
SPARK_WORKER_INSTANCES=1


I am trying to understand Spark's parallelize behavior. The sparkPi example has 
this code:
    val slices = 8
    val n = 100000 * slices
    val count = spark.parallelize(1 to n, slices).map { i =>
      val x = random * 2 - 1
      val y = random * 2 - 1
      if (x*x + y*y < 1) 1 else 0
    }.reduce(_ + _)


As per documentation: "Spark will run one task for each slice of the cluster. 
Typically you want 2-4 slices for each CPU in your cluster". I set slices to be 
8 which means the workingset will be divided among 8 tasks on the cluster, in 
turn each worker node gets 4 tasks (1:1 per core)

Questions:
   i)  Where can I see task level details? Inside executors I dont see task 
breakdown so I can see the effect of slices on the UI.
   ii) How to  programmatically find the working set size for the map function 
above? I assume it is n/slices (100000 above)
   iii) Are the multiple tasks run by an executor run sequentially or paralelly 
in multiple threads?
   iv) Reasoning behind 2-4 slices per CPU.
   v) I assume ideally we should tune SPARK_WORKER_INSTANCES to correspond to 
number of

Bests,
-Monir

Reply via email to