For a HadoopRDD, first the spark scheduler calculates the number of tasks based on input splits. Usually people use this with HDFS data so in that case it's based on HDFS blocks. If the HDFS datanodes are co-located with the Spark cluster then it will try to run the tasks on the data node that contains its input to achieve higher throughput. Otherwise, all of the nodes are considered equally fit to run any task, and Spark just load balances across them.
On Sat, Apr 19, 2014 at 9:25 PM, David Thomas <dt5434...@gmail.com> wrote: > During a Spark stage, how are tasks split among the workers? Specifically > for a HadoopRDD, who determines which worker has to get which task? >