For a HadoopRDD, first the spark scheduler calculates the number of tasks
based on input splits. Usually people use this with HDFS data so in that
case it's based on HDFS blocks. If the HDFS datanodes are co-located with
the Spark cluster then it will try to run the tasks on the data node that
contains its input to achieve higher throughput. Otherwise, all of the
nodes are considered equally fit to run any task, and Spark just load
balances across them.


On Sat, Apr 19, 2014 at 9:25 PM, David Thomas <dt5434...@gmail.com> wrote:

> During a Spark stage, how are tasks split among the workers? Specifically
> for a HadoopRDD, who determines which worker has to get which task?
>

Reply via email to