Hi All,

    When a Spark Job is running, and one of the Spark Executor on Node A
has some partitions cached. Later for some other stage, Scheduler tries to
assign a task to Node A to process a cached partition (PROCESS_LOCAL). But
meanwhile the Node A is occupied with some other
tasks and got busy. Scheduler waits for spark.locality.wait interval and
times out and tries to find some other node B which is NODE_LOCAL. The
executor on Node B will try to get the cached partition from Node A which
adds network IO to node and also some extra CPU for I/O. Eventually,
every node will have a task that is waiting to fetch some cached partition
from node A and so the spark job / cluster is basically blocked on a single
node.

Spark JIRA is created https://issues.apache.org/jira/browse/SPARK-13718

Beginning from Spark 1.2, Spark introduced External Shuffle Service to
enable executors fetch shuffle files from an external service instead of
from each other which will offload the load on Spark Executors.

We want to check whether a similar thing of an External Service is
implemented for transferring the cached partition to other executors.


Thanks, Prabhu Joseph

Reply via email to