You just want to be able to replicate hot cached blocks right? On Tuesday, March 8, 2016, Prabhu Joseph <prabhujose.ga...@gmail.com> wrote:
> Hi All, > > When a Spark Job is running, and one of the Spark Executor on Node A > has some partitions cached. Later for some other stage, Scheduler tries to > assign a task to Node A to process a cached partition (PROCESS_LOCAL). But > meanwhile the Node A is occupied with some other > tasks and got busy. Scheduler waits for spark.locality.wait interval and > times out and tries to find some other node B which is NODE_LOCAL. The > executor on Node B will try to get the cached partition from Node A which > adds network IO to node and also some extra CPU for I/O. Eventually, > every node will have a task that is waiting to fetch some cached partition > from node A and so the spark job / cluster is basically blocked on a single > node. > > Spark JIRA is created https://issues.apache.org/jira/browse/SPARK-13718 > > Beginning from Spark 1.2, Spark introduced External Shuffle Service to > enable executors fetch shuffle files from an external service instead of > from each other which will offload the load on Spark Executors. > > We want to check whether a similar thing of an External Service is > implemented for transferring the cached partition to other executors. > > > Thanks, Prabhu Joseph > > >