I don't just want to replicate all Cached Blocks. I am trying to find a way to solve the issue which i mentioned above mail. Having replicas for all cached blocks will add more cost to customers.
On Wed, Mar 9, 2016 at 9:50 AM, Reynold Xin <r...@databricks.com> wrote: > You just want to be able to replicate hot cached blocks right? > > > On Tuesday, March 8, 2016, Prabhu Joseph <prabhujose.ga...@gmail.com> > wrote: > >> Hi All, >> >> When a Spark Job is running, and one of the Spark Executor on Node A >> has some partitions cached. Later for some other stage, Scheduler tries to >> assign a task to Node A to process a cached partition (PROCESS_LOCAL). But >> meanwhile the Node A is occupied with some other >> tasks and got busy. Scheduler waits for spark.locality.wait interval and >> times out and tries to find some other node B which is NODE_LOCAL. The >> executor on Node B will try to get the cached partition from Node A which >> adds network IO to node and also some extra CPU for I/O. Eventually, >> every node will have a task that is waiting to fetch some cached >> partition from node A and so the spark job / cluster is basically blocked >> on a single node. >> >> Spark JIRA is created https://issues.apache.org/jira/browse/SPARK-13718 >> >> Beginning from Spark 1.2, Spark introduced External Shuffle Service to >> enable executors fetch shuffle files from an external service instead of >> from each other which will offload the load on Spark Executors. >> >> We want to check whether a similar thing of an External Service is >> implemented for transferring the cached partition to other executors. >> >> >> Thanks, Prabhu Joseph >> >> >>