If you know the partition IDs, you can launch a job that runs tasks on only
those partitions by calling sc.runJob
<https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L1686>.
For example, we do this in IndexedRDD
<https://github.com/amplab/spark-indexedrdd/blob/f0c42dcad1f49ce36140f0c1f7d2c3ed61ed373e/src/main/scala/edu/berkeley/cs/amplab/spark/indexedrdd/IndexedRDDLike.scala#L100>
to get particular keys without launching a task on every partition.

Ankur <http://www.ankurdave.com/>

On Sun, May 17, 2015 at 8:32 AM, mas <mas.ha...@gmail.com> wrote:

> I have distributed my RDD into say 10 nodes. I want to fetch the data that
> resides on a particular node say "node 5". How i can achieve this?
> I have tried mapPartitionWithIndex function to filter the data of that
> corresponding node, however it is pretty expensive.
>

Reply via email to