If you know the partition IDs, you can launch a job that runs tasks on only those partitions by calling sc.runJob <https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L1686>. For example, we do this in IndexedRDD <https://github.com/amplab/spark-indexedrdd/blob/f0c42dcad1f49ce36140f0c1f7d2c3ed61ed373e/src/main/scala/edu/berkeley/cs/amplab/spark/indexedrdd/IndexedRDDLike.scala#L100> to get particular keys without launching a task on every partition.
Ankur <http://www.ankurdave.com/> On Sun, May 17, 2015 at 8:32 AM, mas <mas.ha...@gmail.com> wrote: > I have distributed my RDD into say 10 nodes. I want to fetch the data that > resides on a particular node say "node 5". How i can achieve this? > I have tried mapPartitionWithIndex function to filter the data of that > corresponding node, however it is pretty expensive. >