It depends on how big subsetids is.
If subsetids is small, then you can just collect to the client and
broadcast it as a set to use exactly as stated.
If subsetids is too big for that, you need to join it with the second set:
subsetids.map(i => (i, i)) // transform ids to a key/value form for use
with join
.join(subsetObjs) // To get just the objs with included ids; note
subsetObjs should be in the (id, object) form
.map(_._2) // to get back to the original subsetObjs form of (id, object)
I hope this helps.
-Nathan
On Wed, Feb 19, 2014 at 5:23 PM, Soumya Simanta <[email protected]>wrote:
> I've a RDD that contains ids (Long).
>
> subsetids
>
> res22: org.apache.spark.rdd.RDD[Long]
>
>
> I've another RDD that has an Object (MyObject) where one of the field is
> an id (Long).
>
> allobjects
>
> res23: org.apache.spark.rdd.RDD[MyObject] = MappedRDD[272]
>
> Now I want to run filter on allobjects so that I can get a subset that
> matches with the ids that are present in my first RDD (i.e., subsetids)
>
> Say something like -
>
> val subsetObjs = allobjects.filter( x => subsetids.contains(x.getId) )
>
> However, there is no method "contains" so I'm looking for the most
> efficient way to achieving this in Spark.
>
> Thanks.
>
>
>
>
--
Nathan Kronenfeld
Senior Visualization Developer
Oculus Info Inc
2 Berkeley Street, Suite 600,
Toronto, Ontario M5A 4J5
Phone: +1-416-203-3003 x 238
Email: [email protected]