Hi,

I’m looking for a way to compare subsets of an RDD intelligently.

 Lets say I had an RDD with key/value pairs of type (Int->T). I eventually
need to say “compare all values of key 1 with all values of key 2 and
compare values of key 3 to the values of key 5 and key 7”, how would I go
about doing this efficiently?

The way I’m currently thinking of doing it is by creating a List of
filtered RDDs and then using RDD.cartesian()


def filterSubset[T] = (b:Int, r:RDD[(Int, T)]) => r.filter{case(name, _) =>
name == b}

Val keyPairs:(Int, Int) // all key pairs

Val rddPairs = keyPairs.map{

            case (a, b) =>

                filterSubset(a,r).cartesian(filterSubset(b,r))

        }

rddPairs.map{whatever I want to compare…}



I would then iterate the list and perform a map on each of the RDDs of
pairs to gather the relational data that I need.



What I can’t tell about this idea is whether it would be extremely
inefficient to set up possibly of hundreds of map jobs and then iterate
through them. In this case, would the lazy valuation in spark optimize the
data shuffling between all of the maps? If not, can someone please
recommend a possibly more efficient way to approach this issue?


Thank you for your help

Reply via email to