Hi, I'm running this piece of code in my program: smallRdd.join(largeRdd) .groupBy { case (id, (_, X(a, _, _))) => a } .map { case (a, iterable) => a-> iterable.size } .sortBy({ case (_, count) => count }, ascending = false) .take(k)
where basically smallRdd is an rdd of (Long, Unit) and it has thousands of entries in it, (it was an rdd of longs but i needed a tuple for the join) largeRdd is an rdd of (Long, X) where X is a case class containing 3 Longs, and it has millions of entries in it. /both rdds have already been sorted by key/ what i want is, out of the intersection between the rdds (by key which is the first Long in the tuple) find the top k which have the most appearances of the same first value in X (a in this example). This code works but takes way too long, it can take up to 10 minutes on rdds of sizes 20,000 and 8,000,000, i've been playing around with commenting out different lines in the process and running it and i can't seem to find a clear bottleneck, each line seems to be quite costly. I am wondering if anyone can think of a better way to do this, * especially wondering if i should use IndexedRDD for the join, would it significantly improve the join performance ? * i really don't like the fact that i'm sorting thousands of entries just to get the top k, where usually k << smallRdd.count, is there some kind of select kth i can do (and then just filter the bigger elements) on rdds ? or any other way to improve what's happening here ? thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Help-optimizing-some-spark-code-tp23006.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org