Hi, 
I'm running this piece of code in my program:

    smallRdd.join(largeRdd) 
      .groupBy { case (id, (_, X(a, _, _))) => a }
      .map { case (a, iterable) => a-> iterable.size }
      .sortBy({ case (_, count) => count }, ascending = false)
      .take(k)

where basically
smallRdd is an rdd of (Long, Unit) and it has thousands of entries in it,
(it was an rdd of longs but i needed a tuple for the join)
largeRdd is an rdd of (Long, X) where X is a case class containing 3 Longs,
and it has millions of entries in it.
/both rdds have already been sorted by key/

what i want is, out of the intersection between the rdds (by key which is
the first Long in the tuple) find the top k which have the most appearances
of the same first value in X (a in this example).

This code works but takes way too long, it can take up to 10 minutes on rdds
of sizes 20,000 and 8,000,000, i've been playing around with commenting out
different lines in the process and running it and i can't seem to find a
clear bottleneck, each line seems to be quite costly.

I am wondering if anyone can think of a better way to do this, 
* especially wondering if i should use IndexedRDD for the join, would it
significantly improve the join performance ?
* i really don't like the fact that i'm sorting thousands of entries just to
get the top k, where usually k << smallRdd.count, is there some kind of
select kth i can do (and then just filter the bigger elements) on rdds ? or
any other way to improve what's happening here ?

thanks!







--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Help-optimizing-some-spark-code-tp23006.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to