Hi, I have a problem, it is easy in Scala code, but I can not take the top
N from RDD as RDD.


There are 10000 Student Score, ask take top 10 age, and then take top 10
from each age, the result is 100 records.

The Scala code is here, but how can I do it in RDD,  *for RDD.take return
is Array, but other RDD.*

example Scala code:

import scala.util.Random

case class StudentScore(age: Int, num: Int, score: Int, name: Int)

val scores = for {
  i <- 1 to 10000
} yield {
  StudentScore(Random.nextInt(100), Random.nextInt(100),
Random.nextInt(), Random.nextInt())
}


def takeTop(scores: Seq[StudentScore], byKey: StudentScore => Int):
Seq[(Int, Seq[StudentScore])] = {
  val groupedScore = scores.groupBy(byKey)
                           .map{case (_, _scores) =>
(_scores.foldLeft(0)((acc, v) => acc + v.score), _scores)}.toSeq
  groupedScore.sortBy(_._1).take(10)
}

val topScores = for {
  (_, ageScores) <- takeTop(scores, _.age)
  (_, numScores) <- takeTop(ageScores, _.num)
} yield {
  numScores
}

topScores.size


-- 

~Yours, Xuefeng Wu/吴雪峰  敬上

Reply via email to