Kevin (Sangwoo) Kim wrote
If keys are not too many,
You can do like this:
val data = List(
(A, Set(1,2,3)),
(A, Set(1,2,4)),
(B, Set(1,2,3))
)
val rdd = sc.parallelize(data)
rdd.persist()
rdd.filter(_._1 == A).flatMap(_._2).distinct.count
rdd.filter(_._1 ==
i want compute RDD[(String, Set[String])] that include a part of large size
’Set[String]’.
--
val hoge: RDD[(String, Set[String])] = ...
val reduced = hoge.reduceByKey(_ ++ _) //= create large size Set (shuffle
read size 7GB)
val counted = reduced.map{ case (key, strSeq) =
That i want to do, get unique count for each key. so take map() or
countByKey(), not get unique count. (because duplicate string is likely to
be counted)...
--
View this message in context: