You can use aggregateByKey as one option: val input: RDD[Int, String] = ...
val test = input.aggregateByKey(ListBuffer.empty[String])((a, b) => a += b, (a, b) => a ++ b) From: Jianguo Li Date: Monday, June 22, 2015 at 5:12 PM To: "user@spark.apache.org<mailto:user@spark.apache.org>" Subject: workaround for groupByKey Hi, I am processing an RDD of key-value pairs. The key is an user_id, and the value is an website url the user has ever visited. Since I need to know all the urls each user has visited, I am tempted to call the groupByKey on this RDD. However, since there could be millions of users and urls, the shuffling caused by groupByKey proves to be a major bottleneck to get the job done. Is there any workaround? I want to end up with an RDD of key-value pairs, where the key is an user_id, the value is a list of all the urls visited by the user. Thanks, Jianguo