This is as much of a Scala question as a Spark question I have an RDD:
val rdd1: RDD[(Long, Array[Long])] This RDD has duplicate keys that I can collapse such val rdd2: RDD[(Long, Array[Long])] = rdd1.reduceByKey((a,b) => a++b) If I start with an Array of primitive longs in rdd1, will rdd2 also have Arrays of primitive longs? I suspect, based on my memory usage, that this is not the case. Also, would it be more efficient to do this: val rdd1: RDD[(Long, ArrayBuffer[Long])] and then val rdd2: RDD[(Long, Array[Long])] = rdd1.reduceByKey((a,b) => a++b).map(_.toArray)