It seems that ++ does the right thing on arrays of longs, and gives you another 
one:

scala> val a = Array[Long](1,2,3)
a: Array[Long] = Array(1, 2, 3)

scala> val b = Array[Long](1,2,3)
b: Array[Long] = Array(1, 2, 3)

scala> a ++ b
res0: Array[Long] = Array(1, 2, 3, 1, 2, 3)

scala> res0.getClass
res1: Class[_ <: Array[Long]] = class [J

The problem might be that lots of intermediate space is allocated as you merge 
values two by two. In particular, if a key has N arrays mapping to it, your 
code will allocate O(N^2) space because it builds first an array of size 1, 
then 2, then 3, etc. You can make this faster by using aggregateByKey instead, 
and using an intermediate data structure other than an Array to do the merging 
(ideally you'd find a growable ArrayBuffer-like class specialized for Longs, 
but you can also just try ArrayBuffer).

Matei



> On Oct 21, 2014, at 1:08 PM, Akshat Aranya <aara...@gmail.com> wrote:
> 
> This is as much of a Scala question as a Spark question
> 
> I have an RDD:
> 
> val rdd1: RDD[(Long, Array[Long])]
> 
> This RDD has duplicate keys that I can collapse such
> 
> val rdd2: RDD[(Long, Array[Long])] = rdd1.reduceByKey((a,b) => a++b)
> 
> If I start with an Array of primitive longs in rdd1, will rdd2 also have 
> Arrays of primitive longs?  I suspect, based on my memory usage, that this is 
> not the case.
> 
> Also, would it be more efficient to do this:
> 
> val rdd1: RDD[(Long, ArrayBuffer[Long])]
> 
> and then
> 
> val rdd2: RDD[(Long, Array[Long])] = rdd1.reduceByKey((a,b) => 
> a++b).map(_.toArray)
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to