When iterating over a HadoopRDD created using SparkContext.sequenceFile, I
noticed that if I don't copy the key as below, every tuple in the RDD has
the same value as the last one seen. Clearly the object is being recycled,
so if I don't clone the object, I'm in trouble.

Say if my sequence files had key of type LongWritable

val hadoopRdd = sc.sequenceFile(..)
val filteredRdd = hadoopRdd.filter(..)

Now if I run the below to print the 10 keys of type Long, I see the same
value printed 10 times.
filteredRdd.take(10).foreach(t => println(t._1.get()))

Now if I copy the key out, it prints the 10 unique keys correctly
val hadoopRdd = sc.sequenceFile(..)
val mappedRdd = hadoopRdd.map(t => (t._1.get(), t._2))
val filteredRdd = mappedRdd.filter(..)
filteredRdd.take(10).foreach(t => println(t._1))

When are users expected to make such copies of objects when performing RDD
operations?

Ameet

Reply via email to