When iterating over a HadoopRDD created using SparkContext.sequenceFile, I noticed that if I don't copy the key as below, every tuple in the RDD has the same value as the last one seen. Clearly the object is being recycled, so if I don't clone the object, I'm in trouble.
Say if my sequence files had key of type LongWritable val hadoopRdd = sc.sequenceFile(..) val filteredRdd = hadoopRdd.filter(..) Now if I run the below to print the 10 keys of type Long, I see the same value printed 10 times. filteredRdd.take(10).foreach(t => println(t._1.get())) Now if I copy the key out, it prints the 10 unique keys correctly val hadoopRdd = sc.sequenceFile(..) val mappedRdd = hadoopRdd.map(t => (t._1.get(), t._2)) val filteredRdd = mappedRdd.filter(..) filteredRdd.take(10).foreach(t => println(t._1)) When are users expected to make such copies of objects when performing RDD operations? Ameet
