Hi, I’m getting some seemingly invalid results when I collect an RDD. This is happening in both Spark 1.1.0 and 1.2.0, using Java8 on Mac.
See the following code snippet: JavaRDD<Thing> rdd= pairRDD.values(); rdd.foreach( e -> System.out.println ( "RDD Foreach: " + e ) ); rdd.collect().forEach( e -> System.out.println ( "Collected Foreach: " + e ) ); I would expect the results from the two outputters to be identical, but instead I see: RDD Foreach: Thing1 RDD Foreach: Thing2 RDD Foreach: Thing3 RDD Foreach: Thing4 (…snip…) Collected Foreach: Thing1 Collected Foreach: Thing1 Collected Foreach: Thing1 Collected Foreach: Thing2 So essentially the valid entries except for one are replaced by an equivalent number of duplicate objects. I’ve tried various map and filter operations, but the results in the RDD always appear correct until I try to collect() the results. I’ve also found that calling cache() on the RDD materialises the duplication process such that the RDD Foreach displays the duplicates too... Any suggestions for how I can go about debugging this would be massively appreciated. Cheers Tristan