Hi,

I’m getting some seemingly invalid results when I collect an RDD. This is
happening in both Spark 1.1.0 and 1.2.0, using Java8 on Mac.

See the following code snippet:

JavaRDD<Thing> rdd= pairRDD.values();
rdd.foreach( e -> System.out.println ( "RDD Foreach: " + e ) );
rdd.collect().forEach( e -> System.out.println ( "Collected Foreach: " + e
) );

I would expect the results from the two outputters to be identical, but
instead I see:

RDD Foreach: Thing1
RDD Foreach: Thing2
RDD Foreach: Thing3
RDD Foreach: Thing4
(…snip…)
Collected Foreach: Thing1
Collected Foreach: Thing1
Collected Foreach: Thing1
Collected Foreach: Thing2

So essentially the valid entries except for one are replaced by an
equivalent number of duplicate objects. I’ve tried various map and filter
operations, but the results in the RDD always appear correct until I try to
collect() the results. I’ve also found that calling cache() on the RDD
materialises the duplication process such that the RDD Foreach displays the
duplicates too...

Any suggestions for how I can go about debugging this would be massively
appreciated.

Cheers
Tristan

Reply via email to