Hi,
I’m getting some seemingly invalid results when I collect an RDD. This is
happening in both Spark 1.1.0 and 1.2.0, using Java8 on Mac.
See the following code snippet:
JavaRDDThing rdd= pairRDD.values();
rdd.foreach( e - System.out.println ( RDD Foreach: + e ) );
rdd.collect().forEach( e -
It sounds a lot like your values are mutable classes and you are
mutating or reusing them somewhere? It might work until you actually
try to materialize them all and find many point to the same object.
On Thu, Dec 18, 2014 at 10:06 AM, Tristan Blakers tris...@blackfrog.org wrote:
Hi,
I’m
Suspected the same thing, but because the underlying data classes are
deserialised by Avro I think they have to be mutable as you need to provide
the no-args constructor with settable fields.
Nothing is being cached in my code anywhere, and this can be reproduced
using data directly out of the
Being mutable is fine; reusing and mutating the objects is the issue.
And yes the objects you get back from Hadoop are reused by Hadoop
InputFormats. You should just map the objects to a clone before using
them where you need them to exist all independently at once, like
before a collect().
(That
Recording the outcome here for the record. Based on Sean’s advice I’ve
confirmed that making defensive copies of records that will be collected
avoids this problem - it does seem like Avro is being a bit too aggressive
when deciding it’s safe to reuse an object for a new record.
On 18 December