Incorrect results when calling collect() ?

2014-12-18 Thread Tristan Blakers
Hi, I’m getting some seemingly invalid results when I collect an RDD. This is happening in both Spark 1.1.0 and 1.2.0, using Java8 on Mac. See the following code snippet: JavaRDDThing rdd= pairRDD.values(); rdd.foreach( e - System.out.println ( RDD Foreach: + e ) ); rdd.collect().forEach( e -

Re: Incorrect results when calling collect() ?

2014-12-18 Thread Sean Owen
It sounds a lot like your values are mutable classes and you are mutating or reusing them somewhere? It might work until you actually try to materialize them all and find many point to the same object. On Thu, Dec 18, 2014 at 10:06 AM, Tristan Blakers tris...@blackfrog.org wrote: Hi, I’m

Re: Incorrect results when calling collect() ?

2014-12-18 Thread Tristan Blakers
Suspected the same thing, but because the underlying data classes are deserialised by Avro I think they have to be mutable as you need to provide the no-args constructor with settable fields. Nothing is being cached in my code anywhere, and this can be reproduced using data directly out of the

Re: Incorrect results when calling collect() ?

2014-12-18 Thread Sean Owen
Being mutable is fine; reusing and mutating the objects is the issue. And yes the objects you get back from Hadoop are reused by Hadoop InputFormats. You should just map the objects to a clone before using them where you need them to exist all independently at once, like before a collect(). (That

Re: Incorrect results when calling collect() ?

2014-12-18 Thread Tristan Blakers
Recording the outcome here for the record. Based on Sean’s advice I’ve confirmed that making defensive copies of records that will be collected avoids this problem - it does seem like Avro is being a bit too aggressive when deciding it’s safe to reuse an object for a new record. On 18 December