Being mutable is fine; reusing and mutating the objects is the issue. And yes the objects you get back from Hadoop are reused by Hadoop InputFormats. You should just map the objects to a clone before using them where you need them to exist all independently at once, like before a collect().
(That said... generally speaking collect() involves copying from workers to the driver, which necessarily means a copy anyway. I suspect this isn't working that way for you since you're running it all locally?) On Thu, Dec 18, 2014 at 10:42 AM, Tristan Blakers <tris...@blackfrog.org> wrote: > Suspected the same thing, but because the underlying data classes are > deserialised by Avro I think they have to be mutable as you need to provide > the no-args constructor with settable fields. > > Nothing is being cached in my code anywhere, and this can be reproduced > using data directly out of the newAPIHadoopRDD() call. Debugs added to the > constructors of the various classes show that the right number are being > constructed, though the watches set on some of the fields aren’t always > triggering, so suspect maybe the serialisation is doing something a bit too > clever? > > Tristan > > On 18 December 2014 at 21:25, Sean Owen <so...@cloudera.com> wrote: >> >> It sounds a lot like your values are mutable classes and you are >> mutating or reusing them somewhere? It might work until you actually >> try to materialize them all and find many point to the same object. >> >> On Thu, Dec 18, 2014 at 10:06 AM, Tristan Blakers <tris...@blackfrog.org> >> wrote: >> > Hi, >> > >> > I’m getting some seemingly invalid results when I collect an RDD. This >> > is >> > happening in both Spark 1.1.0 and 1.2.0, using Java8 on Mac. >> > >> > See the following code snippet: >> > >> > JavaRDD<Thing> rdd= pairRDD.values(); >> > rdd.foreach( e -> System.out.println ( "RDD Foreach: " + e ) ); >> > rdd.collect().forEach( e -> System.out.println ( "Collected Foreach: " + >> > e ) >> > ); >> > >> > I would expect the results from the two outputters to be identical, but >> > instead I see: >> > >> > RDD Foreach: Thing1 >> > RDD Foreach: Thing2 >> > RDD Foreach: Thing3 >> > RDD Foreach: Thing4 >> > (…snip…) >> > Collected Foreach: Thing1 >> > Collected Foreach: Thing1 >> > Collected Foreach: Thing1 >> > Collected Foreach: Thing2 >> > >> > So essentially the valid entries except for one are replaced by an >> > equivalent number of duplicate objects. I’ve tried various map and >> > filter >> > operations, but the results in the RDD always appear correct until I try >> > to >> > collect() the results. I’ve also found that calling cache() on the RDD >> > materialises the duplication process such that the RDD Foreach displays >> > the >> > duplicates too... >> > >> > Any suggestions for how I can go about debugging this would be massively >> > appreciated. >> > >> > Cheers >> > Tristan --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org