Re: Incorrect results when calling collect() ?

Sean Owen Thu, 18 Dec 2014 02:52:39 -0800

Being mutable is fine; reusing and mutating the objects is the issue.
And yes the objects you get back from Hadoop are reused by Hadoop
InputFormats. You should just map the objects to a clone before using
them where you need them to exist all independently at once, like
before a collect().


(That said... generally speaking collect() involves copying from
workers to the driver, which necessarily means a copy anyway. I
suspect this isn't working that way for you since you're running it
all locally?)

On Thu, Dec 18, 2014 at 10:42 AM, Tristan Blakers <tris...@blackfrog.org> wrote:
> Suspected the same thing, but because the underlying data classes are
> deserialised by Avro I think they have to be mutable as you need to provide
> the no-args constructor with settable fields.
>
> Nothing is being cached in my code anywhere, and this can be reproduced
> using data directly out of the newAPIHadoopRDD() call. Debugs added to the
> constructors of the various classes show that the right number are being
> constructed, though the watches set on some of the fields aren’t always
> triggering, so suspect maybe the serialisation is doing something a bit too
> clever?
>
> Tristan
>
> On 18 December 2014 at 21:25, Sean Owen <so...@cloudera.com> wrote:
>>
>> It sounds a lot like your values are mutable classes and you are
>> mutating or reusing them somewhere? It might work until you actually
>> try to materialize them all and find many point to the same object.
>>
>> On Thu, Dec 18, 2014 at 10:06 AM, Tristan Blakers <tris...@blackfrog.org>
>> wrote:
>> > Hi,
>> >
>> > I’m getting some seemingly invalid results when I collect an RDD. This
>> > is
>> > happening in both Spark 1.1.0 and 1.2.0, using Java8 on Mac.
>> >
>> > See the following code snippet:
>> >
>> > JavaRDD<Thing> rdd= pairRDD.values();
>> > rdd.foreach( e -> System.out.println ( "RDD Foreach: " + e ) );
>> > rdd.collect().forEach( e -> System.out.println ( "Collected Foreach: " +
>> > e )
>> > );
>> >
>> > I would expect the results from the two outputters to be identical, but
>> > instead I see:
>> >
>> > RDD Foreach: Thing1
>> > RDD Foreach: Thing2
>> > RDD Foreach: Thing3
>> > RDD Foreach: Thing4
>> > (…snip…)
>> > Collected Foreach: Thing1
>> > Collected Foreach: Thing1
>> > Collected Foreach: Thing1
>> > Collected Foreach: Thing2
>> >
>> > So essentially the valid entries except for one are replaced by an
>> > equivalent number of duplicate objects. I’ve tried various map and
>> > filter
>> > operations, but the results in the RDD always appear correct until I try
>> > to
>> > collect() the results. I’ve also found that calling cache() on the RDD
>> > materialises the duplication process such that the RDD Foreach displays
>> > the
>> > duplicates too...
>> >
>> > Any suggestions for how I can go about debugging this would be massively
>> > appreciated.
>> >
>> > Cheers
>> > Tristan

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Incorrect results when calling collect() ?

Reply via email to