This is due to the HadoopRDD (and also the underlying Hadoop InputFormat)
reuse objects to avoid allocation. It is sort of tricky to fix. However, in
most cases you can clone the records to make sure you are not collecting
the same object over and over again.

https://issues.apache.org/jira/browse/SPARK-1018

http://mail-archives.apache.org/mod_mbox/spark-user/201308.mbox/%3ccaf_kkpzrq4otyqvwcoc6plaz9x9_sfo33u4ysatki5ptqoy...@mail.gmail.com%3E


On Thu, Sep 18, 2014 at 12:43 AM, vasiliy <zadonsk...@gmail.com> wrote:

> i posted an example in previous post. Tested on spark 1.0.2, 1.2.0-SNAPSHOT
> and 1.1.0 for hadoop 2.4.0 on Windows and Linux servers with hortonworks
> hadoop 2.4 in local[4] mode. Any ideas about this spark behavior ?
>
>
> Akhil Das-2 wrote
> > Can you dump out a small piece of data? while doing rdd.collect and
> > rdd.foreach(println)
> >
> > Thanks
> > Best Regards
> >
> > On Wed, Sep 17, 2014 at 12:26 PM, vasiliy &lt;
>
> > zadonskiyd@
>
> > &gt; wrote:
> >
> >> it also appears in streaming hdfs fileStream
> >>
> >>
> >>
> >> --
> >> View this message in context:
> >>
> http://apache-spark-user-list.1001560.n3.nabble.com/collect-on-hadoopFile-RDD-returns-wrong-results-tp14368p14425.html
> >> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail:
>
> > user-unsubscribe@.apache
>
> >> For additional commands, e-mail:
>
> > user-help@.apache
>
> >>
> >>
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/collect-on-hadoopFile-RDD-returns-wrong-results-tp14368p14527.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to