are you using avro format by any chance? there is some formats that need to be "deep"-copy before caching or aggregating try something like val input = sc.newAPIHadoopRDD(...) val rdd = input.map(deepCopyTransformation).map(...) rdd.cache() rdd.saveAsTextFile(...)
where deepCopyTransformation is function that deep copies every object On 26 February 2016 at 19:41, Yan Yang <y...@wealthfront.com> wrote: > Hi > > I am pretty new to Spark, and after experimentation on our pipelines. I > ran into this weird issue. > > The Scala code is as below: > > val input = sc.newAPIHadoopRDD(...) > val rdd = input.map(...) > rdd.cache() > rdd.saveAsTextFile(...) > > I found rdd to consist of 80+K identical rows. To be more precise, the > number of rows is right, but all are identical. > > The truly weird part is if I remove rdd.cache(), everything works just > fine. I have encountered this issue on a few occasions. > > Thanks > Yan > > > > >