This is because Hadoop writables are being reused. Just map it to some
custom type and then do further operations including cache() on it.
Regards
Sab
On 27-Feb-2016 9:11 am, "Yan Yang" wrote:
> Hi
>
> I am pretty new to Spark, and after experimentation on our pipelines. I
are you using avro format by any chance?
there is some formats that need to be "deep"-copy before caching or
aggregating
try something like
val input = sc.newAPIHadoopRDD(...)
val rdd = input.map(deepCopyTransformation).map(...)
rdd.cache()
rdd.saveAsTextFile(...)
where deepCopyTransformation is
Hi
I am pretty new to Spark, and after experimentation on our pipelines. I ran
into this weird issue.
The Scala code is as below:
val input = sc.newAPIHadoopRDD(...)
val rdd = input.map(...)
rdd.cache()
rdd.saveAsTextFile(...)
I found rdd to consist of 80+K identical rows. To be more precise,