you should clone your data after reading avro On 18 November 2015 at 06:28, tovbinm <tovb...@gmail.com> wrote:
> Howdy, > > We've noticed a strange behavior with Avro serialized data and reduceByKey > RDD functionality. Please see below: > > // We're reading a bunch of Avro serialized data > val data: RDD[T] = sparkContext.hadoopFile(path, > classOf[AvroInputFormat[T]], classOf[AvroWrapper[T]], > classOf[NullWritable]) > // Incorrect data returned > val bad: RDD[(String,List[T])] = data.map(r => (r.id, > List(r))).reduceByKey(_ ++ _) > // After adding the partitioner we get everything as expected > val good: RDD[(String,List[T])] = data.map(r => (r.id, > List(r))).partitionBy(Partitioner.defaultPartitioner(data)).reduceByKey(_ > ++ > _) > > > Any ideas? > > Thanks in advance > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Incorrect-results-with-reduceByKey-tp25410.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >