Howdy, We've noticed a strange behavior with Avro serialized data and reduceByKey RDD functionality. Please see below:
// We're reading a bunch of Avro serialized data val data: RDD[T] = sparkContext.hadoopFile(path, classOf[AvroInputFormat[T]], classOf[AvroWrapper[T]], classOf[NullWritable]) // Incorrect data returned val bad: RDD[(String,List[T])] = data.map(r => (r.id, List(r))).reduceByKey(_ ++ _) // After adding the partitioner we get everything as expected val good: RDD[(String,List[T])] = data.map(r => (r.id, List(r))).partitionBy(Partitioner.defaultPartitioner(data)).reduceByKey(_ ++ _) Any ideas? Thanks in advance -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Incorrect-results-with-reduceByKey-tp25410.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org