Incorrect results with reduceByKey

2015-11-17 Thread tovbinm
Howdy, We've noticed a strange behavior with Avro serialized data and reduceByKey RDD functionality. Please see below: // We're reading a bunch of Avro serialized data val data: RDD[T] = sparkContext.hadoopFile(path, classOf[AvroInputFormat[T]], classOf[AvroWrapper[T]], classOf[NullWritable])

Re: Incorrect results with reduceByKey

2015-11-18 Thread tovbinm
Deep copying the data solved the issue: data.map(r => {val t = SpecificData.get().deepCopy(r.getSchema, r); (t.id, List(t)) }).reduceByKey(_ ++ _) (noted here: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L1003) Thanks Igor Berman, for

Metadata is not propagating with Dataset.map()

2017-01-16 Thread tovbinm
Hello, It seems that metadata is not propagating when using Dataset.map(). Is there a workaround? Below are the steps to reproduce: import spark.implicits._ val columnName = "col1" val meta = new MetadataBuilder().putString("foo", "bar").build() val schema =