Howdy,
We've noticed a strange behavior with Avro serialized data and reduceByKey
RDD functionality. Please see below:
// We're reading a bunch of Avro serialized data
val data: RDD[T] = sparkContext.hadoopFile(path,
classOf[AvroInputFormat[T]], classOf[AvroWrapper[T]], classOf[NullWritable])
Deep copying the data solved the issue:
data.map(r => {val t = SpecificData.get().deepCopy(r.getSchema, r); (t.id,
List(t)) }).reduceByKey(_ ++ _)
(noted here:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L1003)
Thanks Igor Berman, for
Hello,
It seems that metadata is not propagating when using Dataset.map(). Is there
a workaround?
Below are the steps to reproduce:
import spark.implicits._
val columnName = "col1"
val meta = new MetadataBuilder().putString("foo", "bar").build()
val schema =