I have a bit of a strange situation: ***************** import org.apache.avro.generic.{GenericData, GenericRecord} import org.apache.avro.mapred.{AvroInputFormat, AvroWrapper, AvroKey} import org.apache.avro.mapreduce.AvroKeyInputFormat import org.apache.hadoop.io.{NullWritable, WritableUtils}
val path = "/path/to/data.avro" val rdd = sc.newAPIHadoopFile(path, classOf[AvroKeyInputFormat[GenericRecord]], classOf[AvroKey[GenericRecord]], classOf[NullWritable]) rdd.take(10).foreach( x => println( x._1.datum() )) ***************** In this situation, I get the right number of records returned, and if I look at the contents of rdd I see the individual records as tuple2's... however, if I println on each one as shown above, I get the same result every time. Apparently this has to do with something in Spark or Avro keeping a reference to the item its iterating over, so I need to clone the object before I use it. However, if I try to clone it (from the spark-shell console), I get: ***************** rdd.take(10).foreach( x => { val clonedDatum = x._1.datum().clone() println(clonedDatum.datum()) }) <console>:37: error: method clone in class Object cannot be accessed in org.apache.avro.generic.GenericRecord Access to protected method clone not permitted because prefix type org.apache.avro.generic.GenericRecord does not conform to class $iwC where the access take place val clonedDatum = x._1.datum().clone() ***************** So, how can I clone the datum? Seems I'm not the only one who ran into this problem: https://github.com/GoogleCloudPlatform/DataflowJavaSDK/issues/102. I can't figure out how to fix it in my case without hacking away like the person in the linked PR did. Suggestions? -- Chris Miller