Figured it out ... needed to use saveAsNewAPIHadoopFile, but was trying to use it on myDF.rdd instead of converting it to a PairRDD first.
On Mon, Oct 19, 2015 at 2:14 PM, Alex Nastetsky < alex.nastet...@vervemobile.com> wrote: > Using Spark 1.5.1, Parquet 1.7.0. > > I'm trying to write Avro/Parquet files. I have this code: > > sc.hadoopConfiguration.set(ParquetOutputFormat.WRITE_SUPPORT_CLASS, > classOf[AvroWriteSupport].getName) > AvroWriteSupport.setSchema(sc.hadoopConfiguration, MyClass.SCHEMA$) > myDF.write.parquet(outputPath) > > The problem is that the write support class gets overwritten in > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation#prepareJobForWrite: > > val writeSupportClass = > if > (dataSchema.map(_.dataType).forall(ParquetTypesConverter.isPrimitiveType)) { > classOf[MutableRowWriteSupport] > } else { > classOf[RowWriteSupport] > } > ParquetOutputFormat.setWriteSupportClass(job, writeSupportClass) > > So it doesn't seem to actually write Avro data. When look at the metadata > of the Parquet files it writes, it looks like this: > > extra: org.apache.spark.sql.parquet.row.metadata = > {"type":"struct","fields":[{"name":"foo","type":"string","nullable":true,"metadata":{}},{"name":"bar","type":"long","nullable":true,"metadata":{}}]} > > I would expect to see something like "extra: avro.schema" instead. >