Hi, Which version of spark are you using??
On Mon, Mar 21, 2016 at 12:28 PM, Sebastian Piu <sebastian....@gmail.com> wrote: > We use this, but not sure how the schema is stored > > Job job = Job.getInstance(); > ParquetOutputFormat.setWriteSupportClass(job, AvroWriteSupport.class); > AvroParquetOutputFormat.setSchema(job, schema); > LazyOutputFormat.setOutputFormatClass(job, new > ParquetOutputFormat<T>().getClass()); > job.getConfiguration().set("mapreduce.fileoutputcommitter.marksuccessfuljobs", > "false"); > job.getConfiguration().set("parquet.enable.summary-metadata", "false"); > > //save the file > rdd.mapToPair(me -> new Tuple2(null, me)) > .saveAsNewAPIHadoopFile( > String.format("%s/%s", path, timeStamp.milliseconds()), > Void.class, > clazz, > LazyOutputFormat.class, > job.getConfiguration()); > > On Mon, 21 Mar 2016, 05:55 Manivannan Selvadurai, < > smk.manivan...@gmail.com> wrote: > >> Hi All, >> >> In my current project there is a requirement to store avro data >> (json format) as parquet files. >> I was able to use AvroParquetWriter in separately to create the Parquet >> Files. The parquet files along with the data also had the 'avro schema' >> stored on them as a part of their footer. >> >> But when tired using Spark streamng I could not find a way to >> store the data with the avro schema information. The closest that I got was >> to create a Dataframe using the json RDDs and store them as parquet. Here >> the parquet files had a spark specific schema in their footer. >> >> Is this the right approach or do I have a better one. Please guide >> me. >> >> >> We are using Spark 1.4.1. >> >> Thanks In Advance!! >> >