Re: Best way to store Avro Objects as Parquet using SPARK

2016-03-22 Thread Manivannan Selvadurai
I should have phrased it differently, Avro schema has additional properties like required etc.. Right now the json data that I have gets stored as optional fields in the parquet file. Is there a way to model the parquet file schema, close to avro schema. I tried using the

Re: Best way to store Avro Objects as Parquet using SPARK

2016-03-21 Thread Michael Armbrust
> > But when tired using Spark streamng I could not find a way to store the > data with the avro schema information. The closest that I got was to create > a Dataframe using the json RDDs and store them as parquet. Here the parquet > files had a spark specific schema in their footer. > Does this

Re: Best way to store Avro Objects as Parquet using SPARK

2016-03-21 Thread Manivannan Selvadurai
Hi, Which version of spark are you using?? On Mon, Mar 21, 2016 at 12:28 PM, Sebastian Piu wrote: > We use this, but not sure how the schema is stored > > Job job = Job.getInstance(); > ParquetOutputFormat.setWriteSupportClass(job, AvroWriteSupport.class); >

Re: Best way to store Avro Objects as Parquet using SPARK

2016-03-21 Thread Sebastian Piu
We use this, but not sure how the schema is stored Job job = Job.getInstance(); ParquetOutputFormat.setWriteSupportClass(job, AvroWriteSupport.class); AvroParquetOutputFormat.setSchema(job, schema); LazyOutputFormat.setOutputFormatClass(job, new ParquetOutputFormat().getClass());

Best way to store Avro Objects as Parquet using SPARK

2016-03-20 Thread Manivannan Selvadurai
Hi All, In my current project there is a requirement to store avro data (json format) as parquet files. I was able to use AvroParquetWriter in separately to create the Parquet Files. The parquet files along with the data also had the 'avro schema' stored on them as a part of their