TaskEnd Metrics

2016-05-06 Thread Manivannan Selvadurai
Hi All, I tried to run a simple spark program to find out the metrics collected while executing the program. What I observed is, I'm able to get TaskMetrics.inputMetrics data like records read, bytesread etc. But I do not get any metrics about the output. I ran the below code in

Fwd: Adding metadata information to parquet files

2016-04-17 Thread Manivannan Selvadurai
Just a reminder!! Hi All, I'm trying to ingest data form kafka as parquet files. I use spark 1.5.2 and I'm looking for a way to store the source schema in the parquet file like the way you get to store the avro schema as a metadata info when using the AvroParquetWriter. Any help much

Adding metadata information to parquet files

2016-04-14 Thread Manivannan Selvadurai
Hi All, I'm trying to ingest data form kafka as parquet files. I use spark 1.5.2 and I'm looking for a way to store the source schema in the parquet file like the way you get to store the avro schema as a metadata info when using the AvroParquetWriter. Any help much appreciated.

Re: Best way to store Avro Objects as Parquet using SPARK

2016-03-22 Thread Manivannan Selvadurai
I should have phrased it differently, Avro schema has additional properties like required etc.. Right now the json data that I have gets stored as optional fields in the parquet file. Is there a way to model the parquet file schema, close to avro schema. I tried using the

Re: Best way to store Avro Objects as Parquet using SPARK

2016-03-21 Thread Manivannan Selvadurai
guration().set("parquet.enable.summary-metadata", "false"); > > //save the file > rdd.mapToPair(me -> new Tuple2(null, me)) > .saveAsNewAPIHadoopFile( > String.format("%s/%s", path, timeStamp.milliseconds()), > Void.class, > clazz, > LazyOutputFormat.cla

Best way to store Avro Objects as Parquet using SPARK

2016-03-20 Thread Manivannan Selvadurai
Hi All, In my current project there is a requirement to store avro data (json format) as parquet files. I was able to use AvroParquetWriter in separately to create the Parquet Files. The parquet files along with the data also had the 'avro schema' stored on them as a part of their