Hi All,
In my current project there is a requirement to store avro data
(json format) as parquet files.
I was able to use AvroParquetWriter in separately to create the Parquet
Files. The parquet files along with the data also had the 'avro schema'
stored on them as a part of their
guration().set("parquet.enable.summary-metadata", "false");
>
> //save the file
> rdd.mapToPair(me -> new Tuple2(null, me))
> .saveAsNewAPIHadoopFile(
> String.format("%s/%s", path, timeStamp.milliseconds()),
> Void.class,
> clazz,
> LazyOutputFormat.cla
I should have phrased it differently, Avro schema has additional
properties like required etc.. Right now the json data that I have gets
stored as optional fields in the parquet file. Is there a way to model the
parquet file schema, close to avro schema. I tried using the
Hi All,
I tried to run a simple spark program to find out the metrics
collected while executing the program. What I observed is, I'm able to get
TaskMetrics.inputMetrics data like records read, bytesread etc. But I do
not get any metrics about the output.
I ran the below code in
Hi All,
I'm trying to ingest data form kafka as parquet files. I use spark 1.5.2
and I'm looking for a way to store the source schema in the parquet file
like the way you get to store the avro schema as a metadata info when using
the AvroParquetWriter. Any help much appreciated.
Just a reminder!!
Hi All,
I'm trying to ingest data form kafka as parquet files. I use spark 1.5.2
and I'm looking for a way to store the source schema in the parquet file
like the way you get to store the avro schema as a metadata info when using
the AvroParquetWriter. Any help much