Apache Spark-Get All Field Names From Nested Arbitrary JSON Files

John Radin Thu, 31 Mar 2016 15:08:52 -0700

Hello All-

I have run into a somewhat perplexing issue that has plagued me for several
months now (with haphazard workarounds). I am trying to create an Avro
Schema (schema-enforced format for serializing arbitrary data, basically,
as I understand it) to convert some complex JSON files (arbitrary and
nested) eventually to Parquet in a pipeline.


I am wondering if there is a way to get the superset of field names I need
for this use case staying in Apache Spark instead of Hadoop MR in a
reasonable fashion?

I think Apache Arrow under development might be able to help avoid this by
treating JSON as a first class citizen eventually, but it is still aways
off yet.

Any guidance would be sincerely appreciated!

Thanks!

John

Apache Spark-Get All Field Names From Nested Arbitrary JSON Files

Reply via email to