Hello All- I have run into a somewhat perplexing issue that has plagued me for several months now (with haphazard workarounds). I am trying to create an Avro Schema (schema-enforced format for serializing arbitrary data, basically, as I understand it) to convert some complex JSON files (arbitrary and nested) eventually to Parquet in a pipeline.
I am wondering if there is a way to get the superset of field names I need for this use case staying in Apache Spark instead of Hadoop MR in a reasonable fashion? I think Apache Arrow under development might be able to help avoid this by treating JSON as a first class citizen eventually, but it is still aways off yet. Any guidance would be sincerely appreciated! Thanks! John