Apache Drill-Querying One JSON File Per Record Option

John Radin Fri, 08 Jan 2016 11:48:07 -0800

Hello All-

First off, I just wanted to thank you all for this great project.  Given
the scale and heterogenuity of modern data sources, drill has killer use
cases.


I did want to inquire about a use case I have been researching where I
think Drill could be very useful in my ETL pipeline.  I just want to
articulate it and get some opinions.

I have an HDFS directory of the following json file format:

https://www.hl7.org/fhir/bundle-transaction.json.html

The issue is that I would like to treat each individual file as a record,
since each one corresponds to one entity of interest (only one patient
resource per bundle).  I'm curious to how Drill differs from Apache Spark
(which I am currently using) on this.  I've found Apache Spark's off the
shelf methods ineffective in this respect and my attempts use
sc.wholeTextFiles() and subsequent RDD mapping operations to be very
inefficient/memory intensive.

Given that a bundle can contain an arbitrary # of resources AND arbitrary
nesting depth of those resources, it is challenging to find a way to
flatten them effectively and ideally save them in parquet file(s).

Any advice or pointers as to whether Drill might be a solution to my use
case would most appreciated!

Cheers,
John

Apache Drill-Querying One JSON File Per Record Option

Reply via email to