Re: Apache Drill-Querying One JSON File Per Record Option

John Radin Fri, 15 Jan 2016 12:04:26 -0800

Hello Jason-

Thank you kindly for your reply!  Yes, I thought Drill would be able to
read one file per record.  I'll need to explore Drill's flatten operations
for sure.  I've just found it to be a major challenge to work with this
FHIR Bundle JSON format (one patient resource bundled with several other
resources).  I've been curious if my best approach in my ETL pipeline for
this regardless of platform (Drill or Spark) would actually be:


JSON->Avro>Parquet

My rationale is as follows to try to create/maintain an Avro schema from
the JSON to then supply that schema to the Parquet conversion stage.

Do you think these schema files would potentially help?  FHIR Schemas &
Schematrons <https://www.hl7.org/fhir/fhir-all-xsd.zip>

Thanks again,
John

On Thu, Jan 14, 2016 at 2:16 PM, Jason Altekruse <[email protected]>
wrote:

> Hi John,
>
> Thank you for your support, we are trying to build the most useful tool for
> analytics across data sources and always glad to hear we are on the right
> track.
>
> I am a little confused about your question. If you point Drill at a file
> with that JSON in it, it will read it as a single record.
>
> You mention wanting to flatten the data out and put in Parquet files. Have
> you tried working with the FLATTEN function in Drill? [1]
>
> Drill does not currently support something like recursive flatten, each
> level of flattening requires an explicit call to the flatten function. So
> I'm not sure if you will be able to do exactly what you want if the
> documents really can have arbitrary nesting depths. Parquet also lacks
> support for recursive data structure definitions, the metadata requires a
> complete schema explicitly giving each level of nesting be provided when
> you start writing the file (drill will do this automatically for you during
> a CTAS statement, but it will just provide whatever levels of nesting it
> read out of your JSON as the parquet schema).
>
> How much you want to flatten is going to depend on the kind of analysis you
> need to do. There are a lot of different list in this dataset at various
> levels of nesting. I think you are likely going to want to flatten out at
> least the `entry` array, although I'm not quite sure how analysis across
> these lists full 'comment' fields would be in your case. It might make
> sense to store these as lists and flatten them in different queries
> invoking analysis of only some of the lists.
>
> I actually just answered another question about flattening a complex JSON
> structure this morning, you may find my comments over there useful for
> learning about Drill. [2]
>
> [1] - https://drill.apache.org/docs/flatten/
> [2] -
>
> http://mail-archives.apache.org/mod_mbox/drill-user/201601.mbox/%3CCAMpYv7C3CqY6D8x5CC3H955n4CSDTuqY3a8PfZwT1m2dhEyN7w%40mail.gmail.com%3E
>
> On Fri, Jan 8, 2016 at 11:47 AM, John Radin <[email protected]> wrote:
>
> > Hello All-
> >
> > First off, I just wanted to thank you all for this great project.  Given
> > the scale and heterogenuity of modern data sources, drill has killer use
> > cases.
> >
> > I did want to inquire about a use case I have been researching where I
> > think Drill could be very useful in my ETL pipeline.  I just want to
> > articulate it and get some opinions.
> >
> > I have an HDFS directory of the following json file format:
> >
> > https://www.hl7.org/fhir/bundle-transaction.json.html
> >
> > The issue is that I would like to treat each individual file as a record,
> > since each one corresponds to one entity of interest (only one patient
> > resource per bundle).  I'm curious to how Drill differs from Apache Spark
> > (which I am currently using) on this.  I've found Apache Spark's off the
> > shelf methods ineffective in this respect and my attempts use
> > sc.wholeTextFiles() and subsequent RDD mapping operations to be very
> > inefficient/memory intensive.
> >
> > Given that a bundle can contain an arbitrary # of resources AND arbitrary
> > nesting depth of those resources, it is challenging to find a way to
> > flatten them effectively and ideally save them in parquet file(s).
> >
> > Any advice or pointers as to whether Drill might be a solution to my use
> > case would most appreciated!
> >
> > Cheers,
> > John
> >
>



-- 
John Radin
Forward Deployed Data Scientist | Lumiata
[email protected] | 434-327-7311

www.lumiata.com
optimizing care, elevating health

Re: Apache Drill-Querying One JSON File Per Record Option

Reply via email to