Food for thought about intra-document operation

Damien Profeta Mon, 25 Sep 2017 06:10:51 -0700

Hello,

A few format handled by Drill enable to work with document, meaningnested and repeated structure instead of just tables. Json and Parquetare the two that come to my mind right now. Document modeling is a greatway to express complex object and is used a lot in my company. Drill isable to handle them but unfortunately, it cannot make much computationon it. By computation I mean, filtering branches of the document,computing statistics (avg, min, max) on part of the document … Thatwould be very useful as an analytic tools.


_What can be done_

The question then is how to express the computation we want to do on thedocument. I have found multiple ways to handle that and I don't reallyknow which one is the best hence the mail to expose what I have found toinitiate discussion, maybe.

First, in we look back at the Dremel paper which is the base of theparquet format and also one of the example for drill, dremel is addingthe special keyword "WITHIN" to SQL to specify that the computation hasto be done within a document. What is very powerful with this keyword isthat it allows you to generate document and doesn't force you to flatteneverything. You can find exemple of it usage in the google successor ofDremel: BigQuery and its documentation :https://cloud.google.com/bigquery/docs/legacy-nested-repeated.

But it seems that it was problematic for Google, because they nowpropose a SQL that seems to be compliant with SQL 2011 for Bigquery tohandle such computation. I am not familiar with SQL 2011 but it is toldin BigQuery documentation to integrated the keywords for nested andrepeated structure. You can have a view about how this is done inBigQuery here:https://cloud.google.com/bigquery/docs/reference/standard-sql/arrays .Basically, what I have seen is that they leverage UNNEST and ARRAYkeyword and then are able to use JOIN or CROSS JOIN to describe theaggregation.

In Impala, they have added a way to add a subquery on a complex type insuch a way that the subquery only act intra-document. I have no idea ifthis is standard SQL or not. In pagehttps://www.cloudera.com/documentation/enterprise/5-5-x/topics/impala_complex_types.html#complex_typeslook at the phrase: “The subquery labelled SUBQ1 is correlated:” forexample.

In Presto, you can apply lambda function to map/array to transform thestructure and apply filter on it. So you have filter, map_filterfunction to filter array and map respectively. (cfhttps://prestodb.io/docs/current/functions/lambda.html#filter)


_Example_

If I want to make a short example, let’s say we have a flight with agroup of passengers in it. A document would be :

{ “flightnb”:1234,“group”:[{“age”:30,”gender”:”M”},{“age”:15,”gender”:”F”},{“age”:10,”gender”:”F”},{“age”:30,”gender”:”F”}]}

The database would be millions of such document and I want to know theaverage age of the male passenger for every flight.

In Dremel, the query would be something like: select flightnb,avg(male_age) within record from (select groups.age as male_age fromflight where group.gender = "M")

With sql, it would be something like: select flightnb, avg(male_age)from (array(select g.age as male_age from unnest(group)as g whereg.gender = "M") as male_age)

With impala it would be something like: select flightnb, avg(male) fromflight, select g.age from groups as g where g.gender = “M” as male

With presto, it would be something like: select flightnb, avg(male)from flight, filter(group,x->x.gender = "M")as male

I am not sure at all about my SQL queries but it should give you a roughidea about the different ways to express the inital query.

So many different ways to express the same query… I would personally gofor the SQL way of expressing things to implement it in Drill,especially because calcite is already able to parse unnest, array, butthat’s only my first thought.


Best regards,

Damien

Food for thought about intra-document operation

Reply via email to