Hello,

A few format handled by Drill enable to work with document, meaning nested and repeated structure instead of just tables. Json and Parquet are the two that come to my mind right now. Document modeling is a great way to express complex object and is used a lot in my company. Drill is able to handle them but unfortunately, it cannot make much computation on it. By computation I mean, filtering branches of the document, computing statistics (avg, min, max) on part of the document … That would be very useful as an analytic tools.

_What can be done_

The question then is how to express the computation we want to do on the document. I have found multiple ways to handle that and I don't really know which one is the best hence the mail to expose what I have found to initiate discussion, maybe.

First, in we look back at the Dremel paper which is the base of the parquet format and also one of the example for drill, dremel is adding the special keyword "WITHIN" to SQL to specify that the computation has to be done within a document. What is very powerful with this keyword is that it allows you to generate document and doesn't force you to flatten everything. You can find exemple of it usage in the google successor of Dremel: BigQuery and its documentation : https://cloud.google.com/bigquery/docs/legacy-nested-repeated.

But it seems that it was problematic for Google, because they now propose a SQL that seems to be compliant with SQL 2011 for Bigquery to handle such computation. I am not familiar with SQL 2011 but it is told in BigQuery documentation to integrated the keywords for nested and repeated structure. You can have a view about how this is done in BigQuery here: https://cloud.google.com/bigquery/docs/reference/standard-sql/arrays . Basically, what I have seen is that they leverage UNNEST and ARRAY keyword and then are able to use JOIN or CROSS JOIN to describe the aggregation.

In Impala, they have added a way to add a subquery on a complex type in such a way that the subquery only act intra-document. I have no idea if this is standard SQL or not. In page https://www.cloudera.com/documentation/enterprise/5-5-x/topics/impala_complex_types.html#complex_types look at the phrase: “The subquery labelled SUBQ1 is correlated:” for example.

In Presto, you can apply lambda function to map/array to transform the structure and apply filter on it. So you have filter, map_filter function to filter array and map respectively. (cf https://prestodb.io/docs/current/functions/lambda.html#filter)

_Example_

If I want to make a short example, let’s say we have a flight with a group of passengers in it. A document would be :

{ “flightnb”:1234, “group”:[{“age”:30,”gender”:”M”},{“age”:15,”gender”:”F”}, {“age”:10,”gender”:”F”},{“age”:30,”gender”:”F”}]}

The database would be millions of such document and I want to know the average age of the male passenger for every flight.

In Dremel, the query would be something like: select flightnb, avg(male_age) within record from (select groups.age as male_age from flight where group.gender = "M")

With sql, it would be something like: select flightnb, avg(male_age) from (array(select g.age as male_age from unnest(group)as g where g.gender = "M") as male_age)

With impala it would be something like: select flightnb, avg(male) from flight, select g.age from groups as g where g.gender = “M” as male

With presto, it would be something like:  select flightnb, avg(male) from flight, filter(group,x->x.gender = "M")as male

I am not sure at all about my SQL queries but it should give you a rough idea about the different ways to express the inital query.

So many different ways to express the same query… I would personally go for the SQL way of expressing things to implement it in Drill, especially because calcite is already able to parse unnest, array, but that’s only my first thought.

Best regards,

Damien

Reply via email to