Hello,
A few format handled by Drill enable to work with document, meaning
nested and repeated structure instead of just tables. Json and Parquet
are the two that come to my mind right now. Document modeling is a great
way to express complex object and is used a lot in my company. Drill is
able to handle them but unfortunately, it cannot make much computation
on it. By computation I mean, filtering branches of the document,
computing statistics (avg, min, max) on part of the document … That
would be very useful as an analytic tools.
_What can be done_
The question then is how to express the computation we want to do on the
document. I have found multiple ways to handle that and I don't really
know which one is the best hence the mail to expose what I have found to
initiate discussion, maybe.
First, in we look back at the Dremel paper which is the base of the
parquet format and also one of the example for drill, dremel is adding
the special keyword "WITHIN" to SQL to specify that the computation has
to be done within a document. What is very powerful with this keyword is
that it allows you to generate document and doesn't force you to flatten
everything. You can find exemple of it usage in the google successor of
Dremel: BigQuery and its documentation :
https://cloud.google.com/bigquery/docs/legacy-nested-repeated.
But it seems that it was problematic for Google, because they now
propose a SQL that seems to be compliant with SQL 2011 for Bigquery to
handle such computation. I am not familiar with SQL 2011 but it is told
in BigQuery documentation to integrated the keywords for nested and
repeated structure. You can have a view about how this is done in
BigQuery here:
https://cloud.google.com/bigquery/docs/reference/standard-sql/arrays .
Basically, what I have seen is that they leverage UNNEST and ARRAY
keyword and then are able to use JOIN or CROSS JOIN to describe the
aggregation.
In Impala, they have added a way to add a subquery on a complex type in
such a way that the subquery only act intra-document. I have no idea if
this is standard SQL or not. In page
https://www.cloudera.com/documentation/enterprise/5-5-x/topics/impala_complex_types.html#complex_types
look at the phrase: “The subquery labelled SUBQ1 is correlated:” for
example.
In Presto, you can apply lambda function to map/array to transform the
structure and apply filter on it. So you have filter, map_filter
function to filter array and map respectively. (cf
https://prestodb.io/docs/current/functions/lambda.html#filter)
_Example_
If I want to make a short example, let’s say we have a flight with a
group of passengers in it. A document would be :
{ “flightnb”:1234,
“group”:[{“age”:30,”gender”:”M”},{“age”:15,”gender”:”F”},
{“age”:10,”gender”:”F”},{“age”:30,”gender”:”F”}]}
The database would be millions of such document and I want to know the
average age of the male passenger for every flight.
In Dremel, the query would be something like: select flightnb,
avg(male_age) within record from (select groups.age as male_age from
flight where group.gender = "M")
With sql, it would be something like: select flightnb, avg(male_age)
from (array(select g.age as male_age from unnest(group)as g where
g.gender = "M") as male_age)
With impala it would be something like: select flightnb, avg(male) from
flight, select g.age from groups as g where g.gender = “M” as male
With presto, it would be something like: select flightnb, avg(male)
from flight, filter(group,x->x.gender = "M")as male
I am not sure at all about my SQL queries but it should give you a rough
idea about the different ways to express the inital query.
So many different ways to express the same query… I would personally go
for the SQL way of expressing things to implement it in Drill,
especially because calcite is already able to parse unnest, array, but
that’s only my first thought.
Best regards,
Damien
- Food for thought about intra-document operation Damien Profeta
-