Damien, thanks for initiating the discussion..indeed this would be a very useful enhancement. Currently, Drill provides repeated_contains() for filtering and repeated_count() for count aggregates on arrays but not the general purpose intra-document operations that you need based on your example. I haven't gone through all the alternatives but in addition to what you have described, you might also want to look at SQL++ ( https://ci.apache.org/projects/asterixdb/sqlpp/manual.html) which has been adopted by AsterixDB and has syntax extensions to SQL for unstructured data.
-Aman On Mon, Sep 25, 2017 at 6:10 AM, Damien Profeta <damien.prof...@amadeus.com> wrote: > Hello, > > A few format handled by Drill enable to work with document, meaning nested > and repeated structure instead of just tables. Json and Parquet are the two > that come to my mind right now. Document modeling is a great way to express > complex object and is used a lot in my company. Drill is able to handle > them but unfortunately, it cannot make much computation on it. By > computation I mean, filtering branches of the document, computing > statistics (avg, min, max) on part of the document … That would be very > useful as an analytic tools. > > _What can be done_ > > The question then is how to express the computation we want to do on the > document. I have found multiple ways to handle that and I don't really know > which one is the best hence the mail to expose what I have found to > initiate discussion, maybe. > > First, in we look back at the Dremel paper which is the base of the > parquet format and also one of the example for drill, dremel is adding the > special keyword "WITHIN" to SQL to specify that the computation has to be > done within a document. What is very powerful with this keyword is that it > allows you to generate document and doesn't force you to flatten > everything. You can find exemple of it usage in the google successor of > Dremel: BigQuery and its documentation : https://cloud.google.com/bigqu > ery/docs/legacy-nested-repeated. > > But it seems that it was problematic for Google, because they now propose > a SQL that seems to be compliant with SQL 2011 for Bigquery to handle such > computation. I am not familiar with SQL 2011 but it is told in BigQuery > documentation to integrated the keywords for nested and repeated structure. > You can have a view about how this is done in BigQuery here: > https://cloud.google.com/bigquery/docs/reference/standard-sql/arrays . > Basically, what I have seen is that they leverage UNNEST and ARRAY keyword > and then are able to use JOIN or CROSS JOIN to describe the aggregation. > > In Impala, they have added a way to add a subquery on a complex type in > such a way that the subquery only act intra-document. I have no idea if > this is standard SQL or not. In page https://www.cloudera.com/docum > entation/enterprise/5-5-x/topics/impala_complex_types.html#complex_types > look at the phrase: “The subquery labelled SUBQ1 is correlated:” for > example. > > In Presto, you can apply lambda function to map/array to transform the > structure and apply filter on it. So you have filter, map_filter function > to filter array and map respectively. (cf https://prestodb.io/docs/curre > nt/functions/lambda.html#filter) > > _Example_ > > If I want to make a short example, let’s say we have a flight with a group > of passengers in it. A document would be : > > { “flightnb”:1234, “group”:[{“age”:30,”gender”:”M”},{“age”:15,”gender”:”F”}, > {“age”:10,”gender”:”F”},{“age”:30,”gender”:”F”}]} > > The database would be millions of such document and I want to know the > average age of the male passenger for every flight. > > In Dremel, the query would be something like: select flightnb, > avg(male_age) within record from (select groups.age as male_age from flight > where group.gender = "M") > > With sql, it would be something like: select flightnb, avg(male_age) from > (array(select g.age as male_age from unnest(group)as g where g.gender = > "M") as male_age) > > With impala it would be something like: select flightnb, avg(male) from > flight, select g.age from groups as g where g.gender = “M” as male > > With presto, it would be something like: select flightnb, avg(male) from > flight, filter(group,x->x.gender = "M")as male > > I am not sure at all about my SQL queries but it should give you a rough > idea about the different ways to express the inital query. > > So many different ways to express the same query… I would personally go > for the SQL way of expressing things to implement it in Drill, especially > because calcite is already able to parse unnest, array, but that’s only my > first thought. > > Best regards, > > Damien > >