Damien,
thanks for initiating the discussion..indeed this would be a very useful
enhancement.  Currently, Drill provides repeated_contains()  for filtering
and repeated_count() for count aggregates on arrays but not the general
purpose intra-document operations that you need based on your example.
I haven't gone through all the alternatives but in addition to what you
have described,  you might also want to look at SQL++ (
https://ci.apache.org/projects/asterixdb/sqlpp/manual.html) which has been
adopted by AsterixDB and has syntax extensions to SQL for unstructured
data.

-Aman

On Mon, Sep 25, 2017 at 6:10 AM, Damien Profeta <damien.prof...@amadeus.com>
wrote:

> Hello,
>
> A few format handled by Drill enable to work with document, meaning nested
> and repeated structure instead of just tables. Json and Parquet are the two
> that come to my mind right now. Document modeling is a great way to express
> complex object and is used a lot in my company. Drill is able to handle
> them but unfortunately, it cannot make much computation on it. By
> computation I mean, filtering branches of the document, computing
> statistics (avg, min, max) on part of the document … That would be very
> useful as an analytic tools.
>
> _What can be done_
>
> The question then is how to express the computation we want to do on the
> document. I have found multiple ways to handle that and I don't really know
> which one is the best hence the mail to expose what I have found to
> initiate discussion, maybe.
>
> First, in we look back at the Dremel paper which is the base of the
> parquet format and also one of the example for drill, dremel is adding the
> special keyword "WITHIN" to SQL to specify that the computation has to be
> done within a document. What is very powerful with this keyword is that it
> allows you to generate document and doesn't force you to flatten
> everything. You can find exemple of it usage in the google successor of
> Dremel: BigQuery and its documentation : https://cloud.google.com/bigqu
> ery/docs/legacy-nested-repeated.
>
> But it seems that it was problematic for Google, because they now propose
> a SQL that seems to be compliant with SQL 2011 for Bigquery to handle such
> computation. I am not familiar with SQL 2011 but it is told in BigQuery
> documentation to integrated the keywords for nested and repeated structure.
> You can have a view about how this is done in BigQuery here:
> https://cloud.google.com/bigquery/docs/reference/standard-sql/arrays .
> Basically, what I have seen is that they leverage UNNEST and ARRAY keyword
> and then are able to use JOIN or CROSS JOIN to describe the aggregation.
>
> In Impala, they have added a way to add a subquery on a complex type in
> such a way that the subquery only act intra-document. I have no idea if
> this is standard SQL or not. In page https://www.cloudera.com/docum
> entation/enterprise/5-5-x/topics/impala_complex_types.html#complex_types
> look at the phrase: “The subquery labelled SUBQ1 is correlated:” for
> example.
>
> In Presto, you can apply lambda function to map/array to transform the
> structure and apply filter on it. So you have filter, map_filter function
> to filter array and map respectively. (cf https://prestodb.io/docs/curre
> nt/functions/lambda.html#filter)
>
> _Example_
>
> If I want to make a short example, let’s say we have a flight with a group
> of passengers in it. A document would be :
>
> { “flightnb”:1234, “group”:[{“age”:30,”gender”:”M”},{“age”:15,”gender”:”F”},
> {“age”:10,”gender”:”F”},{“age”:30,”gender”:”F”}]}
>
> The database would be millions of such document and I want to know the
> average age of the male passenger for every flight.
>
> In Dremel, the query would be something like: select flightnb,
> avg(male_age) within record from (select groups.age as male_age from flight
> where group.gender = "M")
>
> With sql, it would be something like: select flightnb, avg(male_age) from
> (array(select g.age as male_age from unnest(group)as g where g.gender =
> "M") as male_age)
>
> With impala it would be something like: select flightnb, avg(male) from
> flight, select g.age from groups as g where g.gender = “M” as male
>
> With presto, it would be something like:  select flightnb, avg(male) from
> flight, filter(group,x->x.gender = "M")as male
>
> I am not sure at all about my SQL queries but it should give you a rough
> idea about the different ways to express the inital query.
>
> So many different ways to express the same query… I would personally go
> for the SQL way of expressing things to implement it in Drill, especially
> because calcite is already able to parse unnest, array, but that’s only my
> first thought.
>
> Best regards,
>
> Damien
>
>

Reply via email to