I have attached a work-in-progress design doc to https://issues.apache. org/jira/browse/DRILL-5999. Hope to finalize the design over the next week or so.
-Aman On Wed, Nov 29, 2017 at 9:36 AM, Aman Sinha <amansi...@apache.org> wrote: > Damien, > for the intra-document operations, it would be useful to add support for > LATERAL joins (SQL standard), which in conjunction with UNNEST (or FLATTEN) > should address the use case you have. I have filed a JIRA for this: > https://issues.apache.org/jira/browse/DRILL-5999. > > -Aman > > On Tue, Sep 26, 2017 at 12:04 AM, Damien Profeta < > damien.prof...@amadeus.com> wrote: > >> Hello Aman, >> >> AsterixDb seems to follow the standard SQL with a few minor modifications >> and add functions to ease aggregations (array_count, array_avg…) >> >> That would tend to confirm at least that the support of unnest is a good >> idea to improve Drill. >> >> Best regards >> >> Damien >> >> ** >> >> On 09/25/2017 07:53 PM, Aman Sinha wrote: >> >>> Damien, >>> thanks for initiating the discussion..indeed this would be a very useful >>> enhancement. Currently, Drill provides repeated_contains() for >>> filtering >>> and repeated_count() for count aggregates on arrays but not the general >>> purpose intra-document operations that you need based on your example. >>> I haven't gone through all the alternatives but in addition to what you >>> have described, you might also want to look at SQL++ ( >>> https://ci.apache.org/projects/asterixdb/sqlpp/manual.html) which has >>> been >>> adopted by AsterixDB and has syntax extensions to SQL for unstructured >>> data. >>> >>> -Aman >>> >>> On Mon, Sep 25, 2017 at 6:10 AM, Damien Profeta < >>> damien.prof...@amadeus.com> >>> wrote: >>> >>> Hello, >>>> >>>> A few format handled by Drill enable to work with document, meaning >>>> nested >>>> and repeated structure instead of just tables. Json and Parquet are the >>>> two >>>> that come to my mind right now. Document modeling is a great way to >>>> express >>>> complex object and is used a lot in my company. Drill is able to handle >>>> them but unfortunately, it cannot make much computation on it. By >>>> computation I mean, filtering branches of the document, computing >>>> statistics (avg, min, max) on part of the document … That would be very >>>> useful as an analytic tools. >>>> >>>> _What can be done_ >>>> >>>> The question then is how to express the computation we want to do on the >>>> document. I have found multiple ways to handle that and I don't really >>>> know >>>> which one is the best hence the mail to expose what I have found to >>>> initiate discussion, maybe. >>>> >>>> First, in we look back at the Dremel paper which is the base of the >>>> parquet format and also one of the example for drill, dremel is adding >>>> the >>>> special keyword "WITHIN" to SQL to specify that the computation has to >>>> be >>>> done within a document. What is very powerful with this keyword is that >>>> it >>>> allows you to generate document and doesn't force you to flatten >>>> everything. You can find exemple of it usage in the google successor of >>>> Dremel: BigQuery and its documentation : https://cloud.google.com/bigqu >>>> ery/docs/legacy-nested-repeated. >>>> >>>> But it seems that it was problematic for Google, because they now >>>> propose >>>> a SQL that seems to be compliant with SQL 2011 for Bigquery to handle >>>> such >>>> computation. I am not familiar with SQL 2011 but it is told in BigQuery >>>> documentation to integrated the keywords for nested and repeated >>>> structure. >>>> You can have a view about how this is done in BigQuery here: >>>> https://cloud.google.com/bigquery/docs/reference/standard-sql/arrays . >>>> Basically, what I have seen is that they leverage UNNEST and ARRAY >>>> keyword >>>> and then are able to use JOIN or CROSS JOIN to describe the aggregation. >>>> >>>> In Impala, they have added a way to add a subquery on a complex type in >>>> such a way that the subquery only act intra-document. I have no idea if >>>> this is standard SQL or not. In page https://www.cloudera.com/docum >>>> entation/enterprise/5-5-x/topics/impala_complex_types.html#c >>>> omplex_types >>>> look at the phrase: “The subquery labelled SUBQ1 is correlated:” for >>>> example. >>>> >>>> In Presto, you can apply lambda function to map/array to transform the >>>> structure and apply filter on it. So you have filter, map_filter >>>> function >>>> to filter array and map respectively. (cf >>>> https://prestodb.io/docs/curre >>>> nt/functions/lambda.html#filter) >>>> >>>> _Example_ >>>> >>>> If I want to make a short example, let’s say we have a flight with a >>>> group >>>> of passengers in it. A document would be : >>>> >>>> { “flightnb”:1234, “group”:[{“age”:30,”gender”:”M >>>> ”},{“age”:15,”gender”:”F”}, >>>> {“age”:10,”gender”:”F”},{“age”:30,”gender”:”F”}]} >>>> >>>> The database would be millions of such document and I want to know the >>>> average age of the male passenger for every flight. >>>> >>>> In Dremel, the query would be something like: select flightnb, >>>> avg(male_age) within record from (select groups.age as male_age from >>>> flight >>>> where group.gender = "M") >>>> >>>> With sql, it would be something like: select flightnb, avg(male_age) >>>> from >>>> (array(select g.age as male_age from unnest(group)as g where g.gender = >>>> "M") as male_age) >>>> >>>> With impala it would be something like: select flightnb, avg(male) from >>>> flight, select g.age from groups as g where g.gender = “M” as male >>>> >>>> With presto, it would be something like: select flightnb, avg(male) >>>> from >>>> flight, filter(group,x->x.gender = "M")as male >>>> >>>> I am not sure at all about my SQL queries but it should give you a rough >>>> idea about the different ways to express the inital query. >>>> >>>> So many different ways to express the same query… I would personally go >>>> for the SQL way of expressing things to implement it in Drill, >>>> especially >>>> because calcite is already able to parse unnest, array, but that’s only >>>> my >>>> first thought. >>>> >>>> Best regards, >>>> >>>> Damien >>>> >>>> >>>> >> >