Re: Re: UNNEST() operation for arrays

Tim Armstrong Mon, 13 Sep 2021 22:36:36 -0700

I think compatibility with Postgres derivatives generally makes sense as a
goal, even if the behaviour isn't ideal - definitely much easier than
rewriting everything to a different syntax.


With your rewrite proposal, would you also be able to express these queries
by unnesting in the FROM clause?

SELECT unnest(t.a) c1, unnest(t.b) c1
FROM tbl t
=>
SELECT _tmp1 c1, _tmp2 c2
FROM tbl t, UNNEST(t.a, t.b) (_tmp1, _tmp2)

I'm not sure a feature flag would work well for this, since you
might potentially have a more complex FROM clause with other unnesting and
I'm not sure how you would model that. I suspect it works out cleaner to
model the zipped arrays as a TableRef - for most purposes in analysis and
planning it's going to be treated the same as a TableRef that is just a
plain array of structs - the analysis/planner code for TableRefs, joins,
etc isn't the easiest to extend.

I guess you could also rewrite it to something like this too, which would
really let you isolate the special handling to a simple SELECT block:
SELECT v._tmp1 c1, v._tmp2 c2
FROM tbl t, (SELECT t.a __tmp1, t.b __tmp2 FROM t.a ZIP JOIN t.b) v

Another question: what happens if the two arrays are from different table
references? Is that allowed? Do they still get zipped together?

On Mon, 13 Sept 2021 at 04:43, Gabor Kaszab <[email protected]> wrote:

> Hey All,
> Thanks for taking a look and sharing your view. The whole idea of having
> multiple UNNEST() functions in the select list (and zipping the results)
> came from a user as a direct feature request as they try to move to Impala
> from Greenplum and as they say they can't rewrite their existing queries.
>
> The most straightforward way would be to implement UNNEST() as a UDTF but
> Impala doesn't support them, and anyway, as mentioned above Impala doesn't
> support even UDFs with nested type parameters.
>
> My idea was something like introducing an unnest SlotRef or such, change
> the SQL parser to translate each UNNEST(array) into this SlotRef and then
> do a query rewrite based on that so that we can process the unnest. My
> preference would be to rewrite the query to the syntax we have now with
> joining the arrays in the FROM clause as collection table refs and then
> have a feature flag (e.g. unnest_join_type) to indicate whether we want the
> joined or the zipped behaviour. For the zipped one we can modify the
> existing UNNEST node to be able to receive more arrays and then do the zip
> itself.
>
> What do you think?
>
> Gabor
>
> On Mon, Sep 13, 2021 at 9:35 AM 彭斗 <[email protected]> wrote:
>
> > Hi  All,
> >
> >
> > Currently IMPALA don't support complex data type as input/output
> > parameters for Bulltin Functions / User Defined Functions,
> > If IMPALA can support this, the Unnest()  or other  operation on  complex
> > data type may easy to achieve.
> >
> >
> >
> >
> > --
> >
> > Best Regards
> > Dou
> >
> >
> >
> >
> >
> > At 2021-09-13 15:08:52, "Quanlong Huang" <[email protected]>
> wrote:
> > >This seems like the explode() UDTF in Hive:
> > >
> >
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-explode(array)
> > >
> > >Hive doesn't support multiple UDTF in the SelectList. I think we can
> make
> > >the same choice that only supports one UNNEST() in the SelectList. If
> > >multiple columns need to be unfolded, they should be wrapped in the same
> > >UNNEST() function with a keyword/func specifying how to join the rows.
> > E.g.
> > >
> > >-- cross join by default
> > >select UNNEST(array_a, array_b) from tbl;
> > >-- specifying the join type
> > >select UNNEST(zip array_a, array_b) from tbl;
> > >select UNNEST(arrays_zip(array_a, array_b)) from tbl;
> > >
> > >explode() in Hive only supports one column. The above syntax is inspired
> > by
> > >SparkSQL:
> > >
> >
> https://stackoverflow.com/questions/33220916/explode-transpose-multiple-columns-in-spark-sql-table
> > >https://spark.apache.org/docs/latest/api/sql/index.html#arrays_zip
> > >
> > >On a separate note, what about using the same name(explode) as Hive? I
> > >think it'd be helpful for migrations into Impala, especially if we also
> > >support the LATERAL VIEW syntax.
> > >
> > >Thanks,
> > >Quanlong
> > >
> > >On Sun, Sep 12, 2021 at 8:56 AM Tim Armstrong <
> [email protected]>
> > >wrote:
> > >
> > >> The syntax/behaviour seems kinda unusual to me - do any other systems
> > aside
> > >> from postgres implement this? UNNEST() in the from clause seems more
> > >> common, and it is generally not consistent with the rest of SQL for
> > SELECT
> > >> clause entries to affect the set of rows returned.
> > >>
> > >> I'm not sure but it seems like it might be tricky to make the select
> > clause
> > >> in the Impala frontend do this.
> > >>
> > >> Some way of doing this in FROM clause seems more intuitive and
> > consistent
> > >> with the rest of sql, e.g.
> > >>
> > >> select id, e1, e2
> > >> from tbl t, UNNEST(t.arr1, t.arr2) (e1, e2)
> > >>
> > >> I'm tempted to say that it should be called ZIP() since that's the
> name
> > of
> > >> this operation in functional programming, but I don't think inventing
> > new
> > >> SQL keywords is really ideal.
> > >>
> > >> I'm also a little curious about the use case for this. I think ideally
> > you
> > >> would represent two arrays with correlated indices as an array of
> > structs
> > >> instead of two arrays, but I can appreciate that schema designs may
> not
> > >> always be fully normalised.
> > >>
> > >> On Fri, 10 Sept 2021 at 18:38, Aman Sinha <[email protected]>
> wrote:
> > >>
> > >> > Hi Gabor,
> > >> > An implicit assumption in the proposed syntax is that there is some
> > >> > association between an item at a particular position in the first
> > array
> > >> and
> > >> > the same position in the second array.
> > >> > However,  in general the arrays are not ordered in a semantic sense.
> > e.g
> > >> > car_make: {"Honda", "Toyota"},   car_model: {"Prius", "Accord"}.
>  The
> > >> > proposed UNNEST behavior will not produce any rows where {Honda,
> > Accord}
> > >> > appear together which would be unexpected.  Although one could make
> > the
> > >> > argument that in order to get this behavior you could fall back to
> the
> > >> > current syntax of putting the arrays in the FROM clause, it could be
> > >> > confusing to explain when one should be used vs other.  It would be
> > >> useful
> > >> > to see what other systems (apart from Postgres) which have more
> native
> > >> > complex types support do in this case (e.g Presto, Couchbase,
> > Bigquery).
> > >> >
> > >> > One alternative I can think of is to make the association between
> the
> > >> items
> > >> > in multiple arrays be more explicit e.g by using an outer join on
> the
> > >> > ordinal position. That way, items at the same position appear
> together
> > >> in a
> > >> > row and for non-matches we produce Null for the smaller array. I
> > haven't
> > >> > thought through the exact syntax for this yet.
> > >> >
> > >> > -Aman
> > >> >
> > >> > On Fri, Sep 10, 2021 at 11:04 AM Gabor Kaszab <
> > [email protected]>
> > >> > wrote:
> > >> >
> > >> > > Hey,
> > >> > >
> > >> > > I'm working on implementing a new UNNEST() operation that does
> what
> > it
> > >> > > says, unnests arrays. But here the trick would come when there are
> > >> > multiple
> > >> > > UNNESTS() in the select list. Let me explain the desired behaviour
> > with
> > >> > an
> > >> > > example:
> > >> > >
> > >> > > Let's assume the following table:
> > >> > > [image: Screenshot 2021-09-10 at 19.58.53.png]
> > >> > > An the following query:
> > >> > > SELECT id, UNNEST(arr1), UNNEST(arr2) FROM tbl;
> > >> > >
> > >> > > Here the expected result is to not produce a join on the two
> arrays
> > as
> > >> we
> > >> > > would with the original way of giving them in the select lit, but
> to
> > >> put
> > >> > > the values of the arrays "next to each other" like this:
> > >> > > [image: Screenshot 2021-09-10 at 19.59.04.png]
> > >> > > I have investigated some options how this could be implemented and
> > >> > created
> > >> > > a doc for my notes. Let me share this with the community so that
> > people
> > >> > can
> > >> > > share their preference. Any feedback is welcome!
> > >> > >
> > >> > >
> > >> > >
> > >> >
> > >>
> >
> https://docs.google.com/document/d/184EKJwMME4SNzyfOTueI-nz-IL-WUiBeaS8Zhi2XzMo/edit?usp=sharing
> > >> > >
> > >> > > Cheers,
> > >> > > Gabor
> > >> > >
> > >> > >
> > >> > >
> > >> >
> > >>
> >
>

Re: Re: UNNEST() operation for arrays

Reply via email to