Hi all,

I just wanted to point out a little gotcha we got while using the
collect_set UDF :

You should not perform directly a GROUP BY over a collect_set(...), because
the
set is cast as an array and is not necessarilly sorted.

For example, we ran a query looking like this;

SELECT
set,
COUNT(1) as nb
GROUP BY set
FROM
(
SELECT
colA
collect_set(colB) as set
FROM db.table
GROUP BY colA
) T
;

and got :

[A] 10205
[B] 93856
[A,B] 34865
[B,A] 48324

We had to replace it with a sort_array(collect_set(...)).

I just wanted to point out that perhaps this subtelty should be mentionned
in the doc of the collect_set UDF...

As a corollary, do you guys think a collect_sorted_set using a TreeSet
would be useful (and/or
more efficient than using sort_array(collect_set)) ?

Regards,

Furcy

Reply via email to