Hi all, I just wanted to point out a little gotcha we got while using the collect_set UDF :
You should not perform directly a GROUP BY over a collect_set(...), because the set is cast as an array and is not necessarilly sorted. For example, we ran a query looking like this; SELECT set, COUNT(1) as nb GROUP BY set FROM ( SELECT colA collect_set(colB) as set FROM db.table GROUP BY colA ) T ; and got : [A] 10205 [B] 93856 [A,B] 34865 [B,A] 48324 We had to replace it with a sort_array(collect_set(...)). I just wanted to point out that perhaps this subtelty should be mentionned in the doc of the collect_set UDF... As a corollary, do you guys think a collect_sorted_set using a TreeSet would be useful (and/or more efficient than using sort_array(collect_set)) ? Regards, Furcy