RE: Can Spark SQL (not DataFrame or Dataset) aggregate array into map of element of count?

Vijay B Wed, 10 May 2023 00:10:25 -0700

Please see if this works

-- aggregate array into map of element of count
SELECT aggregate(array(1,2,3,4,5),
  map('cnt',0),
  (acc,x) -> map('cnt', acc.cnt+1)) as array_count


thanks
Vijay

On 2023/05/05 19:32:04 Yong Zhang wrote:
> Hi, This is on Spark 3.1 environment.
>
> For some reason, I can ONLY do this in Spark SQL, instead of either Scala or 
> PySpark environment.
>
> I want to aggregate an array into a Map of element count, within that array, 
> but in Spark SQL.
> I know that there is an aggregate function available like
>
> aggregate(expr, start, merge [, finish])
>
> But I want to know if this can be done in the Spark SQL only, and:
>
>   *   How to represent an empty Map as "start" element above
>   *   How to merge each element (as String type) into Map (as adding count if 
> exist in the Map, or add as (element -> 1) as new entry in the Map if not 
> exist)
>
> Like the following example -> 
> https://docs.databricks.com/sql/language-manual/functions/aggregate.html
>
> SELECT aggregate(array(1, 2, 3, 4),
>                    named_struct('sum', 0, 'cnt', 0),
>                    (acc, x) -> named_struct('sum', acc.sum + x, 'cnt', 
> acc.cnt + 1),
>                    acc -> acc.sum / acc.cnt) AS avg
>
> I wonder:
> select
>       aggregate(
>       array('a','b','a')),
>       map('', 0),
>       (acc, x) -> ???
>       acc -> acc) as output
>
> How to do the logic after "(acc, x) -> ", so I can output a map of count of 
> each element in the array?
> I know I can "explode", then groupby + count, but since I have multi array 
> columns need to transform, so I want to do more a high order function way, 
> and in pure Spark SQL.
>
> Thanks
>
>

RE: Can Spark SQL (not DataFrame or Dataset) aggregate array into map of element of count?

Reply via email to