Re: value_counts after group_by

Weston Pace Thu, 25 Aug 2022 17:45:13 -0700

> Is there a way to get value_counts of a given column after doing table 
> group_by?

Is your goal to group by some key and then get the value counts of an
entirely different non-key column?  If so, then no, not today, at
least not directly.  The only group by node we have is a hash-group-by
and this can only accept "hash aggregate functions".  These are
defined in [1] and value_counts does not have a "hash aggregate"
variant but it does seem like it would make sense.

Indirectly, you can use the "list" aggregate function as a sort of escape hatch:

```
import pyarrow as pa
import pyarrow.compute as pc

tab = pa.Table.from_pydict({
    'state': ['Washington', 'Washington', 'Colorado', 'Colorado', 'Colorado'],
    'city': ['Seattle', 'Seattle', 'Denver', 'Colorado Springs', 'Denver'],
    'temp': [70, 75, 83, 89, 94]
})

grouped = pa.TableGroupBy(tab, 'state').aggregate([('city', 'list')])
print(grouped)

# pyarrow.Table
# city_list: list<item: string>
#   child 0, item: string
# state: string
# ----
# city_list: [[["Seattle","Seattle"],["Denver","Colorado Springs","Denver"]]]
# state: [["Washington","Colorado"]]
```

You could then use a for-loop to walk through each cell of city_list
and run value_counts on that array.

> If its not possible, can you please point me the relevant cpp/python files I 
> need to modify for this to work?

You would need to create a "hash aggregate function" for value_counts
(it would presumably be called hash_value_counts to match the existing
pattern).  The starting point for understanding such functions would
probably be [2].  Each hash-aggregate kernel consists of 5 different
functions (init, resize, consume, merge, and finalize) that you will
need to provide.  You can use any of the other hash_* functions as
examples for how you might implement these.  Basically, these
functions take in a column of values and a column of ids and they
update some kind of running state (one per thread).  At the end of the
stream the various thread states are merged together and the finalize
function turns this final state into an output array.

Work is underway on a guide to help with authoring new kernel
functions.  The current PR for this guide can be found at [3].

[1] https://arrow.apache.org/docs/cpp/compute.html#grouped-aggregations-group-by
[2] 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernel.h#L678
[3] https://github.com/apache/arrow/pull/13933

On Thu, Aug 25, 2022 at 10:26 AM Suresh V <suresh0...@gmail.com> wrote:
>
> Hi,
>
> Is there a way to get value_counts of a given column after doing table 
> group_by?
>
> If its not possible, can you please point me the relevant cpp/python files I 
> need to modify for this to work?
>
> Thanks
>
>

Re: value_counts after group_by

Reply via email to