> Is there a way to get value_counts of a given column after doing table > group_by?
Is your goal to group by some key and then get the value counts of an entirely different non-key column? If so, then no, not today, at least not directly. The only group by node we have is a hash-group-by and this can only accept "hash aggregate functions". These are defined in [1] and value_counts does not have a "hash aggregate" variant but it does seem like it would make sense. Indirectly, you can use the "list" aggregate function as a sort of escape hatch: ``` import pyarrow as pa import pyarrow.compute as pc tab = pa.Table.from_pydict({ 'state': ['Washington', 'Washington', 'Colorado', 'Colorado', 'Colorado'], 'city': ['Seattle', 'Seattle', 'Denver', 'Colorado Springs', 'Denver'], 'temp': [70, 75, 83, 89, 94] }) grouped = pa.TableGroupBy(tab, 'state').aggregate([('city', 'list')]) print(grouped) # pyarrow.Table # city_list: list<item: string> # child 0, item: string # state: string # ---- # city_list: [[["Seattle","Seattle"],["Denver","Colorado Springs","Denver"]]] # state: [["Washington","Colorado"]] ``` You could then use a for-loop to walk through each cell of city_list and run value_counts on that array. > If its not possible, can you please point me the relevant cpp/python files I > need to modify for this to work? You would need to create a "hash aggregate function" for value_counts (it would presumably be called hash_value_counts to match the existing pattern). The starting point for understanding such functions would probably be [2]. Each hash-aggregate kernel consists of 5 different functions (init, resize, consume, merge, and finalize) that you will need to provide. You can use any of the other hash_* functions as examples for how you might implement these. Basically, these functions take in a column of values and a column of ids and they update some kind of running state (one per thread). At the end of the stream the various thread states are merged together and the finalize function turns this final state into an output array. Work is underway on a guide to help with authoring new kernel functions. The current PR for this guide can be found at [3]. [1] https://arrow.apache.org/docs/cpp/compute.html#grouped-aggregations-group-by [2] https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernel.h#L678 [3] https://github.com/apache/arrow/pull/13933 On Thu, Aug 25, 2022 at 10:26 AM Suresh V <suresh0...@gmail.com> wrote: > > Hi, > > Is there a way to get value_counts of a given column after doing table > group_by? > > If its not possible, can you please point me the relevant cpp/python files I > need to modify for this to work? > > Thanks > >