Thank you very much for the response @wesm. Looking forward to the changes
and hopefully gaining enough knowledge to start contributing to the
project. I am planning to go the cython route with a custom aggregator for
now. Tbh I am not sure how much we gain by doing single pass vs potential
loss due to cpu friendly  vectorization.

On Wed, Apr 7, 2021 at 1:53 PM Wes McKinney <[email protected]> wrote:

> We are working on implementing a streaming aggregation to be available in
> Python but it probably won’t be available until the 5.0 release. I am not
> sure solving this problem efficiently is possible at 100GB scale with the
> tools currently available in pyarrow.
>
> On Wed, Apr 7, 2021 at 12:41 PM Suresh V <[email protected]> wrote:
>
>> Hi .. I am trying to compute aggregates on large datasets (100GB) stored
>> in parquet format. Current approach is to use scan/fragement to load chunks
>> iteratively into memory and would like to run the equivalent of following
>> on each chunk using pc.compute functions
>>
>> df.groupby(['a', 'b', 'c']).agg(['sum', 'count', 'min', 'max'])
>>
>> My understanding is that pc.compute needs to scan the entire array for
>> each of the functions. Please let me know if that is not the case and how
>> to optimize it.
>>
>> Thanks
>>
>

Reply via email to