Thank you very much for the response @wesm. Looking forward to the changes and hopefully gaining enough knowledge to start contributing to the project. I am planning to go the cython route with a custom aggregator for now. Tbh I am not sure how much we gain by doing single pass vs potential loss due to cpu friendly vectorization.
On Wed, Apr 7, 2021 at 1:53 PM Wes McKinney <[email protected]> wrote: > We are working on implementing a streaming aggregation to be available in > Python but it probably won’t be available until the 5.0 release. I am not > sure solving this problem efficiently is possible at 100GB scale with the > tools currently available in pyarrow. > > On Wed, Apr 7, 2021 at 12:41 PM Suresh V <[email protected]> wrote: > >> Hi .. I am trying to compute aggregates on large datasets (100GB) stored >> in parquet format. Current approach is to use scan/fragement to load chunks >> iteratively into memory and would like to run the equivalent of following >> on each chunk using pc.compute functions >> >> df.groupby(['a', 'b', 'c']).agg(['sum', 'count', 'min', 'max']) >> >> My understanding is that pc.compute needs to scan the entire array for >> each of the functions. Please let me know if that is not the case and how >> to optimize it. >> >> Thanks >> >
