[Python] Run multiple pc.compute functions on chunks in single pass

Suresh V Wed, 07 Apr 2021 10:41:54 -0700

Hi .. I am trying to compute aggregates on large datasets (100GB) stored in
parquet format. Current approach is to use scan/fragement to load chunks
iteratively into memory and would like to run the equivalent of following
on each chunk using pc.compute functions


df.groupby(['a', 'b', 'c']).agg(['sum', 'count', 'min', 'max'])

My understanding is that pc.compute needs to scan the entire array for each
of the functions. Please let me know if that is not the case and how to
optimize it.

Thanks

[Python] Run multiple pc.compute functions on chunks in single pass

Reply via email to