Hi .. I am trying to compute aggregates on large datasets (100GB) stored in parquet format. Current approach is to use scan/fragement to load chunks iteratively into memory and would like to run the equivalent of following on each chunk using pc.compute functions
df.groupby(['a', 'b', 'c']).agg(['sum', 'count', 'min', 'max']) My understanding is that pc.compute needs to scan the entire array for each of the functions. Please let me know if that is not the case and how to optimize it. Thanks
