[GitHub] [arrow] nosterlu opened a new issue, #13747: group_by functionality directly on large dataset, instead of on a table?

GitBox Fri, 29 Jul 2022 06:10:19 -0700


nosterlu opened a new issue, #13747:
URL: https://github.com/apache/arrow/issues/13747


   I have a large dataset that I would like to use `group_by` on without having 
to read the entire table into memory first.
   
   After reading the documentation it seems `dataset.to_batches` is the best 
way of doing this? But it gets really complex when using other aggregation 
methods than for example `count` and `sum`.
   
   I implemented it like below for `count` and `sum`, but for other more 
complex aggregations I am still forced to read the entire table.
   
   ```python
   table = []
   for batch in ds.to_batches(columns=columns, filter=filters, batch_size=1e6):
       t = pyarrow.Table.from_batches([batch])
       table.append(t.group_by(group_by).aggregate(agg))
   table = pyarrow.concat_tables(table)
   
   # then after this I use group_by again on the concatenated table with `sum` 
as aggregation method
   ```
   
   Thankful for any pointers or comments!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] nosterlu opened a new issue, #13747: group_by functionality directly on large dataset, instead of on a table?

Reply via email to