nosterlu opened a new issue, #13747:
URL: https://github.com/apache/arrow/issues/13747
I have a large dataset that I would like to use `group_by` on without having
to read the entire table into memory first.
After reading the documentation it seems `dataset.to_batches` is the best
way of doing this? But it gets really complex when using other aggregation
methods than for example `count` and `sum`.
I implemented it like below for `count` and `sum`, but for other more
complex aggregations I am still forced to read the entire table.
```python
table = []
for batch in ds.to_batches(columns=columns, filter=filters, batch_size=1e6):
t = pyarrow.Table.from_batches([batch])
table.append(t.group_by(group_by).aggregate(agg))
table = pyarrow.concat_tables(table)
# then after this I use group_by again on the concatenated table with `sum`
as aggregation method
```
Thankful for any pointers or comments!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]