Hey folks,

I'm working with the PyArrow API for Tables and RecordBatches. And I'm
trying to chunk a Table into a list of RecordBatches each with a default
chunk size. For example, 10 GB into several 512MB chunks.

I'm having a hard time doing this using the existing API. The
Table.to_batches method has an optional parameter `max_chunksize` which is
documented as "Maximum size for RecordBatch chunks. Individual chunks may
be smaller depending on the chunk layout of individual columns." It seems
exactly like what I want but I've run into a couple of edge cases.

Edge case 1, Table created using many RecordBatches
```
pylist = [{'n_legs': 2, 'animals': 'Flamingo'},
          {'n_legs': 4, 'animals': 'Dog'}]
pylist_tbl = pa.Table.from_pylist(pylist)
# pylist_tbl.nbytes
# > 35
multiplier = 2048
bigger_pylist_tbl = pa.Table.from_batches(example_tbl.to_batches() *
multiplier)
# bigger_pylist_tbl.nbytes
# 591872 / 578.00 KB

target_batch_size = 512 * 1024 * 1024  # 512 MB
len(bigger_pylist_tbl.to_batches(target_batch_size))
# > 2048
# expected, 1 RecordBatch
```

Edge case 2, really big Table with 1 RecordBatch
```
# file already saved on disk
with pa.memory_map('table_10000000.arrow', 'r') as source:
    huge_arrow_tbl = pa.ipc.open_file(source).read_all()

huge_arrow_tbl.nbytes
# 7188263146 / 6.69 GB
len(huge_arrow_tbl)
# 10_000_000

target_batch_size = 512 * 1024 * 1024  # 512 MB
len(huge_arrow_tbl.to_batches(target_batch_size))
# > 1
# expected (6.69 GB // 512 MB) + 1 RecordBatches
```

I'm currently exploring the underlying implementation for to_batches
<https://github.com/apache/arrow/blob/b8fff043c6cb351b1fad87fa0eeaf8dbc550e37c/python/pyarrow/table.pxi#L4182C26-L4250>
 and TableBatchReader::ReadNext
<https://github.com/apache/arrow/blob/b8fff043c6cb351b1fad87fa0eeaf8dbc550e37c/cpp/src/arrow/table.cc#L641-L691>
.
Please let me know if anyone knows a canonical way to satisfy the chunking
behavior described above.

Thanks,
Kevin

Reply via email to