Hey folks, I'm working with the PyArrow API for Tables and RecordBatches. And I'm trying to chunk a Table into a list of RecordBatches each with a default chunk size. For example, 10 GB into several 512MB chunks.
I'm having a hard time doing this using the existing API. The Table.to_batches method has an optional parameter `max_chunksize` which is documented as "Maximum size for RecordBatch chunks. Individual chunks may be smaller depending on the chunk layout of individual columns." It seems exactly like what I want but I've run into a couple of edge cases. Edge case 1, Table created using many RecordBatches ``` pylist = [{'n_legs': 2, 'animals': 'Flamingo'}, {'n_legs': 4, 'animals': 'Dog'}] pylist_tbl = pa.Table.from_pylist(pylist) # pylist_tbl.nbytes # > 35 multiplier = 2048 bigger_pylist_tbl = pa.Table.from_batches(example_tbl.to_batches() * multiplier) # bigger_pylist_tbl.nbytes # 591872 / 578.00 KB target_batch_size = 512 * 1024 * 1024 # 512 MB len(bigger_pylist_tbl.to_batches(target_batch_size)) # > 2048 # expected, 1 RecordBatch ``` Edge case 2, really big Table with 1 RecordBatch ``` # file already saved on disk with pa.memory_map('table_10000000.arrow', 'r') as source: huge_arrow_tbl = pa.ipc.open_file(source).read_all() huge_arrow_tbl.nbytes # 7188263146 / 6.69 GB len(huge_arrow_tbl) # 10_000_000 target_batch_size = 512 * 1024 * 1024 # 512 MB len(huge_arrow_tbl.to_batches(target_batch_size)) # > 1 # expected (6.69 GB // 512 MB) + 1 RecordBatches ``` I'm currently exploring the underlying implementation for to_batches <https://github.com/apache/arrow/blob/b8fff043c6cb351b1fad87fa0eeaf8dbc550e37c/python/pyarrow/table.pxi#L4182C26-L4250> and TableBatchReader::ReadNext <https://github.com/apache/arrow/blob/b8fff043c6cb351b1fad87fa0eeaf8dbc550e37c/cpp/src/arrow/table.cc#L641-L691> . Please let me know if anyone knows a canonical way to satisfy the chunking behavior described above. Thanks, Kevin