westonpace commented on issue #11469: URL: https://github.com/apache/arrow/issues/11469#issuecomment-947152899
So one of the points of confusion with the python implementation is that it refers to both `feather` and IPC files as separate things. This is unfortunately a bit of legacy. Feather v2 is the same thing as the Arrow IPC file format. The "feather" calls in python are rather limited, as you have noticed, and you only have full table reads. However, the IPC functionality is more extensive. So to read a feather file in python in a streaming fashion you will use a `pyarrow.ipc.RecordBatchFileWriter`. There is some documentation on this here: https://arrow.apache.org/docs/python/ipc.html#writing-and-reading-random-access-files So, for example: ``` import pyarrow as pa import pyarrow.ipc as ipc table = pa.Table.from_pydict({'a': range(100)}) with ipc.RecordBatchFileWriter('test.arrow', table.schema) as writer: writer.write_table(table, max_chunksize=10) with ipc.RecordBatchFileReader('test.arrow') as reader: for batch_index in range(reader.num_record_batches): batch = reader.get_batch(batch_index) print(f'Read in batch {batch_index} which had {batch.num_rows} rows') ``` The second thing to note, as you will see in the example, is that iterative reading is only supported if the file was written as multiple batches. If your giant feather file was written as one giant record batch then you will be unable to read it in a streaming fashion using pyarrow today. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
