[GitHub] [arrow] westonpace commented on issue #11469: read_table/read_feather in chunks?

GitBox Tue, 19 Oct 2021 15:28:28 -0700


westonpace commented on issue #11469:
URL: https://github.com/apache/arrow/issues/11469#issuecomment-947152899



   So one of the points of confusion with the python implementation is that it 
refers to both `feather` and IPC files as separate things.  This is 
unfortunately a bit of legacy.  Feather v2 is the same thing as the Arrow IPC 
file format.  The "feather" calls in python are rather limited, as you have 
noticed, and you only have full table reads.  However, the IPC functionality is 
more extensive.
   
   So to read a feather file in python in a streaming fashion you will use a 
`pyarrow.ipc.RecordBatchFileWriter`.  There is some documentation on this here: 
https://arrow.apache.org/docs/python/ipc.html#writing-and-reading-random-access-files
   
   So, for example:
   
   ```
   import pyarrow as pa
   import pyarrow.ipc as ipc
   
   table = pa.Table.from_pydict({'a': range(100)})
   
   with ipc.RecordBatchFileWriter('test.arrow', table.schema) as writer:
       writer.write_table(table, max_chunksize=10)
   
   with ipc.RecordBatchFileReader('test.arrow') as reader:
       for batch_index in range(reader.num_record_batches):
           batch = reader.get_batch(batch_index)
           print(f'Read in batch {batch_index} which had {batch.num_rows} rows')
   ```
   
   The second thing to note, as you will see in the example, is that iterative 
reading is only supported if the file was written as multiple batches.  If your 
giant feather file was written as one giant record batch then you will be 
unable to read it in a streaming fashion using pyarrow today.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace commented on issue #11469: read_table/read_feather in chunks?

Reply via email to