Hi team. I recently started playing with the Python port of the Apache
Arrow to first learn how it works an then use it in our ML platform.
Currently we need to provide to our users a way to upload their datasets in
our storage platform (GCS and S3 mainly). Once the user uploaded a dataset
then we need to download that in order to properly convert each of its
records (rows) to an specific format we use internally in our platform
(protobuffers models).
Our main concern is that we want to achieve that in a performant way by
avoiding to download the entire dataset. We are really interested to know
if its possible to fetch each RecordBatch of a dataset (Arrow file) stored
in a GCS bucket via streaming by using, for instance, the
RecordBatchStreamReader. I'm not really sure if this is possible without
downloading the entire dataset first.
I made some small tests with GcsFileSystem, open_input_stream and
ipc.open_stream.
*gcs = fs.GcsFileSystem(anonymous=True)with
gcs.open_input_stream("bucket/bigfile.arrow") as source: reader:
pa.ipc.RecordBatchStreamReader = pa.ipc.open_stream(source)*
I'm not sure if I'm missing some important details here but anyways I
always got the same error.
*pyarrow.lib.ArrowInvalid: Expected to read 1330795073 metadata bytes, but
only read 40168302*
I hope you can help me with some indications about how we can handle the
streaming of Record Batches from a dataset stored in an external storage
filesystem.
Thank you in advance!
Albert,