Hi All,

I'm building a server with Flight RPC to store data and then allow
GetFlightInfo->DoGet of either the entire data or just particular record
batches. The server serializes data into the File format / random access
format to allow efficient get of the particular record batches.

I have a couple of questions in the area of reading the data using
RecordBatchFileReader.

---

Skimming through the underlying C++ implementation, I see that the
RecordBatchFileReader caches 'some stuff' internally.

>> This makes me wonder, what is the recommendation for the lifecycle of
the RecordBatchFileReader instances? Is it a good idea to keep a single
instance of RecordBatchFileReader open as long as possible?

*My guess is yes especially when working with files on S3 this could cut
down some network calls. *

---

Now the other thing is... when I'm trying to send out all data as response
to DoGet, it seems that the only way to do so is to create GeneratorStream
and pass a generator that yields  0..num_record_batches from
RecordBatchFileReader

>> Is the GeneratorStream the only way to stream out all data that is
serialized in IPC file format? Perhaps I missed something?

If the GeneratorStream is the right answer, my concern here is that, for
every batch the C++ code will be calling 'up' to Python, grabbing GIL on
the way, getting the batch, then back to C++.

How about having a RecordBatchReader that would one-time read
0..num_record_batches from RecordBatchFileReader? I have seen a couple of
adapters into RecordBatchReader elsewhere in the codebase so perhaps it's
not totally off?  If the adapter would be 'end-to-end' (new
RecordBatchReader implementation in C++ , wrapped using Cython), then
passing such a thing to `pyarrow.flight.RecordBatchStream` would mean all
the sending can be done in C++. Plus I think it makes some sense to have a
common interface to read all batches regardless of the format (stream vs
random access). What do you think?

*Note: I have experimented with building this but I think I hit a wall. The
work in C++ seems straightforward (I have found few other places where
adapters to RecordBatchReader are done, so that was good inspiration).
However I run into problems in Cython because the RecordBatchFileReader is
only defined in ipc.pxi and does not have `cdef class` block anywhere in
*.pxd files. And so - as far as my cython experience goes - the extension
cannot get a hold of the RecordBatchFileReader's underlying C++ instance.
I'm very new to all the cython and Python/C++ extensions, so perhaps there
is some other way? Of course I can still build MyOwnRecordBatchFileReader
in cython but I would rather play with PyArrow types.*

Thank you and best regards,
Lubo Slivka

Reply via email to