Re: [Python] RecordBatchFileReader usage considerations

Lubomir Slivka Fri, 18 Mar 2022 00:43:52 -0700

Hi David,

Thanks for confirmation on the file reader instance lifecycle.


Indeed, the issue is that RecordBatchFileReader is a distinct interface
from RecordBatchReader - which imho is understandable as file reader can do
repeated, random access while RecordBatchReader is about getting batches in
order, one-by-one, once. I guess what is missing is a method like
RecordBatchFileReader.to_reader() that would return the RecordBatchReader
to read all batches from the underlying instance once.

*For sakes of completeness: before I got the requirement to serve random
batches of data, the server was storing data in IPC stream format and then
used the pyarrow.flight.RecordBatchStream with great success.*

How to proceed from there? Shall I open a JIRA ticket or..?

Thanks again,
Lubo


On Thu, Mar 17, 2022 at 11:22 PM David Li <[email protected]> wrote:

> Hi Lubo,
>
> Yes, opening a file (whether locally or on S3 or elsewhere) does have to
> do some work up front to read the file footer, load the schema, etc. before
> any batches can be read (See ARROW-15340 for an optimization that would cut
> down on the number of I/O operations here.) It should generally be minor
> overhead (and as always, profile/benchmark your particular use case), but
> it's best not to repeatedly close/re-open the reader if you can help it.
>
> For streaming data from a record batch reader in DoGet, use
> RecordBatchStream:
> https://arrow.apache.org/docs/python/generated/pyarrow.flight.RecordBatchStream.html#pyarrow.flight.RecordBatchStream
> So long as the record batch reader itself is not implemented in Python,
> this will bypass the GIL/will not call back into Python for each batch. (It
> will extract the underlying C++ RecordBatchReader object from the Python
> reader, and use that directly.)
>
> Ah, though: is the issue that RecordBatchFileReader is a distinct
> interface from RecordBatchReader? We should indeed fix that…
>
> -David
>
> On Thu, Mar 17, 2022, at 17:44, Lubomir Slivka wrote:
>
> Hi All,
>
> I'm building a server with Flight RPC to store data and then allow
> GetFlightInfo->DoGet of either the entire data or just particular record
> batches. The server serializes data into the File format / random access
> format to allow efficient get of the particular record batches.
>
> I have a couple of questions in the area of reading the data using
> RecordBatchFileReader.
>
> ---
>
> Skimming through the underlying C++ implementation, I see that the
> RecordBatchFileReader caches 'some stuff' internally.
>
> >> This makes me wonder, what is the recommendation for the lifecycle of
> the RecordBatchFileReader instances? Is it a good idea to keep a single
> instance of RecordBatchFileReader open as long as possible?
>
> *My guess is yes especially when working with files on S3 this could cut
> down some network calls. *
>
> ---
>
> Now the other thing is... when I'm trying to send out all data as response
> to DoGet, it seems that the only way to do so is to create GeneratorStream
> and pass a generator that yields  0..num_record_batches from
> RecordBatchFileReader
>
> >> Is the GeneratorStream the only way to stream out all data that is
> serialized in IPC file format? Perhaps I missed something?
>
> If the GeneratorStream is the right answer, my concern here is that, for
> every batch the C++ code will be calling 'up' to Python, grabbing GIL on
> the way, getting the batch, then back to C++.
>
> How about having a RecordBatchReader that would one-time read
> 0..num_record_batches from RecordBatchFileReader? I have seen a couple of
> adapters into RecordBatchReader elsewhere in the codebase so perhaps it's
> not totally off?  If the adapter would be 'end-to-end' (new
> RecordBatchReader implementation in C++ , wrapped using Cython), then
> passing such a thing to `pyarrow.flight.RecordBatchStream` would mean all
> the sending can be done in C++. Plus I think it makes some sense to have a
> common interface to read all batches regardless of the format (stream vs
> random access). What do you think?
>
> *Note: I have experimented with building this but I think I hit a wall.
> The work in C++ seems straightforward (I have found few other places where
> adapters to RecordBatchReader are done, so that was good inspiration).
> However I run into problems in Cython because the RecordBatchFileReader is
> only defined in ipc.pxi and does not have `cdef class` block anywhere in
> *.pxd files. And so - as far as my cython experience goes - the extension
> cannot get a hold of the RecordBatchFileReader's underlying C++ instance.
> I'm very new to all the cython and Python/C++ extensions, so perhaps there
> is some other way? Of course I can still build MyOwnRecordBatchFileReader
> in cython but I would rather play with PyArrow types.*
>
> Thank you and best regards,
> Lubo Slivka
>
>
>

Re: [Python] RecordBatchFileReader usage considerations

Reply via email to