Ok thanks. I have created a new ticket: https://issues.apache.org/jira/browse/ARROW-15969.
On Fri, Mar 18, 2022 at 1:48 PM David Li <[email protected]> wrote: > Yes, please file a Jira, it would be good to have a conversion for this > use case. > > On Fri, Mar 18, 2022, at 03:43, Lubomir Slivka wrote: > > Hi David, > > Thanks for confirmation on the file reader instance lifecycle. > > Indeed, the issue is that RecordBatchFileReader is a distinct interface > from RecordBatchReader - which imho is understandable as file reader can do > repeated, random access while RecordBatchReader is about getting batches in > order, one-by-one, once. I guess what is missing is a method like > RecordBatchFileReader.to_reader() that would return the RecordBatchReader > to read all batches from the underlying instance once. > > *For sakes of completeness: before I got the requirement to serve random > batches of data, the server was storing data in IPC stream format and then > used the pyarrow.flight.RecordBatchStream with great success.* > > How to proceed from there? Shall I open a JIRA ticket or..? > > Thanks again, > Lubo > > > On Thu, Mar 17, 2022 at 11:22 PM David Li <[email protected]> wrote: > > > Hi Lubo, > > Yes, opening a file (whether locally or on S3 or elsewhere) does have to > do some work up front to read the file footer, load the schema, etc. before > any batches can be read (See ARROW-15340 for an optimization that would cut > down on the number of I/O operations here.) It should generally be minor > overhead (and as always, profile/benchmark your particular use case), but > it's best not to repeatedly close/re-open the reader if you can help it. > > For streaming data from a record batch reader in DoGet, use > RecordBatchStream: > https://arrow.apache.org/docs/python/generated/pyarrow.flight.RecordBatchStream.html#pyarrow.flight.RecordBatchStream > So long as the record batch reader itself is not implemented in Python, > this will bypass the GIL/will not call back into Python for each batch. (It > will extract the underlying C++ RecordBatchReader object from the Python > reader, and use that directly.) > > Ah, though: is the issue that RecordBatchFileReader is a distinct > interface from RecordBatchReader? We should indeed fix that… > > -David > > On Thu, Mar 17, 2022, at 17:44, Lubomir Slivka wrote: > > Hi All, > > I'm building a server with Flight RPC to store data and then allow > GetFlightInfo->DoGet of either the entire data or just particular record > batches. The server serializes data into the File format / random access > format to allow efficient get of the particular record batches. > > I have a couple of questions in the area of reading the data using > RecordBatchFileReader. > > --- > > Skimming through the underlying C++ implementation, I see that the > RecordBatchFileReader caches 'some stuff' internally. > > >> This makes me wonder, what is the recommendation for the lifecycle of > the RecordBatchFileReader instances? Is it a good idea to keep a single > instance of RecordBatchFileReader open as long as possible? > > *My guess is yes especially when working with files on S3 this could cut > down some network calls. * > > --- > > Now the other thing is... when I'm trying to send out all data as response > to DoGet, it seems that the only way to do so is to create GeneratorStream > and pass a generator that yields 0..num_record_batches from > RecordBatchFileReader > > >> Is the GeneratorStream the only way to stream out all data that is > serialized in IPC file format? Perhaps I missed something? > > If the GeneratorStream is the right answer, my concern here is that, for > every batch the C++ code will be calling 'up' to Python, grabbing GIL on > the way, getting the batch, then back to C++. > > How about having a RecordBatchReader that would one-time read > 0..num_record_batches from RecordBatchFileReader? I have seen a couple of > adapters into RecordBatchReader elsewhere in the codebase so perhaps it's > not totally off? If the adapter would be 'end-to-end' (new > RecordBatchReader implementation in C++ , wrapped using Cython), then > passing such a thing to `pyarrow.flight.RecordBatchStream` would mean all > the sending can be done in C++. Plus I think it makes some sense to have a > common interface to read all batches regardless of the format (stream vs > random access). What do you think? > > *Note: I have experimented with building this but I think I hit a wall. > The work in C++ seems straightforward (I have found few other places where > adapters to RecordBatchReader are done, so that was good inspiration). > However I run into problems in Cython because the RecordBatchFileReader is > only defined in ipc.pxi and does not have `cdef class` block anywhere in > *.pxd files. And so - as far as my cython experience goes - the extension > cannot get a hold of the RecordBatchFileReader's underlying C++ instance. > I'm very new to all the cython and Python/C++ extensions, so perhaps there > is some other way? Of course I can still build MyOwnRecordBatchFileReader > in cython but I would rather play with PyArrow types.* > > Thank you and best regards, > Lubo Slivka > > > >
