> > The operation flow would be like this, or what would it look like? > Client ---> GetFlightInfo (query/update operation in payload) ---> Server > ---> Results (non-streamed)
This is roughly the flow I was imagining if the server chooses to send back inlined data. -Micah On Tue, Mar 8, 2022 at 11:27 AM Gavin Ray <ray.gavi...@gmail.com> wrote: > Thank you for doing this, left a few questions on the GH issue > > I would adopt this proposal as soon as it makes it into nightlies > (or possibly earlier if it's just a matter of regenerating the proto > definitions) > > The operation flow would be like this, or what would it look like? > > Client ---> GetFlightInfo (query/update operation in payload) ---> Server > ---> Results (non-streamed) > > > > > On Tue, Mar 8, 2022 at 2:04 PM Micah Kornfield <emkornfi...@gmail.com> > wrote: > >> Some people have already left comments on >> https://github.com/apache/arrow/pull/12571 More eyes on it would be >> appreciated. If there aren't more comments, I'll try to start >> implementing >> this feature in Flight next week, and hopefully have a vote after it is >> supported in Java and C++/Python. >> >> >> Thanks, >> Micah >> >> On Fri, Mar 4, 2022 at 10:54 PM Micah Kornfield <emkornfi...@gmail.com> >> wrote: >> >> > I put together straw-man proposal in PR [1] for the Flight changes. >> > Ultimately, it seemed based on the use-cases discussed inlining the >> data on >> > the Ticket made the most sense. This might be overly complex (I'm not >> sure >> > how I feel about a enum indicating partial vs full results) but welcome >> > feedback. Once we get consensus on this proposal, I can add changes to >> > Flight SQL and try to provide reference implementations. >> > >> > [1] https://github.com/apache/arrow/pull/12571 >> > >> > On Tue, Mar 1, 2022 at 10:51 PM Micah Kornfield <emkornfi...@gmail.com> >> > wrote: >> > >> >> Would it make sense to make this part of DoGet since it >> >>> still would be returning a record batch >> >> >> >> I would lean against this. I think in many cases the client doesn't >> know >> >> the size of the data that it expects. Leaving the flexibility on the >> >> server side to send back inlined data when it thinks it makes sense, >> or a >> >> bunch of tickets when there is in fact a lot of data seems like the >> best >> >> option here. >> >> >> >> For cases like previewing data, you usually just want to get a small >> >>> amount >> >>> of data quickly. >> >> >> >> This is interesting and might be an additional use case. If we did >> >> decide to extend FlightInfo we might also want a way of annotating >> inlined >> >> data with its corresponding ticket. That way even for large results, >> you >> >> could still send back a small preview if desired. >> >> >> >> After considering it a little bit I think I'm sold that inlined data >> >> should not replace a ticket. So in my mind the open question is >> whether >> >> the client needs to actively opt-in to inlined data. The scenarios I >> could >> >> come with where inlined data isn't useful are: >> >> 1. The client is an old client and isn't aware inline data might be >> >> returned. In this case the main cost is of extra data on the wire and >> >> storing it as unknown fields [1]. >> >> 2. The client is a new client but still doesn't want to get inline >> data >> >> (it might want to distribute all consumption to other processes). Same >> >> cost is paid as option 1. >> >> >> >> Are there other scenarios? If servers choose reasonable limits on what >> >> data to inline, the extra complexity of negotiating with the client in >> this >> >> case might not be worth the benefits. >> >> >> >> Cheers, >> >> Micah >> >> >> >> >> >> [1] >> https://developers.google.com/protocol-buffers/docs/proto3#unknowns >> >> >> >> On Tue, Mar 1, 2022 at 10:01 PM Bryan Cutler <cutl...@gmail.com> >> wrote: >> >> >> >>> I think this would be a useful feature and be nice to have in Flight >> >>> core. >> >>> For cases like previewing data, you usually just want to get a small >> >>> amount >> >>> of data quickly. Would it make sense to make this part of DoGet since >> it >> >>> still would be returning a record batch? Perhaps a Ticket could be >> made >> >>> to >> >>> have an optional FlightDescriptor that would serve as an all-in-one >> shot? >> >>> >> >>> On Tue, Mar 1, 2022 at 8:44 AM David Li <lidav...@apache.org> wrote: >> >>> >> >>> > I agree with something along Antoine's proposal, though: maybe we >> >>> should >> >>> > be more structured with the flags (akin to what Micah mentioned with >> >>> the >> >>> > Feature enum). >> >>> > >> >>> > Also, the flag could be embedded into the Flight SQL messages >> instead. >> >>> (So >> >>> > in effect, Flight would only add the capability to return data with >> >>> > FlightInfo, and it's up to applications, like Flight SQL, to decide >> how >> >>> > they want to take advantage of that.) >> >>> > >> >>> > I think having a completely separate method and return type and >> having >> >>> to >> >>> > poll for it beforehand somewhat defeats the purpose of having >> it/would >> >>> be >> >>> > much harder of a transition. >> >>> > >> >>> > Also: it should be `repeated FlightInfo inline_data` right? In case >> we >> >>> > also need dictionary batches? >> >>> > >> >>> > On Tue, Mar 1, 2022, at 11:39, Antoine Pitrou wrote: >> >>> > > Can we just add the following field to the FlightDescriptor >> message: >> >>> > > >> >>> > > bool accept_inline_data = 4; >> >>> > > >> >>> > > and this one to the FlightInfo message: >> >>> > > >> >>> > > FlightData inline_data = 100; >> >>> > > >> >>> > > Then new clients can `accept_inline_data` to true (the default >> being >> >>> > > false if omitted) to signal servers that they can put the data if >> >>> > > `inline_data` if deemed small enough. >> >>> > > >> >>> > > (the `accept_inline_data` field could also be used to the Criteria >> >>> > > message) >> >>> > > >> >>> > > >> >>> > > Alternatively, if the FlightDescriptor expansion looks a bit dirty >> >>> > > (FlightDescriptor being used in other contexts where >> >>> > > `accept_inline_data` makes no sense), we can instead define a new >> >>> > > method: >> >>> > > >> >>> > > rpc GetFlightInfoEx(GetFlightInfoRequest) returns (FlightInfo) >> {} >> >>> > > >> >>> > > with: >> >>> > > >> >>> > > message GetFlightInfoRequest { >> >>> > > FlightDescriptor flight_descriptor = 1; >> >>> > > bool accept_inline_data = 2; >> >>> > > } >> >>> > > >> >>> > > Regards >> >>> > > >> >>> > > Antoine. >> >>> > > >> >>> > > >> >>> > > On Mon, 28 Feb 2022 11:29:12 -0800 >> >>> > > James Duong <jam...@bitquilltech.com.INVALID> wrote: >> >>> > >> This seems reasonable, however we need to account for existing >> >>> Flight >> >>> > >> clients that were written before this. >> >>> > >> >> >>> > >> It seems like the server will need to still handle the ticket >> >>> returned >> >>> > for >> >>> > >> getStream() for clients that are unaware of the small result >> >>> > optimization. >> >>> > >> >> >>> > >> On Mon, Feb 28, 2022 at 11:26 AM David Li <lidav...@apache.org> >> >>> wrote: >> >>> > >> >> >>> > >> > Ah, that makes more sense, that would be a reasonable >> extension to >> >>> > Flight >> >>> > >> > overall. (While we're at it, I think it would help to have an >> >>> > app_metadata >> >>> > >> > field in FlightInfo as well.) >> >>> > >> > >> >>> > >> > On Mon, Feb 28, 2022, at 14:24, Micah Kornfield wrote: >> >>> > >> > >> >> >>> > >> > >> But it seems reasonable to add a one-shot query path using >> >>> DoGet. >> >>> > >> > > >> >>> > >> > > >> >>> > >> > > I was thinking more of adding a bytes field to FlightInfo >> that >> >>> > could >> >>> > >> > store >> >>> > >> > > arrow data. That way GetFlightInfo would be the only RPC >> >>> necessary >> >>> > for >> >>> > >> > > small results when executing a CMD. The client doesn't >> >>> necessarily >> >>> > know >> >>> > >> > > whether a query will return large or small results. >> >>> > >> > > >> >>> > >> > > On Mon, Feb 28, 2022 at 11:04 AM David Li < >> lidav...@apache.org> >> >>> > wrote: >> >>> > >> > > >> >>> > >> > >> I think the focus was on large result sets (though I don't >> >>> recall >> >>> > this >> >>> > >> > >> being discussed before) and supporting multi-node setups >> (hence >> >>> > >> > >> GetFlightInfo/DoGet are separated). But it seems reasonable >> to >> >>> add >> >>> > a >> >>> > >> > >> one-shot query path using DoGet. >> >>> > >> > >> >> >>> > >> > >> On Mon, Feb 28, 2022, at 13:32, Adam Lippai wrote: >> >>> > >> > >> > I saw the same. A small, stateless query ability would be >> >>> nice >> >>> > >> > >> (connection >> >>> > >> > >> > open, initialization, query in one message, the resultset >> in >> >>> > the >> >>> > >> > response >> >>> > >> > >> > in one message) >> >>> > >> > >> > >> >>> > >> > >> > On Mon, Feb 28, 2022, 13:12 Micah Kornfield < >> >>> > emkornfi...@gmail.com> >> >>> > >> > >> wrote: >> >>> > >> > >> > >> >>> > >> > >> >> I'm rereviewing the Flight SQL interfaces, and I'm not >> sure >> >>> if >> >>> > I'm >> >>> > >> > >> missing >> >>> > >> > >> >> it but is there any optimization for small results? My >> >>> concern >> >>> > is >> >>> > >> > that >> >>> > >> > >> the >> >>> > >> > >> >> overhead of the RPCs for the DoGet after executing the >> query >> >>> > could >> >>> > >> > add >> >>> > >> > >> >> non-trivial latency for smaller results. >> >>> > >> > >> >> >> >>> > >> > >> >> Has anybody else thought about this/investigated it? Am >> I >> >>> > >> > understanding >> >>> > >> > >> >> this correctly? >> >>> > >> > >> >> >> >>> > >> > >> >> Thanks, >> >>> > >> > >> >> Micah >> >>> > >> > >> >> >> >>> > >> > >> >> >>> > >> > >> >>> > >> >> >>> > >> >> >>> > >> >>> >> >> >> >