I put together straw-man proposal in PR [1] for the Flight changes. Ultimately, it seemed based on the use-cases discussed inlining the data on the Ticket made the most sense. This might be overly complex (I'm not sure how I feel about a enum indicating partial vs full results) but welcome feedback. Once we get consensus on this proposal, I can add changes to Flight SQL and try to provide reference implementations.
[1] https://github.com/apache/arrow/pull/12571 On Tue, Mar 1, 2022 at 10:51 PM Micah Kornfield <emkornfi...@gmail.com> wrote: > Would it make sense to make this part of DoGet since it >> still would be returning a record batch > > I would lean against this. I think in many cases the client doesn't know > the size of the data that it expects. Leaving the flexibility on the > server side to send back inlined data when it thinks it makes sense, or a > bunch of tickets when there is in fact a lot of data seems like the best > option here. > > For cases like previewing data, you usually just want to get a small amount >> of data quickly. > > This is interesting and might be an additional use case. If we did decide > to extend FlightInfo we might also want a way of annotating inlined data > with its corresponding ticket. That way even for large results, you could > still send back a small preview if desired. > > After considering it a little bit I think I'm sold that inlined data > should not replace a ticket. So in my mind the open question is whether > the client needs to actively opt-in to inlined data. The scenarios I could > come with where inlined data isn't useful are: > 1. The client is an old client and isn't aware inline data might be > returned. In this case the main cost is of extra data on the wire and > storing it as unknown fields [1]. > 2. The client is a new client but still doesn't want to get inline data > (it might want to distribute all consumption to other processes). Same > cost is paid as option 1. > > Are there other scenarios? If servers choose reasonable limits on what > data to inline, the extra complexity of negotiating with the client in this > case might not be worth the benefits. > > Cheers, > Micah > > > [1] https://developers.google.com/protocol-buffers/docs/proto3#unknowns > > On Tue, Mar 1, 2022 at 10:01 PM Bryan Cutler <cutl...@gmail.com> wrote: > >> I think this would be a useful feature and be nice to have in Flight core. >> For cases like previewing data, you usually just want to get a small >> amount >> of data quickly. Would it make sense to make this part of DoGet since it >> still would be returning a record batch? Perhaps a Ticket could be made to >> have an optional FlightDescriptor that would serve as an all-in-one shot? >> >> On Tue, Mar 1, 2022 at 8:44 AM David Li <lidav...@apache.org> wrote: >> >> > I agree with something along Antoine's proposal, though: maybe we should >> > be more structured with the flags (akin to what Micah mentioned with the >> > Feature enum). >> > >> > Also, the flag could be embedded into the Flight SQL messages instead. >> (So >> > in effect, Flight would only add the capability to return data with >> > FlightInfo, and it's up to applications, like Flight SQL, to decide how >> > they want to take advantage of that.) >> > >> > I think having a completely separate method and return type and having >> to >> > poll for it beforehand somewhat defeats the purpose of having it/would >> be >> > much harder of a transition. >> > >> > Also: it should be `repeated FlightInfo inline_data` right? In case we >> > also need dictionary batches? >> > >> > On Tue, Mar 1, 2022, at 11:39, Antoine Pitrou wrote: >> > > Can we just add the following field to the FlightDescriptor message: >> > > >> > > bool accept_inline_data = 4; >> > > >> > > and this one to the FlightInfo message: >> > > >> > > FlightData inline_data = 100; >> > > >> > > Then new clients can `accept_inline_data` to true (the default being >> > > false if omitted) to signal servers that they can put the data if >> > > `inline_data` if deemed small enough. >> > > >> > > (the `accept_inline_data` field could also be used to the Criteria >> > > message) >> > > >> > > >> > > Alternatively, if the FlightDescriptor expansion looks a bit dirty >> > > (FlightDescriptor being used in other contexts where >> > > `accept_inline_data` makes no sense), we can instead define a new >> > > method: >> > > >> > > rpc GetFlightInfoEx(GetFlightInfoRequest) returns (FlightInfo) {} >> > > >> > > with: >> > > >> > > message GetFlightInfoRequest { >> > > FlightDescriptor flight_descriptor = 1; >> > > bool accept_inline_data = 2; >> > > } >> > > >> > > Regards >> > > >> > > Antoine. >> > > >> > > >> > > On Mon, 28 Feb 2022 11:29:12 -0800 >> > > James Duong <jam...@bitquilltech.com.INVALID> wrote: >> > >> This seems reasonable, however we need to account for existing Flight >> > >> clients that were written before this. >> > >> >> > >> It seems like the server will need to still handle the ticket >> returned >> > for >> > >> getStream() for clients that are unaware of the small result >> > optimization. >> > >> >> > >> On Mon, Feb 28, 2022 at 11:26 AM David Li <lidav...@apache.org> >> wrote: >> > >> >> > >> > Ah, that makes more sense, that would be a reasonable extension to >> > Flight >> > >> > overall. (While we're at it, I think it would help to have an >> > app_metadata >> > >> > field in FlightInfo as well.) >> > >> > >> > >> > On Mon, Feb 28, 2022, at 14:24, Micah Kornfield wrote: >> > >> > >> >> > >> > >> But it seems reasonable to add a one-shot query path using >> DoGet. >> > >> > > >> > >> > > >> > >> > > I was thinking more of adding a bytes field to FlightInfo that >> > could >> > >> > store >> > >> > > arrow data. That way GetFlightInfo would be the only RPC >> necessary >> > for >> > >> > > small results when executing a CMD. The client doesn't >> necessarily >> > know >> > >> > > whether a query will return large or small results. >> > >> > > >> > >> > > On Mon, Feb 28, 2022 at 11:04 AM David Li <lidav...@apache.org> >> > wrote: >> > >> > > >> > >> > >> I think the focus was on large result sets (though I don't >> recall >> > this >> > >> > >> being discussed before) and supporting multi-node setups (hence >> > >> > >> GetFlightInfo/DoGet are separated). But it seems reasonable to >> add >> > a >> > >> > >> one-shot query path using DoGet. >> > >> > >> >> > >> > >> On Mon, Feb 28, 2022, at 13:32, Adam Lippai wrote: >> > >> > >> > I saw the same. A small, stateless query ability would be nice >> > >> > >> (connection >> > >> > >> > open, initialization, query in one message, the resultset in >> > the >> > >> > response >> > >> > >> > in one message) >> > >> > >> > >> > >> > >> > On Mon, Feb 28, 2022, 13:12 Micah Kornfield < >> > emkornfi...@gmail.com> >> > >> > >> wrote: >> > >> > >> > >> > >> > >> >> I'm rereviewing the Flight SQL interfaces, and I'm not sure >> if >> > I'm >> > >> > >> missing >> > >> > >> >> it but is there any optimization for small results? My >> concern >> > is >> > >> > that >> > >> > >> the >> > >> > >> >> overhead of the RPCs for the DoGet after executing the query >> > could >> > >> > add >> > >> > >> >> non-trivial latency for smaller results. >> > >> > >> >> >> > >> > >> >> Has anybody else thought about this/investigated it? Am I >> > >> > understanding >> > >> > >> >> this correctly? >> > >> > >> >> >> > >> > >> >> Thanks, >> > >> > >> >> Micah >> > >> > >> >> >> > >> > >> >> > >> > >> > >> >> > >> >> > >> >