I put together straw-man proposal in PR [1] for the Flight changes.
Ultimately, it seemed based on the use-cases discussed inlining the data on
the Ticket made the most sense.  This might be overly complex (I'm not sure
how I feel about a enum indicating partial vs full results) but welcome
feedback.  Once we get consensus on this proposal, I can add changes to
Flight SQL and try to provide reference implementations.

[1] https://github.com/apache/arrow/pull/12571

On Tue, Mar 1, 2022 at 10:51 PM Micah Kornfield <emkornfi...@gmail.com>
wrote:

> Would it make sense to make this part of DoGet since it
>> still would be returning a record batch
>
> I would lean against this. I think in many cases the client doesn't know
> the size of the data that it expects.  Leaving the flexibility on the
> server side to send back inlined data when it thinks it makes sense, or a
> bunch of tickets when there is in fact a lot of data seems like the best
> option here.
>
> For cases like previewing data, you usually just want to get a small amount
>> of data quickly.
>
> This is interesting and might be an additional use case.  If we did decide
> to extend FlightInfo we might also want a way of annotating inlined data
> with its corresponding ticket.  That way even for large results, you could
> still send back a small preview if desired.
>
> After considering it a little bit I think I'm sold that inlined data
> should not replace a ticket.  So in my mind the open question is whether
> the client needs to actively opt-in to inlined data.  The scenarios I could
> come with where inlined data isn't useful are:
> 1.  The client is an old client and isn't aware inline data might be
> returned.  In this case the main cost is of extra data on the wire and
> storing it as unknown fields [1].
> 2.  The client is a new client but still doesn't want to get inline data
> (it might want to distribute all consumption to other processes).  Same
> cost is paid as option 1.
>
> Are there other scenarios?  If servers choose reasonable limits on what
> data to inline, the extra complexity of negotiating with the client in this
> case might not be worth the benefits.
>
> Cheers,
> Micah
>
>
> [1] https://developers.google.com/protocol-buffers/docs/proto3#unknowns
>
> On Tue, Mar 1, 2022 at 10:01 PM Bryan Cutler <cutl...@gmail.com> wrote:
>
>> I think this would be a useful feature and be nice to have in Flight core.
>> For cases like previewing data, you usually just want to get a small
>> amount
>> of data quickly. Would it make sense to make this part of DoGet since it
>> still would be returning a record batch? Perhaps a Ticket could be made to
>> have an optional FlightDescriptor that would serve as an all-in-one shot?
>>
>> On Tue, Mar 1, 2022 at 8:44 AM David Li <lidav...@apache.org> wrote:
>>
>> > I agree with something along Antoine's proposal, though: maybe we should
>> > be more structured with the flags (akin to what Micah mentioned with the
>> > Feature enum).
>> >
>> > Also, the flag could be embedded into the Flight SQL messages instead.
>> (So
>> > in effect, Flight would only add the capability to return data with
>> > FlightInfo, and it's up to applications, like Flight SQL, to decide how
>> > they want to take advantage of that.)
>> >
>> > I think having a completely separate method and return type and having
>> to
>> > poll for it beforehand somewhat defeats the purpose of having it/would
>> be
>> > much harder of a transition.
>> >
>> > Also: it should be `repeated FlightInfo inline_data` right? In case we
>> > also need dictionary batches?
>> >
>> > On Tue, Mar 1, 2022, at 11:39, Antoine Pitrou wrote:
>> > > Can we just add the following field to the FlightDescriptor message:
>> > >
>> > >   bool accept_inline_data = 4;
>> > >
>> > > and this one to the FlightInfo message:
>> > >
>> > >   FlightData inline_data = 100;
>> > >
>> > > Then new clients can `accept_inline_data` to true (the default being
>> > > false if omitted) to signal servers that they can put the data if
>> > > `inline_data` if deemed small enough.
>> > >
>> > > (the `accept_inline_data` field could also be used to the Criteria
>> > > message)
>> > >
>> > >
>> > > Alternatively, if the FlightDescriptor expansion looks a bit dirty
>> > > (FlightDescriptor being used in other contexts where
>> > > `accept_inline_data` makes no sense), we can instead define a new
>> > > method:
>> > >
>> > >   rpc GetFlightInfoEx(GetFlightInfoRequest) returns (FlightInfo) {}
>> > >
>> > > with:
>> > >
>> > > message GetFlightInfoRequest {
>> > >   FlightDescriptor flight_descriptor = 1;
>> > >   bool accept_inline_data = 2;
>> > > }
>> > >
>> > > Regards
>> > >
>> > > Antoine.
>> > >
>> > >
>> > > On Mon, 28 Feb 2022 11:29:12 -0800
>> > > James Duong <jam...@bitquilltech.com.INVALID> wrote:
>> > >> This seems reasonable, however we need to account for existing Flight
>> > >> clients that were written before this.
>> > >>
>> > >> It seems like the server will need to still handle the ticket
>> returned
>> > for
>> > >> getStream() for clients that are unaware of the small result
>> > optimization.
>> > >>
>> > >> On Mon, Feb 28, 2022 at 11:26 AM David Li <lidav...@apache.org>
>> wrote:
>> > >>
>> > >> > Ah, that makes more sense, that would be a reasonable extension to
>> > Flight
>> > >> > overall. (While we're at it, I think it would help to have an
>> > app_metadata
>> > >> > field in FlightInfo as well.)
>> > >> >
>> > >> > On Mon, Feb 28, 2022, at 14:24, Micah Kornfield wrote:
>> > >> > >>
>> > >> > >> But it seems reasonable to add a one-shot query path using
>> DoGet.
>> > >> > >
>> > >> > >
>> > >> > > I was thinking more of adding a bytes field to FlightInfo that
>> > could
>> > >> > store
>> > >> > > arrow data.  That way GetFlightInfo would be the only RPC
>> necessary
>> > for
>> > >> > > small results when executing a CMD.  The client doesn't
>> necessarily
>> > know
>> > >> > > whether a query will return large or small results.
>> > >> > >
>> > >> > > On Mon, Feb 28, 2022 at 11:04 AM David Li <lidav...@apache.org>
>> > wrote:
>> > >> > >
>> > >> > >> I think the focus was on large result sets (though I don't
>> recall
>> > this
>> > >> > >> being discussed before) and supporting multi-node setups (hence
>> > >> > >> GetFlightInfo/DoGet are separated). But it seems reasonable to
>> add
>> > a
>> > >> > >> one-shot query path using DoGet.
>> > >> > >>
>> > >> > >> On Mon, Feb 28, 2022, at 13:32, Adam Lippai wrote:
>> > >> > >> > I saw the same. A small, stateless query ability would be nice
>> > >> > >> (connection
>> > >> > >> > open, initialization, query in one message, the resultset in
>> > the
>> > >> > response
>> > >> > >> > in one message)
>> > >> > >> >
>> > >> > >> > On Mon, Feb 28, 2022, 13:12 Micah Kornfield <
>> > emkornfi...@gmail.com>
>> > >> > >> wrote:
>> > >> > >> >
>> > >> > >> >> I'm rereviewing the Flight SQL interfaces, and I'm not sure
>> if
>> > I'm
>> > >> > >> missing
>> > >> > >> >> it but is there any optimization for small results?  My
>> concern
>> > is
>> > >> > that
>> > >> > >> the
>> > >> > >> >> overhead of the RPCs for the DoGet after executing the query
>> > could
>> > >> > add
>> > >> > >> >> non-trivial latency for smaller results.
>> > >> > >> >>
>> > >> > >> >> Has anybody else thought about this/investigated it?  Am I
>> > >> > understanding
>> > >> > >> >> this correctly?
>> > >> > >> >>
>> > >> > >> >> Thanks,
>> > >> > >> >> Micah
>> > >> > >> >>
>> > >> > >>
>> > >> >
>> > >>
>> > >>
>> >
>>
>

Reply via email to