Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

Micah Kornfield Fri, 25 Jun 2021 15:07:49 -0700

Sorry for the second reply:


>    1. In our case we do expect relatively frequent changes in the schema
>    of the batch being sent out. I don't see that pattern changing in the mid
>    term for a good reason. However long term maybe it will be possible to
>    leverage separate RPC calls. I left some description in the comments thread
>    here
>    
> <https://docs.google.com/document/d/1dIOpKNYwsd9sdChsRBAx37BiJXl_7enpwWkH76n1tOI/edit?disco=AAAAMkkJ9xQ>
>    .
>
> I think we should clarify between two types of changes:
1.  Completely new  unique schema needed.
2.  Relatively frequent switches between a few known schemas.

On Fri, Jun 25, 2021 at 2:58 PM Micah Kornfield <[email protected]>
wrote:

>
>>    1. Re complexity of "one schema at a time" vs "schema id based": i
>>    think they are not much different, right? In fact the second one is more 
>> of
>>    an optimization to the first one which is beneficial to us. Anyways even
>>    with the first approach you need to add some logic of schema change. 
>> Adding
>>    the schema identification is not that complex at first glance.
>>
>> The second approach requires more state being maintained on the client
> side and makes it more difficult to bring schema evolution back into the
> IPC Stream format (i.e. it would live only in flight) without additional
> changes.
>
> In both cases there are also potentially thorny issues related to
> dictionary usages that we will need to resolve.  I haven't thought deeply
> about it but it seems that these might be more complex on a multiplexed
> channel.
>
>
> On Fri, Jun 25, 2021 at 11:52 AM Gosh Arzumanyan <[email protected]>
> wrote:
>
>> Hi Micah,
>>
>> Sure, let me do it here:
>>
>>    1. In our case we do expect relatively frequent changes in the schema
>>    of the batch being sent out. I don't see that pattern changing in the mid
>>    term for a good reason. However long term maybe it will be possible to
>>    leverage separate RPC calls. I left some description in the comments 
>> thread
>>    here
>>    
>> <https://docs.google.com/document/d/1dIOpKNYwsd9sdChsRBAx37BiJXl_7enpwWkH76n1tOI/edit?disco=AAAAMkkJ9xQ>
>>    .
>>    2. Regarding the union of structs trick: while it is a good
>>    workaround for most of the cases as of now there are also some aspects
>>    which deserve consideration:
>>       1. There is a  space overhead
>>       2. There is an additional glue code to make it work
>>       3. Considering that this solution is bubbling up frequently in
>>       similar contexts, seems like users might benefit if it  was "natively"
>>       supported and they could just focus on populating the schemas they 
>> really
>>       need to.
>>    3. Re complexity of "one schema at a time" vs "schema id based": i
>>    think they are not much different, right? In fact the second one is more 
>> of
>>    an optimization to the first one which is beneficial to us. Anyways even
>>    with the first approach you need to add some logic of schema change. 
>> Adding
>>    the schema identification is not that complex at first glance.
>>
>> Cheers,
>> Gosh
>>
>> On Fri, Jun 25, 2021 at 6:25 PM Micah Kornfield <[email protected]>
>> wrote:
>>
>>> >
>>> > 1. It seems like renaming stream_id to schema_id and delegating
>>> "logical
>>> > stream" distinction to app_metadata mitigates the "multiplexing" point
>>> > while at the same time it gives enough flexibility to address both
>>> Nate's
>>> > and our use cases.
>>>
>>>
>>> I don't think this is the case.  It seems that having no additional
>>> fields
>>> added and then sending a new schema when necessary combined with Union of
>>> Structs would solve most use-cases.  The main downside could be potential
>>> performance implications if the schema is changing frequently.  Gosh,
>>> could
>>> you address why this wouldn't be sufficient (either here or on the doc).
>>>
>>> Thanks,
>>> -Micah
>>>
>>> On Fri, Jun 25, 2021 at 5:30 AM Gosh Arzumanyan <[email protected]>
>>> wrote:
>>>
>>> > Hi guys,
>>> >
>>> > Thanks for sharing your insights/concerns! I also left some comments
>>> based
>>> > on the discussion we had. Briefly:
>>> >
>>> > 1. It seems like renaming stream_id to schema_id and delegating
>>> "logical
>>> > stream" distinction to app_metadata mitigates the "multiplexing" point
>>> > while at the same time it gives enough flexibility to address both
>>> Nate's
>>> > and our use cases.
>>> > 2. To David's point about other transports: in fact currently we are
>>> using
>>> > other transports(aside from gRPC) so we don't wanna depend on only gRPC
>>> > features.
>>> >
>>> > Cheers,
>>> > Gosh
>>> >
>>> > On Wed, Jun 23, 2021 at 10:40 PM David Li <[email protected]> wrote:
>>> >
>>> > > Thanks for chiming in - I've replied in the doc. Scoping it to just
>>> > schema
>>> > > evolution would be preferable, but I'm not sure if Gosh's usecase
>>> > requires
>>> > > more flexibility than that or not.
>>> > >
>>> > > Again, though, given that 1) gRPC recycles a connection, so repeated
>>> > calls
>>> > > aren't necessarily expensive and 2) encoding tricks like
>>> > union-of-structs,
>>> > > any solution needs to be weighed against those/we should make sure to
>>> > > document why they aren't sufficient. (For instance, 1) is hampered
>>> by the
>>> > > use of L7 load balancers and/or client-side load balancing policies
>>> in
>>> > gRPC
>>> > > and assumes statefulness which is undesirable in general. There's
>>> also
>>> > the
>>> > > eventual desire to have a transport besides gRPC someday.)
>>> > >
>>> > > -David
>>> > >
>>> > > On Wed, Jun 23, 2021, at 16:24, Nate Bauernfeind wrote:
>>> > >
>>> > > Thanks for writing this up! I added a few general comments, but have
>>> a
>>> > > question on the approach because it's not quite what I was expecting.
>>> > >
>>> > > I am slightly concerned that the proposal looks more like support for
>>> > > "multiplexing" IPC streams into a single RPC stream rather than
>>> support
>>> > for
>>> > > a changing Schema of an otherwise consistently logical stream. gRPC
>>> > already
>>> > > does a good job decoupling RPC streams from one another. I feel that
>>> > > throwing that idea into the IPC stream increases client-library
>>> > > implementation cost by quite a lot.
>>> > >
>>> > > Why is it not good enough to replace the Schema when we see a
>>> duplicate?
>>> > > This is undoubtedly less work across all client implementations.
>>> > >
>>> > > The benefit I see is that you might have two schemas that you swap
>>> > between
>>> > > frequently then you can indicate with a single integer. If that's
>>> what
>>> > you
>>> > > want to support I would rather think of them as `schema_id` instead
>>> of
>>> > > `stream_id` and not give this impression that multiplexing is a
>>> goal. As
>>> > > you have proposed, it seems that the "done writing for a stream"
>>> needs a
>>> > > callback notifying the user receiving the stream that a logical
>>> subset of
>>> > > the flight is complete. Alternatively, if they aren't independent
>>> streams
>>> > > (to the end-user), we could tell the Arrow layer that a particular
>>> schema
>>> > > is no longer needed without also needing to communicate further
>>> > downstream.
>>> > >
>>> > > On Wed, Jun 23, 2021 at 1:39 PM David Li <[email protected]>
>>> wrote:
>>> > >
>>> > > > Ah to be clear, the API is indeed inconsistent - DoExchange was
>>> added
>>> > > some
>>> > > > time later (and by its nature returning a FlightDataStream would
>>> not
>>> > have
>>> > > > been possible, since it's meant to be able to interleave
>>> > > reading/writing).
>>> > > > But really, DoGet is indeed the odd one out in the C++ API and it
>>> may
>>> > be
>>> > > > worth correcting. You could also perhaps imagine making a
>>> > > FlightDataStream
>>> > > > implementation that accepts a closure and provides it a fake
>>> writer, if
>>> > > the
>>> > > > API mismatch is hard to work with...
>>> > > >
>>> > > > That said: this has some benefits, e.g. for a Python service that
>>> > returns
>>> > > > a Table, that means data can be fed into gRPC entirely in C++
>>> without
>>> > > > having to bounce into Python for each chunk.
>>> > > >
>>> > > > Best,
>>> > > > David
>>> > > >
>>> > > > On Wed, Jun 23, 2021, at 15:33, Gosh Arzumanyan wrote:
>>> > > > > Hi David,
>>> > > > >
>>> > > > > Got you. In fact I was looking at this more from the point of
>>> view of
>>> > > > consistency of the API in terms of "inputs" and thought DoExchange
>>> is
>>> > > kind
>>> > > > of a DoGet+ so might make sense to have the same classes being
>>> utilized
>>> > > in
>>> > > > both places. But again, I might be missing something and I get the
>>> > point
>>> > > > about breaking change.
>>> > > > >
>>> > > > > Cheers,
>>> > > > > Gosh
>>> > > > >
>>> > > > > On Wed, Jun 23, 2021 at 2:58 PM David Li <[email protected]>
>>> > wrote:
>>> > > > >> __
>>> > > > >> It's mostly a quirk of implementation (and just for
>>> clarification,
>>> > > > they're all nearly identical on the format/protocol level).
>>> > > > >>
>>> > > > >> DoGet is conceptualized as your application returning a readable
>>> > > stream
>>> > > > of batches, instead of your application imperatively writing
>>> batches to
>>> > > the
>>> > > > client. (This is different than how Flight is implemented in
>>> Java.) You
>>> > > > would normally not implement FlightDataStream - you would return a
>>> > > > RecordBatchStream.
>>> > > > >>
>>> > > > >> DoGet could not have FlightMessageWriter as a return type as
>>> that
>>> > > > wouldn't make sense, but it could accept an instance of that as a
>>> > > parameter
>>> > > > instead, much like DoExchange. That would be a breaking change.
>>> > > > >>
>>> > > > >> Best,
>>> > > > >> David
>>> > > > >>
>>> > > > >> On Wed, Jun 23, 2021, at 08:47, Gosh Arzumanyan wrote:
>>> > > > >>> Hi David,
>>> > > > >>>
>>> > > > >>> Going through the ArrowFlight API: got confused a bit on DoGet
>>> and
>>> > > > >>> DoPut/DoExachange apis: the former one expects FlightDataStream
>>> > which
>>> > > > talks
>>> > > > >>> in already serialized message terms while the latter to
>>> > > > >>> accept FlightMessageReader/Writer which expect the user to
>>> pass in
>>> > > > >>> RecordBatches etc. Is there any reason why the DoGet can't have
>>> > > > >>> FlightMessageWriter as a return type?
>>> > > > >>>
>>> > > > >>> Cheers,
>>> > > > >>> Gosh
>>> > > > >>>
>>> > > > >>> On Mon, Jun 21, 2021 at 9:47 PM Gosh Arzumanyan <
>>> [email protected]
>>> > >
>>> > > > wrote:
>>> > > > >>>
>>> > > > >>> > Thanks David!
>>> > > > >>> >
>>> > > > >>> > I also responded/added more suggestions/questions to the
>>> doc. I
>>> > > > think it
>>> > > > >>> > makes sense to have two sections: one purely protocol
>>> oriented
>>> > and
>>> > > > second
>>> > > > >>> > API oriented(examples in c++ or in any other language should
>>> make
>>> > > > the idea
>>> > > > >>> > easier to digest).
>>> > > > >>> >
>>> > > > >>> > Thanks for the reference too!
>>> > > > >>> >
>>> > > > >>> > Cheers,
>>> > > > >>> > Gosh
>>> > > > >>> >
>>> > > > >>> > On Mon, Jun 21, 2021 at 4:41 PM David Li <
>>> [email protected]>
>>> > > > wrote:
>>> > > > >>> >
>>> > > > >>> >> Thanks! I've left some initial comments/suggestions to
>>> expand it
>>> > > in
>>> > > > terms
>>> > > > >>> >> of the format definitions and not the C++ APIs.
>>> > > > >>> >>
>>> > > > >>> >> I'll also note something like this was proposed a long time
>>> ago
>>> > -
>>> > > > there's
>>> > > > >>> >> not very much discussion about it there but for reference:
>>> > > > >>> >>
>>> > > >
>>> > >
>>> >
>>> https://lists.apache.org/thread.html/0e5ba78c48cdd0e357f3a4a6d8affd31767c34376b62c001910823af%40%3Cdev.arrow.apache.org%3E
>>> > > > >>> >> (or see the thread '[Discuss][FlightRPC] Extensions to
>>> Flight:
>>> > > > >>> >> "DoBidirectional"' from 2019-2020). It might be good to
>>> address
>>> > > why
>>> > > > the
>>> > > > >>> >> proposed workaround there (union-of-structs) is
>>> insufficient for
>>> > > > the use
>>> > > > >>> >> cases here (and in FlightSQL).
>>> > > > >>> >>
>>> > > > >>> >> -David
>>> > > > >>> >>
>>> > > > >>> >> On Mon, Jun 21, 2021, at 08:22, Gosh Arzumanyan wrote:
>>> > > > >>> >> > Ah sorry, comments should work now.
>>> > > > >>> >> >
>>> > > > >>> >> > Cheers,
>>> > > > >>> >> > Gosh
>>> > > > >>> >> >
>>> > > > >>> >> > On Mon., 21 Jun. 2021, 14:18 David Li, <
>>> [email protected]
>>> > > > <mailto:
>>> > > > >>> >> lidavidm%40apache.org>> wrote:
>>> > > > >>> >> >
>>> > > > >>> >> > > Thanks! Will give it a look.
>>> > > > >>> >> > >
>>> > > > >>> >> > > Would you mind opening it up for comments?
>>> > > > >>> >> > >
>>> > > > >>> >> > > -David
>>> > > > >>> >> > >
>>> > > > >>> >> > > On 2021/06/21 11:56:24, Gosh Arzumanyan <
>>> [email protected]
>>> > > > <mailto:
>>> > > > >>> >> gosharz%40gmail.com>> wrote:
>>> > > > >>> >> > > > Hi folks,
>>> > > > >>> >> > > >
>>> > > > >>> >> > > > Started putting some thoughts together here:
>>> > > > >>> >> > > >
>>> > > > >>> >> > >
>>> > > > >>> >>
>>> > > >
>>> > >
>>> >
>>> https://docs.google.com/document/d/1dIOpKNYwsd9sdChsRBAx37BiJXl_7enpwWkH76n1tOI/edit?usp=sharing
>>> > > > >>> >> > > > Any feedback is welcome!
>>> > > > >>> >> > > >
>>> > > > >>> >> > > > Cheers,
>>> > > > >>> >> > > > Gosh
>>> > > > >>> >> > > >
>>> > > > >>> >> > >
>>> > > > >>> >> >
>>> > > > >>> >>
>>> > > > >>> >
>>> > > > >>>
>>> > > > >>
>>> > > >
>>> > >
>>> > >
>>> > > --
>>> > >
>>> > >
>>> > >
>>> >
>>>
>>

Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

Reply via email to