Sorry for the second reply:
> 1. In our case we do expect relatively frequent changes in the schema > of the batch being sent out. I don't see that pattern changing in the mid > term for a good reason. However long term maybe it will be possible to > leverage separate RPC calls. I left some description in the comments thread > here > > <https://docs.google.com/document/d/1dIOpKNYwsd9sdChsRBAx37BiJXl_7enpwWkH76n1tOI/edit?disco=AAAAMkkJ9xQ> > . > > I think we should clarify between two types of changes: 1. Completely new unique schema needed. 2. Relatively frequent switches between a few known schemas. On Fri, Jun 25, 2021 at 2:58 PM Micah Kornfield <emkornfi...@gmail.com> wrote: > >> 1. Re complexity of "one schema at a time" vs "schema id based": i >> think they are not much different, right? In fact the second one is more >> of >> an optimization to the first one which is beneficial to us. Anyways even >> with the first approach you need to add some logic of schema change. >> Adding >> the schema identification is not that complex at first glance. >> >> The second approach requires more state being maintained on the client > side and makes it more difficult to bring schema evolution back into the > IPC Stream format (i.e. it would live only in flight) without additional > changes. > > In both cases there are also potentially thorny issues related to > dictionary usages that we will need to resolve. I haven't thought deeply > about it but it seems that these might be more complex on a multiplexed > channel. > > > On Fri, Jun 25, 2021 at 11:52 AM Gosh Arzumanyan <gosh...@gmail.com> > wrote: > >> Hi Micah, >> >> Sure, let me do it here: >> >> 1. In our case we do expect relatively frequent changes in the schema >> of the batch being sent out. I don't see that pattern changing in the mid >> term for a good reason. However long term maybe it will be possible to >> leverage separate RPC calls. I left some description in the comments >> thread >> here >> >> <https://docs.google.com/document/d/1dIOpKNYwsd9sdChsRBAx37BiJXl_7enpwWkH76n1tOI/edit?disco=AAAAMkkJ9xQ> >> . >> 2. Regarding the union of structs trick: while it is a good >> workaround for most of the cases as of now there are also some aspects >> which deserve consideration: >> 1. There is a space overhead >> 2. There is an additional glue code to make it work >> 3. Considering that this solution is bubbling up frequently in >> similar contexts, seems like users might benefit if it was "natively" >> supported and they could just focus on populating the schemas they >> really >> need to. >> 3. Re complexity of "one schema at a time" vs "schema id based": i >> think they are not much different, right? In fact the second one is more >> of >> an optimization to the first one which is beneficial to us. Anyways even >> with the first approach you need to add some logic of schema change. >> Adding >> the schema identification is not that complex at first glance. >> >> Cheers, >> Gosh >> >> On Fri, Jun 25, 2021 at 6:25 PM Micah Kornfield <emkornfi...@gmail.com> >> wrote: >> >>> > >>> > 1. It seems like renaming stream_id to schema_id and delegating >>> "logical >>> > stream" distinction to app_metadata mitigates the "multiplexing" point >>> > while at the same time it gives enough flexibility to address both >>> Nate's >>> > and our use cases. >>> >>> >>> I don't think this is the case. It seems that having no additional >>> fields >>> added and then sending a new schema when necessary combined with Union of >>> Structs would solve most use-cases. The main downside could be potential >>> performance implications if the schema is changing frequently. Gosh, >>> could >>> you address why this wouldn't be sufficient (either here or on the doc). >>> >>> Thanks, >>> -Micah >>> >>> On Fri, Jun 25, 2021 at 5:30 AM Gosh Arzumanyan <gosh...@gmail.com> >>> wrote: >>> >>> > Hi guys, >>> > >>> > Thanks for sharing your insights/concerns! I also left some comments >>> based >>> > on the discussion we had. Briefly: >>> > >>> > 1. It seems like renaming stream_id to schema_id and delegating >>> "logical >>> > stream" distinction to app_metadata mitigates the "multiplexing" point >>> > while at the same time it gives enough flexibility to address both >>> Nate's >>> > and our use cases. >>> > 2. To David's point about other transports: in fact currently we are >>> using >>> > other transports(aside from gRPC) so we don't wanna depend on only gRPC >>> > features. >>> > >>> > Cheers, >>> > Gosh >>> > >>> > On Wed, Jun 23, 2021 at 10:40 PM David Li <lidav...@apache.org> wrote: >>> > >>> > > Thanks for chiming in - I've replied in the doc. Scoping it to just >>> > schema >>> > > evolution would be preferable, but I'm not sure if Gosh's usecase >>> > requires >>> > > more flexibility than that or not. >>> > > >>> > > Again, though, given that 1) gRPC recycles a connection, so repeated >>> > calls >>> > > aren't necessarily expensive and 2) encoding tricks like >>> > union-of-structs, >>> > > any solution needs to be weighed against those/we should make sure to >>> > > document why they aren't sufficient. (For instance, 1) is hampered >>> by the >>> > > use of L7 load balancers and/or client-side load balancing policies >>> in >>> > gRPC >>> > > and assumes statefulness which is undesirable in general. There's >>> also >>> > the >>> > > eventual desire to have a transport besides gRPC someday.) >>> > > >>> > > -David >>> > > >>> > > On Wed, Jun 23, 2021, at 16:24, Nate Bauernfeind wrote: >>> > > >>> > > Thanks for writing this up! I added a few general comments, but have >>> a >>> > > question on the approach because it's not quite what I was expecting. >>> > > >>> > > I am slightly concerned that the proposal looks more like support for >>> > > "multiplexing" IPC streams into a single RPC stream rather than >>> support >>> > for >>> > > a changing Schema of an otherwise consistently logical stream. gRPC >>> > already >>> > > does a good job decoupling RPC streams from one another. I feel that >>> > > throwing that idea into the IPC stream increases client-library >>> > > implementation cost by quite a lot. >>> > > >>> > > Why is it not good enough to replace the Schema when we see a >>> duplicate? >>> > > This is undoubtedly less work across all client implementations. >>> > > >>> > > The benefit I see is that you might have two schemas that you swap >>> > between >>> > > frequently then you can indicate with a single integer. If that's >>> what >>> > you >>> > > want to support I would rather think of them as `schema_id` instead >>> of >>> > > `stream_id` and not give this impression that multiplexing is a >>> goal. As >>> > > you have proposed, it seems that the "done writing for a stream" >>> needs a >>> > > callback notifying the user receiving the stream that a logical >>> subset of >>> > > the flight is complete. Alternatively, if they aren't independent >>> streams >>> > > (to the end-user), we could tell the Arrow layer that a particular >>> schema >>> > > is no longer needed without also needing to communicate further >>> > downstream. >>> > > >>> > > On Wed, Jun 23, 2021 at 1:39 PM David Li <lidav...@apache.org> >>> wrote: >>> > > >>> > > > Ah to be clear, the API is indeed inconsistent - DoExchange was >>> added >>> > > some >>> > > > time later (and by its nature returning a FlightDataStream would >>> not >>> > have >>> > > > been possible, since it's meant to be able to interleave >>> > > reading/writing). >>> > > > But really, DoGet is indeed the odd one out in the C++ API and it >>> may >>> > be >>> > > > worth correcting. You could also perhaps imagine making a >>> > > FlightDataStream >>> > > > implementation that accepts a closure and provides it a fake >>> writer, if >>> > > the >>> > > > API mismatch is hard to work with... >>> > > > >>> > > > That said: this has some benefits, e.g. for a Python service that >>> > returns >>> > > > a Table, that means data can be fed into gRPC entirely in C++ >>> without >>> > > > having to bounce into Python for each chunk. >>> > > > >>> > > > Best, >>> > > > David >>> > > > >>> > > > On Wed, Jun 23, 2021, at 15:33, Gosh Arzumanyan wrote: >>> > > > > Hi David, >>> > > > > >>> > > > > Got you. In fact I was looking at this more from the point of >>> view of >>> > > > consistency of the API in terms of "inputs" and thought DoExchange >>> is >>> > > kind >>> > > > of a DoGet+ so might make sense to have the same classes being >>> utilized >>> > > in >>> > > > both places. But again, I might be missing something and I get the >>> > point >>> > > > about breaking change. >>> > > > > >>> > > > > Cheers, >>> > > > > Gosh >>> > > > > >>> > > > > On Wed, Jun 23, 2021 at 2:58 PM David Li <lidav...@apache.org> >>> > wrote: >>> > > > >> __ >>> > > > >> It's mostly a quirk of implementation (and just for >>> clarification, >>> > > > they're all nearly identical on the format/protocol level). >>> > > > >> >>> > > > >> DoGet is conceptualized as your application returning a readable >>> > > stream >>> > > > of batches, instead of your application imperatively writing >>> batches to >>> > > the >>> > > > client. (This is different than how Flight is implemented in >>> Java.) You >>> > > > would normally not implement FlightDataStream - you would return a >>> > > > RecordBatchStream. >>> > > > >> >>> > > > >> DoGet could not have FlightMessageWriter as a return type as >>> that >>> > > > wouldn't make sense, but it could accept an instance of that as a >>> > > parameter >>> > > > instead, much like DoExchange. That would be a breaking change. >>> > > > >> >>> > > > >> Best, >>> > > > >> David >>> > > > >> >>> > > > >> On Wed, Jun 23, 2021, at 08:47, Gosh Arzumanyan wrote: >>> > > > >>> Hi David, >>> > > > >>> >>> > > > >>> Going through the ArrowFlight API: got confused a bit on DoGet >>> and >>> > > > >>> DoPut/DoExachange apis: the former one expects FlightDataStream >>> > which >>> > > > talks >>> > > > >>> in already serialized message terms while the latter to >>> > > > >>> accept FlightMessageReader/Writer which expect the user to >>> pass in >>> > > > >>> RecordBatches etc. Is there any reason why the DoGet can't have >>> > > > >>> FlightMessageWriter as a return type? >>> > > > >>> >>> > > > >>> Cheers, >>> > > > >>> Gosh >>> > > > >>> >>> > > > >>> On Mon, Jun 21, 2021 at 9:47 PM Gosh Arzumanyan < >>> gosh...@gmail.com >>> > > >>> > > > wrote: >>> > > > >>> >>> > > > >>> > Thanks David! >>> > > > >>> > >>> > > > >>> > I also responded/added more suggestions/questions to the >>> doc. I >>> > > > think it >>> > > > >>> > makes sense to have two sections: one purely protocol >>> oriented >>> > and >>> > > > second >>> > > > >>> > API oriented(examples in c++ or in any other language should >>> make >>> > > > the idea >>> > > > >>> > easier to digest). >>> > > > >>> > >>> > > > >>> > Thanks for the reference too! >>> > > > >>> > >>> > > > >>> > Cheers, >>> > > > >>> > Gosh >>> > > > >>> > >>> > > > >>> > On Mon, Jun 21, 2021 at 4:41 PM David Li < >>> lidav...@apache.org> >>> > > > wrote: >>> > > > >>> > >>> > > > >>> >> Thanks! I've left some initial comments/suggestions to >>> expand it >>> > > in >>> > > > terms >>> > > > >>> >> of the format definitions and not the C++ APIs. >>> > > > >>> >> >>> > > > >>> >> I'll also note something like this was proposed a long time >>> ago >>> > - >>> > > > there's >>> > > > >>> >> not very much discussion about it there but for reference: >>> > > > >>> >> >>> > > > >>> > > >>> > >>> https://lists.apache.org/thread.html/0e5ba78c48cdd0e357f3a4a6d8affd31767c34376b62c001910823af%40%3Cdev.arrow.apache.org%3E >>> > > > >>> >> (or see the thread '[Discuss][FlightRPC] Extensions to >>> Flight: >>> > > > >>> >> "DoBidirectional"' from 2019-2020). It might be good to >>> address >>> > > why >>> > > > the >>> > > > >>> >> proposed workaround there (union-of-structs) is >>> insufficient for >>> > > > the use >>> > > > >>> >> cases here (and in FlightSQL). >>> > > > >>> >> >>> > > > >>> >> -David >>> > > > >>> >> >>> > > > >>> >> On Mon, Jun 21, 2021, at 08:22, Gosh Arzumanyan wrote: >>> > > > >>> >> > Ah sorry, comments should work now. >>> > > > >>> >> > >>> > > > >>> >> > Cheers, >>> > > > >>> >> > Gosh >>> > > > >>> >> > >>> > > > >>> >> > On Mon., 21 Jun. 2021, 14:18 David Li, < >>> lidav...@apache.org >>> > > > <mailto: >>> > > > >>> >> lidavidm%40apache.org>> wrote: >>> > > > >>> >> > >>> > > > >>> >> > > Thanks! Will give it a look. >>> > > > >>> >> > > >>> > > > >>> >> > > Would you mind opening it up for comments? >>> > > > >>> >> > > >>> > > > >>> >> > > -David >>> > > > >>> >> > > >>> > > > >>> >> > > On 2021/06/21 11:56:24, Gosh Arzumanyan < >>> gosh...@gmail.com >>> > > > <mailto: >>> > > > >>> >> gosharz%40gmail.com>> wrote: >>> > > > >>> >> > > > Hi folks, >>> > > > >>> >> > > > >>> > > > >>> >> > > > Started putting some thoughts together here: >>> > > > >>> >> > > > >>> > > > >>> >> > > >>> > > > >>> >> >>> > > > >>> > > >>> > >>> https://docs.google.com/document/d/1dIOpKNYwsd9sdChsRBAx37BiJXl_7enpwWkH76n1tOI/edit?usp=sharing >>> > > > >>> >> > > > Any feedback is welcome! >>> > > > >>> >> > > > >>> > > > >>> >> > > > Cheers, >>> > > > >>> >> > > > Gosh >>> > > > >>> >> > > > >>> > > > >>> >> > > >>> > > > >>> >> > >>> > > > >>> >> >>> > > > >>> > >>> > > > >>> >>> > > > >> >>> > > > >>> > > >>> > > >>> > > -- >>> > > >>> > > >>> > > >>> > >>> >>