Re: [Flight Extension] Request for Comments

David Li Wed, 03 Mar 2021 15:22:53 -0800

Ah okay, thank you for clarifying! In that case, if each payload has two 
batches with different purposes - might it make sense to just make that two 
different payloads, and set a flag/enum in the metadata to indicate how to 
interpret the batch? Then you'd be officially the same as Arrow Flight :)


As a side note - is said UI browser-based? Another project recently was 
planning to look at JavaScript support for Flight (using WebSockets as the 
transport, IIRC) and it might make sense to join forces if that's a path you 
were also going to pursue. 

Best,
David

On Wed, Mar 3, 2021, at 18:05, Nate Bauernfeind wrote:
> Thanks for the interest =).
> 
> > However, if I understand right, you're sending data without a fixed
> schema [...]
> 
> The dataset does have a known schema ahead of time, which is similar to
> Flight. However, as you point out, the subscription can change which
> columns it is interested in without re-acquiring data for columns it was
> already subscribed to. This is mostly for convenience. We use it primarily
> to limit which columns are sent to our user interface until the user
> scrolls them into view.
> 
> The enhancement of the RecordBatch here, aside from the additional
> metadata, is only in that the payload has two sets of RecordBatch payloads.
> The first payload is for added rows, every added row must send data for
> each column subscribed; based on the subscribed columns this is otherwise
> fixed width (in the number of columns / buffers). The second payload is for
> modified rows. Here we only send the columns that have rows that are
> modified. Aside from this difference, I have been aiming to be compatible
> enough to be able to reuse the payload parsing that is already written for
> Arrow.
> 
> > I don't quite see why it couldn't be carried as metadata on the side of a
> record batch, instead of having to duplicate the record batch structure
> [...]
> 
> Whoa, this is a good point. I have iterated on this a few times to get it
> closer to Arrow's setup and did not realize that 'BarrageData' is now
> officially identical to `FlightData`. This is an instance of being too
> close to the project and forgetting to step back once in a while.
> 
> > Flight already has a bidirectional streaming endpoint, DoExchange, that
> allows arbitrary payloads (with mixed metadata/data or only one of the
> two), which seems like it should be able to cover the SubscriptionRequest
> endpoint.
> 
> This is exactly the kind of feedback I'm looking for! I wasn't seeing the
> solution where the client-side stream doesn't actually need payload and
> that the subscription changes can be described with another flatbuffer
> metadata type. I like that.
> 
> Thanks David!
> Nate
> 
> On Wed, Mar 3, 2021 at 3:28 PM David Li <[email protected]> wrote:
> 
> > Hey Nate,
> >
> > Thanks for sharing this & for the detailed docs and writeup. I think your
> > use case is interesting, but I'd like to clarify a few things.
> >
> > I would say Arrow Flight doesn't try to impose a particular model, but I
> > agree that Barrage does things that aren't easily doable with Flight.
> > Flight does name concepts in a way that suggests how to apply it to
> > something that looks like a database, but you can mostly think of Flight as
> > an efficient way to transfer Arrow data over the network upon which you can
> > layer further semantics.
> >
> > However, if I understand right, you're sending data without a fixed
> > schema, in the sense that each BarrageRecordBatch may have only a subset of
> > the columns declared up front, or may carry new columns? I think this is
> > the main thing you can't easily do currently, as Flight (and Arrow IPC in
> > general) assumes a fixed schema (and expects all columns in a batch to have
> > the same length).
> >
> > Otherwise, the encoding for identifying rows and changes is interesting,
> > but I don't quite see why it couldn't be carried as metadata on the side of
> > a record batch, instead of having to duplicate the record batch structure,
> > except for the aforementioned schema issue. And in that case it might be
> > better to work out the schema evolution issue & any ergonomic issues with
> > Flight's existing metadata fields/API that would prevent you from using
> > them, as that way you (and we!) don't have to fully duplicate one of
> > Arrow's format definitions. Similarly, Flight already has a bidirectional
> > streaming endpoint, DoExchange, that allows arbitrary payloads (with mixed
> > metadata/data or only one of the two), which seems like it should be able
> > to cover the SubscriptionRequest endpoint.
> >
> > Best,
> > David
> >
> > On Wed, Mar 3, 2021, at 16:08, Nate Bauernfeind wrote:
> > > Hello,
> > >
> > > My colleagues at Deephaven Data Labs and I have been addressing problems
> > at
> > > the intersection of data-driven applications, data science, and updating
> > > (/ticking) data for some years.
> > >
> > > Deephaven has a query engine that supports updating tabular data via a
> > > protocol that communicates precise changes about datasets, such as 1)
> > which
> > > rows were removed, 2) which rows were added, 3) which rows were modified
> > > (and for which columns). We are inspired by Arrow and would like to
> > adopt a
> > > version of this protocol that adheres to goals similar to Arrow and Arrow
> > > Flight.
> > >
> > > Out of the box, Arrow Flight is insufficient to represent such a stream
> > of
> > > changes. For example, because you cannot identify a particular row within
> > > an Arrow Flight, you cannot indicate which rows were removed or modified.
> > >
> > > The project integrates with Arrow Flight at the header-metadata level. We
> > > have preliminarily named the project Barrage as in a "barrage of arrows"
> > > which plays in the same "namespace" as a "flight of arrows."
> > >
> > > We built this as part of an initiative to modernize and open up our table
> > > IPC mechanisms. This is part of a larger open source effort which will
> > > become more visible in the next month or so once we've finished the work
> > > necessary to share our core software components, including a unified
> > static
> > > and real time query engine complete with data visualization tools, a REPL
> > > experience, Jupyter integration, and more.
> > >
> > > I would like to find out:
> > > - if we have understood the primary goals of Arrow, and are honoring them
> > > as closely as possible
> > > - if there are other projects that might benefit from sharing this
> > > extension of Arrow Flight
> > > - if there are any gaps that are best addressed early on to maximize
> > future
> > > compatibility
> > >
> > > A great place to digest the concepts that differ from Arrow Flight are
> > here:
> > > https://deephaven.github.io/barrage/Concepts.html
> > >
> > > The proposed protocol can be perused here:
> > > https://github.com/deephaven/barrage
> > >
> > > Internally, we already have a java server and java client implemented as
> > a
> > > working proof of concept for our use case.
> > >
> > > I really look forward to your feedback; thank you!
> > >
> > > Nate Bauernfeind
> > >
> > > Deephaven Data Labs - https://deephaven.io/
> > > --
> > >
> >
>

Re: [Flight Extension] Request for Comments

Reply via email to