Ah okay, thank you for clarifying! In that case, if each payload has two batches with different purposes - might it make sense to just make that two different payloads, and set a flag/enum in the metadata to indicate how to interpret the batch? Then you'd be officially the same as Arrow Flight :)
As a side note - is said UI browser-based? Another project recently was planning to look at JavaScript support for Flight (using WebSockets as the transport, IIRC) and it might make sense to join forces if that's a path you were also going to pursue. Best, David On Wed, Mar 3, 2021, at 18:05, Nate Bauernfeind wrote: > Thanks for the interest =). > > > However, if I understand right, you're sending data without a fixed > schema [...] > > The dataset does have a known schema ahead of time, which is similar to > Flight. However, as you point out, the subscription can change which > columns it is interested in without re-acquiring data for columns it was > already subscribed to. This is mostly for convenience. We use it primarily > to limit which columns are sent to our user interface until the user > scrolls them into view. > > The enhancement of the RecordBatch here, aside from the additional > metadata, is only in that the payload has two sets of RecordBatch payloads. > The first payload is for added rows, every added row must send data for > each column subscribed; based on the subscribed columns this is otherwise > fixed width (in the number of columns / buffers). The second payload is for > modified rows. Here we only send the columns that have rows that are > modified. Aside from this difference, I have been aiming to be compatible > enough to be able to reuse the payload parsing that is already written for > Arrow. > > > I don't quite see why it couldn't be carried as metadata on the side of a > record batch, instead of having to duplicate the record batch structure > [...] > > Whoa, this is a good point. I have iterated on this a few times to get it > closer to Arrow's setup and did not realize that 'BarrageData' is now > officially identical to `FlightData`. This is an instance of being too > close to the project and forgetting to step back once in a while. > > > Flight already has a bidirectional streaming endpoint, DoExchange, that > allows arbitrary payloads (with mixed metadata/data or only one of the > two), which seems like it should be able to cover the SubscriptionRequest > endpoint. > > This is exactly the kind of feedback I'm looking for! I wasn't seeing the > solution where the client-side stream doesn't actually need payload and > that the subscription changes can be described with another flatbuffer > metadata type. I like that. > > Thanks David! > Nate > > On Wed, Mar 3, 2021 at 3:28 PM David Li <lidav...@apache.org> wrote: > > > Hey Nate, > > > > Thanks for sharing this & for the detailed docs and writeup. I think your > > use case is interesting, but I'd like to clarify a few things. > > > > I would say Arrow Flight doesn't try to impose a particular model, but I > > agree that Barrage does things that aren't easily doable with Flight. > > Flight does name concepts in a way that suggests how to apply it to > > something that looks like a database, but you can mostly think of Flight as > > an efficient way to transfer Arrow data over the network upon which you can > > layer further semantics. > > > > However, if I understand right, you're sending data without a fixed > > schema, in the sense that each BarrageRecordBatch may have only a subset of > > the columns declared up front, or may carry new columns? I think this is > > the main thing you can't easily do currently, as Flight (and Arrow IPC in > > general) assumes a fixed schema (and expects all columns in a batch to have > > the same length). > > > > Otherwise, the encoding for identifying rows and changes is interesting, > > but I don't quite see why it couldn't be carried as metadata on the side of > > a record batch, instead of having to duplicate the record batch structure, > > except for the aforementioned schema issue. And in that case it might be > > better to work out the schema evolution issue & any ergonomic issues with > > Flight's existing metadata fields/API that would prevent you from using > > them, as that way you (and we!) don't have to fully duplicate one of > > Arrow's format definitions. Similarly, Flight already has a bidirectional > > streaming endpoint, DoExchange, that allows arbitrary payloads (with mixed > > metadata/data or only one of the two), which seems like it should be able > > to cover the SubscriptionRequest endpoint. > > > > Best, > > David > > > > On Wed, Mar 3, 2021, at 16:08, Nate Bauernfeind wrote: > > > Hello, > > > > > > My colleagues at Deephaven Data Labs and I have been addressing problems > > at > > > the intersection of data-driven applications, data science, and updating > > > (/ticking) data for some years. > > > > > > Deephaven has a query engine that supports updating tabular data via a > > > protocol that communicates precise changes about datasets, such as 1) > > which > > > rows were removed, 2) which rows were added, 3) which rows were modified > > > (and for which columns). We are inspired by Arrow and would like to > > adopt a > > > version of this protocol that adheres to goals similar to Arrow and Arrow > > > Flight. > > > > > > Out of the box, Arrow Flight is insufficient to represent such a stream > > of > > > changes. For example, because you cannot identify a particular row within > > > an Arrow Flight, you cannot indicate which rows were removed or modified. > > > > > > The project integrates with Arrow Flight at the header-metadata level. We > > > have preliminarily named the project Barrage as in a "barrage of arrows" > > > which plays in the same "namespace" as a "flight of arrows." > > > > > > We built this as part of an initiative to modernize and open up our table > > > IPC mechanisms. This is part of a larger open source effort which will > > > become more visible in the next month or so once we've finished the work > > > necessary to share our core software components, including a unified > > static > > > and real time query engine complete with data visualization tools, a REPL > > > experience, Jupyter integration, and more. > > > > > > I would like to find out: > > > - if we have understood the primary goals of Arrow, and are honoring them > > > as closely as possible > > > - if there are other projects that might benefit from sharing this > > > extension of Arrow Flight > > > - if there are any gaps that are best addressed early on to maximize > > future > > > compatibility > > > > > > A great place to digest the concepts that differ from Arrow Flight are > > here: > > > https://deephaven.github.io/barrage/Concepts.html > > > > > > The proposed protocol can be perused here: > > > https://github.com/deephaven/barrage > > > > > > Internally, we already have a java server and java client implemented as > > a > > > working proof of concept for our use case. > > > > > > I really look forward to your feedback; thank you! > > > > > > Nate Bauernfeind > > > > > > Deephaven Data Labs - https://deephaven.io/ > > > -- > > > > > >