Thanks for the interest =).

> However, if I understand right, you're sending data without a fixed
schema [...]

The dataset does have a known schema ahead of time, which is similar to
Flight. However, as you point out, the subscription can change which
columns it is interested in without re-acquiring data for columns it was
already subscribed to. This is mostly for convenience. We use it primarily
to limit which columns are sent to our user interface until the user
scrolls them into view.

The enhancement of the RecordBatch here, aside from the additional
metadata, is only in that the payload has two sets of RecordBatch payloads.
The first payload is for added rows, every added row must send data for
each column subscribed; based on the subscribed columns this is otherwise
fixed width (in the number of columns / buffers). The second payload is for
modified rows. Here we only send the columns that have rows that are
modified. Aside from this difference, I have been aiming to be compatible
enough to be able to reuse the payload parsing that is already written for
Arrow.

> I don't quite see why it couldn't be carried as metadata on the side of a
record batch, instead of having to duplicate the record batch structure
[...]

Whoa, this is a good point. I have iterated on this a few times to get it
closer to Arrow's setup and did not realize that 'BarrageData' is now
officially identical to `FlightData`. This is an instance of being too
close to the project and forgetting to step back once in a while.

> Flight already has a bidirectional streaming endpoint, DoExchange, that
allows arbitrary payloads (with mixed metadata/data or only one of the
two), which seems like it should be able to cover the SubscriptionRequest
endpoint.

This is exactly the kind of feedback I'm looking for! I wasn't seeing the
solution where the client-side stream doesn't actually need payload and
that the subscription changes can be described with another flatbuffer
metadata type. I like that.

Thanks David!
Nate

On Wed, Mar 3, 2021 at 3:28 PM David Li <lidav...@apache.org> wrote:

> Hey Nate,
>
> Thanks for sharing this & for the detailed docs and writeup. I think your
> use case is interesting, but I'd like to clarify a few things.
>
> I would say Arrow Flight doesn't try to impose a particular model, but I
> agree that Barrage does things that aren't easily doable with Flight.
> Flight does name concepts in a way that suggests how to apply it to
> something that looks like a database, but you can mostly think of Flight as
> an efficient way to transfer Arrow data over the network upon which you can
> layer further semantics.
>
> However, if I understand right, you're sending data without a fixed
> schema, in the sense that each BarrageRecordBatch may have only a subset of
> the columns declared up front, or may carry new columns? I think this is
> the main thing you can't easily do currently, as Flight (and Arrow IPC in
> general) assumes a fixed schema (and expects all columns in a batch to have
> the same length).
>
> Otherwise, the encoding for identifying rows and changes is interesting,
> but I don't quite see why it couldn't be carried as metadata on the side of
> a record batch, instead of having to duplicate the record batch structure,
> except for the aforementioned schema issue. And in that case it might be
> better to work out the schema evolution issue & any ergonomic issues with
> Flight's existing metadata fields/API that would prevent you from using
> them, as that way you (and we!) don't have to fully duplicate one of
> Arrow's format definitions. Similarly, Flight already has a bidirectional
> streaming endpoint, DoExchange, that allows arbitrary payloads (with mixed
> metadata/data or only one of the two), which seems like it should be able
> to cover the SubscriptionRequest endpoint.
>
> Best,
> David
>
> On Wed, Mar 3, 2021, at 16:08, Nate Bauernfeind wrote:
> > Hello,
> >
> > My colleagues at Deephaven Data Labs and I have been addressing problems
> at
> > the intersection of data-driven applications, data science, and updating
> > (/ticking) data for some years.
> >
> > Deephaven has a query engine that supports updating tabular data via a
> > protocol that communicates precise changes about datasets, such as 1)
> which
> > rows were removed, 2) which rows were added, 3) which rows were modified
> > (and for which columns). We are inspired by Arrow and would like to
> adopt a
> > version of this protocol that adheres to goals similar to Arrow and Arrow
> > Flight.
> >
> > Out of the box, Arrow Flight is insufficient to represent such a stream
> of
> > changes. For example, because you cannot identify a particular row within
> > an Arrow Flight, you cannot indicate which rows were removed or modified.
> >
> > The project integrates with Arrow Flight at the header-metadata level. We
> > have preliminarily named the project Barrage as in a "barrage of arrows"
> > which plays in the same "namespace" as a "flight of arrows."
> >
> > We built this as part of an initiative to modernize and open up our table
> > IPC mechanisms. This is part of a larger open source effort which will
> > become more visible in the next month or so once we've finished the work
> > necessary to share our core software components, including a unified
> static
> > and real time query engine complete with data visualization tools, a REPL
> > experience, Jupyter integration, and more.
> >
> > I would like to find out:
> > - if we have understood the primary goals of Arrow, and are honoring them
> > as closely as possible
> > - if there are other projects that might benefit from sharing this
> > extension of Arrow Flight
> > - if there are any gaps that are best addressed early on to maximize
> future
> > compatibility
> >
> > A great place to digest the concepts that differ from Arrow Flight are
> here:
> > https://deephaven.github.io/barrage/Concepts.html
> >
> > The proposed protocol can be perused here:
> > https://github.com/deephaven/barrage
> >
> > Internally, we already have a java server and java client implemented as
> a
> > working proof of concept for our use case.
> >
> > I really look forward to your feedback; thank you!
> >
> > Nate Bauernfeind
> >
> > Deephaven Data Labs - https://deephaven.io/
> > --
> >
>

Reply via email to