Hopefully this thread isn't too stale to pick back up with an open ended question. What interface would a Barrage client library expose? With Flight, application code cares about RecordBatches, but with Barrage it seems as though a client library ought to handle the updating of the table and expose that updated view to a client application. But what specifically would that view be?
In the last few months I've built out some Flight services that would benefit from a protocol like Barrage, and it renewed my interest enough to casually start a Go implementation based on Nate's documentation, just as a way of wrapping my head around the problem. I was watching the repo Nate shared which ultimately led to the Java implementation embedded in Deephaven's open source offering, but since that is part of a larger application, it's a little hard to tell where the lines would be drawn. Paul On Tue, Mar 9, 2021 at 9:45 PM Micah Kornfield <emkornfi...@gmail.com> wrote: > > > > As for schema evolution, I agree with what Micah proposes as a first > step. > > That would again add some overhead, perhaps. As for feasibility, at least > > on the C++/Python side, I think there would be a decent amount of > > refactoring needed, and there's also the question of how to expose this > in > > the API - the APIs there are based on reader/writer interfaces that don't > > expose schema evolution. > > One more option that might be too slow, is if a schema change is necessary, > a new flight endpoint is communicated and a new RPC is used? (reusing the > same underlying channel could mitigate some performance issues here). > > On Tue, Mar 9, 2021 at 3:17 PM David Li <lidav...@apache.org> wrote: > > > There's not really any convention for the app_metadata field or any of > the > > other application-defined fields (e.g. DoAction, Criteria). That said, I > > wouldn't necessarily worry about conflicting with other projects - if a > > client connects to a Barrage service, presumably it knows what to expect. > > And an arbitrary Flight client connecting to an arbitrary Flight server > > isn't really something we've thought about. For instance, see the Flight > > SQL proposal on this mailing list, which similarly defines expected > message > > formats and schemas for various fields - but doesn't provide any sort of > > reflection or way for a completely generic client to discover what's > going > > on from first principles. (There is no OpenAPI/Swagger for Flight!) > > > > As for schema evolution, I agree with what Micah proposes as a first > step. > > That would again add some overhead, perhaps. As for feasibility, at least > > on the C++/Python side, I think there would be a decent amount of > > refactoring needed, and there's also the question of how to expose this > in > > the API - the APIs there are based on reader/writer interfaces that don't > > expose schema evolution. > > > > It may be cleaner on the Java side given you've poked there already. That > > said, even if the Flight API is flexible but not so convenient, > presumably > > part of the value of Barrage is to take that and present a clean > interface > > with a stable schema again. > > > > Best, > > David > > > > On Tue, Mar 9, 2021, at 00:03, Micah Kornfield wrote: > > > > > > > > You know what? This is actually a nicer solution than I am giving it > > > > credit for. I've been trying to think about how to handle the > > > > Integer.MAX_VALUE limit that arrow strongly suggests to maintain > > > > compatibility with Java, while still respecting the need to apply an > > update > > > > atomically. > > > > > > For Flight, the contraint actually is maximum of a 32-bit length > payload > > > (I don't recall exactly if it is 2GB or 4GB but either way, you are > > > probably going to run into issues sending a single payload anywhere > near > > > that large). > > > > > > Are you suggesting this pattern of messages per incremental update? > > > > - FlightData with [the new] metadata header that includes > > > > added/removed/modified information, the number of add record batches, > > and > > > > the number of modified record batches. Noting that there could be > more > > than > > > > one record batch per added or modified to enable serializing more > than > > > > 2^31-1 rows in a single update. Also noting that it would have an > empty > > > > body (similar to Schema). > > > > - A set of FlightData record batches using the normal RecordBatch > > > > flatbuffer. > > > > - A set of FlightData record batches also using the normal > RecordBatch > > > > flatbuffer. > > > > > > > > > I haven't thought too deeply about this too deeply. I think depending > on > > > recovery needs it could differ. One place to start is avoid extra > medata > > > message, and just have a marker bit indicating there are more messages > > that > > > will be coming that are required to be in this transaction and another > > > bit/value indicating end transaction. > > > > > > My biggest concern with this approach is that small updates are likely > > > > going to have significant overhead. Maybe it won't matter, but it is > > the > > > > first thing thought that jumps out. We do typically coalesce updates > > > > somewhere between 50ms and 1s depending on the sensitivity of the > > listener; > > > > so maybe that's enough to eliminate my concern. I might just need to > > get > > > > data/statistics to get a better feeling for this concern. > > > > > > I think this is definitely something to measure. I wouldn't expect the > > > performance differential to be that large. > > > > > > Regarding the schema evolution idea: > > > > What can I do to get started? Does it make sense to target the > feature > > as > > > > a new field in the protobuf so that it can be used in contexts with > > other > > > > header metadata types? Do you have time to riff on the format that > will > > > > apply to the other contexts? I believe all I would need is a bitset > > > > identifying which columns are included, but if enabling/disabling > > features > > > > is a nice-to-have then a bitset is going to be a bit weak. I can > also, > > for > > > > now, cheat and send empty field nodes and empty buffers for those > > columns > > > > (but I am, already, slightly concerned with overhead). > > > > > > I think David might be able to give more guidance. My recollection of > > the > > > library specifics are hazy, but I think we could potentially just > > interpret > > > a new schema arriving as indicating all record batches after that > schema > > > would follow the new schema. Would that work for your use case? David > > > would probably be able to give guidance on how feasible a change like > > that > > > would be. Typically, before we officially alter the specification we > > want > > > to see working implementation in Java and C++ that pass an integration > > > test. But I think we can figure out the specifics here if we can > > > understand concrete requirements. > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Mar 8, 2021 at 6:42 PM Nate Bauernfeind < > > > natebauernfe...@deephaven.io> wrote: > > > > > > > >note that FlightData already has a separate app_metadata field > > > > > > > > That is an interesting point; are there any conventions on how to use > > the > > > > app_metadata compatibly without stepping on other ideas/projects > doing > > the > > > > same? It would be convenient for the server to verify that the client > > is > > > > making the request that the server interprets. Do projects use a > magic > > > > number prefix? Or possibly is there some sort of common header? I > > suspect > > > > that other projects may benefit from having the ability to publish > > > > incremental updates, too. So, I'm just curious if there is any > > pre-existing > > > > domain-knowledge in this respect. > > > > > > > > Nate > > > > > > > > On Mon, Mar 8, 2021 at 1:55 PM David Li <lidav...@apache.org> wrote: > > > > > > > > > Hey - pretty much, I think. I'd just like to note that FlightData > > already > > > > > has a separate app_metadata field, for metadata on top of any > > Arrow-level > > > > > data, so you could ship the Barrage metadata alongside the first > > record > > > > > batch, without having to modify anything about the record batch > > itself, > > > > and > > > > > without having to define a new metadata header at the Arrow level - > > > > > everything could be implemented on top of the existing definitions. > > > > > > > > > > David > > > > > > > > > > On Sat, Mar 6, 2021, at 01:07, Nate Bauernfeind wrote: > > > > > > Eww. I didn't specify why I had two sets of record batches. > > Slightly > > > > > > revised: > > > > > > > > > > > > Are you suggesting this pattern of messages per incremental > update? > > > > > > - FlightData with [the new] metadata header that includes > > > > > > added/removed/modified information, the number of add record > > batches, > > > > and > > > > > > the number of modified record batches. Noting that there could be > > more > > > > > than > > > > > > one record batch per added or modified to enable serializing more > > than > > > > > > 2^31-1 rows in a single update. Also noting that it would have an > > empty > > > > > > body (similar to Schema). > > > > > > - A set of FlightData record batches using the normal RecordBatch > > > > > > flatbuffer for added rows. > > > > > > - A set of FlightData record batches also using the normal > > RecordBatch > > > > > > flatbuffer for modified rows. > > > > > > > > > > > > On Fri, Mar 5, 2021 at 11:00 PM Nate Bauernfeind < > > > > > > natebauernfe...@deephaven.io> wrote: > > > > > > > > > > > > > > It seems that atomic application could also be something > > controlled > > > > > in > > > > > > > metadata (i.e. this is batch 1 or X)? > > > > > > > > > > > > > > You know what? This is actually a nicer solution than I am > > giving it > > > > > > > credit for. I've been trying to think about how to handle the > > > > > > > Integer.MAX_VALUE limit that arrow strongly suggests to > maintain > > > > > > > compatibility with Java, while still respecting the need to > > apply an > > > > > update > > > > > > > atomically. > > > > > > > > > > > > > > Alright, yeah, I'm game with this approach. > > > > > > > > > > > > > > > Right - presumably this could go in the Flight metadata > > instead of > > > > > > > having to be inlined into the batch's metadata. > > > > > > > > > > > > > > I'm not sure I follow. These fields (addedRows, > > addedRowsIncluded, > > > > > > > removedRows, modifiedRows, and modifiedRowsIncluded) apply only > > to a > > > > > > > specific atomic incremental update. For a given update these > are > > the > > > > > > > indices for the rows that were added/removed/modified -- and > > > > therefore > > > > > > > cannot be part of the "global" Flight metadata. > > > > > > > > > > > > > > Are you suggesting this pattern of messages per incremental > > update? > > > > > > > - FlightData with [the new] metadata header that includes > > > > > > > added/removed/modified information, the number of add record > > batches, > > > > > and > > > > > > > the number of modified record batches. Noting that there could > be > > > > more > > > > > than > > > > > > > one record batch per added or modified to enable serializing > more > > > > than > > > > > > > 2^31-1 rows in a single update. Also noting that it would have > an > > > > empty > > > > > > > body (similar to Schema). > > > > > > > - A set of FlightData record batches using the normal > RecordBatch > > > > > > > flatbuffer. > > > > > > > - A set of FlightData record batches also using the normal > > > > RecordBatch > > > > > > > flatbuffer. > > > > > > > > > > > > > > My biggest concern with this approach is that small updates are > > > > likely > > > > > > > going to have significant overhead. Maybe it won't matter, but > > it is > > > > > the > > > > > > > first thing thought that jumps out. We do typically coalesce > > updates > > > > > > > somewhere between 50ms and 1s depending on the sensitivity of > the > > > > > listener; > > > > > > > so maybe that's enough to eliminate my concern. I might just > > need to > > > > > get > > > > > > > data/statistics to get a better feeling for this concern. > > > > > > > > > > > > > > Regarding the schema evolution idea: > > > > > > > What can I do to get started? Does it make sense to target the > > > > feature > > > > > as > > > > > > > a new field in the protobuf so that it can be used in contexts > > with > > > > > other > > > > > > > header metadata types? Do you have time to riff on the format > > that > > > > will > > > > > > > apply to the other contexts? I believe all I would need is a > > bitset > > > > > > > identifying which columns are included, but if > enabling/disabling > > > > > features > > > > > > > is a nice-to-have then a bitset is going to be a bit weak. I > can > > > > also, > > > > > for > > > > > > > now, cheat and send empty field nodes and empty buffers for > those > > > > > columns > > > > > > > (but I am, already, slightly concerned with overhead). > > > > > > > > > > > > > > So, based on the feedback so far, I should be able to boil down > > the > > > > > way I > > > > > > > integrate with Arrow to, more or less, a pair of flatbuffers. > I'm > > > > > going to > > > > > > > start riffing on these changes and see where I end up. Feel > free > > to > > > > > jump up > > > > > > > and down if I misunderstood you. > > > > > > > > > > > > > > > > > > > > > On Fri, Mar 5, 2021 at 9:23 PM Micah Kornfield < > > > > emkornfi...@gmail.com> > > > > > > > wrote: > > > > > > > > > > > > > >> > > > > > > > >> > And then having two sets of buffers, is the same as having > two > > > > > record > > > > > > >> > batches, albeit you need both sets to be delivered together, > > as > > > > > noted. > > > > > > >> > > > > > > >> > > > > > > >> It seems that atomic application could also be something > > controlled > > > > in > > > > > > >> metadata (i.e. this is batch 1 or X)? > > > > > > >> > > > > > > >> The schema evolution question is interesting, it could be > > useful in > > > > > other > > > > > > >> contexts as well. (e.g. switching dictionary encoding > on/off). > > > > > > >> > > > > > > >> -Micah > > > > > > >> > > > > > > >> > > > > > > >> On Fri, Mar 5, 2021 at 11:42 AM David Li <lidav...@apache.org > > > > > > wrote: > > > > > > >> > > > > > > >> > (responses inline) > > > > > > >> > > > > > > > >> > On Thu, Mar 4, 2021, at 17:26, Nate Bauernfeind wrote: > > > > > > >> > > Regarding the BarrageRecordBatch: > > > > > > >> > > > > > > > > >> > > I have been concatenating them; it’s one batch with two > > sets of > > > > > arrow > > > > > > >> > > payloads. They don’t have separate metadata headers; the > > update > > > > > is to > > > > > > >> be > > > > > > >> > > applied atomically. I have only studied the Java Arrow > > Flight > > > > > > >> > > implementation, and I believe it is usable maybe with some > > minor > > > > > > >> changes. > > > > > > >> > > The piece of code in Flight that does the deserialization > > takes > > > > > two > > > > > > >> > > parallel lists/iterators, a `Buffer` list (these describe > > the > > > > > length > > > > > > >> of a > > > > > > >> > > section of the body payload) and a `FieldNode` list (these > > > > > describe > > > > > > >> num > > > > > > >> > > rows and null_count). Each field node is 2-3 buffers > > depending > > > > on > > > > > > >> schema > > > > > > >> > > type. Buffers are allowed to have length of 0, to omit > their > > > > > payloads; > > > > > > >> > > this, for example, is how you omit the validity buffer > when > > > > > > >> null_count is > > > > > > >> > > zero. > > > > > > >> > > > > > > > > >> > > The proposed barrage payload keeps this structural pattern > > (list > > > > > of > > > > > > >> > buffer, > > > > > > >> > > list of field node) with the following modifications: > > > > > > >> > > - we only include field nodes / buffers for subscribed > > columns > > > > > > >> > > - the first set of field nodes are for added rows; these > > may be > > > > > > >> omitted > > > > > > >> > if > > > > > > >> > > there are no added rows included in the update > > > > > > >> > > - the second set of field nodes are for modified rows; we > > omit > > > > > columns > > > > > > >> > that > > > > > > >> > > have no modifications included in the update > > > > > > >> > > > > > > > > >> > > I believe the only thing that is missing is the ability to > > > > > control the > > > > > > >> > > field types to be deserialized (like a third list/iterator > > > > > parallel to > > > > > > >> > > field nodes and buffers). > > > > > > >> > > > > > > > >> > Right. I think we're on the same page here, but looking at > > this > > > > from > > > > > > >> > different angles. I think being able to control which > columns > > to > > > > > > >> > deserialize/being able to only include a subset of buffers, > is > > > > > > >> essentially > > > > > > >> > equivalent to having a stream with schema evolution. And > then > > > > > having two > > > > > > >> > sets of buffers, is the same as having two record batches, > > albeit > > > > > you > > > > > > >> need > > > > > > >> > both sets to be delivered together, as noted. Regardless, we > > can > > > > > work > > > > > > >> out > > > > > > >> > how to handle this. > > > > > > >> > > > > > > > >> > > > > > > > > >> > > Note that the BarrageRecordBatch.addedRowsIncluded, > > > > > > >> > > BarrageFieldNode.addedRows, BarrageFieldNode.modifiedRows > > and > > > > > > >> > > BarrageFieldNode.includedRows (all part of the flatbuffer > > > > > metadata) > > > > > > >> are > > > > > > >> > > intended to be used by code one layer of abstraction > higher > > than > > > > > that > > > > > > >> > > actual wire-format parser. The parser doesn't really need > > them > > > > > except > > > > > > >> to > > > > > > >> > > know which columns to expect in the payload. Technically, > we > > > > could > > > > > > >> encode > > > > > > >> > > the field nodes / buffers as empty, too (but why be > > wasteful if > > > > > this > > > > > > >> > > information is already encoded?). > > > > > > >> > > > > > > > >> > Right - presumably this could go in the Flight metadata > > instead of > > > > > > >> having > > > > > > >> > to be inlined into the batch's metadata. > > > > > > >> > > > > > > > >> > > > > > > > > >> > > Regarding Browser Flight Support: > > > > > > >> > > > > > > > > >> > > Was this company FactSet by chance? (I saw they are > > mentioned in > > > > > the > > > > > > >> JS > > > > > > >> > > thread that recently was bumped on the dev list.) > > > > > > >> > > > > > > > > >> > > I looked at the ticket and wanted to comment how we are > > handling > > > > > > >> > > bi-directional streams for our web-ui. We use > ArrowFlight's > > > > > concept of > > > > > > >> > > Ticket to allow a client to create and identify temporary > > state > > > > > (new > > > > > > >> > tables > > > > > > >> > > / views / REPL sessions / etc). Any bidirectional stream > we > > > > > support > > > > > > >> also > > > > > > >> > > has a server-streaming only variant with the ability for > the > > > > > client to > > > > > > >> > > attach a Ticket to reference/identify that stream. The > > client > > > > may > > > > > then > > > > > > >> > send > > > > > > >> > > a message, out-of-band, to the Ticket. They are sequenced > > by the > > > > > > >> client > > > > > > >> > > (since gRPC doesn't guarantee ordered delivery) and > > delivered to > > > > > the > > > > > > >> > piece > > > > > > >> > > of code controlling that server-stream. It does require > > that the > > > > > > >> server > > > > > > >> > be > > > > > > >> > > a bit stateful; but it works =). > > > > > > >> > > > > > > > >> > I still can't figure out who it was and now I wonder if it > > was all > > > > > in my > > > > > > >> > imagination. I'm hoping they'll see this and chime in, in > the > > > > > spirit of > > > > > > >> > community participation :) > > > > > > >> > > > > > > > >> > I agree bidirectionality will be a challenge. I think > > WebSockets > > > > has > > > > > > >> been > > > > > > >> > proposed as well, but that is also stateful (well, as soon > as > > you > > > > > have > > > > > > >> > bidirectionality, you're going to have statefulness). > > > > > > >> > > > > > > > >> > > > > > > > > >> > > On Thu, Mar 4, 2021 at 6:58 AM David Li < > > lidav...@apache.org> > > > > > wrote: > > > > > > >> > > > > > > > > >> > > > Re: the multiple batches, that makes sense. In that > case, > > > > > depending > > > > > > >> on > > > > > > >> > how > > > > > > >> > > > exactly the two record batches are laid out, I'd suggest > > > > > > >> considering a > > > > > > >> > > > Union of Struct columns (where a Struct is essentially > > > > > > >> interchangeable > > > > > > >> > with > > > > > > >> > > > a record batch or table) - that would let you encode two > > > > > distinct > > > > > > >> > record > > > > > > >> > > > batches inside the same physical batch. Or if the two > > batches > > > > > have > > > > > > >> > > > identical schemas, you could just concatenate them and > > include > > > > > > >> indices > > > > > > >> > in > > > > > > >> > > > your metadata. > > > > > > >> > > > > > > > > > >> > > > As for browser Flight support - there's an existing > > ticket: > > > > > > >> > > > https://issues.apache.org/jira/browse/ARROW-9860 > > > > > > >> > > > > > > > > > >> > > > I was sure I had seen another organization talking about > > > > browser > > > > > > >> > support > > > > > > >> > > > recently, but now I can't find them. I'll update here if > > I do > > > > > figure > > > > > > >> > it out. > > > > > > >> > > > > > > > > > >> > > > Best, > > > > > > >> > > > David > > > > > > >> > > > > > > > > > >> > > > On Wed, Mar 3, 2021, at 21:00, Nate Bauernfeind wrote: > > > > > > >> > > > > > if each payload has two batches with different > > purposes > > > > > [...] > > > > > > >> > > > > > > > > > > >> > > > > The purposes of the payloads are slightly different, > > however > > > > > they > > > > > > >> are > > > > > > >> > > > > intended to be applied atomically. If there are > > guarantees > > > > by > > > > > the > > > > > > >> > table > > > > > > >> > > > > operation generating the updates then those guarantees > > are > > > > > only > > > > > > >> > valid on > > > > > > >> > > > > each boundary of applying the update to your local > > state. > > > > In a > > > > > > >> > sense, one > > > > > > >> > > > > is relatively useless without the other. Record > batches > > fit > > > > > well > > > > > > >> in > > > > > > >> > > > > map-reduce paradigms / algorithms, but what we have is > > > > > stateful to > > > > > > >> > > > > enable/support incremental updates. For example, > > sorting a > > > > > flight > > > > > > >> of > > > > > > >> > data > > > > > > >> > > > > is best done map-reduce-style and requires one to > > re-sort > > > > the > > > > > > >> entire > > > > > > >> > data > > > > > > >> > > > > set when it changes. Our approach focuses on producing > > > > > incremental > > > > > > >> > > > updates > > > > > > >> > > > > which are used to manipulate your existing client > state > > > > using > > > > > a > > > > > > >> much > > > > > > >> > > > > smaller footprint (in both time and space). You can > > imagine, > > > > > in > > > > > > >> the > > > > > > >> > sort > > > > > > >> > > > > scenario, if you evaluate the table after adding rows > > but > > > > > before > > > > > > >> > > > modifying > > > > > > >> > > > > existing rows your table won’t be sorted between the > two > > > > > updates. > > > > > > >> The > > > > > > >> > > > > client would then need to wait until it receives the > > pair of > > > > > > >> > > > RecordBatches > > > > > > >> > > > > anyways, so it seems more natural to deliver them > > together. > > > > > > >> > > > > > > > > > > >> > > > > > As a side note - is said UI browser-based? Another > > project > > > > > > >> > recently was > > > > > > >> > > > > planning to look at JavaScript support for Flight > (using > > > > > > >> WebSockets > > > > > > >> > as > > > > > > >> > > > the > > > > > > >> > > > > transport, IIRC) and it might make sense to join > forces > > if > > > > > that’s > > > > > > >> a > > > > > > >> > path > > > > > > >> > > > > you were also going to pursue. > > > > > > >> > > > > > > > > > > >> > > > > Yes, our UI runs in the browser, although table > > operations > > > > > > >> > themselves run > > > > > > >> > > > > on the server to keep the browser lean and fast. That > > said, > > > > > the > > > > > > >> > browser > > > > > > >> > > > > isn’t the only target for the API we’re iterating on. > > We’re > > > > > > >> engaged > > > > > > >> > in a > > > > > > >> > > > > rewrite to unify our “first-class” Java API for > > intra-engine > > > > > > >> (server, > > > > > > >> > > > > heavyweight client) usage and our cross-language > > > > > > >> > > > (Javascript/C++/C#/Python) > > > > > > >> > > > > “open” API. Our existing customers use the engine to > > drive > > > > > > >> > multi-process > > > > > > >> > > > > data applications, REPL/notebook experiences, and > > > > dashboards. > > > > > We > > > > > > >> are > > > > > > >> > > > > preserving these capabilities as we make the engine > > > > available > > > > > as > > > > > > >> open > > > > > > >> > > > > source software. One goal of the OSS effort is to > > produce a > > > > > > >> singular > > > > > > >> > > > modern > > > > > > >> > > > > API that’s more interoperable with the data science > and > > > > > > >> development > > > > > > >> > > > > community as a whole. In the interest of minimizing > > > > > entry/egress > > > > > > >> > points, > > > > > > >> > > > we > > > > > > >> > > > > are migrating to gRPC for everything in addition to > the > > data > > > > > IPC > > > > > > >> > layer, > > > > > > >> > > > so > > > > > > >> > > > > not just the barrage/arrow-flight piece. > > > > > > >> > > > > > > > > > > >> > > > > The point of all this is to make the Deephaven engine > as > > > > > > >> accessible > > > > > > >> > as > > > > > > >> > > > > possible for a broad user base, including developers > > using > > > > > the API > > > > > > >> > from > > > > > > >> > > > > their language of choice or scripts/code running > > co-located > > > > > > >> within an > > > > > > >> > > > > engine process. Our software can be used to explore or > > build > > > > > > >> > applications > > > > > > >> > > > > and visualizations around static as well as real-time > > data > > > > > > >> (imagine > > > > > > >> > > > joins, > > > > > > >> > > > > aggregations, sorts, filters, time-series joins, etc), > > > > perform > > > > > > >> table > > > > > > >> > > > > operations with code or with a few clicks in a GUI, or > > as a > > > > > > >> > > > building-block > > > > > > >> > > > > in a multi-stage data pipeline. We think making > > ourselves as > > > > > > >> > > > interoperable > > > > > > >> > > > > as possible with tools built on Arrow is an important > > part > > > > of > > > > > > >> > attaining > > > > > > >> > > > > this goal. > > > > > > >> > > > > > > > > > > >> > > > > That said, we have run into quite a few pain points > > > > migrating > > > > > to > > > > > > >> > gRPC, > > > > > > >> > > > such > > > > > > >> > > > > as 1) no-client-side streaming is supported by any > > browser, > > > > 2) > > > > > > >> today, > > > > > > >> > > > > server-side streams require a proxy layer of some sort > > (such > > > > > as > > > > > > >> > envoy), > > > > > > >> > > > 3) > > > > > > >> > > > > flatbuffer’s javascript/typescript support is a little > > weak, > > > > > and > > > > > > >> I’m > > > > > > >> > sure > > > > > > >> > > > > there are others that aren’t coming to mind at the > > moment. > > > > We > > > > > have > > > > > > >> > some > > > > > > >> > > > > interesting solutions to these problems, but, today, > > these > > > > > issues > > > > > > >> > are a > > > > > > >> > > > > decent chunk of our focus. That said, the UI is usable > > today > > > > > by > > > > > > >> our > > > > > > >> > > > > enterprise clients, but it interacts with the server > > over > > > > > > >> websockets > > > > > > >> > and > > > > > > >> > > > a > > > > > > >> > > > > protocol that is heavily influenced by 10-years of > > existing > > > > > > >> > proprietary > > > > > > >> > > > > java-to-java IPC (which are NOT friendly to being > robust > > > > over > > > > > > >> > > > intermittent > > > > > > >> > > > > failures). Today, we’re just heads-down going the gRPC > > route > > > > > and > > > > > > >> > hoping > > > > > > >> > > > > that eventually browsers get around to better support > > for > > > > > some of > > > > > > >> > this > > > > > > >> > > > > stuff (so, maybe one day a proxy isn’t required, etc). > > Some > > > > > of our > > > > > > >> > RPCs > > > > > > >> > > > > make most sense as bidirectional streams, but to > > support our > > > > > > >> web-ui > > > > > > >> > we > > > > > > >> > > > also > > > > > > >> > > > > have a server-streaming variant that we can pass data > to > > > > > > >> > “out-of-band” > > > > > > >> > > > via > > > > > > >> > > > > a unary call referencing the particular server stream. > > It’s > > > > > fun > > > > > > >> > stuff! > > > > > > >> > > > I’m > > > > > > >> > > > > actually very excited about it even if the text > doesn’t > > > > sound > > > > > that > > > > > > >> > way > > > > > > >> > > > =). > > > > > > >> > > > > > > > > > > >> > > > > If you can point me to that project/person/post we’d > > love to > > > > > get > > > > > > >> in > > > > > > >> > touch > > > > > > >> > > > > and are excited to share whatever can be shared. > > > > > > >> > > > > > > > > > > >> > > > > Nate > > > > > > >> > > > > > > > > > > >> > > > > On Wed, Mar 3, 2021 at 4:22 PM David Li < > > > > lidav...@apache.org> > > > > > > >> wrote: > > > > > > >> > > > > > > > > > > >> > > > > > Ah okay, thank you for clarifying! In that case, if > > each > > > > > payload > > > > > > >> > has > > > > > > >> > > > two > > > > > > >> > > > > > batches with different purposes - might it make > sense > > to > > > > > just > > > > > > >> make > > > > > > >> > > > that two > > > > > > >> > > > > > different payloads, and set a flag/enum in the > > metadata to > > > > > > >> indicate > > > > > > >> > > > how to > > > > > > >> > > > > > interpret the batch? Then you'd be officially the > > same as > > > > > Arrow > > > > > > >> > Flight > > > > > > >> > > > :) > > > > > > >> > > > > > > > > > > > >> > > > > > As a side note - is said UI browser-based? Another > > project > > > > > > >> > recently was > > > > > > >> > > > > > planning to look at JavaScript support for Flight > > (using > > > > > > >> > WebSockets as > > > > > > >> > > > the > > > > > > >> > > > > > transport, IIRC) and it might make sense to join > > forces if > > > > > > >> that's a > > > > > > >> > > > path > > > > > > >> > > > > > you were also going to pursue. > > > > > > >> > > > > > > > > > > > >> > > > > > Best, > > > > > > >> > > > > > David > > > > > > >> > > > > > > > > > > > >> > > > > > On Wed, Mar 3, 2021, at 18:05, Nate Bauernfeind > wrote: > > > > > > >> > > > > > > Thanks for the interest =). > > > > > > >> > > > > > > > > > > > > >> > > > > > > > However, if I understand right, you're sending > > data > > > > > without > > > > > > >> a > > > > > > >> > fixed > > > > > > >> > > > > > > schema [...] > > > > > > >> > > > > > > > > > > > > >> > > > > > > The dataset does have a known schema ahead of > time, > > > > which > > > > > is > > > > > > >> > similar > > > > > > >> > > > to > > > > > > >> > > > > > > Flight. However, as you point out, the > subscription > > can > > > > > change > > > > > > >> > which > > > > > > >> > > > > > > columns it is interested in without re-acquiring > > data > > > > for > > > > > > >> > columns it > > > > > > >> > > > was > > > > > > >> > > > > > > already subscribed to. This is mostly for > > convenience. > > > > We > > > > > use > > > > > > >> it > > > > > > >> > > > > > primarily > > > > > > >> > > > > > > to limit which columns are sent to our user > > interface > > > > > until > > > > > > >> the > > > > > > >> > user > > > > > > >> > > > > > > scrolls them into view. > > > > > > >> > > > > > > > > > > > > >> > > > > > > The enhancement of the RecordBatch here, aside > from > > the > > > > > > >> > additional > > > > > > >> > > > > > > metadata, is only in that the payload has two sets > > of > > > > > > >> RecordBatch > > > > > > >> > > > > > payloads. > > > > > > >> > > > > > > The first payload is for added rows, every added > row > > > > must > > > > > send > > > > > > >> > data > > > > > > >> > > > for > > > > > > >> > > > > > > each column subscribed; based on the subscribed > > columns > > > > > this > > > > > > >> is > > > > > > >> > > > otherwise > > > > > > >> > > > > > > fixed width (in the number of columns / buffers). > > The > > > > > second > > > > > > >> > payload > > > > > > >> > > > is > > > > > > >> > > > > > for > > > > > > >> > > > > > > modified rows. Here we only send the columns that > > have > > > > > rows > > > > > > >> that > > > > > > >> > are > > > > > > >> > > > > > > modified. Aside from this difference, I have been > > aiming > > > > > to be > > > > > > >> > > > compatible > > > > > > >> > > > > > > enough to be able to reuse the payload parsing > that > > is > > > > > already > > > > > > >> > > > written > > > > > > >> > > > > > for > > > > > > >> > > > > > > Arrow. > > > > > > >> > > > > > > > > > > > > >> > > > > > > > I don't quite see why it couldn't be carried as > > > > > metadata on > > > > > > >> the > > > > > > >> > > > side > > > > > > >> > > > > > of a > > > > > > >> > > > > > > record batch, instead of having to duplicate the > > record > > > > > batch > > > > > > >> > > > structure > > > > > > >> > > > > > > [...] > > > > > > >> > > > > > > > > > > > > >> > > > > > > Whoa, this is a good point. I have iterated on > this > > a > > > > few > > > > > > >> times > > > > > > >> > to > > > > > > >> > > > get it > > > > > > >> > > > > > > closer to Arrow's setup and did not realize that > > > > > 'BarrageData' > > > > > > >> > is now > > > > > > >> > > > > > > officially identical to `FlightData`. This is an > > > > instance > > > > > of > > > > > > >> > being > > > > > > >> > > > too > > > > > > >> > > > > > > close to the project and forgetting to step back > > once > > > > in a > > > > > > >> while. > > > > > > >> > > > > > > > > > > > > >> > > > > > > > Flight already has a bidirectional streaming > > endpoint, > > > > > > >> > DoExchange, > > > > > > >> > > > that > > > > > > >> > > > > > > allows arbitrary payloads (with mixed > metadata/data > > or > > > > > only > > > > > > >> one > > > > > > >> > of > > > > > > >> > > > the > > > > > > >> > > > > > > two), which seems like it should be able to cover > > the > > > > > > >> > > > SubscriptionRequest > > > > > > >> > > > > > > endpoint. > > > > > > >> > > > > > > > > > > > > >> > > > > > > This is exactly the kind of feedback I'm looking > > for! I > > > > > wasn't > > > > > > >> > > > seeing the > > > > > > >> > > > > > > solution where the client-side stream doesn't > > actually > > > > > need > > > > > > >> > payload > > > > > > >> > > > and > > > > > > >> > > > > > > that the subscription changes can be described > with > > > > > another > > > > > > >> > > > flatbuffer > > > > > > >> > > > > > > metadata type. I like that. > > > > > > >> > > > > > > > > > > > > >> > > > > > > Thanks David! > > > > > > >> > > > > > > Nate > > > > > > >> > > > > > > > > > > > > >> > > > > > > On Wed, Mar 3, 2021 at 3:28 PM David Li < > > > > > lidav...@apache.org> > > > > > > >> > wrote: > > > > > > >> > > > > > > > > > > > > >> > > > > > > > Hey Nate, > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > Thanks for sharing this & for the detailed docs > > and > > > > > > >> writeup. I > > > > > > >> > > > think > > > > > > >> > > > > > your > > > > > > >> > > > > > > > use case is interesting, but I'd like to clarify > > a few > > > > > > >> things. > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > I would say Arrow Flight doesn't try to impose a > > > > > particular > > > > > > >> > model, > > > > > > >> > > > but > > > > > > >> > > > > > I > > > > > > >> > > > > > > > agree that Barrage does things that aren't > easily > > > > doable > > > > > > >> with > > > > > > >> > > > Flight. > > > > > > >> > > > > > > > Flight does name concepts in a way that suggests > > how > > > > to > > > > > > >> apply > > > > > > >> > it to > > > > > > >> > > > > > > > something that looks like a database, but you > can > > > > mostly > > > > > > >> think > > > > > > >> > of > > > > > > >> > > > > > Flight as > > > > > > >> > > > > > > > an efficient way to transfer Arrow data over the > > > > network > > > > > > >> upon > > > > > > >> > which > > > > > > >> > > > > > you can > > > > > > >> > > > > > > > layer further semantics. > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > However, if I understand right, you're sending > > data > > > > > without > > > > > > >> a > > > > > > >> > fixed > > > > > > >> > > > > > > > schema, in the sense that each > BarrageRecordBatch > > may > > > > > have > > > > > > >> > only a > > > > > > >> > > > > > subset of > > > > > > >> > > > > > > > the columns declared up front, or may carry new > > > > > columns? I > > > > > > >> > think > > > > > > >> > > > this > > > > > > >> > > > > > is > > > > > > >> > > > > > > > the main thing you can't easily do currently, as > > > > Flight > > > > > (and > > > > > > >> > Arrow > > > > > > >> > > > IPC > > > > > > >> > > > > > in > > > > > > >> > > > > > > > general) assumes a fixed schema (and expects all > > > > > columns in > > > > > > >> a > > > > > > >> > > > batch to > > > > > > >> > > > > > have > > > > > > >> > > > > > > > the same length). > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > Otherwise, the encoding for identifying rows and > > > > > changes is > > > > > > >> > > > > > interesting, > > > > > > >> > > > > > > > but I don't quite see why it couldn't be carried > > as > > > > > metadata > > > > > > >> > on the > > > > > > >> > > > > > side of > > > > > > >> > > > > > > > a record batch, instead of having to duplicate > the > > > > > record > > > > > > >> batch > > > > > > >> > > > > > structure, > > > > > > >> > > > > > > > except for the aforementioned schema issue. And > in > > > > that > > > > > > >> case it > > > > > > >> > > > might > > > > > > >> > > > > > be > > > > > > >> > > > > > > > better to work out the schema evolution issue & > > any > > > > > > >> ergonomic > > > > > > >> > > > issues > > > > > > >> > > > > > with > > > > > > >> > > > > > > > Flight's existing metadata fields/API that would > > > > > prevent you > > > > > > >> > from > > > > > > >> > > > using > > > > > > >> > > > > > > > them, as that way you (and we!) don't have to > > fully > > > > > > >> duplicate > > > > > > >> > one > > > > > > >> > > > of > > > > > > >> > > > > > > > Arrow's format definitions. Similarly, Flight > > already > > > > > has a > > > > > > >> > > > > > bidirectional > > > > > > >> > > > > > > > streaming endpoint, DoExchange, that allows > > arbitrary > > > > > > >> payloads > > > > > > >> > > > (with > > > > > > >> > > > > > mixed > > > > > > >> > > > > > > > metadata/data or only one of the two), which > seems > > > > like > > > > > it > > > > > > >> > should > > > > > > >> > > > be > > > > > > >> > > > > > able > > > > > > >> > > > > > > > to cover the SubscriptionRequest endpoint. > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > Best, > > > > > > >> > > > > > > > David > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > On Wed, Mar 3, 2021, at 16:08, Nate Bauernfeind > > wrote: > > > > > > >> > > > > > > > > Hello, > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > My colleagues at Deephaven Data Labs and I > have > > been > > > > > > >> > addressing > > > > > > >> > > > > > problems > > > > > > >> > > > > > > > at > > > > > > >> > > > > > > > > the intersection of data-driven applications, > > data > > > > > > >> science, > > > > > > >> > and > > > > > > >> > > > > > updating > > > > > > >> > > > > > > > > (/ticking) data for some years. > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > Deephaven has a query engine that supports > > updating > > > > > > >> tabular > > > > > > >> > data > > > > > > >> > > > via > > > > > > >> > > > > > a > > > > > > >> > > > > > > > > protocol that communicates precise changes > about > > > > > datasets, > > > > > > >> > such > > > > > > >> > > > as 1) > > > > > > >> > > > > > > > which > > > > > > >> > > > > > > > > rows were removed, 2) which rows were added, > 3) > > > > which > > > > > rows > > > > > > >> > were > > > > > > >> > > > > > modified > > > > > > >> > > > > > > > > (and for which columns). We are inspired by > > Arrow > > > > and > > > > > > >> would > > > > > > >> > like > > > > > > >> > > > to > > > > > > >> > > > > > > > adopt a > > > > > > >> > > > > > > > > version of this protocol that adheres to goals > > > > > similar to > > > > > > >> > Arrow > > > > > > >> > > > and > > > > > > >> > > > > > Arrow > > > > > > >> > > > > > > > > Flight. > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > Out of the box, Arrow Flight is insufficient > to > > > > > represent > > > > > > >> > such a > > > > > > >> > > > > > stream > > > > > > >> > > > > > > > of > > > > > > >> > > > > > > > > changes. For example, because you cannot > > identify a > > > > > > >> > particular > > > > > > >> > > > row > > > > > > >> > > > > > within > > > > > > >> > > > > > > > > an Arrow Flight, you cannot indicate which > rows > > were > > > > > > >> removed > > > > > > >> > or > > > > > > >> > > > > > modified. > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > The project integrates with Arrow Flight at > the > > > > > > >> > header-metadata > > > > > > >> > > > > > level. We > > > > > > >> > > > > > > > > have preliminarily named the project Barrage > as > > in a > > > > > > >> > "barrage of > > > > > > >> > > > > > arrows" > > > > > > >> > > > > > > > > which plays in the same "namespace" as a > > "flight of > > > > > > >> arrows." > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > We built this as part of an initiative to > > modernize > > > > > and > > > > > > >> open > > > > > > >> > up > > > > > > >> > > > our > > > > > > >> > > > > > table > > > > > > >> > > > > > > > > IPC mechanisms. This is part of a larger open > > source > > > > > > >> effort > > > > > > >> > which > > > > > > >> > > > > > will > > > > > > >> > > > > > > > > become more visible in the next month or so > once > > > > we've > > > > > > >> > finished > > > > > > >> > > > the > > > > > > >> > > > > > work > > > > > > >> > > > > > > > > necessary to share our core software > components, > > > > > > >> including a > > > > > > >> > > > unified > > > > > > >> > > > > > > > static > > > > > > >> > > > > > > > > and real time query engine complete with data > > > > > > >> visualization > > > > > > >> > > > tools, a > > > > > > >> > > > > > REPL > > > > > > >> > > > > > > > > experience, Jupyter integration, and more. > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > I would like to find out: > > > > > > >> > > > > > > > > - if we have understood the primary goals of > > Arrow, > > > > > and > > > > > > >> are > > > > > > >> > > > honoring > > > > > > >> > > > > > them > > > > > > >> > > > > > > > > as closely as possible > > > > > > >> > > > > > > > > - if there are other projects that might > benefit > > > > from > > > > > > >> sharing > > > > > > >> > > > this > > > > > > >> > > > > > > > > extension of Arrow Flight > > > > > > >> > > > > > > > > - if there are any gaps that are best > addressed > > > > early > > > > > on > > > > > > >> to > > > > > > >> > > > maximize > > > > > > >> > > > > > > > future > > > > > > >> > > > > > > > > compatibility > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > A great place to digest the concepts that > differ > > > > from > > > > > > >> Arrow > > > > > > >> > > > Flight > > > > > > >> > > > > > are > > > > > > >> > > > > > > > here: > > > > > > >> > > > > > > > > > > https://deephaven.github.io/barrage/Concepts.html > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > The proposed protocol can be perused here: > > > > > > >> > > > > > > > > https://github.com/deephaven/barrage > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > Internally, we already have a java server and > > java > > > > > client > > > > > > >> > > > > > implemented as > > > > > > >> > > > > > > > a > > > > > > >> > > > > > > > > working proof of concept for our use case. > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > I really look forward to your feedback; thank > > you! > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > Nate Bauernfeind > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > Deephaven Data Labs - https://deephaven.io/ > > > > > > >> > > > > > > > > -- > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > -- > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > -- > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > >