One question I have is around the choice of using protobufs - It seems that flatbuffers has better support for zero-copy and works with grpc as well. What's the rational behind picking protobuf over flatbuffer?
On Thu, Aug 16, 2018 at 7:41 PM Wes McKinney <wesmck...@gmail.com> wrote: > hi Julian, > > Thanks for chiming in. > > On Thu, Aug 16, 2018 at 1:16 PM, Julian Hyde <jh...@apache.org> wrote: > > If your use case is SQL RPC, then you are getting close to Avatica's > > territory. Avatica[1] is a protocol for implementing > > language-independent JDBC and ODBC stacks. > > I'm not proposing to develop a SQL RPC system inside Apache Arrow. But > Arrow Flight could be used to build one > > > > > Now, I agree that many ODBC implementations are inefficient. Some ODBC > > stacks make more round trips than necessary, and do more copying than > > necessary. In Avatica we are trying to squeeze out those > > inefficiencies, for example minimizing the number of RPCs. We would > > also love to use Arrow as the data format and reduce copying on the > > server side and client side. > > Indeed -- what I would like to see instead is for Avatica to _use_ > Arrow Flight to provide an alternative platform to offer Arrow-native > connectivity in addition to the slower JDBC and ODBC standards. > > > > > But conversely, people who start with a simple RPC use case - send > > SQL, get the results - may soon find themselves needing a more complex > > protocol - authentication, sessions, prepared statements, bind > > variables, getting metadata before executing, cursors, skipping over > > rows. In other words, find themselves wanting substantial portions of > > an ODBC or JDBC driver. > > > > You could find yourselves building Avatica all over again. We saw all > > of this happen in XML-RPC, and it was sad. > > Agreed. I don't think this is in the cards, and what's being proposed > now is orthogonal. > > > > > I suggest to keep flight for the truly simple use case, and for the > > more complex use case, invest effort putting Arrow into Avatica. We > > are always happy to welcome new contributors. > > +1 > > > > > Julian > > > > [1] https://calcite.apache.org/avatica/docs/ > > On Thu, Aug 16, 2018 at 7:56 AM Wes McKinney <wesmck...@gmail.com> > wrote: > >> > >> To give some extra color on my personal motivation for interest in > Arrow Flight: > >> > >> Systems that expose databases on a network frequently send data very > >> slowly. For example, ODBC is in general extremely slow. What I would > >> like to see is servers that can expose a "sql" action type. > >> > >> So, in consideration of the protocol as it stands now [1], example > >> session goes like this: > >> > >> * Client issues ListActions -> returns one or more ActionType, suppose > >> one is "sql" > >> * Client issues DoAction with type sql and body "select * from $TABLE" > >> * Server returns stream URI for query result set and Ticket in the > Result proto > >> * Client issues GetFlightInfo using URI to obtain schema of result set > >> * Client issues DoGet with ticket returned by sql DoAction > >> > >> There's some possible refinements to this workflow; for example, if we > >> wanted to enable DoAction to return more structured results (e.g. to > >> avoid the extra GetFlightInfo RPC to get the schema of the query > >> result set) > >> > >> - Wes > >> > >> [1]: > https://github.com/apache/arrow/blob/c52897274035f8b5192d7647b9711c68d9c54ccc/java/flight/src/main/protobuf/flight.proto > >> > >> On Thu, Aug 16, 2018 at 10:29 AM, Jacques Nadeau <jacq...@apache.org> > wrote: > >> > I'm out of town this week (vacation) and will be reviewing your > feedback > >> > next week. Thanks for the feedback! > >> > > >> > On Thu, Aug 9, 2018, 8:45 PM Wes McKinney <wesmck...@gmail.com> > wrote: > >> > > >> >> hi folks, > >> >> > >> >> I left some feedback on this PR. If others could take a look > >> >> (particularly at the .proto service definition) that would be useful. > >> >> > >> >> We should decide on an approach to getting multiple production-worthy > >> >> Flight/RPC implementations ready to go. It would be a good goal to > >> >> deliver (end-to-end send/receive data between Python and Java, or > >> >> Python and other Python processes) in the next couple releases. > >> >> > >> >> - Wes > >> >> > >> >> On Wed, May 30, 2018 at 12:44 PM, Jacques Nadeau <jacq...@apache.org > > > >> >> wrote: > >> >> > Correct, I'm maintaining standard protobuf encoding so a consumer > that > >> >> > doesn't go byte by byte can still consumer/produce the messages. > >> >> > > >> >> > More impls: for sure. > >> >> > > >> >> > On Wed, May 30, 2018 at 9:01 AM, Wes McKinney <wesmck...@gmail.com > > > >> >> wrote: > >> >> > > >> >> >> I see; looking more closely I see you've sidestepped the standard > >> >> >> Protobuf serialization to write the stream as tagged components: > >> >> >> > >> >> >> > https://github.com/apache/arrow/compare/master...jacques-n:flight#diff- > >> >> >> 02cfc9235e22653fce8a7636c9f95507R241 > >> >> >> > >> >> >> and then reading the fields of the message tag by tag > >> >> >> > >> >> >> > https://github.com/apache/arrow/compare/master...jacques-n:flight#diff- > >> >> >> 02cfc9235e22653fce8a7636c9f95507R159 > >> >> >> > >> >> >> Would it be correct that if a GRPC implementation doesn't provide > >> >> >> sufficient access to the byte stream (or if it doesn't care enough > >> >> >> about zero copy) that you could allow GRPC to return an instance > of > >> >> >> the FlightData structure? > >> >> >> > >> >> >> I expect we'd want to see a few interoperable implementations (I > >> >> >> suggest Java, C++, Go) to harden the fine details. > >> >> >> > >> >> >> - Wes > >> >> >> > >> >> >> On Mon, May 28, 2018 at 3:32 PM, Jacques Nadeau < > jacq...@apache.org> > >> >> >> wrote: > >> >> >> > Cutting through the layers of GRPC will be a per language > approach > >> >> thing. > >> >> >> > Assuming that each GRPC language implementation does a good job > of > >> >> >> > separating message encapsulation from the base library, this > should be > >> >> >> > straight-forward-ish. Hope improves around this as I see > creation of > >> >> >> > non-protobuf protocols built on top of the base GRPC [1]. How > to do > >> >> this > >> >> >> in > >> >> >> > each language will probably take time looking at the GRPC > internals > >> >> for > >> >> >> > that language but can be a secondary step once you get the > protocol > >> >> >> working > >> >> >> > (you can just pay for extra copies until then). > >> >> >> > > >> >> >> > In my Java approach I believe I do one read copy and zero write > copies > >> >> >> > (needs more testing) which was my target. (Getting to zero-copy > on > >> >> read > >> >> >> > means a lot more complexity because your socket-reading has to > be > >> >> >> protocol > >> >> >> > aware: even our bespoke layer in Dremio doesn't try to do that. > I'd > >> >> guess > >> >> >> > KRPC does the same but haven't reviewed the code to confirm.) > >> >> >> > > >> >> >> > Will try to get some more slides/readme and a proper proposed > patch up > >> >> >> soon. > >> >> >> > > >> >> >> > [1] https://grpc.io/blog/flatbuffers > >> >> >> > > >> >> >> > > >> >> >> > > >> >> >> > On Mon, May 28, 2018 at 1:05 AM, Wes McKinney < > wesmck...@gmail.com> > >> >> >> wrote: > >> >> >> > > >> >> >> >> hey Jacques, > >> >> >> >> > >> >> >> >> This is great news, I look forward to digging into this. My > biggest > >> >> >> >> initial question is the Protobuf encapsulation, specifically: > >> >> >> >> > >> >> >> >> https://github.com/jacques-n/arrow/blob/flight/java/flight/ > >> >> >> >> src/main/protobuf/flight.proto#L99 > >> >> >> >> > >> >> >> >> My understanding of Protocol Buffers is that on read, the > "data_body" > >> >> >> >> memory would be copied out of the serialized protobuf that came > >> >> across > >> >> >> >> the wire. Your comment in the .proto says this "comes last in > the > >> >> >> >> definition to help with sidecar patterns" -- my read is that > it would > >> >> >> >> be up to us to do our own sidecar implementation, similar to > how > >> >> >> >> Apache Kudu has zero-copy sidecars in their KRPC system [1] > (the > >> >> >> >> comment there describes pretty much exactly the problem we > have). I > >> >> >> >> saw that you also replied on a GRPC thread about this issue > [2]. > >> >> Could > >> >> >> >> you summarize what (if anything) stands in the way to get > zero-copy > >> >> on > >> >> >> >> write and read? > >> >> >> >> > >> >> >> >> - Wes > >> >> >> >> > >> >> >> >> [1]: https://github.com/apache/kudu/blob/master/src/kudu/rpc/ > >> >> >> >> rpc_sidecar.h#L34 > >> >> >> >> [2]: > https://github.com/grpc/grpc-java/issues/1054#issuecomment- > >> >> >> 391692087 > >> >> >> >> > >> >> >> >> On Thu, May 24, 2018 at 6:57 AM, Jacques Nadeau < > jacq...@apache.org> > >> >> >> >> wrote: > >> >> >> >> > FYI, if you want to see an example server you can run with a > GRPC > >> >> >> >> generated > >> >> >> >> > client, you can run the ExampleFlightServer located at [1]. > Very > >> >> basic > >> >> >> >> > 'test' with that class and client is located at [2]. > >> >> >> >> > > >> >> >> >> > [1] > >> >> >> >> > https://github.com/jacques-n/arrow/tree/flight/java/flight/ > >> >> >> >> src/main/java/org/apache/arrow/flight/example > >> >> >> >> > [2] > >> >> >> >> > https://github.com/jacques-n/arrow/blob/flight/java/flight/ > >> >> >> >> > src/test/java/org/apache/arrow/flight/example/TestExampleServer.java > >> >> >> >> > > >> >> >> >> > > >> >> >> >> > On Thu, May 24, 2018 at 11:51 AM, Jacques Nadeau < > >> >> jacq...@apache.org> > >> >> >> >> wrote: > >> >> >> >> > > >> >> >> >> >> Hey All, > >> >> >> >> >> > >> >> >> >> >> I used my Strata talk today as a forcing function to make > >> >> additional > >> >> >> >> >> progress on a GRPC-based Arrow RPC protocol [1]. I’m > calling it > >> >> >> “Apache > >> >> >> >> >> Arrow Flight”. You can take a look at the work here [2]. > I’ll > >> >> work to > >> >> >> >> clean > >> >> >> >> >> up my work and explain my thoughts about the protocol in the > >> >> coming > >> >> >> >> days. > >> >> >> >> >> High-level: use protobuf as a encapsulation format so that > any > >> >> client > >> >> >> >> that > >> >> >> >> >> is supported in GRPC will work. However, we can optimize the > >> >> >> read/write > >> >> >> >> >> path for targeted languages and hand control the > >> >> >> >> >> serialization/deserialization and memory handling. (I did > that in > >> >> >> this > >> >> >> >> Java > >> >> >> >> >> patch [3][4][5].) I also looked at starting to use GRPC > generated > >> >> >> >> bindings > >> >> >> >> >> within Python but it looks like some glue code may be > needed in > >> >> the > >> >> >> C++ > >> >> >> >> >> layer since Python delegates down frequently. I also am > still > >> >> trying > >> >> >> to > >> >> >> >> >> understand GRPC back-pressure patterns and whether the > protocol > >> >> >> >> >> realistically needs to change to cover real-world high > performance > >> >> >> use > >> >> >> >> >> cases. > >> >> >> >> >> > >> >> >> >> >> I’ll send out some slides about the ideas and update > README, etc. > >> >> >> soon. > >> >> >> >> >> > >> >> >> >> >> Thanks, > >> >> >> >> >> Jacques > >> >> >> >> >> > >> >> >> >> >> [1] > https://github.com/jacques-n/arrow/blob/flight/java/flight/ > >> >> >> >> >> src/main/protobuf/flight.proto > >> >> >> >> >> [2] http://github.com/jacques-n/arrow/ > >> >> >> >> >> [3] https://github.com/jacques-n/arrow/tree/flight/ > >> >> >> >> >> java/flight/src/main/java/org/apache/arrow/flight/grpc > >> >> >> >> >> [4] https://github.com/jacques-n/arrow/blob/flight/ > >> >> >> >> >> java/flight/src/main/java/org/apache/arrow/flight/ > >> >> >> >> ArrowMessage.java#L253 > >> >> >> >> >> < > https://github.com/jacques-n/arrow/blob/flight/java/flight/ > >> >> >> >> src/main/java/org/apache/arrow/flight/ArrowMessage.java#L253> > >> >> >> >> >> [5] https://github.com/jacques-n/arrow/blob/flight/ > >> >> >> >> >> java/flight/src/main/java/org/apache/arrow/flight/ > >> >> >> >> ArrowMessage.java#L185 > >> >> >> >> >> > >> >> >> >> >> > >> >> >> >> > >> >> >> > >> >> >