(Apologies for the formatting) On Thu, Jun 16, 2016 at 12:12 PM, Kenneth Knowles <k...@google.com> wrote:
> Hello everyone! > > We are busily working on a Runner API (for building and transmitting > pipelines) > and a Fn API (for invoking user-defined functions found within pipelines) > as > outlined in the Beam technical vision [1]. Both of these require a > language-independent serialization technology for interoperability between > SDKs > and runners. > > The Fn API includes a high-bandwidth data plane where bundles are > transmitted > via some serialization/RPC envelope (inside the envelope, the stream of > elements is encoded with a coder) to transfer bundles between the runner > and > the SDK, so performance is extremely important. There are many choices for > high > performance serialization, and we would like to start the conversation > about > what serialization technology is best for Beam. > > The goal of this discussion is to arrive at consensus on the question: > What > serialization technology should we use for the data plane envelope of the > Fn > API? > > To facilitate community discussion, we looked at the available > technologies and > tried to narrow the choices based on three criteria: > > - Performance: What is the size of serialized data? How do we expect the > technology to affect pipeline speed and cost? etc > > - Language support: Does the technology support the most widespread > language > for data processing? Does it have a vibrant ecosystem of contributed > language bindings? etc > > - Community: What is the adoption of the technology? How mature is it? > How > active is development? How is the documentation? etc > > Given these criteria, we came up with four technologies that are good > contenders. All have similar & adequate schema capabilities. > > - Apache Avro: Does not require code gen, but embedding the schema in the > data > could be an issue. Very popular. > > - Apache Thrift: Probably a bit faster and compact than Avro. A huge > number of > language supported. > > - Protocol Buffers 3: Incorporates the lessons that Google has learned > through > long-term use of Protocol Buffers. > > - FlatBuffers: Some benchmarks imply great performance from the zero-copy > mmap > idea. We would need to run representative experiments. > > I want to emphasize that this is a community decision, and this thread is > just > the conversation starter for us all to weigh in. We just wanted to do some > legwork to focus the discussion if we could. > > And there's a minor follow-up question: Once we settle here, is that > technology > also suitable for the low-bandwidth Runner API for defining pipelines, or > does > anyone think we need to consider a second technology (like JSON) for > usability > reasons? > > [1] > https://docs.google.com/presentation/d/1E9seGPB_VXtY_KZP4HngDPTbsu5RVZFFaTlwEYa88Zw/present?slide=id.g108d3a202f_0_38 > >