(Apologies for the formatting)

On Thu, Jun 16, 2016 at 12:12 PM, Kenneth Knowles <k...@google.com> wrote:

> Hello everyone!
>
> We are busily working on a Runner API (for building and transmitting
> pipelines)
> and a Fn API (for invoking user-defined functions found within pipelines)
> as
> outlined in the Beam technical vision [1]. Both of these require a
> language-independent serialization technology for interoperability between
> SDKs
> and runners.
>
> The Fn API includes a high-bandwidth data plane where bundles are
> transmitted
> via some serialization/RPC envelope (inside the envelope, the stream of
> elements is encoded with a coder) to transfer bundles between the runner
> and
> the SDK, so performance is extremely important. There are many choices for
> high
> performance serialization, and we would like to start the conversation
> about
> what serialization technology is best for Beam.
>
> The goal of this discussion is to arrive at consensus on the question:
> What
> serialization technology should we use for the data plane envelope of the
> Fn
> API?
>
> To facilitate community discussion, we looked at the available
> technologies and
> tried to narrow the choices based on three criteria:
>
>  - Performance: What is the size of serialized data? How do we expect the
>    technology to affect pipeline speed and cost? etc
>
>  - Language support: Does the technology support the most widespread
> language
>    for data processing? Does it have a vibrant ecosystem of contributed
>    language bindings? etc
>
>  - Community: What is the adoption of the technology? How mature is it?
> How
>    active is development? How is the documentation? etc
>
> Given these criteria, we came up with four technologies that are good
> contenders. All have similar & adequate schema capabilities.
>
>  - Apache Avro: Does not require code gen, but embedding the schema in the
> data
>    could be an issue. Very popular.
>
>  - Apache Thrift: Probably a bit faster and compact than Avro. A huge
> number of
>    language supported.
>
>  - Protocol Buffers 3: Incorporates the lessons that Google has learned
> through
>    long-term use of Protocol Buffers.
>
>  - FlatBuffers: Some benchmarks imply great performance from the zero-copy
> mmap
>    idea. We would need to run representative experiments.
>
> I want to emphasize that this is a community decision, and this thread is
> just
> the conversation starter for us all to weigh in. We just wanted to do some
> legwork to focus the discussion if we could.
>
> And there's a minor follow-up question: Once we settle here, is that
> technology
> also suitable for the low-bandwidth Runner API for defining pipelines, or
> does
> anyone think we need to consider a second technology (like JSON) for
> usability
> reasons?
>
> [1]
> https://docs.google.com/presentation/d/1E9seGPB_VXtY_KZP4HngDPTbsu5RVZFFaTlwEYa88Zw/present?slide=id.g108d3a202f_0_38
>
>

Reply via email to