Re: [DISCUSS] Beam data plane serialization tech

Lukasz Cwik Fri, 17 Jun 2016 10:49:04 -0700

In the Runner API proposal doc, there are 10+ different types with several
fields each.
Is it important to have a code generator for the schema?
* simplify the SDK development process
* reduce errors due to differences in custom implementation


I'm not familiar with tool(s) which can take a JSON schema (e.g.
http://json-schema.org/) and generate code in multiple languages. Anyone?


For the Data Plane API, a Runner and SDK must be able to encode elements
such as WindowedValue and KVs in such a way that both sides can interpret
them. For example, a Runner will be required to implement GBK so it must be
able to read the windowing information from the "bytes" transmitted,
additionally it will need to be able to split KV<K, V> records apart and
recreate KV<K, Iterable<V>> for the SDK. Since Coders are the dominant way
of encoding things, the Data Plane API will transmit "bytes" with the
element boundaries encoded in some way. Aljoscha, I agree with you that a
good choice for transmitting bytes between VMs/languages is very important.
Even though we are still transmitting mostly "bytes", error handling &
connection handling are still important.
For example, if we were to use gRPC and proto3 with a bidirectional stream
based API, we would get:
the Runner and SDK can both push data both ways (stream from/to GBK, stream
from/to state)
error handling
code generation of client libraries
HTTP/2

As for the encoding, any SDK can choose any serialization it wants such as
Kryo but to get interoperability with other languages that would require
others to implement parts of the Kryo serialization spec to be able to
interpret the "bytes". Thus certain types like KV & WindowedValue should be
encoded in a way which allows for this interoperability.






On Fri, Jun 17, 2016 at 3:20 AM, Amit Sela <amitsel...@gmail.com> wrote:

> +1 on Aljoscha comment, not sure where's the benefit in having a
> "schematic" serialization.
>
> I know that Spark and I think Flink as well, use Kryo
> <https://github.com/EsotericSoftware/kryo> for serialization (to be
> accurate it's Chill <https://github.com/twitter/chill> for Spark) and I
> found it very impressive even comparing to "manual" serializations,
>  i.e., it seems to outperform Spark's "native" Encoders (1.6+) for
> primitives..
> In addition it clearly supports Java and Scala, and there are 3rd party
> libraries for Clojure and Objective-C.
>
> I guess my bottom-line here agrees with Kenneth - performance and
> interoperability - but I'm just not sure if schema based serializers are
> *always* the fastest.
>
> As for pipeline serialization, since performance is not the main issue, and
> I think usability would be very important, I say +1 for JSON.
>
> For anyone who spent sometime on benchmarking serialization libraries, know
> is the time to speak up ;)
>
> Thanks,
> Amit
>
> On Fri, Jun 17, 2016 at 12:47 PM Aljoscha Krettek <aljos...@apache.org>
> wrote:
>
> > Hi,
> > am I correct in assuming that the transmitted envelopes would mostly
> > contain coder-serialized values? If so, wouldn't the header of an
> envelope
> > just be the number of contained bytes and number of values? I'm probably
> > missing something but with these assumptions I don't see the benefit of
> > using something like Avro/Thrift/Protobuf for serializing the main-input
> > value envelopes. We would just need a system that can send byte data
> really
> > fast between languages/VMs.
> >
> > By the way, another interesting question (at least for me) is how other
> > data, such as side-inputs, is going to arrive at the DoFn if we want to
> > support a general interface for different languages.
> >
> > Cheers,
> > Aljoscha
> >
> > On Thu, 16 Jun 2016 at 22:33 Kenneth Knowles <k...@google.com.invalid>
> > wrote:
> >
> > > (Apologies for the formatting)
> > >
> > > On Thu, Jun 16, 2016 at 12:12 PM, Kenneth Knowles <k...@google.com>
> > wrote:
> > >
> > > > Hello everyone!
> > > >
> > > > We are busily working on a Runner API (for building and transmitting
> > > > pipelines)
> > > > and a Fn API (for invoking user-defined functions found within
> > pipelines)
> > > > as
> > > > outlined in the Beam technical vision [1]. Both of these require a
> > > > language-independent serialization technology for interoperability
> > > between
> > > > SDKs
> > > > and runners.
> > > >
> > > > The Fn API includes a high-bandwidth data plane where bundles are
> > > > transmitted
> > > > via some serialization/RPC envelope (inside the envelope, the stream
> of
> > > > elements is encoded with a coder) to transfer bundles between the
> > runner
> > > > and
> > > > the SDK, so performance is extremely important. There are many
> choices
> > > for
> > > > high
> > > > performance serialization, and we would like to start the
> conversation
> > > > about
> > > > what serialization technology is best for Beam.
> > > >
> > > > The goal of this discussion is to arrive at consensus on the
> question:
> > > > What
> > > > serialization technology should we use for the data plane envelope of
> > the
> > > > Fn
> > > > API?
> > > >
> > > > To facilitate community discussion, we looked at the available
> > > > technologies and
> > > > tried to narrow the choices based on three criteria:
> > > >
> > > >  - Performance: What is the size of serialized data? How do we expect
> > the
> > > >    technology to affect pipeline speed and cost? etc
> > > >
> > > >  - Language support: Does the technology support the most widespread
> > > > language
> > > >    for data processing? Does it have a vibrant ecosystem of
> contributed
> > > >    language bindings? etc
> > > >
> > > >  - Community: What is the adoption of the technology? How mature is
> it?
> > > > How
> > > >    active is development? How is the documentation? etc
> > > >
> > > > Given these criteria, we came up with four technologies that are good
> > > > contenders. All have similar & adequate schema capabilities.
> > > >
> > > >  - Apache Avro: Does not require code gen, but embedding the schema
> in
> > > the
> > > > data
> > > >    could be an issue. Very popular.
> > > >
> > > >  - Apache Thrift: Probably a bit faster and compact than Avro. A huge
> > > > number of
> > > >    language supported.
> > > >
> > > >  - Protocol Buffers 3: Incorporates the lessons that Google has
> learned
> > > > through
> > > >    long-term use of Protocol Buffers.
> > > >
> > > >  - FlatBuffers: Some benchmarks imply great performance from the
> > > zero-copy
> > > > mmap
> > > >    idea. We would need to run representative experiments.
> > > >
> > > > I want to emphasize that this is a community decision, and this
> thread
> > is
> > > > just
> > > > the conversation starter for us all to weigh in. We just wanted to do
> > > some
> > > > legwork to focus the discussion if we could.
> > > >
> > > > And there's a minor follow-up question: Once we settle here, is that
> > > > technology
> > > > also suitable for the low-bandwidth Runner API for defining
> pipelines,
> > or
> > > > does
> > > > anyone think we need to consider a second technology (like JSON) for
> > > > usability
> > > > reasons?
> > > >
> > > > [1]
> > > >
> > >
> >
> https://docs.google.com/presentation/d/1E9seGPB_VXtY_KZP4HngDPTbsu5RVZFFaTlwEYa88Zw/present?slide=id.g108d3a202f_0_38
> > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Beam data plane serialization tech

Reply via email to