Re: Beam Fn API

Robert Bradshaw Wed, 31 May 2017 12:06:06 -0700

Thank! Looks good. I've added some comments to the doc.

On Wed, May 31, 2017 at 7:00 AM, Etienne Chauchot <echauc...@gmail.com>
wrote:


> Thanks for all these docs! They are exactly what was needed for new
> contributors as discussed in this thread
>
> https://lists.apache.org/thread.html/ac93d29424e19d57097373b
> 78f3f5bcbc701e4b51385a52a6e27b7ed@%3Cdev.beam.apache.org%3E
>
> Etienne
>
>
> Le 31/05/2017 à 11:12, Aljoscha Krettek a écrit :
>
>> Thanks for banging these out Lukasz. I’ll try and read them all this week.
>>
>> We’re also planning to add support for the Fn API to the Flink Runner so
>> that we can execute Python programs. I’m sure we’ll get some valuable
>> feedback for you while doing that.
>>
>> On 26. May 2017, at 22:49, Lukasz Cwik <lc...@google.com.INVALID> wrote:
>>>
>>> I would like to share another document about the Fn API. This document
>>> specifically discusses how to access side inputs, access remote
>>> references
>>> (e.g. large iterables for hot keys produced by a GBK), and support user
>>> state.
>>> https://s.apache.org/beam-fn-state-api-and-bundle-processing
>>>
>>> The document does require a strong foundation in the Apache Beam model
>>> and
>>> a good understanding of the prior shared docs:
>>> * How to process a bundle: https://s.apache.org/beam-fn-api
>>> -processing-a-bundle
>>> * How to send and receive data: https://s.apache.org/beam-fn-api
>>> -send-and-receive-data
>>>
>>> I could really use the help of runner contributors to review the caching
>>> semantics within the SDK harness and whether they would work well for the
>>> runner they contribute to the most.
>>>
>>> On Sun, May 21, 2017 at 6:40 PM, Lukasz Cwik <lc...@google.com> wrote:
>>>
>>> Manu, the goal is to share here initially, update the docs addressing
>>>> people's comments, and then publish them on the website once they are
>>>> stable enough.
>>>>
>>>> On Sun, May 21, 2017 at 5:54 PM, Manu Zhang <owenzhang1...@gmail.com>
>>>> wrote:
>>>>
>>>> Thanks Lukasz. The following two links were somehow incorrectly
>>>>> formatted
>>>>> in your mail.
>>>>>
>>>>> * How to process a bundle:
>>>>> https://s.apache.org/beam-fn-api-processing-a-bundle
>>>>> * How to send and receive data:
>>>>> https://s.apache.org/beam-fn-api-send-and-receive-data
>>>>>
>>>>> By the way, is there a way to find them from the Beam website ?
>>>>>
>>>>>
>>>>> On Fri, May 19, 2017 at 6:44 AM Lukasz Cwik <lc...@google.com.invalid>
>>>>> wrote:
>>>>>
>>>>> Now that I'm back from vacation and the 2.0.0 release is not taking all
>>>>>>
>>>>> my
>>>>>
>>>>>> time, I am focusing my attention on working on the Beam Portability
>>>>>> framework, specifically the Fn API so that we can get Python and other
>>>>>> language integrations work with any runner.
>>>>>>
>>>>>> For new comers, I would like to reshare the overview:
>>>>>> https://s.apache.org/beam-fn-api
>>>>>>
>>>>>> And for those of you who have been following this thread and
>>>>>>
>>>>> contributors
>>>>>
>>>>>> focusing on Runner integration with Apache Beam:
>>>>>> * How to process a bundle: https://s.apache.org/beam-fn-a
>>>>>>
>>>>> pi-processing-a-
>>>>>
>>>>>> bundle
>>>>>> * How to send and receive data: https://s.apache.org/
>>>>>> beam-fn-api-send-and-receive-data
>>>>>>
>>>>>> If you want to dive deeper, you should look at:
>>>>>> * Runner API Protobuf: https://github.com/apache/beam
>>>>>> /blob/master/sdks/
>>>>>> common/runner-api/src/main/proto/beam_runner_api.proto
>>>>>> <https://github.com/apache/beam/blob/master/sdks/common/runn
>>>>>>
>>>>> er-api/src/main/proto/beam_runner_api.proto>
>>>>>
>>>>>> * Fn API Protobuf: https://github.com/apache/beam/blob/master/sdks/
>>>>>> common/fn-api/src/main/proto/beam_fn_api.proto
>>>>>> <https://github.com/apache/beam/blob/master/sdks/common/fn-
>>>>>>
>>>>> api/src/main/proto/beam_fn_api.proto>
>>>>>
>>>>>> * Java SDK Harness: https://github.com/apache/beam/tree/master/sdks/
>>>>>> java/harness
>>>>>> <https://github.com/apache/beam/tree/master/sdks/java/harness>
>>>>>> * Python SDK Harness: https://github.com/apache/beam
>>>>>> /tree/master/sdks/
>>>>>> python/apache_beam/runners/worker
>>>>>> <https://github.com/apache/beam/tree/master/sdks/python/apac
>>>>>>
>>>>> he_beam/runners/worker>
>>>>>
>>>>>> Next I'm planning on talking about Beam Fn State API and will need
>>>>>> help
>>>>>> from Runner contributors to talk about caching semantics and key
>>>>>> spaces
>>>>>>
>>>>> and
>>>>>
>>>>>> whether the integrations mesh well with current Runner
>>>>>> implementations.
>>>>>>
>>>>> The
>>>>>
>>>>>> State API is meant to support user state, side inputs, and
>>>>>> re-iteration
>>>>>>
>>>>> for
>>>>>
>>>>>> large values produced by GroupByKey.
>>>>>>
>>>>>> On Tue, Jan 24, 2017 at 9:46 AM, Lukasz Cwik <lc...@google.com>
>>>>>> wrote:
>>>>>>
>>>>>> Yes, I was using a Pipeline that was:
>>>>>>> Read(10 GiBs of KV (10,000,000 values)) -> GBK -> IdentityParDo (a
>>>>>>>
>>>>>> batch
>>>>>
>>>>>> pipeline in the global window using the default trigger)
>>>>>>>
>>>>>>> In Google Cloud Dataflow, the shuffle step uses the binary
>>>>>>>
>>>>>> representation
>>>>>
>>>>>> to compare keys, so the above pipeline would normally be converted to
>>>>>>>
>>>>>> the
>>>>>
>>>>>> following two stages:
>>>>>>> Read -> GBK Writer
>>>>>>> GBK Reader -> IdentityParDo
>>>>>>>
>>>>>>> Note that the GBK Writer and GBK Reader need to use a coder to encode
>>>>>>>
>>>>>> and
>>>>>
>>>>>> decode the value.
>>>>>>>
>>>>>>> When using the Fn API, those two stages expanded because of the Fn
>>>>>>> Api
>>>>>>> crossings using a gRPC Write/Read pair:
>>>>>>> Read -> gRPC Write -> gRPC Read -> GBK Writer
>>>>>>> GBK Reader -> gRPC Write -> gRPC Read -> IdentityParDo
>>>>>>>
>>>>>>> In my naive prototype implementation, the coder was used to encode
>>>>>>> elements at the gRPC steps. This meant that the coder was
>>>>>>> encoding/decoding/encoding in the first stage and
>>>>>>> decoding/encoding/decoding in the second stage. This tripled the
>>>>>>>
>>>>>> amount
>>>>>
>>>>>> of
>>>>>>
>>>>>>> times the coder was being invoked per element. This additional use of
>>>>>>>
>>>>>> the
>>>>>
>>>>>> coder accounted for ~12% (80% of the 15%) of the extra execution time.
>>>>>>>
>>>>>> This
>>>>>>
>>>>>>> implementation is quite inefficient and would benefit from merging
>>>>>>> the
>>>>>>>
>>>>>> gRPC
>>>>>>
>>>>>>> Read + GBK Writer into one actor and also the GBK Reader + gRPC Write
>>>>>>>
>>>>>> into
>>>>>>
>>>>>>> another actor allowing for the creation of a fast path that can skip
>>>>>>>
>>>>>> parts
>>>>>>
>>>>>>> of the decode/encode cycle through the coder. By using a byte array
>>>>>>>
>>>>>> view
>>>>>
>>>>>> over the logical stream, one can minimize the number of byte array
>>>>>>>
>>>>>> copies
>>>>>
>>>>>> which plagued my naive implementation. This can be done by only
>>>>>>>
>>>>>> parsing
>>>>>
>>>>>> the
>>>>>>
>>>>>>> element boundaries out of the stream to produce those logical byte
>>>>>>>
>>>>>> array
>>>>>
>>>>>> views. I have a very rough estimate that performing this optimization
>>>>>>>
>>>>>> would
>>>>>>
>>>>>>> reduce the 12% overhead to somewhere between 4% and 6%.
>>>>>>>
>>>>>>> The remaining 3% (15% - 12%) overhead went to many parts of gRPC:
>>>>>>> marshalling/unmarshalling protos
>>>>>>> handling/managing the socket
>>>>>>> flow control
>>>>>>> ...
>>>>>>>
>>>>>>> Finally, I did try experiments with different buffer sizes (10KB,
>>>>>>>
>>>>>> 100KB,
>>>>>
>>>>>> 1000KB), flow control (separate thread[1] vs same thread with
>>>>>>>
>>>>>> phaser[2]),
>>>>>
>>>>>> and channel type [3] (NIO, epoll, domain socket), but coder overhead
>>>>>>>
>>>>>> easily
>>>>>>
>>>>>>> dominated the differences in these other experiments.
>>>>>>>
>>>>>>> Further analysis would need to be done to more accurately distill
>>>>>>> this
>>>>>>> down.
>>>>>>>
>>>>>>> 1: https://github.com/lukecwik/incubator-beam/blob/
>>>>>>> fn_api/sdks/java/harness/src/main/java/org/apache/beam/fn/ha
>>>>>>>
>>>>>> rness/stream/
>>>>>
>>>>>> BufferingStreamObserver.java
>>>>>>> 2: https://github.com/lukecwik/incubator-beam/blob/
>>>>>>> fn_api/sdks/java/harness/src/main/java/org/apache/beam/fn/ha
>>>>>>>
>>>>>> rness/stream/
>>>>>
>>>>>> DirectStreamObserver.java
>>>>>>> 3: https://github.com/lukecwik/incubator-beam/blob/
>>>>>>>
>>>>>>> fn_api/sdks/java/harness/src/main/java/org/apache/beam/fn/ha
>>>>>>
>>>>> rness/channel/
>>>>>
>>>>>> ManagedChannelFactory.java
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Jan 24, 2017 at 8:04 AM, Ismaël Mejía <ieme...@gmail.com>
>>>>>>>
>>>>>> wrote:
>>>>>
>>>>>> Awesome job Lukasz, Excellent, I have to confess the first time I
>>>>>>>>
>>>>>>> heard
>>>>>
>>>>>> about
>>>>>>>> the Fn API idea I was a bit incredulous, but you are making it real,
>>>>>>>> amazing!
>>>>>>>>
>>>>>>>> Just one question from your document, you said that 80% of the extra
>>>>>>>>
>>>>>>> (15%)
>>>>>>
>>>>>>> time
>>>>>>>> goes into encoding and decoding the data for your test case, can you
>>>>>>>> expand
>>>>>>>> in
>>>>>>>> your current ideas to improve this? (I am not sure I completely
>>>>>>>>
>>>>>>> understand
>>>>>>
>>>>>>> the
>>>>>>>> issue).
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Jan 23, 2017 at 7:10 PM, Lukasz Cwik
>>>>>>>>
>>>>>>> <lc...@google.com.invalid>
>>>>>
>>>>>> wrote:
>>>>>>>>
>>>>>>>> Responded inline.
>>>>>>>>>
>>>>>>>>> On Sat, Jan 21, 2017 at 8:20 AM, Amit Sela <amitsel...@gmail.com>
>>>>>>>>>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> This is truly amazing Luke!
>>>>>>>>>>
>>>>>>>>>> If I understand this right, the runner executing the DoFn will
>>>>>>>>>>
>>>>>>>>> delegate
>>>>>>>>
>>>>>>>>> the
>>>>>>>>>
>>>>>>>>>> function code and input data (and state, coders, etc.) to the
>>>>>>>>>>
>>>>>>>>> container
>>>>>>>>
>>>>>>>>> where it will execute with the user's SDK of choice, right ?
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Yes, that is correct.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I wonder how the containers relate to the underlying engine's
>>>>>>>>>>
>>>>>>>>> worker
>>>>>
>>>>>> processes ? is it a 1-1, container per worker ? if there's less
>>>>>>>>>>
>>>>>>>>> "work"
>>>>>>
>>>>>>> for
>>>>>>>>>
>>>>>>>>>> the worker's Java process (for example) now and it becomes a
>>>>>>>>>>
>>>>>>>>> sort of
>>>>>
>>>>>> "dispatcher", would that change the resource allocation commonly
>>>>>>>>>>
>>>>>>>>> used
>>>>>>
>>>>>>> for
>>>>>>>>
>>>>>>>>> the same Pipeline so that the worker processes would require less
>>>>>>>>>> resources, while giving those to the container ?
>>>>>>>>>>
>>>>>>>>>> I think with the four services (control, data, state, logging) you
>>>>>>>>>
>>>>>>>> can
>>>>>
>>>>>> go
>>>>>>>>
>>>>>>>>> with a 1-1 relationship or break it up more finely grained and
>>>>>>>>>
>>>>>>>> dedicate
>>>>>>
>>>>>>> some machines to have specific tasks. Like you could have a few
>>>>>>>>>
>>>>>>>> machines
>>>>>>
>>>>>>> dedicated to log aggregation which all the workers push their logs
>>>>>>>>>
>>>>>>>> to.
>>>>>
>>>>>> Similarly, you could have some machines that have a lot of memory
>>>>>>>>>
>>>>>>>> which
>>>>>>
>>>>>>> would be better to be able to do shuffles in memory and then this
>>>>>>>>>
>>>>>>>> cluster
>>>>>>>>
>>>>>>>>> of high memory machines could front the data service. I believe
>>>>>>>>>
>>>>>>>> there
>>>>>
>>>>>> is a
>>>>>>>>
>>>>>>>>> lot of flexibility based upon what a runner can do and what it
>>>>>>>>>
>>>>>>>> specializes
>>>>>>>>
>>>>>>>>> in and believe that with more effort comes more possibilities
>>>>>>>>>
>>>>>>>> albeit
>>>>>
>>>>>> with
>>>>>>>>
>>>>>>>>> increased internal complexity.
>>>>>>>>>
>>>>>>>>> The layout of resources depends on whether the services and SDK
>>>>>>>>>
>>>>>>>> containers
>>>>>>>>
>>>>>>>>> are co-hosted on the same machine or whether there is a different
>>>>>>>>> architecture in play. In a co-hosted configuration, it seems likely
>>>>>>>>>
>>>>>>>> that
>>>>>>
>>>>>>> the SDK container will get more resources but is dependent on the
>>>>>>>>>
>>>>>>>> runner
>>>>>>
>>>>>>> and pipeline shape (shuffle heavy dominated pipelines will look
>>>>>>>>>
>>>>>>>> different
>>>>>>>>
>>>>>>>>> then ParDo dominated pipelines).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> About executing sub-graphs, would it be true to say that as long
>>>>>>>>>>
>>>>>>>>> as
>>>>>
>>>>>> there's
>>>>>>>>>
>>>>>>>>>> no shuffle, you could keep executing in the same container ?
>>>>>>>>>>
>>>>>>>>> meaning
>>>>>
>>>>>> that
>>>>>>>>
>>>>>>>>> the graph is broken into sub-graphs by shuffles ?
>>>>>>>>>>
>>>>>>>>>> The only thing that is required is that the Apache Beam model is
>>>>>>>>>
>>>>>>>> preserved
>>>>>>>>
>>>>>>>>> so typical break points will be at shuffles and language crossing
>>>>>>>>>
>>>>>>>> points
>>>>>>
>>>>>>> (e.g. Python ParDo -> Java ParDo). A runner is free to break up the
>>>>>>>>>
>>>>>>>> graph
>>>>>>>>
>>>>>>>>> even more for other reasons.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I have to dig-in deeper, so I could have more questions ;-)
>>>>>>>>>>
>>>>>>>>> thanks
>>>>>
>>>>>> Luke!
>>>>>>>>
>>>>>>>>> On Sat, Jan 21, 2017 at 1:52 AM Lukasz Cwik
>>>>>>>>>>
>>>>>>>>> <lc...@google.com.invalid
>>>>>>
>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> I updated the PR description to contain the same.
>>>>>>>>>>>
>>>>>>>>>>> I would start by looking at the API/object model definitions
>>>>>>>>>>>
>>>>>>>>>> found
>>>>>
>>>>>> in
>>>>>>>>
>>>>>>>>> beam_fn_api.proto
>>>>>>>>>>> <
>>>>>>>>>>> https://github.com/lukecwik/incubator-beam/blob/fn_api/
>>>>>>>>>>>
>>>>>>>>>> sdks/common/fn-api/src/main/proto/beam_fn_api.proto
>>>>>>>>>>
>>>>>>>>>>> Then depending on your interest, look at the following:
>>>>>>>>>>> * FnHarness.java
>>>>>>>>>>> <
>>>>>>>>>>> https://github.com/lukecwik/incubator-beam/blob/fn_api/
>>>>>>>>>>>
>>>>>>>>>> sdks/java/harness/src/main/java/org/apache/beam/fn/
>>>>>>>>>>
>>>>>>>>> harness/FnHarness.java
>>>>>>>>>
>>>>>>>>>> is the main entry point.
>>>>>>>>>>> * org.apache.beam.fn.harness.data
>>>>>>>>>>> <
>>>>>>>>>>> https://github.com/lukecwik/incubator-beam/tree/fn_api/
>>>>>>>>>>>
>>>>>>>>>> sdks/java/harness/src/main/java/org/apache/beam/fn/harness/data
>>>>>>>>>>
>>>>>>>>>>> contains the most interesting bits of code since it is able to
>>>>>>>>>>>
>>>>>>>>>> multiplex
>>>>>>>>>
>>>>>>>>>> a
>>>>>>>>>>
>>>>>>>>>>> gRPC stream into multiple logical streams of elements bound for
>>>>>>>>>>>
>>>>>>>>>> multiple
>>>>>>>>>
>>>>>>>>>> concurrent process bundle requests. It also contains the code
>>>>>>>>>>>
>>>>>>>>>> to
>>>>>
>>>>>> take
>>>>>>>>
>>>>>>>>> multiple logical outbound streams and multiplex them back onto
>>>>>>>>>>>
>>>>>>>>>> a
>>>>>
>>>>>> gRPC
>>>>>>>>
>>>>>>>>> stream.
>>>>>>>>>>> * org.apache.beam.runners.core
>>>>>>>>>>> <
>>>>>>>>>>> https://github.com/lukecwik/incubator-beam/tree/fn_api/
>>>>>>>>>>>
>>>>>>>>>> sdks/java/harness/src/main/java/org/apache/beam/runners/core
>>>>>>>>>>
>>>>>>>>>>> contains additional runners akin to the DoFnRunner found in
>>>>>>>>>>>
>>>>>>>>>> runners-core
>>>>>>>>>
>>>>>>>>>> to
>>>>>>>>>>
>>>>>>>>>>> support sources and gRPC endpoints.
>>>>>>>>>>>
>>>>>>>>>>> Unless your really interested in how domain sockets, epoll, nio
>>>>>>>>>>>
>>>>>>>>>> channel
>>>>>>>>
>>>>>>>>> factories or how stream readiness callbacks work in gRPC, I
>>>>>>>>>>>
>>>>>>>>>> would
>>>>>
>>>>>> avoid
>>>>>>>>
>>>>>>>>> the
>>>>>>>>>>
>>>>>>>>>>> packages org.apache.beam.fn.harness.channel and
>>>>>>>>>>> org.apache.beam.fn.harness.stream. Similarly I would avoid
>>>>>>>>>>> org.apache.beam.fn.harness.fn and
>>>>>>>>>>>
>>>>>>>>>> org.apache.beam.fn.harness.fake
>>>>>
>>>>>> as
>>>>>>>>
>>>>>>>>> they
>>>>>>>>>>
>>>>>>>>>>> don't add anything meaningful to the api.
>>>>>>>>>>>
>>>>>>>>>>> Code package descriptions:
>>>>>>>>>>>
>>>>>>>>>>> org.apache.beam.fn.harness.FnHarness: main entry point
>>>>>>>>>>> org.apache.beam.fn.harness.control: Control service client and
>>>>>>>>>>>
>>>>>>>>>> individual
>>>>>>>>>>
>>>>>>>>>>> request handlers
>>>>>>>>>>> org.apache.beam.fn.harness.data: Data service client and
>>>>>>>>>>>
>>>>>>>>>> logical
>>>>>
>>>>>> stream
>>>>>>>>>
>>>>>>>>>> multiplexing
>>>>>>>>>>> org.apache.beam.runners.core: Additional runners akin to the
>>>>>>>>>>>
>>>>>>>>>> DoFnRunner
>>>>>>>>
>>>>>>>>> found in runners-core to support sources and gRPC endpoints
>>>>>>>>>>> org.apache.beam.fn.harness.logging: Logging client
>>>>>>>>>>>
>>>>>>>>>> implementation
>>>>>
>>>>>> and
>>>>>>>>
>>>>>>>>> JUL
>>>>>>>>>>
>>>>>>>>>>> logging handler adapter
>>>>>>>>>>> org.apache.beam.fn.harness.channel: gRPC channel management
>>>>>>>>>>> org.apache.beam.fn.harness.stream: gRPC stream management
>>>>>>>>>>> org.apache.beam.fn.harness.fn: Java 8 functional interface
>>>>>>>>>>>
>>>>>>>>>> extensions
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>> On Fri, Jan 20, 2017 at 1:26 PM, Kenneth Knowles
>>>>>>>>>>>
>>>>>>>>>> <k...@google.com.invalid
>>>>>>>>>
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> This is awesome! Any chance you could roadmap the PR for us
>>>>>>>>>>>>
>>>>>>>>>>> with
>>>>>
>>>>>> some
>>>>>>>>
>>>>>>>>> links
>>>>>>>>>>>
>>>>>>>>>>>> into the most interesting bits?
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Jan 20, 2017 at 12:19 PM, Robert Bradshaw <
>>>>>>>>>>>> rober...@google.com.invalid> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Also, note that we can still support the "simple" case. For
>>>>>>>>>>>>>
>>>>>>>>>>>> example,
>>>>>>>>>
>>>>>>>>>> if the user supplies us with a jar file (as they do now) a
>>>>>>>>>>>>>
>>>>>>>>>>>> runner
>>>>>>>>
>>>>>>>>> could launch it as a subprocesses and communicate with it
>>>>>>>>>>>>>
>>>>>>>>>>>> via
>>>>>
>>>>>> this
>>>>>>>>
>>>>>>>>> same Fn API or install it in a fixed container itself--the
>>>>>>>>>>>>>
>>>>>>>>>>>> user
>>>>>>
>>>>>>> doesn't *need* to know about docker or manually manage
>>>>>>>>>>>>>
>>>>>>>>>>>> containers
>>>>>>>>
>>>>>>>>> (and
>>>>>>>>>>
>>>>>>>>>>> indeed the Fn API could be used in-process, cross-process,
>>>>>>>>>>>>> cross-container, and even cross-machine).
>>>>>>>>>>>>>
>>>>>>>>>>>>> However docker provides a nice cross-language way of
>>>>>>>>>>>>>
>>>>>>>>>>>> specifying
>>>>>>
>>>>>>> the
>>>>>>>>
>>>>>>>>> environment including all dependencies (especially for
>>>>>>>>>>>>>
>>>>>>>>>>>> languages
>>>>>>
>>>>>>> like
>>>>>>>>>
>>>>>>>>>> Python or C where the equivalent of a cross-platform,
>>>>>>>>>>>>>
>>>>>>>>>>>> self-contained
>>>>>>>>>
>>>>>>>>>> jar isn't as easy to produce) and is strictly more powerful
>>>>>>>>>>>>>
>>>>>>>>>>>> and
>>>>>>
>>>>>>> flexible (specifically it isolates the runtime environment
>>>>>>>>>>>>>
>>>>>>>>>>>> and
>>>>>
>>>>>> one
>>>>>>>>
>>>>>>>>> can
>>>>>>>>>>
>>>>>>>>>>> even use it for local testing).
>>>>>>>>>>>>>
>>>>>>>>>>>>> Slicing a worker up like this without sacrificing
>>>>>>>>>>>>>
>>>>>>>>>>>> performance
>>>>>
>>>>>> is an
>>>>>>>>
>>>>>>>>> ambitious goal, but essential to the story of being able to
>>>>>>>>>>>>>
>>>>>>>>>>>> mix
>>>>>>
>>>>>>> and
>>>>>>>>
>>>>>>>>> match runners and SDKs arbitrarily, and I think this is a
>>>>>>>>>>>>>
>>>>>>>>>>>> great
>>>>>>
>>>>>>> start.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Jan 20, 2017 at 9:39 AM, Lukasz Cwik
>>>>>>>>>>>>>
>>>>>>>>>>>> <lc...@google.com.invalid
>>>>>>>>>>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Your correct, a docker container is created that contains
>>>>>>>>>>>>>>
>>>>>>>>>>>>> the
>>>>>>
>>>>>>> execution
>>>>>>>>>>>
>>>>>>>>>>>> environment the user wants or the user re-uses an
>>>>>>>>>>>>>>
>>>>>>>>>>>>> existing
>>>>>
>>>>>> one
>>>>>>
>>>>>>> (allowing
>>>>>>>>>>>>
>>>>>>>>>>>>> for a user to embed all their code/dependencies or use a
>>>>>>>>>>>>>>
>>>>>>>>>>>>> container
>>>>>>>>>
>>>>>>>>>> that
>>>>>>>>>>>
>>>>>>>>>>>> can
>>>>>>>>>>>>>
>>>>>>>>>>>>>> deploy code/dependencies on demand).
>>>>>>>>>>>>>> A user creates a pipeline saying which docker container
>>>>>>>>>>>>>>
>>>>>>>>>>>>> they
>>>>>
>>>>>> want
>>>>>>>>
>>>>>>>>> to
>>>>>>>>>>
>>>>>>>>>>> use
>>>>>>>>>>>>
>>>>>>>>>>>>> (this starts to allow for multiple container definitions
>>>>>>>>>>>>>>
>>>>>>>>>>>>> within a
>>>>>>>>
>>>>>>>>> single
>>>>>>>>>>>>
>>>>>>>>>>>>> pipeline to support multiple languages, versioning, ...).
>>>>>>>>>>>>>> A runner would then be responsible for launching one or
>>>>>>>>>>>>>>
>>>>>>>>>>>>> more
>>>>>
>>>>>> of
>>>>>>>>
>>>>>>>>> these
>>>>>>>>>>
>>>>>>>>>>> containers in a cluster manager of their choice (scaling
>>>>>>>>>>>>>>
>>>>>>>>>>>>> up
>>>>>
>>>>>> or
>>>>>>
>>>>>>> down
>>>>>>>>>
>>>>>>>>>> the
>>>>>>>>>>>
>>>>>>>>>>>> number of instances depending on demand/load/...).
>>>>>>>>>>>>>> A runner then interacts with the docker containers over
>>>>>>>>>>>>>>
>>>>>>>>>>>>> the
>>>>>
>>>>>> gRPC
>>>>>>>>
>>>>>>>>> service
>>>>>>>>>>>>
>>>>>>>>>>>>> definitions to delegate processing to.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Jan 20, 2017 at 4:56 AM, Jean-Baptiste Onofré <
>>>>>>>>>>>>>>
>>>>>>>>>>>>> j...@nanthrax.net
>>>>>>>>>>>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Luke,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> that's really great and very promising !
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> It's really ambitious but I like the idea. Just to
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> clarify:
>>>>>
>>>>>> the
>>>>>>>>
>>>>>>>>> purpose
>>>>>>>>>>>>
>>>>>>>>>>>>> of
>>>>>>>>>>>>>
>>>>>>>>>>>>>> using gRPC is once the docker container is running,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> then we
>>>>>
>>>>>> can
>>>>>>>>
>>>>>>>>> "interact"
>>>>>>>>>>>>>
>>>>>>>>>>>>>> with the container to spread and delegate processing to
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> the
>>>>>
>>>>>> docker
>>>>>>>>>
>>>>>>>>>> container, correct ?
>>>>>>>>>>>>>>> The users/devops have to setup the docker containers as
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> prerequisite.
>>>>>>>>>>>
>>>>>>>>>>>> Then, the "location" of the containers (kind of
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> container
>>>>>
>>>>>> registry)
>>>>>>>>>>
>>>>>>>>>>> is
>>>>>>>>>>>
>>>>>>>>>>>> set
>>>>>>>>>>>>>
>>>>>>>>>>>>>> via the pipeline options and used by gRPC ?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks Luke !
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>> JB
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 01/19/2017 03:56 PM, Lukasz Cwik wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I have been prototyping several components towards the
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Beam
>>>>>>
>>>>>>> technical
>>>>>>>>>>>
>>>>>>>>>>>> vision of being able to execute an arbitrary language
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> using
>>>>>>
>>>>>>> an
>>>>>>>>
>>>>>>>>> arbitrary
>>>>>>>>>>>>>
>>>>>>>>>>>>>> runner.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I would like to share this overview [1] of what I have
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> been
>>>>>>
>>>>>>> working
>>>>>>>>>>
>>>>>>>>>>> towards. I also share this PR [2] with a proposed API,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> service
>>>>>>>>
>>>>>>>>> definitions
>>>>>>>>>>>>>
>>>>>>>>>>>>>> and partial implementation.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 1: https://s.apache.org/beam-fn-api
>>>>>>>>>>>>>>>> 2: https://github.com/apache/beam/pull/1801
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Please comment on the overview within this thread, and
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> any
>>>>>
>>>>>> specific
>>>>>>>>>>
>>>>>>>>>>> code
>>>>>>>>>>>>>
>>>>>>>>>>>>>> comments on the PR directly.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Luke
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Jean-Baptiste Onofré
>>>>>>>>>>>>>>> jbono...@apache.org
>>>>>>>>>>>>>>> http://blog.nanthrax.net
>>>>>>>>>>>>>>> Talend - http://www.talend.com
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>
>>>>
>

Re: Beam Fn API

Reply via email to