Most of it should probably go to https://beam.apache.org/con tribute/portability/
Also for reference, here is the prototype doc: https://s.apache.org/beam- portability-team-doc Thomas On Fri, May 18, 2018 at 5:35 AM, Kenneth Knowles <k...@google.com> wrote: > This is awesome. Would you be up for adding a brief description at > https://beam.apache.org/contribute/#works-in-progress and maybe a pointer > to a gdoc with something like the contents of this email? (my reasoning is > (a) keep the contribution guide concise but (b) all this detail is helpful > yet (c) the detail may be ever-changing so making a separate web page is > not the best format) > > Kenn > > On Thu, May 17, 2018 at 3:13 PM Eugene Kirpichov <kirpic...@google.com> > wrote: > >> Hi all, >> >> A little over a month ago, a large group of Beam community members has >> been working a prototype of a portable Flink runner - that is, a runner >> that can execute Beam pipelines on Flink via the Portability API >> <https://s.apache.org/beam-runner-api>. The prototype was developed in a >> separate >> branch <https://github.com/bsidhom/beam/tree/hacking-job-server> and was >> successfully demonstrated at Flink Forward, where it ran Python and Go >> pipelines in a limited setting. >> >> Since then, a smaller group of people (Ankur Goenka, Axel Magnuson, Ben >> Sidhom and myself) have been working on productionizing the prototype to >> address its limitations and do things "the right way", preparing to reuse >> this work for developing other portable runners (e.g. Spark). This involves >> a surprising amount of work, since many important design and implementation >> concerns could be ignored for the purposes of a prototype. I wanted to give >> an update on where we stand now. >> >> Our immediate milestone in sight is *Run Java and Python batch WordCount >> examples against a distributed remote Flink cluster*. That involves a >> few moving parts, roughly in order of appearance: >> >> *Job submission:* >> - The SDK is configured to use a "portable runner", whose responsibility >> is to run the pipeline against a given JobService endpoint. >> - The portable runner converts the pipeline to a portable Pipeline proto >> - The runner finds out which artifacts it needs to stage, and staging >> them against an ArtifactStagingService >> - A Flink-specific JobService receives the Pipeline proto, performs some >> optimizations (e.g. fusion) and translates it to Flink datasets and >> functions >> >> *Job execution:* >> - A Flink function executes a fused chain of Beam transforms (an >> "executable stage") by converting the input and the stage to bundles and >> executing them against an SDK harness >> - The function starts the proper SDK harness, auxiliary services (e.g. >> artifact retrieval, side input handling) and wires them together >> - The function feeds the data to the harness and receives data back. >> >> *And here is our status of implementation for these parts:* basically, >> almost everything is either done or in review. >> >> *Job submission:* >> - General-purpose portable runner in the Python SDK: done >> <https://github.com/apache/beam/pull/5301>; Java SDK: also done >> <https://github.com/apache/beam/pull/5150> >> - Artifact staging from the Python SDK: in review (PR >> <https://github.com/apache/beam/pull/5273>, PR >> <https://github.com/apache/beam/pull/5251>); in java, it's done also >> - Flink JobService: in review <https://github.com/apache/beam/pull/5262> >> - Translation from a Pipeline proto to Flink datasets and functions: done >> <https://github.com/apache/beam/pull/5226> >> - ArtifactStagingService implementation that stages artifacts to a >> location on a distributed filesystem: in development (design is clear) >> >> *Job execution:* >> - Flink function for executing via an SDK harness: done >> <https://github.com/apache/beam/pull/5285> >> - APIs for managing lifecycle of an SDK harness: done >> <https://github.com/apache/beam/pull/5152> >> - Specific implementation of those APIs using Docker: part done >> <https://github.com/apache/beam/pull/5189>, part in review >> <https://github.com/apache/beam/pull/5392> >> - ArtifactRetrievalService that retrieves artifacts from the location >> where ArtifactStagingService staged them: in development. >> >> We expect that the in-review parts will be done, and the in-development >> parts be developed, in the next 2-3 weeks. We will, of course, update the >> community when this important milestone is reached. >> >> *After that, the next milestones include:* >> - Sett up Java, Python and Go ValidatesRunner tests to run against the >> portable Flink runner, and get them to pass >> - Expand Python and Go to parity in terms of such test coverage >> - Implement the portable Spark runner, with a similar lifecycle but >> reusing almost all of the Flink work >> - Add support for streaming to both (which requires SDF - that work is >> progressing in parallel and by this point should be in a suitable place) >> >> *For people who would like to get involved in this effort: *You can >> already help out by improving ValidatesRunner test coverage in Python and >> Go. Java has >300 such tests, Python has only a handful. There'll be a >> large amount of parallelizable work once we get the VR test suites running >> - stay tuned. SDF+Portability is also expected to produce a lot of >> parallelizable work up for grabs within several weeks. >> >> Thanks! >> >