Re: Unbounded source translation for portable pipelines

Eugene Kirpichov Wed, 27 Jun 2018 12:34:21 -0700

Hi!

Those instructions are not current and I think should be discarded as they
referred to a particular effort that is over - +Ankur Goenka
<[email protected]> is, I believe, working on the remaining finishing
touches for running from a clean clone of Beam master and documenting how
to do that; could you help Thomas so we can start looking at what the
streaming runner is missing?


We'll need to document this in a more prominent place. When we get to a
state where we can run Python WordCount from master, we'll need to document
it somewhere on the main portability page and/or the getting started guide;
when we can run something more serious, e.g. Tensorflow pipelines, that
will be worth a Beam blog post and worth documenting in the TFX
documentation.

On Wed, Jun 27, 2018 at 5:35 AM Thomas Weise <[email protected]> wrote:

> Hi Eugene,
>
> The basic streaming translation is already in place from the prototype,
> though I have not verified it on the master branch yet.
>
> Are the user instructions for the portable Flink runner at
> https://s.apache.org/beam-portability-team-doc current?
>
> (I don't have a dependency on SDF since we are going to use custom native
> Flink sources/sinks at this time.)
>
> Thanks,
> Thomas
>
>
> On Tue, Jun 26, 2018 at 2:13 AM Eugene Kirpichov <[email protected]>
> wrote:
>
>> Hi!
>>
>> Wanted to let you know that I've just merged the PR that adds
>> checkpointable SDF support to the portable reference runner (ULR) and the
>> Java SDK harness:
>>
>> https://github.com/apache/beam/pull/5566
>>
>> So now we have a reference implementation of SDF support in a portable
>> runner, and a reference implementation of SDF support in a portable SDK
>> harness.
>> From here on, we need to replicate this support in other portable runners
>> and other harnesses. The obvious targets are Flink and Python respectively.
>>
>> Chamikara was going to work on the Python harness. +Thomas Weise
>> <[email protected]> Would you be interested in the Flink portable streaming
>> runner side? It is of course blocked by having the rest of that runner
>> working in streaming mode though (the batch mode is practically done - will
>> send you a separate note about the status of that).
>>
>> On Fri, Mar 23, 2018 at 12:20 PM Eugene Kirpichov <[email protected]>
>> wrote:
>>
>>> Luke is right - unbounded sources should go through SDF. I am currently
>>> working on adding such support to Fn API.
>>> The relevant document is s.apache.org/beam-breaking-fusion (note: it
>>> focuses on a much more general case, but also considers in detail the
>>> specific case of running unbounded sources on Fn API), and the first
>>> related PR is https://github.com/apache/beam/pull/4743 .
>>>
>>> Ways you can help speed up this effort:
>>> - Make necessary changes to Apex runner per se to support regular SDFs
>>> in streaming (without portability). They will likely largely carry over to
>>> portable world. I recall that the Apex runner had some level of support of
>>> SDFs, but didn't pass the ValidatesRunner tests yet.
>>> - (general to Beam, not Apex-related per se) Implement the translation
>>> of Read.from(UnboundedSource) via impulse, which will require implementing
>>> an SDF that reads from a given UnboundedSource (taking the UnboundedSource
>>> as an element). This should be fairly straightforward and will allow all
>>> portable runners to take advantage of existing UnboundedSource's.
>>>
>>>
>>> On Fri, Mar 23, 2018 at 3:08 PM Lukasz Cwik <[email protected]> wrote:
>>>
>>>> Using impulse is a precursor for both bounded and unbounded SDF.
>>>>
>>>> This JIRA represents the work that would be to add support for
>>>> unbounded SDF using portability APIs:
>>>> https://issues.apache.org/jira/browse/BEAM-2939
>>>>
>>>>
>>>> On Fri, Mar 23, 2018 at 11:46 AM Thomas Weise <[email protected]> wrote:
>>>>
>>>>> So for streaming, we will need the Impulse translation for bounded
>>>>> input, identical with batch, and then in addition to that support for SDF?
>>>>>
>>>>> Any pointers what's involved in adding the SDF support? Is it runner
>>>>> specific? Does the ULR cover it?
>>>>>
>>>>>
>>>>> On Fri, Mar 23, 2018 at 11:26 AM, Lukasz Cwik <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> All "sources" in portability will use splittable DoFns for execution.
>>>>>>
>>>>>> Specifically, runners will need to be able to checkpoint unbounded
>>>>>> sources to get a minimum viable pipeline working.
>>>>>> For bounded pipelines, a DoFn can read the contents of a bounded
>>>>>> source.
>>>>>>
>>>>>>
>>>>>> On Fri, Mar 23, 2018 at 10:52 AM Thomas Weise <[email protected]> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I'm looking at the portable pipeline translation for streaming. I
>>>>>>> understand that for batch pipelines, it is sufficient to translate 
>>>>>>> Impulse.
>>>>>>>
>>>>>>> What is the intended path to support unbounded sources?
>>>>>>>
>>>>>>> The goal here is to get a minimum translation working that will
>>>>>>> allow streaming wordcount execution.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Thomas
>>>>>>>
>>>>>>>
>>>>>

Re: Unbounded source translation for portable pipelines

Reply via email to