Re: [DISCUSS][BEAM-10670] Migrating BoundedSource/UnboundedSource to execute as a Splittable DoFn for non-portable Java runners

Luke Cwik Mon, 19 Oct 2020 10:18:14 -0700

+Rose Nguyen <[email protected]> suggested that instead of just a blog,
we should add the majority of the current blog's content to the core
programming guide and either drop the blog and/or have a much smaller blog
that links to the docs.


I think this is a great idea, what do others think?

On Wed, Oct 14, 2020 at 10:51 AM Luke Cwik <[email protected]> wrote:

> Thanks Alexey, that is correct.
>
> On Wed, Oct 14, 2020 at 10:33 AM Alexey Romanenko <
> [email protected]> wrote:
>
>> Thanks Luke, just I guess that the proper link should be this one:
>>
>> https://docs.google.com/document/d/1kpn0RxqZaoacUPVSMYhhnfmlo8fGT-p50fEblaFr2HE
>>
>> On 13 Oct 2020, at 00:23, Luke Cwik <[email protected]> wrote:
>>
>> I have a draft[1] off the blog ready. Please take a look.
>>
>> 1:
>> http://doc/1kpn0RxqZaoacUPVSMYhhnfmlo8fGT-p50fEblaFr2HE#heading=h.tbab2n97o3eo
>>
>> On Mon, Oct 5, 2020 at 4:28 PM Luke Cwik <[email protected]> wrote:
>>
>>>
>>>
>>> On Mon, Oct 5, 2020 at 3:45 PM Kenneth Knowles <[email protected]> wrote:
>>>
>>>>
>>>>
>>>> On Mon, Oct 5, 2020 at 2:44 PM Luke Cwik <[email protected]> wrote:
>>>>
>>>>> For the 2.25 release the Java Direct, Flink, Jet, Samza, Twister2 will
>>>>> use SDF powered Read transforms. Users can opt-out
>>>>> with --experiments=use_deprecated_read.
>>>>>
>>>>
>>>> Huzzah! In our release notes maybe be clear about the expectations for
>>>> users:
>>>>
>>>> Done in https://github.com/apache/beam/pull/13015
>>>
>>>
>>>>  - semantics are expected to be the same: file bugs for any change in
>>>> results
>>>>  - perf may vary: file bugs or write to user@
>>>>
>>>> I was unable to get Spark done for 2.25 as I found out that Spark
>>>>> streaming doesn't support watermark holds[1]. If someone knows more about
>>>>> the watermark system in Spark I could use some guidance here as I believe 
>>>>> I
>>>>> have a version of unbounded SDF support written for Spark (I get all the
>>>>> expected output from tests, just that watermarks aren't being held back so
>>>>> PAssert fails).
>>>>>
>>>>
>>>> Spark's watermarks are not comparable to Beam's. The rule as I
>>>> understand it is that any data that is later than `max(seen timestamps) -
>>>> allowedLateness` is dropped. One difference is that dropping is relative to
>>>> the watermark instead of expiring windows, like early versions of Beam. The
>>>> other difference is that it track the latest event (some call it a "high
>>>> water mark" because it is the highest datetime value seen) where Beam's
>>>> watermark is an approximation of the earliest (some call it a "low water
>>>> mark" because it is a guarantee that it will not dip lower). When I chatted
>>>> about this with Amit in the early days, it was necessary to implement a
>>>> Beam-style watermark using Spark state. I think that may still be the case,
>>>> for correct results.
>>>>
>>>>
>>> In the Spark implementation I saw that watermark holds weren't wired at
>>> all to control Sparks watermarks and this was causing triggers to fire too
>>> early.
>>>
>>>
>>>> Also, I started a doc[2] to produce an updated blog post since the
>>>>> original SplittableDoFn blog from 2017 is out of date[3]. I was thinking 
>>>>> of
>>>>> making this a new blog post and having the old blog post point to it. We
>>>>> could also remove the old blog post and or update it. Any thoughts?
>>>>>
>>>>
>>>> New blog post w/ pointer from the old one.
>>>>
>>>> Finally, I have a clean-up PR[4] that pushes the Read -> primitive Read
>>>>> expansion into each of the runners instead of having it within Read
>>>>> transform within beam-sdks-java-core.
>>>>>
>>>>
>>>> Approved! I did CC a bunch of runner authors already. I think the
>>>> important thing is if a default changes we should be sure everyone is OK
>>>> with the perf changes, and everyone is confident that no incorrect results
>>>> are produced. The abstractions between sdk-core, runners-core-*, and
>>>> individual runners is important to me:
>>>>
>>>>  - The SDK's job is to produce a portable, un-tweaked pipeline so
>>>> moving flags out of SDK core (and IOs) ASAP is super important.
>>>>  - The runner's job is to execute that pipeline, if they can, however
>>>> they want. If a runner wants to run Read transforms differently/directly
>>>> that is fine. If a runner is incapable of supporting SDF, then Read is
>>>> better than nothing. Etc.
>>>>  - The runners-core-* job is to just be internal libraries for runner
>>>> authors to share code, and should not make any decisions about the Beam
>>>> model, etc.
>>>>
>>>> Kenn
>>>>
>>>> 1: https://github.com/apache/beam/pull/12603
>>>>> 2: http://doc/1kpn0RxqZaoacUPVSMYhhnfmlo8fGT-p50fEblaFr2HE
>>>>> 3: https://beam.apache.org/blog/splittable-do-fn/
>>>>> 4: https://github.com/apache/beam/pull/13006
>>>>>
>>>>>
>>>>> On Fri, Aug 28, 2020 at 1:45 AM Maximilian Michels <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Thanks Luke! I've had a pass.
>>>>>>
>>>>>> -Max
>>>>>>
>>>>>> On 28.08.20 01:22, Luke Cwik wrote:
>>>>>> > As an update.
>>>>>> >
>>>>>> > Direct and Twister2 are done.
>>>>>> > Samza: is ready for review[1].
>>>>>> > Flink: is almost ready for review. [2] lays all the groundwork for
>>>>>> the
>>>>>> > migration and [3] finishes the migration (there is a timeout
>>>>>> happening
>>>>>> > in FlinkSubmissionTest that I'm trying to figure out).
>>>>>> > No further updates on Spark[4] or Jet[5].
>>>>>> >
>>>>>> > @Maximilian Michels <mailto:[email protected]> or @[email protected]
>>>>>> > <mailto:[email protected]>, can either of you take a look at
>>>>>> the
>>>>>> > Flink PRs?
>>>>>> > @[email protected] <mailto:[email protected]>, Since Xinyu
>>>>>> delegated
>>>>>> > to you, can you take another look at the Samza PR?
>>>>>> >
>>>>>> > 1: https://github.com/apache/beam/pull/12617
>>>>>> > 2: https://github.com/apache/beam/pull/12706
>>>>>> > 3: https://github.com/apache/beam/pull/12708
>>>>>> > 4: https://github.com/apache/beam/pull/12603
>>>>>> > 5: https://github.com/apache/beam/pull/12616
>>>>>> >
>>>>>> > On Tue, Aug 18, 2020 at 11:42 AM Pulasthi Supun Wickramasinghe
>>>>>> > <[email protected] <mailto:[email protected]>> wrote:
>>>>>> >
>>>>>> >     Hi Luke
>>>>>> >
>>>>>> >     Will take a look at this as soon as possible and get back to
>>>>>> you.
>>>>>> >
>>>>>> >     Best Regards,
>>>>>> >     Pulasthi
>>>>>> >
>>>>>> >     On Tue, Aug 18, 2020 at 2:30 PM Luke Cwik <[email protected]
>>>>>> >     <mailto:[email protected]>> wrote:
>>>>>> >
>>>>>> >         I have made some good progress here and have gotten to the
>>>>>> >         following state for non-portable runners:
>>>>>> >
>>>>>> >         DirectRunner[1]: Merged. Supports Read.Bounded and
>>>>>> Read.Unbounded.
>>>>>> >         Twister2[2]: Ready for review. Supports Read.Bounded, the
>>>>>> >         current runner doesn't support unbounded pipelines.
>>>>>> >         Spark[3]: WIP. Supports Read.Bounded, Nexmark suite passes.
>>>>>> Not
>>>>>> >         certain about level of unbounded pipeline support coverage
>>>>>> since
>>>>>> >         Spark uses its own tiny suite of tests to get unbounded
>>>>>> pipeline
>>>>>> >         coverage instead of the validates runner set.
>>>>>> >         Jet[4]: WIP. Supports Read.Bounded. Read.Unbounded
>>>>>> definitely
>>>>>> >         needs additional work.
>>>>>> >         Sazma[5]: WIP. Supports Read.Bounded. Not certain about
>>>>>> level of
>>>>>> >         unbounded pipeline support coverage since Spark uses its own
>>>>>> >         tiny suite of tests to get unbounded pipeline coverage
>>>>>> instead
>>>>>> >         of the validates runner set.
>>>>>> >         Flink: Unstarted.
>>>>>> >
>>>>>> >         @Pulasthi Supun Wickramasinghe <mailto:
>>>>>> [email protected]> ,
>>>>>> >         can you help me with the Twister2 PR[2]?
>>>>>> >         @Ismaël Mejía <mailto:[email protected]>, is PR[3] the
>>>>>> expected
>>>>>> >         level of support for unbounded pipelines and hence ready
>>>>>> for review?
>>>>>> >         @Jozsef Bartok <mailto:[email protected]>, can you help
>>>>>> me out
>>>>>> >         to get support for unbounded splittable DoFn's into Jet[4]?
>>>>>> >         @Xinyu Liu <mailto:[email protected]>, is PR[5] the
>>>>>> expected
>>>>>> >         level of support for unbounded pipelines and hence ready
>>>>>> for review?
>>>>>> >
>>>>>> >         1: https://github.com/apache/beam/pull/12519
>>>>>> >         2: https://github.com/apache/beam/pull/12594
>>>>>> >         3: https://github.com/apache/beam/pull/12603
>>>>>> >         4: https://github.com/apache/beam/pull/12616
>>>>>> >         5: https://github.com/apache/beam/pull/12617
>>>>>> >
>>>>>> >         On Tue, Aug 11, 2020 at 10:55 AM Luke Cwik <
>>>>>> [email protected]
>>>>>> >         <mailto:[email protected]>> wrote:
>>>>>> >
>>>>>> >             There shouldn't be any changes required since the
>>>>>> wrapper
>>>>>> >             will smoothly transition the execution to be run as an
>>>>>> SDF.
>>>>>> >             New IOs should strongly prefer to use SDF since it
>>>>>> should be
>>>>>> >             simpler to write and will be more flexible but they can
>>>>>> use
>>>>>> >             the "*Source"-based APIs. Eventually we'll deprecate the
>>>>>> >             APIs but we will never stop supporting them. Eventually
>>>>>> they
>>>>>> >             should all be migrated to use SDF and if there is
>>>>>> another
>>>>>> >             major Beam version, we'll finally be able to remove
>>>>>> them.
>>>>>> >
>>>>>> >             On Tue, Aug 11, 2020 at 8:40 AM Alexey Romanenko
>>>>>> >             <[email protected] <mailto:
>>>>>> [email protected]>>
>>>>>> >             wrote:
>>>>>> >
>>>>>> >                 Hi Luke,
>>>>>> >
>>>>>> >                 Great to hear about such progress on this!
>>>>>> >
>>>>>> >                 Talking about opt-out for all runners in the future,
>>>>>> >                 will it require any code changes for current
>>>>>> >                 “*Source”-based IOs or the wrappers should
>>>>>> completely
>>>>>> >                 smooth this transition?
>>>>>> >                 Do we need to require to create new IOs only based
>>>>>> on
>>>>>> >                 SDF or again, the wrappers should help to avoid
>>>>>> this?
>>>>>> >
>>>>>> >>                 On 10 Aug 2020, at 22:59, Luke Cwik <
>>>>>> [email protected]
>>>>>> >>                 <mailto:[email protected]>> wrote:
>>>>>> >>
>>>>>> >>                 In the past couple of months wrappers[1, 2] have
>>>>>> been
>>>>>> >>                 added to the Beam Java SDK which can execute
>>>>>> >>                 BoundedSource and UnboundedSource as Splittable
>>>>>> DoFns.
>>>>>> >>                 These have been opt-out for portable pipelines
>>>>>> (e.g.
>>>>>> >>                 Dataflow runner v2, XLang pipelines on Flink/Spark)
>>>>>> >>                 and opt-in using an experiment for all other
>>>>>> pipelines.
>>>>>> >>
>>>>>> >>                 I would like to start making the non-portable
>>>>>> >>                 pipelines starting with the DirectRunner[3] to be
>>>>>> >>                 opt-out with the plan that eventually all runners
>>>>>> will
>>>>>> >>                 only execute splittable DoFns and the
>>>>>> >>                 BoundedSource/UnboundedSource specific execution
>>>>>> logic
>>>>>> >>                 from the runners will be removed.
>>>>>> >>
>>>>>> >>                 Users will be able to opt-in any pipeline using the
>>>>>> >>                 experiment 'use_sdf_read' and opt-out with the
>>>>>> >>                 experiment 'use_deprecated_read'. (For portable
>>>>>> >>                 pipelines these experiments were 'beam_fn_api' and
>>>>>> >>                 'beam_fn_api_use_deprecated_read' respectively and
>>>>>> I
>>>>>> >>                 have added these two additional aliases to make the
>>>>>> >>                 experience less confusing).
>>>>>> >>
>>>>>> >>                 1:
>>>>>> >>
>>>>>> https://github.com/apache/beam/blob/af1ce643d8fde5352d4519a558de4a2dfd24721d/sdks/java/core/src/main/java/org/apache/beam/sdk/io/Read.java#L275
>>>>>> >>                 2:
>>>>>> >>
>>>>>> https://github.com/apache/beam/blob/af1ce643d8fde5352d4519a558de4a2dfd24721d/sdks/java/core/src/main/java/org/apache/beam/sdk/io/Read.java#L449
>>>>>> >>                 3: https://github.com/apache/beam/pull/12519
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >     --
>>>>>> >     Pulasthi S. Wickramasinghe
>>>>>> >     PhD Candidate  | Research Assistant
>>>>>> >     School of Informatics and Computing | Digital Science Center
>>>>>> >     Indiana University, Bloomington
>>>>>> >     cell: 224-386-9035 <(224)%20386-9035> <tel:(224)%20386-9035>
>>>>>> >
>>>>>>
>>>>>
>>

Re: [DISCUSS][BEAM-10670] Migrating BoundedSource/UnboundedSource to execute as a Splittable DoFn for non-portable Java runners

Reply via email to