Re: [DISCUSS][BEAM-10670] Migrating BoundedSource/UnboundedSource to execute as a Splittable DoFn for non-portable Java runners

Alexey Romanenko Wed, 14 Oct 2020 10:34:09 -0700

Thanks Luke, just I guess that the proper link should be this one:
https://docs.google.com/document/d/1kpn0RxqZaoacUPVSMYhhnfmlo8fGT-p50fEblaFr2HE


> On 13 Oct 2020, at 00:23, Luke Cwik <[email protected]> wrote:
> 
> I have a draft[1] off the blog ready. Please take a look.
> 
> 1: 
> http://doc/1kpn0RxqZaoacUPVSMYhhnfmlo8fGT-p50fEblaFr2HE#heading=h.tbab2n97o3eo
>  
> <http://doc/1kpn0RxqZaoacUPVSMYhhnfmlo8fGT-p50fEblaFr2HE#heading=h.tbab2n97o3eo>
> On Mon, Oct 5, 2020 at 4:28 PM Luke Cwik <[email protected] 
> <mailto:[email protected]>> wrote:
> 
> 
> On Mon, Oct 5, 2020 at 3:45 PM Kenneth Knowles <[email protected] 
> <mailto:[email protected]>> wrote:
> 
> 
> On Mon, Oct 5, 2020 at 2:44 PM Luke Cwik <[email protected] 
> <mailto:[email protected]>> wrote:
> For the 2.25 release the Java Direct, Flink, Jet, Samza, Twister2 will use 
> SDF powered Read transforms. Users can opt-out with 
> --experiments=use_deprecated_read.
> 
> Huzzah! In our release notes maybe be clear about the expectations for users:
> 
> Done in https://github.com/apache/beam/pull/13015 
> <https://github.com/apache/beam/pull/13015>
>  
>  - semantics are expected to be the same: file bugs for any change in results
>  - perf may vary: file bugs or write to user@
> 
> I was unable to get Spark done for 2.25 as I found out that Spark streaming 
> doesn't support watermark holds[1]. If someone knows more about the watermark 
> system in Spark I could use some guidance here as I believe I have a version 
> of unbounded SDF support written for Spark (I get all the expected output 
> from tests, just that watermarks aren't being held back so PAssert fails).
> 
> Spark's watermarks are not comparable to Beam's. The rule as I understand it 
> is that any data that is later than `max(seen timestamps) - allowedLateness` 
> is dropped. One difference is that dropping is relative to the watermark 
> instead of expiring windows, like early versions of Beam. The other 
> difference is that it track the latest event (some call it a "high water 
> mark" because it is the highest datetime value seen) where Beam's watermark 
> is an approximation of the earliest (some call it a "low water mark" because 
> it is a guarantee that it will not dip lower). When I chatted about this with 
> Amit in the early days, it was necessary to implement a Beam-style watermark 
> using Spark state. I think that may still be the case, for correct results.
> 
> 
> In the Spark implementation I saw that watermark holds weren't wired at all 
> to control Sparks watermarks and this was causing triggers to fire too early.
>  
> Also, I started a doc[2] to produce an updated blog post since the original 
> SplittableDoFn blog from 2017 is out of date[3]. I was thinking of making 
> this a new blog post and having the old blog post point to it. We could also 
> remove the old blog post and or update it. Any thoughts?
> 
> New blog post w/ pointer from the old one.
> 
> Finally, I have a clean-up PR[4] that pushes the Read -> primitive Read 
> expansion into each of the runners instead of having it within Read transform 
> within beam-sdks-java-core.
> 
> Approved! I did CC a bunch of runner authors already. I think the important 
> thing is if a default changes we should be sure everyone is OK with the perf 
> changes, and everyone is confident that no incorrect results are produced. 
> The abstractions between sdk-core, runners-core-*, and individual runners is 
> important to me:
> 
>  - The SDK's job is to produce a portable, un-tweaked pipeline so moving 
> flags out of SDK core (and IOs) ASAP is super important.
>  - The runner's job is to execute that pipeline, if they can, however they 
> want. If a runner wants to run Read transforms differently/directly that is 
> fine. If a runner is incapable of supporting SDF, then Read is better than 
> nothing. Etc.
>  - The runners-core-* job is to just be internal libraries for runner authors 
> to share code, and should not make any decisions about the Beam model, etc.
> 
> Kenn
> 
> 1: https://github.com/apache/beam/pull/12603 
> <https://github.com/apache/beam/pull/12603>
> 2: http://doc/1kpn0RxqZaoacUPVSMYhhnfmlo8fGT-p50fEblaFr2HE 
> <http://doc/1kpn0RxqZaoacUPVSMYhhnfmlo8fGT-p50fEblaFr2HE>
> 3: https://beam.apache.org/blog/splittable-do-fn/ 
> <https://beam.apache.org/blog/splittable-do-fn/>
> 4: https://github.com/apache/beam/pull/13006 
> <https://github.com/apache/beam/pull/13006>
> 
> 
> On Fri, Aug 28, 2020 at 1:45 AM Maximilian Michels <[email protected] 
> <mailto:[email protected]>> wrote:
> Thanks Luke! I've had a pass.
> 
> -Max
> 
> On 28.08.20 01:22, Luke Cwik wrote:
> > As an update.
> > 
> > Direct and Twister2 are done.
> > Samza: is ready for review[1].
> > Flink: is almost ready for review. [2] lays all the groundwork for the 
> > migration and [3] finishes the migration (there is a timeout happening 
> > in FlinkSubmissionTest that I'm trying to figure out).
> > No further updates on Spark[4] or Jet[5].
> > 
> > @Maximilian Michels <mailto:[email protected] <mailto:[email protected]>> or 
> > @[email protected] <mailto:[email protected]> 
> > <mailto:[email protected] <mailto:[email protected]>>, can either 
> > of you take a look at the 
> > Flink PRs?
> > @[email protected] <mailto:[email protected]> 
> > <mailto:[email protected] <mailto:[email protected]>>, Since Xinyu 
> > delegated 
> > to you, can you take another look at the Samza PR?
> > 
> > 1: https://github.com/apache/beam/pull/12617 
> > <https://github.com/apache/beam/pull/12617>
> > 2: https://github.com/apache/beam/pull/12706 
> > <https://github.com/apache/beam/pull/12706>
> > 3: https://github.com/apache/beam/pull/12708 
> > <https://github.com/apache/beam/pull/12708>
> > 4: https://github.com/apache/beam/pull/12603 
> > <https://github.com/apache/beam/pull/12603>
> > 5: https://github.com/apache/beam/pull/12616 
> > <https://github.com/apache/beam/pull/12616>
> > 
> > On Tue, Aug 18, 2020 at 11:42 AM Pulasthi Supun Wickramasinghe 
> > <[email protected] <mailto:[email protected]> 
> > <mailto:[email protected] <mailto:[email protected]>>> wrote:
> > 
> >     Hi Luke
> > 
> >     Will take a look at this as soon as possible and get back to you.
> > 
> >     Best Regards,
> >     Pulasthi
> > 
> >     On Tue, Aug 18, 2020 at 2:30 PM Luke Cwik <[email protected] 
> > <mailto:[email protected]>
> >     <mailto:[email protected] <mailto:[email protected]>>> wrote:
> > 
> >         I have made some good progress here and have gotten to the
> >         following state for non-portable runners:
> > 
> >         DirectRunner[1]: Merged. Supports Read.Bounded and Read.Unbounded.
> >         Twister2[2]: Ready for review. Supports Read.Bounded, the
> >         current runner doesn't support unbounded pipelines.
> >         Spark[3]: WIP. Supports Read.Bounded, Nexmark suite passes. Not
> >         certain about level of unbounded pipeline support coverage since
> >         Spark uses its own tiny suite of tests to get unbounded pipeline
> >         coverage instead of the validates runner set.
> >         Jet[4]: WIP. Supports Read.Bounded. Read.Unbounded definitely
> >         needs additional work.
> >         Sazma[5]: WIP. Supports Read.Bounded. Not certain about level of
> >         unbounded pipeline support coverage since Spark uses its own
> >         tiny suite of tests to get unbounded pipeline coverage instead
> >         of the validates runner set.
> >         Flink: Unstarted.
> > 
> >         @Pulasthi Supun Wickramasinghe <mailto:[email protected] 
> > <mailto:[email protected]>> ,
> >         can you help me with the Twister2 PR[2]?
> >         @Ismaël Mejía <mailto:[email protected] 
> > <mailto:[email protected]>>, is PR[3] the expected
> >         level of support for unbounded pipelines and hence ready for review?
> >         @Jozsef Bartok <mailto:[email protected] 
> > <mailto:[email protected]>>, can you help me out
> >         to get support for unbounded splittable DoFn's into Jet[4]?
> >         @Xinyu Liu <mailto:[email protected] 
> > <mailto:[email protected]>>, is PR[5] the expected
> >         level of support for unbounded pipelines and hence ready for review?
> > 
> >         1: https://github.com/apache/beam/pull/12519 
> > <https://github.com/apache/beam/pull/12519>
> >         2: https://github.com/apache/beam/pull/12594 
> > <https://github.com/apache/beam/pull/12594>
> >         3: https://github.com/apache/beam/pull/12603 
> > <https://github.com/apache/beam/pull/12603>
> >         4: https://github.com/apache/beam/pull/12616 
> > <https://github.com/apache/beam/pull/12616>
> >         5: https://github.com/apache/beam/pull/12617 
> > <https://github.com/apache/beam/pull/12617>
> > 
> >         On Tue, Aug 11, 2020 at 10:55 AM Luke Cwik <[email protected] 
> > <mailto:[email protected]>
> >         <mailto:[email protected] <mailto:[email protected]>>> wrote:
> > 
> >             There shouldn't be any changes required since the wrapper
> >             will smoothly transition the execution to be run as an SDF.
> >             New IOs should strongly prefer to use SDF since it should be
> >             simpler to write and will be more flexible but they can use
> >             the "*Source"-based APIs. Eventually we'll deprecate the
> >             APIs but we will never stop supporting them. Eventually they
> >             should all be migrated to use SDF and if there is another
> >             major Beam version, we'll finally be able to remove them.
> > 
> >             On Tue, Aug 11, 2020 at 8:40 AM Alexey Romanenko
> >             <[email protected] <mailto:[email protected]> 
> > <mailto:[email protected] <mailto:[email protected]>>>
> >             wrote:
> > 
> >                 Hi Luke,
> > 
> >                 Great to hear about such progress on this!
> > 
> >                 Talking about opt-out for all runners in the future,
> >                 will it require any code changes for current
> >                 “*Source”-based IOs or the wrappers should completely
> >                 smooth this transition?
> >                 Do we need to require to create new IOs only based on
> >                 SDF or again, the wrappers should help to avoid this?
> > 
> >>                 On 10 Aug 2020, at 22:59, Luke Cwik <[email protected] 
> >> <mailto:[email protected]>
> >>                 <mailto:[email protected] <mailto:[email protected]>>> wrote:
> >>
> >>                 In the past couple of months wrappers[1, 2] have been
> >>                 added to the Beam Java SDK which can execute
> >>                 BoundedSource and UnboundedSource as Splittable DoFns.
> >>                 These have been opt-out for portable pipelines (e.g.
> >>                 Dataflow runner v2, XLang pipelines on Flink/Spark)
> >>                 and opt-in using an experiment for all other pipelines.
> >>
> >>                 I would like to start making the non-portable
> >>                 pipelines starting with the DirectRunner[3] to be
> >>                 opt-out with the plan that eventually all runners will
> >>                 only execute splittable DoFns and the
> >>                 BoundedSource/UnboundedSource specific execution logic
> >>                 from the runners will be removed.
> >>
> >>                 Users will be able to opt-in any pipeline using the
> >>                 experiment 'use_sdf_read' and opt-out with the
> >>                 experiment 'use_deprecated_read'. (For portable
> >>                 pipelines these experiments were 'beam_fn_api' and
> >>                 'beam_fn_api_use_deprecated_read' respectively and I
> >>                 have added these two additional aliases to make the
> >>                 experience less confusing).
> >>
> >>                 1:
> >>                 
> >> https://github.com/apache/beam/blob/af1ce643d8fde5352d4519a558de4a2dfd24721d/sdks/java/core/src/main/java/org/apache/beam/sdk/io/Read.java#L275
> >>  
> >> <https://github.com/apache/beam/blob/af1ce643d8fde5352d4519a558de4a2dfd24721d/sdks/java/core/src/main/java/org/apache/beam/sdk/io/Read.java#L275>
> >>                 2:
> >>                 
> >> https://github.com/apache/beam/blob/af1ce643d8fde5352d4519a558de4a2dfd24721d/sdks/java/core/src/main/java/org/apache/beam/sdk/io/Read.java#L449
> >>  
> >> <https://github.com/apache/beam/blob/af1ce643d8fde5352d4519a558de4a2dfd24721d/sdks/java/core/src/main/java/org/apache/beam/sdk/io/Read.java#L449>
> >>                 3: https://github.com/apache/beam/pull/12519 
> >> <https://github.com/apache/beam/pull/12519>
> > 
> > 
> > 
> >     -- 
> >     Pulasthi S. Wickramasinghe
> >     PhD Candidate  | Research Assistant
> >     School of Informatics and Computing | Digital Science Center
> >     Indiana University, Bloomington
> >     cell: 224-386-9035 <tel:(224)%20386-9035> <tel:(224)%20386-9035>
> >

Re: [DISCUSS][BEAM-10670] Migrating BoundedSource/UnboundedSource to execute as a Splittable DoFn for non-portable Java runners

Reply via email to