Thanks Luke, just I guess that the proper link should be this one: https://docs.google.com/document/d/1kpn0RxqZaoacUPVSMYhhnfmlo8fGT-p50fEblaFr2HE
> On 13 Oct 2020, at 00:23, Luke Cwik <lc...@google.com> wrote: > > I have a draft[1] off the blog ready. Please take a look. > > 1: > http://doc/1kpn0RxqZaoacUPVSMYhhnfmlo8fGT-p50fEblaFr2HE#heading=h.tbab2n97o3eo > > <http://doc/1kpn0RxqZaoacUPVSMYhhnfmlo8fGT-p50fEblaFr2HE#heading=h.tbab2n97o3eo> > On Mon, Oct 5, 2020 at 4:28 PM Luke Cwik <lc...@google.com > <mailto:lc...@google.com>> wrote: > > > On Mon, Oct 5, 2020 at 3:45 PM Kenneth Knowles <k...@apache.org > <mailto:k...@apache.org>> wrote: > > > On Mon, Oct 5, 2020 at 2:44 PM Luke Cwik <lc...@google.com > <mailto:lc...@google.com>> wrote: > For the 2.25 release the Java Direct, Flink, Jet, Samza, Twister2 will use > SDF powered Read transforms. Users can opt-out with > --experiments=use_deprecated_read. > > Huzzah! In our release notes maybe be clear about the expectations for users: > > Done in https://github.com/apache/beam/pull/13015 > <https://github.com/apache/beam/pull/13015> > > - semantics are expected to be the same: file bugs for any change in results > - perf may vary: file bugs or write to user@ > > I was unable to get Spark done for 2.25 as I found out that Spark streaming > doesn't support watermark holds[1]. If someone knows more about the watermark > system in Spark I could use some guidance here as I believe I have a version > of unbounded SDF support written for Spark (I get all the expected output > from tests, just that watermarks aren't being held back so PAssert fails). > > Spark's watermarks are not comparable to Beam's. The rule as I understand it > is that any data that is later than `max(seen timestamps) - allowedLateness` > is dropped. One difference is that dropping is relative to the watermark > instead of expiring windows, like early versions of Beam. The other > difference is that it track the latest event (some call it a "high water > mark" because it is the highest datetime value seen) where Beam's watermark > is an approximation of the earliest (some call it a "low water mark" because > it is a guarantee that it will not dip lower). When I chatted about this with > Amit in the early days, it was necessary to implement a Beam-style watermark > using Spark state. I think that may still be the case, for correct results. > > > In the Spark implementation I saw that watermark holds weren't wired at all > to control Sparks watermarks and this was causing triggers to fire too early. > > Also, I started a doc[2] to produce an updated blog post since the original > SplittableDoFn blog from 2017 is out of date[3]. I was thinking of making > this a new blog post and having the old blog post point to it. We could also > remove the old blog post and or update it. Any thoughts? > > New blog post w/ pointer from the old one. > > Finally, I have a clean-up PR[4] that pushes the Read -> primitive Read > expansion into each of the runners instead of having it within Read transform > within beam-sdks-java-core. > > Approved! I did CC a bunch of runner authors already. I think the important > thing is if a default changes we should be sure everyone is OK with the perf > changes, and everyone is confident that no incorrect results are produced. > The abstractions between sdk-core, runners-core-*, and individual runners is > important to me: > > - The SDK's job is to produce a portable, un-tweaked pipeline so moving > flags out of SDK core (and IOs) ASAP is super important. > - The runner's job is to execute that pipeline, if they can, however they > want. If a runner wants to run Read transforms differently/directly that is > fine. If a runner is incapable of supporting SDF, then Read is better than > nothing. Etc. > - The runners-core-* job is to just be internal libraries for runner authors > to share code, and should not make any decisions about the Beam model, etc. > > Kenn > > 1: https://github.com/apache/beam/pull/12603 > <https://github.com/apache/beam/pull/12603> > 2: http://doc/1kpn0RxqZaoacUPVSMYhhnfmlo8fGT-p50fEblaFr2HE > <http://doc/1kpn0RxqZaoacUPVSMYhhnfmlo8fGT-p50fEblaFr2HE> > 3: https://beam.apache.org/blog/splittable-do-fn/ > <https://beam.apache.org/blog/splittable-do-fn/> > 4: https://github.com/apache/beam/pull/13006 > <https://github.com/apache/beam/pull/13006> > > > On Fri, Aug 28, 2020 at 1:45 AM Maximilian Michels <m...@apache.org > <mailto:m...@apache.org>> wrote: > Thanks Luke! I've had a pass. > > -Max > > On 28.08.20 01:22, Luke Cwik wrote: > > As an update. > > > > Direct and Twister2 are done. > > Samza: is ready for review[1]. > > Flink: is almost ready for review. [2] lays all the groundwork for the > > migration and [3] finishes the migration (there is a timeout happening > > in FlinkSubmissionTest that I'm trying to figure out). > > No further updates on Spark[4] or Jet[5]. > > > > @Maximilian Michels <mailto:m...@apache.org <mailto:m...@apache.org>> or > > @t...@apache.org <mailto:t...@apache.org> > > <mailto:thomas.we...@gmail.com <mailto:thomas.we...@gmail.com>>, can either > > of you take a look at the > > Flink PRs? > > @ke.wu...@icloud.com <mailto:ke.wu...@icloud.com> > > <mailto:ke.wu...@icloud.com <mailto:ke.wu...@icloud.com>>, Since Xinyu > > delegated > > to you, can you take another look at the Samza PR? > > > > 1: https://github.com/apache/beam/pull/12617 > > <https://github.com/apache/beam/pull/12617> > > 2: https://github.com/apache/beam/pull/12706 > > <https://github.com/apache/beam/pull/12706> > > 3: https://github.com/apache/beam/pull/12708 > > <https://github.com/apache/beam/pull/12708> > > 4: https://github.com/apache/beam/pull/12603 > > <https://github.com/apache/beam/pull/12603> > > 5: https://github.com/apache/beam/pull/12616 > > <https://github.com/apache/beam/pull/12616> > > > > On Tue, Aug 18, 2020 at 11:42 AM Pulasthi Supun Wickramasinghe > > <pulasthi...@gmail.com <mailto:pulasthi...@gmail.com> > > <mailto:pulasthi...@gmail.com <mailto:pulasthi...@gmail.com>>> wrote: > > > > Hi Luke > > > > Will take a look at this as soon as possible and get back to you. > > > > Best Regards, > > Pulasthi > > > > On Tue, Aug 18, 2020 at 2:30 PM Luke Cwik <lc...@google.com > > <mailto:lc...@google.com> > > <mailto:lc...@google.com <mailto:lc...@google.com>>> wrote: > > > > I have made some good progress here and have gotten to the > > following state for non-portable runners: > > > > DirectRunner[1]: Merged. Supports Read.Bounded and Read.Unbounded. > > Twister2[2]: Ready for review. Supports Read.Bounded, the > > current runner doesn't support unbounded pipelines. > > Spark[3]: WIP. Supports Read.Bounded, Nexmark suite passes. Not > > certain about level of unbounded pipeline support coverage since > > Spark uses its own tiny suite of tests to get unbounded pipeline > > coverage instead of the validates runner set. > > Jet[4]: WIP. Supports Read.Bounded. Read.Unbounded definitely > > needs additional work. > > Sazma[5]: WIP. Supports Read.Bounded. Not certain about level of > > unbounded pipeline support coverage since Spark uses its own > > tiny suite of tests to get unbounded pipeline coverage instead > > of the validates runner set. > > Flink: Unstarted. > > > > @Pulasthi Supun Wickramasinghe <mailto:pulasthi...@gmail.com > > <mailto:pulasthi...@gmail.com>> , > > can you help me with the Twister2 PR[2]? > > @Ismaël Mejía <mailto:ieme...@gmail.com > > <mailto:ieme...@gmail.com>>, is PR[3] the expected > > level of support for unbounded pipelines and hence ready for review? > > @Jozsef Bartok <mailto:jo...@hazelcast.com > > <mailto:jo...@hazelcast.com>>, can you help me out > > to get support for unbounded splittable DoFn's into Jet[4]? > > @Xinyu Liu <mailto:xinyuliu...@gmail.com > > <mailto:xinyuliu...@gmail.com>>, is PR[5] the expected > > level of support for unbounded pipelines and hence ready for review? > > > > 1: https://github.com/apache/beam/pull/12519 > > <https://github.com/apache/beam/pull/12519> > > 2: https://github.com/apache/beam/pull/12594 > > <https://github.com/apache/beam/pull/12594> > > 3: https://github.com/apache/beam/pull/12603 > > <https://github.com/apache/beam/pull/12603> > > 4: https://github.com/apache/beam/pull/12616 > > <https://github.com/apache/beam/pull/12616> > > 5: https://github.com/apache/beam/pull/12617 > > <https://github.com/apache/beam/pull/12617> > > > > On Tue, Aug 11, 2020 at 10:55 AM Luke Cwik <lc...@google.com > > <mailto:lc...@google.com> > > <mailto:lc...@google.com <mailto:lc...@google.com>>> wrote: > > > > There shouldn't be any changes required since the wrapper > > will smoothly transition the execution to be run as an SDF. > > New IOs should strongly prefer to use SDF since it should be > > simpler to write and will be more flexible but they can use > > the "*Source"-based APIs. Eventually we'll deprecate the > > APIs but we will never stop supporting them. Eventually they > > should all be migrated to use SDF and if there is another > > major Beam version, we'll finally be able to remove them. > > > > On Tue, Aug 11, 2020 at 8:40 AM Alexey Romanenko > > <aromanenko....@gmail.com <mailto:aromanenko....@gmail.com> > > <mailto:aromanenko....@gmail.com <mailto:aromanenko....@gmail.com>>> > > wrote: > > > > Hi Luke, > > > > Great to hear about such progress on this! > > > > Talking about opt-out for all runners in the future, > > will it require any code changes for current > > “*Source”-based IOs or the wrappers should completely > > smooth this transition? > > Do we need to require to create new IOs only based on > > SDF or again, the wrappers should help to avoid this? > > > >> On 10 Aug 2020, at 22:59, Luke Cwik <lc...@google.com > >> <mailto:lc...@google.com> > >> <mailto:lc...@google.com <mailto:lc...@google.com>>> wrote: > >> > >> In the past couple of months wrappers[1, 2] have been > >> added to the Beam Java SDK which can execute > >> BoundedSource and UnboundedSource as Splittable DoFns. > >> These have been opt-out for portable pipelines (e.g. > >> Dataflow runner v2, XLang pipelines on Flink/Spark) > >> and opt-in using an experiment for all other pipelines. > >> > >> I would like to start making the non-portable > >> pipelines starting with the DirectRunner[3] to be > >> opt-out with the plan that eventually all runners will > >> only execute splittable DoFns and the > >> BoundedSource/UnboundedSource specific execution logic > >> from the runners will be removed. > >> > >> Users will be able to opt-in any pipeline using the > >> experiment 'use_sdf_read' and opt-out with the > >> experiment 'use_deprecated_read'. (For portable > >> pipelines these experiments were 'beam_fn_api' and > >> 'beam_fn_api_use_deprecated_read' respectively and I > >> have added these two additional aliases to make the > >> experience less confusing). > >> > >> 1: > >> > >> https://github.com/apache/beam/blob/af1ce643d8fde5352d4519a558de4a2dfd24721d/sdks/java/core/src/main/java/org/apache/beam/sdk/io/Read.java#L275 > >> > >> <https://github.com/apache/beam/blob/af1ce643d8fde5352d4519a558de4a2dfd24721d/sdks/java/core/src/main/java/org/apache/beam/sdk/io/Read.java#L275> > >> 2: > >> > >> https://github.com/apache/beam/blob/af1ce643d8fde5352d4519a558de4a2dfd24721d/sdks/java/core/src/main/java/org/apache/beam/sdk/io/Read.java#L449 > >> > >> <https://github.com/apache/beam/blob/af1ce643d8fde5352d4519a558de4a2dfd24721d/sdks/java/core/src/main/java/org/apache/beam/sdk/io/Read.java#L449> > >> 3: https://github.com/apache/beam/pull/12519 > >> <https://github.com/apache/beam/pull/12519> > > > > > > > > -- > > Pulasthi S. Wickramasinghe > > PhD Candidate | Research Assistant > > School of Informatics and Computing | Digital Science Center > > Indiana University, Bloomington > > cell: 224-386-9035 <tel:(224)%20386-9035> <tel:(224)%20386-9035> > >