+Rose Nguyen <rtngu...@google.com> suggested that instead of just a blog, we should add the majority of the current blog's content to the core programming guide and either drop the blog and/or have a much smaller blog that links to the docs.
I think this is a great idea, what do others think? On Wed, Oct 14, 2020 at 10:51 AM Luke Cwik <lc...@google.com> wrote: > Thanks Alexey, that is correct. > > On Wed, Oct 14, 2020 at 10:33 AM Alexey Romanenko < > aromanenko....@gmail.com> wrote: > >> Thanks Luke, just I guess that the proper link should be this one: >> >> https://docs.google.com/document/d/1kpn0RxqZaoacUPVSMYhhnfmlo8fGT-p50fEblaFr2HE >> >> On 13 Oct 2020, at 00:23, Luke Cwik <lc...@google.com> wrote: >> >> I have a draft[1] off the blog ready. Please take a look. >> >> 1: >> http://doc/1kpn0RxqZaoacUPVSMYhhnfmlo8fGT-p50fEblaFr2HE#heading=h.tbab2n97o3eo >> >> On Mon, Oct 5, 2020 at 4:28 PM Luke Cwik <lc...@google.com> wrote: >> >>> >>> >>> On Mon, Oct 5, 2020 at 3:45 PM Kenneth Knowles <k...@apache.org> wrote: >>> >>>> >>>> >>>> On Mon, Oct 5, 2020 at 2:44 PM Luke Cwik <lc...@google.com> wrote: >>>> >>>>> For the 2.25 release the Java Direct, Flink, Jet, Samza, Twister2 will >>>>> use SDF powered Read transforms. Users can opt-out >>>>> with --experiments=use_deprecated_read. >>>>> >>>> >>>> Huzzah! In our release notes maybe be clear about the expectations for >>>> users: >>>> >>>> Done in https://github.com/apache/beam/pull/13015 >>> >>> >>>> - semantics are expected to be the same: file bugs for any change in >>>> results >>>> - perf may vary: file bugs or write to user@ >>>> >>>> I was unable to get Spark done for 2.25 as I found out that Spark >>>>> streaming doesn't support watermark holds[1]. If someone knows more about >>>>> the watermark system in Spark I could use some guidance here as I believe >>>>> I >>>>> have a version of unbounded SDF support written for Spark (I get all the >>>>> expected output from tests, just that watermarks aren't being held back so >>>>> PAssert fails). >>>>> >>>> >>>> Spark's watermarks are not comparable to Beam's. The rule as I >>>> understand it is that any data that is later than `max(seen timestamps) - >>>> allowedLateness` is dropped. One difference is that dropping is relative to >>>> the watermark instead of expiring windows, like early versions of Beam. The >>>> other difference is that it track the latest event (some call it a "high >>>> water mark" because it is the highest datetime value seen) where Beam's >>>> watermark is an approximation of the earliest (some call it a "low water >>>> mark" because it is a guarantee that it will not dip lower). When I chatted >>>> about this with Amit in the early days, it was necessary to implement a >>>> Beam-style watermark using Spark state. I think that may still be the case, >>>> for correct results. >>>> >>>> >>> In the Spark implementation I saw that watermark holds weren't wired at >>> all to control Sparks watermarks and this was causing triggers to fire too >>> early. >>> >>> >>>> Also, I started a doc[2] to produce an updated blog post since the >>>>> original SplittableDoFn blog from 2017 is out of date[3]. I was thinking >>>>> of >>>>> making this a new blog post and having the old blog post point to it. We >>>>> could also remove the old blog post and or update it. Any thoughts? >>>>> >>>> >>>> New blog post w/ pointer from the old one. >>>> >>>> Finally, I have a clean-up PR[4] that pushes the Read -> primitive Read >>>>> expansion into each of the runners instead of having it within Read >>>>> transform within beam-sdks-java-core. >>>>> >>>> >>>> Approved! I did CC a bunch of runner authors already. I think the >>>> important thing is if a default changes we should be sure everyone is OK >>>> with the perf changes, and everyone is confident that no incorrect results >>>> are produced. The abstractions between sdk-core, runners-core-*, and >>>> individual runners is important to me: >>>> >>>> - The SDK's job is to produce a portable, un-tweaked pipeline so >>>> moving flags out of SDK core (and IOs) ASAP is super important. >>>> - The runner's job is to execute that pipeline, if they can, however >>>> they want. If a runner wants to run Read transforms differently/directly >>>> that is fine. If a runner is incapable of supporting SDF, then Read is >>>> better than nothing. Etc. >>>> - The runners-core-* job is to just be internal libraries for runner >>>> authors to share code, and should not make any decisions about the Beam >>>> model, etc. >>>> >>>> Kenn >>>> >>>> 1: https://github.com/apache/beam/pull/12603 >>>>> 2: http://doc/1kpn0RxqZaoacUPVSMYhhnfmlo8fGT-p50fEblaFr2HE >>>>> 3: https://beam.apache.org/blog/splittable-do-fn/ >>>>> 4: https://github.com/apache/beam/pull/13006 >>>>> >>>>> >>>>> On Fri, Aug 28, 2020 at 1:45 AM Maximilian Michels <m...@apache.org> >>>>> wrote: >>>>> >>>>>> Thanks Luke! I've had a pass. >>>>>> >>>>>> -Max >>>>>> >>>>>> On 28.08.20 01:22, Luke Cwik wrote: >>>>>> > As an update. >>>>>> > >>>>>> > Direct and Twister2 are done. >>>>>> > Samza: is ready for review[1]. >>>>>> > Flink: is almost ready for review. [2] lays all the groundwork for >>>>>> the >>>>>> > migration and [3] finishes the migration (there is a timeout >>>>>> happening >>>>>> > in FlinkSubmissionTest that I'm trying to figure out). >>>>>> > No further updates on Spark[4] or Jet[5]. >>>>>> > >>>>>> > @Maximilian Michels <mailto:m...@apache.org> or @t...@apache.org >>>>>> > <mailto:thomas.we...@gmail.com>, can either of you take a look at >>>>>> the >>>>>> > Flink PRs? >>>>>> > @ke.wu...@icloud.com <mailto:ke.wu...@icloud.com>, Since Xinyu >>>>>> delegated >>>>>> > to you, can you take another look at the Samza PR? >>>>>> > >>>>>> > 1: https://github.com/apache/beam/pull/12617 >>>>>> > 2: https://github.com/apache/beam/pull/12706 >>>>>> > 3: https://github.com/apache/beam/pull/12708 >>>>>> > 4: https://github.com/apache/beam/pull/12603 >>>>>> > 5: https://github.com/apache/beam/pull/12616 >>>>>> > >>>>>> > On Tue, Aug 18, 2020 at 11:42 AM Pulasthi Supun Wickramasinghe >>>>>> > <pulasthi...@gmail.com <mailto:pulasthi...@gmail.com>> wrote: >>>>>> > >>>>>> > Hi Luke >>>>>> > >>>>>> > Will take a look at this as soon as possible and get back to >>>>>> you. >>>>>> > >>>>>> > Best Regards, >>>>>> > Pulasthi >>>>>> > >>>>>> > On Tue, Aug 18, 2020 at 2:30 PM Luke Cwik <lc...@google.com >>>>>> > <mailto:lc...@google.com>> wrote: >>>>>> > >>>>>> > I have made some good progress here and have gotten to the >>>>>> > following state for non-portable runners: >>>>>> > >>>>>> > DirectRunner[1]: Merged. Supports Read.Bounded and >>>>>> Read.Unbounded. >>>>>> > Twister2[2]: Ready for review. Supports Read.Bounded, the >>>>>> > current runner doesn't support unbounded pipelines. >>>>>> > Spark[3]: WIP. Supports Read.Bounded, Nexmark suite passes. >>>>>> Not >>>>>> > certain about level of unbounded pipeline support coverage >>>>>> since >>>>>> > Spark uses its own tiny suite of tests to get unbounded >>>>>> pipeline >>>>>> > coverage instead of the validates runner set. >>>>>> > Jet[4]: WIP. Supports Read.Bounded. Read.Unbounded >>>>>> definitely >>>>>> > needs additional work. >>>>>> > Sazma[5]: WIP. Supports Read.Bounded. Not certain about >>>>>> level of >>>>>> > unbounded pipeline support coverage since Spark uses its own >>>>>> > tiny suite of tests to get unbounded pipeline coverage >>>>>> instead >>>>>> > of the validates runner set. >>>>>> > Flink: Unstarted. >>>>>> > >>>>>> > @Pulasthi Supun Wickramasinghe <mailto: >>>>>> pulasthi...@gmail.com> , >>>>>> > can you help me with the Twister2 PR[2]? >>>>>> > @Ismaël Mejía <mailto:ieme...@gmail.com>, is PR[3] the >>>>>> expected >>>>>> > level of support for unbounded pipelines and hence ready >>>>>> for review? >>>>>> > @Jozsef Bartok <mailto:jo...@hazelcast.com>, can you help >>>>>> me out >>>>>> > to get support for unbounded splittable DoFn's into Jet[4]? >>>>>> > @Xinyu Liu <mailto:xinyuliu...@gmail.com>, is PR[5] the >>>>>> expected >>>>>> > level of support for unbounded pipelines and hence ready >>>>>> for review? >>>>>> > >>>>>> > 1: https://github.com/apache/beam/pull/12519 >>>>>> > 2: https://github.com/apache/beam/pull/12594 >>>>>> > 3: https://github.com/apache/beam/pull/12603 >>>>>> > 4: https://github.com/apache/beam/pull/12616 >>>>>> > 5: https://github.com/apache/beam/pull/12617 >>>>>> > >>>>>> > On Tue, Aug 11, 2020 at 10:55 AM Luke Cwik < >>>>>> lc...@google.com >>>>>> > <mailto:lc...@google.com>> wrote: >>>>>> > >>>>>> > There shouldn't be any changes required since the >>>>>> wrapper >>>>>> > will smoothly transition the execution to be run as an >>>>>> SDF. >>>>>> > New IOs should strongly prefer to use SDF since it >>>>>> should be >>>>>> > simpler to write and will be more flexible but they can >>>>>> use >>>>>> > the "*Source"-based APIs. Eventually we'll deprecate the >>>>>> > APIs but we will never stop supporting them. Eventually >>>>>> they >>>>>> > should all be migrated to use SDF and if there is >>>>>> another >>>>>> > major Beam version, we'll finally be able to remove >>>>>> them. >>>>>> > >>>>>> > On Tue, Aug 11, 2020 at 8:40 AM Alexey Romanenko >>>>>> > <aromanenko....@gmail.com <mailto: >>>>>> aromanenko....@gmail.com>> >>>>>> > wrote: >>>>>> > >>>>>> > Hi Luke, >>>>>> > >>>>>> > Great to hear about such progress on this! >>>>>> > >>>>>> > Talking about opt-out for all runners in the future, >>>>>> > will it require any code changes for current >>>>>> > “*Source”-based IOs or the wrappers should >>>>>> completely >>>>>> > smooth this transition? >>>>>> > Do we need to require to create new IOs only based >>>>>> on >>>>>> > SDF or again, the wrappers should help to avoid >>>>>> this? >>>>>> > >>>>>> >> On 10 Aug 2020, at 22:59, Luke Cwik < >>>>>> lc...@google.com >>>>>> >> <mailto:lc...@google.com>> wrote: >>>>>> >> >>>>>> >> In the past couple of months wrappers[1, 2] have >>>>>> been >>>>>> >> added to the Beam Java SDK which can execute >>>>>> >> BoundedSource and UnboundedSource as Splittable >>>>>> DoFns. >>>>>> >> These have been opt-out for portable pipelines >>>>>> (e.g. >>>>>> >> Dataflow runner v2, XLang pipelines on Flink/Spark) >>>>>> >> and opt-in using an experiment for all other >>>>>> pipelines. >>>>>> >> >>>>>> >> I would like to start making the non-portable >>>>>> >> pipelines starting with the DirectRunner[3] to be >>>>>> >> opt-out with the plan that eventually all runners >>>>>> will >>>>>> >> only execute splittable DoFns and the >>>>>> >> BoundedSource/UnboundedSource specific execution >>>>>> logic >>>>>> >> from the runners will be removed. >>>>>> >> >>>>>> >> Users will be able to opt-in any pipeline using the >>>>>> >> experiment 'use_sdf_read' and opt-out with the >>>>>> >> experiment 'use_deprecated_read'. (For portable >>>>>> >> pipelines these experiments were 'beam_fn_api' and >>>>>> >> 'beam_fn_api_use_deprecated_read' respectively and >>>>>> I >>>>>> >> have added these two additional aliases to make the >>>>>> >> experience less confusing). >>>>>> >> >>>>>> >> 1: >>>>>> >> >>>>>> https://github.com/apache/beam/blob/af1ce643d8fde5352d4519a558de4a2dfd24721d/sdks/java/core/src/main/java/org/apache/beam/sdk/io/Read.java#L275 >>>>>> >> 2: >>>>>> >> >>>>>> https://github.com/apache/beam/blob/af1ce643d8fde5352d4519a558de4a2dfd24721d/sdks/java/core/src/main/java/org/apache/beam/sdk/io/Read.java#L449 >>>>>> >> 3: https://github.com/apache/beam/pull/12519 >>>>>> > >>>>>> > >>>>>> > >>>>>> > -- >>>>>> > Pulasthi S. Wickramasinghe >>>>>> > PhD Candidate | Research Assistant >>>>>> > School of Informatics and Computing | Digital Science Center >>>>>> > Indiana University, Bloomington >>>>>> > cell: 224-386-9035 <(224)%20386-9035> <tel:(224)%20386-9035> >>>>>> > >>>>>> >>>>> >>