Re: [Go SDK] Direct Runner Replacement: Prism
Here are the next two chunks! https://github.com/apache/beam/pull/25476 - Coder / element / bytes handling internally for prism. https://github.com/apache/beam/pull/25478 - Worker fnAPI handling. Took a bit to get a baseline of unit testing in for these, since they were covered by whole pipeline runs. Coders in particular, since they currently live in the package with the pipeline tests, so it was harder to ensure coverage in a vacuum. But they did force a bit of documentation improvements, and a neglected inefficiency I had in the original coder structure. So small pain now, but will make sure future development is a bit easier, as convenient as "just write a pipeline" is for testing. Sometimes you just want to ensure the protocol works. On Thu, Feb 9, 2023 at 2:50 PM Kenneth Knowles wrote: > Just a +100 to the idea of this runner. Having an easy-to-read, > portable-execution, batch & streaming, parallel, local runner, that > exercises plenty of advanced model features... solid gold! > > On Thu, Feb 9, 2023 at 12:01 PM Robert Burke via dev > wrote: > >> Here are the first of the smaller PRs: >> >> https://github.com/apache/beam/pull/25404 -> Adds READMEs and updates >> go.mod so later changes don't collide there. >> https://github.com/apache/beam/pull/25405 -> Adds internal/urns package >> for extracting URNs from the protos. >> https://github.com/apache/beam/pull/25406 -> Adds internal/config >> package for parsing and accessing the configuration of variants and >> handlers in the runner. >> >> These are independant changes, and small enough for quicker review. The >> remaining larger packages can be submitted more piecemeal once these are in. >> >> >> >> On Wed, Feb 8, 2023 at 3:23 PM Robert Burke wrote: >> >>> Hello Beam! >>> >>> == tl;dr; == >>> >>> I wrote a local, portable Beam runner in Go to replace the Go direct >>> runner. I'd like to contribute it to the Beam Repo. The Big PR with >>> everything is here: https://github.com/apache/beam/pull/25391 >>> >>> I'll be sending smaller PRs out for review to get it into the repo. Take >>> a look at the big one, don't mind the mess, but do ask questions, or offer >>> constructive suggestions to make it clearer. There are ample TODOs that >>> could be added. This thread will be kept up to date with the progress. >>> >>> Highlights: >>> Avoids false positive issues the Go Direct runner has, especially around >>> serialization issues. >>> Single transform at a time execution. >>> Watermark propagation through Graph for GBKs and Side Input windowing. >>> Will be capable of testing the whole Go SDK, in time. >>> Will be capable of being a stand alone single binary runner, in time. >>> ++Many opportunities for contribution after getting into the repo!++ >>> >>> Lowlights: >>> Only for Go SDK, for now. >>> ~~Many unimplemented features~~ >>> >>> Where to start reading? >>> >>> Vision README: >>> https://github.com/apache/beam/blob/9044f2d4ae151f4222a2f3e0a3264c1198040181/sdks/go/pkg/beam/runners/prism/README.md >>> >>> >>> Code Structure README: >>> https://github.com/apache/beam/blob/9044f2d4ae151f4222a2f3e0a3264c1198040181/sdks/go/pkg/beam/runners/prism/internal/README.md >>> >>> >>> executePipeline entrypoint: >>> https://github.com/apache/beam/blob/9044f2d4ae151f4222a2f3e0a3264c1198040181/sdks/go/pkg/beam/runners/prism/internal/execute.go#L41 >>> >>> >>> >>> == The long version == >>> >>> Since last year, I was puttering away at making a Portable Beam Runner >>> authored in Go. Partly because I wanted to learn the "runner" half of beam, >>> and partly because the Go Direct Runner (and most other direct runners), >>> are not good at testing. >>> >>> I managed to get it roughly ready for basic batch execution by end of >>> February 2022 , and then 2022 got away from me. And I couldn't pick it up >>> until the end of the year. >>> >>> I gave a talk about this at Beam Summit 2022 >>> https://2022.beamsummit.org/sessions/portable-go-beam-runner/ that >>> covers my motivation for it. Loosely, Beam has a Testing Problem. There are >>> large parts of Beam execution that matter for real world performance and >>> correctness, but the facilities to test these don't exist. For example, >>> take Combiner Lifting, if a combiner is unlifted, but implements >>> AddInput... then Merge is never called, leaving it untested. And the user >>> has no control over this, or may not even be aware of it. How a DoFn is >>> executed matters for coverage, and user confidence. In particular for >>> Streaming jobs, users will tend to try things out on their Prod runner, but >>> that doesn't help if one is testing on local Flink, but executing on Google >>> Cloud Dataflow, which behave very differently. >>> >>> Regardless of whether you agree with that thesis... I wanted to fill >>> that gap. I wanted a runner that could be configured to test those >>> situations, and in particular, make it easier to develop SDKs and all the >>> features of Beam that don't get their own blog p
Re: Launch Dataflow Flex Templates from Go
Hi Shivam, Thanks a lot for your response. I did check the http request. But I wanted to see if I can use the Google API client Library. The docs show a Python example for it shown below. I wanted to know if there is something similar with Go. from googleapiclient.discovery import build # project = 'your-gcp-project' # job = 'unique-job-name' # template = 'gs://dataflow-templates/latest/Word_Count' # parameters = { # 'inputFile': 'gs://dataflow-samples/shakespeare/kinglear.txt', # 'output': 'gs:///wordcount/outputs', # } dataflow = build('dataflow', 'v1b3') request = dataflow.projects().templates().launch( projectId=project, gcsPath=template, body={ 'jobName': job, 'parameters': parameters, } ) response = request.execute() Regards, Ashok On Wed, Feb 15, 2023 at 4:22 PM Shivam Singhal wrote: > There shouldn’t be much change in the API request irrespective of the SDK > language > > On Wed, 15 Feb 2023 at 10:50, Shivam Singhal > wrote: > >> Hey Ashok, >> >> If you already have a flex template file and the docker image built, you >> can use the Dataflow API to run the template. >> >> https://cloud.google.com/dataflow/docs/reference/rest >> >> >> On Wed, 15 Feb 2023 at 04:49, Ashok KS wrote: >> >>> Hello Beam Community, >>> >>> I have written a Dataflow pipeline using Python SDK and I would be >>> creating a Flex template with it. >>> >>> My task is to launch this Flex Template from Cloud Functions which would >>> be in Go. I found the package below but couldn't find any sample. >>> >>> >>> https://cloud.google.com/dataflow/docs/reference/rest/v1b3/projects.locations.templates/launch >>> >>> I could find examples in Python to launch templates. >>> Can someone please share an example in Go to launch a Dataflow Flex >>> template? >>> >>> Thank you in advance. >>> >>> Regards, >>> Ashok >>> >>
Re: Launch Dataflow Flex Templates from Go
There shouldn’t be much change in the API request irrespective of the SDK language On Wed, 15 Feb 2023 at 10:50, Shivam Singhal wrote: > Hey Ashok, > > If you already have a flex template file and the docker image built, you > can use the Dataflow API to run the template. > > https://cloud.google.com/dataflow/docs/reference/rest > > > On Wed, 15 Feb 2023 at 04:49, Ashok KS wrote: > >> Hello Beam Community, >> >> I have written a Dataflow pipeline using Python SDK and I would be >> creating a Flex template with it. >> >> My task is to launch this Flex Template from Cloud Functions which would >> be in Go. I found the package below but couldn't find any sample. >> >> >> https://cloud.google.com/dataflow/docs/reference/rest/v1b3/projects.locations.templates/launch >> >> I could find examples in Python to launch templates. >> Can someone please share an example in Go to launch a Dataflow Flex >> template? >> >> Thank you in advance. >> >> Regards, >> Ashok >> >
Re: Launch Dataflow Flex Templates from Go
Hey Ashok, If you already have a flex template file and the docker image built, you can use the Dataflow API to run the template. https://cloud.google.com/dataflow/docs/reference/rest On Wed, 15 Feb 2023 at 04:49, Ashok KS wrote: > Hello Beam Community, > > I have written a Dataflow pipeline using Python SDK and I would be > creating a Flex template with it. > > My task is to launch this Flex Template from Cloud Functions which would > be in Go. I found the package below but couldn't find any sample. > > > https://cloud.google.com/dataflow/docs/reference/rest/v1b3/projects.locations.templates/launch > > I could find examples in Python to launch templates. > Can someone please share an example in Go to launch a Dataflow Flex > template? > > Thank you in advance. > > Regards, > Ashok >
Launch Dataflow Flex Templates from Go
Hello Beam Community, I have written a Dataflow pipeline using Python SDK and I would be creating a Flex template with it. My task is to launch this Flex Template from Cloud Functions which would be in Go. I found the package below but couldn't find any sample. https://cloud.google.com/dataflow/docs/reference/rest/v1b3/projects.locations.templates/launch I could find examples in Python to launch templates. Can someone please share an example in Go to launch a Dataflow Flex template? Thank you in advance. Regards, Ashok
Re: OpenJDK8 / OpenJDK11 container deprecation
SGTM. I asked on the PR if this could impact users, but having read the docker release calendar I am not concerned. The last update to the old version was in 2019, and the introduction of compatible versions was 2020. On Tue, Feb 14, 2023 at 3:01 PM Byron Ellis via user wrote: > FWIW I am Team Upgrade Docker :-) > > On Tue, Feb 14, 2023 at 2:53 PM Luke Cwik via user > wrote: > >> I made some progress in testing the container and did hit an issue where >> Ubuntu 22.04 "Jammy" is dependent on the version of Docker installed. It >> turns out that our boot.go crashes with "runtime/cgo: pthread_create >> failed: Operation not permitted" because the Ubuntu 22.04 is using new >> syscalls that Docker 18.09.4 doesn't have a seccomp policy for (and uses a >> default of deny). We have a couple of choices here: >> 1) upgrade the version of docker on Jenkins and require users to >> similarly use a new enough version of Docker so that this isn't an issue >> for them >> 2) use Ubuntu 20.04 "Focal" as the docker container >> >> I was using Docker 20.10.21 which is why I didn't hit this issue when >> testing the change locally. >> >> We could also do these but they same strictly worse then either of the >> two options discussed above: >> A) disable the seccomp policy on Jenkins >> B) use a custom seccomp policy on Jenkins >> >> My suggestion is to upgrade Docker versions on Jenkins and use Ubuntu >> 22.04 as it will have LTS releases till 2027 and then security patches till >> 2032 which gives everyone the longest runway till we need to swap OS >> versions again for users of Apache Beam. Any concerns or ideas? >> >> >> >> On Thu, Feb 9, 2023 at 10:20 AM Luke Cwik wrote: >> >>> Our current container java 8 container is 262 MiBs and layers on top of >>> openjdk:8-bullseye which is 226 MiBs compressed while eclipse-temurin:8 is >>> 92 MiBs compressed and eclipse-temurin:8-alpine is 65 MiBs compressed. >>> >>> I would rather not get into issues with C library differences caused by >>> the alpine project so I would stick with the safer option and let users >>> choose alpine when building their custom container if they feel it provides >>> a large win for them. We can always swap to alpine in the future as well if >>> the C library differences become a non-issue. >>> >>> So swapping to eclipse-temurin will save us a bunch on the container >>> size which should help with container transfer and hopefully for startup >>> times as well. >>> >>> On Tue, Feb 7, 2023 at 5:41 PM Andrew Pilloud >>> wrote: >>> This sounds reasonable to me as well. I've made swaps like this in the past, the base image of each is probably a bigger factor than the JDK. The openjdk images were based on Debian 11. The default eclipse-temurin images are based on Ubuntu 22.04 with an alpine option. Ubuntu is a Debian derivative but the versions and package names aren't exact matches and Ubuntu tends to update a little faster. For most users I don't think this will matter but users building custom containers may need to make minor changes. The alpine option will be much smaller (which could be a significant improvement) but would be a more significant change to the environment. On Tue, Feb 7, 2023 at 5:18 PM Robert Bradshaw via dev < dev@beam.apache.org> wrote: > Seams reasonable to me. > > On Tue, Feb 7, 2023 at 4:19 PM Luke Cwik via user < > u...@beam.apache.org> wrote: > > > > As per [1], the JDK8 and JDK11 containers that Apache Beam uses have > stopped being built and supported since July 2022. I have filed [2] to > track the resolution of this issue. > > > > Based upon [1], almost everyone is swapping to the eclipse-temurin > container[3] as their base based upon the linked issues from the > deprecation notice[1]. The eclipse-temurin container is released under > these licenses: > > Apache License, Version 2.0 > > Eclipse Distribution License 1.0 (BSD) > > Eclipse Public License 2.0 > > 一 (Secondary) GNU General Public License, version 2 with OpenJDK > Assembly Exception > > 一 (Secondary) GNU General Public License, version 2 with the GNU > Classpath Exception > > > > I propose that we swap all our containers to the eclipse-temurin > containers[3]. > > > > Open to other ideas and also would be great to hear about your > experience in any other projects that you have had to make a similar > decision. > > > > 1: https://github.com/docker-library/openjdk/issues/505 > > 2: https://github.com/apache/beam/issues/25371 > > 3: https://hub.docker.com/_/eclipse-temurin >
Re: OpenJDK8 / OpenJDK11 container deprecation
FWIW I am Team Upgrade Docker :-) On Tue, Feb 14, 2023 at 2:53 PM Luke Cwik via user wrote: > I made some progress in testing the container and did hit an issue where > Ubuntu 22.04 "Jammy" is dependent on the version of Docker installed. It > turns out that our boot.go crashes with "runtime/cgo: pthread_create > failed: Operation not permitted" because the Ubuntu 22.04 is using new > syscalls that Docker 18.09.4 doesn't have a seccomp policy for (and uses a > default of deny). We have a couple of choices here: > 1) upgrade the version of docker on Jenkins and require users to similarly > use a new enough version of Docker so that this isn't an issue for them > 2) use Ubuntu 20.04 "Focal" as the docker container > > I was using Docker 20.10.21 which is why I didn't hit this issue when > testing the change locally. > > We could also do these but they same strictly worse then either of the two > options discussed above: > A) disable the seccomp policy on Jenkins > B) use a custom seccomp policy on Jenkins > > My suggestion is to upgrade Docker versions on Jenkins and use Ubuntu > 22.04 as it will have LTS releases till 2027 and then security patches till > 2032 which gives everyone the longest runway till we need to swap OS > versions again for users of Apache Beam. Any concerns or ideas? > > > > On Thu, Feb 9, 2023 at 10:20 AM Luke Cwik wrote: > >> Our current container java 8 container is 262 MiBs and layers on top of >> openjdk:8-bullseye which is 226 MiBs compressed while eclipse-temurin:8 is >> 92 MiBs compressed and eclipse-temurin:8-alpine is 65 MiBs compressed. >> >> I would rather not get into issues with C library differences caused by >> the alpine project so I would stick with the safer option and let users >> choose alpine when building their custom container if they feel it provides >> a large win for them. We can always swap to alpine in the future as well if >> the C library differences become a non-issue. >> >> So swapping to eclipse-temurin will save us a bunch on the container size >> which should help with container transfer and hopefully for startup times >> as well. >> >> On Tue, Feb 7, 2023 at 5:41 PM Andrew Pilloud >> wrote: >> >>> This sounds reasonable to me as well. >>> >>> I've made swaps like this in the past, the base image of each is >>> probably a bigger factor than the JDK. The openjdk images were based on >>> Debian 11. The default eclipse-temurin images are based on Ubuntu 22.04 >>> with an alpine option. Ubuntu is a Debian derivative but the versions and >>> package names aren't exact matches and Ubuntu tends to update a little >>> faster. For most users I don't think this will matter but users building >>> custom containers may need to make minor changes. The alpine option will be >>> much smaller (which could be a significant improvement) but would be a more >>> significant change to the environment. >>> >>> On Tue, Feb 7, 2023 at 5:18 PM Robert Bradshaw via dev < >>> dev@beam.apache.org> wrote: >>> Seams reasonable to me. On Tue, Feb 7, 2023 at 4:19 PM Luke Cwik via user wrote: > > As per [1], the JDK8 and JDK11 containers that Apache Beam uses have stopped being built and supported since July 2022. I have filed [2] to track the resolution of this issue. > > Based upon [1], almost everyone is swapping to the eclipse-temurin container[3] as their base based upon the linked issues from the deprecation notice[1]. The eclipse-temurin container is released under these licenses: > Apache License, Version 2.0 > Eclipse Distribution License 1.0 (BSD) > Eclipse Public License 2.0 > 一 (Secondary) GNU General Public License, version 2 with OpenJDK Assembly Exception > 一 (Secondary) GNU General Public License, version 2 with the GNU Classpath Exception > > I propose that we swap all our containers to the eclipse-temurin containers[3]. > > Open to other ideas and also would be great to hear about your experience in any other projects that you have had to make a similar decision. > > 1: https://github.com/docker-library/openjdk/issues/505 > 2: https://github.com/apache/beam/issues/25371 > 3: https://hub.docker.com/_/eclipse-temurin >>>
Re: OpenJDK8 / OpenJDK11 container deprecation
I made some progress in testing the container and did hit an issue where Ubuntu 22.04 "Jammy" is dependent on the version of Docker installed. It turns out that our boot.go crashes with "runtime/cgo: pthread_create failed: Operation not permitted" because the Ubuntu 22.04 is using new syscalls that Docker 18.09.4 doesn't have a seccomp policy for (and uses a default of deny). We have a couple of choices here: 1) upgrade the version of docker on Jenkins and require users to similarly use a new enough version of Docker so that this isn't an issue for them 2) use Ubuntu 20.04 "Focal" as the docker container I was using Docker 20.10.21 which is why I didn't hit this issue when testing the change locally. We could also do these but they same strictly worse then either of the two options discussed above: A) disable the seccomp policy on Jenkins B) use a custom seccomp policy on Jenkins My suggestion is to upgrade Docker versions on Jenkins and use Ubuntu 22.04 as it will have LTS releases till 2027 and then security patches till 2032 which gives everyone the longest runway till we need to swap OS versions again for users of Apache Beam. Any concerns or ideas? On Thu, Feb 9, 2023 at 10:20 AM Luke Cwik wrote: > Our current container java 8 container is 262 MiBs and layers on top of > openjdk:8-bullseye which is 226 MiBs compressed while eclipse-temurin:8 is > 92 MiBs compressed and eclipse-temurin:8-alpine is 65 MiBs compressed. > > I would rather not get into issues with C library differences caused by > the alpine project so I would stick with the safer option and let users > choose alpine when building their custom container if they feel it provides > a large win for them. We can always swap to alpine in the future as well if > the C library differences become a non-issue. > > So swapping to eclipse-temurin will save us a bunch on the container size > which should help with container transfer and hopefully for startup times > as well. > > On Tue, Feb 7, 2023 at 5:41 PM Andrew Pilloud wrote: > >> This sounds reasonable to me as well. >> >> I've made swaps like this in the past, the base image of each is probably >> a bigger factor than the JDK. The openjdk images were based on Debian 11. >> The default eclipse-temurin images are based on Ubuntu 22.04 with an alpine >> option. Ubuntu is a Debian derivative but the versions and package names >> aren't exact matches and Ubuntu tends to update a little faster. For most >> users I don't think this will matter but users building custom containers >> may need to make minor changes. The alpine option will be much smaller >> (which could be a significant improvement) but would be a more significant >> change to the environment. >> >> On Tue, Feb 7, 2023 at 5:18 PM Robert Bradshaw via dev < >> dev@beam.apache.org> wrote: >> >>> Seams reasonable to me. >>> >>> On Tue, Feb 7, 2023 at 4:19 PM Luke Cwik via user >>> wrote: >>> > >>> > As per [1], the JDK8 and JDK11 containers that Apache Beam uses have >>> stopped being built and supported since July 2022. I have filed [2] to >>> track the resolution of this issue. >>> > >>> > Based upon [1], almost everyone is swapping to the eclipse-temurin >>> container[3] as their base based upon the linked issues from the >>> deprecation notice[1]. The eclipse-temurin container is released under >>> these licenses: >>> > Apache License, Version 2.0 >>> > Eclipse Distribution License 1.0 (BSD) >>> > Eclipse Public License 2.0 >>> > 一 (Secondary) GNU General Public License, version 2 with OpenJDK >>> Assembly Exception >>> > 一 (Secondary) GNU General Public License, version 2 with the GNU >>> Classpath Exception >>> > >>> > I propose that we swap all our containers to the eclipse-temurin >>> containers[3]. >>> > >>> > Open to other ideas and also would be great to hear about your >>> experience in any other projects that you have had to make a similar >>> decision. >>> > >>> > 1: https://github.com/docker-library/openjdk/issues/505 >>> > 2: https://github.com/apache/beam/issues/25371 >>> > 3: https://hub.docker.com/_/eclipse-temurin >>> >>
Re: Beam Release 2.46
Hello Danny, Do you mind if I shadow you while you do this? Best, Damon On Thu, Feb 9, 2023 at 3:17 PM Kenneth Knowles wrote: > Excellent! Keep that release train rolling. > > On Thu, Feb 9, 2023 at 9:28 AM Ahmet Altay via dev > wrote: > >> Thank you Danny! >> >> On Wed, Feb 8, 2023 at 6:46 AM Danny McCormick via dev < >> dev@beam.apache.org> wrote: >> >>> Hey everyone, I would like to volunteer myself to do the 2.46.0 release. >>> >>> I will cut the branch Feb 22 [1], and cherrypick any blocking fixes >>> afterwards. Please review the current release blockers [2] and remove the >>> 2.46 milestone if they don't meet the criteria at [3]. >>> >>> Thanks, >>> Danny >>> >>> [1] >>> https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com >>> [2] https://github.com/apache/beam/milestone/9 >>> [3] https://beam.apache.org/contribute/release-blocking/ >>> >>