Re: [Go SDK] Direct Runner Replacement: Prism

2023-02-14 Thread Robert Burke via dev
Here are the next two chunks!

https://github.com/apache/beam/pull/25476 - Coder / element / bytes
handling internally for prism.
https://github.com/apache/beam/pull/25478 - Worker fnAPI handling.

Took a bit to get a baseline of unit testing in for these, since they were
covered by whole pipeline runs.
Coders in particular, since they currently live in the package with the
pipeline tests, so it was harder to ensure
coverage in a vacuum.

But they did force a bit of documentation improvements, and a neglected
inefficiency I had in the original coder structure.

So small pain now, but will make sure future development is a bit easier,
as convenient as "just write a pipeline" is for testing.
Sometimes you just want to ensure the protocol works.

On Thu, Feb 9, 2023 at 2:50 PM Kenneth Knowles  wrote:

> Just a +100 to the idea of this runner. Having an easy-to-read,
> portable-execution, batch & streaming, parallel, local runner, that
> exercises plenty of advanced model features... solid gold!
>
> On Thu, Feb 9, 2023 at 12:01 PM Robert Burke via dev 
> wrote:
>
>> Here are the first of the smaller PRs:
>>
>> https://github.com/apache/beam/pull/25404 -> Adds READMEs and updates
>> go.mod so later changes don't collide there.
>> https://github.com/apache/beam/pull/25405 -> Adds internal/urns package
>> for extracting URNs from the protos.
>> https://github.com/apache/beam/pull/25406 -> Adds internal/config
>> package for parsing and accessing the configuration of variants and
>> handlers in the runner.
>>
>> These are independant changes, and small enough for quicker review. The
>> remaining larger packages can be submitted more piecemeal once these are in.
>>
>>
>>
>> On Wed, Feb 8, 2023 at 3:23 PM Robert Burke  wrote:
>>
>>> Hello Beam!
>>>
>>> == tl;dr; ==
>>>
>>> I wrote a local, portable Beam runner in Go to replace the Go direct
>>> runner.  I'd like to contribute it to the Beam Repo. The Big PR with
>>> everything is here: https://github.com/apache/beam/pull/25391
>>>
>>> I'll be sending smaller PRs out for review to get it into the repo. Take
>>> a look at the big one, don't mind the mess, but do ask questions, or offer
>>> constructive suggestions to make it clearer. There are ample TODOs that
>>> could be added. This thread will be kept up to date with the progress.
>>>
>>> Highlights:
>>> Avoids false positive issues the Go Direct runner has, especially around
>>> serialization issues.
>>> Single transform at a time execution.
>>> Watermark propagation through Graph for GBKs and Side Input windowing.
>>> Will be capable of testing the whole Go SDK, in time.
>>> Will be capable of being a stand alone single binary runner, in time.
>>> ++Many opportunities for contribution after getting into the repo!++
>>>
>>> Lowlights:
>>> Only for Go SDK, for now.
>>> ~~Many unimplemented features~~
>>>
>>> Where to start reading?
>>>
>>> Vision README:
>>> https://github.com/apache/beam/blob/9044f2d4ae151f4222a2f3e0a3264c1198040181/sdks/go/pkg/beam/runners/prism/README.md
>>>
>>>
>>> Code Structure README:
>>> https://github.com/apache/beam/blob/9044f2d4ae151f4222a2f3e0a3264c1198040181/sdks/go/pkg/beam/runners/prism/internal/README.md
>>>
>>>
>>> executePipeline entrypoint:
>>> https://github.com/apache/beam/blob/9044f2d4ae151f4222a2f3e0a3264c1198040181/sdks/go/pkg/beam/runners/prism/internal/execute.go#L41
>>>
>>>
>>>
>>> == The long version ==
>>>
>>> Since last year, I was puttering away at making a Portable Beam Runner
>>> authored in Go. Partly because I wanted to learn the "runner" half of beam,
>>> and partly because the Go Direct Runner (and most other direct runners),
>>> are not good at testing.
>>>
>>> I managed to get it roughly ready for basic batch execution by end of
>>> February 2022 , and then 2022 got away from me. And I couldn't pick it up
>>> until the end of the year.
>>>
>>> I gave a talk about this at Beam Summit 2022
>>> https://2022.beamsummit.org/sessions/portable-go-beam-runner/ that
>>> covers my motivation for it. Loosely, Beam has a Testing Problem. There are
>>> large parts of Beam execution that matter for real world performance and
>>> correctness, but the facilities to test these don't exist.  For example,
>>> take Combiner Lifting, if a combiner is unlifted, but implements
>>> AddInput... then Merge is never called, leaving it untested. And the user
>>> has no control over this, or may not even be aware of it. How a DoFn is
>>> executed matters for coverage, and user confidence.  In particular for
>>> Streaming jobs, users will tend to try things out on their Prod runner, but
>>> that doesn't help if one is testing on local Flink, but executing on Google
>>> Cloud Dataflow, which behave very differently.
>>>
>>> Regardless of whether you agree with that thesis...  I wanted to fill
>>> that gap. I wanted a runner that could be configured to test those
>>> situations, and in particular, make it easier to develop SDKs and all the
>>> features of Beam that don't get their own blog p

Re: Launch Dataflow Flex Templates from Go

2023-02-14 Thread Ashok KS
Hi Shivam,

Thanks a lot for your response. I did check the http request. But I wanted
to see if I can use the Google API client Library.
The docs show a Python example for it shown below. I wanted to know if
there is something similar with Go.

from googleapiclient.discovery import build

# project = 'your-gcp-project'
# job = 'unique-job-name'
# template = 'gs://dataflow-templates/latest/Word_Count'
# parameters = {
# 'inputFile': 'gs://dataflow-samples/shakespeare/kinglear.txt',
# 'output': 'gs:///wordcount/outputs',
# }

dataflow = build('dataflow', 'v1b3')
request = dataflow.projects().templates().launch(
projectId=project,
gcsPath=template,
body={
'jobName': job,
'parameters': parameters,
}
)

response = request.execute()



Regards,

Ashok


On Wed, Feb 15, 2023 at 4:22 PM Shivam Singhal 
wrote:

> There shouldn’t be much change in the API request irrespective of the SDK
> language
>
> On Wed, 15 Feb 2023 at 10:50, Shivam Singhal 
> wrote:
>
>> Hey Ashok,
>>
>> If you already have a flex template file and the docker image built, you
>> can use the Dataflow API to run the template.
>>
>> https://cloud.google.com/dataflow/docs/reference/rest
>>
>>
>> On Wed, 15 Feb 2023 at 04:49, Ashok KS  wrote:
>>
>>> Hello Beam Community,
>>>
>>> I have written a Dataflow pipeline using Python SDK and I would be
>>> creating a Flex template with it.
>>>
>>> My task is to launch this Flex Template from Cloud Functions which would
>>> be in Go. I found the package below but couldn't find any sample.
>>>
>>>
>>> https://cloud.google.com/dataflow/docs/reference/rest/v1b3/projects.locations.templates/launch
>>>
>>> I could find examples in Python to launch templates.
>>> Can someone please share an example in Go to launch a Dataflow Flex
>>> template?
>>>
>>> Thank you in advance.
>>>
>>> Regards,
>>> Ashok
>>>
>>


Re: Launch Dataflow Flex Templates from Go

2023-02-14 Thread Shivam Singhal
There shouldn’t be much change in the API request irrespective of the SDK
language

On Wed, 15 Feb 2023 at 10:50, Shivam Singhal 
wrote:

> Hey Ashok,
>
> If you already have a flex template file and the docker image built, you
> can use the Dataflow API to run the template.
>
> https://cloud.google.com/dataflow/docs/reference/rest
>
>
> On Wed, 15 Feb 2023 at 04:49, Ashok KS  wrote:
>
>> Hello Beam Community,
>>
>> I have written a Dataflow pipeline using Python SDK and I would be
>> creating a Flex template with it.
>>
>> My task is to launch this Flex Template from Cloud Functions which would
>> be in Go. I found the package below but couldn't find any sample.
>>
>>
>> https://cloud.google.com/dataflow/docs/reference/rest/v1b3/projects.locations.templates/launch
>>
>> I could find examples in Python to launch templates.
>> Can someone please share an example in Go to launch a Dataflow Flex
>> template?
>>
>> Thank you in advance.
>>
>> Regards,
>> Ashok
>>
>


Re: Launch Dataflow Flex Templates from Go

2023-02-14 Thread Shivam Singhal
Hey Ashok,

If you already have a flex template file and the docker image built, you
can use the Dataflow API to run the template.

https://cloud.google.com/dataflow/docs/reference/rest


On Wed, 15 Feb 2023 at 04:49, Ashok KS  wrote:

> Hello Beam Community,
>
> I have written a Dataflow pipeline using Python SDK and I would be
> creating a Flex template with it.
>
> My task is to launch this Flex Template from Cloud Functions which would
> be in Go. I found the package below but couldn't find any sample.
>
>
> https://cloud.google.com/dataflow/docs/reference/rest/v1b3/projects.locations.templates/launch
>
> I could find examples in Python to launch templates.
> Can someone please share an example in Go to launch a Dataflow Flex
> template?
>
> Thank you in advance.
>
> Regards,
> Ashok
>


Launch Dataflow Flex Templates from Go

2023-02-14 Thread Ashok KS
Hello Beam Community,

I have written a Dataflow pipeline using Python SDK and I would be creating
a Flex template with it.

My task is to launch this Flex Template from Cloud Functions which would be
in Go. I found the package below but couldn't find any sample.

https://cloud.google.com/dataflow/docs/reference/rest/v1b3/projects.locations.templates/launch

I could find examples in Python to launch templates.
Can someone please share an example in Go to launch a Dataflow Flex
template?

Thank you in advance.

Regards,
Ashok


Re: OpenJDK8 / OpenJDK11 container deprecation

2023-02-14 Thread Kenneth Knowles
SGTM. I asked on the PR if this could impact users, but having read the
docker release calendar I am not concerned. The last update to the old
version was in 2019, and the introduction of compatible versions was 2020.

On Tue, Feb 14, 2023 at 3:01 PM Byron Ellis via user 
wrote:

> FWIW I am Team Upgrade Docker :-)
>
> On Tue, Feb 14, 2023 at 2:53 PM Luke Cwik via user 
> wrote:
>
>> I made some progress in testing the container and did hit an issue where
>> Ubuntu 22.04 "Jammy" is dependent on the version of Docker installed. It
>> turns out that our boot.go crashes with "runtime/cgo: pthread_create
>> failed: Operation not permitted" because the Ubuntu 22.04 is using new
>> syscalls that Docker 18.09.4 doesn't have a seccomp policy for (and uses a
>> default of deny). We have a couple of choices here:
>> 1) upgrade the version of docker on Jenkins and require users to
>> similarly use a new enough version of Docker so that this isn't an issue
>> for them
>> 2) use Ubuntu 20.04 "Focal" as the docker container
>>
>> I was using Docker 20.10.21 which is why I didn't hit this issue when
>> testing the change locally.
>>
>> We could also do these but they same strictly worse then either of the
>> two options discussed above:
>> A) disable the seccomp policy on Jenkins
>> B) use a custom seccomp policy on Jenkins
>>
>> My suggestion is to upgrade Docker versions on Jenkins and use Ubuntu
>> 22.04 as it will have LTS releases till 2027 and then security patches till
>> 2032 which gives everyone the longest runway till we need to swap OS
>> versions again for users of Apache Beam. Any concerns or ideas?
>>
>>
>>
>> On Thu, Feb 9, 2023 at 10:20 AM Luke Cwik  wrote:
>>
>>> Our current container java 8 container is 262 MiBs and layers on top of
>>> openjdk:8-bullseye which is 226 MiBs compressed while eclipse-temurin:8 is
>>> 92 MiBs compressed and eclipse-temurin:8-alpine is 65 MiBs compressed.
>>>
>>> I would rather not get into issues with C library differences caused by
>>> the alpine project so I would stick with the safer option and let users
>>> choose alpine when building their custom container if they feel it provides
>>> a large win for them. We can always swap to alpine in the future as well if
>>> the C library differences become a non-issue.
>>>
>>> So swapping to eclipse-temurin will save us a bunch on the container
>>> size which should help with container transfer and hopefully for startup
>>> times as well.
>>>
>>> On Tue, Feb 7, 2023 at 5:41 PM Andrew Pilloud 
>>> wrote:
>>>
 This sounds reasonable to me as well.

 I've made swaps like this in the past, the base image of each is
 probably a bigger factor than the JDK. The openjdk images were based on
 Debian 11. The default eclipse-temurin images are based on Ubuntu 22.04
 with an alpine option. Ubuntu is a Debian derivative but the versions and
 package names aren't exact matches and Ubuntu tends to update a little
 faster. For most users I don't think this will matter but users building
 custom containers may need to make minor changes. The alpine option will be
 much smaller (which could be a significant improvement) but would be a more
 significant change to the environment.

 On Tue, Feb 7, 2023 at 5:18 PM Robert Bradshaw via dev <
 dev@beam.apache.org> wrote:

> Seams reasonable to me.
>
> On Tue, Feb 7, 2023 at 4:19 PM Luke Cwik via user <
> u...@beam.apache.org> wrote:
> >
> > As per [1], the JDK8 and JDK11 containers that Apache Beam uses have
> stopped being built and supported since July 2022. I have filed [2] to
> track the resolution of this issue.
> >
> > Based upon [1], almost everyone is swapping to the eclipse-temurin
> container[3] as their base based upon the linked issues from the
> deprecation notice[1]. The eclipse-temurin container is released under
> these licenses:
> > Apache License, Version 2.0
> > Eclipse Distribution License 1.0 (BSD)
> > Eclipse Public License 2.0
> > 一 (Secondary) GNU General Public License, version 2 with OpenJDK
> Assembly Exception
> > 一 (Secondary) GNU General Public License, version 2 with the GNU
> Classpath Exception
> >
> > I propose that we swap all our containers to the eclipse-temurin
> containers[3].
> >
> > Open to other ideas and also would be great to hear about your
> experience in any other projects that you have had to make a similar
> decision.
> >
> > 1: https://github.com/docker-library/openjdk/issues/505
> > 2: https://github.com/apache/beam/issues/25371
> > 3: https://hub.docker.com/_/eclipse-temurin
>



Re: OpenJDK8 / OpenJDK11 container deprecation

2023-02-14 Thread Byron Ellis via dev
FWIW I am Team Upgrade Docker :-)

On Tue, Feb 14, 2023 at 2:53 PM Luke Cwik via user 
wrote:

> I made some progress in testing the container and did hit an issue where
> Ubuntu 22.04 "Jammy" is dependent on the version of Docker installed. It
> turns out that our boot.go crashes with "runtime/cgo: pthread_create
> failed: Operation not permitted" because the Ubuntu 22.04 is using new
> syscalls that Docker 18.09.4 doesn't have a seccomp policy for (and uses a
> default of deny). We have a couple of choices here:
> 1) upgrade the version of docker on Jenkins and require users to similarly
> use a new enough version of Docker so that this isn't an issue for them
> 2) use Ubuntu 20.04 "Focal" as the docker container
>
> I was using Docker 20.10.21 which is why I didn't hit this issue when
> testing the change locally.
>
> We could also do these but they same strictly worse then either of the two
> options discussed above:
> A) disable the seccomp policy on Jenkins
> B) use a custom seccomp policy on Jenkins
>
> My suggestion is to upgrade Docker versions on Jenkins and use Ubuntu
> 22.04 as it will have LTS releases till 2027 and then security patches till
> 2032 which gives everyone the longest runway till we need to swap OS
> versions again for users of Apache Beam. Any concerns or ideas?
>
>
>
> On Thu, Feb 9, 2023 at 10:20 AM Luke Cwik  wrote:
>
>> Our current container java 8 container is 262 MiBs and layers on top of
>> openjdk:8-bullseye which is 226 MiBs compressed while eclipse-temurin:8 is
>> 92 MiBs compressed and eclipse-temurin:8-alpine is 65 MiBs compressed.
>>
>> I would rather not get into issues with C library differences caused by
>> the alpine project so I would stick with the safer option and let users
>> choose alpine when building their custom container if they feel it provides
>> a large win for them. We can always swap to alpine in the future as well if
>> the C library differences become a non-issue.
>>
>> So swapping to eclipse-temurin will save us a bunch on the container size
>> which should help with container transfer and hopefully for startup times
>> as well.
>>
>> On Tue, Feb 7, 2023 at 5:41 PM Andrew Pilloud 
>> wrote:
>>
>>> This sounds reasonable to me as well.
>>>
>>> I've made swaps like this in the past, the base image of each is
>>> probably a bigger factor than the JDK. The openjdk images were based on
>>> Debian 11. The default eclipse-temurin images are based on Ubuntu 22.04
>>> with an alpine option. Ubuntu is a Debian derivative but the versions and
>>> package names aren't exact matches and Ubuntu tends to update a little
>>> faster. For most users I don't think this will matter but users building
>>> custom containers may need to make minor changes. The alpine option will be
>>> much smaller (which could be a significant improvement) but would be a more
>>> significant change to the environment.
>>>
>>> On Tue, Feb 7, 2023 at 5:18 PM Robert Bradshaw via dev <
>>> dev@beam.apache.org> wrote:
>>>
 Seams reasonable to me.

 On Tue, Feb 7, 2023 at 4:19 PM Luke Cwik via user 
 wrote:
 >
 > As per [1], the JDK8 and JDK11 containers that Apache Beam uses have
 stopped being built and supported since July 2022. I have filed [2] to
 track the resolution of this issue.
 >
 > Based upon [1], almost everyone is swapping to the eclipse-temurin
 container[3] as their base based upon the linked issues from the
 deprecation notice[1]. The eclipse-temurin container is released under
 these licenses:
 > Apache License, Version 2.0
 > Eclipse Distribution License 1.0 (BSD)
 > Eclipse Public License 2.0
 > 一 (Secondary) GNU General Public License, version 2 with OpenJDK
 Assembly Exception
 > 一 (Secondary) GNU General Public License, version 2 with the GNU
 Classpath Exception
 >
 > I propose that we swap all our containers to the eclipse-temurin
 containers[3].
 >
 > Open to other ideas and also would be great to hear about your
 experience in any other projects that you have had to make a similar
 decision.
 >
 > 1: https://github.com/docker-library/openjdk/issues/505
 > 2: https://github.com/apache/beam/issues/25371
 > 3: https://hub.docker.com/_/eclipse-temurin

>>>


Re: OpenJDK8 / OpenJDK11 container deprecation

2023-02-14 Thread Luke Cwik via dev
I made some progress in testing the container and did hit an issue where
Ubuntu 22.04 "Jammy" is dependent on the version of Docker installed. It
turns out that our boot.go crashes with "runtime/cgo: pthread_create
failed: Operation not permitted" because the Ubuntu 22.04 is using new
syscalls that Docker 18.09.4 doesn't have a seccomp policy for (and uses a
default of deny). We have a couple of choices here:
1) upgrade the version of docker on Jenkins and require users to similarly
use a new enough version of Docker so that this isn't an issue for them
2) use Ubuntu 20.04 "Focal" as the docker container

I was using Docker 20.10.21 which is why I didn't hit this issue when
testing the change locally.

We could also do these but they same strictly worse then either of the two
options discussed above:
A) disable the seccomp policy on Jenkins
B) use a custom seccomp policy on Jenkins

My suggestion is to upgrade Docker versions on Jenkins and use Ubuntu 22.04
as it will have LTS releases till 2027 and then security patches till 2032
which gives everyone the longest runway till we need to swap OS versions
again for users of Apache Beam. Any concerns or ideas?



On Thu, Feb 9, 2023 at 10:20 AM Luke Cwik  wrote:

> Our current container java 8 container is 262 MiBs and layers on top of
> openjdk:8-bullseye which is 226 MiBs compressed while eclipse-temurin:8 is
> 92 MiBs compressed and eclipse-temurin:8-alpine is 65 MiBs compressed.
>
> I would rather not get into issues with C library differences caused by
> the alpine project so I would stick with the safer option and let users
> choose alpine when building their custom container if they feel it provides
> a large win for them. We can always swap to alpine in the future as well if
> the C library differences become a non-issue.
>
> So swapping to eclipse-temurin will save us a bunch on the container size
> which should help with container transfer and hopefully for startup times
> as well.
>
> On Tue, Feb 7, 2023 at 5:41 PM Andrew Pilloud  wrote:
>
>> This sounds reasonable to me as well.
>>
>> I've made swaps like this in the past, the base image of each is probably
>> a bigger factor than the JDK. The openjdk images were based on Debian 11.
>> The default eclipse-temurin images are based on Ubuntu 22.04 with an alpine
>> option. Ubuntu is a Debian derivative but the versions and package names
>> aren't exact matches and Ubuntu tends to update a little faster. For most
>> users I don't think this will matter but users building custom containers
>> may need to make minor changes. The alpine option will be much smaller
>> (which could be a significant improvement) but would be a more significant
>> change to the environment.
>>
>> On Tue, Feb 7, 2023 at 5:18 PM Robert Bradshaw via dev <
>> dev@beam.apache.org> wrote:
>>
>>> Seams reasonable to me.
>>>
>>> On Tue, Feb 7, 2023 at 4:19 PM Luke Cwik via user 
>>> wrote:
>>> >
>>> > As per [1], the JDK8 and JDK11 containers that Apache Beam uses have
>>> stopped being built and supported since July 2022. I have filed [2] to
>>> track the resolution of this issue.
>>> >
>>> > Based upon [1], almost everyone is swapping to the eclipse-temurin
>>> container[3] as their base based upon the linked issues from the
>>> deprecation notice[1]. The eclipse-temurin container is released under
>>> these licenses:
>>> > Apache License, Version 2.0
>>> > Eclipse Distribution License 1.0 (BSD)
>>> > Eclipse Public License 2.0
>>> > 一 (Secondary) GNU General Public License, version 2 with OpenJDK
>>> Assembly Exception
>>> > 一 (Secondary) GNU General Public License, version 2 with the GNU
>>> Classpath Exception
>>> >
>>> > I propose that we swap all our containers to the eclipse-temurin
>>> containers[3].
>>> >
>>> > Open to other ideas and also would be great to hear about your
>>> experience in any other projects that you have had to make a similar
>>> decision.
>>> >
>>> > 1: https://github.com/docker-library/openjdk/issues/505
>>> > 2: https://github.com/apache/beam/issues/25371
>>> > 3: https://hub.docker.com/_/eclipse-temurin
>>>
>>


Re: Beam Release 2.46

2023-02-14 Thread Damon Douglas via dev
Hello Danny,

Do you mind if I shadow you while you do this?

Best,

Damon

On Thu, Feb 9, 2023 at 3:17 PM Kenneth Knowles  wrote:

> Excellent! Keep that release train rolling.
>
> On Thu, Feb 9, 2023 at 9:28 AM Ahmet Altay via dev 
> wrote:
>
>> Thank you Danny!
>>
>> On Wed, Feb 8, 2023 at 6:46 AM Danny McCormick via dev <
>> dev@beam.apache.org> wrote:
>>
>>> Hey everyone, I would like to volunteer myself to do the 2.46.0 release.
>>>
>>> I will cut the branch Feb 22 [1], and cherrypick any blocking fixes
>>> afterwards. Please review the current release blockers [2] and remove the
>>> 2.46 milestone if they don't meet the criteria at [3].
>>>
>>> Thanks,
>>> Danny
>>>
>>> [1]
>>> https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com
>>> [2] https://github.com/apache/beam/milestone/9
>>> [3] https://beam.apache.org/contribute/release-blocking/
>>>
>>