Re: Beam Release 2.46

2023-02-09 Thread Kenneth Knowles
Excellent! Keep that release train rolling.

On Thu, Feb 9, 2023 at 9:28 AM Ahmet Altay via dev 
wrote:

> Thank you Danny!
>
> On Wed, Feb 8, 2023 at 6:46 AM Danny McCormick via dev <
> dev@beam.apache.org> wrote:
>
>> Hey everyone, I would like to volunteer myself to do the 2.46.0 release.
>>
>> I will cut the branch Feb 22 [1], and cherrypick any blocking fixes
>> afterwards. Please review the current release blockers [2] and remove the
>> 2.46 milestone if they don't meet the criteria at [3].
>>
>> Thanks,
>> Danny
>>
>> [1]
>> https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com
>> [2] https://github.com/apache/beam/milestone/9
>> [3] https://beam.apache.org/contribute/release-blocking/
>>
>


Re: [Go SDK] Direct Runner Replacement: Prism

2023-02-09 Thread Kenneth Knowles
Just a +100 to the idea of this runner. Having an easy-to-read,
portable-execution, batch & streaming, parallel, local runner, that
exercises plenty of advanced model features... solid gold!

On Thu, Feb 9, 2023 at 12:01 PM Robert Burke via dev 
wrote:

> Here are the first of the smaller PRs:
>
> https://github.com/apache/beam/pull/25404 -> Adds READMEs and updates
> go.mod so later changes don't collide there.
> https://github.com/apache/beam/pull/25405 -> Adds internal/urns package
> for extracting URNs from the protos.
> https://github.com/apache/beam/pull/25406 -> Adds internal/config package
> for parsing and accessing the configuration of variants and handlers in the
> runner.
>
> These are independant changes, and small enough for quicker review. The
> remaining larger packages can be submitted more piecemeal once these are in.
>
>
>
> On Wed, Feb 8, 2023 at 3:23 PM Robert Burke  wrote:
>
>> Hello Beam!
>>
>> == tl;dr; ==
>>
>> I wrote a local, portable Beam runner in Go to replace the Go direct
>> runner.  I'd like to contribute it to the Beam Repo. The Big PR with
>> everything is here: https://github.com/apache/beam/pull/25391
>>
>> I'll be sending smaller PRs out for review to get it into the repo. Take
>> a look at the big one, don't mind the mess, but do ask questions, or offer
>> constructive suggestions to make it clearer. There are ample TODOs that
>> could be added. This thread will be kept up to date with the progress.
>>
>> Highlights:
>> Avoids false positive issues the Go Direct runner has, especially around
>> serialization issues.
>> Single transform at a time execution.
>> Watermark propagation through Graph for GBKs and Side Input windowing.
>> Will be capable of testing the whole Go SDK, in time.
>> Will be capable of being a stand alone single binary runner, in time.
>> ++Many opportunities for contribution after getting into the repo!++
>>
>> Lowlights:
>> Only for Go SDK, for now.
>> ~~Many unimplemented features~~
>>
>> Where to start reading?
>>
>> Vision README:
>> https://github.com/apache/beam/blob/9044f2d4ae151f4222a2f3e0a3264c1198040181/sdks/go/pkg/beam/runners/prism/README.md
>>
>>
>> Code Structure README:
>> https://github.com/apache/beam/blob/9044f2d4ae151f4222a2f3e0a3264c1198040181/sdks/go/pkg/beam/runners/prism/internal/README.md
>>
>>
>> executePipeline entrypoint:
>> https://github.com/apache/beam/blob/9044f2d4ae151f4222a2f3e0a3264c1198040181/sdks/go/pkg/beam/runners/prism/internal/execute.go#L41
>>
>>
>>
>> == The long version ==
>>
>> Since last year, I was puttering away at making a Portable Beam Runner
>> authored in Go. Partly because I wanted to learn the "runner" half of beam,
>> and partly because the Go Direct Runner (and most other direct runners),
>> are not good at testing.
>>
>> I managed to get it roughly ready for basic batch execution by end of
>> February 2022 , and then 2022 got away from me. And I couldn't pick it up
>> until the end of the year.
>>
>> I gave a talk about this at Beam Summit 2022
>> https://2022.beamsummit.org/sessions/portable-go-beam-runner/ that
>> covers my motivation for it. Loosely, Beam has a Testing Problem. There are
>> large parts of Beam execution that matter for real world performance and
>> correctness, but the facilities to test these don't exist.  For example,
>> take Combiner Lifting, if a combiner is unlifted, but implements
>> AddInput... then Merge is never called, leaving it untested. And the user
>> has no control over this, or may not even be aware of it. How a DoFn is
>> executed matters for coverage, and user confidence.  In particular for
>> Streaming jobs, users will tend to try things out on their Prod runner, but
>> that doesn't help if one is testing on local Flink, but executing on Google
>> Cloud Dataflow, which behave very differently.
>>
>> Regardless of whether you agree with that thesis...  I wanted to fill
>> that gap. I wanted a runner that could be configured to test those
>> situations, and in particular, make it easier to develop SDKs and all the
>> features of Beam that don't get their own blog posts.
>>
>> Especially for the Go SDK. Java, being the oldest, has arguably the only
>> "correct" beam runner, in the form of the Java Direct Runner. But one can't
>> execute Go pipelines on that. Python has a portable execution of its
>> runner, but the current state of python is Parallelism hostile at best. It
>> supports a great many things, like Cross Language, but can't support
>> streaming execution (ProcessContinations etc) at present. Also, being a
>> large Python program, it's harder to follow.  The Java Direct runner, while
>> being slightly easier to follow, doesn't have a clear execution flow.
>> Neither of them are particularly easy for Non Language Experts to stand up
>> and use, especially outside of the Beam repo.
>>
>> The Go SDK's Direct Runner has many flaws, most of which are due to
>> Direct execution, rather than Portable Execution.  Implementing features
>> largely meant 

Re: Python 3.11 support in Apache Beam

2023-02-09 Thread Anand Inguva via dev
Yes, we may need to update all of them
.
I can add more information once I dig into the issue(most likely next
week). I will comment on my findings on the issue:
https://github.com/apache/beam/issues/24569 and will periodically update
this thread.

On Tue, Feb 7, 2023 at 5:47 PM Valentyn Tymofieiev 
wrote:

> On Tue, Feb 7, 2023 at 2:35 PM Anand Inguva 
> wrote:
>
>> Yes, it is related to protobuf only. But I think the update of these
>> dependencies are required for Python 3.11 since the newer versions have
>> support for Python 3.11 wheels.
>>
> Assuming you refer to protobuf. Yes, there are no wheels for 3.10 for
> protobuf==3.x.x and that can cause friction.
> https://pypi.org/project/protobuf/3.20.3/#files
>
> I would probably narrow the problem further to demonstrate which stubs are
> not being generated, and if reason not obvious we can also ask for feedback
> from protobuf maintainers. Also - do we by chance need to update some other
> deps from
> https://github.com/apache/beam/blob/master/sdks/python/build-requirements.txt#L28-L33
> for this to work?
>
> Also: tracking issue for protobuf4 support in Beam:
> https://github.com/apache/beam/issues/24569.
>
> If we use older versions of these packages, then we have to depend on
>> installing those packages on Python 3.11 from source distributions which is
>> not desired.
>>
>> I am working parallely on that issue in a different PR
>> https://github.com/apache/beam/pull/24599 but I think this issue should
>> be a blocker for Python 3.11 update.
>>
>> On Tue, Feb 7, 2023 at 5:25 PM Valentyn Tymofieiev 
>> wrote:
>>
>>> Hi Anand,
>>>
>>> On Tue, Feb 7, 2023 at 1:35 PM Anand Inguva via dev 
>>> wrote:
>>>
 Hi all,

 We are planning to work on adding support for Python 3.11[1] to Apache
 Beam Python SDK.

 As part of this effort, we are going to update the python build
 dependencies defined at [2].

 Right now, there is an error with the newer version of
 protobuf(4.21.11). It is not generating _urn files.

 It can be reproduced by

>>>
 1. python setup.py sdist
 2. pip install dist/apache-beam-x.xx.x.dev0.tar.gz
 3. switch to python interpreter and run import apache_beam as beam

>>> I think the error you are describing is related to protobuf 4, so the
>>> repro should focus on the portion where generation of stubs is happening.
>>> Presumably some stubs are not generated on protobuf 4 + Python 3.11?
>>>
>>>

 will lead to *ImportError: cannot import name
 'beam_runner_api_pb2_urns' from 'apache_beam.portability.api'.  *Running
 `python gen_protos.py` to forcefully generate files didn't help either.

 If you have encountered this error and found a resolution, please let
 me know(that would be super helpful).

 I am going to work on this soon. Please let me know if you want to
 collaborate.

 Thanks,
 Anand Inguva

 *[1] *https://github.com/apache/beam/pull/24721
 [2]
 https://github.com/apache/beam/blob/master/sdks/python/build-requirements.txt

>>>


Re: [Go SDK] Direct Runner Replacement: Prism

2023-02-09 Thread Robert Burke via dev
Here are the first of the smaller PRs:

https://github.com/apache/beam/pull/25404 -> Adds READMEs and updates
go.mod so later changes don't collide there.
https://github.com/apache/beam/pull/25405 -> Adds internal/urns package for
extracting URNs from the protos.
https://github.com/apache/beam/pull/25406 -> Adds internal/config package
for parsing and accessing the configuration of variants and handlers in the
runner.

These are independant changes, and small enough for quicker review. The
remaining larger packages can be submitted more piecemeal once these are in.



On Wed, Feb 8, 2023 at 3:23 PM Robert Burke  wrote:

> Hello Beam!
>
> == tl;dr; ==
>
> I wrote a local, portable Beam runner in Go to replace the Go direct
> runner.  I'd like to contribute it to the Beam Repo. The Big PR with
> everything is here: https://github.com/apache/beam/pull/25391
>
> I'll be sending smaller PRs out for review to get it into the repo. Take a
> look at the big one, don't mind the mess, but do ask questions, or offer
> constructive suggestions to make it clearer. There are ample TODOs that
> could be added. This thread will be kept up to date with the progress.
>
> Highlights:
> Avoids false positive issues the Go Direct runner has, especially around
> serialization issues.
> Single transform at a time execution.
> Watermark propagation through Graph for GBKs and Side Input windowing.
> Will be capable of testing the whole Go SDK, in time.
> Will be capable of being a stand alone single binary runner, in time.
> ++Many opportunities for contribution after getting into the repo!++
>
> Lowlights:
> Only for Go SDK, for now.
> ~~Many unimplemented features~~
>
> Where to start reading?
>
> Vision README:
> https://github.com/apache/beam/blob/9044f2d4ae151f4222a2f3e0a3264c1198040181/sdks/go/pkg/beam/runners/prism/README.md
>
>
> Code Structure README:
> https://github.com/apache/beam/blob/9044f2d4ae151f4222a2f3e0a3264c1198040181/sdks/go/pkg/beam/runners/prism/internal/README.md
>
>
> executePipeline entrypoint:
> https://github.com/apache/beam/blob/9044f2d4ae151f4222a2f3e0a3264c1198040181/sdks/go/pkg/beam/runners/prism/internal/execute.go#L41
>
>
>
> == The long version ==
>
> Since last year, I was puttering away at making a Portable Beam Runner
> authored in Go. Partly because I wanted to learn the "runner" half of beam,
> and partly because the Go Direct Runner (and most other direct runners),
> are not good at testing.
>
> I managed to get it roughly ready for basic batch execution by end of
> February 2022 , and then 2022 got away from me. And I couldn't pick it up
> until the end of the year.
>
> I gave a talk about this at Beam Summit 2022
> https://2022.beamsummit.org/sessions/portable-go-beam-runner/ that covers
> my motivation for it. Loosely, Beam has a Testing Problem. There are large
> parts of Beam execution that matter for real world performance and
> correctness, but the facilities to test these don't exist.  For example,
> take Combiner Lifting, if a combiner is unlifted, but implements
> AddInput... then Merge is never called, leaving it untested. And the user
> has no control over this, or may not even be aware of it. How a DoFn is
> executed matters for coverage, and user confidence.  In particular for
> Streaming jobs, users will tend to try things out on their Prod runner, but
> that doesn't help if one is testing on local Flink, but executing on Google
> Cloud Dataflow, which behave very differently.
>
> Regardless of whether you agree with that thesis...  I wanted to fill that
> gap. I wanted a runner that could be configured to test those situations,
> and in particular, make it easier to develop SDKs and all the features of
> Beam that don't get their own blog posts.
>
> Especially for the Go SDK. Java, being the oldest, has arguably the only
> "correct" beam runner, in the form of the Java Direct Runner. But one can't
> execute Go pipelines on that. Python has a portable execution of its
> runner, but the current state of python is Parallelism hostile at best. It
> supports a great many things, like Cross Language, but can't support
> streaming execution (ProcessContinations etc) at present. Also, being a
> large Python program, it's harder to follow.  The Java Direct runner, while
> being slightly easier to follow, doesn't have a clear execution flow.
> Neither of them are particularly easy for Non Language Experts to stand up
> and use, especially outside of the Beam repo.
>
> The Go SDK's Direct Runner has many flaws, most of which are due to Direct
> execution, rather than Portable Execution.  Implementing features largely
> meant hacking certain things in, so they would be able to be executed. This
> also made supporting and testing Cross Language Transforms, State and
> Timers in Go pipelines a non-starter for users. And that's just the tip.
>
> So I wanted something better. I mentioned it a few times to others, but I
> kept hearing the same refrain: "I want something that does that". Or 

Re: [Go SDK] Direct Runner Replacement: Prism

2023-02-09 Thread Austin Bennett
Thanks for the work on this; a very welcomed feature/contribution!

On Thu, Feb 9, 2023 at 7:36 AM Jack McCluskey via dev 
wrote:

> Congratulations on getting the runner to a state you're happy contributing
> to the main repo! I'm happy to help review PRs and get sub-packages in.
> Anything that helps developers and users test Beam pipelines more
> effectively is a welcome inclusion.
>
> Thanks,
>
> Jack McCluskey
>
> P.S. I'm glad the Prism name stuck, that's definitely one of my finer
> branding efforts
>
> On Wed, Feb 8, 2023 at 6:23 PM Robert Burke  wrote:
>
>> Hello Beam!
>>
>> == tl;dr; ==
>>
>> I wrote a local, portable Beam runner in Go to replace the Go direct
>> runner.  I'd like to contribute it to the Beam Repo. The Big PR with
>> everything is here: https://github.com/apache/beam/pull/25391
>>
>> I'll be sending smaller PRs out for review to get it into the repo. Take
>> a look at the big one, don't mind the mess, but do ask questions, or offer
>> constructive suggestions to make it clearer. There are ample TODOs that
>> could be added. This thread will be kept up to date with the progress.
>>
>> Highlights:
>> Avoids false positive issues the Go Direct runner has, especially around
>> serialization issues.
>> Single transform at a time execution.
>> Watermark propagation through Graph for GBKs and Side Input windowing.
>> Will be capable of testing the whole Go SDK, in time.
>> Will be capable of being a stand alone single binary runner, in time.
>> ++Many opportunities for contribution after getting into the repo!++
>>
>> Lowlights:
>> Only for Go SDK, for now.
>> ~~Many unimplemented features~~
>>
>> Where to start reading?
>>
>> Vision README:
>> https://github.com/apache/beam/blob/9044f2d4ae151f4222a2f3e0a3264c1198040181/sdks/go/pkg/beam/runners/prism/README.md
>>
>>
>> Code Structure README:
>> https://github.com/apache/beam/blob/9044f2d4ae151f4222a2f3e0a3264c1198040181/sdks/go/pkg/beam/runners/prism/internal/README.md
>>
>>
>> executePipeline entrypoint:
>> https://github.com/apache/beam/blob/9044f2d4ae151f4222a2f3e0a3264c1198040181/sdks/go/pkg/beam/runners/prism/internal/execute.go#L41
>>
>>
>>
>> == The long version ==
>>
>> Since last year, I was puttering away at making a Portable Beam Runner
>> authored in Go. Partly because I wanted to learn the "runner" half of beam,
>> and partly because the Go Direct Runner (and most other direct runners),
>> are not good at testing.
>>
>> I managed to get it roughly ready for basic batch execution by end of
>> February 2022 , and then 2022 got away from me. And I couldn't pick it up
>> until the end of the year.
>>
>> I gave a talk about this at Beam Summit 2022
>> https://2022.beamsummit.org/sessions/portable-go-beam-runner/ that
>> covers my motivation for it. Loosely, Beam has a Testing Problem. There are
>> large parts of Beam execution that matter for real world performance and
>> correctness, but the facilities to test these don't exist.  For example,
>> take Combiner Lifting, if a combiner is unlifted, but implements
>> AddInput... then Merge is never called, leaving it untested. And the user
>> has no control over this, or may not even be aware of it. How a DoFn is
>> executed matters for coverage, and user confidence.  In particular for
>> Streaming jobs, users will tend to try things out on their Prod runner, but
>> that doesn't help if one is testing on local Flink, but executing on Google
>> Cloud Dataflow, which behave very differently.
>>
>> Regardless of whether you agree with that thesis...  I wanted to fill
>> that gap. I wanted a runner that could be configured to test those
>> situations, and in particular, make it easier to develop SDKs and all the
>> features of Beam that don't get their own blog posts.
>>
>> Especially for the Go SDK. Java, being the oldest, has arguably the only
>> "correct" beam runner, in the form of the Java Direct Runner. But one can't
>> execute Go pipelines on that. Python has a portable execution of its
>> runner, but the current state of python is Parallelism hostile at best. It
>> supports a great many things, like Cross Language, but can't support
>> streaming execution (ProcessContinations etc) at present. Also, being a
>> large Python program, it's harder to follow.  The Java Direct runner, while
>> being slightly easier to follow, doesn't have a clear execution flow.
>> Neither of them are particularly easy for Non Language Experts to stand up
>> and use, especially outside of the Beam repo.
>>
>> The Go SDK's Direct Runner has many flaws, most of which are due to
>> Direct execution, rather than Portable Execution.  Implementing features
>> largely meant hacking certain things in, so they would be able to be
>> executed. This also made supporting and testing Cross Language Transforms,
>> State and Timers in Go pipelines a non-starter for users. And that's just
>> the tip.
>>
>> So I wanted something better. I mentioned it a few times to others, but I
>> kept hearing the same refrain: "I 

Re: OpenJDK8 / OpenJDK11 container deprecation

2023-02-09 Thread Luke Cwik via dev
Our current container java 8 container is 262 MiBs and layers on top of
openjdk:8-bullseye which is 226 MiBs compressed while eclipse-temurin:8 is
92 MiBs compressed and eclipse-temurin:8-alpine is 65 MiBs compressed.

I would rather not get into issues with C library differences caused by the
alpine project so I would stick with the safer option and let users choose
alpine when building their custom container if they feel it provides a
large win for them. We can always swap to alpine in the future as well if
the C library differences become a non-issue.

So swapping to eclipse-temurin will save us a bunch on the container size
which should help with container transfer and hopefully for startup times
as well.

On Tue, Feb 7, 2023 at 5:41 PM Andrew Pilloud  wrote:

> This sounds reasonable to me as well.
>
> I've made swaps like this in the past, the base image of each is probably
> a bigger factor than the JDK. The openjdk images were based on Debian 11.
> The default eclipse-temurin images are based on Ubuntu 22.04 with an alpine
> option. Ubuntu is a Debian derivative but the versions and package names
> aren't exact matches and Ubuntu tends to update a little faster. For most
> users I don't think this will matter but users building custom containers
> may need to make minor changes. The alpine option will be much smaller
> (which could be a significant improvement) but would be a more significant
> change to the environment.
>
> On Tue, Feb 7, 2023 at 5:18 PM Robert Bradshaw via dev <
> dev@beam.apache.org> wrote:
>
>> Seams reasonable to me.
>>
>> On Tue, Feb 7, 2023 at 4:19 PM Luke Cwik via user 
>> wrote:
>> >
>> > As per [1], the JDK8 and JDK11 containers that Apache Beam uses have
>> stopped being built and supported since July 2022. I have filed [2] to
>> track the resolution of this issue.
>> >
>> > Based upon [1], almost everyone is swapping to the eclipse-temurin
>> container[3] as their base based upon the linked issues from the
>> deprecation notice[1]. The eclipse-temurin container is released under
>> these licenses:
>> > Apache License, Version 2.0
>> > Eclipse Distribution License 1.0 (BSD)
>> > Eclipse Public License 2.0
>> > 一 (Secondary) GNU General Public License, version 2 with OpenJDK
>> Assembly Exception
>> > 一 (Secondary) GNU General Public License, version 2 with the GNU
>> Classpath Exception
>> >
>> > I propose that we swap all our containers to the eclipse-temurin
>> containers[3].
>> >
>> > Open to other ideas and also would be great to hear about your
>> experience in any other projects that you have had to make a similar
>> decision.
>> >
>> > 1: https://github.com/docker-library/openjdk/issues/505
>> > 2: https://github.com/apache/beam/issues/25371
>> > 3: https://hub.docker.com/_/eclipse-temurin
>>
>


Google Cloud Bigtable Change Stream Connector Submission Plan

2023-02-09 Thread Tony Tang via dev
Hi there,

I'm a developer on the Google Cloud Bigtable team working on expanding the
BigtableIO connector to stream change data from the database. Streaming
changes from Bigtable uses a brand new API.

I'm writing to explain our plan to submit Cloud Bigtable's new Change
Stream for the next release cut (2.46). I've briefly discussed the plan
with @pabloem .

The new API in the Cloud Bigtable Java Client is still being merged
https://github.com/googleapis/java-bigtable/pull/1569. After that, we need
to cut a new release and bump up the Google Cloud BOM
 version and the bump apache
beam's Google Cloud BOM. Or alternatively, we can temporarily bump up just
Cloud Bigtable's Java Client to be ahead of BOM's version.

Since the API is not in beam's master branch yet, we would like to, in
parallel, incrementally review and merge into a feature branch. In order
for the code to compile and merge into the feature branch, we can
temporarily include a locally built jar of Cloud Bigtable Java Client that
includes the new API. Once beam is updated to include the new APIs, we can
rebase and merge into master. https://github.com/apache/beam/pull/25364 is
the first of the many incremental PRs to be reviewed and submitted to the
feature branch.

Thanks,

Tony


Beam Dependency Check Report (2023-02-09)

2023-02-09 Thread Apache Jenkins Server
<<< text/html; charset=UTF-8: Unrecognized >>>


Re: Community event cooperation

2023-02-09 Thread Ahmet Altay via dev
Great.

Beam summit submissions are open - https://sessionize.com/beam-summit -- I
am sure we would love to meet and hear about how two projects could be used
together, or two communities collaborate, or learn from your experience in
building a community.

Also, it would be great to see you folks there if more than one person
could attend.

Ahmet

On Wed, Feb 8, 2023 at 11:26 PM 曾辉  wrote:

> Thank you for the information!
>
> Great suggestion, I think, maybe we can submit an issue? If registration
> is still open? Or we have people from the community go to this summit,
> which is doable.
>
> Regards,
>
> Hui Zeng | Community Manager
> M: +86 18819063834 <+86%20188%201906%203834>
>
> Apache DolphinScheduler Committer
> zeng...@apache.org
> https://twitter.com/Niko_Zeng
>
>
> Ahmet Altay  于2023年2月9日周四 11:53写道:
>
>> Hi!
>>
>> Thank you for the follow up.
>>
>> Brittany is still part of this community but as you pointed she would
>> probably have less time for Beam after the changes at Google. I assume the
>> best contact will be the mailing list (dev@), and Danielle (who you
>> added) here is also still actively working on community engagement.
>>
>> And great to hear that you have a person working in the United States
>> too.
>>
>> If I remember correctly, we were not able to clearly identify use cases
>> where DolphinScheduler & Beam are used together and after that it was hard
>> to identify joint activities. Some concrete things we could do are a person
>> from DolphinScheduler could participate in the Beam summit (June 13 - 15:
>> https://beamsummit.org/). We could coordinate a meetup. Perhaps the
>> DolphinScheduler person US could help with that. If they organize a meetup
>> in a place local to one of the Beam community members they could
>> participate. If you have any other concrete ideas please share.
>>
>> Ahmet
>>
>>
>> On Wed, Feb 8, 2023 at 5:19 PM 曾辉  wrote:
>>
>>> Hi Ahmet,
>>>
>>> It's a pity that Google's recent personnel changes, including Brittany's
>>> post on LinkedIn, hope everyone is well, I don't know what happened, I
>>> haven't received feedback from the Beam community, and we have always hoped
>>> to cooperate with the Beam community Get cooperation so that users from
>>> both parties have a gathering place. Obviously, I am a little unclear about
>>> who would be better to contact now. If you feel it is unsuitable, you can
>>> also email me privately, and I will contact you in the future. Another good
>>> news is that we have found an evangelist in the United States. If there are
>>> offline activities, we can also support them!
>>>
>>> Regards,
>>>
>>> Hui Zeng | Community Manager
>>> M: +86 18819063834 <+86%20188%201906%203834>
>>>
>>> Apache DolphinScheduler Committer
>>> zeng...@apache.org
>>> https://twitter.com/Niko_Zeng
>>>
>>>
>>> Ahmet Altay via dev  于2022年8月9日周二 07:03写道:
>>>
 Hi Niko,

 Thank you for reaching out. We do have contributors who might be
 interested in participating but they might have limited time for
 participating in an event. If you can clarify what are you looking for
 (e.g. speaker? help with coordination? etc.) people might be able to give
 you a better answer.

 Another note, we have been organizing meetups for a while (with @Danielle
 Syse  and @Brittany Hermann  doing
 the recent organizations), we could also see if someone from
 DolphinScheduler could participate as a speaker in one of those.

 A question for my learning, I am not familiar with DolphinScheduler.
 Are there users or potential use cases where DolphinScheduler & Beam are
 used together?

 Thank you!
 Ahmet

 On Sun, Aug 7, 2022 at 6:13 PM 曾辉  wrote:

> Anyone interested?
>
> 曾辉  于2022年8月4日周四 16:23写道:
>
>> Hey, Developers in the Apache Beam community, How's your day?
>>
>>  I'm Apache DolphinScheduler Community Manager, you can call me Niko,
>> nice to meet you all
>>
>> Apache DolphinScheduler is a worldly renowned data orchestration tool
>> that has largely taken the scheduler market in China. Over 1000 
>> companies,
>> including IBM, Tencent, iFlytek, Meituan, 360, China Unicom, Shein, and 
>> SF
>> Express, are relying on its decentralized infrastructure and no-code DAG
>> interface. Apache DolphinScheduler also owns the largest developer
>> community in China and each meetup gathers over 3K attendees.
>>
>> We would love to find partners like Apache Beam to co-host events in
>> the Bay Area, to share our resources with fellow Apache teams.
>>
>> click the link
>> 
>> is an introduction to our community programs. If you are interested in
>> becoming our partner and holding a Meetup together, please contact me in
>> the mail, or schedule a zoom call to discuss the details sometime next
>> week.   

Re: Exploring existing Beam features to prevent web service API overuse

2023-02-09 Thread Kerry Donny-Clark via dev
Thanks Damon, I appreciate the data-driven effort you are making to test
different approaches to rate limiting in Beam.

On Thu, Feb 9, 2023 at 12:54 AM Damon Douglas 
wrote:

> Hello Everyone,
>
> The following exploratory study proposal aims to evaluate existing
> features of Beam to prevent web service API overusage.  API providers
> typically design for application workloads smaller than parallelized data
> processing.  This presents a challenge when using these resources to read
> from and write to in the context of Beam.
>
> The study defines primary and secondary measures, identifying experimental
> groups based on a key Beam feature applied to its pipeline design, such as
> windowing or the State API.  The data will foster evidence based approaches
> to designing a solution to this problem.
>
>
> https://docs.google.com/document/d/1VZ9YphDO7kewBSz5oMXVPHWaib3S03Z6aZ66BhciB3E/edit?usp=sharing=0-ItxMSG72EzfSwVedSz-Zeg
>
> Best,
>
> Damon
>


Re: [Go SDK] Direct Runner Replacement: Prism

2023-02-09 Thread Jack McCluskey via dev
Congratulations on getting the runner to a state you're happy contributing
to the main repo! I'm happy to help review PRs and get sub-packages in.
Anything that helps developers and users test Beam pipelines more
effectively is a welcome inclusion.

Thanks,

Jack McCluskey

P.S. I'm glad the Prism name stuck, that's definitely one of my finer
branding efforts

On Wed, Feb 8, 2023 at 6:23 PM Robert Burke  wrote:

> Hello Beam!
>
> == tl;dr; ==
>
> I wrote a local, portable Beam runner in Go to replace the Go direct
> runner.  I'd like to contribute it to the Beam Repo. The Big PR with
> everything is here: https://github.com/apache/beam/pull/25391
>
> I'll be sending smaller PRs out for review to get it into the repo. Take a
> look at the big one, don't mind the mess, but do ask questions, or offer
> constructive suggestions to make it clearer. There are ample TODOs that
> could be added. This thread will be kept up to date with the progress.
>
> Highlights:
> Avoids false positive issues the Go Direct runner has, especially around
> serialization issues.
> Single transform at a time execution.
> Watermark propagation through Graph for GBKs and Side Input windowing.
> Will be capable of testing the whole Go SDK, in time.
> Will be capable of being a stand alone single binary runner, in time.
> ++Many opportunities for contribution after getting into the repo!++
>
> Lowlights:
> Only for Go SDK, for now.
> ~~Many unimplemented features~~
>
> Where to start reading?
>
> Vision README:
> https://github.com/apache/beam/blob/9044f2d4ae151f4222a2f3e0a3264c1198040181/sdks/go/pkg/beam/runners/prism/README.md
>
>
> Code Structure README:
> https://github.com/apache/beam/blob/9044f2d4ae151f4222a2f3e0a3264c1198040181/sdks/go/pkg/beam/runners/prism/internal/README.md
>
>
> executePipeline entrypoint:
> https://github.com/apache/beam/blob/9044f2d4ae151f4222a2f3e0a3264c1198040181/sdks/go/pkg/beam/runners/prism/internal/execute.go#L41
>
>
>
> == The long version ==
>
> Since last year, I was puttering away at making a Portable Beam Runner
> authored in Go. Partly because I wanted to learn the "runner" half of beam,
> and partly because the Go Direct Runner (and most other direct runners),
> are not good at testing.
>
> I managed to get it roughly ready for basic batch execution by end of
> February 2022 , and then 2022 got away from me. And I couldn't pick it up
> until the end of the year.
>
> I gave a talk about this at Beam Summit 2022
> https://2022.beamsummit.org/sessions/portable-go-beam-runner/ that covers
> my motivation for it. Loosely, Beam has a Testing Problem. There are large
> parts of Beam execution that matter for real world performance and
> correctness, but the facilities to test these don't exist.  For example,
> take Combiner Lifting, if a combiner is unlifted, but implements
> AddInput... then Merge is never called, leaving it untested. And the user
> has no control over this, or may not even be aware of it. How a DoFn is
> executed matters for coverage, and user confidence.  In particular for
> Streaming jobs, users will tend to try things out on their Prod runner, but
> that doesn't help if one is testing on local Flink, but executing on Google
> Cloud Dataflow, which behave very differently.
>
> Regardless of whether you agree with that thesis...  I wanted to fill that
> gap. I wanted a runner that could be configured to test those situations,
> and in particular, make it easier to develop SDKs and all the features of
> Beam that don't get their own blog posts.
>
> Especially for the Go SDK. Java, being the oldest, has arguably the only
> "correct" beam runner, in the form of the Java Direct Runner. But one can't
> execute Go pipelines on that. Python has a portable execution of its
> runner, but the current state of python is Parallelism hostile at best. It
> supports a great many things, like Cross Language, but can't support
> streaming execution (ProcessContinations etc) at present. Also, being a
> large Python program, it's harder to follow.  The Java Direct runner, while
> being slightly easier to follow, doesn't have a clear execution flow.
> Neither of them are particularly easy for Non Language Experts to stand up
> and use, especially outside of the Beam repo.
>
> The Go SDK's Direct Runner has many flaws, most of which are due to Direct
> execution, rather than Portable Execution.  Implementing features largely
> meant hacking certain things in, so they would be able to be executed. This
> also made supporting and testing Cross Language Transforms, State and
> Timers in Go pipelines a non-starter for users. And that's just the tip.
>
> So I wanted something better. I mentioned it a few times to others, but I
> kept hearing the same refrain: "I want something that does that". Or at
> least they wanted something simpler to understand to hack against
> themselves.
>
> I added more tests, and implemented more features, filed a tracking issue (
> https://github.com/apache/beam/issues/24789),