from:"Kenneth Knowles"

[ACTION REQUESTED] Please help me draft the Beam Board Report for September 2024

2024-09-11 Thread Kenneth Knowles

Hello all!

The next Beam board report is due today, Wednesday, September 11 (I'm a
little behind). Please draft it together at
https://s.apache.org/beam-draft-report-2024-09

The doc is open for anyone to edit.

Ideas:

 - highlights from CHANGES.md
 - interesting technical discussions
 - integrations with other projects
 - community events
 - major user facing addition/deprecation

Past reports are at https://whimsy.apache.org/board/minutes/Beam.html for
examples.

Thanks,

Kenn

Re: Sunsetting Beam Python 3.8 Support

2024-08-26 Thread Kenneth Knowles

SGTM

Incidentally I poked around on pypi for a minute but didn't find even basic
download analytics. Do we have data about usage of Python versions? (this
is not pushback - I'm all for turning things down on a natural pace (or
faster!); I'm just even *more* for having data around it)

Kenn

On Mon, Aug 26, 2024 at 10:59 AM Jack McCluskey via dev 
wrote:

> Hey everyone,
>
> With Python 3.8 reaching end-of-life in October, I've started the work of
> removing support in the Beam repository. The aim is to target Beam release
> 2.60.0 for this, since the expected release cut date is on October 2nd,
> 2024. The start of this effort is at
> https://github.com/apache/beam/pull/32283/, updating our GitHub Actions
> workflows. For many workflows like our unit test suites this is not a large
> change; the Python version matrix simply omits 3.8 and runs on the
> remaining python versions as expected. This is more complicated for a
> number of workflows that currently only run on 3.8 or both 3.8 and 3.12, as
> GitHub will not run the updated actions in the main repository until the PR
> updating them is submitted. This can already be seen in some workflow runs
> on the PR where Python 3.8 is no longer being installed in the runner
> environment, leading to failures.
>
> The current plan is to do as much validation of the new workflow files as
> I can before the above PR is submitted (hopefully the week after Beam
> Summit,) then focus on getting any potential workflow breakages resolved
> before removing the core Python 3.8 support from the package. There may be
> some instability with our workflows, and I will try my best to resolve
> things as they pop up. This is the first Python version to have support
> dropped since we migrated to GitHub Actions, so there's going to be a
> decent amount of trial and error as we navigate this. That said, if you
> notice problems please let me know! Either file a standalone issue and tag
> me on it (@jrmccluskey) or leave a comment on
> https://github.com/apache/beam/issues/31192 so I can take a look.
>
> Thanks,
>
> Jack McCluskey
>
> --
>
>
> Jack McCluskey
> SWE - DataPLS PLAT/ Dataflow ML
> RDU
> jrmcclus...@google.com
>
>
>

Re: Beam Patch Releases

2024-08-26 Thread Kenneth Knowles

Is the gap between current automation and path releases just that we can't
choose the base branch to start from?

On Fri, Aug 23, 2024 at 10:40 PM Robert Burke  wrote:

> LGTM with the addendum that if we approve of the patch process, we
> automate the patch PR process via an action like we do for a regular cut.
>
> We've only been able to make our releases faster through this automation,
> there's no sense in dropping that when the criteria of a patch requires a
> quick, timely release.
>
> On Fri, Aug 23, 2024, 7:24 AM Kenneth Knowles  wrote:
>
>> This looks great to me.
>>
>> On Fri, Aug 23, 2024 at 4:52 AM Danny McCormick via dev <
>> dev@beam.apache.org> wrote:
>>
>>> Hey folks, we've now run 2 emergency patch releases in the last year -
>>> both times it has been pretty ad hoc, with someone noticing a major
>>> issue, suggesting a fix, and then someone with available time jumping in to
>>> make the release happen. There hasn't been a clear path on how much voting
>>> is enough/how long we should wait for the release to be voted on. While I
>>> think it has ended up working reasonably well, I'd like to propose a more
>>> formalized process for patch releases. I put together a doc to do this here
>>> -
>>> https://docs.google.com/document/d/1o4UK444hCm1t5KZ9ufEu33e_o400ONAehXUR9A34qc8/edit?usp=sharing
>>>
>>> I think the piece most folks will probably care about are the criteria
>>> for running a patch release and the voting process, so I've inlined both
>>> below. Please let me know what you think.
>>>
>>> Criteria for patch release:
>>>
>>> While Beam normally releases on a 6 week cadence, with a minor version
>>> bump for each release, it is sometimes necessary to make an emergency patch
>>> release. One of the following criteria must be met to consider a patch
>>> release:
>>>
>>>
>>>-
>>>
>>>A significant new bug was released in the last release. This could
>>>include major losses of functionality for a runner, an SDK bug breaking a
>>>feature, or a transform/IO which no longer works under certain 
>>> conditions.
>>>Regressions which have been around for multiple releases do not meet this
>>>bar.
>>>-
>>>
>>>A major bug was discovered in a previous release which causes data
>>>corruption or loss
>>>-
>>>
>>>A critical vulnerability was discovered which exposes users to
>>>significant security risk.
>>>
>>>
>>> Voting process:
>>>
>>> Because of the time-sensitive nature of emergency patch releases, voting
>>> does not require a 3 day finalization period. However, it does still
>>> require the following:
>>>
>>>
>>>-
>>>
>>>3 approving binding (PMC) votes
>>>-
>>>
>>>0 disapproving (binding or non-binding) votes, or explicit
>>>acknowledgement from the binding voters that it is safe to ignore the
>>>disapproving votes.
>>>
>>>
>>> There are no minimum time requirements on how long the vote must be
>>> open, however the releaser must include their target timeline in their
>>> release candidate email so that voters can respond accordingly
>>>
>>> Thanks,
>>> Danny
>>>
>>>

Re: [DISCUSS] Beam 3.0: Paving the Path to the Next Generation Data Processing Framework

2024-08-23 Thread Kenneth Knowles

he weaker parts).
>> - ML is obviously doing well, and Beam's turnkey transform idea is also
>> doing well; we could expand on both.
>> - Whatever we do, we need to make it a non-breaking change. Breaking
>> changes turns out poorly for users and us. We might even gradually get to
>> 3.0
>> - As we get closer, we should think about a way to market 3.0 with a big
>> bang, I am sure there will be many ideas.
>>
>> Process wish: I hope we can find a structured way to make progress. When
>> there is a lot of excitement, energy, and ideas, we must have a clear
>> process for deciding what to do and how to build it to move this forward.
>>
>> Ahmet
>>
>>
>>
>> On Thu, Aug 22, 2024 at 3:51 PM XQ Hu via dev 
>> wrote:
>>
>>> Thanks a lot for these discussions so far! I really like all of the
>>> thoughts.
>>> If you have some time, please add these thoughts to these public doc:
>>> https://docs.google.com/document/d/13r4NvuvFdysqjCTzMHLuUUXjKTIEY3d7oDNIHT6guww/
>>> Everyone should have the write permission. Feel free to add/edit themes
>>> as well.
>>> Again, thanks a lot!
>>> For any folks who will attend Beam Summit 2024, see you all there and
>>> let us have more casual chats during the summit!
>>>
>>> On Thu, Aug 22, 2024 at 5:07 PM Valentyn Tymofieiev via dev <
>>> dev@beam.apache.org> wrote:
>>>
>>>> >  Key to this will be a push to producing/consuming structured data
>>>> (as has been mentioned) and also well-structured,
>>>> language-agnostic configuration.
>>>>
>>>> > Unstructured data (aka "everything is bytes with coders") is
>>>> overrated and should be an exception not the default. Structured data
>>>> everywhere, with specialized bytes columns.
>>>>
>>>> +1.
>>>>
>>>> I am seeing a tendency in distributed data processing engines to
>>>> heavily recommend and use relational APIs to express data-processing cases
>>>> on structured data, for example,
>>>>
>>>> Flink has introduced the Table API:
>>>> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/tableapi/
>>>>
>>>> Spark has recently evolved their Dataframe API into a language-agnostic
>>>> portability layer:
>>>> https://spark.apache.org/docs/latest/spark-connect-overview.html
>>>> Some less known and more recent data processing also offer a subset of
>>>> Dataframe or SQL, and  or a Dataframe API that is later translated into 
>>>> SQL.
>>>>
>>>> In contrast, in Beam, SQL and Dataframe apis are more limited add-ons,
>>>> natively available in Java and Python SDKs respectively. It might be a
>>>> worthwhile consideration  to think whether introducing a first-class
>>>> citizen relational API would make sense in Beam 3, and how it would impact
>>>> Beam cross-runner portability story.
>>>>
>>>> On Thu, Aug 22, 2024 at 12:21 PM Robert Bradshaw via dev <
>>>> dev@beam.apache.org> wrote:
>>>>
>>>>> Echoing many of the comments here, but organizing them under a single
>>>>> theme, I would say a good focus for Beam 3.0 could be centering around
>>>>> being more "transform-centric." Specifically:
>>>>>
>>>>> - Make it easy to mix and match transforms across pipelines and
>>>>> environments (SDKs). Key to this will be a push to producing/consuming
>>>>> structured data (as has been mentioned) and also well-structured,
>>>>> language-agnostic configuration.
>>>>> - Better encapsulation for transforms. The main culprit here is update
>>>>> compatibility, but there may be other issues as well. Let's try to
>>>>> actually solve that for both primitives and composites.
>>>>> - Somewhat related to the above, I would love to actually solve the
>>>>> early/late output issue, and I think retractions and sink triggers are
>>>>> powerful paradigms we could develop to actually solve this issue in a
>>>>> novel way.
>>>>> - Continue to refine the idea of "best practices." This includes the
>>>>> points above, as well as things like robust error handling,
>>>>> monitoring, etc.
>>>>>
>>>>> Once we have these in place we are in a position to offer a powerful
>>>>> catalogue of ea

Re: Beam Patch Releases

2024-08-23 Thread Kenneth Knowles

This looks great to me.

On Fri, Aug 23, 2024 at 4:52 AM Danny McCormick via dev 
wrote:

> Hey folks, we've now run 2 emergency patch releases in the last year -
> both times it has been pretty ad hoc, with someone noticing a major
> issue, suggesting a fix, and then someone with available time jumping in to
> make the release happen. There hasn't been a clear path on how much voting
> is enough/how long we should wait for the release to be voted on. While I
> think it has ended up working reasonably well, I'd like to propose a more
> formalized process for patch releases. I put together a doc to do this here
> -
> https://docs.google.com/document/d/1o4UK444hCm1t5KZ9ufEu33e_o400ONAehXUR9A34qc8/edit?usp=sharing
>
> I think the piece most folks will probably care about are the criteria for
> running a patch release and the voting process, so I've inlined both below.
> Please let me know what you think.
>
> Criteria for patch release:
>
> While Beam normally releases on a 6 week cadence, with a minor version
> bump for each release, it is sometimes necessary to make an emergency patch
> release. One of the following criteria must be met to consider a patch
> release:
>
>
>-
>
>A significant new bug was released in the last release. This could
>include major losses of functionality for a runner, an SDK bug breaking a
>feature, or a transform/IO which no longer works under certain conditions.
>Regressions which have been around for multiple releases do not meet this
>bar.
>-
>
>A major bug was discovered in a previous release which causes data
>corruption or loss
>-
>
>A critical vulnerability was discovered which exposes users to
>significant security risk.
>
>
> Voting process:
>
> Because of the time-sensitive nature of emergency patch releases, voting
> does not require a 3 day finalization period. However, it does still
> require the following:
>
>
>-
>
>3 approving binding (PMC) votes
>-
>
>0 disapproving (binding or non-binding) votes, or explicit
>acknowledgement from the binding voters that it is safe to ignore the
>disapproving votes.
>
>
> There are no minimum time requirements on how long the vote must be open,
> however the releaser must include their target timeline in their release
> candidate email so that voters can respond accordingly
>
> Thanks,
> Danny
>
>

Re: [DISCUSS] Beam 3.0: Paving the Path to the Next Generation Data Processing Framework

2024-08-22 Thread Kenneth Knowles

I think this is a good idea. Fun fact - I think the first time we talked
about "3.0" was 2018.

I don't want to break users with 3.0 TBH, despite that being what a major
version bump suggests. But I also don't want a triple-digit minor version.
I think 3.0 is worthwhile if we have a new emphasis that is very meaningful
to users and contributors.


A couple things I would say from experience with 2.0:

 - A lot of new model features are dropped before completion. Can we make
it easier to evolve? Maybe not, since in a way it is our "instruction set".

 - Transforms that provide straightforward functionality have a big impact:
RunInference, IOs, etc. I like that these get more discussion now, whereas
early in the project a lot of focus was on primitives and runners.

 - Integrations like YAML (and there will be plenty more I'm sure) that
rely on transforms as true no-code black boxes with non-UDF configuration
seem like the next step in abstraction and ease of use.

 - Update compatibility needs, which break through all our abstractions,
have blocked innovative changes and UX improvements, and had a chilling
effect on refactoring and the things that make software continue to
approach Quality.


And a few ideas I have about the future of the space, agreeing with XQ and
Jan

 - Unstructured data (aka "everything is bytes with coders") is overrated
and should be an exception not the default. Structured data everywhere,
with specialized bytes columns. We can make small steps in this direction
(and we are already).

 - Triggers are really not a great construct. "Sink triggers" map better to
use cases but how to implement them is a long adventure. But we really
can't live without *something* to manage early output / late input, and the
options in all other systems I am aware of are even worse.

And a last thought is that we shouldn't continue to work on last decade's
problems, if we can avoid it. Maybe there is a core to Beam that is
imperfect but good enough (unification of batch & streaming; integration of
many languages; core primitives that apply to any engine capable of
handling our use cases) and what we want to do is focus on what we can
build on top of it. I think this is implied by everything in this thread so
far but I just wanted to say it explicitly.

Kenn

On Tue, Aug 20, 2024 at 9:03 AM Jan Lukavský  wrote:

> Formatting and coloring. :)
>
> 
>
> Hi XQ,
>
> thanks for starting this discussion!
>
> I agree we are getting to a point when discussion a major update of Apache
> Beam might be good idea. Because such window of opportunity happens only
> once in (quite many) years, I think we should try to use our current
> experience with the Beam model itself and check if there is any room for
> improvement there. First of all, we have some parts of the model itself
> that are not implemented in Beam 2.0, e.g. retractions. Second, there are
> parts that are known to be error-prone, e.g. triggers. Another topic are
> features that are missing in the current model, e.g. iterations (yes, I
> know, general iterations might not be even possible, but it seems we can
> create a reasonable constraints for them to work for cases that really
> matter), last but not least we might want to re-think how we structure
> transforms, because that has direct impact on how expensive it is to
> implement a new runner (GBK/Combine vs stateful ParDo).
>
> Having said that, my suggestion would be to take a higher-level look
> first, define which parts of the model are battle-tested enough we trust
> them as a definite part of the 3.0 model, question all the others and then
> iterate over this to come with a new proposition of the model, with focus
> on what you emphasize - use cases, user-friendly APIs and concepts that
> contain as few unexpected behavior as possible. A key part of this should
> be discussion about how we position Beam on the market - simplicity and
> correctness should be the key points, because practice shows people tend to
> misunderstand the streaming concepts (which is absolutely understandable!).
>
> Best,
>
>  Jan
>
> On 8/20/24 14:38, Jan Lukavský wrote:
>
> Hi XQ,
>
> thanks for starting this discussion!
>
> I agree we are getting to a point when discussion a major update of Apache
> Beam might be good idea. Because such window of opportunity happens only
> once in (quite many) years, I think we should try to use our current
> experience with the Beam model itself and check if there is any room for
> improvement there. First of all, we have some parts of the model itself
> that are not implemented in Beam 2.0, e.g. retractions. Second, there are
> parts that are known to be error-prone, e.g. triggers. Another topic are
> features that are missing in the current model, e.g. iterations (yes, I
> know, general iterations might not be even possible, but it seems we can
> create a reasonable constraints for them to work for cases that really
> matter), last but not least we might want to re-think how we

Re: BatchElements Overview Documentation

2024-08-16 Thread Kenneth Knowles

I like it!

On Thu, Aug 15, 2024 at 6:59 PM Danny McCormick via dev 
wrote:

> Thanks, left a few comments, but overall this looks like great content!
>
> +1 to everything Robert said as well.
>
> Thanks,
> Danny
>
> On Thu, Aug 15, 2024 at 9:25 PM Robert Burke  wrote:
>
>> I like it!
>>
>> My vote is put it on the beam site
>>
>> Likely linked from here
>>
>> https://beam.apache.org/documentation/ml/about-ml/
>>
>> But also as a sibling to that page, it's in the python specific section
>> at least.
>>
>> If it's done in the next few days, before the cut it would be worth
>> including as a callout and link to the page in 2.59.0's CHANGES.md
>>
>> Website changes are technically outside of the release cycle, which make
>> updates harder to surface, but if we include them in CHANGES, they become
>> part of the release notes which geta better readership, even if the content
>> has been live longer.
>>
>> On Thu, Aug 15, 2024, 11:58 AM Jack McCluskey via dev <
>> dev@beam.apache.org> wrote:
>>
>>> Hey everyone,
>>>
>>> I've taken some time to write up some user-focused context around Beam
>>> Python's BatchElements transform. There are some nuanced considerations
>>> with this transform, especially the stateful implementation, and it felt
>>> like it was worth writing up! The end goal for the content is to be linked
>>> from the BatchElements docstring, either on this doc or on the
>>> documentation website proper. Feel free to take a look and give feedback!
>>>
>>>
>>> https://docs.google.com/document/d/1fOjIjIUH5dxllOGp5Z4ZmpM7BJhAJc2-hNjTnyChvgc/edit?usp=sharing
>>>
>>> Thanks,
>>>
>>> Jack McCluskey
>>>
>>> --
>>>
>>>
>>> Jack McCluskey
>>> SWE - DataPLS PLAT/ Dataflow ML
>>> RDU
>>> jrmcclus...@google.com
>>>
>>>
>>>

Re: [PROPOSAL] 2.8.1 point release due to potential data duplication in KafkaIO

2024-08-16 Thread Kenneth Knowles via dev

Aha yes 2.58.0 and 2.58.1, respectively. I'm not proposing that we create a
release at 88 miles per hour.

Kenn

On Thu, Aug 15, 2024 at 2:10 PM Jack McCluskey via dev 
wrote:

> Hey everyone,
>
> I'm assuming you mean 2.58.1? I'm +1 on a patch release to fix the problem
> though.
>
> Thanks,
>
> Jack McCluskey
>
> On Thu, Aug 15, 2024 at 1:48 PM Kenneth Knowles  wrote:
>
>> Hi all
>>
>> I am proposing a point release because 2.8.0 has a critical data bug. The
>> KafkaIO unconditionally adds a "redistribute allowing duplicates" transform
>> to the read path, whether or not a user requested it.
>>
>> Issue: https://github.com/apache/beam/issues/32196
>> Fix: https://github.com/apache/beam/pull/32134
>>
>> Luckily most runners do not support "allowing duplicates" yet so the
>> impact is minimal. However, let's release an SDK without such a surprise
>> lurking in it.
>>
>> Kenn
>>
>

Re: [VOTE] Release 2.58.1, release candidate #1

2024-08-16 Thread Kenneth Knowles

+1 (binding)

On Fri, Aug 16, 2024 at 11:43 AM Robert Burke  wrote:

> +1 (binding)
>
> Validated the linux-amd64 prism binary with a few pipelines.
>
> On 2024/08/16 00:25:58 Danny McCormick via dev wrote:
> > Hi everyone,
> > Please review and vote on the patch release candidate #1 for the version
> > 2.58.1, as follows:
> >
> > [ ] +1, Approve the release
> > [ ] -1, Do not approve the release (please provide specific comments)
> >
> > I'm still waiting on the PyPi pieces to finish up, but they should be
> done
> > shortly, at which point I'll update the list. Those links will eventually
> > be valid. Given that the only code changes are Java changes, though, I
> > wanted to get this out sooner rather than later.
> >
> > Reviewers are encouraged to test their own use cases with the release
> > candidate, and vote +1 if no issues are found. Only PMC member votes will
> > count towards the final vote, but votes from all community members is
> > encouraged and helpful for finding regressions; you can either test your
> > own use cases [13] or use cases from the validation sheet [10].
> >
> > The complete staging area is available for your review, which includes:
> > * the official Apache source release to be deployed to dist.apache.org
> [1],
> > which is signed with the key with fingerprint D20316F712213422 [2],
> > * all artifacts to be deployed to the Maven Central Repository [3],
> > * source code tag "v2.58.1-RC1" [4],
> > * website pull request listing the release [5], the blog post [5], and
> > publishing the API reference manual [6].
> > * Python artifacts are deployed along with the source release to the
> > dist.apache.org [1] and PyPI[7].
> > * Go artifacts and documentation are available at pkg.go.dev [8]
> > * Validation sheet with a tab for 2.58.1 release to help with validation
> > [9].
> > * Docker images published to Docker Hub [10].
> > * PR to run tests against release branch [11].
> >
> > The vote will be open for at least 72 hours. It is adopted by majority
> > approval, with at least 3 PMC affirmative votes.
> >
> > For guidelines on how to try the release in your projects, check out our
> RC
> > testing guide [13].
> >
> > Thanks,
> > Release Manager
> >
> > [1] https://dist.apache.org/repos/dist/dev/beam/2.58.1/
> > [2] https://dist.apache.org/repos/dist/release/beam/KEYS
> > [3]
> https://repository.apache.org/content/repositories/orgapachebeam-1384/
> > [4] https://github.com/apache/beam/tree/v2.58.1-RC1
> > [5] https://github.com/apache/beam/pull/32205
> > [6] https://github.com/apache/beam/pull/32206
> > [7] https://pypi.org/project/apache-beam/2.58.1rc1/
> > [8]
> >
> https://pkg.go.dev/github.com/apache/beam/sdks/v2@v2.58.1-RC1/go/pkg/beam
> > [9]
> >
> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=88573490
> > [10] https://hub.docker.com/search?q=apache%2Fbeam&type=image
> > [11] https://github.com/apache/beam/pull/32206
> > [12]
> >
> https://github.com/apache/beam/blob/master/contributor-docs/rc-testing-guide.md
> >
>

[PROPOSAL] 2.8.1 point release due to potential data duplication in KafkaIO

2024-08-15 Thread Kenneth Knowles

Hi all

I am proposing a point release because 2.8.0 has a critical data bug. The
KafkaIO unconditionally adds a "redistribute allowing duplicates" transform
to the read path, whether or not a user requested it.

Issue: https://github.com/apache/beam/issues/32196
Fix: https://github.com/apache/beam/pull/32134

Luckily most runners do not support "allowing duplicates" yet so the impact
is minimal. However, let's release an SDK without such a surprise lurking
in it.

Kenn

[RESULT] [VOTE] Release 2.57.0, release candidate #1

2024-06-26 Thread Kenneth Knowles

Thanks for validating and voting everyone!

There are 10 approving votes, 6 of which are binding:

 - Jean-Baptiste Onofré (binding)
 - Jan Lukavský (binding)
 - Robert Bradshaw (binding)
 - Kenneth Knowles (binding)
 - Valentyn Timofieiev (binding)
 - Chamikara Jayalath (binding)
 - XQ Hu
 - Svetak Sundhar
 - Jeff Kinard
 - Yi Hu

There are no disapproving votes.

I will go ahead and finalize the release now.

Kenn

On Tue, Jun 25, 2024 at 8:39 PM Chamikara Jayalath via dev <
dev@beam.apache.org> wrote:

> +1
>
> Validated multi-lang Java/Python pipelines.
>
> Thanks,
> Cham
>
> On Tue, Jun 25, 2024 at 4:45 PM Valentyn Tymofieiev via dev <
> dev@beam.apache.org> wrote:
>
>> +1.
>>
>> On Tue, Jun 25, 2024 at 12:18 PM Kenneth Knowles  wrote:
>>
>>> +1 (binding)
>>>
>>> I will continue to wait until 3 work days to conclude the vote, for
>>> plenty of validation time.
>>>
>>> On Mon, Jun 24, 2024 at 8:38 PM Yi Hu via dev 
>>> wrote:
>>>
>>>> +1 (non-binding)
>>>>
>>>> Validated DataflowTemplates integration tests (all templates except for
>>>> YamlTemplates, where a separate vote is entered; and except for PythonUDF
>>>> templates due to https://github.com/apache/beam/issues/31680):
>>>> https://github.com/GoogleCloudPlatform/DataflowTemplates/pull/1681
>>>>
>>>> On Mon, Jun 24, 2024 at 10:49 PM Jeff Kinard 
>>>> wrote:
>>>>
>>>>> +1 (non-binding) - Validated against Dataflow YamlTemplate and several
>>>>> Yaml pipeline examples.
>>>>>
>>>>> - Jeff
>>>>>
>>>>> On Fri, Jun 21, 2024 at 4:18 PM Kenneth Knowles 
>>>>> wrote:
>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>> Please review and vote on the release candidate #1 for the version
>>>>>> 2.57.0, as follows:
>>>>>>
>>>>>> [ ] +1, Approve the release
>>>>>> [ ] -1, Do not approve the release (please provide specific comments)
>>>>>>
>>>>>> Reviewers are encouraged to test their own use cases with the release
>>>>>> candidate, and vote +1 if no issues are found. Only PMC member votes will
>>>>>> count towards the final vote, but votes from all community members is
>>>>>> encouraged and helpful for finding regressions; you can either test your
>>>>>> own use cases [13] or use cases from the validation sheet [10].
>>>>>>
>>>>>> The complete staging area is available for your review, which
>>>>>> includes:
>>>>>>
>>>>>>- GitHub Release notes [1],
>>>>>>- the official Apache source release to be deployed to
>>>>>>dist.apache.org [2], which is signed with the key with
>>>>>>fingerprint 03DBA3E6ABDD04BFD1558DC16ED551A8AE02461C 
>>>>>> (D20316F712213422 if
>>>>>>automated) [3],
>>>>>>- all artifacts to be deployed to the Maven Central Repository
>>>>>>[4],
>>>>>>- source code tag "v2.57.0-RC1" [5],
>>>>>>- cursed website pull request listing the release [6], the blog
>>>>>>post [6], and publishing the API reference manual [7].
>>>>>>- Python artifacts are deployed along with the source release to
>>>>>>the dist.apache.org [2] and PyPI[8].
>>>>>>- Go artifacts and documentation are available at pkg.go.dev [9]
>>>>>>- Validation sheet with a tab for 2.57.0 release to help with
>>>>>>validation [10].
>>>>>>- Docker images published to Docker Hub [11].
>>>>>>- PR to run tests against release branch [12].
>>>>>>
>>>>>> The vote will be open for at least 72 hours. It is adopted by
>>>>>> majority approval, with at least 3 PMC affirmative votes.
>>>>>>
>>>>>> For guidelines on how to try the release in your projects, check out
>>>>>> our RC testing guide [13].
>>>>>>
>>>>>> Thanks,
>>>>>> Kenn
>>>>>>
>>>>>> [1] https://github.com/apache/beam/milestone/21
>>>>>> [2] https://dist.apache.org/repos/dist/dev/beam/2.57.0/
>>>>>> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
>>>>>> [4]
>>>>>> https://repository.apache.org/content/repositories/orgapachebeam-1379/
>>>>>> [5] https://github.com/apache/beam/tree/v2.57.0-RC1
>>>>>> [6] https://github.com/apache/beam/pull/31667
>>>>>> [7] https://github.com/apache/beam-site/pull/666
>>>>>> [8] https://pypi.org/project/apache-beam/2.57.0rc1/
>>>>>> [9]
>>>>>> https://pkg.go.dev/github.com/apache/beam/sdks/v2@v2.57.0-RC1/go/pkg/beam
>>>>>> [10]
>>>>>> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit?gid=612149473#gid=612149473
>>>>>> [11] https://hub.docker.com/search?q=apache%2Fbeam&type=image
>>>>>> [12] https://github.com/apache/beam/pull/31513
>>>>>> [13]
>>>>>> https://github.com/apache/beam/blob/master/contributor-docs/rc-testing-guide.md
>>>>>>
>>>>>

Re: [VOTE] Release 2.57.0, release candidate #1

2024-06-25 Thread Kenneth Knowles

+1 (binding)

I will continue to wait until 3 work days to conclude the vote, for plenty
of validation time.

On Mon, Jun 24, 2024 at 8:38 PM Yi Hu via dev  wrote:

> +1 (non-binding)
>
> Validated DataflowTemplates integration tests (all templates except for
> YamlTemplates, where a separate vote is entered; and except for PythonUDF
> templates due to https://github.com/apache/beam/issues/31680):
> https://github.com/GoogleCloudPlatform/DataflowTemplates/pull/1681
>
> On Mon, Jun 24, 2024 at 10:49 PM Jeff Kinard  wrote:
>
>> +1 (non-binding) - Validated against Dataflow YamlTemplate and several
>> Yaml pipeline examples.
>>
>> - Jeff
>>
>> On Fri, Jun 21, 2024 at 4:18 PM Kenneth Knowles  wrote:
>>
>>> Hi everyone,
>>>
>>> Please review and vote on the release candidate #1 for the version
>>> 2.57.0, as follows:
>>>
>>> [ ] +1, Approve the release
>>> [ ] -1, Do not approve the release (please provide specific comments)
>>>
>>> Reviewers are encouraged to test their own use cases with the release
>>> candidate, and vote +1 if no issues are found. Only PMC member votes will
>>> count towards the final vote, but votes from all community members is
>>> encouraged and helpful for finding regressions; you can either test your
>>> own use cases [13] or use cases from the validation sheet [10].
>>>
>>> The complete staging area is available for your review, which includes:
>>>
>>>- GitHub Release notes [1],
>>>- the official Apache source release to be deployed to
>>>dist.apache.org [2], which is signed with the key with fingerprint
>>>03DBA3E6ABDD04BFD1558DC16ED551A8AE02461C (D20316F712213422 if automated)
>>>[3],
>>>- all artifacts to be deployed to the Maven Central Repository [4],
>>>- source code tag "v2.57.0-RC1" [5],
>>>- cursed website pull request listing the release [6], the blog post
>>>[6], and publishing the API reference manual [7].
>>>- Python artifacts are deployed along with the source release to the
>>>dist.apache.org [2] and PyPI[8].
>>>- Go artifacts and documentation are available at pkg.go.dev [9]
>>>- Validation sheet with a tab for 2.57.0 release to help with
>>>validation [10].
>>>- Docker images published to Docker Hub [11].
>>>- PR to run tests against release branch [12].
>>>
>>> The vote will be open for at least 72 hours. It is adopted by majority
>>> approval, with at least 3 PMC affirmative votes.
>>>
>>> For guidelines on how to try the release in your projects, check out our
>>> RC testing guide [13].
>>>
>>> Thanks,
>>> Kenn
>>>
>>> [1] https://github.com/apache/beam/milestone/21
>>> [2] https://dist.apache.org/repos/dist/dev/beam/2.57.0/
>>> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
>>> [4]
>>> https://repository.apache.org/content/repositories/orgapachebeam-1379/
>>> [5] https://github.com/apache/beam/tree/v2.57.0-RC1
>>> [6] https://github.com/apache/beam/pull/31667
>>> [7] https://github.com/apache/beam-site/pull/666
>>> [8] https://pypi.org/project/apache-beam/2.57.0rc1/
>>> [9]
>>> https://pkg.go.dev/github.com/apache/beam/sdks/v2@v2.57.0-RC1/go/pkg/beam
>>> [10]
>>> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit?gid=612149473#gid=612149473
>>> [11] https://hub.docker.com/search?q=apache%2Fbeam&type=image
>>> [12] https://github.com/apache/beam/pull/31513
>>> [13]
>>> https://github.com/apache/beam/blob/master/contributor-docs/rc-testing-guide.md
>>>
>>

Re: [VOTE] Release 2.57.0, release candidate #1

2024-06-24 Thread Kenneth Knowles

I've re-run the jobs failing on the verification PR. There are two that
remain red (or so flaky I cannot get a green in a few reruns)

"PostCommit XVR GoUsingJava Dataflow" - this is perma-red and appears to
have never had a successful run in any context.
"PostCommit Java PVR Spark Batch" - this is extremely flaky with different
tests failing each time.

I do not think these should block the release. They should probably both be
disabled until they can be repaired.

Kenn

On Mon, Jun 24, 2024 at 9:30 AM Kenneth Knowles  wrote:

>
>
> On Mon, Jun 24, 2024 at 9:26 AM Valentyn Tymofieiev via dev <
> dev@beam.apache.org> wrote:
>
>> Ran a Python 3.12 pipeline on Dataflow without issues, noted a suboptimal
>> dependency resolution: https://github.com/apache/beam/issues/31676,
>> verified that this is not a regression in 2.57.0, will follow up separately.
>>
>> https://github.com/apache/beam/pull/31513 has several failures, can you
>> please take a look? if they are not deemed release-blocking let's add
>> rationale why not, on the PR. We did this in the previous release.
>>
>
> Will do. I've re-run falied jobs to try to weed out flakes (which I think
> the JPMS and Dataflow V2 ValidatesRunner tests are) and then will
> investigate more closely on the ones that come back red again.
>
> Kenn
>
>
>>
>> On Mon, Jun 24, 2024 at 8:32 AM Robert Bradshaw via dev <
>> dev@beam.apache.org> wrote:
>>
>>> +1 (binding)
>>>
>>> The signatures and artifacts all look good. Also tested out some
>>> pipelines with the Python SDK installed into a fresh virtual
>>> environment.
>>>
>>> On Mon, Jun 24, 2024 at 2:20 AM Jan Lukavský  wrote:
>>> >
>>> > +1 (binding)
>>> >
>>> > Tested Java SDK with Flink Runner 1.18.
>>> >
>>> >   Jan
>>> >
>>> > On 6/22/24 06:43, Jean-Baptiste Onofré wrote:
>>> > > +1 (binding)
>>> > >
>>> > > Regards
>>> > > JB
>>> > >
>>> > > On Fri, Jun 21, 2024 at 10:17 PM Kenneth Knowles 
>>> wrote:
>>> > >> Hi everyone,
>>> > >>
>>> > >> Please review and vote on the release candidate #1 for the version
>>> 2.57.0, as follows:
>>> > >>
>>> > >> [ ] +1, Approve the release
>>> > >> [ ] -1, Do not approve the release (please provide specific
>>> comments)
>>> > >>
>>> > >> Reviewers are encouraged to test their own use cases with the
>>> release candidate, and vote +1 if no issues are found. Only PMC member
>>> votes will count towards the final vote, but votes from all community
>>> members is encouraged and helpful for finding regressions; you can either
>>> test your own use cases [13] or use cases from the validation sheet [10].
>>> > >>
>>> > >> The complete staging area is available for your review, which
>>> includes:
>>> > >>
>>> > >> GitHub Release notes [1],
>>> > >> the official Apache source release to be deployed to
>>> dist.apache.org [2], which is signed with the key with fingerprint
>>> 03DBA3E6ABDD04BFD1558DC16ED551A8AE02461C (D20316F712213422 if automated)
>>> [3],
>>> > >> all artifacts to be deployed to the Maven Central Repository [4],
>>> > >> source code tag "v2.57.0-RC1" [5],
>>> > >> cursed website pull request listing the release [6], the blog post
>>> [6], and publishing the API reference manual [7].
>>> > >> Python artifacts are deployed along with the source release to the
>>> dist.apache.org [2] and PyPI[8].
>>> > >> Go artifacts and documentation are available at pkg.go.dev [9]
>>> > >> Validation sheet with a tab for 2.57.0 release to help with
>>> validation [10].
>>> > >> Docker images published to Docker Hub [11].
>>> > >> PR to run tests against release branch [12].
>>> > >>
>>> > >> The vote will be open for at least 72 hours. It is adopted by
>>> majority approval, with at least 3 PMC affirmative votes.
>>> > >>
>>> > >> For guidelines on how to try the release in your projects, check
>>> out our RC testing guide [13].
>>> > >>
>>> > >> Thanks,
>>> > >> Kenn
>>> > >>
>>> > >> [1] https://github.com/apache/beam/milestone/21
>>> > >> [2] https://dist.apache.org/repos/dist/dev/beam/2.57.0/
>>> > >> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
>>> > >> [4]
>>> https://repository.apache.org/content/repositories/orgapachebeam-1379/
>>> > >> [5] https://github.com/apache/beam/tree/v2.57.0-RC1
>>> > >> [6] https://github.com/apache/beam/pull/31667
>>> > >> [7] https://github.com/apache/beam-site/pull/666
>>> > >> [8] https://pypi.org/project/apache-beam/2.57.0rc1/
>>> > >> [9]
>>> https://pkg.go.dev/github.com/apache/beam/sdks/v2@v2.57.0-RC1/go/pkg/beam
>>> > >> [10]
>>> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit?gid=612149473#gid=612149473
>>> > >> [11] https://hub.docker.com/search?q=apache%2Fbeam&type=image
>>> > >> [12] https://github.com/apache/beam/pull/31513
>>> > >> [13]
>>> https://github.com/apache/beam/blob/master/contributor-docs/rc-testing-guide.md
>>>
>>

[ANNOUNCE] New Committer: XQ Hu

2024-06-24 Thread Kenneth Knowles

Hi all,

Please join me and the rest of the Beam PMC in welcoming a new committer:
XQ Hu (x...@apache.org).

XQ has been with Beam since mid-2022. In that time he's contributed many
fixes throughout the project, including examples, documentation fixes,
bugfixes, test fixes, dependency upgrades, input on dev@ discussions,
release validations, and many many code review comments, all showing that
XQ cares deeply about the Beam project and user experience.

Considering his contributions to the project over this timeframe, the Beam
PMC trusts XQ with the responsibilities of a Beam committer. [1]

Thank you XQ! And we are looking to see more of your contributions!

Kenn, on behalf of the Apache Beam PMC

[1]
https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer

Re: [Discussion] Strategy on drop Java 8 support

2024-06-24 Thread Kenneth Knowles

Step 1 and 2 sound great. (I tried a first step with
https://github.com/apache/beam/pull/29992 but didn't have bandwidth)

This will make it easier for people to get started with beam without having
to deal with ancient version compatibility and installing old Java, etc.
Even as a minor point, many of our build plugins are out of date because
they have moved on to more modern Java versions.

Questions:

 - Could Beam move to requiring latest Java to build and just relying on
"--release" flag or "--target" flag to build the artifacts we release? (we
need to be sure we don't rely on pieces of the standard library that are
missing in JRE8)

 - Can we release multi-release jars to provide updated versions for
different JDK versions?

Kenn

On Mon, Jun 24, 2024 at 9:44 AM Yi Hu via dev  wrote:

> Dear Beam developers,
>
> As Java8 has gone through end-of-public-update, many Beam dependencies
> have already deprecated Java8 (see [1]), and dropping Java8 supports
> are planned.
>
> Beam hasn't deprecated Java8, moreover, currently Beam CI is using Java8
> to test and release Beam. Referring to other Apache projects, I hereby
> suggest a 3-step process for dropping Java 8 support:
>
> 1. Switch CI to Java11, while retain the byte code compatibility to Java8
> for Beam releases. Tracked in  [1].
>
> This won't affect Beam developers and users currently on Java8, and
> can be done immediately
>
> 2. Require Java11+ to build Beam, deprecate Java8 support [2].
>
> This still won't affect Beam users currently on Java8, but for Beam
> developers build custom Beam artifacts, they will need Java11+
>
> 3. Drop Java8 support
>
> This will affect Beam users. Targeted in a medium-long future date,
> when Beam's major dependencies already dropped Java8 support
>
> There are a few details for further decision
>
> * Java8 has ended premier support in March 2022, the next LTS, Java11 also
> ended premier support in Sept 2023. Should we bump the default Java version
> to 17 for CI at once (while keeping Java8/11 bytecode compatibility for the
> build)?
>
> * Timeline of deprecating Java8.
>   I am volunteering to work on [1] which is targeted to Beam 2.58.0;
> naturally (2) would be Beam 2.59.0 or 2.60.0.
>
> Please provide your thoughts on the general process, and highlight
> particular areas of concern.
>
> [1] https://github.com/apache/beam/issues/31677
>
> [2] https://www.oracle.com/java/technologies/java-se-support-roadmap.html
>
> Regards,
> Yi
>
> --
>
> Yi Hu, (he/him/his)
>
> Software Engineer
>
>
>

Re: [VOTE] Release 2.57.0, release candidate #1

2024-06-24 Thread Kenneth Knowles

On Mon, Jun 24, 2024 at 9:26 AM Valentyn Tymofieiev via dev <
dev@beam.apache.org> wrote:

> Ran a Python 3.12 pipeline on Dataflow without issues, noted a suboptimal
> dependency resolution: https://github.com/apache/beam/issues/31676,
> verified that this is not a regression in 2.57.0, will follow up separately.
>
> https://github.com/apache/beam/pull/31513 has several failures, can you
> please take a look? if they are not deemed release-blocking let's add
> rationale why not, on the PR. We did this in the previous release.
>

Will do. I've re-run falied jobs to try to weed out flakes (which I think
the JPMS and Dataflow V2 ValidatesRunner tests are) and then will
investigate more closely on the ones that come back red again.

Kenn


>
> On Mon, Jun 24, 2024 at 8:32 AM Robert Bradshaw via dev <
> dev@beam.apache.org> wrote:
>
>> +1 (binding)
>>
>> The signatures and artifacts all look good. Also tested out some
>> pipelines with the Python SDK installed into a fresh virtual
>> environment.
>>
>> On Mon, Jun 24, 2024 at 2:20 AM Jan Lukavský  wrote:
>> >
>> > +1 (binding)
>> >
>> > Tested Java SDK with Flink Runner 1.18.
>> >
>> >   Jan
>> >
>> > On 6/22/24 06:43, Jean-Baptiste Onofré wrote:
>> > > +1 (binding)
>> > >
>> > > Regards
>> > > JB
>> > >
>> > > On Fri, Jun 21, 2024 at 10:17 PM Kenneth Knowles 
>> wrote:
>> > >> Hi everyone,
>> > >>
>> > >> Please review and vote on the release candidate #1 for the version
>> 2.57.0, as follows:
>> > >>
>> > >> [ ] +1, Approve the release
>> > >> [ ] -1, Do not approve the release (please provide specific comments)
>> > >>
>> > >> Reviewers are encouraged to test their own use cases with the
>> release candidate, and vote +1 if no issues are found. Only PMC member
>> votes will count towards the final vote, but votes from all community
>> members is encouraged and helpful for finding regressions; you can either
>> test your own use cases [13] or use cases from the validation sheet [10].
>> > >>
>> > >> The complete staging area is available for your review, which
>> includes:
>> > >>
>> > >> GitHub Release notes [1],
>> > >> the official Apache source release to be deployed to dist.apache.org
>> [2], which is signed with the key with fingerprint
>> 03DBA3E6ABDD04BFD1558DC16ED551A8AE02461C (D20316F712213422 if automated)
>> [3],
>> > >> all artifacts to be deployed to the Maven Central Repository [4],
>> > >> source code tag "v2.57.0-RC1" [5],
>> > >> cursed website pull request listing the release [6], the blog post
>> [6], and publishing the API reference manual [7].
>> > >> Python artifacts are deployed along with the source release to the
>> dist.apache.org [2] and PyPI[8].
>> > >> Go artifacts and documentation are available at pkg.go.dev [9]
>> > >> Validation sheet with a tab for 2.57.0 release to help with
>> validation [10].
>> > >> Docker images published to Docker Hub [11].
>> > >> PR to run tests against release branch [12].
>> > >>
>> > >> The vote will be open for at least 72 hours. It is adopted by
>> majority approval, with at least 3 PMC affirmative votes.
>> > >>
>> > >> For guidelines on how to try the release in your projects, check out
>> our RC testing guide [13].
>> > >>
>> > >> Thanks,
>> > >> Kenn
>> > >>
>> > >> [1] https://github.com/apache/beam/milestone/21
>> > >> [2] https://dist.apache.org/repos/dist/dev/beam/2.57.0/
>> > >> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
>> > >> [4]
>> https://repository.apache.org/content/repositories/orgapachebeam-1379/
>> > >> [5] https://github.com/apache/beam/tree/v2.57.0-RC1
>> > >> [6] https://github.com/apache/beam/pull/31667
>> > >> [7] https://github.com/apache/beam-site/pull/666
>> > >> [8] https://pypi.org/project/apache-beam/2.57.0rc1/
>> > >> [9]
>> https://pkg.go.dev/github.com/apache/beam/sdks/v2@v2.57.0-RC1/go/pkg/beam
>> > >> [10]
>> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit?gid=612149473#gid=612149473
>> > >> [11] https://hub.docker.com/search?q=apache%2Fbeam&type=image
>> > >> [12] https://github.com/apache/beam/pull/31513
>> > >> [13]
>> https://github.com/apache/beam/blob/master/contributor-docs/rc-testing-guide.md
>>
>

[VOTE] Release 2.57.0, release candidate #1

2024-06-21 Thread Kenneth Knowles

Hi everyone,

Please review and vote on the release candidate #1 for the version 2.57.0,
as follows:

[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific comments)

Reviewers are encouraged to test their own use cases with the release
candidate, and vote +1 if no issues are found. Only PMC member votes will
count towards the final vote, but votes from all community members is
encouraged and helpful for finding regressions; you can either test your
own use cases [13] or use cases from the validation sheet [10].

The complete staging area is available for your review, which includes:

   - GitHub Release notes [1],
   - the official Apache source release to be deployed to dist.apache.org
   [2], which is signed with the key with fingerprint
   03DBA3E6ABDD04BFD1558DC16ED551A8AE02461C (D20316F712213422 if automated)
   [3],
   - all artifacts to be deployed to the Maven Central Repository [4],
   - source code tag "v2.57.0-RC1" [5],
   - cursed website pull request listing the release [6], the blog post
   [6], and publishing the API reference manual [7].
   - Python artifacts are deployed along with the source release to the
   dist.apache.org [2] and PyPI[8].
   - Go artifacts and documentation are available at pkg.go.dev [9]
   - Validation sheet with a tab for 2.57.0 release to help with validation
   [10].
   - Docker images published to Docker Hub [11].
   - PR to run tests against release branch [12].

The vote will be open for at least 72 hours. It is adopted by majority
approval, with at least 3 PMC affirmative votes.

For guidelines on how to try the release in your projects, check out our RC
testing guide [13].

Thanks,
Kenn

[1] https://github.com/apache/beam/milestone/21
[2] https://dist.apache.org/repos/dist/dev/beam/2.57.0/
[3] https://dist.apache.org/repos/dist/release/beam/KEYS
[4] https://repository.apache.org/content/repositories/orgapachebeam-1379/
[5] https://github.com/apache/beam/tree/v2.57.0-RC1
[6] https://github.com/apache/beam/pull/31667
[7] https://github.com/apache/beam-site/pull/666
[8] https://pypi.org/project/apache-beam/2.57.0rc1/
[9]
https://pkg.go.dev/github.com/apache/beam/sdks/v2@v2.57.0-RC1/go/pkg/beam
[10]
https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit?gid=612149473#gid=612149473
[11] https://hub.docker.com/search?q=apache%2Fbeam&type=image
[12] https://github.com/apache/beam/pull/31513
[13]
https://github.com/apache/beam/blob/master/contributor-docs/rc-testing-guide.md

Re: [PROPOSAL] Preparing for 2.57.0 Release

2024-06-05 Thread Kenneth Knowles

Hi all,

I delayed a week but have now cut the 2.57.0 branch.

Kenn

On Thu, May 16, 2024 at 8:41 PM Ahmet Altay via dev 
wrote:

> Thank you Kenn!
>
> On Wed, May 15, 2024 at 8:00 AM Kenneth Knowles  wrote:
>
>> Hi everyone,
>>
>> The next release (2.57.0) branch cut is scheduled on May 29th, 2024,
>> according to the release calendar [1].
>>
>> I volunteer to perform this release. My plan is to cut the branch on that
>> date, and cherrypick release-blocking fixes afterwards, if any.
>>
>> Please help me make sure the release goes smoothly by:
>>
>>  - Making sure that any unresolved release blocking issues for 2.55.0
>> should
>> have their "Milestone" marked as "2.57.0 Release" as soon as possible.
>>  - Reviewing the current release blockers [2] and remove the Milestone if
>> they don't meet the criteria at [3].
>>
>> Let me know if you have any comments/objections/questions.
>>
>> Thanks,
>>
>> Kenn
>>
>> [1]
>> https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com
>> [2] https://github.com/apache/beam/milestone/21
>> [3] https://beam.apache.org/contribute/release-blocking/
>>
>

Re: design docs that get deleted, etc

2024-06-03 Thread Kenneth Knowles

I like the idea of a "final" step of exporting the Google Doc to markdown
and checking it in, if there is any tool that does an OK job (my experience
with such tools over my lifetime has been underwhelming, but I'm always
open to being pleasantly surprised). PDF would be a close second.

Kenn

On Thu, May 30, 2024 at 5:41 AM Jan Lukavský  wrote:

> Yes, I think that the discoverability and durability would be part of
> some "process". There could be "informal" part, where the design can
> take place virtually anywhere and the document can be owned by anyone
> (as we do today), then it would be copied to a durable store (asf wiki?)
> and a [DISCUSS] thread could be started on ML, followed by [VOTE] (if
> needed?) after that, then the document would be marked as accepted and
> will be turned into issues. I believe this is how the FLIPs and KIPs are
> processed.
>
>   Jan
>
> On 5/29/24 17:56, Robert Bradshaw via dev wrote:
> > While I'm not opposed to having a more formal process for proposals to
> > go from idea to consensus to implementation, I'm not sure how much
> > this would solve the primary issues we face (discoverability and
> > durability). But maybe that could be built into the process? At the
> > very least we could have an "index" which would give identifiers (and
> > hopefully good titles) to all the proposals, and maybe have an offline
> > process to snapshot such docs (even just periodically pulling the
> > content to a repo like I do with
> > https://github.com/cython/cython-issues ). I have yet to find a medium
> > (not even wikis) that facilitates conversation/collaborative editing
> > to the extent that docs does, but I agree with the downside that
> > ownership by random individuals can pose a problem.
> >
> > On Wed, May 29, 2024 at 7:07 AM Jan Lukavský  wrote:
> >> Hi,
> >>
> >> regarding changing the way we document past (and more importantly
> >> future) changes, I've always been a big fan of the FLIP analogy [1]. I
> >> would love if we could make this work for Beam as well, while preserving
> >> the 'informal' part that I believe all of us want to keep. On the other
> >> hand, this could make the design decisions more searchable, transparent
> >> and get more people involved in the process. Having design documents
> >> durable is of course a highly important part of it.
> >>
> >>Jan
> >>
> >> [1] https://lists.apache.org/thread/whfy3706w2d0q6rdk4kwyrzvhfd4b5kg
> >>
> >> On 5/29/24 15:04, Kenneth Knowles wrote:
> >>> Hi all,
> >>>
> >>> Yesterday someone asked me about the design doc linked from
> >>> https://github.com/apache/beam/issues/18297 because it is now a 404.
> >>>
> >>> There are plenty of reasons a Google Doc might no longer be
> >>> accessible. They exist outside the project's control. This is part of
> >>> why ASF projects emphasize having discussions on the dev@ list and
> >>> often put all their designs directly onto some ASF-hosted
> >>> infrastructure, such as a Wiki (Example:
> >>>
> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
> ).
> >>> In the early days we had a project-owned shared folder but it fell
> >>> into disuse.
> >>>
> >>> In my opinion, Google Docs are still the best place for design docs to
> >>> get feedback and be revised, but the lack of durability is a downside
> >>> to stay aware of. I've also gotten lots of complaints of lack of
> >>> discoverability and lack of systematization of design docs, neither of
> >>> which would be addressed by a shared folder.
> >>>
> >>> I don't have a proposal or suggestion. I don't think this is super
> >>> urgent, certainly not my personal highest priority, but I thought I'd
> >>> just share this as food for thought.
> >>>
> >>> Kenn
>

design docs that get deleted, etc

2024-05-29 Thread Kenneth Knowles

Hi all,

Yesterday someone asked me about the design doc linked from
https://github.com/apache/beam/issues/18297 because it is now a 404.

There are plenty of reasons a Google Doc might no longer be accessible.
They exist outside the project's control. This is part of why ASF projects
emphasize having discussions on the dev@ list and often put all their
designs directly onto some ASF-hosted infrastructure, such as a Wiki
(Example:
https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals).
In the early days we had a project-owned shared folder but it fell into
disuse.

In my opinion, Google Docs are still the best place for design docs to get
feedback and be revised, but the lack of durability is a downside to stay
aware of. I've also gotten lots of complaints of lack of discoverability
and lack of systematization of design docs, neither of which would be
addressed by a shared folder.

I don't have a proposal or suggestion. I don't think this is super urgent,
certainly not my personal highest priority, but I thought I'd just share
this as food for thought.

Kenn

[ACTION REQUESTED] Help me draft the Beam Board Report for June 2024

2024-05-23 Thread Kenneth Knowles

The next Beam board report is due next Wednesday, June 12. Please draft it
together at https://s.apache.org/beam-draft-report-2024-06

The doc is open for anyone to edit.

Ideas:

 - highlights from CHANGES.md
 - interesting technical discussions
 - integrations with other projects
 - community events
 - major user facing addition/deprecation

Past reports are at https://whimsy.apache.org/board/minutes/Beam.html for
examples.

Thanks,

Kenn

[PROPOSAL] Preparing for 2.57.0 Release

2024-05-15 Thread Kenneth Knowles

Hi everyone,

The next release (2.57.0) branch cut is scheduled on May 29th, 2024,
according to the release calendar [1].

I volunteer to perform this release. My plan is to cut the branch on that
date, and cherrypick release-blocking fixes afterwards, if any.

Please help me make sure the release goes smoothly by:

 - Making sure that any unresolved release blocking issues for 2.55.0 should
have their "Milestone" marked as "2.57.0 Release" as soon as possible.
 - Reviewing the current release blockers [2] and remove the Milestone if
they don't meet the criteria at [3].

Let me know if you have any comments/objections/questions.

Thanks,

Kenn

[1]
https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com
[2] https://github.com/apache/beam/milestone/21
[3] https://beam.apache.org/contribute/release-blocking/

Re: PCollection#applyWindowingStrategyInternal

2024-04-22 Thread Kenneth Knowles

nd what semantics do you want for this early data. Ideally
>>> triggers should be specified separately at the ParDo level (Beam has no
>>> real notion of Sinks as a special object, so to allow for output
>>> specification it has to be on the ParDo), and the triggers should propagate
>>> up the graph back to the source. This is in contrast to today where we
>>> attach triggering to the windowing information.
>>>
>>> This was a proposal some years back and there was some effort made to
>>> implement it, but the implementation never really got off the ground.
>>>
>>> On Wed, Apr 10, 2024 at 12:43 AM Jan Lukavský  wrote:
>>>
>>>> On 4/9/24 18:33, Kenneth Knowles wrote:
>>>>
>>>> At a top level `setWindowingStrategyInternal` exists to set up the
>>>> metadata without actually assigning windows. If we were more clever we
>>>> might have found a way for it to not be public... it is something that can
>>>> easily lead to an invalid pipeline.
>>>>
>>>> Yes, that was what hit me about one minute after I started this thread.
>>>> :)
>>>>
>>>>
>>>> I think "compatible windows" today in Beam doesn't have very good uses
>>>> anyhow. I do see how when you are flattening PCollections you might also
>>>> want to explicitly have a function that says "and here is how to reconcile
>>>> their different metadata". But is it not reasonable to use
>>>> Window.into(global window)? It doesn't seem like boilerplate to me
>>>> actually, but something you really want to know is happening.
>>>>
>>>> :)
>>>>
>>>> Of course this was the way out, but I was somewhat intuitively seeking
>>>> something that could go this autonomously.
>>>>
>>>> Generally speaking, we might have some room for improvement in the way
>>>> we handle windows and triggers - windows relate only to GBK and stateful
>>>> ParDo, triggers relate to GBK only. They have no semantics if downstream
>>>> processing does not use any of these. There could be a pipeline
>>>> preprocessing stage that would discard (replace with meaningful defaults)
>>>> any of these metadata that is unused, but can cause Pipeline to fail at
>>>> construction time. It is also (to me) somewhat questionable if triggers are
>>>> really a property of a PCollection or a property of a specific transform
>>>> (GBK - ehm, actually (stateless) 'key by' + 'reduce by key', but that is
>>>> completely different story :)) because (non-default) triggers are likely
>>>> not preserved across multiple transforms. Maybe the correct subject of this
>>>> thread could be "are we sure our windowing and triggering semantics is 100%
>>>> correct"? Probably the - wrong - expectations at the beginning of this
>>>> thread were due to conflict in my mental model of how things 'could' work
>>>> as opposed to how they actually work. :)
>>>>
>>>>  Jan
>>>>
>>>>
>>>> Kenn
>>>>
>>>> On Tue, Apr 9, 2024 at 9:19 AM Jan Lukavský  wrote:
>>>>
>>>>> On 4/6/24 21:23, Reuven Lax via dev wrote:
>>>>>
>>>>> So the problem here is that windowFn is a property of the PCollection,
>>>>> not the element, and the result of Flatten is a single PCollection.
>>>>>
>>>>> Yes. That is the cause of why Flatten.pCollections() needs the same
>>>>> windowFn.
>>>>>
>>>>>
>>>>> In various cases, there is a notion of "compatible" windows. Basically
>>>>> given window functions W1 and W2, provide a W3 that "works" with both.
>>>>>
>>>>> Exactly this would be a nice feature for Flatten, something like
>>>>> 'windowFn resolve strategy', so that if use does not know the windowFn of
>>>>> upstream PCollections this can be somehow resolved at pipeline 
>>>>> construction
>>>>> time. Alternatively only as a small syntactic sugar, something like:
>>>>>
>>>>>  
>>>>> Flatten.pCollections().withWindowingStrategy(WindowResolution.into(oneInput.getWindowingStrategy()))
>>>>>
>>>>> or anything similar. This can be done in user code, so it is not
>>>>> something deeper, but might help in some

Re: PCollection#applyWindowingStrategyInternal

2024-04-09 Thread Kenneth Knowles

At a top level `setWindowingStrategyInternal` exists to set up the metadata
without actually assigning windows. If we were more clever we might have
found a way for it to not be public... it is something that can easily lead
to an invalid pipeline.

I think "compatible windows" today in Beam doesn't have very good uses
anyhow. I do see how when you are flattening PCollections you might also
want to explicitly have a function that says "and here is how to reconcile
their different metadata". But is it not reasonable to use
Window.into(global window)? It doesn't seem like boilerplate to me
actually, but something you really want to know is happening.

Kenn

On Tue, Apr 9, 2024 at 9:19 AM Jan Lukavský  wrote:

> On 4/6/24 21:23, Reuven Lax via dev wrote:
>
> So the problem here is that windowFn is a property of the PCollection, not
> the element, and the result of Flatten is a single PCollection.
>
> Yes. That is the cause of why Flatten.pCollections() needs the same
> windowFn.
>
>
> In various cases, there is a notion of "compatible" windows. Basically
> given window functions W1 and W2, provide a W3 that "works" with both.
>
> Exactly this would be a nice feature for Flatten, something like 'windowFn
> resolve strategy', so that if use does not know the windowFn of upstream
> PCollections this can be somehow resolved at pipeline construction time.
> Alternatively only as a small syntactic sugar, something like:
>
>  
> Flatten.pCollections().withWindowingStrategy(WindowResolution.into(oneInput.getWindowingStrategy()))
>
> or anything similar. This can be done in user code, so it is not something
> deeper, but might help in some cases. It would be cool if we could reuse
> concepts from other cases where such mechanism is needed.
>
>
> Note that Beam already has something similar with side inputs, since the
> side input often is in a different window than the main input. However main
> input elements are supposed to see side input elements in the same window
> (and in fact main inputs are blocked until the side-input window is ready),
> so we must do a mapping. If for example (and very commonly!) the side input
> is in the global window and the main input is in a fixed window, by default
> we will remap the global-window elements into the main-input's fixed window.
>
> This is a one-sided merge function, there is a 'main' and 'side' input,
> but the generic symmetric merge might be possible as well. E.g. if one
> PCollection of Flatten is in GlobalWindow, I wonder if there are cases
> where users would actually want to do anything else then apply the same
> global windowing strategy to all input PCollections.
>
>  Jan
>
>
> In Side input we also allow the user to control this mapping, so for
> example side input elements could always map to the previous fixed window
> (e.g. while processing window 12-1, you want to see summary data of all
> records in the previous window 11-12). Users can do this by providing a
> WindowMappingFunction to the View - essentially a function from window to
> window. Unfortunately this is hard to use (one must create their own
> PCollectionView class) and very poorly documented, so I doubt many users
> know about this!
>
> Reuven
>
> On Sat, Apr 6, 2024 at 7:09 AM Jan Lukavský  wrote:
>
>> Immediate self-correction, although setting the strategy directly via
>> setWindowingStrategyInternal() *seemed* to be working during Pipeline
>> construction time, during runtime it obviously does not work, because
>> the PCollection was still windowed using the old windowFn. Make sense to
>> me, but there remains the other question if we can make flattening
>> PCollections with incompatible windowFns more user-friendly. The current
>> approach where we require the same windowFn for all input PCollections
>> creates some unnecessary boilerplate code needed on user side.
>>
>>   Jan
>>
>> On 4/6/24 15:45, Jan Lukavský wrote:
>> > Hi,
>> >
>> > I came across a case where using
>> > PCollection#applyWindowingStrategyInternal seems legit in user core.
>> > The case is roughly as follows:
>> >
>> >  a) compute some streaming statistics
>> >
>> >  b) apply the same transform (say ComputeWindowedAggregation) with
>> > different parameters on these statistics yielding two windowed
>> > PCollections - first is global with early trigger, the other is
>> > sliding window, the specific parameters of the windowFns are
>> > encapsulated in the ComputeWindowedAggregation transform
>> >
>> >  c) apply the same transform on both of the above PCollections,
>> > yielding two PCollections with the same types, but different windowFns
>> >
>> >  d) flatten these PCollections into single one (e.g. for downstream
>> > processing - joining - or flushing to sink)
>> >
>> > Now, the flatten will not work, because these PCollections have
>> > different windowFns. It would be possible to restore the windowing for
>> > either of them, but it requires to somewhat break the encapsulation of
>> > the transforms that produce t

Re: [VOTE] Patch Release 2.55.1, release candidate #2

2024-04-03 Thread Kenneth Knowles

+1 (binding)

Kenn

On Wed, Apr 3, 2024 at 12:58 PM Danny McCormick via dev 
wrote:

> > Also noting that there is no PR postsubmit test suite running against
> the release branch in the vote email. Given the diff, that's also fine
> since previous tests runs didn't detect the breakage, but in general we
> should include that for patch releases as well.
>
> Yeah, it didn't seem useful to run these given the diff since they test
> the code, not the produced artifacts (which is more or less our only vector
> for new bugs here). If anyone disagrees I can kick that off, but deflaking
> seems like more trouble than its worth
>
> -
>
> Also, from my original email: This release does not include any website
> changes since it is addressing a single bug fix as discussed in
> https://lists.apache.org/thread/kvq1wsj505pvopkq186dnvc0l6ryyfh0.
> I realized it is still worth making the website changes to get the
> versions correct, even though no content will be updated. With that in
> mind, I've also proposed changes to the beam-site [1] and the beam website
> [2]. I still omitted writing a blog post.
>
> Thanks,
> Danny
>
> [1] https://github.com/apache/beam-site/pull/663
> [2] https://github.com/apache/beam/pull/30839
>
> On Wed, Apr 3, 2024 at 12:30 PM Valentyn Tymofieiev 
> wrote:
>
>> Hi Danny,
>>
>> Thanks for volunteering to do this patch release.
>>
>> For review convenience, this is the diff:
>>   - Diff of release branches:
>> https://github.com/apache/beam/compare/release-2.55.0...release-2.55.1
>>   - The diff of tags v2.55.0-RC3 and v2.55.1-RC2:
>> https://github.com/apache/beam/compare/v2.55.0-RC3...v2.55.1-RC2  is
>> somewhat misleading, it looks as though there is a change in the version
>> naming pattern, but upon inspection of gradle.properties for each tag
>> individually, the pattern is the same and doesn't include dev/SNAPSHOT
>> suffixes.
>>
>> > I put together a patch release per the conversation in
>> https://lists.apache.org/thread/kvq1wsj505pvopkq186dnvc0l6ryyfh0.
>>
>> Noting that the 2.55.1 doesn't fix another Python SDK known issue that
>> was called out in that thread, which is fine with me, just calling out the
>> difference from previous discussion.
>>
>> Also noting that there is no PR postsubmit test suite running against the
>> release branch in the vote email. Given the diff, that's also fine since
>> previous tests runs didn't detect the breakage, but in general we  should
>> include that for patch releases as well.
>>
>> +1. Spot-checked some Python SDK artifacts and containers.
>>
>> On Wed, Apr 3, 2024 at 8:08 AM Danny McCormick via dev <
>> dev@beam.apache.org> wrote:
>>
>>> Hi everyone,
>>>
>>> I put together a patch release per the conversation in
>>> https://lists.apache.org/thread/kvq1wsj505pvopkq186dnvc0l6ryyfh0.
>>>
>>> Please review and vote on the release candidate #2 (I messed up rc1) for
>>> the version 2.55.1, as follows:
>>> [ ] +1, Approve the release
>>> [ ] -1, Do not approve the release (please provide specific comments)
>>>
>>>
>>> Reviewers are encouraged to test their own use cases with the release
>>> candidate, and vote +1 if no issues are found. Only PMC member votes will
>>> count towards the final vote, but votes from all community members is
>>> encouraged and helpful for finding regressions; you can either test your
>>> own use cases [9] or use cases from the validation sheet [7].
>>>
>>> The complete staging area is available for your review, which includes:
>>> * the official Apache source release to be deployed to dist.apache.org
>>> [1], which is signed with the key with fingerprint D20316F712213422 [2],
>>> * all artifacts to be deployed to the Maven Central Repository [3],
>>> * source code tag "v2.55.1-RC2" [4],
>>> * Python artifacts are deployed along with the source release to the
>>> dist.apache.org [1] and PyPI[5].
>>> * Go artifacts and documentation are available at pkg.go.dev [6]
>>> * Validation sheet with a tab for 2.55.1 release to help with validation
>>> [7].
>>> * Docker images published to Docker Hub [8].
>>>
>>> This release does not include any website changes since it is addressing
>>> a single bug fix as discussed in
>>> https://lists.apache.org/thread/kvq1wsj505pvopkq186dnvc0l6ryyfh0.
>>>
>>> The vote will be open for at least 72 hours. It is adopted by majority
>>> approval, with at least 3 PMC affirmative votes.
>>>
>>> For guidelines on how to try the release in your projects, check out our
>>> RC testing guide [9].
>>>
>>> Thanks,
>>> Danny
>>>
>>> [1] https://dist.apache.org/repos/dist/dev/beam/2.55.1/
>>> [2] https://dist.apache.org/repos/dist/release/beam/KEYS
>>> [3]
>>> https://repository.apache.org/content/repositories/orgapachebeam-1375/
>>> [4] https://github.com/apache/beam/tree/v2.55.1-RC2
>>> [5] https://pypi.org/project/apache-beam/2.55.1rc2/
>>> [6]
>>> https://pkg.go.dev/github.com/apache/beam/sdks/v2@v2.55.1-RC2/go/pkg/beam
>>> [7]
>>>

Re: Supporting Dynamic Destinations in a portable context

2024-04-03 Thread Kenneth Knowles

Let me summarize the most recent proposal on-list to frame my question
about this last suggestion. It looks like this:

1. user has an element, call it `data`
2. user maps `data` to an arbitrary metadata row, call it `dest`
3. we can do things like shuffle on `dest` because it isn't too big
4. we map `dest` to a concrete destination (aka URL) to write to by a
string format that uses fields of `dest`

I believe steps 1-3 are identical is expressivity to non-portable
DynamicDestinations. So Reuven the question is for step 4: what are the
mappings from `dest` to URL that cannot be expressed by string formatting
but need SQL or Lua, etc? That would be a useful guide to consideration of
those possibilities.

FWIW I think even if we add a mini-language that string formatting has
better ease of use (can easily be displayed in UI, etc) so it would be the
first choice, and more advanced stuff is a fallback for rare cases. So they
are both valuable and I'd be happy to implement the easier-to-use path
right away while we discuss.

Kenn

On Tue, Apr 2, 2024 at 2:59 PM Reuven Lax via dev 
wrote:

> I do suspect that over time we'll find more and more cases we can't
> express, and will be asked to extend this little templating in more
> directions. To head that off - could we easily just reuse an existing
> language (SQL, LUA, something of the form?) instead of creating something
> new?
>
> On Tue, Apr 2, 2024 at 8:55 AM Kenneth Knowles  wrote:
>
>> I really like this proposal. I think it has narrowed down and solved the
>> essential problem of not shuffling excess redundant data, and also provides
>> the vast majority of the functionality that a lambda would, with
>> significantly better debugability and usability too, since the dynamic
>> destination pattern string can be in display data, etc.
>>
>> Kenn
>>
>> On Wed, Mar 27, 2024 at 1:58 PM Robert Bradshaw via dev <
>> dev@beam.apache.org> wrote:
>>
>>> On Wed, Mar 27, 2024 at 10:20 AM Reuven Lax  wrote:
>>>
>>>> Can the prefix still be generated programmatically at graph creation
>>>> time?
>>>>
>>>
>>> Yes. It's just a property of the transform passed by the user at
>>> configuration time.
>>>
>>>
>>>> On Wed, Mar 27, 2024 at 9:40 AM Robert Bradshaw 
>>>> wrote:
>>>>
>>>>> On Wed, Mar 27, 2024 at 9:12 AM Reuven Lax  wrote:
>>>>>
>>>>>> This does seem like the best compromise, though I think there will
>>>>>> still end up being performance issues. A common pattern I've seen is that
>>>>>> there is a long common prefix to the dynamic destination followed the
>>>>>> dynamic component. e.g. the destination might be
>>>>>> long/common/path/to/destination/files/. In this case, the
>>>>>> prefix is often much larger than messages themselves and is what gets
>>>>>> effectively encoded in the lambda.
>>>>>>
>>>>>
>>>>> The idea here is that the destination would be given as a format
>>>>> string, say, "long/common/path/to/destination/files/{dest_info.user}".
>>>>> Another way to put this is that we support (only) "lambdas" that are
>>>>> represented as string substitutions. (The fact that dest_info does not 
>>>>> have
>>>>> to be part of the record, and can be the output of an arbitrary map if 
>>>>> need
>>>>> be, makes this restriction not so bad.)
>>>>>
>>>>> As well as solving the performance issues, I think this is actually a
>>>>> pretty convenient and natural way for the user to name their destination
>>>>> (for the common usecase, even easier than providing a lambda), and has the
>>>>> benefit of being much more transparent than an arbitrary callable as well
>>>>> for introspection (for both machine and human that may look at the
>>>>> resulting pipeline).
>>>>>
>>>>>
>>>>>> I'm not entirely sure how to address this in a portable context. We
>>>>>> might simply have to accept the extra overhead when going cross language.
>>>>>>
>>>>>> Reuven
>>>>>>
>>>>>> On Wed, Mar 27, 2024 at 8:51 AM Robert Bradshaw via dev <
>>>>>> dev@beam.apache.org> wrote:
>>>>>>
>>>>>>> Thanks for putting this together, it will be a really useful feature
>>>>>>> to have.
>>&

Re: Supporting Dynamic Destinations in a portable context

2024-04-02 Thread Kenneth Knowles

I really like this proposal. I think it has narrowed down and solved the
essential problem of not shuffling excess redundant data, and also provides
the vast majority of the functionality that a lambda would, with
significantly better debugability and usability too, since the dynamic
destination pattern string can be in display data, etc.

Kenn

On Wed, Mar 27, 2024 at 1:58 PM Robert Bradshaw via dev 
wrote:

> On Wed, Mar 27, 2024 at 10:20 AM Reuven Lax  wrote:
>
>> Can the prefix still be generated programmatically at graph creation time?
>>
>
> Yes. It's just a property of the transform passed by the user at
> configuration time.
>
>
>> On Wed, Mar 27, 2024 at 9:40 AM Robert Bradshaw 
>> wrote:
>>
>>> On Wed, Mar 27, 2024 at 9:12 AM Reuven Lax  wrote:
>>>
 This does seem like the best compromise, though I think there will
 still end up being performance issues. A common pattern I've seen is that
 there is a long common prefix to the dynamic destination followed the
 dynamic component. e.g. the destination might be
 long/common/path/to/destination/files/. In this case, the
 prefix is often much larger than messages themselves and is what gets
 effectively encoded in the lambda.

>>>
>>> The idea here is that the destination would be given as a format string,
>>> say, "long/common/path/to/destination/files/{dest_info.user}". Another way
>>> to put this is that we support (only) "lambdas" that are represented as
>>> string substitutions. (The fact that dest_info does not have to be part of
>>> the record, and can be the output of an arbitrary map if need be, makes
>>> this restriction not so bad.)
>>>
>>> As well as solving the performance issues, I think this is actually a
>>> pretty convenient and natural way for the user to name their destination
>>> (for the common usecase, even easier than providing a lambda), and has the
>>> benefit of being much more transparent than an arbitrary callable as well
>>> for introspection (for both machine and human that may look at the
>>> resulting pipeline).
>>>
>>>
 I'm not entirely sure how to address this in a portable context. We
 might simply have to accept the extra overhead when going cross language.

 Reuven

 On Wed, Mar 27, 2024 at 8:51 AM Robert Bradshaw via dev <
 dev@beam.apache.org> wrote:

> Thanks for putting this together, it will be a really useful feature
> to have.
>
> I am in favor of the string-pattern approaches. I think we need to
> support both the {record=..., dest_info=...} and the elide-fields
> approaches, as the former is nicer when one has a fixed representation for
> the output record (e.g. a proto or avro schema) and the flattened form for
> ease of use in more free-form contexts (e.g. when producing records from
> YAML and SQL).
>
> Also left some comments on the doc.
>
>
> On Wed, Mar 27, 2024 at 6:51 AM Ahmed Abualsaud via dev <
> dev@beam.apache.org> wrote:
>
>> Hey all,
>>
>> There have been some conversations lately about how best to enable
>> dynamic destinations in a portable context. Usually, this comes up for
>> cross-language transforms and more recently for Beam YAML.
>>
>> I've started a short doc outlining some routes we could take. The
>> purpose is to establish a good standard for supporting dynamic 
>> destinations
>> with portability, one that can be applied to most use cases and IOs. 
>> Please
>> take a look and add any thoughts!
>>
>> https://s.apache.org/portable-dynamic-destinations
>>
>> Best,
>> Ahmed
>>
>

Re: [ACTION REQUESTED] Help me draft the Beam Board Report for March 2024

2024-03-13 Thread Kenneth Knowles

Thanks! I've submitted the report earlier today.

Kenn

On Mon, Mar 11, 2024 at 6:08 PM XQ Hu  wrote:

> Thanks for the ping! I added several notes and feel free to make more
> changes.
>
> On Mon, Mar 11, 2024 at 2:49 PM Kenneth Knowles  wrote:
>
>> Ping!
>>
>> Would really love help from folks building stuff to report out on what
>> they've built, especially!
>>
>> Kenn
>>
>> On Tue, Mar 5, 2024 at 12:15 PM Kenneth Knowles  wrote:
>>
>>> The next Beam board report is due next Wednesday, March 13. Please draft
>>> it together at https://s.apache.org/beam-draft-report-2024-03. The doc
>>> is open for anyone to edit.
>>>
>>> Ideas:
>>>
>>>  - highlights from CHANGES.md
>>>  - interesting technical discussions
>>>  - integrations with other projects
>>>  - community events
>>>  - major user facing addition/deprecation
>>>
>>> Past reports are at https://whimsy.apache.org/board/minutes/Beam.html for
>>> examples.
>>>
>>> Thanks,
>>>
>>> Kenn
>>>
>>

Re: [DESIGN PROPOSAL] Reshuffle Allowing Duplicates

2024-03-13 Thread Kenneth Knowles

Closing the loop, I went with two URNs and an associated payload in
https://github.com/apache/beam/pull/30545

Kenn

On Wed, Mar 6, 2024 at 10:54 AM Kenneth Knowles  wrote:

> OK of course hacking this up there's already combinatorial 2x2 that
> perhaps people were alluding to but I missed.
>
> RedistributeByKey (user's choice)
> RedistributeArbitrarily (runner's choice! default may be random keys but
> that is not required)
>
> RedistributeArbitrarilyAllowingDuplicates (this is the use case I am
> trying to get at with the design & impl - basically runner's choice and
> also no need to dedup or persist)
> RedistributeByKeyAllowingDuplicates (is this an important use case? I
> don't know - if so, then it points to some future where you tag any
> transform with this)
>
> So now I kind of want to have two URNs (one per input/output type) and a
> config that allows duplicates.
>
> WDYT? Do the people who liked having separate URNs want to have 4 URNs? We
> can still have whatever end-user SDK interface we need to have regardless.
> I think in Java we want it to look like this regardless:
>
> Redistribute.arbitrarily()
> Redistribute.byKey()
> Redistribute.arbitrarily().allowingDuplicates()
> Redistribute.byKey().allowingDuplicates()
>
> And Python
>
> beam.Redistribute()
> beam.RedistributeByKey()
> beam.Redistribute(allowDuplicates=true)
> beam.RedistributeByKey(allowDuplicates=true)
>
> I'll add end-user APIs to the design doc (and ask for help on Python and
> Go idioms) but they are pretty short and sweet.
>
> Kenn
>
> On Thu, Feb 8, 2024 at 1:45 PM Robert Burke  wrote:
>
>> Was that only October? Wow.
>>
>> Option 2 SGTM, with the adjustment to making the core of the URN
>> "redistribute_allowing_duplicates" instead of building from the unspecified
>> Reshuffle semantics.
>>
>> Transforms getting updated to use the new transform can have their
>> @RequiresStableInputs annotation added  accordingly if they need that
>> property per previous discussions.
>>
>>
>>
>> On Thu, Feb 8, 2024, 10:31 AM Kenneth Knowles  wrote:
>>
>>>
>>>
>>> On Wed, Feb 7, 2024 at 5:15 PM Robert Burke  wrote:
>>>
>>>> OK, so my stance is a configurable Reshuffle might be interesting, so
>>>> my vote is +1, along the following lines.
>>>>
>>>> 1. Use a new URN (beam:transform:reshuffle:v2) and attach a new
>>>> ReshufflePayload to it.
>>>>
>>>
>>> Ah, I see there's more than one variation of the "new URN" approach.
>>> Namely, you have a new version of an existing URN prefix, while I had in
>>> mind that it was a totally new base URN. In other words the open question I
>>> meant to pose is between these options:
>>>
>>> 1. beam:transform:reshuffle:v2 + { allowing_duplicates: true }
>>> 2. beam:transform:reshuffle_allowing_duplicates:v1 {}
>>>
>>> The most compelling argument in favor of option 2 is that it could have
>>> a distinct payload type associated with the different URN (maybe parameters
>>> around tweaking how much duplication? I don't know... I actually expect
>>> neither payload to evolve much if at all).
>>>
>>> There were also two comments in favor of option 2 on the design doc.
>>>
>>>   -> Unknown "urns for composite transforms" already default to the
>>>> subtransform graph implementation for most (all?) runners.
>>>>   -> Having a payload to toggle this behavior then can have whatever
>>>> desired behavior we like. It also allows for additional configurations
>>>> added in later on. This is preferable to a plethora of one-off urns IMHO.
>>>> We can have SDKs gate configuration combinations as needed if additional
>>>> ones appear.
>>>>
>>>> 2. It's very cheap to add but also ignore, as the default is "Do what
>>>> we're already doing without change", and not all SDKs need to add it right
>>>> away. It's more important that the portable way is defined at least, so
>>>> it's easy for other SDKs to add and handle it.
>>>>
>>>> I would prefer we have a clear starting point on what Reshuffle does
>>>> though. I remain a fan of "The Reshuffle (v2) Transform is a user
>>>> designated hint to a runner for a change in parallelism. By default, it
>>>> produces an output PCollection that has the same elements as the input
>>>> PCollection".
&

Re: [ACTION REQUESTED] Help me draft the Beam Board Report for March 2024

2024-03-11 Thread Kenneth Knowles

Ping!

Would really love help from folks building stuff to report out on what
they've built, especially!

Kenn

On Tue, Mar 5, 2024 at 12:15 PM Kenneth Knowles  wrote:

> The next Beam board report is due next Wednesday, March 13. Please draft
> it together at https://s.apache.org/beam-draft-report-2024-03. The doc is
> open for anyone to edit.
>
> Ideas:
>
>  - highlights from CHANGES.md
>  - interesting technical discussions
>  - integrations with other projects
>  - community events
>  - major user facing addition/deprecation
>
> Past reports are at https://whimsy.apache.org/board/minutes/Beam.html for
> examples.
>
> Thanks,
>
> Kenn
>

Re: [DESIGN PROPOSAL] Reshuffle Allowing Duplicates

2024-03-06 Thread Kenneth Knowles

OK of course hacking this up there's already combinatorial 2x2 that perhaps
people were alluding to but I missed.

RedistributeByKey (user's choice)
RedistributeArbitrarily (runner's choice! default may be random keys but
that is not required)

RedistributeArbitrarilyAllowingDuplicates (this is the use case I am trying
to get at with the design & impl - basically runner's choice and also no
need to dedup or persist)
RedistributeByKeyAllowingDuplicates (is this an important use case? I don't
know - if so, then it points to some future where you tag any transform
with this)

So now I kind of want to have two URNs (one per input/output type) and a
config that allows duplicates.

WDYT? Do the people who liked having separate URNs want to have 4 URNs? We
can still have whatever end-user SDK interface we need to have regardless.
I think in Java we want it to look like this regardless:

Redistribute.arbitrarily()
Redistribute.byKey()
Redistribute.arbitrarily().allowingDuplicates()
Redistribute.byKey().allowingDuplicates()

And Python

beam.Redistribute()
beam.RedistributeByKey()
beam.Redistribute(allowDuplicates=true)
beam.RedistributeByKey(allowDuplicates=true)

I'll add end-user APIs to the design doc (and ask for help on Python and Go
idioms) but they are pretty short and sweet.

Kenn

On Thu, Feb 8, 2024 at 1:45 PM Robert Burke  wrote:

> Was that only October? Wow.
>
> Option 2 SGTM, with the adjustment to making the core of the URN
> "redistribute_allowing_duplicates" instead of building from the unspecified
> Reshuffle semantics.
>
> Transforms getting updated to use the new transform can have their
> @RequiresStableInputs annotation added  accordingly if they need that
> property per previous discussions.
>
>
>
> On Thu, Feb 8, 2024, 10:31 AM Kenneth Knowles  wrote:
>
>>
>>
>> On Wed, Feb 7, 2024 at 5:15 PM Robert Burke  wrote:
>>
>>> OK, so my stance is a configurable Reshuffle might be interesting, so my
>>> vote is +1, along the following lines.
>>>
>>> 1. Use a new URN (beam:transform:reshuffle:v2) and attach a new
>>> ReshufflePayload to it.
>>>
>>
>> Ah, I see there's more than one variation of the "new URN" approach.
>> Namely, you have a new version of an existing URN prefix, while I had in
>> mind that it was a totally new base URN. In other words the open question I
>> meant to pose is between these options:
>>
>> 1. beam:transform:reshuffle:v2 + { allowing_duplicates: true }
>> 2. beam:transform:reshuffle_allowing_duplicates:v1 {}
>>
>> The most compelling argument in favor of option 2 is that it could have a
>> distinct payload type associated with the different URN (maybe parameters
>> around tweaking how much duplication? I don't know... I actually expect
>> neither payload to evolve much if at all).
>>
>> There were also two comments in favor of option 2 on the design doc.
>>
>>   -> Unknown "urns for composite transforms" already default to the
>>> subtransform graph implementation for most (all?) runners.
>>>   -> Having a payload to toggle this behavior then can have whatever
>>> desired behavior we like. It also allows for additional configurations
>>> added in later on. This is preferable to a plethora of one-off urns IMHO.
>>> We can have SDKs gate configuration combinations as needed if additional
>>> ones appear.
>>>
>>> 2. It's very cheap to add but also ignore, as the default is "Do what
>>> we're already doing without change", and not all SDKs need to add it right
>>> away. It's more important that the portable way is defined at least, so
>>> it's easy for other SDKs to add and handle it.
>>>
>>> I would prefer we have a clear starting point on what Reshuffle does
>>> though. I remain a fan of "The Reshuffle (v2) Transform is a user
>>> designated hint to a runner for a change in parallelism. By default, it
>>> produces an output PCollection that has the same elements as the input
>>> PCollection".
>>>
>>
>> +1 this is a better phrasing of the spec I propose in
>> https://s.apache.org/beam-redistribute but let's not get into it here if
>> we can, and just evaluate the delta from that design to
>> https://s.apache.org/beam-reshuffle-allowing-duplicates
>>
>> Kenn
>>
>>
>>> It remains an open question about what that means for
>>> checkpointing/durability behavior, but that's largely been runner dependent
>>> anyway. I admit the above definition is biased by the uses of Reshuffle I'm
>>

[ACTION REQUESTED] Help me draft the Beam Board Report for March 2024

2024-03-05 Thread Kenneth Knowles

The next Beam board report is due next Wednesday, March 13. Please draft it
together at https://s.apache.org/beam-draft-report-2024-03. The doc is open
for anyone to edit.

Ideas:

 - highlights from CHANGES.md
 - interesting technical discussions
 - integrations with other projects
 - community events
 - major user facing addition/deprecation

Past reports are at https://whimsy.apache.org/board/minutes/Beam.html for
examples.

Thanks,

Kenn

Re: [DISCUSS] Processing time timers in "batch" (faster-than-wall-time [re]processing)

2024-02-27 Thread Kenneth Knowles

I very much like the idea of processing time clock as a parameter
to @ProcessElement. That will be obviously useful and remove a source of
inconsistency, in addition to letting the runner/SDK harness control it. I
also like the idea of passing a Sleeper or to @ProcessElement. These are
both good practices for testing and flexibility and runner/SDK language
differences.

In your (a) (b) (c) can you be more specific about which watermarks you are
referring to? Are they the same as in my opening email? If so, then what
you describe is what we already have.

Kenn

On Tue, Feb 27, 2024 at 2:40 AM Jan Lukavský  wrote:

> I think that before we introduce a possibly somewhat duplicate new feature
> we should be certain that it is really semantically different. I'll
> rephrase the two cases:
>
>  a) need to wait and block data (delay) - the use case is the motivating
> example of Throttle transform
>
>  b) act without data, not block
>
> Provided we align processing time with local machine clock (or better,
> because of testing, make current processing time available via context to
> @ProcessElement) it seems to possble to unify both cases under slightly
> updated semantics of processing time timer in batch:
>
>  a) processing time timers fire with best-effort, i.e. trying to minimize
> delay between firing timestamp and timer's timestamp
>  b) timer is valid only in the context of current key-window, once
> watermark passes window GC time for the particular window that created the
> timer, it is ignored
>  c) if timer has output timestamp, this timestamp holds watermark (but
> this is currently probably noop, because runners currently do no propagate
> (per-key) watermark in batch, I assume)
>
> In case b) there might be needed to distinguish cases when timer has
> output timestamp, if so, it probably should be taken into account.
>
> Now, such semantics should be quite aligned with what we do in streaming
> case and what users generally expect. The blocking part can be implemented
> in @ProcessElement using buffer & timer, once there is need to wait, it can
> be implemented in user code using plain sleep(). That is due to the
> alignment between local time and definition of processing time. If we had
> some reason to be able to run faster-than-wall-clock (as I'm still not in
> favor of that), we could do that using ProcessContext.sleep(). Delaying
> processing in the @ProcessElement should result in backpressuring and
> backpropagation of this backpressure from the Throttle transform to the
> sources as mentioned (of course this is only for the streaming case).
>
> Is there anything missing in such definition that would still require
> splitting the timers into two distinct features?
>
>  Jan
> On 2/26/24 21:22, Kenneth Knowles wrote:
>
> Yea I like DelayTimer, or SleepTimer, or WaitTimer or some such.
>
> OutputTime is always an event time timestamp so it isn't even allowed to
> be set outside the window (or you'd end up with an element assigned to a
> window that it isn't within, since OutputTime essentially represents
> reserving the right to output an element with that timestamp)
>
> Kenn
>
> On Mon, Feb 26, 2024 at 3:19 PM Robert Burke  wrote:
>
>> Agreed that a retroactive behavior change would be bad, even if tied to a
>> beam version change. I agree that it meshes well with the general theme of
>> State & Timers exposing underlying primitives for implementing Windowing
>> and similar. I'd say the distinction between the two might be additional
>> complexity for users to grok, and would need to be documented well, as both
>> operate in the ProcessingTime domain, but differently.
>>
>> What to call this new timer then? DelayTimer?
>>
>> "A DelayTimer sets an instant in ProcessingTime at which point
>> computations can continue. Runners will prevent the EventTimer watermark
>> from advancing past the set OutputTime until Processing Time has advanced
>> to at least the provided instant to execute the timers callback. This can
>> be used to allow the runner to constrain pipeline throughput with user
>> guidance."
>>
>> I'd probably add that a timer with an output time outside of the window
>> would not be guaranteed to fire, and that OnWindowExpiry is the correct way
>> to ensure cleanup occurs.
>>
>> No solution to the Looping Timers on Drain problem here, but i think
>> that's ultimately an orthogonal discussion, and will restrain my thoughts
>> on that for now.
>>
>> This isn't a proposal, but exploring the solution space within our
>> problem. We'd want to break down exactly what different and the same for
>&

Re: [DISCUSS] Processing time timers in "batch" (faster-than-wall-time [re]processing)

2024-02-26 Thread Kenneth Knowles

Yea I like DelayTimer, or SleepTimer, or WaitTimer or some such.

OutputTime is always an event time timestamp so it isn't even allowed to be
set outside the window (or you'd end up with an element assigned to a
window that it isn't within, since OutputTime essentially represents
reserving the right to output an element with that timestamp)

Kenn

On Mon, Feb 26, 2024 at 3:19 PM Robert Burke  wrote:

> Agreed that a retroactive behavior change would be bad, even if tied to a
> beam version change. I agree that it meshes well with the general theme of
> State & Timers exposing underlying primitives for implementing Windowing
> and similar. I'd say the distinction between the two might be additional
> complexity for users to grok, and would need to be documented well, as both
> operate in the ProcessingTime domain, but differently.
>
> What to call this new timer then? DelayTimer?
>
> "A DelayTimer sets an instant in ProcessingTime at which point
> computations can continue. Runners will prevent the EventTimer watermark
> from advancing past the set OutputTime until Processing Time has advanced
> to at least the provided instant to execute the timers callback. This can
> be used to allow the runner to constrain pipeline throughput with user
> guidance."
>
> I'd probably add that a timer with an output time outside of the window
> would not be guaranteed to fire, and that OnWindowExpiry is the correct way
> to ensure cleanup occurs.
>
> No solution to the Looping Timers on Drain problem here, but i think
> that's ultimately an orthogonal discussion, and will restrain my thoughts
> on that for now.
>
> This isn't a proposal, but exploring the solution space within our
> problem. We'd want to break down exactly what different and the same for
> the 3 kinds of timers...
>
>
>
>
> On Mon, Feb 26, 2024, 11:45 AM Kenneth Knowles  wrote:
>
>> Pulling out focus points:
>>
>> On Fri, Feb 23, 2024 at 7:21 PM Robert Bradshaw via dev <
>> dev@beam.apache.org> wrote:
>> > I can't act on something yet [...] but I expect to be able to [...] at
>> some time in the processing-time future.
>>
>> I like this as a clear and internally-consistent feature description. It
>> describes ProcessContinuation and those timers which serve the same purpose
>> as ProcessContinuation.
>>
>> On Fri, Feb 23, 2024 at 7:21 PM Robert Bradshaw via dev <
>> dev@beam.apache.org> wrote:
>> > I can't think of a batch or streaming scenario where it would be
>> correct to not wait at least that long
>>
>> The main reason we created timers: to take action in the absence of data.
>> The archetypal use case for processing time timers was/is "flush data from
>> state if it has been sitting there too long". For this use case, the right
>> behavior for batch is to skip the timer. It is actually basically incorrect
>> to wait.
>>
>> On Fri, Feb 23, 2024 at 3:54 PM Robert Burke  wrote:
>> > It doesn't require a new primitive.
>>
>> IMO what's being proposed *is* a new primitive. I think it is a good
>> primitive. It is the underlying primitive to ProcessContinuation. It
>> would be user-friendly as a kind of timer. But if we made this the behavior
>> of processing time timers retroactively, it would break everyone using them
>> to flush data who is also reprocessing data.
>>
>> There's two very different use cases ("I need to wait, and block data" vs
>> "I want to act without data, aka NOT wait for data") and I think we should
>> serve both of them, but it doesn't have to be with the same low-level
>> feature.
>>
>> Kenn
>>
>>
>> On Fri, Feb 23, 2024 at 7:21 PM Robert Bradshaw via dev <
>> dev@beam.apache.org> wrote:
>>
>>> On Fri, Feb 23, 2024 at 3:54 PM Robert Burke 
>>> wrote:
>>> >
>>> > While I'm currently on the other side of the fence, I would not be
>>> against changing/requiring the semantics of ProcessingTime constructs to be
>>> "must wait and execute" as such a solution, and enables the Proposed
>>> "batch" process continuation throttling mechanism to work as hypothesized
>>> for both "batch" and "streaming" execution.
>>> >
>>> > There's a lot to like, as it leans Beam further into the unification
>>> of Batch and Stream, with one fewer exception (eg. unifies timer experience
>>> further). It doesn't require a new primitive. It probably matches more with
>>> user expectations anyway

Re: [DISCUSS] Processing time timers in "batch" (faster-than-wall-time [re]processing)

2024-02-26 Thread Kenneth Knowles

23, 2024 at 12:36 AM Jan Lukavský  wrote:
> > > >
> > > > For me it always helps to seek analogy in our physical reality.
> Stream
> > > > processing actually has quite a good analogy for both event-time and
> > > > processing-time - the simplest model for this being relativity
> theory.
> > > > Event-time is the time at which events occur _at distant locations_.
> Due
> > > > to finite and invariant speed of light (which is actually really
> > > > involved in the explanation why any stream processing is inevitably
> > > > unordered) these events are observed (processed) at different times
> > > > (processing time, different for different observers). It is perfectly
> > > > possible for an observer to observe events at a rate that is higher
> than
> > > > one second per second. This also happens in reality for observers
> that
> > > > travel at relativistic speeds (which might be an analogy for fast -
> > > > batch - (re)processing). Besides the invariant speed, there is also
> > > > another invariant - local clock (wall time) always ticks exactly at
> the
> > > > rate of one second per second, no matter what. It is not possible to
> > > > "move faster or slower" through (local) time.
> > > >
> > > > In my understanding the reason why we do not put any guarantees or
> > > > bounds on the delay of firing processing time timers is purely
> technical
> > > > - the processing is (per key) single-threaded, thus any timer has to
> > > > wait before any element processing finishes. This is only
> consequence of
> > > > a technical solution, not something fundamental.
> > > >
> > > > Having said that, my point is that according to the above analogy, it
> > > > should be perfectly fine to fire processing time timers in batch
> based
> > > > on (local wall) time only. There should be no way of manipulating
> this
> > > > local time (excluding tests). Watermarks should be affected the same
> way
> > > > as any buffering in a state that would happen in a stateful DoFn
> (i.e.
> > > > set timer holds output watermark). We should probably pay attention
> to
> > > > looping timers, but it seems possible to define a valid stopping
> > > > condition (input watermark at infinity).
> > > >
> > > >   Jan
> > > >
> > > > On 2/22/24 19:50, Kenneth Knowles wrote:
> > > > > Forking this thread.
> > > > >
> > > > > The state of processing time timers in this mode of processing is
> not
> > > > > satisfactory and is discussed a lot but we should make everything
> > > > > explicit.
> > > > >
> > > > > Currently, a state and timer DoFn has a number of logical
> watermarks:
> > > > > (apologies for fixed width not coming through in email lists).
> Treat
> > > > > timers as a back edge.
> > > > >
> > > > > input --(A)(C)--> ParDo(DoFn) (D)---> output
> > > > > ^  |
> > > > > |--(B)-|
> > > > >timers
> > > > >
> > > > > (A) Input Element watermark: this is the watermark that promises
> there
> > > > > is no incoming element with a timestamp earlier than it. Each input
> > > > > element's timestamp holds this watermark. Note that *event time
> timers
> > > > > firing is according to this watermark*. But a runner commits
> changes
> > > > > to this watermark *whenever it wants*, in a way that can be
> > > > > consistent. So the runner can absolute process *all* the elements
> > > > > before advancing the watermark (A), and only afterwards start
> firing
> > > > > timers.
> > > > >
> > > > > (B) Timer watermark: this is a watermark that promises no timer is
> set
> > > > > with an output timestamp earlier than it. Each timer that has an
> > > > > output timestamp holds this watermark. Note that timers can set new
> > > > > timers, indefinitely, so this may never reach infinity even in a
> drain
> > > > > scenario.
> > > > >
> > > > > (C) (derived) total input watermark: this is a watermark that is
> the
> > > > > minimum of the two above, and ensures that all state for the DoFn
> for
> &

[DISCUSS] Processing time timers in "batch" (faster-than-wall-time [re]processing)

2024-02-22 Thread Kenneth Knowles

Forking this thread.

The state of processing time timers in this mode of processing is not
satisfactory and is discussed a lot but we should make everything explicit.

Currently, a state and timer DoFn has a number of logical watermarks:
(apologies for fixed width not coming through in email lists). Treat timers
as a back edge.

input --(A)(C)--> ParDo(DoFn) (D)---> output
^  |
|--(B)-|
   timers


(A) Input Element watermark: this is the watermark that promises there is
no incoming element with a timestamp earlier than it. Each input element's
timestamp holds this watermark. Note that *event time timers firing is
according to this watermark*. But a runner commits changes to this
watermark *whenever it wants*, in a way that can be consistent. So the
runner can absolute process *all* the elements before advancing the
watermark (A), and only afterwards start firing timers.

(B) Timer watermark: this is a watermark that promises no timer is set with
an output timestamp earlier than it. Each timer that has an output
timestamp holds this watermark. Note that timers can set new timers,
indefinitely, so this may never reach infinity even in a drain scenario.

(C) (derived) total input watermark: this is a watermark that is the
minimum of the two above, and ensures that all state for the DoFn for
expired windows can be GCd after calling @OnWindowExpiration.

(D) output watermark: this is a promise that the DoFn will not output
earlier than the watermark. It is held by the total input watermark.

So a any timer, processing or not, holds the total input watermark which
prevents window GC, hence the timer must be fired. You can set timers
without a timestamp and they will not hold (B) hence not hold the total
input / GC watermark (C). Then if a timer fires for an expired window, it
is ignored. But in general a timer that sets an output timestamp is saying
that it may produce output, so it *must* be fired, even in batch, for data
integrity. There was a time before timers had output timestamps that we
said that you *always* have to have an @OnWindowExpiration callback for
data integrity, and processing time timers could not hold the watermark.
That is changed now.

One main purpose of processing time timers in streaming is to be a
"timeout" for data buffered in state, to eventually flush. In this case the
output timestamp should be the minimum of the elements in state (or
equivalent). In batch, of course, this kind of timer is not relevant and we
should definitely not wait for it, because the goal is to just get through
all the data. We can justify this by saying that the worker really has no
business having any idea what time it really is, and the runner can just
run the clock at whatever speed it wants.

Another purpose, brought up on the Throttle thread, is to wait or backoff.
In this case it would be desired for the timer to actually cause batch
processing to pause and wait. This kind of behavior has not been explored
much. Notably the runner can absolutely process all elements first, then
start to fire any enqueued processing time timers. In the same way that
state in batch can just be in memory, this *could* just be a call to
sleep(). It all seems a bit sketchy so I'd love clearer opinions.

These two are both operational effects - as you would expect for processing
time timers - and they seem to be in conflict. Maybe they just need
different features?

I'd love to hear some more uses of processing time timers from the
community.

Kenn

Re: Throttle PTransform

2024-02-22 Thread Kenneth Knowles

Wow I love your input Reuven. Of course "the source" that you are applying
backpressure to is often a runner's shuffle so it may be state anyhow, but
it is good to give the runner the choice of how to figure that out and
maybe chain backpressure further.

The goal is basically to make a sink that doesn't do its own throttling
behave as well as one that does, so we don't DoS it, right? So that could
be a key design goal and inspiration, which leads to what Reuven describes.
Such sinks will throttle the DoFn by making the IO requests take longer or
returning throttle error codes or both. So we might consider how to emulate
that rather than buffer and timer.

Tangentially, I will start a separate thread and doc about processing time
timers is batch, which we should probably frame as "processing time timers
when historically processing a very large amount of data as fast and
efficiently as possible". I've had this chat with many people and even I
constantly forget the status, conclusion, and rationale for why things are
the way they are. It'll be good to record if not already somewhere.

Kenn

On Thu, Feb 22, 2024 at 2:43 AM Jan Lukavský  wrote:

>
> On 2/21/24 18:27, Reuven Lax via dev wrote:
>
> Agreed, that event-time throttling doesn't make sense here. In theory
> processing-time timers have no SLA - i.e. their firing might be delayed -
> so batch runners aren't violating the model by firing them all at the end;
> however it does make processing time timers less useful in batch, as we see
> here.
>
> Personally, I'm not sure I would use state and timers to implement this,
> and I definitely wouldn't create this many keys. A couple of reasons for
> this:
>   1. If a pipeline is receiving input faster than the throttle rate, the
> proposed technique would shift all those elements into the DoFn's state
> which will keep growing indefinitely. Generally we would prefer to leave
> that backlog in the source instead of copying it into DoFn state.
>   2. In my experience with throttling, having too much parallelism is
> problematic. The issue is that there is some error involved whenever you
> throttle, and this error can accumulate across many shards (and when I've
> done this sort of thing before, I found that the error was often biased in
> one direction). If targeting 100,000 records/sec, this  approach (if I
> understand it correctly) would create 100,000 shards and throttle them each
> to one element/sec. I doubt this will actually result in anything close to
> desired throttling.
>   3. Very commonly, the request is to throttle based on bytes/sec, not
> events/sec. Anything we build should be easily extensible to bytes/sec.
>
> What I would suggest (and what Beam users have often done in the past)
> would be to bucket the PCollection into N buckets where N is generally
> smallish (100 buckets, 1000 buckets, depending on the expected throughput);
> runners that support autosharding (such as Dataflow) can automatically
> choose N. Each shard then throttles its output to rate/N. Keeping N no
> larger than necessary minimizes the error introduced into throttling.
>
> We also don't necessarily need state/timers here - each shard is processed
> on a single thread, so those threads can simply throttle calls to
> OutputReceiver.output. This way if the pipeline is exceeding the threshold,
> backpressure will tend to simply leave excess data in the source. This also
> is a simpler design than the proposed one.
>
> A more sophisticated design might combine elements of both - buffering a
> bounded amount of data in state when the threshold is exceeded, but
> severely limiting the state size. However I wouldn't start here - we would
> want to build the simpler implementation first and see how it performs.
>
> +1
>
>
> On Wed, Feb 21, 2024 at 8:53 AM Robert Bradshaw via dev <
> dev@beam.apache.org> wrote:
>
>> On Wed, Feb 21, 2024 at 12:48 AM Jan Lukavský  wrote:
>> >
>> > Hi,
>> >
>> > I have left a note regarding the proposed splitting of batch and
>> > streaming expansion of this transform. In general, a need for such split
>> > triggers doubts in me. This signals that either
>> >
>> >   a) the transform does something is should not, or
>> >
>> >   b) Beam model is not complete in terms of being "unified"
>> >
>> > The problem that is described in the document is that in the batch case
>> > timers are not fired appropriately.
>>
>> +1. The underlying flaw is that processing time timers are not handled
>> correctly in batch, but should be (even if it means keeping workers
>> idle?). We should fix this.
>>
>> > This is actually on of the
>> > motivations that led to introduction of @RequiresTimeSortedInput
>> > annotation and, though mentioned years ago as a question, I do not
>> > remember what arguments were used against enforcing sorting inputs by
>> > timestamp in the batch stateful DoFn as a requirement in the model. That
>> > would enable the appropriate firing of timers while preserving the batch
>> > invariant w

Re: [PROPOSAL] Preparing for 2.55.0 Release

2024-02-22 Thread Kenneth Knowles

Hooray! Thank you!

On Thu, Feb 22, 2024 at 10:24 AM Yi Hu via dev  wrote:

> Hey Beam community,
>
> The next release (2.55.0) branch cut is scheduled on Mar 6th, 2024,
> according to
> the release calendar [1].
>
> I volunteer to perform this release. My plan is to cut the branch on that
> date, and cherrypick release-blocking fixes afterwards, if any.
>
> Please help me make sure the release goes smoothly by:
> - Making sure that any unresolved release blocking issues for 2.55.0
> should have their "Milestone" marked as "2.55.0 Release" as soon as
> possible.
> - Reviewing the current release blockers [2] and remove the Milestone if
> they don't meet the criteria at [3].
>
> Let me know if you have any comments/objections/questions.
>
> Thanks,
>
> Yi
>
> [1]
> https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com
> [2] https://github.com/apache/beam/milestone/19
> [3] https://beam.apache.org/contribute/release-blocking/
>
> --
>
> Yi Hu, (he/him/his)
>
> Software Engineer
>
>
>

Re: Pipeline upgrade to 2.55.0-SNAPSHOT broken for FlinkRunner

2024-02-22 Thread Kenneth Knowles

Great. Let me know if I can help. I broke it after all :-)

Kenn

On Thu, Feb 22, 2024 at 2:58 AM Jan Lukavský  wrote:

> Reasons we use Java serialization are not fundamental, probably only
> historical. Thinking about it, yes, there is lucky coincidence that we
> currently have to change the serialization because of Flink 1.17 support.
> Flink 1.17 actually removes the legacy java serialization from Flink and
> enforces custom serialization. Therefore, we need to introduce an upgrade
> compatible change of serialization to support Flink 1.17. This is already
> implemented in [1]. The PR can go further, though. We can replace Java
> serialization of Coder in the TypeSerializerSnapshot and use the portable
> representation of Coder (which will still use Java serialization in some
> cases, but might avoid it at least for well-known coders, moreover Coders
> should be more upgrade-stable classes).
>
> I'll try to restore the SerializablePipelineOptions (copy&paste) in
> FlinkRunner only and rework the serialization in a more stable way (at
> least avoid serializing the CoderTypeSerializer, which references the
> SerializablePipelineOptions).
>
> I created [2] and marked it as blocker for 2.55.0 release, because
> otherwise we would break the upgrade.
>
> Thanks for the discussion, it helped a lot.
>
>  Jan
>
> [1] https://github.com/apache/beam/pull/30197
>
> [2] https://github.com/apache/beam/issues/30385
> On 2/21/24 20:33, Kenneth Knowles wrote:
>
> Yea I think we should restore the necessary classes but also fix the
> FlinkRunner. Java serialization is inherently self-update-incompatible.
>
> On Wed, Feb 21, 2024 at 1:35 PM Reuven Lax via dev 
> wrote:
>
>> Is there a fundamental reason we serialize java classes into Flink
>> savepoints.
>>
>> On Wed, Feb 21, 2024 at 9:51 AM Robert Bradshaw via dev <
>> dev@beam.apache.org> wrote:
>>
>>> We could consider merging the gradle targets without renaming the
>>> classpaths as an intermediate step.
>>>
>>> Optimistically, perhaps there's a small number of classes that we need
>>> to preserve (e.g. SerializablePipelineOptions looks like it was
>>> something specifically intended to be serialized; maybe that an a
>>> handful of others (that implement Serializable) could be left in their
>>> original packages for backwards compatibility reasons?
>>>
>>> On Wed, Feb 21, 2024 at 7:32 AM Jan Lukavský  wrote:
>>> >
>>> > Hi,
>>> >
>>> > while implementing FlinkRunner for Flink 1.17 I tried to verify that a
>>> > running Pipeline is able to successfully upgrade from Flink 1.16 to
>>> > Flink 1.17. There is some change regarding serialization needed for
>>> > Flink 1.17, so this was a concern. Unfortunately recently we merged
>>> > core-construction-java into SDK, which resulted in some classes being
>>> > repackaged. Unfortunately, we serialize some classes into Flink's
>>> > check/savepoints. The renaming of the class therefore ends with the
>>> > following exception trying to restore from the savepoint:
>>> >
>>> > Caused by: java.lang.ClassNotFoundException:
>>> > org.apache.beam.runners.core.construction.SerializablePipelineOptions
>>> >  at java.base/java.net
>>> .URLClassLoader.findClass(URLClassLoader.java:476)
>>> >  at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:589)
>>> >  at
>>> >
>>> org.apache.flink.util.FlinkUserCodeClassLoader.loadClassWithoutExceptionHandling(FlinkUserCodeClassLoader.java:67)
>>> >  at
>>> >
>>> org.apache.flink.util.ChildFirstClassLoader.loadClassWithoutExceptionHandling(ChildFirstClassLoader.java:74)
>>> >  at
>>> >
>>> org.apache.flink.util.FlinkUserCodeClassLoader.loadClass(FlinkUserCodeClassLoader.java:51)
>>> >  at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
>>> >  at
>>> >
>>> org.apache.flink.util.FlinkUserCodeClassLoaders$SafetyNetWrapperClassLoader.loadClass(FlinkUserCodeClassLoaders.java:192)
>>> >  at java.base/java.lang.Class.forName0(Native Method)
>>> >  at java.base/java.lang.Class.forName(Class.java:398)
>>> >  at
>>> >
>>> org.apache.flink.util.InstantiationUtil$ClassLoaderObjectInputStream.resolveClass(InstantiationUtil.java:78)
>>> >  at
>>> >
>>> org.apache.flink.util.InstantiationUtil$FailureTolerantObjectInputStream.readClassDescriptor(InstantiationUtil.jav

Re: Pipeline upgrade to 2.55.0-SNAPSHOT broken for FlinkRunner

2024-02-21 Thread Kenneth Knowles

Yea I think we should restore the necessary classes but also fix the
FlinkRunner. Java serialization is inherently self-update-incompatible.

On Wed, Feb 21, 2024 at 1:35 PM Reuven Lax via dev 
wrote:

> Is there a fundamental reason we serialize java classes into Flink
> savepoints.
>
> On Wed, Feb 21, 2024 at 9:51 AM Robert Bradshaw via dev <
> dev@beam.apache.org> wrote:
>
>> We could consider merging the gradle targets without renaming the
>> classpaths as an intermediate step.
>>
>> Optimistically, perhaps there's a small number of classes that we need
>> to preserve (e.g. SerializablePipelineOptions looks like it was
>> something specifically intended to be serialized; maybe that an a
>> handful of others (that implement Serializable) could be left in their
>> original packages for backwards compatibility reasons?
>>
>> On Wed, Feb 21, 2024 at 7:32 AM Jan Lukavský  wrote:
>> >
>> > Hi,
>> >
>> > while implementing FlinkRunner for Flink 1.17 I tried to verify that a
>> > running Pipeline is able to successfully upgrade from Flink 1.16 to
>> > Flink 1.17. There is some change regarding serialization needed for
>> > Flink 1.17, so this was a concern. Unfortunately recently we merged
>> > core-construction-java into SDK, which resulted in some classes being
>> > repackaged. Unfortunately, we serialize some classes into Flink's
>> > check/savepoints. The renaming of the class therefore ends with the
>> > following exception trying to restore from the savepoint:
>> >
>> > Caused by: java.lang.ClassNotFoundException:
>> > org.apache.beam.runners.core.construction.SerializablePipelineOptions
>> >  at java.base/java.net
>> .URLClassLoader.findClass(URLClassLoader.java:476)
>> >  at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:589)
>> >  at
>> >
>> org.apache.flink.util.FlinkUserCodeClassLoader.loadClassWithoutExceptionHandling(FlinkUserCodeClassLoader.java:67)
>> >  at
>> >
>> org.apache.flink.util.ChildFirstClassLoader.loadClassWithoutExceptionHandling(ChildFirstClassLoader.java:74)
>> >  at
>> >
>> org.apache.flink.util.FlinkUserCodeClassLoader.loadClass(FlinkUserCodeClassLoader.java:51)
>> >  at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
>> >  at
>> >
>> org.apache.flink.util.FlinkUserCodeClassLoaders$SafetyNetWrapperClassLoader.loadClass(FlinkUserCodeClassLoaders.java:192)
>> >  at java.base/java.lang.Class.forName0(Native Method)
>> >  at java.base/java.lang.Class.forName(Class.java:398)
>> >  at
>> >
>> org.apache.flink.util.InstantiationUtil$ClassLoaderObjectInputStream.resolveClass(InstantiationUtil.java:78)
>> >  at
>> >
>> org.apache.flink.util.InstantiationUtil$FailureTolerantObjectInputStream.readClassDescriptor(InstantiationUtil.java:251)
>> >
>> >
>> > This means that no Pipeline will be able to successfully upgrade from
>> > version prior to 2.55.0 to 2.55.0 (i.e. all Pipelines will have to be
>> > restarted from scratch). I wanted to know how the community would feel
>> > about that, this consequence probably was not clear when we merged the
>> > artifacts. The only option would be to revert the merge and then try to
>> > figure out how to avoid Java serialization in Flink's savepoints. That
>> > would definitely be costly in terms of implementation and even more to
>> > provide ways to transfer old savepoints to the new format (can be
>> > possible using state processor API). I'm aware that Beam provides no
>> > general guarantees about the upgrade compatibility, so it might be fine
>> > to just ignore this, I just wanted to shout this out loud so that we can
>> > make a deliberate decision.
>> >
>> > Best,
>> >
>> >   Jan
>> >
>>
>

Re: [API PROPOSAL] PTransform.getURN, toProto, etc, for Java

2024-02-16 Thread Kenneth Knowles

My opinion regarding the execution side and symmetry is this: it was always
wrong to use the term "PTransform" to describe the thing that is executed
by workers or SDK harnesses. They aren't the same and shouldn't be thought
of or implemented as the same.

The original Dataflow runner had it right - a runner converts Beam into a
physical plan that is composed of physical operations. S*teps* and *stages* in
Dataflow's case. These do involve invoking user DoFns and other UDFs which
are shared with the pipeline. You can take a look at the Dataflow v1 worker
and you'll see the set of useful steps is neither a subset nor a superset
of what you think of as Beam's core transforms. In fact they are entirely
disjoint despite the temptation to suggest that ParDoStep is "just" ParDo,
which is wrong.

The Beam model on the fn api side isn't as good as the original Dataflow
approach when it comes to this clarity. There was a desire to share bits of
the encoding of a DAG between the Pipeline proto and the
ProcessBundleDescriptor, which is understandable. But honestly it might be
extraneous complexity as the Dataflow v1 worker only executes trees. I
don't know quite where we landed on that. But I think re-using PTransform
with execution-oriented URNs to describe instructions to the SDK harness is
primarily misleading/confusing and saves maybe dozens of lines of code.

Which is all to say that how this may or may not impact the execution side
doesn't matter to me. I would view it as an improvement if they diverged
further. But this change - at first - will just be a refactor of where the
code lives that produces the same particular protos.

Kenn

On Thu, Feb 15, 2024 at 2:48 PM Robert Burke  wrote:

> +1
>
> While the current Go SDK has always been portability first it was designed
> with a goal of enabling it to back out of that at the time, so it's fully
> on a broad vertical slice of things to translate to protos and back again,
> leading to difficulties when adding a new core transform.
>
> I have an experimental hobby implementation of a Go SDK for prototyping
> things (mostly seeing if Go Generics can make a pipeline compile time
> typesafe, and the answer is yes... but that's a different email) and went
> with emitting out a FunctionSpec, (urn and payload), the env ID, and
> UniqueName, while inputs and outputs were handled with common code.
>
> I still kept Execution side translation to be graph based at the time,
> because of the lost type information, which required additional graph
> context to build the execution side with the right types (eg for SDK side
> source, sink, and flatten handling).
>
> So I question if full symmetry is required. Eg. There's no reason for
> ExternalTransforms to be converted back on execution side, or for GBKs
> (usually that is, I'm looking at you Typescript SDK!). And conversely,
> there are "Execution Side Only" transforms that are never directly written
> by a pipeline or transform author, but are necessary to execute SDK side
> (combine or SDF components for example), even though those have single user
> side constructs.
>
> That just implies that the toProto and fromProto parts are separable
> though.
>
> But that's just that specific experimental design for that specific
> languages affordances.
>
> It's definitely a big plus to be able to see all the bits for a single
> transform in one file, instead of trying to find the 5-8 different places
> once must add a registration for it. More so in Java where such handler
> registrations can be done via class annotations!
>
> Robert Burke
> Beam Go Busybody
>
> On Thu, Feb 15, 2024, 10:37 AM Robert Bradshaw via dev <
> dev@beam.apache.org> wrote:
>
>> On Wed, Feb 14, 2024 at 10:28 AM Kenneth Knowles  wrote:
>> >
>> > Hi all,
>> >
>> > TL;DR I want to add some API like PTransform.getURN, toProto and
>> fromProto, etc. to the Java SDK. I want to do this so that making a
>> PTransform support portability is a natural part of writing the transform
>> and not a totally separate thing with tons of boilerplate.
>> >
>> > What do you think?
>>
>> Huge +1 to this direction.
>>
>> IMHO one of the most fundamental things about Beam is its model.
>> Originally this was only expressed in a specific SDK (Java) and then
>> got ported to others, but now that we have portability it's expressed
>> in a language-independent way.
>>
>> The fact that we keep these separate in Java is not buying us
>> anything, and causes a huge amount of boilerplate that'd be great to
>> remove, as well as making the essential model more front-and-center.
>>
>> > I think

[API PROPOSAL] PTransform.getURN, toProto, etc, for Java

2024-02-14 Thread Kenneth Knowles

Hi all,

TL;DR I want to add some API like PTransform.getURN, toProto and fromProto,
etc. to the Java SDK. I want to do this so that making a PTransform support
portability is a natural part of writing the transform and not a totally
separate thing with tons of boilerplate.

What do you think?

I think a particular API can be sorted out most easily in code (which I
will prepare after gathering some feedback).

We already have all the translation logic written, and porting a couple
transforms to it will ensure the API has everything we need. We can refer
to Python and Go for API ideas as well.

Lots of context below, but you can skip it...

-

When we first created the portability framework, we wanted the SDKs to be
"standalone" and not depend on portability. We wanted portability to be an
optional plugin that users could opt in to. That is totally the opposite
now. We want portability to be the main place where Beam is defined, and
then SDKs make that available in language idiomatic ways.

Also when we first created the framework, we were experimenting with
different serialization approaches and we wanted to be independent of
protobuf and gRPC if we could. But now we are pretty committed and it would
be a huge lift to use anything else.

Finally, at the time we created the portability framework, we designed it
to allow composites to have URNs and well-defined specs, rather than just
be language-specific subgraphs, but we didn't really plan to make this easy.

For all of the above, most users depend on portability and on proto. So
separating them is not useful and just creates LOTS of boilerplate and
friction for making new well-defined transforms.

Kenn

Re: [VOTE] Vendored Dependencies Release

2024-02-14 Thread Kenneth Knowles

+1 (binding)

On Wed, Feb 14, 2024 at 10:48 AM Robert Burke  wrote:

> +1 (binding)
>
> On Wed, Feb 14, 2024, 7:35 AM Yi Hu via dev  wrote:
>
>> +1 (non-binding)
>>
>> checked artifact packages not leaking namespace (or under
>> org.apache.beam.vendor.grpc.v1p60p1) and the tests in
>> https://github.com/apache/beam/pull/30212
>>
>>
>>
>>
>> On Tue, Feb 13, 2024 at 4:29 AM Sam Whittle  wrote:
>>
>>> Hi,
>>> Sorry I missed that close step. Done!
>>> Sam
>>>
>>> On Mon, Feb 12, 2024 at 8:32 PM Yi Hu via dev 
>>> wrote:
>>>
 Hi,

 I am trying to open "
 https://repository.apache.org/content/repositories/orgapachebeam-1369/";
 but get "[id=orgapachebeam-1369] exists but is not exposed." It seems the
 staging repository needs to be closed to have it available to public: [1]

 [1]
 https://docs.google.com/document/d/1ztEoyGkqq9ie5riQxRtMuBu3vb6BUO91mSMn1PU0pDA/edit?disco=vHX80XE

 On Mon, Feb 12, 2024 at 1:44 PM Chamikara Jayalath via dev <
 dev@beam.apache.org> wrote:

> +1 (binding)
>
> Thanks,
> Cham
>
> On Fri, Feb 9, 2024 at 5:25 AM Sam Whittle 
> wrote:
>
>> Please review the release of the following artifacts that we vendor,
>> following the process [5]:
>>
>>  * beam-vendor-grpc-1-60-1:0.2
>>
>> Hi everyone,
>>
>> Please review and vote on the release candidate #1 for the version
>> beam-vendor-grpc-1-60-1:0.2 as follows:
>>
>> [ ] +1, Approve the release
>>
>> [ ] -1, Do not approve the release (please provide specific comments)
>>
>>
>> The complete staging area is available for your review, which
>> includes:
>>
>> * the official Apache source release to be deployed to
>> dist.apache.org [1], which is signed with the key with fingerprint
>> FCFD152811BF1578 [2],
>>
>> * all artifacts to be deployed to the Maven Central Repository [3],
>>
>> * commit hash "2d08b32e674a1046ba7be0ae5f1e4b7b05b73488" [4].
>>
>> The vote will be open for at least 72 hours. It is adopted by
>> majority approval, with at least 3 PMC affirmative votes.
>>
>> Thanks,
>>
>> Sam
>>
>> [1] https://dist.apache.org/repos/dist/dev/beam/vendor/
>>
>> [2] https://dist.apache.org/repos/dist/release/beam/KEYS
>>
>> [3]
>> https://repository.apache.org/content/repositories/orgapachebeam-1369/
>>
>> [4]
>> https://github.com/apache/beam/commit/2d08b32e674a1046ba7be0ae5f1e4b7b05b73488
>>
>> [5] https://s.apache.org/beam-release-vendored-artifacts
>>
>

[ANNOUNCE] New Committer: Svetak Sundhar

2024-02-12 Thread Kenneth Knowles

Hi all,

Please join me and the rest of the Beam PMC in welcoming a new committer:
Svetak Sundhar (sve...@apache.org).

Svetak has been with Beam since 2021. Svetak has contributed code to many
areas of Beam, including notebooks, Beam Quest, dataframes, and IOs. We
also want to especially highlight the effort Svetak has put into improving
Beam's documentation, participating in release validation, and evangelizing
Beam.

Considering his contributions to the project over this timeframe, the Beam
PMC trusts Svetak with the responsibilities of a Beam committer. [1]

Thank you Svetak! And we are looking to see more of your contributions!

Kenn, on behalf of the Apache Beam PMC

[1]
https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer

Re: [DESIGN PROPOSAL] Reshuffle Allowing Duplicates

2024-02-08 Thread Kenneth Knowles

On Wed, Feb 7, 2024 at 5:15 PM Robert Burke  wrote:

> OK, so my stance is a configurable Reshuffle might be interesting, so my
> vote is +1, along the following lines.
>
> 1. Use a new URN (beam:transform:reshuffle:v2) and attach a new
> ReshufflePayload to it.
>

Ah, I see there's more than one variation of the "new URN" approach.
Namely, you have a new version of an existing URN prefix, while I had in
mind that it was a totally new base URN. In other words the open question I
meant to pose is between these options:

1. beam:transform:reshuffle:v2 + { allowing_duplicates: true }
2. beam:transform:reshuffle_allowing_duplicates:v1 {}

The most compelling argument in favor of option 2 is that it could have a
distinct payload type associated with the different URN (maybe parameters
around tweaking how much duplication? I don't know... I actually expect
neither payload to evolve much if at all).

There were also two comments in favor of option 2 on the design doc.

  -> Unknown "urns for composite transforms" already default to the
> subtransform graph implementation for most (all?) runners.
>   -> Having a payload to toggle this behavior then can have whatever
> desired behavior we like. It also allows for additional configurations
> added in later on. This is preferable to a plethora of one-off urns IMHO.
> We can have SDKs gate configuration combinations as needed if additional
> ones appear.
>
> 2. It's very cheap to add but also ignore, as the default is "Do what
> we're already doing without change", and not all SDKs need to add it right
> away. It's more important that the portable way is defined at least, so
> it's easy for other SDKs to add and handle it.
>
> I would prefer we have a clear starting point on what Reshuffle does
> though. I remain a fan of "The Reshuffle (v2) Transform is a user
> designated hint to a runner for a change in parallelism. By default, it
> produces an output PCollection that has the same elements as the input
> PCollection".
>

+1 this is a better phrasing of the spec I propose in
https://s.apache.org/beam-redistribute but let's not get into it here if we
can, and just evaluate the delta from that design to
https://s.apache.org/beam-reshuffle-allowing-duplicates

Kenn


> It remains an open question about what that means for
> checkpointing/durability behavior, but that's largely been runner dependent
> anyway. I admit the above definition is biased by the uses of Reshuffle I'm
> aware of, which largely are to incur a fusion break in the execution graph.
>
> Robert Burke
> Beam Go Busybody
>
> On 2024/01/31 16:01:33 Kenneth Knowles wrote:
> > On Wed, Jan 31, 2024 at 4:21 AM Jan Lukavský  wrote:
> >
> > > Hi,
> > >
> > > if I understand this proposal correctly, the motivation is actually
> > > reducing latency by bypassing bundle atomic guarantees, bundles after
> "at
> > > least once" Reshuffle would be reconstructed independently of the
> > > pre-shuffle bundling. Provided this is correct, it seems that the
> behavior
> > > is slightly more general than for the case of Reshuffle. We have
> already
> > > some transforms that manipulate a specific property of a PCollection -
> if
> > > it may or might not contain duplicates. That is manipulated in two
> ways -
> > > explicitly removing duplicates based on IDs on sources that generate
> > > duplicates and using @RequiresStableInput, mostly in sinks. These
> > > techniques modify an inherent property of a PCollection, that is if it
> > > contains or does not contain possible duplicates originating from the
> same
> > > input element.
> > >
> > > There are two types of duplicates - duplicate elements in _different
> > > bundles_ (typically from at-least-once sources) and duplicates arising
> due
> > > to bundle reprocessing (affecting only transforms with side-effects,
> that
> > > is what we solve by @RequiresStableInput). The point I'm trying to get
> to -
> > > should we add these properties to PCollections (contains cross-bundle
> > > duplicates vs. does not) and PTransforms ("outputs deduplicated
> elements"
> > > and "requires stable input")? That would allow us to analyze the
> Pipeline
> > > DAG and provide appropriate implementation for Reshuffle
> automatically, so
> > > that a new URN or flag would not be needed. Moreover, this might be
> useful
> > > for a broader range of optimizations.
> > >
> > > WDYT?
> > >
> > These are interesting ideas that could be useful. I think they achieve a

Re: [PROPOSAL] Re-release vendor grpc

2024-02-06 Thread Kenneth Knowles

SGTM. Thanks for doing this!

On Tue, Feb 6, 2024 at 5:20 PM Sam Whittle  wrote:

> Hi everyone,
>
> I would like to volunteer to rerelease the Beam vendored grpc 1.60.1.
> The grpc version will be unchanged but additional jars
> 'io.grpc:grpc-services' and 'io.grpc:grpc-util' will be added due to [1]
> addressing [2]
>
> My plan is to follow the release process [3, 4], which involves preparing
> for the release, building a candidate, voting and finalizing the release. I
> plan on integrating the vendored artifact
> org.apache.beam:beam-vendor-grpc-1_60_1:0.2 into the 2.55.0 release.
>
> Please let me know if you have any comments/objections/questions.
>
> Thanks,
>
> Sam
>
> [1] https://github.com/apache/beam/pull/30196
> [2] https://github.com/apache/beam/issues/24835
> [3] https://github.com/apache/beam/tree/master/vendor
> [4]
> https://docs.google.com/document/d/1ztEoyGkqq9ie5riQxRtMuBu3vb6BUO91mSMn1PU0pDA/edit#heading=h.vhcuqlttpnog
>

Re: [DESIGN PROPOSAL] Reshuffle Allowing Duplicates

2024-01-31 Thread Kenneth Knowles

On Wed, Jan 31, 2024 at 4:21 AM Jan Lukavský  wrote:

> Hi,
>
> if I understand this proposal correctly, the motivation is actually
> reducing latency by bypassing bundle atomic guarantees, bundles after "at
> least once" Reshuffle would be reconstructed independently of the
> pre-shuffle bundling. Provided this is correct, it seems that the behavior
> is slightly more general than for the case of Reshuffle. We have already
> some transforms that manipulate a specific property of a PCollection - if
> it may or might not contain duplicates. That is manipulated in two ways -
> explicitly removing duplicates based on IDs on sources that generate
> duplicates and using @RequiresStableInput, mostly in sinks. These
> techniques modify an inherent property of a PCollection, that is if it
> contains or does not contain possible duplicates originating from the same
> input element.
>
> There are two types of duplicates - duplicate elements in _different
> bundles_ (typically from at-least-once sources) and duplicates arising due
> to bundle reprocessing (affecting only transforms with side-effects, that
> is what we solve by @RequiresStableInput). The point I'm trying to get to -
> should we add these properties to PCollections (contains cross-bundle
> duplicates vs. does not) and PTransforms ("outputs deduplicated elements"
> and "requires stable input")? That would allow us to analyze the Pipeline
> DAG and provide appropriate implementation for Reshuffle automatically, so
> that a new URN or flag would not be needed. Moreover, this might be useful
> for a broader range of optimizations.
>
> WDYT?
>
These are interesting ideas that could be useful. I think they achieve a
different goal in my case. I actually want to explicitly allow
Reshuffle.allowingDuplicates() to skip expensive parts of its
implementation that are used to prevent duplicates.

The property that would make it possible to automate this in the case of
combiners, or at least validate that the pipeline still gives 100% accurate
answers, would be something like @InsensitiveToDuplicateElements which is
longer and less esoteric than @Idempotent. For situations where there is a
source or sink that only has at-least-once guarantees then yea maybe the
property "has duplicates" will let you know that you may as well use the
duplicating reshuffle without any loss. But still, you may not want to
introduce *more* duplicates.

I would say my proposal is a step in this direction that would gain some
experience and tools that we might later use in a more automated way.

Kenn

>  Jan
> On 1/30/24 23:22, Robert Burke wrote:
>
> Is the benefit of this proposal just the bounded deviation from the
> existing reshuffle?
>
> Reshuffle is already rather dictated by arbitrary runner choice, from
> simply ignoring the node, to forcing a materialization break, to a full
> shuffle implementation which has additional side effects.
>
> But model wise I don't believe it guarantees specific checkpointing or
> re-execution behavior as currently specified. The proto only says it
> represents the operation (without specifying the behavior, that is a big
> problem).
>
> I guess my concern here is that it implies/codifies that the existing
> reshuffle has more behavior than it promises outside of the Java SDK.
>
> "Allowing duplicates" WRT reshuffle is tricky. It feels like mostly allows
> an implementation that may mean the inputs into the reshuffle might be
> re-executed for example. But that's always under the runner's discretion ,
> and ultimately it could also prevent even getting the intended benefit of a
> reshuffle (notionally, just a fusion break).
>
> Is there even a valid way to implement the notion of a reshuffle that
> leads to duplicates outside of a retry/resilience case?
>
> ---
>
> To be clear, I'm not against the proposal. I'm against that its being
> built on a non-existent foundation. If the behavior isn't already defined,
> it's impossible to specify a real deviation from it.
>
> I'm all for more specific behaviors if means we actually clarify what the
> original version is in the protos, since its news to me ( just now, because
> I looked) that the Java reshuffle promises GBK-like side effects. But
> that's a long deprecated transform without a satisfying replacement for
> it's usage, so it may be moot.
>
> Robert Burke
>
>
>
> On Tue, Jan 30, 2024, 1:34 PM Kenneth Knowles  wrote:
>
>> Hi all,
>>
>> Just when you thought I had squeezed all the possible interest out of
>> this most boring-seeming of transforms :-)
>>
>> I wrote up a very quick proposal as a doc [1]. It is short enough that I
>>

Re: [DESIGN PROPOSAL] Reshuffle Allowing Duplicates

2024-01-31 Thread Kenneth Knowles

On Tue, Jan 30, 2024 at 5:22 PM Robert Burke  wrote:

> Is the benefit of this proposal just the bounded deviation from the
> existing reshuffle?
>
> Reshuffle is already rather dictated by arbitrary runner choice, from
> simply ignoring the node, to forcing a materialization break, to a full
> shuffle implementation which has additional side effects.
>
> But model wise I don't believe it guarantees specific checkpointing or
> re-execution behavior as currently specified. The proto only says it
> represents the operation (without specifying the behavior, that is a big
> problem).
>

Indeed, the semantics are specified for reshuffle: the output PCollection
has the same elements as the input PCollection. Beam very deliberately
doesn't define operational characteristics. It is entirely possible that
reshuffle is meaningless for a runner, indeed. I'm not particularly trying
to re-open that can of worms here...

I guess my concern here is that it implies/codifies that the existing
> reshuffle has more behavior than it promises outside of the Java SDK.
>
> "Allowing duplicates" WRT reshuffle is tricky. It feels like mostly allows
> an implementation that may mean the inputs into the reshuffle might be
> re-executed for example. But that's always under the runner's discretion ,
> and ultimately it could also prevent even getting the intended benefit of a
> reshuffle (notionally, just a fusion break).
>

My intent is to be exactly as questionable as the current reshuffle, which
is indeed questionable. The semantics of the newly proposed transform is
that the output PCollection contains the same elements as the input
PCollection, possibly with duplicates. Aka the input is a subset of the
output.

Is there even a valid way to implement the notion of a reshuffle that leads
> to duplicates outside of a retry/resilience case?
>

Sure! ParDo(x -> { output(x); output(x) })

:-) :-) :-)

Kenn


>
> ---
>
> To be clear, I'm not against the proposal. I'm against that its being
> built on a non-existent foundation. If the behavior isn't already defined,
> it's impossible to specify a real deviation from it.
>
> I'm all for more specific behaviors if means we actually clarify what the
> original version is in the protos, since its news to me ( just now, because
> I looked) that the Java reshuffle promises GBK-like side effects. But
> that's a long deprecated transform without a satisfying replacement for
> it's usage, so it may be moot.
>
> Robert Burke
>
>
>
> On Tue, Jan 30, 2024, 1:34 PM Kenneth Knowles  wrote:
>
>> Hi all,
>>
>> Just when you thought I had squeezed all the possible interest out of
>> this most boring-seeming of transforms :-)
>>
>> I wrote up a very quick proposal as a doc [1]. It is short enough that I
>> will also put the main idea and main question in this email so you can
>> quickly read. Best to put comments in the.
>>
>> Main idea: add a variation of Reshuffle that allows duplicates, aka "at
>> least once", so that users and runners can benefit from efficiency if it is
>> possible
>>
>> Main question: is it best as a parameter to existing reshuffle transforms
>> or as new URN(s)? I have proposed it as a parameter but I think either one
>> could work.
>>
>> I would love feedback on the main idea, main question, or anywhere on the
>> doc.
>>
>> Thanks!
>>
>> Kenn
>>
>> [1] https://s.apache.org/beam-reshuffle-allowing-duplicates
>>
>

[DESIGN PROPOSAL] Reshuffle Allowing Duplicates

2024-01-30 Thread Kenneth Knowles

Hi all,

Just when you thought I had squeezed all the possible interest out of this
most boring-seeming of transforms :-)

I wrote up a very quick proposal as a doc [1]. It is short enough that I
will also put the main idea and main question in this email so you can
quickly read. Best to put comments in the.

Main idea: add a variation of Reshuffle that allows duplicates, aka "at
least once", so that users and runners can benefit from efficiency if it is
possible

Main question: is it best as a parameter to existing reshuffle transforms
or as new URN(s)? I have proposed it as a parameter but I think either one
could work.

I would love feedback on the main idea, main question, or anywhere on the
doc.

Thanks!

Kenn

[1] https://s.apache.org/beam-reshuffle-allowing-duplicates

Re: [VOTE] Vendored Dependencies Release

2024-01-22 Thread Kenneth Knowles

Notably, the vendored artifact has no impact on the repo until the version
used is also bumped, right? So the release is very low stakes.

Kenn

On Fri, Jan 19, 2024 at 4:55 PM Robert Bradshaw via dev 
wrote:

> Thanks.
>
> +1
>
>
> On Fri, Jan 19, 2024 at 1:24 PM Yi Hu  wrote:
>
>> The process I have been following is [1]. I have also suggested edits to
>> the voting email template to include the self-link. However, does anyone
>> can edit this doc so the change can be made? Otherwise we might better to
>> migrate this doc to
>> https://github.com/apache/beam/tree/master/contributor-docs
>>
>> [1] https://s.apache.org/beam-release-vendored-artifacts
>>
>> On Thu, Jan 18, 2024 at 2:56 PM Robert Bradshaw via dev <
>> dev@beam.apache.org> wrote:
>>
>>> Could you explain the process you used to produce these artifacts?
>>>
>>> On Thu, Jan 18, 2024 at 11:23 AM Kenneth Knowles 
>>> wrote:
>>>
>>>> +1
>>>>
>>>> On Wed, Jan 17, 2024 at 6:03 PM Yi Hu via dev 
>>>> wrote:
>>>>
>>>>> Hi everyone,
>>>>>
>>>>>
>>>>> Please review the release of the following artifacts that we vendor:
>>>>>
>>>>>  * beam-vendor-grpc-1_60_1
>>>>>
>>>>>
>>>>> Please review and vote on the release candidate #1 for the version
>>>>> 0.1, as follows:
>>>>>
>>>>> [ ] +1, Approve the release
>>>>>
>>>>> [ ] -1, Do not approve the release (please provide specific comments)
>>>>>
>>>>>
>>>>> The complete staging area is available for your review, which includes:
>>>>>
>>>>> * the official Apache source release to be deployed to dist.apache.org
>>>>> [1], which is signed with the key with fingerprint 8935B943A188DE65 [2],
>>>>>
>>>>> * all artifacts to be deployed to the Maven Central Repository [3],
>>>>>
>>>>> * commit hash "52b4a9cb58e486745ded7d53a5b6e2d2312e9551" [4],
>>>>>
>>>>> The vote will be open for at least 72 hours. It is adopted by majority
>>>>> approval, with at least 3 PMC affirmative votes.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Release Manager
>>>>>
>>>>> [1] https://dist.apache.org/repos/dist/dev/beam/vendor/
>>>>>
>>>>> [2] https://dist.apache.org/repos/dist/release/beam/KEYS
>>>>>
>>>>> [3]
>>>>> https://repository.apache.org/content/repositories/orgapachebeam-1366/
>>>>>
>>>>> [4]
>>>>> https://github.com/apache/beam/commits/52b4a9cb58e486745ded7d53a5b6e2d2312e9551/
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Yi Hu, (he/him/his)
>>>>>
>>>>> Software Engineer
>>>>>
>>>>>
>>>>>

Re: @RequiresTimeSortedInput adoption by runners

2024-01-19 Thread Kenneth Knowles

In this design space, what we have done in the past is:

1) ensure that runners all reject pipelines they cannot run correctly
2) if there is a default/workaround/slower implementation, provide it as an
override

This is largely ignoring portability but I think/hope it will still work.
At one time I put some effort into ensuring Java Pipeline objects and proto
representations could roundtrip with all the necessary information for
pre-portability runners to still work, which is the same prereqs as
pre-portable "Override" implementations to still work.

TBH I'm 50/50 on the idea. If something is going to be implemented more
slowly or less scalably as a fallback, I think it may be best to simply be
upfront about being unable to really run it. It would depend on the
situation. For requiring time sorted input, the manual implementation is
probably similar to what a streaming runner might do, so it might make
sense.

Kenn

On Fri, Jan 19, 2024 at 11:05 AM Robert Burke  wrote:

> I certainly don't have the deeper java insight here. So one more portable
> based reply and then I'll step back on the Java specifics.
>
> Portable runners only really have the "unknown Composite" fallback option,
> where if the Composite's URN isn't known to the runner, it should use the
> subgraph that is being wrapped.
>
> I suppose the protocol could be expanded : If a composite transform with a
> ParDo payload, and urn has features the runner can't handle, then it could
> use the fallback graph as well.
>
> The SDK would have then still needed to have construct the fallback graph
> into the Pipeline proto. This doesn't sound incompatible with what you've
> suggested the Java SDK could do, but it avoids the runner needing to be
> aware of a specific implementation requirement around a feature it doesn't
> support.  If it has to do something specific to support an SDK specific
> mechanism, that's still supporting the feature, but I fear it's not a great
> road to tread on for runners to add SDK specific implementation details.
>
> If a (portable) runner is going to spend work on doing something to handle
> RequiresTimeSortedInput, it's probably easier to handle it generally than
> to try to enable a Java specific work around. I'm not even sure how that
> could work since the SDK would then need a special interpretation of what a
> runner sent back for it to do any SDK side special backup handling, vs the
> simple execution of the given transform.
>
> It's entirely possible I've over simplified the "fallback" protocol
> described above, so this thread is still useful for my Prism work,
> especially if I see any similar situations once I start on the Java
> Validates Runner suite.
>
> Robert Burke
> Beam Go Busybody
>
> On Fri, Jan 19, 2024, 6:41 AM Jan Lukavský  wrote:
>
>> I was primarily focused on Java SDK (and core-contruction-java), but
>> generally speaking, any SDK can provide default expansion that runners can
>> use so that it is not (should not be) required to implement this manually.
>> Currently, in Java SDK, the annotation is wired up into
>> StatefulDoFnRunner, which (as name suggests) can be used for running
>> stateful DoFns. The problem is that not every runner is using this
>> facility. Java SDK generally supports providing default expansions of
>> transforms, but _only for transforms that do not have to work with dynamic
>> state_. This is not the case for this annotation - a default implementation
>> for @RequiresTimeSortedInput has to take another DoFn as input, and wire
>> its lifecycle in a way that elements are buffered in (dynamically created)
>> buffer and fed into the downstream DoFn only when timer fires.
>>
>> If I narrow down my line of thinking, it would be possible to:
>>  a) create something like "dynamic pipeline expansion", which would make
>> it possible work with PTransforms in this way (probably would require some
>> ByteBuddy magic)
>>  b) wire this up to DoFnInvoker, which takes DoFn and creates class that
>> is used by runners for feeding data
>>
>> Option b) would ensure that actually all runners support such expansion,
>> but seems to be somewhat hacky and too specific to this case. Moreover, it
>> would require knowledge if the expansion is actually required by the runner
>> (e.g. if the annotation is supported explicitly - most likely for batch
>> execution). Therefore I'd be in favor of option a), this might be reusable
>> by a broader range of default expansions.
>>
>> In other SDKs than Java this might have different implications, the
>> reason why it is somewhat more complicated to do dynamic (or generic?)
>> expansions of PTransforms in Java is mostly due to how DoFns are
>> implemented in terms of annotations and the DoFnInvokers involved for
>> efficiency.
>>
>>  Jan
>>
>> On 1/18/24 18:35, Robert Burke wrote:
>>
>> I agree that variable support across Runners does limit the adoption of a 
>> feature.  But it's also then limited if the SDKs and their local / direct 
>> runners don't y

Re: [VOTE] Vendored Dependencies Release

2024-01-18 Thread Kenneth Knowles

+1

On Wed, Jan 17, 2024 at 6:03 PM Yi Hu via dev  wrote:

> Hi everyone,
>
>
> Please review the release of the following artifacts that we vendor:
>
>  * beam-vendor-grpc-1_60_1
>
>
> Please review and vote on the release candidate #1 for the version 0.1, as
> follows:
>
> [ ] +1, Approve the release
>
> [ ] -1, Do not approve the release (please provide specific comments)
>
>
> The complete staging area is available for your review, which includes:
>
> * the official Apache source release to be deployed to dist.apache.org
> [1], which is signed with the key with fingerprint 8935B943A188DE65 [2],
>
> * all artifacts to be deployed to the Maven Central Repository [3],
>
> * commit hash "52b4a9cb58e486745ded7d53a5b6e2d2312e9551" [4],
>
> The vote will be open for at least 72 hours. It is adopted by majority
> approval, with at least 3 PMC affirmative votes.
>
> Thanks,
>
> Release Manager
>
> [1] https://dist.apache.org/repos/dist/dev/beam/vendor/
>
> [2] https://dist.apache.org/repos/dist/release/beam/KEYS
>
> [3] https://repository.apache.org/content/repositories/orgapachebeam-1366/
>
> [4]
> https://github.com/apache/beam/commits/52b4a9cb58e486745ded7d53a5b6e2d2312e9551/
>
>
> --
>
> Yi Hu, (he/him/his)
>
> Software Engineer
>
>
>

Re: [PROPOSAL] Upgrade vendor grpc

2024-01-12 Thread Kenneth Knowles

Yes, thank you!

On Thu, Jan 11, 2024 at 8:21 PM Chamikara Jayalath via dev <
dev@beam.apache.org> wrote:

> Sounds good and thanks for doing this :)
>
> - Cham
>
> On Thu, Jan 11, 2024 at 8:06 AM Yi Hu via dev  wrote:
>
>> Hi everyone,
>>
>> I would like to volunteer to upgrade the Beam vendored grpc, as requested
>> by the GitHub Issue [1]. The last update was in Apr 2023 [2]. There have
>> been vulnerabilities in its dependencies as well as potential oom issues
>> found since then (see [1]), and also to include grpc-alts [2].
>>
>> My plan is to follow the release process [3, 4], which involves preparing
>> for the release, building a candidate, voting and finalizing the release.
>> Then the vendored artifact is targeted to be integrated by Beam v2.54.0
>> onwards (cut date Jan 24, 2024).
>>
>> Please let me know if you have any comments/objections/questions.
>>
>> Thanks,
>>
>> Yi
>>
>> [1] https://github.com/apache/beam/issues/29861
>> [2] https://github.com/apache/beam/issues/25746
>> [3] https://github.com/apache/beam/tree/master/vendor
>> [4]
>> https://docs.google.com/document/d/1ztEoyGkqq9ie5riQxRtMuBu3vb6BUO91mSMn1PU0pDA/edit#heading=h.vhcuqlttpnog
>> --
>>
>> Yi Hu, (he/him/his)
>>
>> Software Engineer
>>
>>
>>

Re: ByteBuddy DoFnInvokers Write Up

2024-01-12 Thread Kenneth Knowles

This is really great, and a very good idea to document. Going from "what
does a DoFnSignature and DoFnInvoker look like for a particular DoFn" is
super useful to even explain why these constructions exist. And from there,
you can talk about what the bytecode looks like and what the ByteBuddy to
generate it looks like.

Kenn

On Thu, Jan 11, 2024 at 12:26 PM Ismaël Mejía  wrote:

> Neat! I remember passing long time trying to decipher the DoFnInvoker
> behavior so this will definitely be helpful.
>
> Maybe a good idea to add the link to the Design Documents list for future
> reference
> https://cwiki.apache.org/confluence/display/BEAM/Design+Documents
>
> On Wed, Jan 10, 2024 at 9:15 PM Robert Burke  wrote:
>
>> That's neat! Thanks for writing that up!
>>
>> On Wed, Jan 10, 2024, 11:12 AM John Casey via dev 
>> wrote:
>>
>>> The team at Google recently held an internal hackathon, and my hack
>>> involved modifying how our ByteBuddy DoFnInvokers work. My hack didn't end
>>> up going anywhere, but I learned a lot about how our code generation works.
>>> It turns out we have no documentation or design docs about our code
>>> generation, so I wrote up what I learned,
>>>
>>> Please take a look, and let me know if I got anything wrong, or if you
>>> are looking for more detail
>>>
>>> s.apache.org/beam-bytebuddy-dofninvoker
>>>
>>> John
>>>
>>

[ACTION REQUESTED] Help me draft the Beam Board Report for January 2024

2024-01-05 Thread Kenneth Knowles

Hi all,

The next Beam board report is due next Wednesday, January 10. Please help
me to draft it at https://s.apache.org/beam-draft-report-2024-01. The doc
is open for anyone to edit.

Ideas:

 - highlights from CHANGES.md
 - interesting technical discussions
 - integrations with other projects
 - community events
 - major user facing addition/deprecation

Past reports are at https://whimsy.apache.org/board/minutes/Beam.html for
examples.

Thanks,

Kenn

Re: Credentials Rotation Failure on Metrics cluster (2023-11-01)

2023-11-01 Thread Kenneth Knowles via dev

+Danny McCormick   is this the converse of the
other failure? (I didn't click through I just read the other thread)

On Tue, Oct 31, 2023 at 10:10 PM gacti...@beam.apache.org <
beamacti...@gmail.com> wrote:

> Something went wrong during the automatic credentials rotation for Metrics
> Cluster, performed at 2023-11-01. It may be necessary to check the state of
> the cluster certificates. For further details refer to the following
> links:\n * Failing job:
> https://github.com/apache/beam/actions/workflows/beam_MetricsCredentialsRotation.yml
> \n * Job configuration:
> https://github.com/apache/beam/blob/master/.github/workflows/beam_MetricsCredentialsRotation.yml
> \n * Cluster URL:
> https://pantheon.corp.google.com/kubernetes/clusters/details/us-central1-a/metrics/details?mods=dataflow_dev&project=apache-beam-testing
>

Re: [YAML] Aggregations

2023-10-30 Thread Kenneth Knowles

Automatically dereferencing, basically. It is nice. Especially for
many-to-many relationships like the example. I don't know if the
aggregation is any different though, is it?

Kenn

On Sun, Oct 29, 2023 at 1:12 PM Robert Burke  wrote:

> I came across Edge DB, and it has a novel syntax moving away from SQL with
> their EdgeQL.
>
> https://www.edgedb.com/
>
> Eg. Heere are two equivalent "nested" queries.
>
>
> # EdgeQL
>
> select Movie {
>   title,
>   actors: {
>name
>   },
>   rating := math::mean(.reviews.score)
> } filter "Zendaya" in .actors.name;
>
>
> # SQL
>
> SELECT
>   title,
>   Actors.name AS actor_name,
>   (SELECT avg(score)
> FROM Movie_Reviews
> WHERE movie_id = Movie.id) AS rating
> FROM
>   Movie
>   LEFT JOIN Movie_Actors ON
> Movie.id = Movie_Actors.movie_id
>   LEFT JOIN Person AS Actors ON
> Movie_Actors.person_id = Person.id
> WHERE
>   'Zendaya' IN (
> SELECT Person.name
> FROM
>   Movie_Actors
>   INNER JOIN Person
> ON Movie_Actors.person_id = Person.id
> WHERE
>   Movie_Actors.movie_id = Movie.id)
>
>
> The key observations here are specifics around join kinds and stuff don't
> often need to be directly expressed in the query.
>
> I'd need to dig deeper around it (such as do they share... ) but it does
> do a nice first impression of demos.
>
>
> On Mon, Oct 23, 2023, 7:00 AM XQ Hu via dev  wrote:
>
>> +1 on your proposal.
>>
>> On Fri, Oct 20, 2023 at 4:59 PM Robert Bradshaw via dev <
>> dev@beam.apache.org> wrote:
>>
>>> On Fri, Oct 20, 2023 at 11:35 AM Kenneth Knowles 
>>> wrote:
>>> >
>>> > A couple other bits on having an expression language:
>>> >
>>> >  - You already have Python lambdas at places, right? so that's quite a
>>> lot more complex than SQL project/aggregate expressions
>>> >  - It really does save a lot of pain for users (at the cost of
>>> implementation complexity) when you need to "SUM(col1*col2)" where
>>> otherwise you have to Map first. This could be viewed as desirable as well,
>>> of course.
>>> >
>>> > Anyhow I'm pretty much in agreement with all your reasoning as to why
>>> *not* to use SQL-like expressions in strings. But it does seem odd when
>>> juxtaposed with Python snippets.
>>>
>>> Well, we say "here's a Python expression" when we're using a Python
>>> string. But "SUM(col1*col2)" isn't as transparent. (Agree about the
>>> niceties of being able to provide an expression rather than a column.)
>>>
>>> > On Thu, Oct 19, 2023 at 4:00 PM Robert Bradshaw via dev <
>>> dev@beam.apache.org> wrote:
>>> >>
>>> >> On Thu, Oct 19, 2023 at 12:53 PM Reuven Lax  wrote:
>>> >> >
>>> >> > Is the schema Group transform (in Java) something along these lines?
>>> >>
>>> >> Yes, for sure it is. It (and Python's and Typescript's equivalent) are
>>> >> linked in the original post. The open question is how to best express
>>> >> this in YAML.
>>> >>
>>> >> > On Wed, Oct 18, 2023 at 1:11 PM Robert Bradshaw via dev <
>>> dev@beam.apache.org> wrote:
>>> >> >>
>>> >> >> Beam Yaml has good support for IOs and mappings, but one key
>>> missing
>>> >> >> feature for even writing a WordCount is the ability to do
>>> Aggregations
>>> >> >> [1]. While the traditional Beam primitive is GroupByKey (and
>>> >> >> CombineValues), we're eschewing KVs in the notion of more schema'd
>>> >> >> data (which has some precedence in our other languages, see the
>>> links
>>> >> >> below). The key components the user needs to specify are (1) the
>>> key
>>> >> >> fields on which the grouping will take place, (2) the fields
>>> >> >> (expressions?) involved in the aggregation, and (3) what
>>> aggregating
>>> >> >> fn to use.
>>> >> >>
>>> >> >> A straw-man example could be something like
>>> >> >>
>>> >> >> type: Aggregating
>>> >> >> config:
>>> >> >>   key: [field1, field2]
>>> >> >>   aggregating:
>>> >&

Re: Streaming update compatibility

2023-10-30 Thread Kenneth Knowles

+1 million to this.

I think this could be a real game-changer. I would even more forcefully say
update compatibility has pushed our development style has been pushed into
the "never make significant changes" or "every significant change is wildly
more complex than it should be". It forces our first draft to be our final
draft, much moreso than abstraction-based backwards-compatibility, because
it requires freezing many implementation details as well.

And just to put more non-subjective data behind my +1, I have used this
approach many times in situations where a new version of a service rolled
out while still serving older clients (using URL as the flag). It is a
tried-and-true technique and connecting it to Beam is like an epiphany.
Hooray!

The easiest way to ensure clean code is to make older versions more like
straight line code, flattening out cyclomatic complexity by forking
transforms at the top level. In other words FooIO.read() immediately
delegates to FooIO_2_48.read(). You shouldn't be checking this flag at a
bunch of separate places inside an IO. In fact I might say that should be
largely forbidden and it should only be used as a "routing" flag.

Kenn

On Wed, Oct 25, 2023 at 8:25 PM Robert Bradshaw via dev 
wrote:

> Dataflow (among other runners) has the ability to "upgrade" running
> pipelines with new code (e.g. capturing bug fixes, dependency updates,
> and limited topology changes). Unfortunately some improvements (e.g.
> new and improved ways of writing to BigQuery, optimized use of side
> inputs, a change in algorithm, sometimes completely internally and not
> visible to the user) are not sufficiently backwards compatible which
> causes us, with the motivation to not break users, to either not make
> these changes or guard them as a parallel opt-in mode which is a
> significant drain on both developer productivity and causes new
> pipelines to run in obsolete modes by default.
>
> I created https://github.com/apache/beam/pull/29140 which adds a new
> pipeline option, update_compatibility_version, that allows the SDK to
> move forward while letting users with pipelines launched previously to
> manually request the "old" way of doing things to preserve update
> compatibility. (We should still attempt backwards compatibility when
> it makes sense, and the old way would remain in code until such a time
> it's actually deprecated and removed, but this means we won't be
> constrained by it, especially when it comes to default settings.)
>
> Any objections or other thoughts on this approach?
>
> - Robert
>
> P.S. Separately I think it'd be valuable to elevate the vague notion
> of update compatibility to a first-class Beam concept and put it on
> firm footing, but that's a larger conversation outside the thread of
> this smaller (and I think still useful in such a future world) change.
>

Re: [Discuss] Idea to increase RC voting participation

2023-10-25 Thread Kenneth Knowles

Agree. As long as we are getting enough of them, then our records as well
as any automation depending on it are fine. One easy and standard way to
make it more resilient would be to make it idempotent instead of counting
on uptime or receiving any particular event.

Kenn

On Tue, Oct 24, 2023 at 2:58 PM Danny McCormick 
wrote:

> Looks like for some reason the workflow didn't trigger. This is running on
> GitHub's hosted runners, so my best guess is an outage.
>
> Looking at a more refined query, this year there have been 14 issues that
> were missed by the automation (3 had their milestone manually removed) -
> https://github.com/apache/beam/issues?q=is%3Aissue+no%3Amilestone+is%3Aclosed+reason%3Acompleted+created%3A%3E2023-01-01
>  out
> of 605 total -
> https://github.com/apache/beam/issues?q=is%3Aissue+is%3Aclosed+reason%3Acompleted+created%3A%3E2023-01-01+
>  -
> as best I can tell there were a small number of workflow flakes and then
> GHA didn't correctly trigger a few.
>
> If we wanted, we could set up some recurring automation to go through and
> try to pick up the ones without milestones (or modify our existing
> automation to be more tolerant to failures), but it doesn't seem super
> urgent to me (feel free to disagree). I don't think this piece needs to be
> perfect.
>
> On Tue, Oct 24, 2023 at 2:40 PM Kenneth Knowles  wrote:
>
>> Just grabbing one at random for an example,
>> https://github.com/apache/beam/issues/28635 seems like it was closed as
>> completed but not tagged.
>>
>> I'm happy to see that the bot reads the version from the repo to find the
>> appropriate milestone, rather than using the nearest open one. Just
>> recording that for the thread since I first read the description as the
>> latter.
>>
>> Kenn
>>
>> On Tue, Oct 24, 2023 at 2:34 PM Danny McCormick via dev <
>> dev@beam.apache.org> wrote:
>>
>>> We do tag issues to milestones when the issue is marked as "completed"
>>> (as opposed to "not planned") -
>>> https://github.com/apache/beam/blob/master/.github/workflows/assign_milestone.yml.
>>> So I think using issues is probably about as accurate as using commits.
>>>
>>> > It looks like we have 820 with no milestone
>>> https://github.com/apache/beam/issues?q=is%3Aissue+no%3Amilestone+is%3Aclosed
>>>
>>> Most predate the automation, though maybe not all? Some of those may
>>> have been closed as "not planned".
>>>
>>> > This could (should) be automatically discoverable. A (closed) issues
>>> is associated with commits which are associated with a release.
>>>
>>> Today, we just tag issues to the upcoming milestone when they're closed.
>>> In theory you could do something more sophisticated using linked commits,
>>> but in practice people aren't clean enough about linking commits to issues.
>>> Again, this is fixable by automation/enforcement, but I don't think it
>>> actually gives us much value beyond what we have today.
>>>
>>> On Tue, Oct 24, 2023 at 1:54 PM Robert Bradshaw via dev <
>>> dev@beam.apache.org> wrote:
>>>
>>>> On Tue, Oct 24, 2023 at 10:35 AM Kenneth Knowles 
>>>> wrote:
>>>>
>>>>> Tangentially related:
>>>>>
>>>>> Long ago, attaching an issue to a release was a mandatory step as part
>>>>> of closing. Now I think it is not. Is it automatically happening? It looks
>>>>> like we have 820 with no milestone
>>>>> https://github.com/apache/beam/issues?q=is%3Aissue+no%3Amilestone+is%3Aclosed
>>>>>
>>>>
>>>> This could (should) be automatically discoverable. A (closed) issues is
>>>> associated with commits which are associated with a release.
>>>>
>>>>
>>>>> On Tue, Oct 24, 2023 at 1:25 PM Chamikara Jayalath via dev <
>>>>> dev@beam.apache.org> wrote:
>>>>>
>>>>>> +1 for going by the commits since this is what matters at the end of
>>>>>> the day. Also, many issues may not get tagged correctly for a given 
>>>>>> release
>>>>>> due to either the contributor not tagging the issue or due to commits for
>>>>>> the issue spanning multiple Beam releases.
>>>>>>
>>>>>> For example,
>>>>>>
>>>>>> For all commits in a given release RC:
>>>>>>   * If we find a Github issue for the commit: add a notice

Re: [Discuss] Idea to increase RC voting participation

2023-10-24 Thread Kenneth Knowles

Just grabbing one at random for an example,
https://github.com/apache/beam/issues/28635 seems like it was closed as
completed but not tagged.

I'm happy to see that the bot reads the version from the repo to find the
appropriate milestone, rather than using the nearest open one. Just
recording that for the thread since I first read the description as the
latter.

Kenn

On Tue, Oct 24, 2023 at 2:34 PM Danny McCormick via dev 
wrote:

> We do tag issues to milestones when the issue is marked as "completed" (as
> opposed to "not planned") -
> https://github.com/apache/beam/blob/master/.github/workflows/assign_milestone.yml.
> So I think using issues is probably about as accurate as using commits.
>
> > It looks like we have 820 with no milestone
> https://github.com/apache/beam/issues?q=is%3Aissue+no%3Amilestone+is%3Aclosed
>
> Most predate the automation, though maybe not all? Some of those may have
> been closed as "not planned".
>
> > This could (should) be automatically discoverable. A (closed) issues is
> associated with commits which are associated with a release.
>
> Today, we just tag issues to the upcoming milestone when they're closed.
> In theory you could do something more sophisticated using linked commits,
> but in practice people aren't clean enough about linking commits to issues.
> Again, this is fixable by automation/enforcement, but I don't think it
> actually gives us much value beyond what we have today.
>
> On Tue, Oct 24, 2023 at 1:54 PM Robert Bradshaw via dev <
> dev@beam.apache.org> wrote:
>
>> On Tue, Oct 24, 2023 at 10:35 AM Kenneth Knowles  wrote:
>>
>>> Tangentially related:
>>>
>>> Long ago, attaching an issue to a release was a mandatory step as part
>>> of closing. Now I think it is not. Is it automatically happening? It looks
>>> like we have 820 with no milestone
>>> https://github.com/apache/beam/issues?q=is%3Aissue+no%3Amilestone+is%3Aclosed
>>>
>>
>> This could (should) be automatically discoverable. A (closed) issues is
>> associated with commits which are associated with a release.
>>
>>
>>> On Tue, Oct 24, 2023 at 1:25 PM Chamikara Jayalath via dev <
>>> dev@beam.apache.org> wrote:
>>>
>>>> +1 for going by the commits since this is what matters at the end of
>>>> the day. Also, many issues may not get tagged correctly for a given release
>>>> due to either the contributor not tagging the issue or due to commits for
>>>> the issue spanning multiple Beam releases.
>>>>
>>>> For example,
>>>>
>>>> For all commits in a given release RC:
>>>>   * If we find a Github issue for the commit: add a notice to the
>>>> Github issue
>>>>   * Else: add the notice to a generic issue for the release including
>>>> tags for the commit ID, PR author, and the committer who merged the PR.
>>>>
>>>> Thanks,
>>>> Cham
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, Oct 23, 2023 at 11:49 AM Danny McCormick via dev <
>>>> dev@beam.apache.org> wrote:
>>>>
>>>>> I'd probably vote to include both the issue filer and the contributor.
>>>>> It is pretty equally straightforward - one way to do this would be using
>>>>> all issues related to that release's milestone and extracting the issue
>>>>> author and the issue closer.
>>>>>
>>>>> This does leave out the (unfortunately sizable) set of contributions
>>>>> that don't have an associated issue; if we're worried about that, we could
>>>>> always fall back to anyone with a commit in the last release who doesn't
>>>>> have an associated issue (aka what I thought we were initially proposing
>>>>> and what I think Airflow does today).
>>>>>
>>>>> I'm pretty much +1 on any sort of automation here, and it
>>>>> certainly can come in stages :)
>>>>>
>>>>> On Mon, Oct 23, 2023 at 1:50 PM Johanna Öjeling via dev <
>>>>> dev@beam.apache.org> wrote:
>>>>>
>>>>>> Yes that's a good point to include also those who created the issue.
>>>>>>
>>>>>> On Mon, Oct 23, 2023, 19:18 Robert Bradshaw via dev <
>>>>>> dev@beam.apache.org> wrote:
>>>>>>
>>>>>>> On Mon, Oct 23, 2023 at 7:26 AM Danny McCormick via dev <
>>>>>>> dev@

Re: [Discuss] Idea to increase RC voting participation

2023-10-24 Thread Kenneth Knowles

Tangentially related:

Long ago, attaching an issue to a release was a mandatory step as part of
closing. Now I think it is not. Is it automatically happening? It looks
like we have 820 with no milestone
https://github.com/apache/beam/issues?q=is%3Aissue+no%3Amilestone+is%3Aclosed

Kenn

On Tue, Oct 24, 2023 at 1:25 PM Chamikara Jayalath via dev <
dev@beam.apache.org> wrote:

> +1 for going by the commits since this is what matters at the end of the
> day. Also, many issues may not get tagged correctly for a given release due
> to either the contributor not tagging the issue or due to commits for the
> issue spanning multiple Beam releases.
>
> For example,
>
> For all commits in a given release RC:
>   * If we find a Github issue for the commit: add a notice to the Github
> issue
>   * Else: add the notice to a generic issue for the release including tags
> for the commit ID, PR author, and the committer who merged the PR.
>
> Thanks,
> Cham
>
>
>
>
> On Mon, Oct 23, 2023 at 11:49 AM Danny McCormick via dev <
> dev@beam.apache.org> wrote:
>
>> I'd probably vote to include both the issue filer and the contributor. It
>> is pretty equally straightforward - one way to do this would be using all
>> issues related to that release's milestone and extracting the issue author
>> and the issue closer.
>>
>> This does leave out the (unfortunately sizable) set of contributions that
>> don't have an associated issue; if we're worried about that, we could
>> always fall back to anyone with a commit in the last release who doesn't
>> have an associated issue (aka what I thought we were initially proposing
>> and what I think Airflow does today).
>>
>> I'm pretty much +1 on any sort of automation here, and it certainly can
>> come in stages :)
>>
>> On Mon, Oct 23, 2023 at 1:50 PM Johanna Öjeling via dev <
>> dev@beam.apache.org> wrote:
>>
>>> Yes that's a good point to include also those who created the issue.
>>>
>>> On Mon, Oct 23, 2023, 19:18 Robert Bradshaw via dev 
>>> wrote:
>>>
 On Mon, Oct 23, 2023 at 7:26 AM Danny McCormick via dev <
 dev@beam.apache.org> wrote:

> So to summarize, I think there's broad consensus (or at least lazy
> consensus) around the following:
>
> - (1) Updating our release email/guidelines to be more specific about
> what we mean by release validation/how to be helpful during this process.
> This includes both encouraging validation within each user's own code base
> and encouraging people to document/share their process of validation and
> link it in the release spreadsheet.
> - (2) Doing something like what Airflow does (#29424
> ) and creating an
> issue asking people who have contributed to the current release to help
> validate their changes.
>
> I'm also +1 on doing both of these. The first bit (updating our
> guidelines) is relatively easy - it should just require updating
> https://github.com/apache/beam/blob/master/contributor-docs/release-guide.md#vote-and-validate-the-release-candidate
> .
>
> I took a look at the second piece (copying what Airflow does) to see
> if we could just copy their automation, but it looks like it's tied
> to airflow breeze
> 
> (their repo-specific automation tooling), so we'd probably need to build
> the automation ourselves. It shouldn't be terrible, basically we'd want a
> GitHub Action that compares the current release tag with the last release
> tag, grabs all the commits in between, parses them to get the author, and
> creates an issue with that data, but it does represent more effort than
> just updating a markdown file. There might even be an existing Action that
> can help with this, I haven't looked too hard.
>

 I was thinking along the lines of a script that would scrape the issues
 resolved in a given release and add a comment to them noting that the
 change is in release N and encouraging (with clear instructions) how this
 can be validated. Creating a "validate this release" issue with all
 "contributing" participants could be an interesting way to do this as well.
 (I think it'd be valuable to get those who filed the issue, not just those
 who fixed it, to validate.)


> As our next release manager, I'm happy to review PRs for either of
> these if anyone wants to volunteer to help out. If not, I'm happy to 
> update
> the guidelines, but I probably won't have time to add the commit 
> inspection
> tooling (I'm planning on throwing any extra time towards continuing to
> automate release candidate creation which is currently a more impactful
> problem IMO). I would very much like it if both of these things happened
> though :)
>
> Thanks,
> Danny
>
> On Mon,

Re: [YAML] Aggregations

2023-10-20 Thread Kenneth Knowles

A couple other bits on having an expression language:

 - You already have Python lambdas at places, right? so that's quite a lot
more complex than SQL project/aggregate expressions
 - It really does save a lot of pain for users (at the cost of
implementation complexity) when you need to "SUM(col1*col2)" where
otherwise you have to Map first. This could be viewed as desirable as well,
of course.

Anyhow I'm pretty much in agreement with all your reasoning as to why *not*
to use SQL-like expressions in strings. But it does seem odd when
juxtaposed with Python snippets.

Kenn

On Thu, Oct 19, 2023 at 4:00 PM Robert Bradshaw via dev 
wrote:

> On Thu, Oct 19, 2023 at 12:53 PM Reuven Lax  wrote:
> >
> > Is the schema Group transform (in Java) something along these lines?
>
> Yes, for sure it is. It (and Python's and Typescript's equivalent) are
> linked in the original post. The open question is how to best express
> this in YAML.
>
> > On Wed, Oct 18, 2023 at 1:11 PM Robert Bradshaw via dev <
> dev@beam.apache.org> wrote:
> >>
> >> Beam Yaml has good support for IOs and mappings, but one key missing
> >> feature for even writing a WordCount is the ability to do Aggregations
> >> [1]. While the traditional Beam primitive is GroupByKey (and
> >> CombineValues), we're eschewing KVs in the notion of more schema'd
> >> data (which has some precedence in our other languages, see the links
> >> below). The key components the user needs to specify are (1) the key
> >> fields on which the grouping will take place, (2) the fields
> >> (expressions?) involved in the aggregation, and (3) what aggregating
> >> fn to use.
> >>
> >> A straw-man example could be something like
> >>
> >> type: Aggregating
> >> config:
> >>   key: [field1, field2]
> >>   aggregating:
> >> total_cost:
> >>   fn: sum
> >>   value: cost
> >> max_cost:
> >>   fn: max
> >>   value: cost
> >>
> >> This would basically correspond to the SQL expression
> >>
> >> "SELECT field1, field2, sum(cost) as total_cost, max(cost) as max_cost
> >> from table GROUP BY field1, field2"
> >>
> >> (though I'm not requiring that we use this as an implementation
> >> strategy). I do not think we need a separate (non aggregating)
> >> Grouping operation, this can be accomplished by having a concat-style
> >> combiner.
> >>
> >> There are still some open questions here, notably around how to
> >> specify the aggregation fns themselves. We could of course provide a
> >> number of built-ins (like SQL does). This gets into the question of
> >> how and where to document this complete set, but some basics should
> >> take us pretty far. Many aggregators, however, are parameterized (e.g.
> >> quantiles); where do we put the parameters? We could go with something
> >> like
> >>
> >> fn:
> >>   type: ApproximateQuantiles
> >>   config:
> >> n: 10
> >>
> >> but others are even configured by functions themselves (e.g. LargestN
> >> that wants a comparator Fn). Maybe we decide not to support these
> >> (yet?)
> >>
> >> One thing I think we should support, however, is referencing custom
> >> CombineFns. We have some precedent for this with our Fns from
> >> MapToFields, where we accept things like inline lambdas and external
> >> references. Again the topic of how to configure them comes up, as
> >> these custom Fns are more likely to be parameterized than Map Fns
> >> (though, to be clear, perhaps it'd be good to allow parameterizatin of
> >> MapFns as well). Maybe we allow
> >>
> >> language: python. # like MapToFields (and here it'd be harder to mix
> >> and match per Fn)
> >> fn:
> >>   type: ???
> >>   # should these be nested as config?
> >>   name: fully.qualiied.name
> >>   path: /path/to/defining/file
> >>   args: [...]
> >>   kwargs: {...}
> >>
> >> which would invoke the constructor.
> >>
> >> I'm also open to other ways of naming/structuring these essential
> >> parameters if it makes things more clear.
> >>
> >> - Robert
> >>
> >>
> >> Java:
> https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/schemas/transforms/Group.html
> >> Python:
> https://beam.apache.org/documentation/transforms/python/aggregation/groupby
> >> Typescript:
> https://beam.apache.org/releases/typedoc/current/classes/transforms_group_and_combine.GroupBy.html
> >>
> >> [1] One can of course use SqlTransform for this, but I'm leaning
> >> towards offering something more native.
>

Re: Reshuffle PTransform Design Doc

2023-10-20 Thread Kenneth Knowles

On Fri, Oct 20, 2023 at 3:29 AM Jan Lukavský  wrote:

> Yes, I'm aware that Beam is not defined in terms of runner convenience,
> but in terms of data transform primitives. On the other hand - looking
> at that from specific perspective - even though stateless shuffle does
> not change the data itself, it changes distribution of data with
> relation to partitioning, which is a property of the data as well. Not
> only the data itself, but also any metadata about it might be viewed as
> something that characterizes a PCollection and as something that can be
> manipulated. Hence a transform can be given a proper data-related
> semantics, even though it is a nop with regards to actual _contents_ of
> a PCollection. Having said that, my point was that if Redistribute was a
> defined fundamental primitive, it would immediately follow that it
> should not be implemented as GBK-unGBK (at least not with extra care),
> because it leads to problems with the stateful GBK introducing
> unexpected side-effects, which from my understanding was the initial
> problem that started this thread.
>

Agree with all this. It would be interesting, and I think I've seen a
system that does it but I forget which one, to have a model whereby the
metadata of collections is explicitly treated but also held independent of
the contents of the collection. An interesting problem to allow authoring
well-defined computations in such a model. This really is a Friday
discussion :-)

Kenn


Best,
>
>   Jan
>
> On 10/19/23 20:26, Kenneth Knowles wrote:
> > Well I accidentally conflated "stateful" and "persisting", but anyhow
> > yea we aren't targeting to have one Beam primitive for each thing that
> > is probably a runner primitive.
> >
> > On Thu, Oct 19, 2023 at 2:25 PM Kenneth Knowles  wrote:
> >> On Fri, Oct 13, 2023 at 12:51 PM Jan Lukavský  wrote:
> >>> Hi,
> >>>
> >>> I think there's been already said nearly everything in this thread,
> but ... it is time for Friday discussions. :)
> >>>
> >>> Today I recalled of a discussion we've had long time ago, when we were
> designing Euphoria (btw, deprecating and removing it is still on my todo
> list, I should create a vote thread for that). We had 4 primitives:
> >>>
> >>>   a) non-shuffle, stateless ~ stateless ParDo
> >>>
> >>>   b) shuffle, stateful ~ stateful ParDo, with the ability (under the
> right circumstances,  i.e. defined event-time trigger, defined state merge
> function, ...) to be performed in a "combinable way".
> >>>
> >>>   c) shuffle, stateless ~ Reshuffle
> >>>
> >>>   d) non-shuffle, stateful - nope, makes no sense :) - part of the
> "combinable stateful shuffle operation"
> >>>
> >>>   e) union ~ Flatten
> >>>
> >>> Turns out you can build everything bottom up from these.
> >>>
> >>> Now, the not-so-well defined semantics of Reshuffle (Redistribute)
> might arise from the fact it is not a primitive. Stateless shuffling of
> data is definitely a primitive of all runners.
> >> Not Dataflow :-)
> >>
> >> But more importantly, Beam primitives are deliberately chosen to be
> >> fundamental data operations, not physical plan steps that a runner
> >> might use. In other words, Beam is decidedly _not_ a library for
> >> building composites that eventually are constructed from runner
> >> primitives. It is more like SQL in that it is a library for building
> >> composites that eventually are constructed from fundamental operations
> >> on data, that every engine (like every RDBMS) will be able to
> >> implement in its own way.
> >>
> >> Kenn
> >>
> >>> Therefore here goes the question - should Redistribute be a primitive
> and not be built up from other transforms?
> >>>
> >>> Best,
> >>>
> >>>   Jan
> >>>
> >>> On 10/6/23 21:12, Kenneth Knowles wrote:
> >>>
> >>>
> >>>
> >>> On Fri, Oct 6, 2023 at 3:07 PM Jan Lukavský  wrote:
> >>>>
> >>>> On 10/6/23 15:11, Kenneth Knowles wrote:
> >>>>
> >>>>
> >>>>
> >>>> On Fri, Oct 6, 2023 at 3:20 AM Jan Lukavský  wrote:
> >>>>> Hi,
> >>>>>
> >>>>> there is also one other thing to mention with relation to
> Reshuffle/RequiresStableinput and that is that our current implementation
> of RequiresStableInput can break without Reshuffle in s

Re: [NOTICE] Deprecation Avro classes in "core" and use "extensions/avro" instead for Java SDK

2023-10-19 Thread Kenneth Knowles

W

On Wed, Oct 18, 2023 at 4:19 PM Byron Ellis via dev 
wrote:

> Awesome!
>
> On Wed, Oct 18, 2023 at 1:14 PM Alexey Romanenko 
> wrote:
>
>> Heads up!
>>
>> Finally, all Avro-related code and Avro dependency, that was deprecated
>> before (see a message above), has been removed from Beam Java SDK “core”
>> module [1]. We believe that it was a sufficient number of Beam releases
>> (six!) that passed after this code had been deprecated and users had an
>> opportunity to switch to a new Avro extension as it was recommended before.
>>
>> We did our best to make this transition as smooth as possible but,
>> please, let me know you find any failed tests or any other strange behavior
>> because of this change.
>>
>> Thanks,
>> Alexey
>>
>>
>> [1] https://github.com/apache/beam/pull/27851/
>>
>>
>> On 22 Feb 2023, at 20:21, Robert Bradshaw via dev 
>> wrote:
>>
>> Thanks for pushing this through!
>>
>> On Wed, Feb 22, 2023 at 10:38 AM Alexey Romanenko
>>  wrote:
>>
>>
>> Hi all,
>>
>> As a part of migration the Avro-related classes from Java SDK “core”
>> module to a dedicated extension [1] (as it was discussed here [2] and here
>> [3]), two important PRs has been merged [4][5]. Therefore, old Avro-related
>> classes became deprecated in “core” (still possible to use but not
>> recommended) and all other Beam modules, that depended on them, switched to
>> use "extensions/avro” instead.
>>
>> We did our best to make this change smooth, compatible and not breaking
>> but, since it was one of the oldest part of “core”, then everything,
>> unfortunatelly, is possible and we probably could miss something despite of
>> all efforts. So, considering that, I’d like to ask community to run any
>> kind of tests or pipelines that utilise, for example, AvroCoder or
>> AvroUtils or any other related Avro classes and check if new changes
>> doesn’t break something and everything works as expected.
>>
>> —
>> Alexey
>>
>> [1] https://github.com/apache/beam/issues/24292
>> [2] https://lists.apache.org/thread/mz8hvz8dwhd0tzmv2lyobhlz7gtg4gq7
>> [3] https://lists.apache.org/thread/47oz1mlwj0orvo1458v5pw5c20bwt08q
>> [4] https://github.com/apache/beam/pull/24992
>> [5] https://github.com/apache/beam/pull/25534
>>
>>
>>
>>

Re: [Discuss] Idea to increase RC voting participation

2023-10-19 Thread Kenneth Knowles

+1 to more helpful guide on "how to usefully participate in RC validation"
but also big +1 to Robert, Jack, Johanna.

TL;DR the RC validation is an opportunity for downstream testing.

Robert alluded to the origin of the spreadsheet: I created it long ago to
validate that the human language on our web page actually works. Maybe
someone should automate that with an LLM now.

Robert also alluded to clean environment: our gradle scripts and GHA
scripts and CI environment are heavily enough engineered that they don't
represent what a user will experience. We could potentially use our starter
repos for an adequate smoke test here.

Those are both ways that *we* can pretend to be users. But actual users
checking the RC to make sure they'll have a smooth upgrade is by far the
most impactful validation.

This thread honestly makes me want to delete the spreadsheet but maybe come
up with a guide for downstream projects to validate against an RC. Maybe
that's an extreme reaction...

Kenn

On Wed, Oct 18, 2023 at 2:32 PM Robert Bradshaw via dev 
wrote:

> +1 That's a great idea. They have incentive to make sure the issue was
> resolved for them, plus we get to ensure there were no other regressions.
>
> On Wed, Oct 18, 2023 at 11:30 AM Johanna Öjeling via dev <
> dev@beam.apache.org> wrote:
>
>> When I have contributed to Apache Airflow, they have tagged all
>> contributors concerned in a GitHub issue when the RC is available and asked
>> us to validate it. Example: #29424
>> .
>>
>> I found that to be an effective way to notify contributors of the RC and
>> nudge them to help out. In the issue description there is a reference to
>> the guidelines on how to test the RC and a note that people are encouraged
>> to vote on the mailing list (which could admittedly be more highlighted
>> because I did not pay attention to it until now and was unaware that
>> contributors had a vote).
>>
>> It might be an idea to consider something similar here to increase the
>> participation?
>>
>> On Tue, Oct 17, 2023 at 7:01 PM Jack McCluskey via dev <
>> dev@beam.apache.org> wrote:
>>
>>> I'm +1 on helping explain what we mean by "validate the RC" since we're
>>> really just asking users to see if their existing use cases work along with
>>> our typical slate of tests. I don't know if offloading that work to our
>>> active validators is the right approach though, documentation/screen share
>>> of their specific workflow is definitely less useful than having a more
>>> general outline of how to install the RC and things to look out for when
>>> testing.
>>>
>>> On Tue, Oct 17, 2023 at 12:55 PM Austin Bennett 
>>> wrote:
>>>
 Great effort.  I'm also interested in streamlining releases -- so if
 there are alot of manual tests that could be automated, would be great
 to discover and then look to address.

 On Tue, Oct 17, 2023 at 8:47 AM Robert Bradshaw via dev <
 dev@beam.apache.org> wrote:

> +1
>
> I would also strongly suggest that people try out the release against
> their own codebases. This has the benefit of ensuring the release won't
> break your own code when they go out, and stress-tests the new code 
> against
> real-world pipelines. (Ideally our own tests are all passing, and this
> validation is automated as much as possible (though ensuring it matches 
> our
> documentation and works in a clean environment still has value), but
> there's a lot of code and uses out there that we don't have access to
> during normal Beam development.)
>
> On Tue, Oct 17, 2023 at 8:21 AM Svetak Sundhar via dev <
> dev@beam.apache.org> wrote:
>
>> Hi all,
>>
>> I’ve participated in RC testing for a few releases and have observed
>> a bit of a knowledge gap in how releases can be tested. Given that Beam
>> encourages contributors to vote on RC’s regardless of tenure, and that
>> voting on an RC is a relatively low-effort, high leverage way to 
>> influence
>> the release of the library, I propose the following:
>>
>> During the vote for the next release, voters can document the process
>> they followed on a separate document, and add the link on column G
>> here
>> .
>> One step further, could be a screencast of running the test, and 
>> attaching
>> a link of that.
>>
>> We can keep repeating this through releases until we have
>> documentation for many of the different tests. We can then add these docs
>> into the repo.
>>
>> I’m proposing this because I’ve gathered the following feedback from
>> colleagues that are tangentially involved with Beam: They are interested 
>> in
>> participating in release validation, but don’t know how to get started.
>> Happy to hear other suggestions too, if there are a

Re: [DISCUSS] Drop Euphoria extension

2023-10-19 Thread Kenneth Knowles

Makes sense to me. Let's deprecate for the 2.52.0 release unless there is
some objection. You can also look at the maven central downloads (I believe
all PMC and maybe all committers can view this) compared to other Beam jars.

Kenn

On Mon, Oct 16, 2023 at 9:28 AM Jan Lukavský  wrote:

> Sure, that would be probably the preferred way to go. For now, I'm
> trying to get some feedback, if there are some real-world users who
> might miss the API. Currently, the only value I see is that Euphoria
> adds an additional level of indirection for user code. The expansion
> goes like this:
>
>   Euphoria Pipeline -> runtime provided translators -> vanilla Beam
> Pipeline -> runner
>
> Hence code written using Euphoria extension can be modified at runtime
> (Pipeline construction time) using dependency injection, which brings
> the value that users can modify (typically optimize) Pipelines without
> actually modifying the business logic. On the other hand I'm not sure if
> this justifies the complexity of the extension. Were this the only
> value, it should be possible to implement such dynamic expansion either
> into Java SDK core or as a different light-weight extension.
>
>   Jan
>
> On 10/16/23 15:10, Alexey Romanenko wrote:
> > Can we just deprecate it for a while and then remove completely?
> >
> > —
> > Alexey
> >
> >> On 13 Oct 2023, at 18:59, Jan Lukavský  wrote:
> >>
> >> Hi,
> >>
> >> it has been some time since Euphoria extension [1] has been adopted by
> Beam as a possible "Java 8 API". Beam has evolved from that time a lot, the
> current API seems actually more elegant than the original Euphoria's and
> last but not least, it has no maintainers and no known users. If there are
> any users, please speak up!
> >>
> >> Otherwise I'd like to propose to drop it from codebase, I'll start a
> vote thread during next week, if there are no objections.
> >>
> >> Best,
> >>
> >>   Jan
> >>
> >> [1] https://beam.apache.org/documentation/sdks/java/euphoria/
> >>
>

Re: Reshuffle PTransform Design Doc

2023-10-19 Thread Kenneth Knowles

Well I accidentally conflated "stateful" and "persisting", but anyhow
yea we aren't targeting to have one Beam primitive for each thing that
is probably a runner primitive.

On Thu, Oct 19, 2023 at 2:25 PM Kenneth Knowles  wrote:
>
> On Fri, Oct 13, 2023 at 12:51 PM Jan Lukavský  wrote:
> >
> > Hi,
> >
> > I think there's been already said nearly everything in this thread, but ... 
> > it is time for Friday discussions. :)
> >
> > Today I recalled of a discussion we've had long time ago, when we were 
> > designing Euphoria (btw, deprecating and removing it is still on my todo 
> > list, I should create a vote thread for that). We had 4 primitives:
> >
> >  a) non-shuffle, stateless ~ stateless ParDo
> >
> >  b) shuffle, stateful ~ stateful ParDo, with the ability (under the right 
> > circumstances,  i.e. defined event-time trigger, defined state merge 
> > function, ...) to be performed in a "combinable way".
> >
> >  c) shuffle, stateless ~ Reshuffle
> >
> >  d) non-shuffle, stateful - nope, makes no sense :) - part of the 
> > "combinable stateful shuffle operation"
> >
> >  e) union ~ Flatten
> >
> > Turns out you can build everything bottom up from these.
> >
> > Now, the not-so-well defined semantics of Reshuffle (Redistribute) might 
> > arise from the fact it is not a primitive. Stateless shuffling of data is 
> > definitely a primitive of all runners.
>
> Not Dataflow :-)
>
> But more importantly, Beam primitives are deliberately chosen to be
> fundamental data operations, not physical plan steps that a runner
> might use. In other words, Beam is decidedly _not_ a library for
> building composites that eventually are constructed from runner
> primitives. It is more like SQL in that it is a library for building
> composites that eventually are constructed from fundamental operations
> on data, that every engine (like every RDBMS) will be able to
> implement in its own way.
>
> Kenn
>
> >
> > Therefore here goes the question - should Redistribute be a primitive and 
> > not be built up from other transforms?
> >
> > Best,
> >
> >  Jan
> >
> > On 10/6/23 21:12, Kenneth Knowles wrote:
> >
> >
> >
> > On Fri, Oct 6, 2023 at 3:07 PM Jan Lukavský  wrote:
> >>
> >>
> >> On 10/6/23 15:11, Kenneth Knowles wrote:
> >>
> >>
> >>
> >> On Fri, Oct 6, 2023 at 3:20 AM Jan Lukavský  wrote:
> >>>
> >>> Hi,
> >>>
> >>> there is also one other thing to mention with relation to 
> >>> Reshuffle/RequiresStableinput and that is that our current implementation 
> >>> of RequiresStableInput can break without Reshuffle in some corner cases 
> >>> on most portable runners, at least with Java GreedyPipelineFuser, see 
> >>> [1]. The only way to workaround this currently is inserting Reshuffle (or 
> >>> any other fusion-break transform) directly before the stable DoFn 
> >>> (Reshuffle is handy, because it does not change the data). I think we 
> >>> should either somehow fix the issue [1] or include fusion break as a 
> >>> mandatory requirement for the new Redistribute transform as well (at 
> >>> least with some variant) or possibly add a new "hint" for non-optional 
> >>> fusion breaking.
> >>
> >> This is actually the bug we have wanted to fix for years - redistribute 
> >> has nothing to do with checkpointing or stable input and Reshuffle 
> >> incorrectly merges the two concepts.
> >>
> >> I agree that we couldn't make any immediate change that will break a 
> >> runner. I believe runners that depend on Reshuffle to provide stable input 
> >> will also provide stable input after GroupByKey. Since the SDK expansion 
> >> of Reshuffle will still contains a GBK, those runners functionality will 
> >> be unchanged.
> >>
> >> I don't yet have a firm opinion between the these approaches:
> >>
> >> 1. Adjust the Java SDK implementation of Reshuffle (and maybe other SDKs 
> >> if needed). With some flag so that users can use the old wrong behavior 
> >> for update compatibility.
> >> 2. Add a Redistribute transform to the SDKs that has the right behavior 
> >> and leave Reshuffle as it is.
> >> 1+2. Add the Redistribute transform but also make Reshuffle call it, so 
> >> Reshuffle also gets the new behavior, with the same flag so that users can 
> >&

Re: Reshuffle PTransform Design Doc

2023-10-19 Thread Kenneth Knowles

On Fri, Oct 13, 2023 at 12:51 PM Jan Lukavský  wrote:
>
> Hi,
>
> I think there's been already said nearly everything in this thread, but ... 
> it is time for Friday discussions. :)
>
> Today I recalled of a discussion we've had long time ago, when we were 
> designing Euphoria (btw, deprecating and removing it is still on my todo 
> list, I should create a vote thread for that). We had 4 primitives:
>
>  a) non-shuffle, stateless ~ stateless ParDo
>
>  b) shuffle, stateful ~ stateful ParDo, with the ability (under the right 
> circumstances,  i.e. defined event-time trigger, defined state merge 
> function, ...) to be performed in a "combinable way".
>
>  c) shuffle, stateless ~ Reshuffle
>
>  d) non-shuffle, stateful - nope, makes no sense :) - part of the "combinable 
> stateful shuffle operation"
>
>  e) union ~ Flatten
>
> Turns out you can build everything bottom up from these.
>
> Now, the not-so-well defined semantics of Reshuffle (Redistribute) might 
> arise from the fact it is not a primitive. Stateless shuffling of data is 
> definitely a primitive of all runners.

Not Dataflow :-)

But more importantly, Beam primitives are deliberately chosen to be
fundamental data operations, not physical plan steps that a runner
might use. In other words, Beam is decidedly _not_ a library for
building composites that eventually are constructed from runner
primitives. It is more like SQL in that it is a library for building
composites that eventually are constructed from fundamental operations
on data, that every engine (like every RDBMS) will be able to
implement in its own way.

Kenn

>
> Therefore here goes the question - should Redistribute be a primitive and not 
> be built up from other transforms?
>
> Best,
>
>  Jan
>
> On 10/6/23 21:12, Kenneth Knowles wrote:
>
>
>
> On Fri, Oct 6, 2023 at 3:07 PM Jan Lukavský  wrote:
>>
>>
>> On 10/6/23 15:11, Kenneth Knowles wrote:
>>
>>
>>
>> On Fri, Oct 6, 2023 at 3:20 AM Jan Lukavský  wrote:
>>>
>>> Hi,
>>>
>>> there is also one other thing to mention with relation to 
>>> Reshuffle/RequiresStableinput and that is that our current implementation 
>>> of RequiresStableInput can break without Reshuffle in some corner cases on 
>>> most portable runners, at least with Java GreedyPipelineFuser, see [1]. The 
>>> only way to workaround this currently is inserting Reshuffle (or any other 
>>> fusion-break transform) directly before the stable DoFn (Reshuffle is 
>>> handy, because it does not change the data). I think we should either 
>>> somehow fix the issue [1] or include fusion break as a mandatory 
>>> requirement for the new Redistribute transform as well (at least with some 
>>> variant) or possibly add a new "hint" for non-optional fusion breaking.
>>
>> This is actually the bug we have wanted to fix for years - redistribute has 
>> nothing to do with checkpointing or stable input and Reshuffle incorrectly 
>> merges the two concepts.
>>
>> I agree that we couldn't make any immediate change that will break a runner. 
>> I believe runners that depend on Reshuffle to provide stable input will also 
>> provide stable input after GroupByKey. Since the SDK expansion of Reshuffle 
>> will still contains a GBK, those runners functionality will be unchanged.
>>
>> I don't yet have a firm opinion between the these approaches:
>>
>> 1. Adjust the Java SDK implementation of Reshuffle (and maybe other SDKs if 
>> needed). With some flag so that users can use the old wrong behavior for 
>> update compatibility.
>> 2. Add a Redistribute transform to the SDKs that has the right behavior and 
>> leave Reshuffle as it is.
>> 1+2. Add the Redistribute transform but also make Reshuffle call it, so 
>> Reshuffle also gets the new behavior, with the same flag so that users can 
>> use the old wrong behavior for update compatibility.
>>
>> All of these will leave "Reshuffle for RequestStableInput" alone for now. 
>> The options that include (2) will move us a little closer to migrating to a 
>> "better" future state.
>>
>> I might have not expressed the right way. I understand that Reshuffle having 
>> "stable input" functionality is non-portable side-effect. It would be nice 
>> to get rid of it and my impression from this thread was that we would try to 
>> deprecate Reshuffle and introduce Redistribute which will not have such 
>> semantics. All of this is fine, problem is that we currently (is some corner 
>> cases) rely on Reshuffle *

Re: [YAML] Aggregations

2023-10-19 Thread Kenneth Knowles

Using SQL expressions in strings is maybe OK given we are all
relational all the time. Either way you have to define what the
universe of `fn` is. Here's a compact possibility:

type: Combine
config:
  group_by: [field1, field2]
  aggregates:
max_cost: "MAX(cost)"
total_cost: "SUM(cost)"

Just a thought to get it closer to SQL concision. I also used the word
"Combine" just to connect it to other Beam writings and whatnot.

Kenn

On Thu, Oct 19, 2023 at 1:41 PM Robert Bradshaw via dev
 wrote:
>
> On Thu, Oct 19, 2023 at 10:25 AM Jan Lukavský  wrote:
> >
> > On 10/19/23 18:28, Robert Bradshaw via dev wrote:
> > > On Thu, Oct 19, 2023 at 9:00 AM Byron Ellis  wrote:
> > >> Rill is definitely SQL-oriented but I think that's going to be the most 
> > >> common. Dataframes are explicitly modeled on the relational approach so 
> > >> that's going to look a lot like SQL,
> > > I think pretty much any approach that fits here is going to be
> > > relational, meaning you choose a set of columns to group on, a set of
> > > columns to aggregate, and how to aggregate. The big open question is
> > > what syntax to use for the "how."
> > This might be already answered, if so, pardon my ignorance, but what is
> > the goal this declarative approach is trying to solve? Is it meant to be
> > more expressive or equally expressive than SQL? And if more, how much more?
>
> I'm not sure if you're asking about YAML in general, or the particular
> case of aggregation, but I can answer both.
>
> For the larger Beam YAML project, it's trying to solve the problem
> that SQL is (and I'll admit this is somewhat subjective here) good at
> expressing the T part of ETL, but not the other parts. For example,
> the simple data movent usecase of (say) reading from PubSub and
> dumping into BigQuery is not well expressed in terms of SQL. SQL is
> also fairly awkward when it comes to defining UDFs and TDFs and
> non-linear pipelines (especially those with fanout). There are of
> course other tools in this space (dbt comes to mind, and there's been
> some investigation on how to make dbt play well with Beam). The other
> niche it is trying to solve is that installing and learning a full SDK
> is heavyweight and overkill for creating pipelines that are simply
> wiring together pre-defined transforms.
>
> As for the more narrow case of aggregations, I think being similarly
> expressive as SQL is fine, though it'd be good to make custom UADFs
> more natural. Originally I was thinking that just having SqlTransform
> might be sufficient, but it feels like a big hammer to reach for every
> time I just want to sum over one or two columns.

[ANNOUNCE] Apache Beam 2.51.0 Released

2023-10-18 Thread Kenneth Knowles

The Apache Beam Team is pleased to announce the release of version 2.51.0.

You can download the release here:

https://beam.apache.org/get-started/downloads/

This release includes bug fixes, features, and improvements detailed
on the Beam Blog: https://beam.apache.org/blog/beam-2.51.0/ and the
Github release page
https://github.com/apache/beam/releases/tag/v2.51.0

Thanks to everyone who contributed to this release, and we hope you
enjoy using Beam 2.51.0.

Kenn, on behalf of the Apache Beam Team.

[ANNOUNCE] New Committer: Byron Ellis

2023-10-16 Thread Kenneth Knowles

Hi all,

Please join me and the rest of the Beam PMC in welcoming a new
committer: Byron Ellis (b...@apache.org).

Byron has been with Beam for over a year now. You may all know him as the
guy who just decided to write a Swift SDK :-). In addition to that big
contribution Byron has also fixed plenty of bugs, prototyped DBT-tyle
pipeline authoring, and participated in our collective decision-making
process.

Considering his contributions to the project over this timeframe, the
Beam PMC trusts Byron with the responsibilities of a Beam committer. [1]

Thank you Byron! And we are looking to see more of your contributions!

Kenn, on behalf of the Apache Beam PMC

[1]
https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer

[ANNOUNCE] New Committer: Sam Whittle

2023-10-16 Thread Kenneth Knowles

Hi all,

Please join me and the rest of the Beam PMC in welcoming a new
committer: Sam Whittle (scwhit...@apache.org).

Sam has been contributing to Beam since 2016! In particular, he specializes
in streaming and the Dataflow Java worker but his contributions expand
naturally from there to the Java SDK, IOs, and even a bit of Python :-).
Sam has contributed a ton of code over the years and is generous in code
review and sharing his expertise.

Considering his contributions to the project over this timeframe, the
Beam PMC trusts Sam with the responsibilities of a Beam committer. [1]

Thank you Sam! And we are looking to see more of your contributions!

Kenn, on behalf of the Apache Beam PMC

[1]
https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer

Re: [GitHub Actions] Requiring a test but not running it until requested

2023-10-11 Thread Kenneth Knowles

OK, just realizing a low-tech way this has been done in my experience: have
a workflow that goes red unless the commit / PR message to say something
like "TESTED=link to workflow run". I don't love it, and it is easy to get
lazy and circumvent, but I can see why it is what people chose to do.

A variation on that could be an integration test workflow that runs
unconditionally based on the path, but immediately fails early unless the
PR contains some [recently added] magic phrase or comment like "Run large
tests". When the author is ready, they can lick in and re-run failed jobs
through the GHA UI. In theory the draft/ready status could be used for this
but I doubt that practice would catch on.

Kenn

On Wed, Oct 11, 2023 at 12:07 PM Danny McCormick via dev <
dev@beam.apache.org> wrote:

> I actually don't think GitHub supports path filters for required checks,
> so you can't say something like "check X is only required if path Y or Z
> are modified",  you can only say "check X is required". I'm not 100% sure
> on this, but it matches my memory and I could neither find docs with that
> feature or get it to function like that on a personal repo. There are ways
> to work around this (e.g. always status "good" when not on the path), but
> they're messy. You also need a way of kicking off the job (today that is
> comment triggers which is probably fine if these are limited to a few
> checks per PR).
>
> Outside of the feasibility question, I'm at least theoretically
> interested. This could allow us to turn some of our postcommits into
> precommits without burning too much CI compute. I'm also generally +1 on
> requiring more checks to pass before merging, especially if we can do so
> through tooling.
>
> On Wed, Oct 11, 2023 at 10:42 AM Kenneth Knowles  wrote:
>
>> From our other thread I had a thought about our "only on request" tests.
>>
>> Today, in theory:
>>
>>  - The lightweight tests run automatically based on path matching. This
>> is an approximate implementation of the ideal of running based on whether
>> they could impact a test signal.
>>  - Heavyweight (and more flaky) tests run on request.
>>
>> A while ago, our "lightweight" tests were a huge monolith and very flaky
>> (Python still is in this state I think). While I was splitting up
>> monolithic "lightweight" Java SDK tests to make them run only on relevant
>> paths, some of the "heavyweight" tests became small enough that they run
>> automatically, so we also have:
>>
>>  - Smaller integration tests (like DirectRunner doing SplunkIO) run
>> automatically based on path matching.
>>
>> Danny mentioned the idea of changing the above to:
>>
>>  - Heavyweight tests run only if the lightweight tests are healthy.
>>
>> Here's an idea I had about a combination of these that I wanted to know
>> if anyone had seen it or thought of how it could happen or why it is a bad
>> idea:
>>
>>  - Heavyweight tests are *required but not automatically run* based on
>> path matching. A status would show up indicating that the PR is not green
>> until you request and pass the heavyweight tests.
>>  - When the PR is actually ready you request them.
>>
>> Then I would move even the small integration tests into that latter
>> category. Incidentally this also could easily apply to non-hermetic tests
>> that make our security posture more difficult, requiring a committer to
>> approve running them.
>>
>> Is this possible? Good? Bad?
>>
>> Kenn
>>
>

[RESULT] [VOTE] Release 2.51.0, release candidate #1

2023-10-11 Thread Kenneth Knowles

The vote has passed.

There are 5 +1 binding votes:

 - Robert Bradshaw
 - Jan Lukavský
 - Ahmet Altay
 - Jean-Baptiste Onofré
 - Alexey Romanenko

Additionally there are 5 non-binding +1 votes:

 - Danny McCormick
 - Svetak Sundhar
 - XQ Hu
 - Bruno Volpato
 - Yi Hu

There are no disapproving votes.

I will note the known issue(s) as I finalize the release.

Kenn

On Tue, Oct 10, 2023 at 4:30 PM Ahmet Altay via dev 
wrote:

> Thank you for the information.
>
> I agree with Kenn in that case. This could wait for the next release.
> Unless there is another reason to do the RC2.
>
> On Tue, Oct 10, 2023 at 12:30 PM Yi Hu  wrote:
>
>>
>> Would it impact all python users including breaking the new user, quick
>>> start experience? Or would it impact users of a specific IO or
>>> configuration?
>>>
>>
>> It is the latter. It will impact users of Specific IO (BigQueryIO read)
>> specific configuration (Direct_Read). Note that the default configuration
>> for BigQueryIO read is EXPORT. So this won't affect "quick-start" examples
>> having default settings.
>>
>> It also won't affect users using SDK docker containers (e.g. Dataflow
>> users and Flink/Spark users running on a remote cluster). It will affect
>> users running in direct runner, and local portable runners (e.g. Flink
>> local cluster) with LOOPBACK configuration, which is exactly what our
>> Python PostComit is doing.
>>
>>
>

Re: [VOTE] Release 2.51.0, release candidate #1

2023-10-11 Thread Kenneth Knowles

OK I'm ready.

+1 (binding)


On Tue, Oct 10, 2023 at 4:30 PM Ahmet Altay via dev 
wrote:

> Thank you for the information.
>
> I agree with Kenn in that case. This could wait for the next release.
> Unless there is another reason to do the RC2.
>
> On Tue, Oct 10, 2023 at 12:30 PM Yi Hu  wrote:
>
>>
>> Would it impact all python users including breaking the new user, quick
>>> start experience? Or would it impact users of a specific IO or
>>> configuration?
>>>
>>
>> It is the latter. It will impact users of Specific IO (BigQueryIO read)
>> specific configuration (Direct_Read). Note that the default configuration
>> for BigQueryIO read is EXPORT. So this won't affect "quick-start" examples
>> having default settings.
>>
>> It also won't affect users using SDK docker containers (e.g. Dataflow
>> users and Flink/Spark users running on a remote cluster). It will affect
>> users running in direct runner, and local portable runners (e.g. Flink
>> local cluster) with LOOPBACK configuration, which is exactly what our
>> Python PostComit is doing.
>>
>>
>

[GitHub Actions] Requiring a test but not running it until requested

2023-10-11 Thread Kenneth Knowles

>From our other thread I had a thought about our "only on request" tests.

Today, in theory:

 - The lightweight tests run automatically based on path matching. This is
an approximate implementation of the ideal of running based on whether they
could impact a test signal.
 - Heavyweight (and more flaky) tests run on request.

A while ago, our "lightweight" tests were a huge monolith and very flaky
(Python still is in this state I think). While I was splitting up
monolithic "lightweight" Java SDK tests to make them run only on relevant
paths, some of the "heavyweight" tests became small enough that they run
automatically, so we also have:

 - Smaller integration tests (like DirectRunner doing SplunkIO) run
automatically based on path matching.

Danny mentioned the idea of changing the above to:

 - Heavyweight tests run only if the lightweight tests are healthy.

Here's an idea I had about a combination of these that I wanted to know if
anyone had seen it or thought of how it could happen or why it is a bad
idea:

 - Heavyweight tests are *required but not automatically run* based on path
matching. A status would show up indicating that the PR is not green until
you request and pass the heavyweight tests.
 - When the PR is actually ready you request them.

Then I would move even the small integration tests into that latter
category. Incidentally this also could easily apply to non-hermetic tests
that make our security posture more difficult, requiring a committer to
approve running them.

Is this possible? Good? Bad?

Kenn

Re: [YAML] Fileio sink parameterization (streaming, sharding, and naming)

2023-10-11 Thread Kenneth Knowles

So, top-posting because the threading got to be a lot for me and I think it
forked a bit too... I may even be restating something someone said, so
apologies for that.

Very very good point about *required* parameters where if you don't use
them then you will end up with two writers writing to the same file. The
easiest example to work with might be if you omitted SHARD_NUM so all
shards end up clobbering the same file.

I think there's a unifying perspective between prefix/suffix and the need
to be sure to include critical sharding variables. Essentially it is my
point about it being a "big data fileset". It is perhaps unrealistic but
ideally the user names the big data fileset and then the mandatory other
pieces are added outside of their control. For example if I name my big
data fileset "foo" then that implicitly means that "foo" consists of all
the files named "foo/${SHARD_NUM}-of-${SHARD_TOTAL}". And yes now that I
re-read I see you basically said the same thing. In some cases the required
fields will include $WINDOW, $KEY, and $PANE_INDEX, yes? Even though the
user can think of it as a textual template, if we can use a library that
yields an abstract syntax tree for the expression we can easily check these
requirements in a robust way - or we could do it in a non-robust way be
string-scraping ourselves.

We actually are very close to this in FileIO. I think the interpretation of
"prefix" is that it is the filename "foo" as above, and "suffix" is really
something like ".txt" that you stick on the end of everything for whatever
reason.

Kenn

On Tue, Oct 10, 2023 at 7:12 PM Robert Bradshaw via dev 
wrote:

> On Tue, Oct 10, 2023 at 4:05 PM Chamikara Jayalath 
> wrote:
>
>>
>> On Tue, Oct 10, 2023 at 4:02 PM Robert Bradshaw 
>> wrote:
>>
>>> On Tue, Oct 10, 2023 at 3:53 PM Chamikara Jayalath 
>>> wrote:
>>>

 On Tue, Oct 10, 2023 at 3:41 PM Reuven Lax  wrote:

> I suspect some simple pattern templating would solve most use cases.
> We probably would want to support timestamp formatting (e.g. $ $M $D)
> as well.
>
> On Tue, Oct 10, 2023 at 3:35 PM Robert Bradshaw 
> wrote:
>
>> On Mon, Oct 9, 2023 at 3:09 PM Chamikara Jayalath <
>> chamik...@google.com> wrote:
>>
>>> I would say:
>>>
>>> sink:
>>>   type: WriteToParquet
>>>   config:
>>> path: /beam/filesytem/dest
>>> prefix: 
>>> suffix: 
>>>
>>> Underlying SDK will add the middle part of the file names to make
>>> sure that files generated by various bundles/windows/shards do not 
>>> conflict.
>>>
>>
>> What's the relationship between path and prefix? Is path the
>> directory part of the full path, or does prefix precede it?
>>
>
 prefix would be the first part of the file name so each shard will be
 named.
 /--

 This is similar to what we do in existing SDKS. For example, Java
 FileIO,

 https://github.com/apache/beam/blob/65eaf45026e9eeb61a9e05412488e5858faec6de/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileIO.java#L187

>>>
>>> Yeah, although there's no distinction between path and prefix.
>>>
>>
>> Ah, for FIleIO, path comes from the "to" call.
>>
>>
>> https://github.com/apache/beam/blob/65eaf45026e9eeb61a9e05412488e5858faec6de/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileIO.java#L1125
>>
>>
>
> Ah. I guess there's some inconsistency here, e.g. text files are written
> to a filenamePrefix that subsumes both:
> https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/TextIO.Write.html#to-java.lang.String-
>
>
>>
>>>
>>
>>> This will satisfy the vast majority of use-cases I believe. Fully
>>> customizing the file pattern sounds like a more advanced use case that 
>>> can
>>> be left for "real" SDKs.
>>>
>>
>> Yea, we don't have to do everything.
>>
>>
>>> For dynamic destinations, I think just making the "path" component
>>> support  a lambda that is parameterized by the input should be adequate
>>> since this allows customers to direct files written to different
>>> destination directories.
>>>
>>> sink:
>>>   type: WriteToParquet
>>>   config:
>>> path: 
>>> prefix: 
>>> suffix: 
>>>
>>> I'm not sure what would be the best way to specify a lambda here
>>> though. Maybe a regex or the name of a Python callable ?
>>>
>>
>> I'd rather not require Python for a pure Java pipeline, but some kind
>> of a pattern template may be sufficient here.
>>
>>
>>> On Mon, Oct 9, 2023 at 2:06 PM Robert Bradshaw via dev <
>>> dev@beam.apache.org> wrote:
>>>
 .On Mon, Oct 9, 2023 at 1:49 PM Reuven Lax 
 wrote:

> Just FYI - the reason why names (including prefixes) in
> DynamicDestinations were parameterized via a lambda ins

Re: [VOTE] Release 2.51.0, release candidate #1

2023-10-10 Thread Kenneth Knowles

After thinking this through a bit more, I am inclined to release RC1 with
this noted as a known issue, unless there are other more compelling reasons
to issues a second RC.

Why?

 - It is more-or-less by design that end users of Beam Python have
dependencies shift under them; breakage and recovery (via pinning to known
good versions) must be part of that design.
 - In many contexts, users will already know this and will have pinned
dependencies which therefore won't be impacted.

So I am still working through the other failures on
https://github.com/apache/beam/pull/28663 to confirm if they are all benign
before closing the vote. If someone wants to actually -1 the RC they can do
that, but I won't (yet).

Kenn

On Mon, Oct 9, 2023 at 4:22 PM Kenneth Knowles  wrote:

> OK I can cherrypick it so they have an upgrade fix. But also we should
> instruct users to pin their fastavro version to a good version. That is
> probably safer and easier than upgrading Beam.
>
> Our containers that we build have the version pinned, right? So will this
> also cause all the prior containers to have slow start up?
>
> Kenn
>
> On Mon, Oct 9, 2023 at 4:13 PM Yi Hu via dev  wrote:
>
>> Yes, and moreover, this specific issue will break the user the same way
>> for *all* Beam versions (2.50.0, 2.49.0, etc) after Oct 3. That said the
>> issue is not limited to Beam 2.50.0 though.
>>
>> On Mon, Oct 9, 2023 at 4:08 PM Kenneth Knowles  wrote:
>>
>>> If we had closed the release today, this would still have broken all our
>>> users, correct?
>>>
>>> Kenn
>>>
>>> On Mon, Oct 9, 2023 at 3:37 PM Anand Inguva via dev 
>>> wrote:
>>>
>>>> There was a regression[1] on fastavro latest release 1.8.4. Fix was
>>>> merged at https://github.com/apache/beam/pull/28896. The RC1 includes
>>>> that version in the range for fastavro[2]. I think we need to CP
>>>> https://github.com/apache/beam/pull/28896 to solve the fastavro
>>>> regression.
>>>>
>>>> [1] https://github.com/apache/beam/issues/28811
>>>> [2]
>>>> https://github.com/apache/beam/blob/cd653e33b342bd09c76c2bbaca12597fec5b4a2c/sdks/python/setup.py#L245
>>>>
>>>>
>>>> On Mon, Oct 9, 2023 at 3:15 PM Kenneth Knowles  wrote:
>>>>
>>>>> Ran a couple of Java pipelines "as a newb user" to make sure our
>>>>> instructions weren't out of date. There are some errors in the 
>>>>> instructions
>>>>> but they don't have to do with this release.
>>>>>
>>>>> Re-ran mass_comment.py on https://github.com/apache/beam/pull/28663.
>>>>> There are enough red signals there that some triage is needed. Any help
>>>>> triaging would be appreciated.
>>>>>
>>>>> I'll close the vote once everything is run and examined.
>>>>>
>>>>> Kenn
>>>>>
>>>>> On Sat, Oct 7, 2023 at 9:58 AM Yi Hu via dev 
>>>>> wrote:
>>>>>
>>>>>> +1 (non-binding) Tested on Java IO load tests (
>>>>>> https://github.com/bvolpato/DataflowTemplates/tree/56d18a31c1c95e58543d7a1656bd83d7e859b482/it)
>>>>>> BigQueryIO, TextIO, BigtableIO, SpannerIO on Dataflow legacy runner and
>>>>>> runner v2
>>>>>>
>>>>>>
>>>>>> On Fri, Oct 6, 2023 at 3:23 PM Kenneth Knowles 
>>>>>> wrote:
>>>>>>
>>>>>>> Additionally we need https://github.com/apache/beam/pull/28665/files
>>>>>>> in order to run GHA tests.
>>>>>>>
>>>>>>> On Fri, Oct 6, 2023 at 3:19 PM Kenneth Knowles 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> That PR was prior to many cherry-picks so it is not the signal we
>>>>>>>> need. I have updated it to the tip of the release-2.51.0 branch.
>>>>>>>>
>>>>>>>> There were some post-commit tests involving JPMS that I believe
>>>>>>>> need https://github.com/apache/beam/pull/28726 to pass.
>>>>>>>>
>>>>>>>> Kenn
>>>>>>>>
>>>>>>>> On Fri, Oct 6, 2023 at 2:53 PM Valentyn Tymofieiev via dev <
>>>>>>>> dev@beam.apache.org> wrote:
>>>>>>>>
>>>>>>>>> > PR to run tests against release branch [12].
>>>>>>>>>
>

Re: [PROPOSAL] [Nice-to-have] CI job names and commands that match

2023-10-10 Thread Kenneth Knowles

o yes.


3) An expensive job that has one command to run a small set of tests and
>> then another to run a more expensive set only if the initial one passed.
>>
>
This makes sense conceptually, but I am not aware of a concrete instance of
it. In practice how it works for us is that lighter-weight hermetic tests
and sometimes highly targeted medium-weight integration tests (such as for
a specific IO being edited) will run on every change to a PR, while all
heavier-weight tests are by request. My proposal leaves this as-is. It
actually would be cool to have the heavyweight workflow show up as
"failed/required" but only run on demand. Because you want to wait until
"ready" to run the heavyweight thing. You don't want it to necessary kick
off just as soon as the hermetic tests pass. Anyhow, this is just
hypothetical.



> It's also worth mentioning that often the command to run a step is quite
>> long, especially for things like perf tests that have lots of flags to pass.
>>
>
Agree with this. Might be a case where you choose some more abbreviated
identifier


All of these examples *could *be bundled into a single gradle command, but
>> they shouldn't have to be. Instead we should have a workflow interface that
>> is as independent of implementation as possible IMO and represents an
>> abstraction of what we actually want to test (e.g. "Run Dataflow Runner
>> Tests", or "Run Java Examples Tests"). This also avoids us taking a
>> dependency that could go out of date if we change our commands.
>>
>
I agree with the goal here. But I don't believe we should use CI / GHA to
build an abstraction of "tasks that you can request by name to be run". We
already have a dependency-driven build system with that same characteristic
:-). For CD stuff, sure. But for test signals they really ought to be basic
test suites that can be invoked on the command line, ideally without
parameters.



> > Mostly fixing stuff like how the status label "Google Cloud Dataflow
>> Runner V2 Java ValidatesRunner Tests (streaming)" is literally a less
>> useful way of writing
>> ":runners:google-cloud-dataflow-java:validatesRunnerV2Streaming".
>>
>> Last thing I'll add - this is true for you and probably many
>> contributors, but is less friendly for new folks who are less familiar with
>> the project IMO (especially when the filepath/command is less obvious).
>>
>
Agree to disagree :-). They are both the same words, just one is a soup and
one is a task name :-)

--end--


>
>> On Tue, Oct 10, 2023 at 12:29 PM Robert Burke  wrote:
>>
>>> +1 to the general proposal.
>>>
>>> I'm not bothered if something says a gradle command and in execution,
>>> that task ends up running multiple different commands. Arguably, if we're
>>> running a gradle task manualy to prepare for a subsequent task that latter
>>> task should be adding the former to it's dependencies.
>>>
>>> Also agree that this is a post Jenkins exit task. One migration in an
>>> area at a time please.
>>>
>>> On Tue, Oct 10, 2023, 8:07 AM Kenneth Knowles  wrote:
>>>
>>>> On Tue, Oct 10, 2023 at 10:21 AM Danny McCormick via dev <
>>>> dev@beam.apache.org> wrote:
>>>>
>>>>> I'm +1 on:
>>>>> - standardizing our naming
>>>>> - making job names match their commands exactly (though I'd still like
>>>>> the `Run` prefix so that you can do things like say "Suite XYZ failed"
>>>>> without triggering the automation)
>>>>> - removing pre/postcommit from the naming (we actually already run
>>>>> many precommits as postcommits as well)
>>>>
>>>>
>>>> Fully agree with your point of keeping "Run" as the magic word and the
>>>> way we have it today.
>>>>
>>>> I'm -0 on:
>>>>
>>>>> - Doing this immediately - I'd prefer we wait til the Jenkins to
>>>>> Actions migration is done and we can do this in bulk versus renaming 
>>>>> things
>>>>> as we go since we're so close to the finish line and exact parity makes
>>>>> reviews easier.
>>>>
>>>>
>>>> Cool. And indeed this is why I didn't "just do it' (aside from this
>>>> being enough impact to people's daily lives that I wanted to get feedback
>>>> from dev@). If we can fold it in as a last step to the migration, that
>>>> would be a nice-to-have. Otherwise ping back when

Re: [PROPOSAL] [Nice-to-have] CI job names and commands that match

2023-10-10 Thread Kenneth Knowles

On Tue, Oct 10, 2023 at 10:21 AM Danny McCormick via dev <
dev@beam.apache.org> wrote:

> I'm +1 on:
> - standardizing our naming
> - making job names match their commands exactly (though I'd still like the
> `Run` prefix so that you can do things like say "Suite XYZ failed" without
> triggering the automation)
> - removing pre/postcommit from the naming (we actually already run many
> precommits as postcommits as well)

Fully agree with your point of keeping "Run" as the magic word and the way
we have it today.

I'm -0 on:

> - Doing this immediately - I'd prefer we wait til the Jenkins to Actions
> migration is done and we can do this in bulk versus renaming things as we
> go since we're so close to the finish line and exact parity makes reviews
> easier.

Cool. And indeed this is why I didn't "just do it' (aside from this being
enough impact to people's daily lives that I wanted to get feedback from
dev@). If we can fold it in as a last step to the migration, that would be
a nice-to-have. Otherwise ping back when ready please :-)

On Tue, Oct 10, 2023 at 11:15 AM Yi Hu via dev  wrote:

> Thanks for raising this. This generally works, though some jobs run more
> than one gradle task (e.g. some IO_Direct_PreCommit run both :build (which
> executes unit tests) and :integrationTest).
>

On Tue, Oct 10, 2023 at 10:21 AM Danny McCormick via dev <
dev@beam.apache.org> wrote:

> I'm -1 on:
> - Tying the naming to gradle - like Yi mentioned some workflows have
> multiple gradle tasks, some don't have any, and I think that's ok.
>

Just to clarify: I'm not proposing tying them to gradle tasks (I'm fine
with `go test` for example) or doing this in situations where it is
unnatural.

My example probably confused this because I left off the `./gradlew` just
to save space. I'm proposing naming them after their obvious repro command,
wherever applicable. Mostly fixing stuff like how the status label "Google
Cloud Dataflow Runner V2 Java ValidatesRunner Tests (streaming)" is
literally a less useful way of writing
":runners:google-cloud-dataflow-java:validatesRunnerV2Streaming".

FWIW I think Yi's example demonstrates an anti-pattern (mixing
hermetic/reliable and non-hermetic/possibly-flaky tests in one test signal)
but I'm sure there are indeed jobs where this doesn't make sense.

Kenn

>
> Thanks,
> Danny
>
>
>>
>> Another option is to normalize the naming of every job, saying the job
>> name is X, then workflow name is PreCommit_X or PostCommit_X, and the
>> phrase is Run X. Currently most PreCommit follow this pattern, but there
>> are also many outliers. A good start could be to clean up all jobs
>> to follow the same pattern.
>>
>>
>> On Tue, Oct 10, 2023 at 9:57 AM Kenneth Knowles  wrote:
>>
>>> FWIW I aware of the README in
>>> https://github.com/apache/beam/tree/master/.test-infra/jenkins that
>>> lists the phrases alongside the jobs. This is just wasted work to maintain
>>> IMO.
>>>
>>> Kenn
>>>
>>> On Tue, Oct 10, 2023 at 9:46 AM Kenneth Knowles  wrote:
>>>
>>>> *Proposal:* make all the job names exactly match the GH comment to run
>>>> them and make it also as close as possible to how to reproduce locally
>>>>
>>>> *Example problems*:
>>>>
>>>>  - We have really silly redundant jobs results like 'Chicago Taxi
>>>> Example on Dataflow ("Run Chicago Taxi on Dataflow")' and
>>>> 'Python_Xlang_IO_Dataflow ("Run Python_Xlang_IO_Dataflow PostCommit")'
>>>>
>>>>  - We have jobs that there's no way you could guess the command 'Google
>>>> Cloud Dataflow Runner V2 Java ValidatesRunner Tests (streaming)'
>>>>
>>>>  - (nit) We are weirdly inconsistent about using spaces vs underscores.
>>>> I don't think any of our infrastructure cares about this.
>>>>
>>>> *Extra proposal*: make the job name also the local command, where
>>>> possible
>>>>
>>>> *Example: *
>>>> https://github.com/apache/beam/blob/master/.github/workflows/beam_PostCommit_Java_ValidatesRunner_Dataflow.yml
>>>>
>>>>  - This runs :runners:google-cloud-dataflow-java:validatesRunner
>>>>  - So make the status label
>>>> ":runners:google-cloud-dataflow-java:validatesRunner"
>>>>  - "Run :runners:google-cloud-dataflow-java:validatesRunner" as comment
>>>>
>>>> If I want to run it locally, yes there are GCP thi

Re: [YAML] Fileio sink parameterization (streaming, sharding, and naming)

2023-10-10 Thread Kenneth Knowles

Another perspective:

We should focus on the fact that FileIO writes what I would call a "big
file-based dataset" to a filesystem. The primary characteristic of a "big
file-based dataset" is that it is sharded and that the shards should not
have any individual distinctiveness. The dataset should be read and written
as a whole, and the shards are an implementation detail for performance.

This impact dynamic destinations in two ways that I can think of right away:

 - It is critical by definition to be able to refer to a whole "big
file-based dataset" as a whole thing. The most obvious way would be for it
to exist as a folder or folder-like grouping of files, and you glob
everything underneath that. But the hard requirement is that there is
*some* way
to refer to the dataset as a single entity (via glob, basically).

 - When using "dynamic destinations" style functionality, each of the
dynamic destinations is a "big file-based dataset". The purpose is to route
a datum to the correct one of these, NOT to route the datum to a particular
file (which would be just some anonymous shard in the dataset).

So having really fine-grained control over filenames is likely to result in
anti-patterns of malformed datasets that cannot be easily globbed or, in
the converse case, well-formed datasets that have suboptimal sharding
because it was manually managed.

I know that "reality" is not this simple, because people have accumulated
files they have to work with as-is, where they probably didn't plan for
this way of thinking when they were gathering the files. We need to give
good options for everyone, but the golden path should be the simple and
good case.

Kenn

On Tue, Oct 10, 2023 at 10:09 AM Kenneth Knowles  wrote:

> Since I've been in GHA files lately...
>
> I think they have a very useful pattern which we could borrow from or
> learn from, where setting up the variables happens separately, like
> https://github.com/apache/beam/blob/57821c191d322f9f21c01a34c55e0c40eda44f1e/.github/workflows/build_release_candidate.yml#L270
>
> If we called the section "vars" and then the config could use the vars in
> the destination. I'm making this example deliberately a little gross:
>
>  - vars:
> - USER_REGION: $.user.metadata.region
> - USER_GROUP: $.user.groups[0].name
>  - config:
> - path:
> gs://output-bucket-${vars.USER_REGION}/files/${vars.USER_GROUP}-${fileio.SHARD_NUM}-${fileio.WINDOW}
>
> I think it strikes a good balance between arbitrary lambdas and just a
> prefix/suffix control, giving a really easy place where we can say "the
> whole value of this YAML field is a path expression into the structured
> data"
>
> Kenn
>
> On Mon, Oct 9, 2023 at 6:09 PM Chamikara Jayalath via dev <
> dev@beam.apache.org> wrote:
>
>> I would say:
>>
>> sink:
>>   type: WriteToParquet
>>   config:
>> path: /beam/filesytem/dest
>> prefix: 
>> suffix: 
>>
>> Underlying SDK will add the middle part of the file names to make sure
>> that files generated by various bundles/windows/shards do not conflict.
>>
>> This will satisfy the vast majority of use-cases I believe. Fully
>> customizing the file pattern sounds like a more advanced use case that can
>> be left for "real" SDKs.
>>
>> For dynamic destinations, I think just making the "path" component
>> support  a lambda that is parameterized by the input should be adequate
>> since this allows customers to direct files written to different
>> destination directories.
>>
>> sink:
>>   type: WriteToParquet
>>   config:
>> path: 
>> prefix: 
>> suffix: 
>>
>> I'm not sure what would be the best way to specify a lambda here though.
>> Maybe a regex or the name of a Python callable ?
>>
>> Thanks,
>> Cham
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Mon, Oct 9, 2023 at 2:06 PM Robert Bradshaw via dev <
>> dev@beam.apache.org> wrote:
>>
>>> .On Mon, Oct 9, 2023 at 1:49 PM Reuven Lax  wrote:
>>>
>>>> Just FYI - the reason why names (including prefixes) in
>>>> DynamicDestinations were parameterized via a lambda instead of just having
>>>> the user add it via MapElements is performance. We discussed something
>>>> along the lines of what you are suggesting (essentially having the user
>>>> create a KV where the key contained the dynamic information). The problem
>>>> was that often the size of the generated filepath was often much larger
>>>> (

Re: [YAML] Fileio sink parameterization (streaming, sharding, and naming)

2023-10-10 Thread Kenneth Knowles

Since I've been in GHA files lately...

I think they have a very useful pattern which we could borrow from or learn
from, where setting up the variables happens separately, like
https://github.com/apache/beam/blob/57821c191d322f9f21c01a34c55e0c40eda44f1e/.github/workflows/build_release_candidate.yml#L270

If we called the section "vars" and then the config could use the vars in
the destination. I'm making this example deliberately a little gross:

 - vars:
- USER_REGION: $.user.metadata.region
- USER_GROUP: $.user.groups[0].name
 - config:
- path:
gs://output-bucket-${vars.USER_REGION}/files/${vars.USER_GROUP}-${fileio.SHARD_NUM}-${fileio.WINDOW}

I think it strikes a good balance between arbitrary lambdas and just a
prefix/suffix control, giving a really easy place where we can say "the
whole value of this YAML field is a path expression into the structured
data"

Kenn

On Mon, Oct 9, 2023 at 6:09 PM Chamikara Jayalath via dev <
dev@beam.apache.org> wrote:

> I would say:
>
> sink:
>   type: WriteToParquet
>   config:
> path: /beam/filesytem/dest
> prefix: 
> suffix: 
>
> Underlying SDK will add the middle part of the file names to make sure
> that files generated by various bundles/windows/shards do not conflict.
>
> This will satisfy the vast majority of use-cases I believe. Fully
> customizing the file pattern sounds like a more advanced use case that can
> be left for "real" SDKs.
>
> For dynamic destinations, I think just making the "path" component
> support  a lambda that is parameterized by the input should be adequate
> since this allows customers to direct files written to different
> destination directories.
>
> sink:
>   type: WriteToParquet
>   config:
> path: 
> prefix: 
> suffix: 
>
> I'm not sure what would be the best way to specify a lambda here though.
> Maybe a regex or the name of a Python callable ?
>
> Thanks,
> Cham
>
>
>
>
>
>
>
>
>
>
> On Mon, Oct 9, 2023 at 2:06 PM Robert Bradshaw via dev <
> dev@beam.apache.org> wrote:
>
>> .On Mon, Oct 9, 2023 at 1:49 PM Reuven Lax  wrote:
>>
>>> Just FYI - the reason why names (including prefixes) in
>>> DynamicDestinations were parameterized via a lambda instead of just having
>>> the user add it via MapElements is performance. We discussed something
>>> along the lines of what you are suggesting (essentially having the user
>>> create a KV where the key contained the dynamic information). The problem
>>> was that often the size of the generated filepath was often much larger
>>> (sometimes by 2 OOM) than the information in the record, and there was a
>>> desire to avoid record blowup. e.g. the record might contain a single
>>> integer userid, and the filepath prefix would then be
>>> /long/path/to/output/users/. This was especially bad in cases where the
>>> data had to be shuffled, and the existing dynamic destinations method
>>> allowed extracting the filepath only _after_  the shuffle.
>>>
>>
>> That is a consideration I hadn't thought much of, thanks for
>> bringing this up.
>>
>>
>>> Now there may not be any good way to keep this benefit in a
>>> declarative approach such as YAML (or at least a good easy way - we could
>>> always allow the user to pass in a SQL expression to extract the filename
>>> from the record!), but we should keep in mind that this might mean that
>>> YAML-generated pipelines will be less efficient for certain use cases.
>>>
>>
>> Yep, it's not as straightforward to do in a declarative way. I would like
>> to avoid mixing UDFs (with their associated languages and execution
>> environments) if possible. Though I'd like the performance of a
>> "straightforward" YAML pipeline to be that which one can get writing
>> straight-line Java (and possibly better, if we can leverage the structure
>> of schemas everywhere) this is not an absolute requirement for all
>> features.
>>
>> I wonder if separating out a constant prefix vs. the dynamic stuff could
>> be sufficient to mitigate the blow-up of pre-computing this in most cases
>> (especially in the context of a larger pipeline). Alternatively, rather
>> than just a sharding pattern, one could have a full filepattern that
>> includes format parameters for dynamically computed bits as well as the
>> shard number, windowing info, etc. (There are pros and cons to this.)
>>
>>
>>> On Mon, Oct 9, 2023 at 12:37 PM Robert Bradshaw via dev <
>>> dev@beam.apache.org> wrote:
>>>
 Currently the various file writing configurations take a single
 parameter, path, which indicates where the (sharded) output should be
 placed. In other words, one can write something like

   pipeline:
 ...
 sink:
   type: WriteToParquet
   config:
 path: /beam/filesytem/dest

 and one gets files like "/beam/filesystem/dest-X-of-N"

 Of course, in practice file writing is often much more complicated than
 this (especially when it comes to S

Re: [PROPOSAL] [Nice-to-have] CI job names and commands that match

2023-10-10 Thread Kenneth Knowles

FWIW I aware of the README in
https://github.com/apache/beam/tree/master/.test-infra/jenkins that lists
the phrases alongside the jobs. This is just wasted work to maintain IMO.

Kenn

On Tue, Oct 10, 2023 at 9:46 AM Kenneth Knowles  wrote:

> *Proposal:* make all the job names exactly match the GH comment to run
> them and make it also as close as possible to how to reproduce locally
>
> *Example problems*:
>
>  - We have really silly redundant jobs results like 'Chicago Taxi Example
> on Dataflow ("Run Chicago Taxi on Dataflow")' and 'Python_Xlang_IO_Dataflow
> ("Run Python_Xlang_IO_Dataflow PostCommit")'
>
>  - We have jobs that there's no way you could guess the command 'Google
> Cloud Dataflow Runner V2 Java ValidatesRunner Tests (streaming)'
>
>  - (nit) We are weirdly inconsistent about using spaces vs underscores. I
> don't think any of our infrastructure cares about this.
>
> *Extra proposal*: make the job name also the local command, where possible
>
> *Example: *
> https://github.com/apache/beam/blob/master/.github/workflows/beam_PostCommit_Java_ValidatesRunner_Dataflow.yml
>
>  - This runs :runners:google-cloud-dataflow-java:validatesRunner
>  - So make the status label
> ":runners:google-cloud-dataflow-java:validatesRunner"
>  - "Run :runners:google-cloud-dataflow-java:validatesRunner" as comment
>
> If I want to run it locally, yes there are GCP things I have to set up,
> but I know the gradle command now.
>
> *Corollary*: remove "postcommit" and "precommit" from names, because
> whether a suite runs before merge or after merge is not a property of the
> suite.
>
> *Caveats*: I haven't been that involved. I didn't do this to Jenkins
> because they are going away. I didn't do anything to GHA because I don't
> know if they are ready or in flux.
>
> I know this is the sort of thing that invites bikeshedding. It just would
> save me a few minutes when puzzling out what to care about and how to kick
> jobs on the release branch validation PR.
>
> I'm happy to scrape through the existing stuff and align it. Perfect task
> for when my brain is too tired for other work.
>
> Kenn
>

[PROPOSAL] [Nice-to-have] CI job names and commands that match

2023-10-10 Thread Kenneth Knowles

*Proposal:* make all the job names exactly match the GH comment to run them
and make it also as close as possible to how to reproduce locally

*Example problems*:

- We have really silly redundant jobs results like 'Chicago Taxi Example
on Dataflow ("Run Chicago Taxi on Dataflow")' and 'Python_Xlang_IO_Dataflow
("Run Python_Xlang_IO_Dataflow PostCommit")'

- We have jobs that there's no way you could guess the command 'Google
Cloud Dataflow Runner V2 Java ValidatesRunner Tests (streaming)'

- (nit) We are weirdly inconsistent about using spaces vs underscores. I
don't think any of our infrastructure cares about this.

*Extra proposal*: make the job name also the local command, where possible

*Example: *
https://github.com/apache/beam/blob/master/.github/workflows/beam_PostCommit_Java_ValidatesRunner_Dataflow.yml

- This runs :runners:google-cloud-dataflow-java:validatesRunner
- So make the status label
":runners:google-cloud-dataflow-java:validatesRunner"
- "Run :runners:google-cloud-dataflow-java:validatesRunner" as comment

If I want to run it locally, yes there are GCP things I have to set up, but
I know the gradle command now.

*Corollary*: remove "postcommit" and "precommit" from names, because
whether a suite runs before merge or after merge is not a property of the
suite.

*Caveats*: I haven't been that involved. I didn't do this to Jenkins
because they are going away. I didn't do anything to GHA because I don't
know if they are ready or in flux.

I know this is the sort of thing that invites bikeshedding. It just would
save me a few minutes when puzzling out what to care about and how to kick
jobs on the release branch validation PR.

I'm happy to scrape through the existing stuff and align it. Perfect task
for when my brain is too tired for other work.

Kenn

Re: [VOTE] Release 2.51.0, release candidate #1

2023-10-09 Thread Kenneth Knowles

OK I can cherrypick it so they have an upgrade fix. But also we should
instruct users to pin their fastavro version to a good version. That is
probably safer and easier than upgrading Beam.

Our containers that we build have the version pinned, right? So will this
also cause all the prior containers to have slow start up?

Kenn

On Mon, Oct 9, 2023 at 4:13 PM Yi Hu via dev  wrote:

> Yes, and moreover, this specific issue will break the user the same way
> for *all* Beam versions (2.50.0, 2.49.0, etc) after Oct 3. That said the
> issue is not limited to Beam 2.50.0 though.
>
> On Mon, Oct 9, 2023 at 4:08 PM Kenneth Knowles  wrote:
>
>> If we had closed the release today, this would still have broken all our
>> users, correct?
>>
>> Kenn
>>
>> On Mon, Oct 9, 2023 at 3:37 PM Anand Inguva via dev 
>> wrote:
>>
>>> There was a regression[1] on fastavro latest release 1.8.4. Fix was
>>> merged at https://github.com/apache/beam/pull/28896. The RC1 includes
>>> that version in the range for fastavro[2]. I think we need to CP
>>> https://github.com/apache/beam/pull/28896 to solve the fastavro
>>> regression.
>>>
>>> [1] https://github.com/apache/beam/issues/28811
>>> [2]
>>> https://github.com/apache/beam/blob/cd653e33b342bd09c76c2bbaca12597fec5b4a2c/sdks/python/setup.py#L245
>>>
>>>
>>> On Mon, Oct 9, 2023 at 3:15 PM Kenneth Knowles  wrote:
>>>
>>>> Ran a couple of Java pipelines "as a newb user" to make sure our
>>>> instructions weren't out of date. There are some errors in the instructions
>>>> but they don't have to do with this release.
>>>>
>>>> Re-ran mass_comment.py on https://github.com/apache/beam/pull/28663.
>>>> There are enough red signals there that some triage is needed. Any help
>>>> triaging would be appreciated.
>>>>
>>>> I'll close the vote once everything is run and examined.
>>>>
>>>> Kenn
>>>>
>>>> On Sat, Oct 7, 2023 at 9:58 AM Yi Hu via dev 
>>>> wrote:
>>>>
>>>>> +1 (non-binding) Tested on Java IO load tests (
>>>>> https://github.com/bvolpato/DataflowTemplates/tree/56d18a31c1c95e58543d7a1656bd83d7e859b482/it)
>>>>> BigQueryIO, TextIO, BigtableIO, SpannerIO on Dataflow legacy runner and
>>>>> runner v2
>>>>>
>>>>>
>>>>> On Fri, Oct 6, 2023 at 3:23 PM Kenneth Knowles 
>>>>> wrote:
>>>>>
>>>>>> Additionally we need https://github.com/apache/beam/pull/28665/files
>>>>>> in order to run GHA tests.
>>>>>>
>>>>>> On Fri, Oct 6, 2023 at 3:19 PM Kenneth Knowles 
>>>>>> wrote:
>>>>>>
>>>>>>> That PR was prior to many cherry-picks so it is not the signal we
>>>>>>> need. I have updated it to the tip of the release-2.51.0 branch.
>>>>>>>
>>>>>>> There were some post-commit tests involving JPMS that I believe need
>>>>>>> https://github.com/apache/beam/pull/28726 to pass.
>>>>>>>
>>>>>>> Kenn
>>>>>>>
>>>>>>> On Fri, Oct 6, 2023 at 2:53 PM Valentyn Tymofieiev via dev <
>>>>>>> dev@beam.apache.org> wrote:
>>>>>>>
>>>>>>>> > PR to run tests against release branch [12].
>>>>>>>>
>>>>>>>>  https://github.com/apache/beam/pull/28663 is closed and test
>>>>>>>> signal is no longer available. did all the tests pass?
>>>>>>>>
>>>>>>>> On Fri, Oct 6, 2023 at 5:32 AM Alexey Romanenko <
>>>>>>>> aromanenko@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> +1 (binding)
>>>>>>>>>
>>>>>>>>> —
>>>>>>>>> Alexey
>>>>>>>>>
>>>>>>>>> > On 5 Oct 2023, at 18:38, Jean-Baptiste Onofré 
>>>>>>>>> wrote:
>>>>>>>>> >
>>>>>>>>> > +1 (binding)
>>>>>>>>> >
>>>>>>>>> > Thanks !
>>>>>>>>> > Regards
>>>>>>>>> > JB
>>>>>>>>> >
>>>>>>>>> > On Tue, Oct 3, 2023 at

Re: [VOTE] Release 2.51.0, release candidate #1

2023-10-09 Thread Kenneth Knowles

If we had closed the release today, this would still have broken all our
users, correct?

Kenn

On Mon, Oct 9, 2023 at 3:37 PM Anand Inguva via dev 
wrote:

> There was a regression[1] on fastavro latest release 1.8.4. Fix was merged
> at https://github.com/apache/beam/pull/28896. The RC1 includes that
> version in the range for fastavro[2]. I think we need to CP
> https://github.com/apache/beam/pull/28896 to solve the fastavro
> regression.
>
> [1] https://github.com/apache/beam/issues/28811
> [2]
> https://github.com/apache/beam/blob/cd653e33b342bd09c76c2bbaca12597fec5b4a2c/sdks/python/setup.py#L245
>
>
> On Mon, Oct 9, 2023 at 3:15 PM Kenneth Knowles  wrote:
>
>> Ran a couple of Java pipelines "as a newb user" to make sure our
>> instructions weren't out of date. There are some errors in the instructions
>> but they don't have to do with this release.
>>
>> Re-ran mass_comment.py on https://github.com/apache/beam/pull/28663.
>> There are enough red signals there that some triage is needed. Any help
>> triaging would be appreciated.
>>
>> I'll close the vote once everything is run and examined.
>>
>> Kenn
>>
>> On Sat, Oct 7, 2023 at 9:58 AM Yi Hu via dev  wrote:
>>
>>> +1 (non-binding) Tested on Java IO load tests (
>>> https://github.com/bvolpato/DataflowTemplates/tree/56d18a31c1c95e58543d7a1656bd83d7e859b482/it)
>>> BigQueryIO, TextIO, BigtableIO, SpannerIO on Dataflow legacy runner and
>>> runner v2
>>>
>>>
>>> On Fri, Oct 6, 2023 at 3:23 PM Kenneth Knowles  wrote:
>>>
>>>> Additionally we need https://github.com/apache/beam/pull/28665/files
>>>> in order to run GHA tests.
>>>>
>>>> On Fri, Oct 6, 2023 at 3:19 PM Kenneth Knowles  wrote:
>>>>
>>>>> That PR was prior to many cherry-picks so it is not the signal we
>>>>> need. I have updated it to the tip of the release-2.51.0 branch.
>>>>>
>>>>> There were some post-commit tests involving JPMS that I believe need
>>>>> https://github.com/apache/beam/pull/28726 to pass.
>>>>>
>>>>> Kenn
>>>>>
>>>>> On Fri, Oct 6, 2023 at 2:53 PM Valentyn Tymofieiev via dev <
>>>>> dev@beam.apache.org> wrote:
>>>>>
>>>>>> > PR to run tests against release branch [12].
>>>>>>
>>>>>>  https://github.com/apache/beam/pull/28663 is closed and test signal
>>>>>> is no longer available. did all the tests pass?
>>>>>>
>>>>>> On Fri, Oct 6, 2023 at 5:32 AM Alexey Romanenko <
>>>>>> aromanenko@gmail.com> wrote:
>>>>>>
>>>>>>> +1 (binding)
>>>>>>>
>>>>>>> —
>>>>>>> Alexey
>>>>>>>
>>>>>>> > On 5 Oct 2023, at 18:38, Jean-Baptiste Onofré 
>>>>>>> wrote:
>>>>>>> >
>>>>>>> > +1 (binding)
>>>>>>> >
>>>>>>> > Thanks !
>>>>>>> > Regards
>>>>>>> > JB
>>>>>>> >
>>>>>>> > On Tue, Oct 3, 2023 at 7:58 PM Kenneth Knowles 
>>>>>>> wrote:
>>>>>>> >>
>>>>>>> >> Hi everyone,
>>>>>>> >>
>>>>>>> >> Please review and vote on the release candidate #1 for the
>>>>>>> version 2.51.0, as follows:
>>>>>>> >>
>>>>>>> >> [ ] +1, Approve the release
>>>>>>> >> [ ] -1, Do not approve the release (please provide specific
>>>>>>> comments)
>>>>>>> >>
>>>>>>> >> Reviewers are encouraged to test their own use cases with the
>>>>>>> release candidate, and vote +1 if no issues are found. Only PMC member
>>>>>>> votes will count towards the final vote, but votes from all community
>>>>>>> members is encouraged and helpful for finding regressions; you can 
>>>>>>> either
>>>>>>> test your own use cases or use cases from the validation sheet [10].
>>>>>>> >>
>>>>>>> >> The complete staging area is available for your review, which
>>>>>>> includes:
>>>>>>> >>
>>>>>>&g

Re: [VOTE] Release 2.51.0, release candidate #1

2023-10-09 Thread Kenneth Knowles

Ran a couple of Java pipelines "as a newb user" to make sure our
instructions weren't out of date. There are some errors in the instructions
but they don't have to do with this release.

Re-ran mass_comment.py on https://github.com/apache/beam/pull/28663. There
are enough red signals there that some triage is needed. Any help triaging
would be appreciated.

I'll close the vote once everything is run and examined.

Kenn

On Sat, Oct 7, 2023 at 9:58 AM Yi Hu via dev  wrote:

> +1 (non-binding) Tested on Java IO load tests (
> https://github.com/bvolpato/DataflowTemplates/tree/56d18a31c1c95e58543d7a1656bd83d7e859b482/it)
> BigQueryIO, TextIO, BigtableIO, SpannerIO on Dataflow legacy runner and
> runner v2
>
>
> On Fri, Oct 6, 2023 at 3:23 PM Kenneth Knowles  wrote:
>
>> Additionally we need https://github.com/apache/beam/pull/28665/files in
>> order to run GHA tests.
>>
>> On Fri, Oct 6, 2023 at 3:19 PM Kenneth Knowles  wrote:
>>
>>> That PR was prior to many cherry-picks so it is not the signal we need.
>>> I have updated it to the tip of the release-2.51.0 branch.
>>>
>>> There were some post-commit tests involving JPMS that I believe need
>>> https://github.com/apache/beam/pull/28726 to pass.
>>>
>>> Kenn
>>>
>>> On Fri, Oct 6, 2023 at 2:53 PM Valentyn Tymofieiev via dev <
>>> dev@beam.apache.org> wrote:
>>>
>>>> > PR to run tests against release branch [12].
>>>>
>>>>  https://github.com/apache/beam/pull/28663 is closed and test signal
>>>> is no longer available. did all the tests pass?
>>>>
>>>> On Fri, Oct 6, 2023 at 5:32 AM Alexey Romanenko <
>>>> aromanenko....@gmail.com> wrote:
>>>>
>>>>> +1 (binding)
>>>>>
>>>>> —
>>>>> Alexey
>>>>>
>>>>> > On 5 Oct 2023, at 18:38, Jean-Baptiste Onofré 
>>>>> wrote:
>>>>> >
>>>>> > +1 (binding)
>>>>> >
>>>>> > Thanks !
>>>>> > Regards
>>>>> > JB
>>>>> >
>>>>> > On Tue, Oct 3, 2023 at 7:58 PM Kenneth Knowles 
>>>>> wrote:
>>>>> >>
>>>>> >> Hi everyone,
>>>>> >>
>>>>> >> Please review and vote on the release candidate #1 for the version
>>>>> 2.51.0, as follows:
>>>>> >>
>>>>> >> [ ] +1, Approve the release
>>>>> >> [ ] -1, Do not approve the release (please provide specific
>>>>> comments)
>>>>> >>
>>>>> >> Reviewers are encouraged to test their own use cases with the
>>>>> release candidate, and vote +1 if no issues are found. Only PMC member
>>>>> votes will count towards the final vote, but votes from all community
>>>>> members is encouraged and helpful for finding regressions; you can either
>>>>> test your own use cases or use cases from the validation sheet [10].
>>>>> >>
>>>>> >> The complete staging area is available for your review, which
>>>>> includes:
>>>>> >>
>>>>> >> GitHub Release notes [1],
>>>>> >> the official Apache source release to be deployed to
>>>>> dist.apache.org [2], which is signed with the key with fingerprint
>>>>>  [3],
>>>>> >> all artifacts to be deployed to the Maven Central Repository [4],
>>>>> >> source code tag "v1.2.3-RC3" [5],
>>>>> >> website pull request listing the release [6], the blog post [6],
>>>>> and publishing the API reference manual [7].
>>>>> >> Java artifacts were built with Gradle GRADLE_VERSION and
>>>>> OpenJDK/Oracle JDK JDK_VERSION.
>>>>> >> Python artifacts are deployed along with the source release to the
>>>>> dist.apache.org [2] and PyPI[8].
>>>>> >> Go artifacts and documentation are available at pkg.go.dev [9]
>>>>> >> Validation sheet with a tab for 1.2.3 release to help with
>>>>> validation [10].
>>>>> >> Docker images published to Docker Hub [11].
>>>>> >> PR to run tests against release branch [12].
>>>>> >>
>>>>> >> The vote will be open for at least 72 hours. It is adopted by
>>>>> majority approval, with at least 3 PMC affirmative votes.
>>>>> >>
>>>>> >> For guidelines on how to try the release in your projects, check
>>>>> out our blog post at
>>>>> https://beam.apache.org/blog/validate-beam-release/.
>>>>> >>
>>>>> >> Thanks,
>>>>> >> Kenn
>>>>> >>
>>>>> >> [1] https://github.com/apache/beam/milestone/15
>>>>> >> [2] https://dist.apache.org/repos/dist/dev/beam/2.51.0
>>>>> >> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
>>>>> >> [4]
>>>>> https://repository.apache.org/content/repositories/orgapachebeam-1356/
>>>>> >> [5] https://github.com/apache/beam/tree/v2.51.0-RC1
>>>>> >> [6] https://github.com/apache/beam/pull/28800
>>>>> >> [7] https://github.com/apache/beam-site/pull/649
>>>>> >> [8] https://pypi.org/project/apache-beam/2.51.0rc1/
>>>>> >> [9]
>>>>> https://pkg.go.dev/github.com/apache/beam/sdks/v2@v2.51.0-RC1/go/pkg/beam
>>>>> >> [10]
>>>>> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=437054928
>>>>> >> [11] https://hub.docker.com/search?q=apache%2Fbeam&type=image
>>>>> >> [12] https://github.com/apache/beam/pull/28663
>>>>>
>>>>>

Re: [VOTE] Release 2.51.0, release candidate #1

2023-10-06 Thread Kenneth Knowles

Additionally we need https://github.com/apache/beam/pull/28665/files in
order to run GHA tests.

On Fri, Oct 6, 2023 at 3:19 PM Kenneth Knowles  wrote:

> That PR was prior to many cherry-picks so it is not the signal we need. I
> have updated it to the tip of the release-2.51.0 branch.
>
> There were some post-commit tests involving JPMS that I believe need
> https://github.com/apache/beam/pull/28726 to pass.
>
> Kenn
>
> On Fri, Oct 6, 2023 at 2:53 PM Valentyn Tymofieiev via dev <
> dev@beam.apache.org> wrote:
>
>> > PR to run tests against release branch [12].
>>
>>  https://github.com/apache/beam/pull/28663 is closed and test signal is
>> no longer available. did all the tests pass?
>>
>> On Fri, Oct 6, 2023 at 5:32 AM Alexey Romanenko 
>> wrote:
>>
>>> +1 (binding)
>>>
>>> —
>>> Alexey
>>>
>>> > On 5 Oct 2023, at 18:38, Jean-Baptiste Onofré  wrote:
>>> >
>>> > +1 (binding)
>>> >
>>> > Thanks !
>>> > Regards
>>> > JB
>>> >
>>> > On Tue, Oct 3, 2023 at 7:58 PM Kenneth Knowles 
>>> wrote:
>>> >>
>>> >> Hi everyone,
>>> >>
>>> >> Please review and vote on the release candidate #1 for the version
>>> 2.51.0, as follows:
>>> >>
>>> >> [ ] +1, Approve the release
>>> >> [ ] -1, Do not approve the release (please provide specific comments)
>>> >>
>>> >> Reviewers are encouraged to test their own use cases with the release
>>> candidate, and vote +1 if no issues are found. Only PMC member votes will
>>> count towards the final vote, but votes from all community members is
>>> encouraged and helpful for finding regressions; you can either test your
>>> own use cases or use cases from the validation sheet [10].
>>> >>
>>> >> The complete staging area is available for your review, which
>>> includes:
>>> >>
>>> >> GitHub Release notes [1],
>>> >> the official Apache source release to be deployed to dist.apache.org
>>> [2], which is signed with the key with fingerprint  [3],
>>> >> all artifacts to be deployed to the Maven Central Repository [4],
>>> >> source code tag "v1.2.3-RC3" [5],
>>> >> website pull request listing the release [6], the blog post [6], and
>>> publishing the API reference manual [7].
>>> >> Java artifacts were built with Gradle GRADLE_VERSION and
>>> OpenJDK/Oracle JDK JDK_VERSION.
>>> >> Python artifacts are deployed along with the source release to the
>>> dist.apache.org [2] and PyPI[8].
>>> >> Go artifacts and documentation are available at pkg.go.dev [9]
>>> >> Validation sheet with a tab for 1.2.3 release to help with validation
>>> [10].
>>> >> Docker images published to Docker Hub [11].
>>> >> PR to run tests against release branch [12].
>>> >>
>>> >> The vote will be open for at least 72 hours. It is adopted by
>>> majority approval, with at least 3 PMC affirmative votes.
>>> >>
>>> >> For guidelines on how to try the release in your projects, check out
>>> our blog post at https://beam.apache.org/blog/validate-beam-release/.
>>> >>
>>> >> Thanks,
>>> >> Kenn
>>> >>
>>> >> [1] https://github.com/apache/beam/milestone/15
>>> >> [2] https://dist.apache.org/repos/dist/dev/beam/2.51.0
>>> >> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
>>> >> [4]
>>> https://repository.apache.org/content/repositories/orgapachebeam-1356/
>>> >> [5] https://github.com/apache/beam/tree/v2.51.0-RC1
>>> >> [6] https://github.com/apache/beam/pull/28800
>>> >> [7] https://github.com/apache/beam-site/pull/649
>>> >> [8] https://pypi.org/project/apache-beam/2.51.0rc1/
>>> >> [9]
>>> https://pkg.go.dev/github.com/apache/beam/sdks/v2@v2.51.0-RC1/go/pkg/beam
>>> >> [10]
>>> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=437054928
>>> >> [11] https://hub.docker.com/search?q=apache%2Fbeam&type=image
>>> >> [12] https://github.com/apache/beam/pull/28663
>>>
>>>

Re: [VOTE] Release 2.51.0, release candidate #1

2023-10-06 Thread Kenneth Knowles

That PR was prior to many cherry-picks so it is not the signal we need. I
have updated it to the tip of the release-2.51.0 branch.

There were some post-commit tests involving JPMS that I believe need
https://github.com/apache/beam/pull/28726 to pass.

Kenn

On Fri, Oct 6, 2023 at 2:53 PM Valentyn Tymofieiev via dev <
dev@beam.apache.org> wrote:

> > PR to run tests against release branch [12].
>
>  https://github.com/apache/beam/pull/28663 is closed and test signal is
> no longer available. did all the tests pass?
>
> On Fri, Oct 6, 2023 at 5:32 AM Alexey Romanenko 
> wrote:
>
>> +1 (binding)
>>
>> —
>> Alexey
>>
>> > On 5 Oct 2023, at 18:38, Jean-Baptiste Onofré  wrote:
>> >
>> > +1 (binding)
>> >
>> > Thanks !
>> > Regards
>> > JB
>> >
>> > On Tue, Oct 3, 2023 at 7:58 PM Kenneth Knowles  wrote:
>> >>
>> >> Hi everyone,
>> >>
>> >> Please review and vote on the release candidate #1 for the version
>> 2.51.0, as follows:
>> >>
>> >> [ ] +1, Approve the release
>> >> [ ] -1, Do not approve the release (please provide specific comments)
>> >>
>> >> Reviewers are encouraged to test their own use cases with the release
>> candidate, and vote +1 if no issues are found. Only PMC member votes will
>> count towards the final vote, but votes from all community members is
>> encouraged and helpful for finding regressions; you can either test your
>> own use cases or use cases from the validation sheet [10].
>> >>
>> >> The complete staging area is available for your review, which includes:
>> >>
>> >> GitHub Release notes [1],
>> >> the official Apache source release to be deployed to dist.apache.org
>> [2], which is signed with the key with fingerprint  [3],
>> >> all artifacts to be deployed to the Maven Central Repository [4],
>> >> source code tag "v1.2.3-RC3" [5],
>> >> website pull request listing the release [6], the blog post [6], and
>> publishing the API reference manual [7].
>> >> Java artifacts were built with Gradle GRADLE_VERSION and
>> OpenJDK/Oracle JDK JDK_VERSION.
>> >> Python artifacts are deployed along with the source release to the
>> dist.apache.org [2] and PyPI[8].
>> >> Go artifacts and documentation are available at pkg.go.dev [9]
>> >> Validation sheet with a tab for 1.2.3 release to help with validation
>> [10].
>> >> Docker images published to Docker Hub [11].
>> >> PR to run tests against release branch [12].
>> >>
>> >> The vote will be open for at least 72 hours. It is adopted by majority
>> approval, with at least 3 PMC affirmative votes.
>> >>
>> >> For guidelines on how to try the release in your projects, check out
>> our blog post at https://beam.apache.org/blog/validate-beam-release/.
>> >>
>> >> Thanks,
>> >> Kenn
>> >>
>> >> [1] https://github.com/apache/beam/milestone/15
>> >> [2] https://dist.apache.org/repos/dist/dev/beam/2.51.0
>> >> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
>> >> [4]
>> https://repository.apache.org/content/repositories/orgapachebeam-1356/
>> >> [5] https://github.com/apache/beam/tree/v2.51.0-RC1
>> >> [6] https://github.com/apache/beam/pull/28800
>> >> [7] https://github.com/apache/beam-site/pull/649
>> >> [8] https://pypi.org/project/apache-beam/2.51.0rc1/
>> >> [9]
>> https://pkg.go.dev/github.com/apache/beam/sdks/v2@v2.51.0-RC1/go/pkg/beam
>> >> [10]
>> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=437054928
>> >> [11] https://hub.docker.com/search?q=apache%2Fbeam&type=image
>> >> [12] https://github.com/apache/beam/pull/28663
>>
>>

Re: Reshuffle PTransform Design Doc

2023-10-06 Thread Kenneth Knowles

On Fri, Oct 6, 2023 at 3:07 PM Jan Lukavský  wrote:

>
> On 10/6/23 15:11, Kenneth Knowles wrote:
>
>
>
> On Fri, Oct 6, 2023 at 3:20 AM Jan Lukavský  wrote:
>
>> Hi,
>>
>> there is also one other thing to mention with relation to
>> Reshuffle/RequiresStableinput and that is that our current implementation
>> of RequiresStableInput can break without Reshuffle in some corner cases on
>> most portable runners, at least with Java GreedyPipelineFuser, see [1]. The
>> only way to workaround this currently is inserting Reshuffle (or any other
>> fusion-break transform) directly before the stable DoFn (Reshuffle is
>> handy, because it does not change the data). I think we should either
>> somehow fix the issue [1] or include fusion break as a mandatory
>> requirement for the new Redistribute transform as well (at least with some
>> variant) or possibly add a new "hint" for non-optional fusion breaking.
>>
> This is actually the bug we have wanted to fix for years - redistribute
> has nothing to do with checkpointing or stable input and Reshuffle
> incorrectly merges the two concepts.
>
> I agree that we couldn't make any immediate change that will break a
> runner. I believe runners that depend on Reshuffle to provide stable input
> will also provide stable input after GroupByKey. Since the SDK expansion of
> Reshuffle will still contains a GBK, those runners functionality will be
> unchanged.
>
> I don't yet have a firm opinion between the these approaches:
>
> 1. Adjust the Java SDK implementation of Reshuffle (and maybe other SDKs
> if needed). With some flag so that users can use the old wrong behavior for
> update compatibility.
> 2. Add a Redistribute transform to the SDKs that has the right behavior
> and leave Reshuffle as it is.
> 1+2. Add the Redistribute transform but also make Reshuffle call it, so
> Reshuffle also gets the new behavior, with the same flag so that users can
> use the old wrong behavior for update compatibility.
>
> All of these will leave "Reshuffle for RequestStableInput" alone for now.
> The options that include (2) will move us a little closer to migrating to a
> "better" future state.
>
> I might have not expressed the right way. I understand that Reshuffle
> having "stable input" functionality is non-portable side-effect. It would
> be nice to get rid of it and my impression from this thread was that we
> would try to deprecate Reshuffle and introduce Redistribute which will not
> have such semantics. All of this is fine, problem is that we currently (is
> some corner cases) rely on Reshuffle *even though* Pipeline uses
> @RequiresStableInput. That is due to the fact that Reshuffle also ensures
> fusion breaking.  Fusing non-deterministic DoFn with stable DoFn breaks the
> stable input property, because runners can ensure stability only at the
> input of executable stage. Therefore we would either need to:
>
>  a) define Redistribute as being an unconditional fusion break boundary, or
>
>  b) define some other transform or hint to be able to enforce fusion
> breaking
>
> Otherwise I'd be in favor of 2 and deprecation of Reshuffle.
>

Just to be very clear - my goal right now is to just give Reshuffle a
consistent semantics. Even for the old "stable input + redistribute" use of
Reshuffle, the semantics are inconsistent/undefined and the Java SDK
expansion is wrong. Changing things having to do with stable input are not
part of what I am trying to change right now. But it is fine to do things
that prepare for that.

Kenn


>  Jan
>
>
> Any votes? Any other options?
>
> Kenn
>
>  Jan
>>
>> [1] https://github.com/apache/beam/issues/24655
>> On 10/5/23 21:01, Robert Burke wrote:
>>
>> Reshuffle/redistribute being a transform has the benefit of allowing
>> existing runners that aren't updated to be aware of the new urns to rely on
>> an SDK side implementation, which may be more expensive than what the
>> runner is able to do with that awareness.
>>
>> Aka: it gives purpose to the fallback implementations.
>>
>> On Thu, Oct 5, 2023, 9:03 AM Kenneth Knowles  wrote:
>>
>>> Another perspective, ignoring runners custom implementations and
>>> non-Java SDKs could be that the semantics are perfectly well defined: it is
>>> a composite and its semantics are defined by its implementation in terms of
>>> primitives. It is just that this expansion is not what we want so we should
>>> not use it (and also we shouldn't use "whatever the implementation does" as
>>> a spec for anything we care about).
>>>
>>> On

Re: Reshuffle PTransform Design Doc

2023-10-06 Thread Kenneth Knowles

On Fri, Oct 6, 2023 at 3:20 AM Jan Lukavský  wrote:

> Hi,
>
> there is also one other thing to mention with relation to
> Reshuffle/RequiresStableinput and that is that our current implementation
> of RequiresStableInput can break without Reshuffle in some corner cases on
> most portable runners, at least with Java GreedyPipelineFuser, see [1]. The
> only way to workaround this currently is inserting Reshuffle (or any other
> fusion-break transform) directly before the stable DoFn (Reshuffle is
> handy, because it does not change the data). I think we should either
> somehow fix the issue [1] or include fusion break as a mandatory
> requirement for the new Redistribute transform as well (at least with some
> variant) or possibly add a new "hint" for non-optional fusion breaking.
>
> This is actually the bug we have wanted to fix for years - redistribute
has nothing to do with checkpointing or stable input and Reshuffle
incorrectly merges the two concepts.

I agree that we couldn't make any immediate change that will break a
runner. I believe runners that depend on Reshuffle to provide stable input
will also provide stable input after GroupByKey. Since the SDK expansion of
Reshuffle will still contains a GBK, those runners functionality will be
unchanged.

I don't yet have a firm opinion between the these approaches:

1. Adjust the Java SDK implementation of Reshuffle (and maybe other SDKs if
needed). With some flag so that users can use the old wrong behavior for
update compatibility.
2. Add a Redistribute transform to the SDKs that has the right behavior and
leave Reshuffle as it is.
1+2. Add the Redistribute transform but also make Reshuffle call it, so
Reshuffle also gets the new behavior, with the same flag so that users can
use the old wrong behavior for update compatibility.

All of these will leave "Reshuffle for RequestStableInput" alone for now.
The options that include (2) will move us a little closer to migrating to a
"better" future state.

Any votes? Any other options?

Kenn

 Jan
>
> [1] https://github.com/apache/beam/issues/24655
> On 10/5/23 21:01, Robert Burke wrote:
>
> Reshuffle/redistribute being a transform has the benefit of allowing
> existing runners that aren't updated to be aware of the new urns to rely on
> an SDK side implementation, which may be more expensive than what the
> runner is able to do with that awareness.
>
> Aka: it gives purpose to the fallback implementations.
>
> On Thu, Oct 5, 2023, 9:03 AM Kenneth Knowles  wrote:
>
>> Another perspective, ignoring runners custom implementations and non-Java
>> SDKs could be that the semantics are perfectly well defined: it is a
>> composite and its semantics are defined by its implementation in terms of
>> primitives. It is just that this expansion is not what we want so we should
>> not use it (and also we shouldn't use "whatever the implementation does" as
>> a spec for anything we care about).
>>
>> On Thu, Oct 5, 2023 at 11:56 AM Kenneth Knowles  wrote:
>>
>>> I totally agree. I am motivated right now by the fact that it is already
>>> used all over the place but with no consistent semantics. Maybe it is
>>> simpler to focus on just making the minimal change, which would basically
>>> be to update the expansion of the Reshuffle in the Java SDK.
>>>
>>> Kenn
>>>
>>> On Thu, Oct 5, 2023 at 11:39 AM John Casey 
>>> wrote:
>>>
>>>> Given that this is a hint, I'm not sure redistribute should be a
>>>> PTransform as opposed to some other way to hint to a runner.
>>>>
>>>> I'm not sure of what the syntax of that would be, but a semantic no-op
>>>> transform that the runner may or may not do anything with is odd.
>>>>
>>>> On Thu, Oct 5, 2023 at 11:30 AM Kenneth Knowles 
>>>> wrote:
>>>>
>>>>> So a high level suggestion from Robert that I want to highlight as a
>>>>> top-post:
>>>>>
>>>>> Instead of focusing on just fixing the SDKs and runners Reshuffle,
>>>>> this could be an opportunity to introduce Redistribute which was proposed
>>>>> in the long-ago thread. The semantics are identical but it is more clear
>>>>> that it *only* is a hint about redistributing data and there is no
>>>>> expectation of a checkpoint.
>>>>>
>>>>> This new name may also be an opportunity to maintain update
>>>>> compatibility (though this may actually be leaving unsafe code in user's
>>>>> hands) and/or separate @RequiresStableInput/checkpointing uses of

Re: Reshuffle PTransform Design Doc

2023-10-05 Thread Kenneth Knowles

Another perspective, ignoring runners custom implementations and non-Java
SDKs could be that the semantics are perfectly well defined: it is a
composite and its semantics are defined by its implementation in terms of
primitives. It is just that this expansion is not what we want so we should
not use it (and also we shouldn't use "whatever the implementation does" as
a spec for anything we care about).

On Thu, Oct 5, 2023 at 11:56 AM Kenneth Knowles  wrote:

> I totally agree. I am motivated right now by the fact that it is already
> used all over the place but with no consistent semantics. Maybe it is
> simpler to focus on just making the minimal change, which would basically
> be to update the expansion of the Reshuffle in the Java SDK.
>
> Kenn
>
> On Thu, Oct 5, 2023 at 11:39 AM John Casey 
> wrote:
>
>> Given that this is a hint, I'm not sure redistribute should be a
>> PTransform as opposed to some other way to hint to a runner.
>>
>> I'm not sure of what the syntax of that would be, but a semantic no-op
>> transform that the runner may or may not do anything with is odd.
>>
>> On Thu, Oct 5, 2023 at 11:30 AM Kenneth Knowles  wrote:
>>
>>> So a high level suggestion from Robert that I want to highlight as a
>>> top-post:
>>>
>>> Instead of focusing on just fixing the SDKs and runners Reshuffle, this
>>> could be an opportunity to introduce Redistribute which was proposed in the
>>> long-ago thread. The semantics are identical but it is more clear that it
>>> *only* is a hint about redistributing data and there is no expectation
>>> of a checkpoint.
>>>
>>> This new name may also be an opportunity to maintain update
>>> compatibility (though this may actually be leaving unsafe code in user's
>>> hands) and/or separate @RequiresStableInput/checkpointing uses of Reshuffle
>>> from redistribution-only uses of Reshuffle.
>>>
>>> Any other thoughts on this one high level bit?
>>>
>>> Kenn
>>>
>>> On Thu, Oct 5, 2023 at 11:15 AM Kenneth Knowles  wrote:
>>>
>>>>
>>>> On Wed, Oct 4, 2023 at 7:45 PM Robert Burke 
>>>> wrote:
>>>>
>>>>> LGTM.
>>>>>
>>>>> It looks the Go SDK already adheres to these semantics as well for the
>>>>> reference impl(well, reshuffle/redistribute_randomly, _by_key isn't
>>>>> implemented in the Go SDK, and only uses the existing unqualified 
>>>>> reshuffle
>>>>> URN [0].
>>>>>
>>>>> The original strategy, and then for every element, the original
>>>>> Window, TS, and Pane are all serialized, shuffled, and then deserialized
>>>>> downstream.
>>>>>
>>>>>
>>>>> https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/core/runtime/exec/reshuffle.go#L65
>>>>>
>>>>>
>>>>> https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/core/runtime/exec/reshuffle.go#L145
>>>>>
>>>>> Prism at the moment vaccuously implements reshuffle by omitting the
>>>>> node, and rewriting the inputs and outputs [1], as it's a local runner 
>>>>> with
>>>>> single transform per bundle execution, but I was intending to make it a
>>>>> fusion break regardless.  Ultimately prism's "test" variant will default 
>>>>> to
>>>>> executing the SDKs dictated reference implementation for the composite(s),
>>>>> and any "fast" or "prod" variant would simply do the current 
>>>>> implementation.
>>>>>
>>>>
>>>> Very nice!
>>>>
>>>> And of course I should have linked out to the existing reshuffle URN in
>>>> the proto.
>>>>
>>>> Kenn
>>>>
>>>>
>>>>
>>>>>
>>>>> Robert Burke
>>>>> Beam Go Busybody
>>>>>
>>>>> [0]:
>>>>> https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/core/runtime/graphx/translate.go#L46C3-L46C50
>>>>> [1]:
>>>>> https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/runners/prism/internal/handlerunner.go#L82
>>>>>
>>>>>
>>>>>
>>>>> On 2023/09/26 15:43:53 Kenneth Knowles wrote:
>>>>> > Hi everyone,
>>>>> >
>>>>> > Recently there was a bug [1] caused by discrepancies between two of
>>>>> > Dataflow's reshuffle implementations. I think the reference
>>>>> implementation
>>>>> > in the Java SDK [2] also does not match. This all led to discussion
>>>>> on the
>>>>> > bug and the pull request [3] about what the actual semantics should
>>>>> be. I
>>>>> > got it wrong, maybe multiple times. So I wrote up a very short
>>>>> document to
>>>>> > finish the discussion:
>>>>> >
>>>>> > https://s.apache.org/beam-reshuffle
>>>>> >
>>>>> > This is also probably among the simplest imaginable use of
>>>>> > http://s.apache.org/ptransform-design-doc in case you want to see
>>>>> kind of
>>>>> > how I intended it to be used.
>>>>> >
>>>>> > Kenn
>>>>> >
>>>>> > [1] https://github.com/apache/beam/issues/28219
>>>>> > [2]
>>>>> >
>>>>> https://github.com/apache/beam/blob/d52b077ad505c8b50f10ec6a4eb83d385cdaf96a/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Reshuffle.java#L84
>>>>> > [3] https://github.com/apache/beam/pull/28272
>>>>> >
>>>>>
>>>>

Re: Reshuffle PTransform Design Doc

2023-10-05 Thread Kenneth Knowles

I totally agree. I am motivated right now by the fact that it is already
used all over the place but with no consistent semantics. Maybe it is
simpler to focus on just making the minimal change, which would basically
be to update the expansion of the Reshuffle in the Java SDK.

Kenn

On Thu, Oct 5, 2023 at 11:39 AM John Casey  wrote:

> Given that this is a hint, I'm not sure redistribute should be a
> PTransform as opposed to some other way to hint to a runner.
>
> I'm not sure of what the syntax of that would be, but a semantic no-op
> transform that the runner may or may not do anything with is odd.
>
> On Thu, Oct 5, 2023 at 11:30 AM Kenneth Knowles  wrote:
>
>> So a high level suggestion from Robert that I want to highlight as a
>> top-post:
>>
>> Instead of focusing on just fixing the SDKs and runners Reshuffle, this
>> could be an opportunity to introduce Redistribute which was proposed in the
>> long-ago thread. The semantics are identical but it is more clear that it
>> *only* is a hint about redistributing data and there is no expectation
>> of a checkpoint.
>>
>> This new name may also be an opportunity to maintain update compatibility
>> (though this may actually be leaving unsafe code in user's hands) and/or
>> separate @RequiresStableInput/checkpointing uses of Reshuffle from
>> redistribution-only uses of Reshuffle.
>>
>> Any other thoughts on this one high level bit?
>>
>> Kenn
>>
>> On Thu, Oct 5, 2023 at 11:15 AM Kenneth Knowles  wrote:
>>
>>>
>>> On Wed, Oct 4, 2023 at 7:45 PM Robert Burke  wrote:
>>>
>>>> LGTM.
>>>>
>>>> It looks the Go SDK already adheres to these semantics as well for the
>>>> reference impl(well, reshuffle/redistribute_randomly, _by_key isn't
>>>> implemented in the Go SDK, and only uses the existing unqualified reshuffle
>>>> URN [0].
>>>>
>>>> The original strategy, and then for every element, the original Window,
>>>> TS, and Pane are all serialized, shuffled, and then deserialized 
>>>> downstream.
>>>>
>>>>
>>>> https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/core/runtime/exec/reshuffle.go#L65
>>>>
>>>>
>>>> https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/core/runtime/exec/reshuffle.go#L145
>>>>
>>>> Prism at the moment vaccuously implements reshuffle by omitting the
>>>> node, and rewriting the inputs and outputs [1], as it's a local runner with
>>>> single transform per bundle execution, but I was intending to make it a
>>>> fusion break regardless.  Ultimately prism's "test" variant will default to
>>>> executing the SDKs dictated reference implementation for the composite(s),
>>>> and any "fast" or "prod" variant would simply do the current 
>>>> implementation.
>>>>
>>>
>>> Very nice!
>>>
>>> And of course I should have linked out to the existing reshuffle URN in
>>> the proto.
>>>
>>> Kenn
>>>
>>>
>>>
>>>>
>>>> Robert Burke
>>>> Beam Go Busybody
>>>>
>>>> [0]:
>>>> https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/core/runtime/graphx/translate.go#L46C3-L46C50
>>>> [1]:
>>>> https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/runners/prism/internal/handlerunner.go#L82
>>>>
>>>>
>>>>
>>>> On 2023/09/26 15:43:53 Kenneth Knowles wrote:
>>>> > Hi everyone,
>>>> >
>>>> > Recently there was a bug [1] caused by discrepancies between two of
>>>> > Dataflow's reshuffle implementations. I think the reference
>>>> implementation
>>>> > in the Java SDK [2] also does not match. This all led to discussion
>>>> on the
>>>> > bug and the pull request [3] about what the actual semantics should
>>>> be. I
>>>> > got it wrong, maybe multiple times. So I wrote up a very short
>>>> document to
>>>> > finish the discussion:
>>>> >
>>>> > https://s.apache.org/beam-reshuffle
>>>> >
>>>> > This is also probably among the simplest imaginable use of
>>>> > http://s.apache.org/ptransform-design-doc in case you want to see
>>>> kind of
>>>> > how I intended it to be used.
>>>> >
>>>> > Kenn
>>>> >
>>>> > [1] https://github.com/apache/beam/issues/28219
>>>> > [2]
>>>> >
>>>> https://github.com/apache/beam/blob/d52b077ad505c8b50f10ec6a4eb83d385cdaf96a/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Reshuffle.java#L84
>>>> > [3] https://github.com/apache/beam/pull/28272
>>>> >
>>>>
>>>

Re: Reshuffle PTransform Design Doc

2023-10-05 Thread Kenneth Knowles

So a high level suggestion from Robert that I want to highlight as a
top-post:

Instead of focusing on just fixing the SDKs and runners Reshuffle, this
could be an opportunity to introduce Redistribute which was proposed in the
long-ago thread. The semantics are identical but it is more clear that it
*only* is a hint about redistributing data and there is no expectation of a
checkpoint.

This new name may also be an opportunity to maintain update compatibility
(though this may actually be leaving unsafe code in user's hands) and/or
separate @RequiresStableInput/checkpointing uses of Reshuffle from
redistribution-only uses of Reshuffle.

Any other thoughts on this one high level bit?

Kenn

On Thu, Oct 5, 2023 at 11:15 AM Kenneth Knowles  wrote:

>
> On Wed, Oct 4, 2023 at 7:45 PM Robert Burke  wrote:
>
>> LGTM.
>>
>> It looks the Go SDK already adheres to these semantics as well for the
>> reference impl(well, reshuffle/redistribute_randomly, _by_key isn't
>> implemented in the Go SDK, and only uses the existing unqualified reshuffle
>> URN [0].
>>
>> The original strategy, and then for every element, the original Window,
>> TS, and Pane are all serialized, shuffled, and then deserialized downstream.
>>
>>
>> https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/core/runtime/exec/reshuffle.go#L65
>>
>>
>> https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/core/runtime/exec/reshuffle.go#L145
>>
>> Prism at the moment vaccuously implements reshuffle by omitting the node,
>> and rewriting the inputs and outputs [1], as it's a local runner with
>> single transform per bundle execution, but I was intending to make it a
>> fusion break regardless.  Ultimately prism's "test" variant will default to
>> executing the SDKs dictated reference implementation for the composite(s),
>> and any "fast" or "prod" variant would simply do the current implementation.
>>
>
> Very nice!
>
> And of course I should have linked out to the existing reshuffle URN in
> the proto.
>
> Kenn
>
>
>
>>
>> Robert Burke
>> Beam Go Busybody
>>
>> [0]:
>> https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/core/runtime/graphx/translate.go#L46C3-L46C50
>> [1]:
>> https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/runners/prism/internal/handlerunner.go#L82
>>
>>
>>
>> On 2023/09/26 15:43:53 Kenneth Knowles wrote:
>> > Hi everyone,
>> >
>> > Recently there was a bug [1] caused by discrepancies between two of
>> > Dataflow's reshuffle implementations. I think the reference
>> implementation
>> > in the Java SDK [2] also does not match. This all led to discussion on
>> the
>> > bug and the pull request [3] about what the actual semantics should be.
>> I
>> > got it wrong, maybe multiple times. So I wrote up a very short document
>> to
>> > finish the discussion:
>> >
>> > https://s.apache.org/beam-reshuffle
>> >
>> > This is also probably among the simplest imaginable use of
>> > http://s.apache.org/ptransform-design-doc in case you want to see kind
>> of
>> > how I intended it to be used.
>> >
>> > Kenn
>> >
>> > [1] https://github.com/apache/beam/issues/28219
>> > [2]
>> >
>> https://github.com/apache/beam/blob/d52b077ad505c8b50f10ec6a4eb83d385cdaf96a/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Reshuffle.java#L84
>> > [3] https://github.com/apache/beam/pull/28272
>> >
>>
>

Re: Reshuffle PTransform Design Doc

2023-10-05 Thread Kenneth Knowles

On Wed, Oct 4, 2023 at 7:45 PM Robert Burke  wrote:

> LGTM.
>
> It looks the Go SDK already adheres to these semantics as well for the
> reference impl(well, reshuffle/redistribute_randomly, _by_key isn't
> implemented in the Go SDK, and only uses the existing unqualified reshuffle
> URN [0].
>
> The original strategy, and then for every element, the original Window,
> TS, and Pane are all serialized, shuffled, and then deserialized downstream.
>
>
> https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/core/runtime/exec/reshuffle.go#L65
>
>
> https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/core/runtime/exec/reshuffle.go#L145
>
> Prism at the moment vaccuously implements reshuffle by omitting the node,
> and rewriting the inputs and outputs [1], as it's a local runner with
> single transform per bundle execution, but I was intending to make it a
> fusion break regardless.  Ultimately prism's "test" variant will default to
> executing the SDKs dictated reference implementation for the composite(s),
> and any "fast" or "prod" variant would simply do the current implementation.
>

Very nice!

And of course I should have linked out to the existing reshuffle URN in the
proto.

Kenn



>
> Robert Burke
> Beam Go Busybody
>
> [0]:
> https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/core/runtime/graphx/translate.go#L46C3-L46C50
> [1]:
> https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/runners/prism/internal/handlerunner.go#L82
>
>
>
> On 2023/09/26 15:43:53 Kenneth Knowles wrote:
> > Hi everyone,
> >
> > Recently there was a bug [1] caused by discrepancies between two of
> > Dataflow's reshuffle implementations. I think the reference
> implementation
> > in the Java SDK [2] also does not match. This all led to discussion on
> the
> > bug and the pull request [3] about what the actual semantics should be. I
> > got it wrong, maybe multiple times. So I wrote up a very short document
> to
> > finish the discussion:
> >
> > https://s.apache.org/beam-reshuffle
> >
> > This is also probably among the simplest imaginable use of
> > http://s.apache.org/ptransform-design-doc in case you want to see kind
> of
> > how I intended it to be used.
> >
> > Kenn
> >
> > [1] https://github.com/apache/beam/issues/28219
> > [2]
> >
> https://github.com/apache/beam/blob/d52b077ad505c8b50f10ec6a4eb83d385cdaf96a/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Reshuffle.java#L84
> > [3] https://github.com/apache/beam/pull/28272
> >
>

[ANNOUNCE] New PMC Member: Valentyn Tymofieiev

2023-10-03 Thread Kenneth Knowles

Hi all,

Please join me and the rest of the Beam PMC in welcoming Valentyn
Tymofieiev  as our newest PMC member.

Valentyn has been contributing to Beam since 2017. Notable highlights
include his work on the Python SDK and also in our container management.
Valentyn also is involved in many discussions around Beam's infrastructure
and community processes. If you look through Valentyn's history, you will
see an abundance of the most critical maintenance work that is the beating
heart of any project.

Congratulations Valentyn and thanks for being a part of Apache Beam!

Kenn, on behalf of the Beam PMC (which now includes Valentyn)

[ANNOUNCE] New PMC Member: Robert Burke

2023-10-03 Thread Kenneth Knowles

Hi all,

Please join me and the rest of the Beam PMC in welcoming Robert Burke <
lostl...@apache.org> as our newest PMC member.

Robert has been a part of the Beam community since 2017. He is our resident
Gopher, producing the Go SDK and most recently the local, portable, Prism
runner. Robert has presented on Beam many times, having written not just
core Beam code but quite interesting pipelines too :-)

Congratulations Robert and thanks for being a part of Apache Beam!

Kenn, on behalf of the Beam PMC (which now includes Robert)

[ANNOUNCE] New PMC Member: Alex Van Boxel

2023-10-03 Thread Kenneth Knowles

Hi all,

Please join me and the rest of the Beam PMC in welcoming Alex Van Boxel <
alexvanbo...@apache.org> as our newest PMC member.

Alex has been with Beam since 2016, very early in the life of the project.
Alex has contributed code, design ideas, and perhaps most importantly been
a huge part of organizing Beam Summits, and of course presenting at them as
well. Alex really brings the ASF community spirit to Beam.

Congratulations Alex and thanks for being a part of Apache Beam!

Kenn, on behalf of the Beam PMC (which now includes Alex)

[VOTE] Release 2.51.0, release candidate #1

2023-10-03 Thread Kenneth Knowles

Hi everyone,

Please review and vote on the release candidate #1 for the version 2.51.0,
as follows:

[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific comments)

Reviewers are encouraged to test their own use cases with the release
candidate, and vote +1 if no issues are found. Only PMC member votes will
count towards the final vote, but votes from all community members is
encouraged and helpful for finding regressions; you can either test your
own use cases or use cases from the validation sheet [10].

The complete staging area is available for your review, which includes:

   - GitHub Release notes [1],
   - the official Apache source release to be deployed to dist.apache.org
   [2], which is signed with the key with fingerprint  [3],
   - all artifacts to be deployed to the Maven Central Repository [4],
   - source code tag "v1.2.3-RC3" [5],
   - website pull request listing the release [6], the blog post [6], and
   publishing the API reference manual [7].
   - Java artifacts were built with Gradle GRADLE_VERSION and
   OpenJDK/Oracle JDK JDK_VERSION.
   - Python artifacts are deployed along with the source release to the
   dist.apache.org [2] and PyPI[8].
   - Go artifacts and documentation are available at pkg.go.dev [9]
   - Validation sheet with a tab for 1.2.3 release to help with validation
   [10].
   - Docker images published to Docker Hub [11].
   - PR to run tests against release branch [12].

The vote will be open for at least 72 hours. It is adopted by majority
approval, with at least 3 PMC affirmative votes.

For guidelines on how to try the release in your projects, check out our
blog post at https://beam.apache.org/blog/validate-beam-release/.

Thanks,
Kenn

[1] https://github.com/apache/beam/milestone/15
[2] https://dist.apache.org/repos/dist/dev/beam/2.51.0
[3] https://dist.apache.org/repos/dist/release/beam/KEYS
[4] https://repository.apache.org/content/repositories/orgapachebeam-1356/
[5] https://github.com/apache/beam/tree/v2.51.0-RC1
[6] https://github.com/apache/beam/pull/28800
[7] https://github.com/apache/beam-site/pull/649
[8] https://pypi.org/project/apache-beam/2.51.0rc1/
[9]
https://pkg.go.dev/github.com/apache/beam/sdks/v2@v2.51.0-RC1/go/pkg/beam
[10]
https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=437054928
[11] https://hub.docker.com/search?q=apache%2Fbeam&type=image
[12] https://github.com/apache/beam/pull/28663

Re: [LAZY CONSENSUS] Create separate repository for Swift SDK

2023-09-29 Thread Kenneth Knowles

Hi all,

Thanks for your "approval" :-)

I have created https://github.com/apache/beam-swift

Kenn

On Mon, Sep 25, 2023 at 1:01 PM Valentyn Tymofieiev via dev <
dev@beam.apache.org> wrote:

> On Mon, Sep 25, 2023 at 9:03 AM Kenneth Knowles  wrote:
>
>> Hi all,
>>
>> I propose to unblock Byron's work by creating a new repository for the
>> Beam Swift SDK. This will be the first of its kind, and break from
>> tradition of having Beam be kind of a mini-mono-repo.
>>
>> Discussion of the Swift SDK and request for a separate repo is at
>> https://lists.apache.org/thread/25tp4yoptqxzty8t4fqznqxc3cwklpss
>>
>
> Additional context (since there was a branching between dev and user
> threads):
> https://lists.apache.org/thread/pc0s0953z6z09z597h1rwdskk2y00hmo . From
> the first message: *the "Swift Way" would be to have it in its own repo
> so that it can easily be used from the Swift Package Manager. *
>
>
>
>> I have created this thread to clearly separate this one issue, and
>> clearly record if we have consensus (or not).
>>
>> If no one has an objection or further discussion needed in 72 hours, it
>> can be considered approved and I will create the repository. See
>> https://community.apache.org/committers/lazyConsensus.html
>>
>> Kenn
>>
>

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1297 matches

Mail list logo