Re: [Discuss] Idea to increase RC voting participation

2023-10-23 Thread Danny McCormick via dev
I'd probably vote to include both the issue filer and the contributor. It
is pretty equally straightforward - one way to do this would be using all
issues related to that release's milestone and extracting the issue author
and the issue closer.

This does leave out the (unfortunately sizable) set of contributions that
don't have an associated issue; if we're worried about that, we could
always fall back to anyone with a commit in the last release who doesn't
have an associated issue (aka what I thought we were initially proposing
and what I think Airflow does today).

I'm pretty much +1 on any sort of automation here, and it certainly can
come in stages :)

On Mon, Oct 23, 2023 at 1:50 PM Johanna Öjeling via dev 
wrote:

> Yes that's a good point to include also those who created the issue.
>
> On Mon, Oct 23, 2023, 19:18 Robert Bradshaw via dev 
> wrote:
>
>> On Mon, Oct 23, 2023 at 7:26 AM Danny McCormick via dev <
>> dev@beam.apache.org> wrote:
>>
>>> So to summarize, I think there's broad consensus (or at least lazy
>>> consensus) around the following:
>>>
>>> - (1) Updating our release email/guidelines to be more specific about
>>> what we mean by release validation/how to be helpful during this process.
>>> This includes both encouraging validation within each user's own code base
>>> and encouraging people to document/share their process of validation and
>>> link it in the release spreadsheet.
>>> - (2) Doing something like what Airflow does (#29424
>>> ) and creating an issue
>>> asking people who have contributed to the current release to help validate
>>> their changes.
>>>
>>> I'm also +1 on doing both of these. The first bit (updating our
>>> guidelines) is relatively easy - it should just require updating
>>> https://github.com/apache/beam/blob/master/contributor-docs/release-guide.md#vote-and-validate-the-release-candidate
>>> .
>>>
>>> I took a look at the second piece (copying what Airflow does) to see if
>>> we could just copy their automation, but it looks like it's tied to
>>> airflow breeze
>>> 
>>> (their repo-specific automation tooling), so we'd probably need to build
>>> the automation ourselves. It shouldn't be terrible, basically we'd want a
>>> GitHub Action that compares the current release tag with the last release
>>> tag, grabs all the commits in between, parses them to get the author, and
>>> creates an issue with that data, but it does represent more effort than
>>> just updating a markdown file. There might even be an existing Action that
>>> can help with this, I haven't looked too hard.
>>>
>>
>> I was thinking along the lines of a script that would scrape the issues
>> resolved in a given release and add a comment to them noting that the
>> change is in release N and encouraging (with clear instructions) how this
>> can be validated. Creating a "validate this release" issue with all
>> "contributing" participants could be an interesting way to do this as well.
>> (I think it'd be valuable to get those who filed the issue, not just those
>> who fixed it, to validate.)
>>
>>
>>> As our next release manager, I'm happy to review PRs for either of these
>>> if anyone wants to volunteer to help out. If not, I'm happy to update the
>>> guidelines, but I probably won't have time to add the commit inspection
>>> tooling (I'm planning on throwing any extra time towards continuing to
>>> automate release candidate creation which is currently a more impactful
>>> problem IMO). I would very much like it if both of these things happened
>>> though :)
>>>
>>> Thanks,
>>> Danny
>>>
>>> On Mon, Oct 23, 2023 at 10:05 AM XQ Hu  wrote:
>>>
 +1. This is a great idea to try. @Danny McCormick
  FYI as our next release manager.

 On Wed, Oct 18, 2023 at 2:30 PM Johanna Öjeling via dev <
 dev@beam.apache.org> wrote:

> When I have contributed to Apache Airflow, they have tagged all
> contributors concerned in a GitHub issue when the RC is available and 
> asked
> us to validate it. Example: #29424
> .
>
> I found that to be an effective way to notify contributors of the RC
> and nudge them to help out. In the issue description there is a reference
> to the guidelines on how to test the RC and a note that people are
> encouraged to vote on the mailing list (which could admittedly be more
> highlighted because I did not pay attention to it until now and was 
> unaware
> that contributors had a vote).
>
> It might be an idea to consider something similar here to increase the
> participation?
>
> On Tue, Oct 17, 2023 at 7:01 PM Jack McCluskey via dev <
> dev@beam.apache.org> wrote:
>
>> I'm +1 on helping explain what we mean by "validate the RC" since
>> we're really just asking users 

Re: [Discuss] Idea to increase RC voting participation

2023-10-23 Thread Johanna Öjeling via dev
Yes that's a good point to include also those who created the issue.

On Mon, Oct 23, 2023, 19:18 Robert Bradshaw via dev 
wrote:

> On Mon, Oct 23, 2023 at 7:26 AM Danny McCormick via dev <
> dev@beam.apache.org> wrote:
>
>> So to summarize, I think there's broad consensus (or at least lazy
>> consensus) around the following:
>>
>> - (1) Updating our release email/guidelines to be more specific about
>> what we mean by release validation/how to be helpful during this process.
>> This includes both encouraging validation within each user's own code base
>> and encouraging people to document/share their process of validation and
>> link it in the release spreadsheet.
>> - (2) Doing something like what Airflow does (#29424
>> ) and creating an issue
>> asking people who have contributed to the current release to help validate
>> their changes.
>>
>> I'm also +1 on doing both of these. The first bit (updating our
>> guidelines) is relatively easy - it should just require updating
>> https://github.com/apache/beam/blob/master/contributor-docs/release-guide.md#vote-and-validate-the-release-candidate
>> .
>>
>> I took a look at the second piece (copying what Airflow does) to see if
>> we could just copy their automation, but it looks like it's tied to
>> airflow breeze
>> 
>> (their repo-specific automation tooling), so we'd probably need to build
>> the automation ourselves. It shouldn't be terrible, basically we'd want a
>> GitHub Action that compares the current release tag with the last release
>> tag, grabs all the commits in between, parses them to get the author, and
>> creates an issue with that data, but it does represent more effort than
>> just updating a markdown file. There might even be an existing Action that
>> can help with this, I haven't looked too hard.
>>
>
> I was thinking along the lines of a script that would scrape the issues
> resolved in a given release and add a comment to them noting that the
> change is in release N and encouraging (with clear instructions) how this
> can be validated. Creating a "validate this release" issue with all
> "contributing" participants could be an interesting way to do this as well.
> (I think it'd be valuable to get those who filed the issue, not just those
> who fixed it, to validate.)
>
>
>> As our next release manager, I'm happy to review PRs for either of these
>> if anyone wants to volunteer to help out. If not, I'm happy to update the
>> guidelines, but I probably won't have time to add the commit inspection
>> tooling (I'm planning on throwing any extra time towards continuing to
>> automate release candidate creation which is currently a more impactful
>> problem IMO). I would very much like it if both of these things happened
>> though :)
>>
>> Thanks,
>> Danny
>>
>> On Mon, Oct 23, 2023 at 10:05 AM XQ Hu  wrote:
>>
>>> +1. This is a great idea to try. @Danny McCormick
>>>  FYI as our next release manager.
>>>
>>> On Wed, Oct 18, 2023 at 2:30 PM Johanna Öjeling via dev <
>>> dev@beam.apache.org> wrote:
>>>
 When I have contributed to Apache Airflow, they have tagged all
 contributors concerned in a GitHub issue when the RC is available and asked
 us to validate it. Example: #29424
 .

 I found that to be an effective way to notify contributors of the RC
 and nudge them to help out. In the issue description there is a reference
 to the guidelines on how to test the RC and a note that people are
 encouraged to vote on the mailing list (which could admittedly be more
 highlighted because I did not pay attention to it until now and was unaware
 that contributors had a vote).

 It might be an idea to consider something similar here to increase the
 participation?

 On Tue, Oct 17, 2023 at 7:01 PM Jack McCluskey via dev <
 dev@beam.apache.org> wrote:

> I'm +1 on helping explain what we mean by "validate the RC" since
> we're really just asking users to see if their existing use cases work
> along with our typical slate of tests. I don't know if offloading that 
> work
> to our active validators is the right approach though, 
> documentation/screen
> share of their specific workflow is definitely less useful than having a
> more general outline of how to install the RC and things to look out for
> when testing.
>
> On Tue, Oct 17, 2023 at 12:55 PM Austin Bennett 
> wrote:
>
>> Great effort.  I'm also interested in streamlining releases -- so if
>> there are alot of manual tests that could be automated, would be great
>> to discover and then look to address.
>>
>> On Tue, Oct 17, 2023 at 8:47 AM Robert Bradshaw via dev <
>> dev@beam.apache.org> wrote:
>>
>>> +1
>>>
>>> I would also 

Re: [Discuss] Idea to increase RC voting participation

2023-10-23 Thread Robert Bradshaw via dev
On Mon, Oct 23, 2023 at 7:26 AM Danny McCormick via dev 
wrote:

> So to summarize, I think there's broad consensus (or at least lazy
> consensus) around the following:
>
> - (1) Updating our release email/guidelines to be more specific about what
> we mean by release validation/how to be helpful during this process. This
> includes both encouraging validation within each user's own code base and
> encouraging people to document/share their process of validation and link
> it in the release spreadsheet.
> - (2) Doing something like what Airflow does (#29424
> ) and creating an issue
> asking people who have contributed to the current release to help validate
> their changes.
>
> I'm also +1 on doing both of these. The first bit (updating our
> guidelines) is relatively easy - it should just require updating
> https://github.com/apache/beam/blob/master/contributor-docs/release-guide.md#vote-and-validate-the-release-candidate
> .
>
> I took a look at the second piece (copying what Airflow does) to see if we
> could just copy their automation, but it looks like it's tied to airflow
> breeze
> 
> (their repo-specific automation tooling), so we'd probably need to build
> the automation ourselves. It shouldn't be terrible, basically we'd want a
> GitHub Action that compares the current release tag with the last release
> tag, grabs all the commits in between, parses them to get the author, and
> creates an issue with that data, but it does represent more effort than
> just updating a markdown file. There might even be an existing Action that
> can help with this, I haven't looked too hard.
>

I was thinking along the lines of a script that would scrape the issues
resolved in a given release and add a comment to them noting that the
change is in release N and encouraging (with clear instructions) how this
can be validated. Creating a "validate this release" issue with all
"contributing" participants could be an interesting way to do this as well.
(I think it'd be valuable to get those who filed the issue, not just those
who fixed it, to validate.)


> As our next release manager, I'm happy to review PRs for either of these
> if anyone wants to volunteer to help out. If not, I'm happy to update the
> guidelines, but I probably won't have time to add the commit inspection
> tooling (I'm planning on throwing any extra time towards continuing to
> automate release candidate creation which is currently a more impactful
> problem IMO). I would very much like it if both of these things happened
> though :)
>
> Thanks,
> Danny
>
> On Mon, Oct 23, 2023 at 10:05 AM XQ Hu  wrote:
>
>> +1. This is a great idea to try. @Danny McCormick
>>  FYI as our next release manager.
>>
>> On Wed, Oct 18, 2023 at 2:30 PM Johanna Öjeling via dev <
>> dev@beam.apache.org> wrote:
>>
>>> When I have contributed to Apache Airflow, they have tagged all
>>> contributors concerned in a GitHub issue when the RC is available and asked
>>> us to validate it. Example: #29424
>>> .
>>>
>>> I found that to be an effective way to notify contributors of the RC and
>>> nudge them to help out. In the issue description there is a reference to
>>> the guidelines on how to test the RC and a note that people are encouraged
>>> to vote on the mailing list (which could admittedly be more highlighted
>>> because I did not pay attention to it until now and was unaware that
>>> contributors had a vote).
>>>
>>> It might be an idea to consider something similar here to increase the
>>> participation?
>>>
>>> On Tue, Oct 17, 2023 at 7:01 PM Jack McCluskey via dev <
>>> dev@beam.apache.org> wrote:
>>>
 I'm +1 on helping explain what we mean by "validate the RC" since we're
 really just asking users to see if their existing use cases work along with
 our typical slate of tests. I don't know if offloading that work to our
 active validators is the right approach though, documentation/screen share
 of their specific workflow is definitely less useful than having a more
 general outline of how to install the RC and things to look out for when
 testing.

 On Tue, Oct 17, 2023 at 12:55 PM Austin Bennett 
 wrote:

> Great effort.  I'm also interested in streamlining releases -- so if
> there are alot of manual tests that could be automated, would be great
> to discover and then look to address.
>
> On Tue, Oct 17, 2023 at 8:47 AM Robert Bradshaw via dev <
> dev@beam.apache.org> wrote:
>
>> +1
>>
>> I would also strongly suggest that people try out the release against
>> their own codebases. This has the benefit of ensuring the release won't
>> break your own code when they go out, and stress-tests the new code 
>> against
>> real-world pipelines. (Ideally our own tests are 

Re: [Discuss] Idea to increase RC voting participation

2023-10-23 Thread Svetak Sundhar via dev
Thanks for summarizing.

For (1), I will send out this Google doc

around the time of the next release where we can crowdsource ways to test
the RC. I think it'd be valuable to approach a guide like this to be
organized from a Beam user's perspective, meaning headers such as " If your
workflow utilizes the Python SDK, change X to test with the newest RC".

As Danny mentioned, we can update our release guidelines after we have the
info, but I think it could also make for a nice blog post to get more
traction :)

As per (2) I likely won't have time to commit by the next release as well.
Happy to take a look though after the next release.

Thanks,



Svetak Sundhar

  Data Engineer
s vetaksund...@google.com



On Mon, Oct 23, 2023 at 10:26 AM Danny McCormick via dev <
dev@beam.apache.org> wrote:

> So to summarize, I think there's broad consensus (or at least lazy
> consensus) around the following:
>
> - (1) Updating our release email/guidelines to be more specific about what
> we mean by release validation/how to be helpful during this process. This
> includes both encouraging validation within each user's own code base and
> encouraging people to document/share their process of validation and link
> it in the release spreadsheet.
> - (2) Doing something like what Airflow does (#29424
> ) and creating an issue
> asking people who have contributed to the current release to help validate
> their changes.
>
> I'm also +1 on doing both of these. The first bit (updating our
> guidelines) is relatively easy - it should just require updating
> https://github.com/apache/beam/blob/master/contributor-docs/release-guide.md#vote-and-validate-the-release-candidate
> .
>
> I took a look at the second piece (copying what Airflow does) to see if we
> could just copy their automation, but it looks like it's tied to airflow
> breeze
> 
> (their repo-specific automation tooling), so we'd probably need to build
> the automation ourselves. It shouldn't be terrible, basically we'd want a
> GitHub Action that compares the current release tag with the last release
> tag, grabs all the commits in between, parses them to get the author, and
> creates an issue with that data, but it does represent more effort than
> just updating a markdown file. There might even be an existing Action that
> can help with this, I haven't looked too hard.
>
> As our next release manager, I'm happy to review PRs for either of these
> if anyone wants to volunteer to help out. If not, I'm happy to update the
> guidelines, but I probably won't have time to add the commit inspection
> tooling (I'm planning on throwing any extra time towards continuing to
> automate release candidate creation which is currently a more impactful
> problem IMO). I would very much like it if both of these things happened
> though :)
>
> Thanks,
> Danny
>
> On Mon, Oct 23, 2023 at 10:05 AM XQ Hu  wrote:
>
>> +1. This is a great idea to try. @Danny McCormick
>>  FYI as our next release manager.
>>
>> On Wed, Oct 18, 2023 at 2:30 PM Johanna Öjeling via dev <
>> dev@beam.apache.org> wrote:
>>
>>> When I have contributed to Apache Airflow, they have tagged all
>>> contributors concerned in a GitHub issue when the RC is available and asked
>>> us to validate it. Example: #29424
>>> .
>>>
>>> I found that to be an effective way to notify contributors of the RC and
>>> nudge them to help out. In the issue description there is a reference to
>>> the guidelines on how to test the RC and a note that people are encouraged
>>> to vote on the mailing list (which could admittedly be more highlighted
>>> because I did not pay attention to it until now and was unaware that
>>> contributors had a vote).
>>>
>>> It might be an idea to consider something similar here to increase the
>>> participation?
>>>
>>> On Tue, Oct 17, 2023 at 7:01 PM Jack McCluskey via dev <
>>> dev@beam.apache.org> wrote:
>>>
 I'm +1 on helping explain what we mean by "validate the RC" since we're
 really just asking users to see if their existing use cases work along with
 our typical slate of tests. I don't know if offloading that work to our
 active validators is the right approach though, documentation/screen share
 of their specific workflow is definitely less useful than having a more
 general outline of how to install the RC and things to look out for when
 testing.

 On Tue, Oct 17, 2023 at 12:55 PM Austin Bennett 
 wrote:

> Great effort.  I'm also interested in streamlining releases -- so if
> there are alot of manual tests that could be automated, would be great
> to discover and then look to address.
>
> On Tue, Oct 17, 2023 at 8:47 AM Robert Bradshaw via dev <

Re: [PYTHON] partitioner utilities?

2023-10-23 Thread Joey Tran
PR for top: https://github.com/apache/beam/pull/29106

On Mon, Oct 23, 2023 at 10:11 AM XQ Hu via dev  wrote:

> +1 on this idea. Thanks!
>
> On Thu, Oct 19, 2023 at 3:40 PM Joey Tran 
> wrote:
>
>> Yeah, I already implemented these partitioners for my use case (I just
>> pasted the classnames/docstrings for them) and I used both combiners.Top
>> and combiners.Sample.
>>
>> In fact, before writing these partitioners I had misunderstood those
>> combiners and thought they would partition my pcollections. Not sure if
>> that might be a common pitfall.
>>
>> On Thu, Oct 19, 2023 at 3:32 PM Anand Inguva via dev 
>> wrote:
>>
>>> FYI, there is a Top transform[1] that will fetch the greatest n elements
>>> in Python SDK. It is not a partitioner but It may be useful for your
>>> reference.
>>>
>>> [1]
>>> https://github.com/apache/beam/blob/68e9c997a9085b0cb045238ae406d534011e7c21/sdks/python/apache_beam/transforms/combiners.py#L191
>>>
>>> On Thu, Oct 19, 2023 at 3:21 PM Joey Tran 
>>> wrote:
>>>
 Yes, both need to be small enough to fit into state.

 Yeah a percentage sampler would also be great, we have a bunch of use
 cases for that ourselves. Not sure if it'd be too clever, but I was
 imagining three public sampling partitioners: FixedSample,
 PercentageSample, and Sample. Sample could automatically choose between
 FixedSample and PercentageSample based on whether a percentage is given or
 a large `n` is given.

 For `PercentageSample`, I was imagining we'd just take a count of the
 number of elements and then assign every element a `rand` and keep the ones
 that are larger than `n / Count(inputs)` (or percentage). For runners that
 have fast counting, it should perform quickly. Open to other ideas though.

 Cheers,
 Joey



 On Thu, Oct 19, 2023 at 3:10 PM Danny McCormick via dev <
 dev@beam.apache.org> wrote:

> I'm interested adding something like this, I could see these being
> generally useful for a number of cases (one that immediately comes to mind
> is partitioning datasets into train/test/validation sets and writing each
> to a different place).
>
> I'm assuming Top (or FixedSample) needs to be small enough to fit into
> state? I would also be interested in being able to do percentages as well
> (something like partitioners.Sample(percent=10)), though that might be 
> much
> more challenging for an unbounded data set (maybe we could do something as
> simple as a probabilistic target_percentage).
>
> Happy to help review a design doc or PR.
>
> Thanks,
> Danny
>
> On Thu, Oct 19, 2023 at 10:06 AM Joey Tran 
> wrote:
>
>> Hey all,
>>
>> While writing a few pipelines, I was surprised by how few
>> partitioners there were in the python SDK. I wrote a couple that are 
>> pretty
>> generic and possibly generally useful. Just wanted to do a quick poll to
>> see if they seem useful enough to be in the sdk's library of transforms. 
>> If
>> so, I can put together a PTransform Design Doc[1] for them. Just wanted 
>> to
>> confirm before spending time on the doc.
>>
>> Here are the two that I wrote, I'll just paste the class names and
>> docstrings:
>>
>> class FixedSample(beam.PTransform):
>> """
>> A PTransform that takes a PCollection and partitions it into two
>> PCollections.
>> The first PCollection is a random sample of the input
>> PCollection, and the
>> second PCollection is the remaining elements of the input
>> PCollection.
>>
>> This is useful for creating holdout / test sets in machine
>> learning.
>>
>> Example usage:
>>
>> >>> with beam.Pipeline() as p:
>> ... sample, remaining = (p
>> ... | beam.Create(list(range(10)))
>> ... | partitioners.FixedSample(3))
>> ... # sample will contain three randomly selected
>> elements from the
>> ... # input PCollection
>> ... # remaining will contain the remaining seven elements
>>
>> """
>>
>> class Top(beam.PTransform):
>> """
>> A PTransform that takes a PCollection and partitions it into two
>> PCollections.
>> The first PCollection contains the largest n elements of the
>> input PCollection,
>> and the second PCollection contains the remaining elements of the
>> input
>> PCollection.
>>
>> Parameters:
>> n: The number of elements to take from the input PCollection.
>> key: A function that takes an element of the input
>> PCollection and returns
>> a value to compare for the purpose of determining the top
>> n elements,
>> similar to Python's built-in sorted function.
>> 

Re: [Discuss] Idea to increase RC voting participation

2023-10-23 Thread Danny McCormick via dev
So to summarize, I think there's broad consensus (or at least lazy
consensus) around the following:

- (1) Updating our release email/guidelines to be more specific about what
we mean by release validation/how to be helpful during this process. This
includes both encouraging validation within each user's own code base and
encouraging people to document/share their process of validation and link
it in the release spreadsheet.
- (2) Doing something like what Airflow does (#29424
) and creating an issue
asking people who have contributed to the current release to help validate
their changes.

I'm also +1 on doing both of these. The first bit (updating our guidelines)
is relatively easy - it should just require updating
https://github.com/apache/beam/blob/master/contributor-docs/release-guide.md#vote-and-validate-the-release-candidate
.

I took a look at the second piece (copying what Airflow does) to see if we
could just copy their automation, but it looks like it's tied to airflow
breeze

(their repo-specific automation tooling), so we'd probably need to build
the automation ourselves. It shouldn't be terrible, basically we'd want a
GitHub Action that compares the current release tag with the last release
tag, grabs all the commits in between, parses them to get the author, and
creates an issue with that data, but it does represent more effort than
just updating a markdown file. There might even be an existing Action that
can help with this, I haven't looked too hard.

As our next release manager, I'm happy to review PRs for either of these if
anyone wants to volunteer to help out. If not, I'm happy to update the
guidelines, but I probably won't have time to add the commit inspection
tooling (I'm planning on throwing any extra time towards continuing to
automate release candidate creation which is currently a more impactful
problem IMO). I would very much like it if both of these things happened
though :)

Thanks,
Danny

On Mon, Oct 23, 2023 at 10:05 AM XQ Hu  wrote:

> +1. This is a great idea to try. @Danny McCormick
>  FYI as our next release manager.
>
> On Wed, Oct 18, 2023 at 2:30 PM Johanna Öjeling via dev <
> dev@beam.apache.org> wrote:
>
>> When I have contributed to Apache Airflow, they have tagged all
>> contributors concerned in a GitHub issue when the RC is available and asked
>> us to validate it. Example: #29424
>> .
>>
>> I found that to be an effective way to notify contributors of the RC and
>> nudge them to help out. In the issue description there is a reference to
>> the guidelines on how to test the RC and a note that people are encouraged
>> to vote on the mailing list (which could admittedly be more highlighted
>> because I did not pay attention to it until now and was unaware that
>> contributors had a vote).
>>
>> It might be an idea to consider something similar here to increase the
>> participation?
>>
>> On Tue, Oct 17, 2023 at 7:01 PM Jack McCluskey via dev <
>> dev@beam.apache.org> wrote:
>>
>>> I'm +1 on helping explain what we mean by "validate the RC" since we're
>>> really just asking users to see if their existing use cases work along with
>>> our typical slate of tests. I don't know if offloading that work to our
>>> active validators is the right approach though, documentation/screen share
>>> of their specific workflow is definitely less useful than having a more
>>> general outline of how to install the RC and things to look out for when
>>> testing.
>>>
>>> On Tue, Oct 17, 2023 at 12:55 PM Austin Bennett 
>>> wrote:
>>>
 Great effort.  I'm also interested in streamlining releases -- so if
 there are alot of manual tests that could be automated, would be great
 to discover and then look to address.

 On Tue, Oct 17, 2023 at 8:47 AM Robert Bradshaw via dev <
 dev@beam.apache.org> wrote:

> +1
>
> I would also strongly suggest that people try out the release against
> their own codebases. This has the benefit of ensuring the release won't
> break your own code when they go out, and stress-tests the new code 
> against
> real-world pipelines. (Ideally our own tests are all passing, and this
> validation is automated as much as possible (though ensuring it matches 
> our
> documentation and works in a clean environment still has value), but
> there's a lot of code and uses out there that we don't have access to
> during normal Beam development.)
>
> On Tue, Oct 17, 2023 at 8:21 AM Svetak Sundhar via dev <
> dev@beam.apache.org> wrote:
>
>> Hi all,
>>
>> I’ve participated in RC testing for a few releases and have observed
>> a bit of a knowledge gap in how releases can be tested. Given that Beam
>> encourages contributors to vote on RC’s regardless of tenure, and 

Re: [PYTHON] partitioner utilities?

2023-10-23 Thread XQ Hu via dev
+1 on this idea. Thanks!

On Thu, Oct 19, 2023 at 3:40 PM Joey Tran  wrote:

> Yeah, I already implemented these partitioners for my use case (I just
> pasted the classnames/docstrings for them) and I used both combiners.Top
> and combiners.Sample.
>
> In fact, before writing these partitioners I had misunderstood those
> combiners and thought they would partition my pcollections. Not sure if
> that might be a common pitfall.
>
> On Thu, Oct 19, 2023 at 3:32 PM Anand Inguva via dev 
> wrote:
>
>> FYI, there is a Top transform[1] that will fetch the greatest n elements
>> in Python SDK. It is not a partitioner but It may be useful for your
>> reference.
>>
>> [1]
>> https://github.com/apache/beam/blob/68e9c997a9085b0cb045238ae406d534011e7c21/sdks/python/apache_beam/transforms/combiners.py#L191
>>
>> On Thu, Oct 19, 2023 at 3:21 PM Joey Tran 
>> wrote:
>>
>>> Yes, both need to be small enough to fit into state.
>>>
>>> Yeah a percentage sampler would also be great, we have a bunch of use
>>> cases for that ourselves. Not sure if it'd be too clever, but I was
>>> imagining three public sampling partitioners: FixedSample,
>>> PercentageSample, and Sample. Sample could automatically choose between
>>> FixedSample and PercentageSample based on whether a percentage is given or
>>> a large `n` is given.
>>>
>>> For `PercentageSample`, I was imagining we'd just take a count of the
>>> number of elements and then assign every element a `rand` and keep the ones
>>> that are larger than `n / Count(inputs)` (or percentage). For runners that
>>> have fast counting, it should perform quickly. Open to other ideas though.
>>>
>>> Cheers,
>>> Joey
>>>
>>>
>>>
>>> On Thu, Oct 19, 2023 at 3:10 PM Danny McCormick via dev <
>>> dev@beam.apache.org> wrote:
>>>
 I'm interested adding something like this, I could see these being
 generally useful for a number of cases (one that immediately comes to mind
 is partitioning datasets into train/test/validation sets and writing each
 to a different place).

 I'm assuming Top (or FixedSample) needs to be small enough to fit into
 state? I would also be interested in being able to do percentages as well
 (something like partitioners.Sample(percent=10)), though that might be much
 more challenging for an unbounded data set (maybe we could do something as
 simple as a probabilistic target_percentage).

 Happy to help review a design doc or PR.

 Thanks,
 Danny

 On Thu, Oct 19, 2023 at 10:06 AM Joey Tran 
 wrote:

> Hey all,
>
> While writing a few pipelines, I was surprised by how few partitioners
> there were in the python SDK. I wrote a couple that are pretty generic and
> possibly generally useful. Just wanted to do a quick poll to see if they
> seem useful enough to be in the sdk's library of transforms. If so, I can
> put together a PTransform Design Doc[1] for them. Just wanted to confirm
> before spending time on the doc.
>
> Here are the two that I wrote, I'll just paste the class names and
> docstrings:
>
> class FixedSample(beam.PTransform):
> """
> A PTransform that takes a PCollection and partitions it into two
> PCollections.
> The first PCollection is a random sample of the input PCollection,
> and the
> second PCollection is the remaining elements of the input
> PCollection.
>
> This is useful for creating holdout / test sets in machine
> learning.
>
> Example usage:
>
> >>> with beam.Pipeline() as p:
> ... sample, remaining = (p
> ... | beam.Create(list(range(10)))
> ... | partitioners.FixedSample(3))
> ... # sample will contain three randomly selected elements
> from the
> ... # input PCollection
> ... # remaining will contain the remaining seven elements
>
> """
>
> class Top(beam.PTransform):
> """
> A PTransform that takes a PCollection and partitions it into two
> PCollections.
> The first PCollection contains the largest n elements of the input
> PCollection,
> and the second PCollection contains the remaining elements of the
> input
> PCollection.
>
> Parameters:
> n: The number of elements to take from the input PCollection.
> key: A function that takes an element of the input PCollection
> and returns
> a value to compare for the purpose of determining the top
> n elements,
> similar to Python's built-in sorted function.
> reverse: If True, the top n elements will be the n smallest
> elements of the
> input PCollection.
>
> Example usage:
>
> >>> with beam.Pipeline() as p:
> ... top, remaining = (p
> ... | 

Re: [Discuss] Idea to increase RC voting participation

2023-10-23 Thread XQ Hu via dev
+1. This is a great idea to try. @Danny McCormick
 FYI as our next release manager.

On Wed, Oct 18, 2023 at 2:30 PM Johanna Öjeling via dev 
wrote:

> When I have contributed to Apache Airflow, they have tagged all
> contributors concerned in a GitHub issue when the RC is available and asked
> us to validate it. Example: #29424
> .
>
> I found that to be an effective way to notify contributors of the RC and
> nudge them to help out. In the issue description there is a reference to
> the guidelines on how to test the RC and a note that people are encouraged
> to vote on the mailing list (which could admittedly be more highlighted
> because I did not pay attention to it until now and was unaware that
> contributors had a vote).
>
> It might be an idea to consider something similar here to increase the
> participation?
>
> On Tue, Oct 17, 2023 at 7:01 PM Jack McCluskey via dev <
> dev@beam.apache.org> wrote:
>
>> I'm +1 on helping explain what we mean by "validate the RC" since we're
>> really just asking users to see if their existing use cases work along with
>> our typical slate of tests. I don't know if offloading that work to our
>> active validators is the right approach though, documentation/screen share
>> of their specific workflow is definitely less useful than having a more
>> general outline of how to install the RC and things to look out for when
>> testing.
>>
>> On Tue, Oct 17, 2023 at 12:55 PM Austin Bennett 
>> wrote:
>>
>>> Great effort.  I'm also interested in streamlining releases -- so if
>>> there are alot of manual tests that could be automated, would be great
>>> to discover and then look to address.
>>>
>>> On Tue, Oct 17, 2023 at 8:47 AM Robert Bradshaw via dev <
>>> dev@beam.apache.org> wrote:
>>>
 +1

 I would also strongly suggest that people try out the release against
 their own codebases. This has the benefit of ensuring the release won't
 break your own code when they go out, and stress-tests the new code against
 real-world pipelines. (Ideally our own tests are all passing, and this
 validation is automated as much as possible (though ensuring it matches our
 documentation and works in a clean environment still has value), but
 there's a lot of code and uses out there that we don't have access to
 during normal Beam development.)

 On Tue, Oct 17, 2023 at 8:21 AM Svetak Sundhar via dev <
 dev@beam.apache.org> wrote:

> Hi all,
>
> I’ve participated in RC testing for a few releases and have observed a
> bit of a knowledge gap in how releases can be tested. Given that Beam
> encourages contributors to vote on RC’s regardless of tenure, and that
> voting on an RC is a relatively low-effort, high leverage way to influence
> the release of the library, I propose the following:
>
> During the vote for the next release, voters can document the process
> they followed on a separate document, and add the link on column G
> here
> .
> One step further, could be a screencast of running the test, and attaching
> a link of that.
>
> We can keep repeating this through releases until we have
> documentation for many of the different tests. We can then add these docs
> into the repo.
>
> I’m proposing this because I’ve gathered the following feedback from
> colleagues that are tangentially involved with Beam: They are interested 
> in
> participating in release validation, but don’t know how to get started.
> Happy to hear other suggestions too, if there are any to address the
> above.
>
> Thanks,
>
>
> Svetak Sundhar
>
>   Data Engineer
> s vetaksund...@google.com
>
>


Re: [YAML] Aggregations

2023-10-23 Thread XQ Hu via dev
+1 on your proposal.

On Fri, Oct 20, 2023 at 4:59 PM Robert Bradshaw via dev 
wrote:

> On Fri, Oct 20, 2023 at 11:35 AM Kenneth Knowles  wrote:
> >
> > A couple other bits on having an expression language:
> >
> >  - You already have Python lambdas at places, right? so that's quite a
> lot more complex than SQL project/aggregate expressions
> >  - It really does save a lot of pain for users (at the cost of
> implementation complexity) when you need to "SUM(col1*col2)" where
> otherwise you have to Map first. This could be viewed as desirable as well,
> of course.
> >
> > Anyhow I'm pretty much in agreement with all your reasoning as to why
> *not* to use SQL-like expressions in strings. But it does seem odd when
> juxtaposed with Python snippets.
>
> Well, we say "here's a Python expression" when we're using a Python
> string. But "SUM(col1*col2)" isn't as transparent. (Agree about the
> niceties of being able to provide an expression rather than a column.)
>
> > On Thu, Oct 19, 2023 at 4:00 PM Robert Bradshaw via dev <
> dev@beam.apache.org> wrote:
> >>
> >> On Thu, Oct 19, 2023 at 12:53 PM Reuven Lax  wrote:
> >> >
> >> > Is the schema Group transform (in Java) something along these lines?
> >>
> >> Yes, for sure it is. It (and Python's and Typescript's equivalent) are
> >> linked in the original post. The open question is how to best express
> >> this in YAML.
> >>
> >> > On Wed, Oct 18, 2023 at 1:11 PM Robert Bradshaw via dev <
> dev@beam.apache.org> wrote:
> >> >>
> >> >> Beam Yaml has good support for IOs and mappings, but one key missing
> >> >> feature for even writing a WordCount is the ability to do
> Aggregations
> >> >> [1]. While the traditional Beam primitive is GroupByKey (and
> >> >> CombineValues), we're eschewing KVs in the notion of more schema'd
> >> >> data (which has some precedence in our other languages, see the links
> >> >> below). The key components the user needs to specify are (1) the key
> >> >> fields on which the grouping will take place, (2) the fields
> >> >> (expressions?) involved in the aggregation, and (3) what aggregating
> >> >> fn to use.
> >> >>
> >> >> A straw-man example could be something like
> >> >>
> >> >> type: Aggregating
> >> >> config:
> >> >>   key: [field1, field2]
> >> >>   aggregating:
> >> >> total_cost:
> >> >>   fn: sum
> >> >>   value: cost
> >> >> max_cost:
> >> >>   fn: max
> >> >>   value: cost
> >> >>
> >> >> This would basically correspond to the SQL expression
> >> >>
> >> >> "SELECT field1, field2, sum(cost) as total_cost, max(cost) as
> max_cost
> >> >> from table GROUP BY field1, field2"
> >> >>
> >> >> (though I'm not requiring that we use this as an implementation
> >> >> strategy). I do not think we need a separate (non aggregating)
> >> >> Grouping operation, this can be accomplished by having a concat-style
> >> >> combiner.
> >> >>
> >> >> There are still some open questions here, notably around how to
> >> >> specify the aggregation fns themselves. We could of course provide a
> >> >> number of built-ins (like SQL does). This gets into the question of
> >> >> how and where to document this complete set, but some basics should
> >> >> take us pretty far. Many aggregators, however, are parameterized
> (e.g.
> >> >> quantiles); where do we put the parameters? We could go with
> something
> >> >> like
> >> >>
> >> >> fn:
> >> >>   type: ApproximateQuantiles
> >> >>   config:
> >> >> n: 10
> >> >>
> >> >> but others are even configured by functions themselves (e.g. LargestN
> >> >> that wants a comparator Fn). Maybe we decide not to support these
> >> >> (yet?)
> >> >>
> >> >> One thing I think we should support, however, is referencing custom
> >> >> CombineFns. We have some precedent for this with our Fns from
> >> >> MapToFields, where we accept things like inline lambdas and external
> >> >> references. Again the topic of how to configure them comes up, as
> >> >> these custom Fns are more likely to be parameterized than Map Fns
> >> >> (though, to be clear, perhaps it'd be good to allow parameterizatin
> of
> >> >> MapFns as well). Maybe we allow
> >> >>
> >> >> language: python. # like MapToFields (and here it'd be harder to mix
> >> >> and match per Fn)
> >> >> fn:
> >> >>   type: ???
> >> >>   # should these be nested as config?
> >> >>   name: fully.qualiied.name
> >> >>   path: /path/to/defining/file
> >> >>   args: [...]
> >> >>   kwargs: {...}
> >> >>
> >> >> which would invoke the constructor.
> >> >>
> >> >> I'm also open to other ways of naming/structuring these essential
> >> >> parameters if it makes things more clear.
> >> >>
> >> >> - Robert
> >> >>
> >> >>
> >> >> Java:
> https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/schemas/transforms/Group.html
> >> >> Python:
> https://beam.apache.org/documentation/transforms/python/aggregation/groupby
> >> >> Typescript:
> https://beam.apache.org/releases/typedoc/current/classes/transforms_group_and_combine.GroupBy.html
> >> 

Beam High Priority Issue Report (46)

2023-10-23 Thread beamactions
This is your daily summary of Beam's current high priority issues that may need 
attention.

See https://beam.apache.org/contribute/issue-priorities for the meaning and 
expectations around issue priorities.

Unassigned P1 Issues:

https://github.com/apache/beam/issues/29099 [Bug]: FnAPI Java SDK Harness 
doesn't update user counters in OnTimer callback functions
https://github.com/apache/beam/issues/29076 [Failing Test]: Python ARM 
PostCommit failing after #28385
https://github.com/apache/beam/issues/29022 [Failing Test]: Python Github 
actions tests are failing due to update of pip 
https://github.com/apache/beam/issues/28760 [Bug]: EFO Kinesis IO reader 
provided by apache beam does not pick the event time for watermarking
https://github.com/apache/beam/issues/28703 [Failing Test]: Building a wheel 
for integration tests sometimes times out
https://github.com/apache/beam/issues/28383 [Failing Test]: 
org.apache.beam.runners.dataflow.worker.StreamingDataflowWorkerTest.testMaxThreadMetric
https://github.com/apache/beam/issues/28339 Fix failing 
"beam_PostCommit_XVR_GoUsingJava_Dataflow" job
https://github.com/apache/beam/issues/28326 Bug: 
apache_beam.io.gcp.pubsublite.ReadFromPubSubLite not working
https://github.com/apache/beam/issues/28142 [Bug]: [Go SDK] Memory seems to be 
leaking on 2.49.0 with Dataflow
https://github.com/apache/beam/issues/27892 [Bug]: ignoreUnknownValues not 
working when using CreateDisposition.CREATE_IF_NEEDED 
https://github.com/apache/beam/issues/27648 [Bug]: Python SDFs (e.g. 
PeriodicImpulse) running in Flink and polling using tracker.defer_remainder 
have checkpoint size growing indefinitely 
https://github.com/apache/beam/issues/27616 [Bug]: Unable to use 
applyRowMutations() in bigquery IO apache beam java
https://github.com/apache/beam/issues/27486 [Bug]: Read from datastore with 
inequality filters
https://github.com/apache/beam/issues/27314 [Failing Test]: 
bigquery.StorageApiSinkCreateIfNeededIT.testCreateManyTables[1]
https://github.com/apache/beam/issues/27238 [Bug]: Window trigger has lag when 
using Kafka and GroupByKey on Dataflow Runner
https://github.com/apache/beam/issues/26981 [Bug]: Getting an error related to 
SchemaCoder after upgrading to 2.48
https://github.com/apache/beam/issues/26911 [Bug]: UNNEST ARRAY with a nested 
ROW (described below)
https://github.com/apache/beam/issues/26343 [Bug]: 
apache_beam.io.gcp.bigquery_read_it_test.ReadAllBQTests.test_read_queries is 
flaky
https://github.com/apache/beam/issues/26329 [Bug]: BigQuerySourceBase does not 
propagate a Coder to AvroSource
https://github.com/apache/beam/issues/26041 [Bug]: Unable to create 
exactly-once Flink pipeline with stream source and file sink
https://github.com/apache/beam/issues/24776 [Bug]: Race condition in Python SDK 
Harness ProcessBundleProgress
https://github.com/apache/beam/issues/24389 [Failing Test]: 
HadoopFormatIOElasticTest.classMethod ExceptionInInitializerError 
ContainerFetchException
https://github.com/apache/beam/issues/24313 [Flaky]: 
apache_beam/runners/portability/portable_runner_test.py::PortableRunnerTestWithSubprocesses::test_pardo_state_with_custom_key_coder
https://github.com/apache/beam/issues/23944  beam_PreCommit_Python_Cron 
regularily failing - test_pardo_large_input flaky
https://github.com/apache/beam/issues/23709 [Flake]: Spark batch flakes in 
ParDoLifecycleTest.testTeardownCalledAfterExceptionInProcessElement and 
ParDoLifecycleTest.testTeardownCalledAfterExceptionInStartBundle
https://github.com/apache/beam/issues/23525 [Bug]: Default PubsubMessage coder 
will drop message id and orderingKey
https://github.com/apache/beam/issues/22913 [Bug]: 
beam_PostCommit_Java_ValidatesRunner_Flink is flakes in 
org.apache.beam.sdk.transforms.GroupByKeyTest$BasicTests.testAfterProcessingTimeContinuationTriggerUsingState
https://github.com/apache/beam/issues/22605 [Bug]: Beam Python failure for 
dataflow_exercise_metrics_pipeline_test.ExerciseMetricsPipelineTest.test_metrics_it
https://github.com/apache/beam/issues/21714 
PulsarIOTest.testReadFromSimpleTopic is very flaky
https://github.com/apache/beam/issues/21706 Flaky timeout in github Python unit 
test action 
StatefulDoFnOnDirectRunnerTest.test_dynamic_timer_clear_then_set_timer
https://github.com/apache/beam/issues/21643 FnRunnerTest with non-trivial 
(order 1000 elements) numpy input flakes in non-cython environment
https://github.com/apache/beam/issues/21476 WriteToBigQuery Dynamic table 
destinations returns wrong tableId
https://github.com/apache/beam/issues/21469 beam_PostCommit_XVR_Flink flaky: 
Connection refused
https://github.com/apache/beam/issues/21424 Java VR (Dataflow, V2, Streaming) 
failing: ParDoTest$TimestampTests/OnWindowExpirationTests
https://github.com/apache/beam/issues/21262 Python AfterAny, AfterAll do not 
follow spec
https://github.com/apache/beam/issues/21260 Python DirectRunner does not emit 
data at GC time
https://github.com/apache/beam/issues/21121