[Python SDK] Use pre-released dependencies for Beam python unit testing

2023-04-11 Thread Anand Inguva via dev
Hi all,

For Apache Beam Python we are considering using pre-released dependencies
for unit testing by using the --pre flag to install pre-released
dependencies of packages.

We believe that using pre-released dependencies may help us to identify and
resolve bugs more quickly, and to take advantage of new features or bug
fixes that are not yet available in stable releases. However, we also
understand that using pre-released dependencies may introduce new risks and
challenges, including potential code duplication and stability issues.

Before proceeding, we wanted to get your feedback on this approach.

1. Create a new PreCommit test suite and a PostCommit test suite that runs
tests by installing pre-released dependencies.

Pros:

   - stable and pre-released test suites are separate and it will be easier
   to debug if the pre-released test suite fails.

Cons:

   - More test infra code to maintain. More tests to monitor.


2. Make use of the current PreCommit and PostCommit test suite and modify
it so that it installs pre-released dependencies.

Pros:

   - Less infra code and less tests to monitor.

Cons:

   - Leads to noisy test signals if the pre-release candidate is unstable.

I am in favor of approach 1 since this approach would ensure that any
issues encountered during pre-release testing do not impact the stable
release environment, and vice versa.

If you have experience or done any testing work using pre-released
dependencies, please let me know if you took any different approaches. It
will be really helpful.

Thanks,
Anand


Re: Jenkins Flakes

2023-04-11 Thread Danny McCormick via dev
This seems to have fixed the problem, please let me know if you see any
further issues.

On Tue, Apr 11, 2023 at 3:51 PM Danny McCormick 
wrote:

> I went ahead and made the limit 40 runs on the following jobs (PR
> ):
>
> beam_PostCommit_Go_VR_Flink
> beam_PostCommit_Java_Nexmark_Flink
> beam_PostCommit_Python_Examples_Flink
> beam_PreCommit_Java_*
> beam_PreCommit_Python_*
> beam_PreCommit_SQL_*
>
> It doesn't quite stick to my proposed 5.0 GB limit, but all of these are
> >2.5GB.
>
> I'm not sure how long it will take for this to take effect (my guess is it
> will happen lazily as jobs are run).
>
> Thanks,
> Danny
>
> On Tue, Apr 11, 2023 at 11:49 AM Danny McCormick <
> dannymccorm...@google.com> wrote:
>
>> > Regarding the "(and not guaranteed to work)" part, is the resolution
>> that the memory issues may still persist and we restore the normal
>> retention limit (and we look for another fix), or that we never restore
>> back to the normal retention limit?
>>
>> Mostly, I'm just not 100% certain that this is the only source of disk
>> space pressure. I think it should work, but I have no way of testing that
>> hypothesis (other than doing it).
>>
>> > Also, considering the number of flaky tests in general [1], code
>> coverage might not be the pressing issue. Should it be disabled everywhere
>> in favor of more reliable / faster builds? Unless Devs here are willing to
>> commit on taking actions, it doesn’t seem to provide too much value
>> recording these numbers as part of the normal pre commit jobs?
>>
>> I think most flakes are unrelated to this issue, so I don't think
>> removing code coverage is going to solve our problems here. If we need to
>> remove all code coverage to fix the issues we're currently experiencing,
>> then I think that is definitely worth it (at least until we can find a
>> better way to do coverage). But I'm not sure if that will be necessary yet.
>>
>> > Is there a technical reason we can't migrate Java code coverage over
>> to the Codecov tool/Actions like we have with Go and Python?
>>
>> I have no context on this and will defer to others.
>>
>> On Tue, Apr 11, 2023 at 11:27 AM Jack McCluskey via dev <
>> dev@beam.apache.org> wrote:
>>
>>> Is there a technical reason we can't migrate Java code coverage over to
>>> the Codecov tool/Actions like we have with Go and Python?
>>>
>>> On Tue, Apr 11, 2023 at 11:25 AM Moritz Mack  wrote:
>>>
 Yes, sorry Robert for being so unspecific. With everywhere I meant Java
 only, my bad!



 On 11.04.23, 17:17, "Robert Burke"  wrote:



 The coverage issue is only with the Java builds in specific. Go abd
 Python have their coverage numbers codecov uploads done in GitHub Actions
 instead. On Tue, Apr 11, 2023, 8: 14 AM Moritz Mack >>> com> wrote: Thanks so much

 The coverage issue is only with the Java builds in specific.



 Go abd Python have their coverage numbers codecov uploads done in
 GitHub Actions instead.



 On Tue, Apr 11, 2023, 8:14 AM Moritz Mack  wrote:

 Thanks so much for looking into this!

 I’m absolutely +1 for removing Jenkins related friction and the
 proposed changes sound legitimate.



 Also, considering the number of flaky tests in general [1], code
 coverage might not be the pressing issue. Should it be disabled everywhere
 in favor of more reliable / faster builds? Unless Devs here are willing to
 commit on taking actions, it doesn’t seem to provide too much value
 recording these numbers as part of the normal pre commit jobs?



 Kind regards,

 Moritz



 [1]
 https://github.com/apache/beam/issues?q=is%3Aissue+is%3Aopen+label%3Aflake
 



 On 11.04.23, 16:24, "Danny McCormick via dev" 
 wrote:



 ;tldr - I want to temporarily reduce the number of builds that we
 retain to reduce pressure on Jenkins Hey everyone, over the past few days
 our Jenkins runs have been particularly flaky across the board, with errors
 like the following showing

 *;tldr - I want to temporarily reduce the number of builds that we
 retain to reduce pressure on Jenkins*



 Hey everyone, over the past few days our Jenkins runs have been
 particularly flaky across the board, with errors like the following showing
 up all over the place [1]:



 java.nio.file.FileSystemException: 
 /home/jenkins/jenkins-home/jobs/beam_PreCommit_Python_Phrase/builds/3352/changelog.xml:
  No space left on device [2]



 These errors indicate that we're out of space on the Jenkins master
 node. After som

Re: Jenkins Flakes

2023-04-11 Thread Danny McCormick via dev
I went ahead and made the limit 40 runs on the following jobs (PR
):

beam_PostCommit_Go_VR_Flink
beam_PostCommit_Java_Nexmark_Flink
beam_PostCommit_Python_Examples_Flink
beam_PreCommit_Java_*
beam_PreCommit_Python_*
beam_PreCommit_SQL_*

It doesn't quite stick to my proposed 5.0 GB limit, but all of these are
>2.5GB.

I'm not sure how long it will take for this to take effect (my guess is it
will happen lazily as jobs are run).

Thanks,
Danny

On Tue, Apr 11, 2023 at 11:49 AM Danny McCormick 
wrote:

> > Regarding the "(and not guaranteed to work)" part, is the resolution
> that the memory issues may still persist and we restore the normal
> retention limit (and we look for another fix), or that we never restore
> back to the normal retention limit?
>
> Mostly, I'm just not 100% certain that this is the only source of disk
> space pressure. I think it should work, but I have no way of testing that
> hypothesis (other than doing it).
>
> > Also, considering the number of flaky tests in general [1], code
> coverage might not be the pressing issue. Should it be disabled everywhere
> in favor of more reliable / faster builds? Unless Devs here are willing to
> commit on taking actions, it doesn’t seem to provide too much value
> recording these numbers as part of the normal pre commit jobs?
>
> I think most flakes are unrelated to this issue, so I don't think removing
> code coverage is going to solve our problems here. If we need to remove all
> code coverage to fix the issues we're currently experiencing, then I think
> that is definitely worth it (at least until we can find a better way to do
> coverage). But I'm not sure if that will be necessary yet.
>
> > Is there a technical reason we can't migrate Java code coverage over to
> the Codecov tool/Actions like we have with Go and Python?
>
> I have no context on this and will defer to others.
>
> On Tue, Apr 11, 2023 at 11:27 AM Jack McCluskey via dev <
> dev@beam.apache.org> wrote:
>
>> Is there a technical reason we can't migrate Java code coverage over to
>> the Codecov tool/Actions like we have with Go and Python?
>>
>> On Tue, Apr 11, 2023 at 11:25 AM Moritz Mack  wrote:
>>
>>> Yes, sorry Robert for being so unspecific. With everywhere I meant Java
>>> only, my bad!
>>>
>>>
>>>
>>> On 11.04.23, 17:17, "Robert Burke"  wrote:
>>>
>>>
>>>
>>> The coverage issue is only with the Java builds in specific. Go abd
>>> Python have their coverage numbers codecov uploads done in GitHub Actions
>>> instead. On Tue, Apr 11, 2023, 8: 14 AM Moritz Mack >> com> wrote: Thanks so much
>>>
>>> The coverage issue is only with the Java builds in specific.
>>>
>>>
>>>
>>> Go abd Python have their coverage numbers codecov uploads done in GitHub
>>> Actions instead.
>>>
>>>
>>>
>>> On Tue, Apr 11, 2023, 8:14 AM Moritz Mack  wrote:
>>>
>>> Thanks so much for looking into this!
>>>
>>> I’m absolutely +1 for removing Jenkins related friction and the proposed
>>> changes sound legitimate.
>>>
>>>
>>>
>>> Also, considering the number of flaky tests in general [1], code
>>> coverage might not be the pressing issue. Should it be disabled everywhere
>>> in favor of more reliable / faster builds? Unless Devs here are willing to
>>> commit on taking actions, it doesn’t seem to provide too much value
>>> recording these numbers as part of the normal pre commit jobs?
>>>
>>>
>>>
>>> Kind regards,
>>>
>>> Moritz
>>>
>>>
>>>
>>> [1]
>>> https://github.com/apache/beam/issues?q=is%3Aissue+is%3Aopen+label%3Aflake
>>> 
>>>
>>>
>>>
>>> On 11.04.23, 16:24, "Danny McCormick via dev" 
>>> wrote:
>>>
>>>
>>>
>>> ;tldr - I want to temporarily reduce the number of builds that we retain
>>> to reduce pressure on Jenkins Hey everyone, over the past few days our
>>> Jenkins runs have been particularly flaky across the board, with errors
>>> like the following showing
>>>
>>> *;tldr - I want to temporarily reduce the number of builds that we
>>> retain to reduce pressure on Jenkins*
>>>
>>>
>>>
>>> Hey everyone, over the past few days our Jenkins runs have been
>>> particularly flaky across the board, with errors like the following showing
>>> up all over the place [1]:
>>>
>>>
>>>
>>> java.nio.file.FileSystemException: 
>>> /home/jenkins/jenkins-home/jobs/beam_PreCommit_Python_Phrase/builds/3352/changelog.xml:
>>>  No space left on device [2]
>>>
>>>
>>>
>>> These errors indicate that we're out of space on the Jenkins master
>>> node. After some digging (thanks @Yi Hu  @Ahmet Altay
>>>  and @Bruno Volpato  for
>>> contributing), we've determined that at least one large contributing issue
>>> is that some of our builds are eating up too much space. For example, our
>>> beam_PreCommit_Java_Commit build is taking up 28GB of space by itself (this
>>> is just one exampl

Re: Jenkins Flakes

2023-04-11 Thread Danny McCormick via dev
> Regarding the "(and not guaranteed to work)" part, is the resolution that
the memory issues may still persist and we restore the normal retention
limit (and we look for another fix), or that we never restore back to the
normal retention limit?

Mostly, I'm just not 100% certain that this is the only source of disk
space pressure. I think it should work, but I have no way of testing that
hypothesis (other than doing it).

> Also, considering the number of flaky tests in general [1], code coverage
might not be the pressing issue. Should it be disabled everywhere in favor
of more reliable / faster builds? Unless Devs here are willing to commit on
taking actions, it doesn’t seem to provide too much value recording these
numbers as part of the normal pre commit jobs?

I think most flakes are unrelated to this issue, so I don't think removing
code coverage is going to solve our problems here. If we need to remove all
code coverage to fix the issues we're currently experiencing, then I think
that is definitely worth it (at least until we can find a better way to do
coverage). But I'm not sure if that will be necessary yet.

> Is there a technical reason we can't migrate Java code coverage over to
the Codecov tool/Actions like we have with Go and Python?

I have no context on this and will defer to others.

On Tue, Apr 11, 2023 at 11:27 AM Jack McCluskey via dev 
wrote:

> Is there a technical reason we can't migrate Java code coverage over to
> the Codecov tool/Actions like we have with Go and Python?
>
> On Tue, Apr 11, 2023 at 11:25 AM Moritz Mack  wrote:
>
>> Yes, sorry Robert for being so unspecific. With everywhere I meant Java
>> only, my bad!
>>
>>
>>
>> On 11.04.23, 17:17, "Robert Burke"  wrote:
>>
>>
>>
>> The coverage issue is only with the Java builds in specific. Go abd
>> Python have their coverage numbers codecov uploads done in GitHub Actions
>> instead. On Tue, Apr 11, 2023, 8: 14 AM Moritz Mack 
>> wrote: Thanks so much
>>
>> The coverage issue is only with the Java builds in specific.
>>
>>
>>
>> Go abd Python have their coverage numbers codecov uploads done in GitHub
>> Actions instead.
>>
>>
>>
>> On Tue, Apr 11, 2023, 8:14 AM Moritz Mack  wrote:
>>
>> Thanks so much for looking into this!
>>
>> I’m absolutely +1 for removing Jenkins related friction and the proposed
>> changes sound legitimate.
>>
>>
>>
>> Also, considering the number of flaky tests in general [1], code coverage
>> might not be the pressing issue. Should it be disabled everywhere in favor
>> of more reliable / faster builds? Unless Devs here are willing to commit on
>> taking actions, it doesn’t seem to provide too much value recording these
>> numbers as part of the normal pre commit jobs?
>>
>>
>>
>> Kind regards,
>>
>> Moritz
>>
>>
>>
>> [1]
>> https://github.com/apache/beam/issues?q=is%3Aissue+is%3Aopen+label%3Aflake
>> 
>>
>>
>>
>> On 11.04.23, 16:24, "Danny McCormick via dev" 
>> wrote:
>>
>>
>>
>> ;tldr - I want to temporarily reduce the number of builds that we retain
>> to reduce pressure on Jenkins Hey everyone, over the past few days our
>> Jenkins runs have been particularly flaky across the board, with errors
>> like the following showing
>>
>> *;tldr - I want to temporarily reduce the number of builds that we retain
>> to reduce pressure on Jenkins*
>>
>>
>>
>> Hey everyone, over the past few days our Jenkins runs have been
>> particularly flaky across the board, with errors like the following showing
>> up all over the place [1]:
>>
>>
>>
>> java.nio.file.FileSystemException: 
>> /home/jenkins/jenkins-home/jobs/beam_PreCommit_Python_Phrase/builds/3352/changelog.xml:
>>  No space left on device [2]
>>
>>
>>
>> These errors indicate that we're out of space on the Jenkins master node.
>> After some digging (thanks @Yi Hu  @Ahmet Altay
>>  and @Bruno Volpato  for
>> contributing), we've determined that at least one large contributing issue
>> is that some of our builds are eating up too much space. For example, our
>> beam_PreCommit_Java_Commit build is taking up 28GB of space by itself (this
>> is just one example).
>>
>>
>>
>> @Yi Hu  found one change around code coverage that is
>> likely heavily contributing to the problem and rolled that back [3]. We can
>> continue to find other contributing factors here.
>>
>>
>>
>> In the meantime, to get us back to healthy *I propose that we reduce the
>> number of builds that we are retaining to 40 for all jobs that are using a
>> large amount of storage (>5GB)*. This will hopefully allow us to return
>> Jenkins to a normal functioning state, though it will do so at the cost of
>> a significant amount of build history (right now, for example,
>> beam_PreCommit_Java_Commit is at 400 retained builds). We could restore the
>> normal retention limit once the underlying problem 

Re: Jenkins Flakes

2023-04-11 Thread Jack McCluskey via dev
Is there a technical reason we can't migrate Java code coverage over to the
Codecov tool/Actions like we have with Go and Python?

On Tue, Apr 11, 2023 at 11:25 AM Moritz Mack  wrote:

> Yes, sorry Robert for being so unspecific. With everywhere I meant Java
> only, my bad!
>
>
>
> On 11.04.23, 17:17, "Robert Burke"  wrote:
>
>
>
> The coverage issue is only with the Java builds in specific. Go abd Python
> have their coverage numbers codecov uploads done in GitHub Actions instead.
> On Tue, Apr 11, 2023, 8: 14 AM Moritz Mack  wrote:
> Thanks so much
>
> The coverage issue is only with the Java builds in specific.
>
>
>
> Go abd Python have their coverage numbers codecov uploads done in GitHub
> Actions instead.
>
>
>
> On Tue, Apr 11, 2023, 8:14 AM Moritz Mack  wrote:
>
> Thanks so much for looking into this!
>
> I’m absolutely +1 for removing Jenkins related friction and the proposed
> changes sound legitimate.
>
>
>
> Also, considering the number of flaky tests in general [1], code coverage
> might not be the pressing issue. Should it be disabled everywhere in favor
> of more reliable / faster builds? Unless Devs here are willing to commit on
> taking actions, it doesn’t seem to provide too much value recording these
> numbers as part of the normal pre commit jobs?
>
>
>
> Kind regards,
>
> Moritz
>
>
>
> [1]
> https://github.com/apache/beam/issues?q=is%3Aissue+is%3Aopen+label%3Aflake
> 
>
>
>
> On 11.04.23, 16:24, "Danny McCormick via dev"  wrote:
>
>
>
> ;tldr - I want to temporarily reduce the number of builds that we retain
> to reduce pressure on Jenkins Hey everyone, over the past few days our
> Jenkins runs have been particularly flaky across the board, with errors
> like the following showing
>
> *;tldr - I want to temporarily reduce the number of builds that we retain
> to reduce pressure on Jenkins*
>
>
>
> Hey everyone, over the past few days our Jenkins runs have been
> particularly flaky across the board, with errors like the following showing
> up all over the place [1]:
>
>
>
> java.nio.file.FileSystemException: 
> /home/jenkins/jenkins-home/jobs/beam_PreCommit_Python_Phrase/builds/3352/changelog.xml:
>  No space left on device [2]
>
>
>
> These errors indicate that we're out of space on the Jenkins master node.
> After some digging (thanks @Yi Hu  @Ahmet Altay
>  and @Bruno Volpato  for
> contributing), we've determined that at least one large contributing issue
> is that some of our builds are eating up too much space. For example, our
> beam_PreCommit_Java_Commit build is taking up 28GB of space by itself (this
> is just one example).
>
>
>
> @Yi Hu  found one change around code coverage that is
> likely heavily contributing to the problem and rolled that back [3]. We can
> continue to find other contributing factors here.
>
>
>
> In the meantime, to get us back to healthy *I propose that we reduce the
> number of builds that we are retaining to 40 for all jobs that are using a
> large amount of storage (>5GB)*. This will hopefully allow us to return
> Jenkins to a normal functioning state, though it will do so at the cost of
> a significant amount of build history (right now, for example,
> beam_PreCommit_Java_Commit is at 400 retained builds). We could restore the
> normal retention limit once the underlying problem is resolved. Given that
> this is irreversible (and not guaranteed to work), I wanted to gather
> feedback before doing this. Personally, I rarely use builds that old, but
> others may feel differently.
>
>
>
> Please let me know if you have any objections or support for this proposal.
>
>
>
> Thanks,
>
> Danny
>
>
>
> [1] Tracking issue: https://github.com/apache/beam/issues/26197
> 
>
> [2] Example run with this error:
> https://ci-beam.apache.org/job/beam_PreCommit_Python_Phrase/3352/console
> 
>
> [3] Rollback PR: https://github.com/apache/beam/pull/26199
> 
>
> *As a recipient of an email from the Talend Group, your personal data will
> be processed by our systems. Please see our Privacy Notice
>  for more information about our
> collection and use of your personal information, our security practices,
> and your data protection rights, including any rights you may have to
> object to automated-decision making or profiling we use

Re: Jenkins Flakes

2023-04-11 Thread Moritz Mack
Yes, sorry Robert for being so unspecific. With everywhere I meant Java only, 
my bad!

On 11.04.23, 17:17, "Robert Burke"  wrote:

The coverage issue is only with the Java builds in specific. Go abd Python have 
their coverage numbers codecov uploads done in GitHub Actions instead. On Tue, 
Apr 11, 2023, 8: 14 AM Moritz Mack  wrote: Thanks so much

The coverage issue is only with the Java builds in specific.

Go abd Python have their coverage numbers codecov uploads done in GitHub 
Actions instead.

On Tue, Apr 11, 2023, 8:14 AM Moritz Mack 
mailto:mm...@talend.com>> wrote:
Thanks so much for looking into this!
I’m absolutely +1 for removing Jenkins related friction and the proposed 
changes sound legitimate.

Also, considering the number of flaky tests in general [1], code coverage might 
not be the pressing issue. Should it be disabled everywhere in favor of more 
reliable / faster builds? Unless Devs here are willing to commit on taking 
actions, it doesn’t seem to provide too much value recording these numbers as 
part of the normal pre commit jobs?

Kind regards,
Moritz

[1] 
https://github.com/apache/beam/issues?q=is%3Aissue+is%3Aopen+label%3Aflake

On 11.04.23, 16:24, "Danny McCormick via dev" 
mailto:dev@beam.apache.org>> wrote:

;tldr - I want to temporarily reduce the number of builds that we retain to 
reduce pressure on Jenkins Hey everyone, over the past few days our Jenkins 
runs have been particularly flaky across the board, with errors like the 
following showing
;tldr - I want to temporarily reduce the number of builds that we retain to 
reduce pressure on Jenkins

Hey everyone, over the past few days our Jenkins runs have been particularly 
flaky across the board, with errors like the following showing up all over the 
place [1]:


java.nio.file.FileSystemException: 
/home/jenkins/jenkins-home/jobs/beam_PreCommit_Python_Phrase/builds/3352/changelog.xml:
 No space left on device [2]

These errors indicate that we're out of space on the Jenkins master node. After 
some digging (thanks @Yi Hu @Ahmet 
Altay and @Bruno Volpato 
for contributing), we've determined that at least one large contributing issue 
is that some of our builds are eating up too much space. For example, our 
beam_PreCommit_Java_Commit build is taking up 28GB of space by itself (this is 
just one example).

@Yi Hu found one change around code coverage that is 
likely heavily contributing to the problem and rolled that back [3]. We can 
continue to find other contributing factors here.

In the meantime, to get us back to healthy I propose that we reduce the number 
of builds that we are retaining to 40 for all jobs that are using a large 
amount of storage (>5GB). This will hopefully allow us to return Jenkins to a 
normal functioning state, though it will do so at the cost of a significant 
amount of build history (right now, for example, beam_PreCommit_Java_Commit is 
at 400 retained builds). We could restore the normal retention limit once the 
underlying problem is resolved. Given that this is irreversible (and not 
guaranteed to work), I wanted to gather feedback before doing this. Personally, 
I rarely use builds that old, but others may feel differently.

Please let me know if you have any objections or support for this proposal.

Thanks,
Danny

[1] Tracking issue: 
https://github.com/apache/beam/issues/26197
[2] Example run with this error: 
https://ci-beam.apache.org/job/beam_PreCommit_Python_Phrase/3352/console
[3] Rollback PR: 
https://github.com/apache/beam/pull/26199

As a recipient of an email from the Talend Group, your personal data will be 
processed by our systems. Please see our Privacy Notice 
 for more information about our 
collection and use of your personal information, our security practices, and 
your data protection rights, including any rights you may have to object to 
automated-decision making or profiling we use to analyze support or marketing 
related communications. To manage or discontinue promotional communications, 
use the communication preferences 
portal. To exercise your data 
protection rights, 

Re: Jenkins Flakes

2023-04-11 Thread Robert Burke
The coverage issue is only with the Java builds in specific.

Go abd Python have their coverage numbers codecov uploads done in GitHub
Actions instead.

On Tue, Apr 11, 2023, 8:14 AM Moritz Mack  wrote:

> Thanks so much for looking into this!
>
> I’m absolutely +1 for removing Jenkins related friction and the proposed
> changes sound legitimate.
>
>
>
> Also, considering the number of flaky tests in general [1], code coverage
> might not be the pressing issue. Should it be disabled everywhere in favor
> of more reliable / faster builds? Unless Devs here are willing to commit on
> taking actions, it doesn’t seem to provide too much value recording these
> numbers as part of the normal pre commit jobs?
>
>
>
> Kind regards,
>
> Moritz
>
>
>
> [1]
> https://github.com/apache/beam/issues?q=is%3Aissue+is%3Aopen+label%3Aflake
>
>
>
> On 11.04.23, 16:24, "Danny McCormick via dev"  wrote:
>
>
>
> ;tldr - I want to temporarily reduce the number of builds that we retain
> to reduce pressure on Jenkins Hey everyone, over the past few days our
> Jenkins runs have been particularly flaky across the board, with errors
> like the following showing
>
> *;tldr - I want to temporarily reduce the number of builds that we retain
> to reduce pressure on Jenkins*
>
>
>
> Hey everyone, over the past few days our Jenkins runs have been
> particularly flaky across the board, with errors like the following showing
> up all over the place [1]:
>
>
>
> java.nio.file.FileSystemException: 
> /home/jenkins/jenkins-home/jobs/beam_PreCommit_Python_Phrase/builds/3352/changelog.xml:
>  No space left on device [2]
>
>
>
> These errors indicate that we're out of space on the Jenkins master node.
> After some digging (thanks @Yi Hu  @Ahmet Altay
>  and @Bruno Volpato  for
> contributing), we've determined that at least one large contributing issue
> is that some of our builds are eating up too much space. For example, our
> beam_PreCommit_Java_Commit build is taking up 28GB of space by itself (this
> is just one example).
>
>
>
> @Yi Hu  found one change around code coverage that is
> likely heavily contributing to the problem and rolled that back [3]. We can
> continue to find other contributing factors here.
>
>
>
> In the meantime, to get us back to healthy *I propose that we reduce the
> number of builds that we are retaining to 40 for all jobs that are using a
> large amount of storage (>5GB)*. This will hopefully allow us to return
> Jenkins to a normal functioning state, though it will do so at the cost of
> a significant amount of build history (right now, for example,
> beam_PreCommit_Java_Commit is at 400 retained builds). We could restore the
> normal retention limit once the underlying problem is resolved. Given that
> this is irreversible (and not guaranteed to work), I wanted to gather
> feedback before doing this. Personally, I rarely use builds that old, but
> others may feel differently.
>
>
>
> Please let me know if you have any objections or support for this proposal.
>
>
>
> Thanks,
>
> Danny
>
>
>
> [1] Tracking issue: https://github.com/apache/beam/issues/26197
> 
>
> [2] Example run with this error:
> https://ci-beam.apache.org/job/beam_PreCommit_Python_Phrase/3352/console
> 
>
> [3] Rollback PR: https://github.com/apache/beam/pull/26199
> 
>
> *As a recipient of an email from the Talend Group, your personal data will
> be processed by our systems. Please see our Privacy Notice
> *for more information about our
> collection and use of your personal information, our security practices,
> and your data protection rights, including any rights you may have to
> object to automated-decision making or profiling we use to analyze support
> or marketing related communications. To manage or discontinue promotional
> communications, use the communication preferences portal
> . To exercise your data
> protection rights, use the privacy request form
> .
> Contact us here or by mail to either of
> our co-headquarters: Talend, Inc.: 400 South El Camino Real, Ste 1400, San
> Mateo, CA 94402; Talend SAS: 5/7 rue Salomon De Rothschild, 92150 Suresnes,
> France
>


Re: Jenkins Flakes

2023-04-11 Thread Moritz Mack
Thanks so much for looking into this!
I’m absolutely +1 for removing Jenkins related friction and the proposed 
changes sound legitimate.

Also, considering the number of flaky tests in general [1], code coverage might 
not be the pressing issue. Should it be disabled everywhere in favor of more 
reliable / faster builds? Unless Devs here are willing to commit on taking 
actions, it doesn’t seem to provide too much value recording these numbers as 
part of the normal pre commit jobs?

Kind regards,
Moritz

[1] https://github.com/apache/beam/issues?q=is%3Aissue+is%3Aopen+label%3Aflake

On 11.04.23, 16:24, "Danny McCormick via dev"  wrote:

;tldr - I want to temporarily reduce the number of builds that we retain to 
reduce pressure on Jenkins Hey everyone, over the past few days our Jenkins 
runs have been particularly flaky across the board, with errors like the 
following showing

;tldr - I want to temporarily reduce the number of builds that we retain to 
reduce pressure on Jenkins

Hey everyone, over the past few days our Jenkins runs have been particularly 
flaky across the board, with errors like the following showing up all over the 
place [1]:


java.nio.file.FileSystemException: 
/home/jenkins/jenkins-home/jobs/beam_PreCommit_Python_Phrase/builds/3352/changelog.xml:
 No space left on device [2]

These errors indicate that we're out of space on the Jenkins master node. After 
some digging (thanks @Yi Hu @Ahmet 
Altay and @Bruno Volpato 
for contributing), we've determined that at least one large contributing issue 
is that some of our builds are eating up too much space. For example, our 
beam_PreCommit_Java_Commit build is taking up 28GB of space by itself (this is 
just one example).

@Yi Hu found one change around code coverage that is 
likely heavily contributing to the problem and rolled that back [3]. We can 
continue to find other contributing factors here.

In the meantime, to get us back to healthy I propose that we reduce the number 
of builds that we are retaining to 40 for all jobs that are using a large 
amount of storage (>5GB). This will hopefully allow us to return Jenkins to a 
normal functioning state, though it will do so at the cost of a significant 
amount of build history (right now, for example, beam_PreCommit_Java_Commit is 
at 400 retained builds). We could restore the normal retention limit once the 
underlying problem is resolved. Given that this is irreversible (and not 
guaranteed to work), I wanted to gather feedback before doing this. Personally, 
I rarely use builds that old, but others may feel differently.

Please let me know if you have any objections or support for this proposal.

Thanks,
Danny

[1] Tracking issue: 
https://github.com/apache/beam/issues/26197
[2] Example run with this error: 
https://ci-beam.apache.org/job/beam_PreCommit_Python_Phrase/3352/console
[3] Rollback PR: 
https://github.com/apache/beam/pull/26199

As a recipient of an email from the Talend Group, your personal data will be 
processed by our systems. Please see our Privacy Notice 
 for more information about our 
collection and use of your personal information, our security practices, and 
your data protection rights, including any rights you may have to object to 
automated-decision making or profiling we use to analyze support or marketing 
related communications. To manage or discontinue promotional communications, 
use the communication preferences 
portal. To exercise your data 
protection rights, use the privacy request 
form.
 Contact us here  or by mail to either of our 
co-headquarters: Talend, Inc.: 400 South El Camino Real, Ste 1400, San Mateo, 
CA 94402; Talend SAS: 5/7 rue Salomon De Rothschild, 92150 Suresnes, France


Re: Jenkins Flakes

2023-04-11 Thread Robert Burke
+1

SGTM

Remember, if an issue is being investigated, a committer can always mark a
build to be retained longer in the Jenkins UI. Just be sure to clean it up
once it's resolved though.

(TBH there may also be some old retained builds like that, but I doubt
there's a good way to see which are still relevant.)

On Tue, Apr 11, 2023, 8:03 AM Yi Hu via dev  wrote:

> +1 Thanks Danny for figuring out a solution.
>
> Best,
> Yi
>
> On Tue, Apr 11, 2023 at 10:56 AM Svetak Sundhar via dev <
> dev@beam.apache.org> wrote:
>
>> +1 to the proposal.
>>
>> Regarding the "(and not guaranteed to work)" part, is the resolution that
>> the memory issues may still persist and we restore the normal retention
>> limit (and we look for another fix), or that we never restore back to the
>> normal retention limit?
>>
>>
>> Svetak Sundhar
>>
>>   Technical Solutions Engineer, Data
>> s vetaksund...@google.com
>>
>>
>>
>> On Tue, Apr 11, 2023 at 10:34 AM Jack McCluskey via dev <
>> dev@beam.apache.org> wrote:
>>
>>> +1 for getting Jenkins back into a happier state, getting release
>>> blockers resolved ahead of building an RC has been severely hindered by
>>> Jenkins not picking up tests or running them properly.
>>>
>>> On Tue, Apr 11, 2023 at 10:24 AM Danny McCormick via dev <
>>> dev@beam.apache.org> wrote:
>>>
 *;tldr - I want to temporarily reduce the number of builds that we
 retain to reduce pressure on Jenkins*

 Hey everyone, over the past few days our Jenkins runs have been
 particularly flaky across the board, with errors like the following showing
 up all over the place [1]:

 java.nio.file.FileSystemException: 
 /home/jenkins/jenkins-home/jobs/beam_PreCommit_Python_Phrase/builds/3352/changelog.xml:
  No space left on device [2]


 These errors indicate that we're out of space on the Jenkins master
 node. After some digging (thanks @Yi Hu  @Ahmet Altay
  and @Bruno Volpato  for
 contributing), we've determined that at least one large contributing issue
 is that some of our builds are eating up too much space. For example, our
 beam_PreCommit_Java_Commit build is taking up 28GB of space by itself (this
 is just one example).

 @Yi Hu  found one change around code coverage that
 is likely heavily contributing to the problem and rolled that back [3]. We
 can continue to find other contributing factors here.

 In the meantime, to get us back to healthy *I propose that we reduce
 the number of builds that we are retaining to 40 for all jobs that are
 using a large amount of storage (>5GB)*. This will hopefully allow us
 to return Jenkins to a normal functioning state, though it will do so at
 the cost of a significant amount of build history (right now, for example,
 beam_PreCommit_Java_Commit is at 400 retained builds). We could restore the
 normal retention limit once the underlying problem is resolved. Given that
 this is irreversible (and not guaranteed to work), I wanted to gather
 feedback before doing this. Personally, I rarely use builds that old, but
 others may feel differently.

 Please let me know if you have any objections or support for this
 proposal.

 Thanks,
 Danny

 [1] Tracking issue: https://github.com/apache/beam/issues/26197
 [2] Example run with this error:
 https://ci-beam.apache.org/job/beam_PreCommit_Python_Phrase/3352/console
 [3] Rollback PR: https://github.com/apache/beam/pull/26199

>>>


Re: Jenkins Flakes

2023-04-11 Thread Yi Hu via dev
+1 Thanks Danny for figuring out a solution.

Best,
Yi

On Tue, Apr 11, 2023 at 10:56 AM Svetak Sundhar via dev 
wrote:

> +1 to the proposal.
>
> Regarding the "(and not guaranteed to work)" part, is the resolution that
> the memory issues may still persist and we restore the normal retention
> limit (and we look for another fix), or that we never restore back to the
> normal retention limit?
>
>
> Svetak Sundhar
>
>   Technical Solutions Engineer, Data
> s vetaksund...@google.com
>
>
>
> On Tue, Apr 11, 2023 at 10:34 AM Jack McCluskey via dev <
> dev@beam.apache.org> wrote:
>
>> +1 for getting Jenkins back into a happier state, getting release
>> blockers resolved ahead of building an RC has been severely hindered by
>> Jenkins not picking up tests or running them properly.
>>
>> On Tue, Apr 11, 2023 at 10:24 AM Danny McCormick via dev <
>> dev@beam.apache.org> wrote:
>>
>>> *;tldr - I want to temporarily reduce the number of builds that we
>>> retain to reduce pressure on Jenkins*
>>>
>>> Hey everyone, over the past few days our Jenkins runs have been
>>> particularly flaky across the board, with errors like the following showing
>>> up all over the place [1]:
>>>
>>> java.nio.file.FileSystemException: 
>>> /home/jenkins/jenkins-home/jobs/beam_PreCommit_Python_Phrase/builds/3352/changelog.xml:
>>>  No space left on device [2]
>>>
>>>
>>> These errors indicate that we're out of space on the Jenkins master
>>> node. After some digging (thanks @Yi Hu  @Ahmet Altay
>>>  and @Bruno Volpato  for
>>> contributing), we've determined that at least one large contributing issue
>>> is that some of our builds are eating up too much space. For example, our
>>> beam_PreCommit_Java_Commit build is taking up 28GB of space by itself (this
>>> is just one example).
>>>
>>> @Yi Hu  found one change around code coverage that is
>>> likely heavily contributing to the problem and rolled that back [3]. We can
>>> continue to find other contributing factors here.
>>>
>>> In the meantime, to get us back to healthy *I propose that we reduce
>>> the number of builds that we are retaining to 40 for all jobs that are
>>> using a large amount of storage (>5GB)*. This will hopefully allow us
>>> to return Jenkins to a normal functioning state, though it will do so at
>>> the cost of a significant amount of build history (right now, for example,
>>> beam_PreCommit_Java_Commit is at 400 retained builds). We could restore the
>>> normal retention limit once the underlying problem is resolved. Given that
>>> this is irreversible (and not guaranteed to work), I wanted to gather
>>> feedback before doing this. Personally, I rarely use builds that old, but
>>> others may feel differently.
>>>
>>> Please let me know if you have any objections or support for this
>>> proposal.
>>>
>>> Thanks,
>>> Danny
>>>
>>> [1] Tracking issue: https://github.com/apache/beam/issues/26197
>>> [2] Example run with this error:
>>> https://ci-beam.apache.org/job/beam_PreCommit_Python_Phrase/3352/console
>>> [3] Rollback PR: https://github.com/apache/beam/pull/26199
>>>
>>


Re: Jenkins Flakes

2023-04-11 Thread Svetak Sundhar via dev
+1 to the proposal.

Regarding the "(and not guaranteed to work)" part, is the resolution that
the memory issues may still persist and we restore the normal retention
limit (and we look for another fix), or that we never restore back to the
normal retention limit?


Svetak Sundhar

  Technical Solutions Engineer, Data
s vetaksund...@google.com



On Tue, Apr 11, 2023 at 10:34 AM Jack McCluskey via dev 
wrote:

> +1 for getting Jenkins back into a happier state, getting release blockers
> resolved ahead of building an RC has been severely hindered by Jenkins not
> picking up tests or running them properly.
>
> On Tue, Apr 11, 2023 at 10:24 AM Danny McCormick via dev <
> dev@beam.apache.org> wrote:
>
>> *;tldr - I want to temporarily reduce the number of builds that we retain
>> to reduce pressure on Jenkins*
>>
>> Hey everyone, over the past few days our Jenkins runs have been
>> particularly flaky across the board, with errors like the following showing
>> up all over the place [1]:
>>
>> java.nio.file.FileSystemException: 
>> /home/jenkins/jenkins-home/jobs/beam_PreCommit_Python_Phrase/builds/3352/changelog.xml:
>>  No space left on device [2]
>>
>>
>> These errors indicate that we're out of space on the Jenkins master node.
>> After some digging (thanks @Yi Hu  @Ahmet Altay
>>  and @Bruno Volpato  for
>> contributing), we've determined that at least one large contributing issue
>> is that some of our builds are eating up too much space. For example, our
>> beam_PreCommit_Java_Commit build is taking up 28GB of space by itself (this
>> is just one example).
>>
>> @Yi Hu  found one change around code coverage that is
>> likely heavily contributing to the problem and rolled that back [3]. We can
>> continue to find other contributing factors here.
>>
>> In the meantime, to get us back to healthy *I propose that we reduce the
>> number of builds that we are retaining to 40 for all jobs that are using a
>> large amount of storage (>5GB)*. This will hopefully allow us to return
>> Jenkins to a normal functioning state, though it will do so at the cost of
>> a significant amount of build history (right now, for example,
>> beam_PreCommit_Java_Commit is at 400 retained builds). We could restore the
>> normal retention limit once the underlying problem is resolved. Given that
>> this is irreversible (and not guaranteed to work), I wanted to gather
>> feedback before doing this. Personally, I rarely use builds that old, but
>> others may feel differently.
>>
>> Please let me know if you have any objections or support for this
>> proposal.
>>
>> Thanks,
>> Danny
>>
>> [1] Tracking issue: https://github.com/apache/beam/issues/26197
>> [2] Example run with this error:
>> https://ci-beam.apache.org/job/beam_PreCommit_Python_Phrase/3352/console
>> [3] Rollback PR: https://github.com/apache/beam/pull/26199
>>
>


Re: Jenkins Flakes

2023-04-11 Thread Jack McCluskey via dev
+1 for getting Jenkins back into a happier state, getting release blockers
resolved ahead of building an RC has been severely hindered by Jenkins not
picking up tests or running them properly.

On Tue, Apr 11, 2023 at 10:24 AM Danny McCormick via dev <
dev@beam.apache.org> wrote:

> *;tldr - I want to temporarily reduce the number of builds that we retain
> to reduce pressure on Jenkins*
>
> Hey everyone, over the past few days our Jenkins runs have been
> particularly flaky across the board, with errors like the following showing
> up all over the place [1]:
>
> java.nio.file.FileSystemException: 
> /home/jenkins/jenkins-home/jobs/beam_PreCommit_Python_Phrase/builds/3352/changelog.xml:
>  No space left on device [2]
>
>
> These errors indicate that we're out of space on the Jenkins master node.
> After some digging (thanks @Yi Hu  @Ahmet Altay
>  and @Bruno Volpato  for
> contributing), we've determined that at least one large contributing issue
> is that some of our builds are eating up too much space. For example, our
> beam_PreCommit_Java_Commit build is taking up 28GB of space by itself (this
> is just one example).
>
> @Yi Hu  found one change around code coverage that is
> likely heavily contributing to the problem and rolled that back [3]. We can
> continue to find other contributing factors here.
>
> In the meantime, to get us back to healthy *I propose that we reduce the
> number of builds that we are retaining to 40 for all jobs that are using a
> large amount of storage (>5GB)*. This will hopefully allow us to return
> Jenkins to a normal functioning state, though it will do so at the cost of
> a significant amount of build history (right now, for example,
> beam_PreCommit_Java_Commit is at 400 retained builds). We could restore the
> normal retention limit once the underlying problem is resolved. Given that
> this is irreversible (and not guaranteed to work), I wanted to gather
> feedback before doing this. Personally, I rarely use builds that old, but
> others may feel differently.
>
> Please let me know if you have any objections or support for this proposal.
>
> Thanks,
> Danny
>
> [1] Tracking issue: https://github.com/apache/beam/issues/26197
> [2] Example run with this error:
> https://ci-beam.apache.org/job/beam_PreCommit_Python_Phrase/3352/console
> [3] Rollback PR: https://github.com/apache/beam/pull/26199
>


Jenkins Flakes

2023-04-11 Thread Danny McCormick via dev
*;tldr - I want to temporarily reduce the number of builds that we retain
to reduce pressure on Jenkins*

Hey everyone, over the past few days our Jenkins runs have been
particularly flaky across the board, with errors like the following showing
up all over the place [1]:

java.nio.file.FileSystemException:
/home/jenkins/jenkins-home/jobs/beam_PreCommit_Python_Phrase/builds/3352/changelog.xml:
No space left on device [2]


These errors indicate that we're out of space on the Jenkins master node.
After some digging (thanks @Yi Hu  @Ahmet Altay
 and @Bruno Volpato  for
contributing), we've determined that at least one large contributing issue
is that some of our builds are eating up too much space. For example, our
beam_PreCommit_Java_Commit build is taking up 28GB of space by itself (this
is just one example).

@Yi Hu  found one change around code coverage that is
likely heavily contributing to the problem and rolled that back [3]. We can
continue to find other contributing factors here.

In the meantime, to get us back to healthy *I propose that we reduce the
number of builds that we are retaining to 40 for all jobs that are using a
large amount of storage (>5GB)*. This will hopefully allow us to return
Jenkins to a normal functioning state, though it will do so at the cost of
a significant amount of build history (right now, for example,
beam_PreCommit_Java_Commit is at 400 retained builds). We could restore the
normal retention limit once the underlying problem is resolved. Given that
this is irreversible (and not guaranteed to work), I wanted to gather
feedback before doing this. Personally, I rarely use builds that old, but
others may feel differently.

Please let me know if you have any objections or support for this proposal.

Thanks,
Danny

[1] Tracking issue: https://github.com/apache/beam/issues/26197
[2] Example run with this error:
https://ci-beam.apache.org/job/beam_PreCommit_Python_Phrase/3352/console
[3] Rollback PR: https://github.com/apache/beam/pull/26199


Beam High Priority Issue Report (26)

2023-04-11 Thread beamactions
This is your daily summary of Beam's current high priority issues that may need 
attention.

See https://beam.apache.org/contribute/issue-priorities for the meaning and 
expectations around issue priorities.

Unassigned P1 Issues:

https://github.com/apache/beam/issues/26126 [Failing Test]: 
beam_PostCommit_XVR_Samza permared validatesCrossLanguageRunnerGoUsingJava 
TestDebeziumIO_BasicRead
https://github.com/apache/beam/issues/24776 [Bug]: Race condition in Python SDK 
Harness ProcessBundleProgress
https://github.com/apache/beam/issues/24389 [Failing Test]: 
HadoopFormatIOElasticTest.classMethod ExceptionInInitializerError 
ContainerFetchException
https://github.com/apache/beam/issues/24313 [Flaky]: 
apache_beam/runners/portability/portable_runner_test.py::PortableRunnerTestWithSubprocesses::test_pardo_state_with_custom_key_coder
https://github.com/apache/beam/issues/23944  beam_PreCommit_Python_Cron 
regularily failing - test_pardo_large_input flaky
https://github.com/apache/beam/issues/23709 [Flake]: Spark batch flakes in 
ParDoLifecycleTest.testTeardownCalledAfterExceptionInProcessElement and 
ParDoLifecycleTest.testTeardownCalledAfterExceptionInStartBundle
https://github.com/apache/beam/issues/22913 [Bug]: 
beam_PostCommit_Java_ValidatesRunner_Flink is flakes in 
org.apache.beam.sdk.transforms.GroupByKeyTest$BasicTests.testAfterProcessingTimeContinuationTriggerUsingState
https://github.com/apache/beam/issues/22605 [Bug]: Beam Python failure for 
dataflow_exercise_metrics_pipeline_test.ExerciseMetricsPipelineTest.test_metrics_it
https://github.com/apache/beam/issues/21706 Flaky timeout in github Python unit 
test action 
StatefulDoFnOnDirectRunnerTest.test_dynamic_timer_clear_then_set_timer
https://github.com/apache/beam/issues/21645 
beam_PostCommit_XVR_GoUsingJava_Dataflow fails on some test transforms
https://github.com/apache/beam/issues/21643 FnRunnerTest with non-trivial 
(order 1000 elements) numpy input flakes in non-cython environment
https://github.com/apache/beam/issues/21469 beam_PostCommit_XVR_Flink flaky: 
Connection refused
https://github.com/apache/beam/issues/21424 Java VR (Dataflow, V2, Streaming) 
failing: ParDoTest$TimestampTests/OnWindowExpirationTests
https://github.com/apache/beam/issues/21262 Python AfterAny, AfterAll do not 
follow spec
https://github.com/apache/beam/issues/21260 Python DirectRunner does not emit 
data at GC time
https://github.com/apache/beam/issues/21121 
apache_beam.examples.streaming_wordcount_it_test.StreamingWordCountIT.test_streaming_wordcount_it
 flakey
https://github.com/apache/beam/issues/21104 Flaky: 
apache_beam.runners.portability.fn_api_runner.fn_runner_test.FnApiRunnerTestWithGrpcAndMultiWorkers
https://github.com/apache/beam/issues/20976 
apache_beam.runners.portability.flink_runner_test.FlinkRunnerTestOptimized.test_flink_metrics
 is flaky
https://github.com/apache/beam/issues/20974 Python GHA PreCommits flake with 
grpc.FutureTimeoutError on SDK harness startup
https://github.com/apache/beam/issues/20108 Python direct runner doesn't emit 
empty pane when it should
https://github.com/apache/beam/issues/19814 Flink streaming flakes in 
ParDoLifecycleTest.testTeardownCalledAfterExceptionInStartBundleStateful and 
ParDoLifecycleTest.testTeardownCalledAfterExceptionInProcessElementStateful
https://github.com/apache/beam/issues/19465 Explore possibilities to lower 
in-use IP address quota footprint.


P1 Issues with no update in the last week:

https://github.com/apache/beam/issues/23525 [Bug]: Default PubsubMessage coder 
will drop message id and orderingKey
https://github.com/apache/beam/issues/21714 
PulsarIOTest.testReadFromSimpleTopic is very flaky
https://github.com/apache/beam/issues/21708 beam_PostCommit_Java_DataflowV2, 
testBigQueryStorageWrite30MProto failing consistently
https://github.com/apache/beam/issues/21476 WriteToBigQuery Dynamic table 
destinations returns wrong tableId