Re: Jenkins Flakes

2023-04-11 Thread Danny McCormick via dev
This seems to have fixed the problem, please let me know if you see any
further issues.

On Tue, Apr 11, 2023 at 3:51 PM Danny McCormick 
wrote:

> I went ahead and made the limit 40 runs on the following jobs (PR
> ):
>
> beam_PostCommit_Go_VR_Flink
> beam_PostCommit_Java_Nexmark_Flink
> beam_PostCommit_Python_Examples_Flink
> beam_PreCommit_Java_*
> beam_PreCommit_Python_*
> beam_PreCommit_SQL_*
>
> It doesn't quite stick to my proposed 5.0 GB limit, but all of these are
> >2.5GB.
>
> I'm not sure how long it will take for this to take effect (my guess is it
> will happen lazily as jobs are run).
>
> Thanks,
> Danny
>
> On Tue, Apr 11, 2023 at 11:49 AM Danny McCormick <
> dannymccorm...@google.com> wrote:
>
>> > Regarding the "(and not guaranteed to work)" part, is the resolution
>> that the memory issues may still persist and we restore the normal
>> retention limit (and we look for another fix), or that we never restore
>> back to the normal retention limit?
>>
>> Mostly, I'm just not 100% certain that this is the only source of disk
>> space pressure. I think it should work, but I have no way of testing that
>> hypothesis (other than doing it).
>>
>> > Also, considering the number of flaky tests in general [1], code
>> coverage might not be the pressing issue. Should it be disabled everywhere
>> in favor of more reliable / faster builds? Unless Devs here are willing to
>> commit on taking actions, it doesn’t seem to provide too much value
>> recording these numbers as part of the normal pre commit jobs?
>>
>> I think most flakes are unrelated to this issue, so I don't think
>> removing code coverage is going to solve our problems here. If we need to
>> remove all code coverage to fix the issues we're currently experiencing,
>> then I think that is definitely worth it (at least until we can find a
>> better way to do coverage). But I'm not sure if that will be necessary yet.
>>
>> > Is there a technical reason we can't migrate Java code coverage over
>> to the Codecov tool/Actions like we have with Go and Python?
>>
>> I have no context on this and will defer to others.
>>
>> On Tue, Apr 11, 2023 at 11:27 AM Jack McCluskey via dev <
>> dev@beam.apache.org> wrote:
>>
>>> Is there a technical reason we can't migrate Java code coverage over to
>>> the Codecov tool/Actions like we have with Go and Python?
>>>
>>> On Tue, Apr 11, 2023 at 11:25 AM Moritz Mack  wrote:
>>>
 Yes, sorry Robert for being so unspecific. With everywhere I meant Java
 only, my bad!



 On 11.04.23, 17:17, "Robert Burke"  wrote:



 The coverage issue is only with the Java builds in specific. Go abd
 Python have their coverage numbers codecov uploads done in GitHub Actions
 instead. On Tue, Apr 11, 2023, 8: 14 AM Moritz Mack >>> com> wrote: Thanks so much

 The coverage issue is only with the Java builds in specific.



 Go abd Python have their coverage numbers codecov uploads done in
 GitHub Actions instead.



 On Tue, Apr 11, 2023, 8:14 AM Moritz Mack  wrote:

 Thanks so much for looking into this!

 I’m absolutely +1 for removing Jenkins related friction and the
 proposed changes sound legitimate.



 Also, considering the number of flaky tests in general [1], code
 coverage might not be the pressing issue. Should it be disabled everywhere
 in favor of more reliable / faster builds? Unless Devs here are willing to
 commit on taking actions, it doesn’t seem to provide too much value
 recording these numbers as part of the normal pre commit jobs?



 Kind regards,

 Moritz



 [1]
 https://github.com/apache/beam/issues?q=is%3Aissue+is%3Aopen+label%3Aflake
 



 On 11.04.23, 16:24, "Danny McCormick via dev" 
 wrote:



 ;tldr - I want to temporarily reduce the number of builds that we
 retain to reduce pressure on Jenkins Hey everyone, over the past few days
 our Jenkins runs have been particularly flaky across the board, with errors
 like the following showing

 *;tldr - I want to temporarily reduce the number of builds that we
 retain to reduce pressure on Jenkins*



 Hey everyone, over the past few days our Jenkins runs have been
 particularly flaky across the board, with errors like the following showing
 up all over the place [1]:



 java.nio.file.FileSystemException: 
 /home/jenkins/jenkins-home/jobs/beam_PreCommit_Python_Phrase/builds/3352/changelog.xml:
  No space left on device [2]



 These errors indicate that we're out of space on the Jenkins master
 node. After 

Re: Jenkins Flakes

2023-04-11 Thread Danny McCormick via dev
I went ahead and made the limit 40 runs on the following jobs (PR
):

beam_PostCommit_Go_VR_Flink
beam_PostCommit_Java_Nexmark_Flink
beam_PostCommit_Python_Examples_Flink
beam_PreCommit_Java_*
beam_PreCommit_Python_*
beam_PreCommit_SQL_*

It doesn't quite stick to my proposed 5.0 GB limit, but all of these are
>2.5GB.

I'm not sure how long it will take for this to take effect (my guess is it
will happen lazily as jobs are run).

Thanks,
Danny

On Tue, Apr 11, 2023 at 11:49 AM Danny McCormick 
wrote:

> > Regarding the "(and not guaranteed to work)" part, is the resolution
> that the memory issues may still persist and we restore the normal
> retention limit (and we look for another fix), or that we never restore
> back to the normal retention limit?
>
> Mostly, I'm just not 100% certain that this is the only source of disk
> space pressure. I think it should work, but I have no way of testing that
> hypothesis (other than doing it).
>
> > Also, considering the number of flaky tests in general [1], code
> coverage might not be the pressing issue. Should it be disabled everywhere
> in favor of more reliable / faster builds? Unless Devs here are willing to
> commit on taking actions, it doesn’t seem to provide too much value
> recording these numbers as part of the normal pre commit jobs?
>
> I think most flakes are unrelated to this issue, so I don't think removing
> code coverage is going to solve our problems here. If we need to remove all
> code coverage to fix the issues we're currently experiencing, then I think
> that is definitely worth it (at least until we can find a better way to do
> coverage). But I'm not sure if that will be necessary yet.
>
> > Is there a technical reason we can't migrate Java code coverage over to
> the Codecov tool/Actions like we have with Go and Python?
>
> I have no context on this and will defer to others.
>
> On Tue, Apr 11, 2023 at 11:27 AM Jack McCluskey via dev <
> dev@beam.apache.org> wrote:
>
>> Is there a technical reason we can't migrate Java code coverage over to
>> the Codecov tool/Actions like we have with Go and Python?
>>
>> On Tue, Apr 11, 2023 at 11:25 AM Moritz Mack  wrote:
>>
>>> Yes, sorry Robert for being so unspecific. With everywhere I meant Java
>>> only, my bad!
>>>
>>>
>>>
>>> On 11.04.23, 17:17, "Robert Burke"  wrote:
>>>
>>>
>>>
>>> The coverage issue is only with the Java builds in specific. Go abd
>>> Python have their coverage numbers codecov uploads done in GitHub Actions
>>> instead. On Tue, Apr 11, 2023, 8: 14 AM Moritz Mack >> com> wrote: Thanks so much
>>>
>>> The coverage issue is only with the Java builds in specific.
>>>
>>>
>>>
>>> Go abd Python have their coverage numbers codecov uploads done in GitHub
>>> Actions instead.
>>>
>>>
>>>
>>> On Tue, Apr 11, 2023, 8:14 AM Moritz Mack  wrote:
>>>
>>> Thanks so much for looking into this!
>>>
>>> I’m absolutely +1 for removing Jenkins related friction and the proposed
>>> changes sound legitimate.
>>>
>>>
>>>
>>> Also, considering the number of flaky tests in general [1], code
>>> coverage might not be the pressing issue. Should it be disabled everywhere
>>> in favor of more reliable / faster builds? Unless Devs here are willing to
>>> commit on taking actions, it doesn’t seem to provide too much value
>>> recording these numbers as part of the normal pre commit jobs?
>>>
>>>
>>>
>>> Kind regards,
>>>
>>> Moritz
>>>
>>>
>>>
>>> [1]
>>> https://github.com/apache/beam/issues?q=is%3Aissue+is%3Aopen+label%3Aflake
>>> 
>>>
>>>
>>>
>>> On 11.04.23, 16:24, "Danny McCormick via dev" 
>>> wrote:
>>>
>>>
>>>
>>> ;tldr - I want to temporarily reduce the number of builds that we retain
>>> to reduce pressure on Jenkins Hey everyone, over the past few days our
>>> Jenkins runs have been particularly flaky across the board, with errors
>>> like the following showing
>>>
>>> *;tldr - I want to temporarily reduce the number of builds that we
>>> retain to reduce pressure on Jenkins*
>>>
>>>
>>>
>>> Hey everyone, over the past few days our Jenkins runs have been
>>> particularly flaky across the board, with errors like the following showing
>>> up all over the place [1]:
>>>
>>>
>>>
>>> java.nio.file.FileSystemException: 
>>> /home/jenkins/jenkins-home/jobs/beam_PreCommit_Python_Phrase/builds/3352/changelog.xml:
>>>  No space left on device [2]
>>>
>>>
>>>
>>> These errors indicate that we're out of space on the Jenkins master
>>> node. After some digging (thanks @Yi Hu  @Ahmet Altay
>>>  and @Bruno Volpato  for
>>> contributing), we've determined that at least one large contributing issue
>>> is that some of our builds are eating up too much space. For example, our
>>> beam_PreCommit_Java_Commit build is taking up 28GB of space by itself (this
>>> is just one 

Re: Jenkins Flakes

2023-04-11 Thread Danny McCormick via dev
> Regarding the "(and not guaranteed to work)" part, is the resolution that
the memory issues may still persist and we restore the normal retention
limit (and we look for another fix), or that we never restore back to the
normal retention limit?

Mostly, I'm just not 100% certain that this is the only source of disk
space pressure. I think it should work, but I have no way of testing that
hypothesis (other than doing it).

> Also, considering the number of flaky tests in general [1], code coverage
might not be the pressing issue. Should it be disabled everywhere in favor
of more reliable / faster builds? Unless Devs here are willing to commit on
taking actions, it doesn’t seem to provide too much value recording these
numbers as part of the normal pre commit jobs?

I think most flakes are unrelated to this issue, so I don't think removing
code coverage is going to solve our problems here. If we need to remove all
code coverage to fix the issues we're currently experiencing, then I think
that is definitely worth it (at least until we can find a better way to do
coverage). But I'm not sure if that will be necessary yet.

> Is there a technical reason we can't migrate Java code coverage over to
the Codecov tool/Actions like we have with Go and Python?

I have no context on this and will defer to others.

On Tue, Apr 11, 2023 at 11:27 AM Jack McCluskey via dev 
wrote:

> Is there a technical reason we can't migrate Java code coverage over to
> the Codecov tool/Actions like we have with Go and Python?
>
> On Tue, Apr 11, 2023 at 11:25 AM Moritz Mack  wrote:
>
>> Yes, sorry Robert for being so unspecific. With everywhere I meant Java
>> only, my bad!
>>
>>
>>
>> On 11.04.23, 17:17, "Robert Burke"  wrote:
>>
>>
>>
>> The coverage issue is only with the Java builds in specific. Go abd
>> Python have their coverage numbers codecov uploads done in GitHub Actions
>> instead. On Tue, Apr 11, 2023, 8: 14 AM Moritz Mack 
>> wrote: Thanks so much
>>
>> The coverage issue is only with the Java builds in specific.
>>
>>
>>
>> Go abd Python have their coverage numbers codecov uploads done in GitHub
>> Actions instead.
>>
>>
>>
>> On Tue, Apr 11, 2023, 8:14 AM Moritz Mack  wrote:
>>
>> Thanks so much for looking into this!
>>
>> I’m absolutely +1 for removing Jenkins related friction and the proposed
>> changes sound legitimate.
>>
>>
>>
>> Also, considering the number of flaky tests in general [1], code coverage
>> might not be the pressing issue. Should it be disabled everywhere in favor
>> of more reliable / faster builds? Unless Devs here are willing to commit on
>> taking actions, it doesn’t seem to provide too much value recording these
>> numbers as part of the normal pre commit jobs?
>>
>>
>>
>> Kind regards,
>>
>> Moritz
>>
>>
>>
>> [1]
>> https://github.com/apache/beam/issues?q=is%3Aissue+is%3Aopen+label%3Aflake
>> 
>>
>>
>>
>> On 11.04.23, 16:24, "Danny McCormick via dev" 
>> wrote:
>>
>>
>>
>> ;tldr - I want to temporarily reduce the number of builds that we retain
>> to reduce pressure on Jenkins Hey everyone, over the past few days our
>> Jenkins runs have been particularly flaky across the board, with errors
>> like the following showing
>>
>> *;tldr - I want to temporarily reduce the number of builds that we retain
>> to reduce pressure on Jenkins*
>>
>>
>>
>> Hey everyone, over the past few days our Jenkins runs have been
>> particularly flaky across the board, with errors like the following showing
>> up all over the place [1]:
>>
>>
>>
>> java.nio.file.FileSystemException: 
>> /home/jenkins/jenkins-home/jobs/beam_PreCommit_Python_Phrase/builds/3352/changelog.xml:
>>  No space left on device [2]
>>
>>
>>
>> These errors indicate that we're out of space on the Jenkins master node.
>> After some digging (thanks @Yi Hu  @Ahmet Altay
>>  and @Bruno Volpato  for
>> contributing), we've determined that at least one large contributing issue
>> is that some of our builds are eating up too much space. For example, our
>> beam_PreCommit_Java_Commit build is taking up 28GB of space by itself (this
>> is just one example).
>>
>>
>>
>> @Yi Hu  found one change around code coverage that is
>> likely heavily contributing to the problem and rolled that back [3]. We can
>> continue to find other contributing factors here.
>>
>>
>>
>> In the meantime, to get us back to healthy *I propose that we reduce the
>> number of builds that we are retaining to 40 for all jobs that are using a
>> large amount of storage (>5GB)*. This will hopefully allow us to return
>> Jenkins to a normal functioning state, though it will do so at the cost of
>> a significant amount of build history (right now, for example,
>> beam_PreCommit_Java_Commit is at 400 retained builds). We could restore the
>> normal retention limit once the underlying problem 

Re: Jenkins Flakes

2023-04-11 Thread Jack McCluskey via dev
Is there a technical reason we can't migrate Java code coverage over to the
Codecov tool/Actions like we have with Go and Python?

On Tue, Apr 11, 2023 at 11:25 AM Moritz Mack  wrote:

> Yes, sorry Robert for being so unspecific. With everywhere I meant Java
> only, my bad!
>
>
>
> On 11.04.23, 17:17, "Robert Burke"  wrote:
>
>
>
> The coverage issue is only with the Java builds in specific. Go abd Python
> have their coverage numbers codecov uploads done in GitHub Actions instead.
> On Tue, Apr 11, 2023, 8: 14 AM Moritz Mack  wrote:
> Thanks so much
>
> The coverage issue is only with the Java builds in specific.
>
>
>
> Go abd Python have their coverage numbers codecov uploads done in GitHub
> Actions instead.
>
>
>
> On Tue, Apr 11, 2023, 8:14 AM Moritz Mack  wrote:
>
> Thanks so much for looking into this!
>
> I’m absolutely +1 for removing Jenkins related friction and the proposed
> changes sound legitimate.
>
>
>
> Also, considering the number of flaky tests in general [1], code coverage
> might not be the pressing issue. Should it be disabled everywhere in favor
> of more reliable / faster builds? Unless Devs here are willing to commit on
> taking actions, it doesn’t seem to provide too much value recording these
> numbers as part of the normal pre commit jobs?
>
>
>
> Kind regards,
>
> Moritz
>
>
>
> [1]
> https://github.com/apache/beam/issues?q=is%3Aissue+is%3Aopen+label%3Aflake
> 
>
>
>
> On 11.04.23, 16:24, "Danny McCormick via dev"  wrote:
>
>
>
> ;tldr - I want to temporarily reduce the number of builds that we retain
> to reduce pressure on Jenkins Hey everyone, over the past few days our
> Jenkins runs have been particularly flaky across the board, with errors
> like the following showing
>
> *;tldr - I want to temporarily reduce the number of builds that we retain
> to reduce pressure on Jenkins*
>
>
>
> Hey everyone, over the past few days our Jenkins runs have been
> particularly flaky across the board, with errors like the following showing
> up all over the place [1]:
>
>
>
> java.nio.file.FileSystemException: 
> /home/jenkins/jenkins-home/jobs/beam_PreCommit_Python_Phrase/builds/3352/changelog.xml:
>  No space left on device [2]
>
>
>
> These errors indicate that we're out of space on the Jenkins master node.
> After some digging (thanks @Yi Hu  @Ahmet Altay
>  and @Bruno Volpato  for
> contributing), we've determined that at least one large contributing issue
> is that some of our builds are eating up too much space. For example, our
> beam_PreCommit_Java_Commit build is taking up 28GB of space by itself (this
> is just one example).
>
>
>
> @Yi Hu  found one change around code coverage that is
> likely heavily contributing to the problem and rolled that back [3]. We can
> continue to find other contributing factors here.
>
>
>
> In the meantime, to get us back to healthy *I propose that we reduce the
> number of builds that we are retaining to 40 for all jobs that are using a
> large amount of storage (>5GB)*. This will hopefully allow us to return
> Jenkins to a normal functioning state, though it will do so at the cost of
> a significant amount of build history (right now, for example,
> beam_PreCommit_Java_Commit is at 400 retained builds). We could restore the
> normal retention limit once the underlying problem is resolved. Given that
> this is irreversible (and not guaranteed to work), I wanted to gather
> feedback before doing this. Personally, I rarely use builds that old, but
> others may feel differently.
>
>
>
> Please let me know if you have any objections or support for this proposal.
>
>
>
> Thanks,
>
> Danny
>
>
>
> [1] Tracking issue: https://github.com/apache/beam/issues/26197
> 
>
> [2] Example run with this error:
> https://ci-beam.apache.org/job/beam_PreCommit_Python_Phrase/3352/console
> 
>
> [3] Rollback PR: https://github.com/apache/beam/pull/26199
> 
>
> *As a recipient of an email from the Talend Group, your personal data will
> be processed by our systems. Please see our Privacy Notice
>  for more information about our
> collection and use of your personal information, our security practices,
> and your data protection rights, including any rights you may have to
> object to automated-decision making or profiling we 

Re: Jenkins Flakes

2023-04-11 Thread Moritz Mack
Yes, sorry Robert for being so unspecific. With everywhere I meant Java only, 
my bad!

On 11.04.23, 17:17, "Robert Burke"  wrote:

The coverage issue is only with the Java builds in specific. Go abd Python have 
their coverage numbers codecov uploads done in GitHub Actions instead. On Tue, 
Apr 11, 2023, 8: 14 AM Moritz Mack  wrote: Thanks so much

The coverage issue is only with the Java builds in specific.

Go abd Python have their coverage numbers codecov uploads done in GitHub 
Actions instead.

On Tue, Apr 11, 2023, 8:14 AM Moritz Mack 
mailto:mm...@talend.com>> wrote:
Thanks so much for looking into this!
I’m absolutely +1 for removing Jenkins related friction and the proposed 
changes sound legitimate.

Also, considering the number of flaky tests in general [1], code coverage might 
not be the pressing issue. Should it be disabled everywhere in favor of more 
reliable / faster builds? Unless Devs here are willing to commit on taking 
actions, it doesn’t seem to provide too much value recording these numbers as 
part of the normal pre commit jobs?

Kind regards,
Moritz

[1] 
https://github.com/apache/beam/issues?q=is%3Aissue+is%3Aopen+label%3Aflake

On 11.04.23, 16:24, "Danny McCormick via dev" 
mailto:dev@beam.apache.org>> wrote:

;tldr - I want to temporarily reduce the number of builds that we retain to 
reduce pressure on Jenkins Hey everyone, over the past few days our Jenkins 
runs have been particularly flaky across the board, with errors like the 
following showing
;tldr - I want to temporarily reduce the number of builds that we retain to 
reduce pressure on Jenkins

Hey everyone, over the past few days our Jenkins runs have been particularly 
flaky across the board, with errors like the following showing up all over the 
place [1]:


java.nio.file.FileSystemException: 
/home/jenkins/jenkins-home/jobs/beam_PreCommit_Python_Phrase/builds/3352/changelog.xml:
 No space left on device [2]

These errors indicate that we're out of space on the Jenkins master node. After 
some digging (thanks @Yi Hu @Ahmet 
Altay and @Bruno Volpato 
for contributing), we've determined that at least one large contributing issue 
is that some of our builds are eating up too much space. For example, our 
beam_PreCommit_Java_Commit build is taking up 28GB of space by itself (this is 
just one example).

@Yi Hu found one change around code coverage that is 
likely heavily contributing to the problem and rolled that back [3]. We can 
continue to find other contributing factors here.

In the meantime, to get us back to healthy I propose that we reduce the number 
of builds that we are retaining to 40 for all jobs that are using a large 
amount of storage (>5GB). This will hopefully allow us to return Jenkins to a 
normal functioning state, though it will do so at the cost of a significant 
amount of build history (right now, for example, beam_PreCommit_Java_Commit is 
at 400 retained builds). We could restore the normal retention limit once the 
underlying problem is resolved. Given that this is irreversible (and not 
guaranteed to work), I wanted to gather feedback before doing this. Personally, 
I rarely use builds that old, but others may feel differently.

Please let me know if you have any objections or support for this proposal.

Thanks,
Danny

[1] Tracking issue: 
https://github.com/apache/beam/issues/26197
[2] Example run with this error: 
https://ci-beam.apache.org/job/beam_PreCommit_Python_Phrase/3352/console
[3] Rollback PR: 
https://github.com/apache/beam/pull/26199

As a recipient of an email from the Talend Group, your personal data will be 
processed by our systems. Please see our Privacy Notice 
 for more information about our 
collection and use of your personal information, our security practices, and 
your data protection rights, including any rights you may have to object to 
automated-decision making or profiling we use to analyze support or marketing 
related communications. To manage or discontinue promotional communications, 
use the communication preferences 
portal. To exercise your data 
protection rights, 

Re: Jenkins Flakes

2023-04-11 Thread Robert Burke
The coverage issue is only with the Java builds in specific.

Go abd Python have their coverage numbers codecov uploads done in GitHub
Actions instead.

On Tue, Apr 11, 2023, 8:14 AM Moritz Mack  wrote:

> Thanks so much for looking into this!
>
> I’m absolutely +1 for removing Jenkins related friction and the proposed
> changes sound legitimate.
>
>
>
> Also, considering the number of flaky tests in general [1], code coverage
> might not be the pressing issue. Should it be disabled everywhere in favor
> of more reliable / faster builds? Unless Devs here are willing to commit on
> taking actions, it doesn’t seem to provide too much value recording these
> numbers as part of the normal pre commit jobs?
>
>
>
> Kind regards,
>
> Moritz
>
>
>
> [1]
> https://github.com/apache/beam/issues?q=is%3Aissue+is%3Aopen+label%3Aflake
>
>
>
> On 11.04.23, 16:24, "Danny McCormick via dev"  wrote:
>
>
>
> ;tldr - I want to temporarily reduce the number of builds that we retain
> to reduce pressure on Jenkins Hey everyone, over the past few days our
> Jenkins runs have been particularly flaky across the board, with errors
> like the following showing
>
> *;tldr - I want to temporarily reduce the number of builds that we retain
> to reduce pressure on Jenkins*
>
>
>
> Hey everyone, over the past few days our Jenkins runs have been
> particularly flaky across the board, with errors like the following showing
> up all over the place [1]:
>
>
>
> java.nio.file.FileSystemException: 
> /home/jenkins/jenkins-home/jobs/beam_PreCommit_Python_Phrase/builds/3352/changelog.xml:
>  No space left on device [2]
>
>
>
> These errors indicate that we're out of space on the Jenkins master node.
> After some digging (thanks @Yi Hu  @Ahmet Altay
>  and @Bruno Volpato  for
> contributing), we've determined that at least one large contributing issue
> is that some of our builds are eating up too much space. For example, our
> beam_PreCommit_Java_Commit build is taking up 28GB of space by itself (this
> is just one example).
>
>
>
> @Yi Hu  found one change around code coverage that is
> likely heavily contributing to the problem and rolled that back [3]. We can
> continue to find other contributing factors here.
>
>
>
> In the meantime, to get us back to healthy *I propose that we reduce the
> number of builds that we are retaining to 40 for all jobs that are using a
> large amount of storage (>5GB)*. This will hopefully allow us to return
> Jenkins to a normal functioning state, though it will do so at the cost of
> a significant amount of build history (right now, for example,
> beam_PreCommit_Java_Commit is at 400 retained builds). We could restore the
> normal retention limit once the underlying problem is resolved. Given that
> this is irreversible (and not guaranteed to work), I wanted to gather
> feedback before doing this. Personally, I rarely use builds that old, but
> others may feel differently.
>
>
>
> Please let me know if you have any objections or support for this proposal.
>
>
>
> Thanks,
>
> Danny
>
>
>
> [1] Tracking issue: https://github.com/apache/beam/issues/26197
> 
>
> [2] Example run with this error:
> https://ci-beam.apache.org/job/beam_PreCommit_Python_Phrase/3352/console
> 
>
> [3] Rollback PR: https://github.com/apache/beam/pull/26199
> 
>
> *As a recipient of an email from the Talend Group, your personal data will
> be processed by our systems. Please see our Privacy Notice
> *for more information about our
> collection and use of your personal information, our security practices,
> and your data protection rights, including any rights you may have to
> object to automated-decision making or profiling we use to analyze support
> or marketing related communications. To manage or discontinue promotional
> communications, use the communication preferences portal
> . To exercise your data
> protection rights, use the privacy request form
> .
> Contact us here or by mail to either of
> our co-headquarters: Talend, Inc.: 400 South El Camino Real, Ste 1400, San
> Mateo, CA 94402; Talend SAS: 5/7 rue Salomon De Rothschild, 92150 Suresnes,
> France
>


Re: Jenkins Flakes

2023-04-11 Thread Moritz Mack
Thanks so much for looking into this!
I’m absolutely +1 for removing Jenkins related friction and the proposed 
changes sound legitimate.

Also, considering the number of flaky tests in general [1], code coverage might 
not be the pressing issue. Should it be disabled everywhere in favor of more 
reliable / faster builds? Unless Devs here are willing to commit on taking 
actions, it doesn’t seem to provide too much value recording these numbers as 
part of the normal pre commit jobs?

Kind regards,
Moritz

[1] https://github.com/apache/beam/issues?q=is%3Aissue+is%3Aopen+label%3Aflake

On 11.04.23, 16:24, "Danny McCormick via dev"  wrote:

;tldr - I want to temporarily reduce the number of builds that we retain to 
reduce pressure on Jenkins Hey everyone, over the past few days our Jenkins 
runs have been particularly flaky across the board, with errors like the 
following showing

;tldr - I want to temporarily reduce the number of builds that we retain to 
reduce pressure on Jenkins

Hey everyone, over the past few days our Jenkins runs have been particularly 
flaky across the board, with errors like the following showing up all over the 
place [1]:


java.nio.file.FileSystemException: 
/home/jenkins/jenkins-home/jobs/beam_PreCommit_Python_Phrase/builds/3352/changelog.xml:
 No space left on device [2]

These errors indicate that we're out of space on the Jenkins master node. After 
some digging (thanks @Yi Hu @Ahmet 
Altay and @Bruno Volpato 
for contributing), we've determined that at least one large contributing issue 
is that some of our builds are eating up too much space. For example, our 
beam_PreCommit_Java_Commit build is taking up 28GB of space by itself (this is 
just one example).

@Yi Hu found one change around code coverage that is 
likely heavily contributing to the problem and rolled that back [3]. We can 
continue to find other contributing factors here.

In the meantime, to get us back to healthy I propose that we reduce the number 
of builds that we are retaining to 40 for all jobs that are using a large 
amount of storage (>5GB). This will hopefully allow us to return Jenkins to a 
normal functioning state, though it will do so at the cost of a significant 
amount of build history (right now, for example, beam_PreCommit_Java_Commit is 
at 400 retained builds). We could restore the normal retention limit once the 
underlying problem is resolved. Given that this is irreversible (and not 
guaranteed to work), I wanted to gather feedback before doing this. Personally, 
I rarely use builds that old, but others may feel differently.

Please let me know if you have any objections or support for this proposal.

Thanks,
Danny

[1] Tracking issue: 
https://github.com/apache/beam/issues/26197
[2] Example run with this error: 
https://ci-beam.apache.org/job/beam_PreCommit_Python_Phrase/3352/console
[3] Rollback PR: 
https://github.com/apache/beam/pull/26199

As a recipient of an email from the Talend Group, your personal data will be 
processed by our systems. Please see our Privacy Notice 
 for more information about our 
collection and use of your personal information, our security practices, and 
your data protection rights, including any rights you may have to object to 
automated-decision making or profiling we use to analyze support or marketing 
related communications. To manage or discontinue promotional communications, 
use the communication preferences 
portal. To exercise your data 
protection rights, use the privacy request 
form.
 Contact us here  or by mail to either of our 
co-headquarters: Talend, Inc.: 400 South El Camino Real, Ste 1400, San Mateo, 
CA 94402; Talend SAS: 5/7 rue Salomon De Rothschild, 92150 Suresnes, France


Re: Jenkins Flakes

2023-04-11 Thread Robert Burke
+1

SGTM

Remember, if an issue is being investigated, a committer can always mark a
build to be retained longer in the Jenkins UI. Just be sure to clean it up
once it's resolved though.

(TBH there may also be some old retained builds like that, but I doubt
there's a good way to see which are still relevant.)

On Tue, Apr 11, 2023, 8:03 AM Yi Hu via dev  wrote:

> +1 Thanks Danny for figuring out a solution.
>
> Best,
> Yi
>
> On Tue, Apr 11, 2023 at 10:56 AM Svetak Sundhar via dev <
> dev@beam.apache.org> wrote:
>
>> +1 to the proposal.
>>
>> Regarding the "(and not guaranteed to work)" part, is the resolution that
>> the memory issues may still persist and we restore the normal retention
>> limit (and we look for another fix), or that we never restore back to the
>> normal retention limit?
>>
>>
>> Svetak Sundhar
>>
>>   Technical Solutions Engineer, Data
>> s vetaksund...@google.com
>>
>>
>>
>> On Tue, Apr 11, 2023 at 10:34 AM Jack McCluskey via dev <
>> dev@beam.apache.org> wrote:
>>
>>> +1 for getting Jenkins back into a happier state, getting release
>>> blockers resolved ahead of building an RC has been severely hindered by
>>> Jenkins not picking up tests or running them properly.
>>>
>>> On Tue, Apr 11, 2023 at 10:24 AM Danny McCormick via dev <
>>> dev@beam.apache.org> wrote:
>>>
 *;tldr - I want to temporarily reduce the number of builds that we
 retain to reduce pressure on Jenkins*

 Hey everyone, over the past few days our Jenkins runs have been
 particularly flaky across the board, with errors like the following showing
 up all over the place [1]:

 java.nio.file.FileSystemException: 
 /home/jenkins/jenkins-home/jobs/beam_PreCommit_Python_Phrase/builds/3352/changelog.xml:
  No space left on device [2]


 These errors indicate that we're out of space on the Jenkins master
 node. After some digging (thanks @Yi Hu  @Ahmet Altay
  and @Bruno Volpato  for
 contributing), we've determined that at least one large contributing issue
 is that some of our builds are eating up too much space. For example, our
 beam_PreCommit_Java_Commit build is taking up 28GB of space by itself (this
 is just one example).

 @Yi Hu  found one change around code coverage that
 is likely heavily contributing to the problem and rolled that back [3]. We
 can continue to find other contributing factors here.

 In the meantime, to get us back to healthy *I propose that we reduce
 the number of builds that we are retaining to 40 for all jobs that are
 using a large amount of storage (>5GB)*. This will hopefully allow us
 to return Jenkins to a normal functioning state, though it will do so at
 the cost of a significant amount of build history (right now, for example,
 beam_PreCommit_Java_Commit is at 400 retained builds). We could restore the
 normal retention limit once the underlying problem is resolved. Given that
 this is irreversible (and not guaranteed to work), I wanted to gather
 feedback before doing this. Personally, I rarely use builds that old, but
 others may feel differently.

 Please let me know if you have any objections or support for this
 proposal.

 Thanks,
 Danny

 [1] Tracking issue: https://github.com/apache/beam/issues/26197
 [2] Example run with this error:
 https://ci-beam.apache.org/job/beam_PreCommit_Python_Phrase/3352/console
 [3] Rollback PR: https://github.com/apache/beam/pull/26199

>>>


Re: Jenkins Flakes

2023-04-11 Thread Yi Hu via dev
+1 Thanks Danny for figuring out a solution.

Best,
Yi

On Tue, Apr 11, 2023 at 10:56 AM Svetak Sundhar via dev 
wrote:

> +1 to the proposal.
>
> Regarding the "(and not guaranteed to work)" part, is the resolution that
> the memory issues may still persist and we restore the normal retention
> limit (and we look for another fix), or that we never restore back to the
> normal retention limit?
>
>
> Svetak Sundhar
>
>   Technical Solutions Engineer, Data
> s vetaksund...@google.com
>
>
>
> On Tue, Apr 11, 2023 at 10:34 AM Jack McCluskey via dev <
> dev@beam.apache.org> wrote:
>
>> +1 for getting Jenkins back into a happier state, getting release
>> blockers resolved ahead of building an RC has been severely hindered by
>> Jenkins not picking up tests or running them properly.
>>
>> On Tue, Apr 11, 2023 at 10:24 AM Danny McCormick via dev <
>> dev@beam.apache.org> wrote:
>>
>>> *;tldr - I want to temporarily reduce the number of builds that we
>>> retain to reduce pressure on Jenkins*
>>>
>>> Hey everyone, over the past few days our Jenkins runs have been
>>> particularly flaky across the board, with errors like the following showing
>>> up all over the place [1]:
>>>
>>> java.nio.file.FileSystemException: 
>>> /home/jenkins/jenkins-home/jobs/beam_PreCommit_Python_Phrase/builds/3352/changelog.xml:
>>>  No space left on device [2]
>>>
>>>
>>> These errors indicate that we're out of space on the Jenkins master
>>> node. After some digging (thanks @Yi Hu  @Ahmet Altay
>>>  and @Bruno Volpato  for
>>> contributing), we've determined that at least one large contributing issue
>>> is that some of our builds are eating up too much space. For example, our
>>> beam_PreCommit_Java_Commit build is taking up 28GB of space by itself (this
>>> is just one example).
>>>
>>> @Yi Hu  found one change around code coverage that is
>>> likely heavily contributing to the problem and rolled that back [3]. We can
>>> continue to find other contributing factors here.
>>>
>>> In the meantime, to get us back to healthy *I propose that we reduce
>>> the number of builds that we are retaining to 40 for all jobs that are
>>> using a large amount of storage (>5GB)*. This will hopefully allow us
>>> to return Jenkins to a normal functioning state, though it will do so at
>>> the cost of a significant amount of build history (right now, for example,
>>> beam_PreCommit_Java_Commit is at 400 retained builds). We could restore the
>>> normal retention limit once the underlying problem is resolved. Given that
>>> this is irreversible (and not guaranteed to work), I wanted to gather
>>> feedback before doing this. Personally, I rarely use builds that old, but
>>> others may feel differently.
>>>
>>> Please let me know if you have any objections or support for this
>>> proposal.
>>>
>>> Thanks,
>>> Danny
>>>
>>> [1] Tracking issue: https://github.com/apache/beam/issues/26197
>>> [2] Example run with this error:
>>> https://ci-beam.apache.org/job/beam_PreCommit_Python_Phrase/3352/console
>>> [3] Rollback PR: https://github.com/apache/beam/pull/26199
>>>
>>


Re: Jenkins Flakes

2023-04-11 Thread Svetak Sundhar via dev
+1 to the proposal.

Regarding the "(and not guaranteed to work)" part, is the resolution that
the memory issues may still persist and we restore the normal retention
limit (and we look for another fix), or that we never restore back to the
normal retention limit?


Svetak Sundhar

  Technical Solutions Engineer, Data
s vetaksund...@google.com



On Tue, Apr 11, 2023 at 10:34 AM Jack McCluskey via dev 
wrote:

> +1 for getting Jenkins back into a happier state, getting release blockers
> resolved ahead of building an RC has been severely hindered by Jenkins not
> picking up tests or running them properly.
>
> On Tue, Apr 11, 2023 at 10:24 AM Danny McCormick via dev <
> dev@beam.apache.org> wrote:
>
>> *;tldr - I want to temporarily reduce the number of builds that we retain
>> to reduce pressure on Jenkins*
>>
>> Hey everyone, over the past few days our Jenkins runs have been
>> particularly flaky across the board, with errors like the following showing
>> up all over the place [1]:
>>
>> java.nio.file.FileSystemException: 
>> /home/jenkins/jenkins-home/jobs/beam_PreCommit_Python_Phrase/builds/3352/changelog.xml:
>>  No space left on device [2]
>>
>>
>> These errors indicate that we're out of space on the Jenkins master node.
>> After some digging (thanks @Yi Hu  @Ahmet Altay
>>  and @Bruno Volpato  for
>> contributing), we've determined that at least one large contributing issue
>> is that some of our builds are eating up too much space. For example, our
>> beam_PreCommit_Java_Commit build is taking up 28GB of space by itself (this
>> is just one example).
>>
>> @Yi Hu  found one change around code coverage that is
>> likely heavily contributing to the problem and rolled that back [3]. We can
>> continue to find other contributing factors here.
>>
>> In the meantime, to get us back to healthy *I propose that we reduce the
>> number of builds that we are retaining to 40 for all jobs that are using a
>> large amount of storage (>5GB)*. This will hopefully allow us to return
>> Jenkins to a normal functioning state, though it will do so at the cost of
>> a significant amount of build history (right now, for example,
>> beam_PreCommit_Java_Commit is at 400 retained builds). We could restore the
>> normal retention limit once the underlying problem is resolved. Given that
>> this is irreversible (and not guaranteed to work), I wanted to gather
>> feedback before doing this. Personally, I rarely use builds that old, but
>> others may feel differently.
>>
>> Please let me know if you have any objections or support for this
>> proposal.
>>
>> Thanks,
>> Danny
>>
>> [1] Tracking issue: https://github.com/apache/beam/issues/26197
>> [2] Example run with this error:
>> https://ci-beam.apache.org/job/beam_PreCommit_Python_Phrase/3352/console
>> [3] Rollback PR: https://github.com/apache/beam/pull/26199
>>
>


Re: Jenkins Flakes

2023-04-11 Thread Jack McCluskey via dev
+1 for getting Jenkins back into a happier state, getting release blockers
resolved ahead of building an RC has been severely hindered by Jenkins not
picking up tests or running them properly.

On Tue, Apr 11, 2023 at 10:24 AM Danny McCormick via dev <
dev@beam.apache.org> wrote:

> *;tldr - I want to temporarily reduce the number of builds that we retain
> to reduce pressure on Jenkins*
>
> Hey everyone, over the past few days our Jenkins runs have been
> particularly flaky across the board, with errors like the following showing
> up all over the place [1]:
>
> java.nio.file.FileSystemException: 
> /home/jenkins/jenkins-home/jobs/beam_PreCommit_Python_Phrase/builds/3352/changelog.xml:
>  No space left on device [2]
>
>
> These errors indicate that we're out of space on the Jenkins master node.
> After some digging (thanks @Yi Hu  @Ahmet Altay
>  and @Bruno Volpato  for
> contributing), we've determined that at least one large contributing issue
> is that some of our builds are eating up too much space. For example, our
> beam_PreCommit_Java_Commit build is taking up 28GB of space by itself (this
> is just one example).
>
> @Yi Hu  found one change around code coverage that is
> likely heavily contributing to the problem and rolled that back [3]. We can
> continue to find other contributing factors here.
>
> In the meantime, to get us back to healthy *I propose that we reduce the
> number of builds that we are retaining to 40 for all jobs that are using a
> large amount of storage (>5GB)*. This will hopefully allow us to return
> Jenkins to a normal functioning state, though it will do so at the cost of
> a significant amount of build history (right now, for example,
> beam_PreCommit_Java_Commit is at 400 retained builds). We could restore the
> normal retention limit once the underlying problem is resolved. Given that
> this is irreversible (and not guaranteed to work), I wanted to gather
> feedback before doing this. Personally, I rarely use builds that old, but
> others may feel differently.
>
> Please let me know if you have any objections or support for this proposal.
>
> Thanks,
> Danny
>
> [1] Tracking issue: https://github.com/apache/beam/issues/26197
> [2] Example run with this error:
> https://ci-beam.apache.org/job/beam_PreCommit_Python_Phrase/3352/console
> [3] Rollback PR: https://github.com/apache/beam/pull/26199
>