Re: s390x KOJI builders issue

2022-03-04 Thread Colin Walters


On Thu, Mar 3, 2022, at 4:25 PM, Colin Walters wrote:
> On Wed, Mar 2, 2022, at 7:04 PM, Kevin Fenzi wrote:
>
>> * OOm killer looks and says... oh hey, I need to kill something. This
>> kojid process/slice is taking up all the memory.
>> * kojid is killed.
>
> If we replaced Koji's backend with Kubernetes (at least my employer's 
> production way to run Linux containers), and mock with scheduled pods 
> that just run `yum builddep $package && rpmbuild` inside them etc. 

Filed https://pagure.io/koji/issue/3273 to centralize this and gave it a catchy 
name too!
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Do not reply to spam on the list, report it: 
https://pagure.io/fedora-infrastructure


Re: s390x KOJI builders issue

2022-03-03 Thread Kevin Fenzi
...snip...
> 
> If this is just s390x builders, I'd prefer to see if we cannot rebalance
> them to just pass your builds. So, looking at it, we have 20 buildvm's
> on a host with 256gb mem. I could bump them all from 10 to 12 without
> overcommiting. I don't know if 2gb would help enough tho? Is that worth
> trying before anything else? If that doesn't work, we could reduce and
> consolidate builders. ;( 

I went ahead and moved the builders to 13GB memory instead of 10GB.

I then forced the mariadb build on one of those and it finished. 

So, perhaps things are better now? 

but do look and if you see it again let me know. 

kevin


signature.asc
Description: PGP signature
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Do not reply to spam on the list, report it: 
https://pagure.io/fedora-infrastructure


Re: s390x KOJI builders issue

2022-03-03 Thread Colin Walters
On Wed, Mar 2, 2022, at 7:04 PM, Kevin Fenzi wrote:

> * OOm killer looks and says... oh hey, I need to kill something. This
> kojid process/slice is taking up all the memory.
> * kojid is killed.

If we replaced Koji's backend with Kubernetes (at least my employer's 
production way to run Linux containers), and mock with scheduled pods that just 
run `yum builddep $package && rpmbuild` inside them etc. then this would be 
fixed for free because significant work has gone in to protecting the kubelet 
(equivalent of kojid) from workloads.  See e.g.

https://docs.openshift.com/container-platform/4.9/nodes/nodes/nodes-nodes-resources-configuring.html
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Do not reply to spam on the list, report it: 
https://pagure.io/fedora-infrastructure


Re: s390x KOJI builders issue

2022-03-03 Thread Kevin Fenzi
On Thu, Mar 03, 2022 at 02:32:50AM +0100, Michal Schorm wrote:
> In many cases, the build is killed during compilation itself.
> I'd understand the situation, if it would consistently fail somewhere
> during the testsuite on OOM errors, but it's weirder than that.
> 
> Until now, I didn't have this issue. Why now?

In january we got more s390x resources and rebalanced things. 
Before jan 18th the builders had 20GB mem and 4 cpus.
So I suspect if this started happening after that, thats the cause?

> The tests are still important.

Agreed completely. 

> Through the years I took several steps to reduce the resource usage
> for the testsuite.
> The most significant is that I ran the full testsuite only once or few
> times in scratch builds, and when I didn't find any issues worth
> investigating, I switch the testsuite to a minimal mode for every
> other build of the same minor versions.
> So e.g. mass rebuilds which only bump patch numbers in the NVR run
> only the 'main' suite. As well as other small patches during the life
> of that particular upstream release.
> 
> The issue in general is:
> We have the majority of packages which are small and quick to build.
> Then we have a minority of insanely huge projects, whose resource
> thirst can never be quenched. :)
> 
> Could we somehow just identify the huge packages, mark them in a
> special way, and when KOJI would pick up such marked packages, it
> would give it much more resources ?
> At the same time, the average amount of resources given should be
> lowered to only what most packages need.
> I believe all could benefit from this.

Yes, but it gets complex. 

koji has the ability to set policy and send builds matching some
expression to a specific koji 'channel' (ie, group of builders). 

I had to do this for chromium a while back. It was never finishing on
aarch64 on normal builders. We have 2 buildhw's that are bare metal and
have a lot of memory/cpus, so I set those into a heavybuilder channel.
But channel cannot be per arch, so I had to add a bunch of x86 builders
also for the x86_64 build. Sounds great right?

But... if I just add more packages to that channel, there's only 2
aarch64 builders. So, when Tom submits say 4 chromiums, any other
packages that are submitted will just wait until those all finish before
even starting. :( 

ie, if we have a heavybuild channel, it needs enough builders in it to
build as many of the big heavy packages at once as people normally do,
or else its going to serialize builds badly behind the fewest ones. 

So, I'm open to setting mariadb into a channel with bigger builders, but
realize that may mean that there's fewer of them and they may sometimes
have to wait for a builder. ;( 

If this is just s390x builders, I'd prefer to see if we cannot rebalance
them to just pass your builds. So, looking at it, we have 20 buildvm's
on a host with 256gb mem. I could bump them all from 10 to 12 without
overcommiting. I don't know if 2gb would help enough tho? Is that worth
trying before anything else? If that doesn't work, we could reduce and
consolidate builders. ;( 

Thoughts?

kevin


signature.asc
Description: PGP signature
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Do not reply to spam on the list, report it: 
https://pagure.io/fedora-infrastructure


Re: s390x KOJI builders issue

2022-03-03 Thread Florian Weimer
* Kevin Fenzi:

> Perhaps there's some way to adjust the oom killer to kill the build
> instead of kojid?

There is /proc/PID/oom_score_adj.  I don't know how to use it, sorry.

Thanks,
Florian
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Do not reply to spam on the list, report it: 
https://pagure.io/fedora-infrastructure


Re: s390x KOJI builders issue

2022-03-02 Thread Michal Schorm
In many cases, the build is killed during compilation itself.
I'd understand the situation, if it would consistently fail somewhere
during the testsuite on OOM errors, but it's weirder than that.

Until now, I didn't have this issue. Why now?

The tests are still important.
Through the years I took several steps to reduce the resource usage
for the testsuite.
The most significant is that I ran the full testsuite only once or few
times in scratch builds, and when I didn't find any issues worth
investigating, I switch the testsuite to a minimal mode for every
other build of the same minor versions.
So e.g. mass rebuilds which only bump patch numbers in the NVR run
only the 'main' suite. As well as other small patches during the life
of that particular upstream release.

The issue in general is:
We have the majority of packages which are small and quick to build.
Then we have a minority of insanely huge projects, whose resource
thirst can never be quenched. :)

Could we somehow just identify the huge packages, mark them in a
special way, and when KOJI would pick up such marked packages, it
would give it much more resources ?
At the same time, the average amount of resources given should be
lowered to only what most packages need.
I believe all could benefit from this.

Michal
--

Michal Schorm
Software Engineer
Core Services - Databases Team
Red Hat

--

On Thu, Mar 3, 2022 at 1:05 AM Kevin Fenzi  wrote:
>
> On Wed, Mar 02, 2022 at 03:54:32PM +0100, Florian Weimer wrote:
> > * Michael Catanzaro:
> >
> > > On Wed, Mar 2 2022 at 02:21:22 PM +0100, Dan Horák 
> > > wrote:
> > >> those are weird, the build tasks have been restarted many times by the
> > >> builder daemon, after something crashed there (OOM?) ...
> > >
> > > This was happening to me on armv7hl a few weeks ago. Kevin Fenzi
> > > investigated and discovered that the builds kept hitting an OOM
> > > condition and then restarting, which triggered an infinite loop. Each
> > > build would work for 3-5 hours before failing, then it would start
> > > over, then again, then again
> > >
> > > I think some configuration changed recently on the builders, because I
> > > had never seen this happen before last month. If a build hits OOM, it
> > > really needs to fail immediately. It should not restart, because it's
> > > likely to fail again the same way. My builds had restarted four or
> > > five times before Kevin manually handled them.
> >
> > Maybe Koji restarts the build because the builder has rebooted?
>
> Nope.
>
> What happens is:
>
> * 10: Build is taken by builder and starts building.
> * Build takes up more than 90% of memory+swap
> * OOm killer looks and says... oh hey, I need to kill something. This
> kojid process/slice is taking up all the memory.
> * kojid is killed.
> * kojid is restarted (we have it set to restart in unit)
> * builder checks into hub
> * hub says, hey you are doing task X right?
> * builder says... oh, yes, let me start that.
> * goto 10
>
> So in this case it seems like it's the tests that are causing this.
> The s390x kvm builders have 2cpus and 10gb of memory.
>
> So, is there any way to decrease memory usage there?
> I see the tests have -parallel=auto perhaps that could be set to 1 or 2?
>
> Perhaps there's some way to adjust the oom killer to kill the build
> instead of kojid? I would prefer that because then the build would
> quickly fail and you could see it was killed and need to reduce memory
> consumption somehow.
>
> I suppose we could look at reducing the number of builders and
> increasing memory on fewer of them, but it's hard to know what the right
> value is there. it's definitely better for mass rebuilds to have more
> smaller builders.
>
> kevin
> ___
> devel mailing list -- devel@lists.fedoraproject.org
> To unsubscribe send an email to devel-le...@lists.fedoraproject.org
> Fedora Code of Conduct: 
> https://docs.fedoraproject.org/en-US/project/code-of-conduct/
> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
> List Archives: 
> https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
> Do not reply to spam on the list, report it: 
> https://pagure.io/fedora-infrastructure
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Do not reply to spam on the list, report it: 
https://pagure.io/fedora-infrastructure


Re: s390x KOJI builders issue

2022-03-02 Thread Kevin Fenzi
On Wed, Mar 02, 2022 at 03:54:32PM +0100, Florian Weimer wrote:
> * Michael Catanzaro:
> 
> > On Wed, Mar 2 2022 at 02:21:22 PM +0100, Dan Horák 
> > wrote:
> >> those are weird, the build tasks have been restarted many times by the
> >> builder daemon, after something crashed there (OOM?) ...
> >
> > This was happening to me on armv7hl a few weeks ago. Kevin Fenzi
> > investigated and discovered that the builds kept hitting an OOM 
> > condition and then restarting, which triggered an infinite loop. Each
> > build would work for 3-5 hours before failing, then it would start 
> > over, then again, then again
> >
> > I think some configuration changed recently on the builders, because I
> > had never seen this happen before last month. If a build hits OOM, it 
> > really needs to fail immediately. It should not restart, because it's
> > likely to fail again the same way. My builds had restarted four or
> > five times before Kevin manually handled them.
> 
> Maybe Koji restarts the build because the builder has rebooted?

Nope.

What happens is:

* 10: Build is taken by builder and starts building.
* Build takes up more than 90% of memory+swap
* OOm killer looks and says... oh hey, I need to kill something. This
kojid process/slice is taking up all the memory.
* kojid is killed.
* kojid is restarted (we have it set to restart in unit)
* builder checks into hub
* hub says, hey you are doing task X right?
* builder says... oh, yes, let me start that.
* goto 10

So in this case it seems like it's the tests that are causing this.
The s390x kvm builders have 2cpus and 10gb of memory.

So, is there any way to decrease memory usage there?
I see the tests have -parallel=auto perhaps that could be set to 1 or 2?

Perhaps there's some way to adjust the oom killer to kill the build
instead of kojid? I would prefer that because then the build would
quickly fail and you could see it was killed and need to reduce memory
consumption somehow.

I suppose we could look at reducing the number of builders and
increasing memory on fewer of them, but it's hard to know what the right
value is there. it's definitely better for mass rebuilds to have more
smaller builders.

kevin


signature.asc
Description: PGP signature
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Do not reply to spam on the list, report it: 
https://pagure.io/fedora-infrastructure


Re: s390x KOJI builders issue

2022-03-02 Thread Florian Weimer
* Michael Catanzaro:

> On Wed, Mar 2 2022 at 02:21:22 PM +0100, Dan Horák 
> wrote:
>> those are weird, the build tasks have been restarted many times by the
>> builder daemon, after something crashed there (OOM?) ...
>
> This was happening to me on armv7hl a few weeks ago. Kevin Fenzi
> investigated and discovered that the builds kept hitting an OOM 
> condition and then restarting, which triggered an infinite loop. Each
> build would work for 3-5 hours before failing, then it would start 
> over, then again, then again
>
> I think some configuration changed recently on the builders, because I
> had never seen this happen before last month. If a build hits OOM, it 
> really needs to fail immediately. It should not restart, because it's
> likely to fail again the same way. My builds had restarted four or
> five times before Kevin manually handled them.

Maybe Koji restarts the build because the builder has rebooted?

Thanks,
Florian
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Do not reply to spam on the list, report it: 
https://pagure.io/fedora-infrastructure


Re: s390x KOJI builders issue

2022-03-02 Thread Michael Catanzaro
On Wed, Mar 2 2022 at 02:21:22 PM +0100, Dan Horák  
wrote:

those are weird, the build tasks have been restarted many times by the
builder daemon, after something crashed there (OOM?) ...


This was happening to me on armv7hl a few weeks ago. Kevin Fenzi 
investigated and discovered that the builds kept hitting an OOM 
condition and then restarting, which triggered an infinite loop. Each 
build would work for 3-5 hours before failing, then it would start 
over, then again, then again


I think some configuration changed recently on the builders, because I 
had never seen this happen before last month. If a build hits OOM, it 
really needs to fail immediately. It should not restart, because it's 
likely to fail again the same way. My builds had restarted four or five 
times before Kevin manually handled them.


Michael

___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Do not reply to spam on the list, report it: 
https://pagure.io/fedora-infrastructure


Re: s390x KOJI builders issue

2022-03-02 Thread Dan Horák
On Wed, 2 Mar 2022 14:08:23 +0100
Michal Schorm  wrote:

> Hello,
> for the last few days, I'm not able to finish my builds of the
> 'mariadb' package on s390x architecture.
> 
> Those builds freeze, e.g. several of these:
>   https://koji.fedoraproject.org/koji/taskinfo?taskID=83297826
>   https://koji.fedoraproject.org/koji/taskinfo?taskID=83296979
>   https://koji.fedoraproject.org/koji/taskinfo?taskID=83292553
>   https://koji.fedoraproject.org/koji/taskinfo?taskID=83290439
>   https://koji.fedoraproject.org/koji/taskinfo?taskID=83295670
>   https://koji.fedoraproject.org/koji/taskinfo?taskID=83293919
> have >100 hours total time before finally failing.

those are weird, the build tasks have been restarted many times by the
builder daemon, after something crashed there (OOM?) ...


Dan

> 
> Even the new one I submitte have already > 22 hours count:
>   https://koji.fedoraproject.org/koji/taskinfo?taskID=83514145
> 
> The MariaDB build - especially with the full testsuite on - is very
> resource hungry.
> However the freezes are strange, since sometimes it freezes randomly
> even during compilation, while sometimes %check phase ...
> 
> --
> 
> Michal Schorm
> Software Engineer
> Core Services - Databases Team
> Red Hat
> 
> --
> ___
> devel mailing list -- devel@lists.fedoraproject.org
> To unsubscribe send an email to devel-le...@lists.fedoraproject.org
> Fedora Code of Conduct: 
> https://docs.fedoraproject.org/en-US/project/code-of-conduct/
> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
> List Archives: 
> https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
> Do not reply to spam on the list, report it: 
> https://pagure.io/fedora-infrastructure
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Do not reply to spam on the list, report it: 
https://pagure.io/fedora-infrastructure


s390x KOJI builders issue

2022-03-02 Thread Michal Schorm
Hello,
for the last few days, I'm not able to finish my builds of the
'mariadb' package on s390x architecture.

Those builds freeze, e.g. several of these:
  https://koji.fedoraproject.org/koji/taskinfo?taskID=83297826
  https://koji.fedoraproject.org/koji/taskinfo?taskID=83296979
  https://koji.fedoraproject.org/koji/taskinfo?taskID=83292553
  https://koji.fedoraproject.org/koji/taskinfo?taskID=83290439
  https://koji.fedoraproject.org/koji/taskinfo?taskID=83295670
  https://koji.fedoraproject.org/koji/taskinfo?taskID=83293919
have >100 hours total time before finally failing.

Even the new one I submitte have already > 22 hours count:
  https://koji.fedoraproject.org/koji/taskinfo?taskID=83514145

The MariaDB build - especially with the full testsuite on - is very
resource hungry.
However the freezes are strange, since sometimes it freezes randomly
even during compilation, while sometimes %check phase ...

--

Michal Schorm
Software Engineer
Core Services - Databases Team
Red Hat

--
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Do not reply to spam on the list, report it: 
https://pagure.io/fedora-infrastructure