Re: s390x KOJI builders issue
On Thu, Mar 3, 2022, at 4:25 PM, Colin Walters wrote: > On Wed, Mar 2, 2022, at 7:04 PM, Kevin Fenzi wrote: > >> * OOm killer looks and says... oh hey, I need to kill something. This >> kojid process/slice is taking up all the memory. >> * kojid is killed. > > If we replaced Koji's backend with Kubernetes (at least my employer's > production way to run Linux containers), and mock with scheduled pods > that just run `yum builddep $package && rpmbuild` inside them etc. Filed https://pagure.io/koji/issue/3273 to centralize this and gave it a catchy name too! ___ devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-le...@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
Re: s390x KOJI builders issue
...snip... > > If this is just s390x builders, I'd prefer to see if we cannot rebalance > them to just pass your builds. So, looking at it, we have 20 buildvm's > on a host with 256gb mem. I could bump them all from 10 to 12 without > overcommiting. I don't know if 2gb would help enough tho? Is that worth > trying before anything else? If that doesn't work, we could reduce and > consolidate builders. ;( I went ahead and moved the builders to 13GB memory instead of 10GB. I then forced the mariadb build on one of those and it finished. So, perhaps things are better now? but do look and if you see it again let me know. kevin signature.asc Description: PGP signature ___ devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-le...@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
Re: s390x KOJI builders issue
On Wed, Mar 2, 2022, at 7:04 PM, Kevin Fenzi wrote: > * OOm killer looks and says... oh hey, I need to kill something. This > kojid process/slice is taking up all the memory. > * kojid is killed. If we replaced Koji's backend with Kubernetes (at least my employer's production way to run Linux containers), and mock with scheduled pods that just run `yum builddep $package && rpmbuild` inside them etc. then this would be fixed for free because significant work has gone in to protecting the kubelet (equivalent of kojid) from workloads. See e.g. https://docs.openshift.com/container-platform/4.9/nodes/nodes/nodes-nodes-resources-configuring.html ___ devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-le...@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
Re: s390x KOJI builders issue
On Thu, Mar 03, 2022 at 02:32:50AM +0100, Michal Schorm wrote: > In many cases, the build is killed during compilation itself. > I'd understand the situation, if it would consistently fail somewhere > during the testsuite on OOM errors, but it's weirder than that. > > Until now, I didn't have this issue. Why now? In january we got more s390x resources and rebalanced things. Before jan 18th the builders had 20GB mem and 4 cpus. So I suspect if this started happening after that, thats the cause? > The tests are still important. Agreed completely. > Through the years I took several steps to reduce the resource usage > for the testsuite. > The most significant is that I ran the full testsuite only once or few > times in scratch builds, and when I didn't find any issues worth > investigating, I switch the testsuite to a minimal mode for every > other build of the same minor versions. > So e.g. mass rebuilds which only bump patch numbers in the NVR run > only the 'main' suite. As well as other small patches during the life > of that particular upstream release. > > The issue in general is: > We have the majority of packages which are small and quick to build. > Then we have a minority of insanely huge projects, whose resource > thirst can never be quenched. :) > > Could we somehow just identify the huge packages, mark them in a > special way, and when KOJI would pick up such marked packages, it > would give it much more resources ? > At the same time, the average amount of resources given should be > lowered to only what most packages need. > I believe all could benefit from this. Yes, but it gets complex. koji has the ability to set policy and send builds matching some expression to a specific koji 'channel' (ie, group of builders). I had to do this for chromium a while back. It was never finishing on aarch64 on normal builders. We have 2 buildhw's that are bare metal and have a lot of memory/cpus, so I set those into a heavybuilder channel. But channel cannot be per arch, so I had to add a bunch of x86 builders also for the x86_64 build. Sounds great right? But... if I just add more packages to that channel, there's only 2 aarch64 builders. So, when Tom submits say 4 chromiums, any other packages that are submitted will just wait until those all finish before even starting. :( ie, if we have a heavybuild channel, it needs enough builders in it to build as many of the big heavy packages at once as people normally do, or else its going to serialize builds badly behind the fewest ones. So, I'm open to setting mariadb into a channel with bigger builders, but realize that may mean that there's fewer of them and they may sometimes have to wait for a builder. ;( If this is just s390x builders, I'd prefer to see if we cannot rebalance them to just pass your builds. So, looking at it, we have 20 buildvm's on a host with 256gb mem. I could bump them all from 10 to 12 without overcommiting. I don't know if 2gb would help enough tho? Is that worth trying before anything else? If that doesn't work, we could reduce and consolidate builders. ;( Thoughts? kevin signature.asc Description: PGP signature ___ devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-le...@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
Re: s390x KOJI builders issue
* Kevin Fenzi: > Perhaps there's some way to adjust the oom killer to kill the build > instead of kojid? There is /proc/PID/oom_score_adj. I don't know how to use it, sorry. Thanks, Florian ___ devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-le...@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
Re: s390x KOJI builders issue
In many cases, the build is killed during compilation itself. I'd understand the situation, if it would consistently fail somewhere during the testsuite on OOM errors, but it's weirder than that. Until now, I didn't have this issue. Why now? The tests are still important. Through the years I took several steps to reduce the resource usage for the testsuite. The most significant is that I ran the full testsuite only once or few times in scratch builds, and when I didn't find any issues worth investigating, I switch the testsuite to a minimal mode for every other build of the same minor versions. So e.g. mass rebuilds which only bump patch numbers in the NVR run only the 'main' suite. As well as other small patches during the life of that particular upstream release. The issue in general is: We have the majority of packages which are small and quick to build. Then we have a minority of insanely huge projects, whose resource thirst can never be quenched. :) Could we somehow just identify the huge packages, mark them in a special way, and when KOJI would pick up such marked packages, it would give it much more resources ? At the same time, the average amount of resources given should be lowered to only what most packages need. I believe all could benefit from this. Michal -- Michal Schorm Software Engineer Core Services - Databases Team Red Hat -- On Thu, Mar 3, 2022 at 1:05 AM Kevin Fenzi wrote: > > On Wed, Mar 02, 2022 at 03:54:32PM +0100, Florian Weimer wrote: > > * Michael Catanzaro: > > > > > On Wed, Mar 2 2022 at 02:21:22 PM +0100, Dan Horák > > > wrote: > > >> those are weird, the build tasks have been restarted many times by the > > >> builder daemon, after something crashed there (OOM?) ... > > > > > > This was happening to me on armv7hl a few weeks ago. Kevin Fenzi > > > investigated and discovered that the builds kept hitting an OOM > > > condition and then restarting, which triggered an infinite loop. Each > > > build would work for 3-5 hours before failing, then it would start > > > over, then again, then again > > > > > > I think some configuration changed recently on the builders, because I > > > had never seen this happen before last month. If a build hits OOM, it > > > really needs to fail immediately. It should not restart, because it's > > > likely to fail again the same way. My builds had restarted four or > > > five times before Kevin manually handled them. > > > > Maybe Koji restarts the build because the builder has rebooted? > > Nope. > > What happens is: > > * 10: Build is taken by builder and starts building. > * Build takes up more than 90% of memory+swap > * OOm killer looks and says... oh hey, I need to kill something. This > kojid process/slice is taking up all the memory. > * kojid is killed. > * kojid is restarted (we have it set to restart in unit) > * builder checks into hub > * hub says, hey you are doing task X right? > * builder says... oh, yes, let me start that. > * goto 10 > > So in this case it seems like it's the tests that are causing this. > The s390x kvm builders have 2cpus and 10gb of memory. > > So, is there any way to decrease memory usage there? > I see the tests have -parallel=auto perhaps that could be set to 1 or 2? > > Perhaps there's some way to adjust the oom killer to kill the build > instead of kojid? I would prefer that because then the build would > quickly fail and you could see it was killed and need to reduce memory > consumption somehow. > > I suppose we could look at reducing the number of builders and > increasing memory on fewer of them, but it's hard to know what the right > value is there. it's definitely better for mass rebuilds to have more > smaller builders. > > kevin > ___ > devel mailing list -- devel@lists.fedoraproject.org > To unsubscribe send an email to devel-le...@lists.fedoraproject.org > Fedora Code of Conduct: > https://docs.fedoraproject.org/en-US/project/code-of-conduct/ > List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines > List Archives: > https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org > Do not reply to spam on the list, report it: > https://pagure.io/fedora-infrastructure ___ devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-le...@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
Re: s390x KOJI builders issue
On Wed, Mar 02, 2022 at 03:54:32PM +0100, Florian Weimer wrote: > * Michael Catanzaro: > > > On Wed, Mar 2 2022 at 02:21:22 PM +0100, Dan Horák > > wrote: > >> those are weird, the build tasks have been restarted many times by the > >> builder daemon, after something crashed there (OOM?) ... > > > > This was happening to me on armv7hl a few weeks ago. Kevin Fenzi > > investigated and discovered that the builds kept hitting an OOM > > condition and then restarting, which triggered an infinite loop. Each > > build would work for 3-5 hours before failing, then it would start > > over, then again, then again > > > > I think some configuration changed recently on the builders, because I > > had never seen this happen before last month. If a build hits OOM, it > > really needs to fail immediately. It should not restart, because it's > > likely to fail again the same way. My builds had restarted four or > > five times before Kevin manually handled them. > > Maybe Koji restarts the build because the builder has rebooted? Nope. What happens is: * 10: Build is taken by builder and starts building. * Build takes up more than 90% of memory+swap * OOm killer looks and says... oh hey, I need to kill something. This kojid process/slice is taking up all the memory. * kojid is killed. * kojid is restarted (we have it set to restart in unit) * builder checks into hub * hub says, hey you are doing task X right? * builder says... oh, yes, let me start that. * goto 10 So in this case it seems like it's the tests that are causing this. The s390x kvm builders have 2cpus and 10gb of memory. So, is there any way to decrease memory usage there? I see the tests have -parallel=auto perhaps that could be set to 1 or 2? Perhaps there's some way to adjust the oom killer to kill the build instead of kojid? I would prefer that because then the build would quickly fail and you could see it was killed and need to reduce memory consumption somehow. I suppose we could look at reducing the number of builders and increasing memory on fewer of them, but it's hard to know what the right value is there. it's definitely better for mass rebuilds to have more smaller builders. kevin signature.asc Description: PGP signature ___ devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-le...@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
Re: s390x KOJI builders issue
* Michael Catanzaro: > On Wed, Mar 2 2022 at 02:21:22 PM +0100, Dan Horák > wrote: >> those are weird, the build tasks have been restarted many times by the >> builder daemon, after something crashed there (OOM?) ... > > This was happening to me on armv7hl a few weeks ago. Kevin Fenzi > investigated and discovered that the builds kept hitting an OOM > condition and then restarting, which triggered an infinite loop. Each > build would work for 3-5 hours before failing, then it would start > over, then again, then again > > I think some configuration changed recently on the builders, because I > had never seen this happen before last month. If a build hits OOM, it > really needs to fail immediately. It should not restart, because it's > likely to fail again the same way. My builds had restarted four or > five times before Kevin manually handled them. Maybe Koji restarts the build because the builder has rebooted? Thanks, Florian ___ devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-le...@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
Re: s390x KOJI builders issue
On Wed, Mar 2 2022 at 02:21:22 PM +0100, Dan Horák wrote: those are weird, the build tasks have been restarted many times by the builder daemon, after something crashed there (OOM?) ... This was happening to me on armv7hl a few weeks ago. Kevin Fenzi investigated and discovered that the builds kept hitting an OOM condition and then restarting, which triggered an infinite loop. Each build would work for 3-5 hours before failing, then it would start over, then again, then again I think some configuration changed recently on the builders, because I had never seen this happen before last month. If a build hits OOM, it really needs to fail immediately. It should not restart, because it's likely to fail again the same way. My builds had restarted four or five times before Kevin manually handled them. Michael ___ devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-le...@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
Re: s390x KOJI builders issue
On Wed, 2 Mar 2022 14:08:23 +0100 Michal Schorm wrote: > Hello, > for the last few days, I'm not able to finish my builds of the > 'mariadb' package on s390x architecture. > > Those builds freeze, e.g. several of these: > https://koji.fedoraproject.org/koji/taskinfo?taskID=83297826 > https://koji.fedoraproject.org/koji/taskinfo?taskID=83296979 > https://koji.fedoraproject.org/koji/taskinfo?taskID=83292553 > https://koji.fedoraproject.org/koji/taskinfo?taskID=83290439 > https://koji.fedoraproject.org/koji/taskinfo?taskID=83295670 > https://koji.fedoraproject.org/koji/taskinfo?taskID=83293919 > have >100 hours total time before finally failing. those are weird, the build tasks have been restarted many times by the builder daemon, after something crashed there (OOM?) ... Dan > > Even the new one I submitte have already > 22 hours count: > https://koji.fedoraproject.org/koji/taskinfo?taskID=83514145 > > The MariaDB build - especially with the full testsuite on - is very > resource hungry. > However the freezes are strange, since sometimes it freezes randomly > even during compilation, while sometimes %check phase ... > > -- > > Michal Schorm > Software Engineer > Core Services - Databases Team > Red Hat > > -- > ___ > devel mailing list -- devel@lists.fedoraproject.org > To unsubscribe send an email to devel-le...@lists.fedoraproject.org > Fedora Code of Conduct: > https://docs.fedoraproject.org/en-US/project/code-of-conduct/ > List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines > List Archives: > https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org > Do not reply to spam on the list, report it: > https://pagure.io/fedora-infrastructure ___ devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-le...@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
s390x KOJI builders issue
Hello, for the last few days, I'm not able to finish my builds of the 'mariadb' package on s390x architecture. Those builds freeze, e.g. several of these: https://koji.fedoraproject.org/koji/taskinfo?taskID=83297826 https://koji.fedoraproject.org/koji/taskinfo?taskID=83296979 https://koji.fedoraproject.org/koji/taskinfo?taskID=83292553 https://koji.fedoraproject.org/koji/taskinfo?taskID=83290439 https://koji.fedoraproject.org/koji/taskinfo?taskID=83295670 https://koji.fedoraproject.org/koji/taskinfo?taskID=83293919 have >100 hours total time before finally failing. Even the new one I submitte have already > 22 hours count: https://koji.fedoraproject.org/koji/taskinfo?taskID=83514145 The MariaDB build - especially with the full testsuite on - is very resource hungry. However the freezes are strange, since sometimes it freezes randomly even during compilation, while sometimes %check phase ... -- Michal Schorm Software Engineer Core Services - Databases Team Red Hat -- ___ devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-le...@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure