Re: s390 private runner CI job timing out
Am 06.04.23 um 14:39 schrieb Peter Maydell: On Thu, 6 Apr 2023 at 13:30, Thomas Huth wrote: The thing is: it shouldn't take that long to build QEMU and run the tests here, theoretically. Some days ago, the job was finishing in 39 minutes: https://gitlab.com/qemu-project/qemu/-/jobs/3973481571 The recent run took 74 minutes: https://gitlab.com/qemu-project/qemu/-/jobs/4066136770 That's almost a factor of two! So there is definitely something strange going on. So that 39 minute run was about 22 minutes compile, 17 minutes test. The 74 minute run was 45 minutes compile, 30 minutes test. The number of compile steps in meson was pretty much the same (10379 vs 10384) in each case. So the most plausible conclusion seems like "the VM mysteriously got slower by nearly a factor of 2", given that the slowdown seems to affect the compile and test stages about equally. The VM has been up for 44 days, so we can rule out "rebooted into a new kernel with some kind of perf bug". Looks like the team is looking into some storage delays at the moment. So it would be good to get some numbers for io wait and steal on the next run to see if this is related to storage or CPU consumption. Maybe collect /proc/stat before and after a run. Or login and use top.
Re: s390 private runner CI job timing out
Am 06.04.23 um 14:39 schrieb Peter Maydell: On Thu, 6 Apr 2023 at 13:30, Thomas Huth wrote: The thing is: it shouldn't take that long to build QEMU and run the tests here, theoretically. Some days ago, the job was finishing in 39 minutes: https://gitlab.com/qemu-project/qemu/-/jobs/3973481571 The recent run took 74 minutes: https://gitlab.com/qemu-project/qemu/-/jobs/4066136770 That's almost a factor of two! So there is definitely something strange going on. So that 39 minute run was about 22 minutes compile, 17 minutes test. The 74 minute run was 45 minutes compile, 30 minutes test. The number of compile steps in meson was pretty much the same (10379 vs 10384) in each case. So the most plausible conclusion seems like "the VM mysteriously got slower by nearly a factor of 2", given that the slowdown seems to affect the compile and test stages about equally. The VM has been up for 44 days, so we can rule out "rebooted into a new kernel with some kind of perf bug". I will ask around if we can get a higher share of the system.
Re: s390 private runner CI job timing out
Am 06.04.23 um 14:30 schrieb Thomas Huth: On 06/04/2023 14.13, Christian Borntraeger wrote: Am 06.04.23 um 14:05 schrieb Peter Maydell: On Thu, 6 Apr 2023 at 12:17, Christian Borntraeger wrote: Am 06.04.23 um 12:44 schrieb Peter Maydell: On Thu, 6 Apr 2023 at 11:40, Christian Borntraeger wrote: Am 06.04.23 um 11:21 schrieb Peter Maydell: Christian, does our S390X machine get a guaranteed amount of CPU, or does it depend on what else is running on the hardware? I think its a shared system with shared CPUs. Can you check the steal time in top or proc? If this is far too high we could ask to give you more weight for that VM. It's idle at the moment and steal time seems to be low (0.0 .. 0.3); I'll try to remember to check next time it's running a job. Do you have /proc/stat ? Yes; hopefully it means more to you than it does to me :-) linux1@qemu01:~$ cat /proc/stat cpu 60904459 604975 15052194 1435958176 17128179 351949 758578 22218760 0 0 cpu0 15022535 146734 3786909 358774818 4283172 98313 237156 5894809 0 0 cpu1 15306890 151164 3746024 358968957 4378864 85629 172492 5434255 0 0 cpu2 15307709 157180 3762691 359141276 4138714 85736 176367 5474594 0 0 cpu3 15267324 149895 3756569 359073124 4327428 82269 172562 5415101 0 0 This is user,nice,system,idle,iowait,irq,softirq,steal,guest,guest_nice So overall there is some (20-30% since the last reboot) steal going on. Not sure if this is the real problem since it is only Ubuntu 20.04. Does a higher timeout make the problem go away? The thing is: it shouldn't take that long to build QEMU and run the tests here, theoretically. Some days ago, the job was finishing in 39 minutes: https://gitlab.com/qemu-project/qemu/-/jobs/3973481571 The recent run took 74 minutes: https://gitlab.com/qemu-project/qemu/-/jobs/4066136770 That's almost a factor of two! So there is definitely something strange going on. But this has 786 instead of 659 tests, no?
Re: s390 private runner CI job timing out
On Thu, 6 Apr 2023 at 13:30, Thomas Huth wrote: > The thing is: it shouldn't take that long to build QEMU and run the tests > here, theoretically. Some days ago, the job was finishing in 39 minutes: > > https://gitlab.com/qemu-project/qemu/-/jobs/3973481571 > > The recent run took 74 minutes: > > https://gitlab.com/qemu-project/qemu/-/jobs/4066136770 > > That's almost a factor of two! So there is definitely something strange > going on. So that 39 minute run was about 22 minutes compile, 17 minutes test. The 74 minute run was 45 minutes compile, 30 minutes test. The number of compile steps in meson was pretty much the same (10379 vs 10384) in each case. So the most plausible conclusion seems like "the VM mysteriously got slower by nearly a factor of 2", given that the slowdown seems to affect the compile and test stages about equally. The VM has been up for 44 days, so we can rule out "rebooted into a new kernel with some kind of perf bug". thanks -- PMM
Re: s390 private runner CI job timing out
On 06/04/2023 14.13, Christian Borntraeger wrote: Am 06.04.23 um 14:05 schrieb Peter Maydell: On Thu, 6 Apr 2023 at 12:17, Christian Borntraeger wrote: Am 06.04.23 um 12:44 schrieb Peter Maydell: On Thu, 6 Apr 2023 at 11:40, Christian Borntraeger wrote: Am 06.04.23 um 11:21 schrieb Peter Maydell: Christian, does our S390X machine get a guaranteed amount of CPU, or does it depend on what else is running on the hardware? I think its a shared system with shared CPUs. Can you check the steal time in top or proc? If this is far too high we could ask to give you more weight for that VM. It's idle at the moment and steal time seems to be low (0.0 .. 0.3); I'll try to remember to check next time it's running a job. Do you have /proc/stat ? Yes; hopefully it means more to you than it does to me :-) linux1@qemu01:~$ cat /proc/stat cpu 60904459 604975 15052194 1435958176 17128179 351949 758578 22218760 0 0 cpu0 15022535 146734 3786909 358774818 4283172 98313 237156 5894809 0 0 cpu1 15306890 151164 3746024 358968957 4378864 85629 172492 5434255 0 0 cpu2 15307709 157180 3762691 359141276 4138714 85736 176367 5474594 0 0 cpu3 15267324 149895 3756569 359073124 4327428 82269 172562 5415101 0 0 This is user,nice,system,idle,iowait,irq,softirq,steal,guest,guest_nice So overall there is some (20-30% since the last reboot) steal going on. Not sure if this is the real problem since it is only Ubuntu 20.04. Does a higher timeout make the problem go away? The thing is: it shouldn't take that long to build QEMU and run the tests here, theoretically. Some days ago, the job was finishing in 39 minutes: https://gitlab.com/qemu-project/qemu/-/jobs/3973481571 The recent run took 74 minutes: https://gitlab.com/qemu-project/qemu/-/jobs/4066136770 That's almost a factor of two! So there is definitely something strange going on. Thomas
Re: s390 private runner CI job timing out
On Thu, 6 Apr 2023 at 13:13, Christian Borntraeger wrote: > > > > Am 06.04.23 um 14:05 schrieb Peter Maydell: > > On Thu, 6 Apr 2023 at 12:17, Christian Borntraeger > > wrote: > >> > >> Am 06.04.23 um 12:44 schrieb Peter Maydell: > >>> On Thu, 6 Apr 2023 at 11:40, Christian Borntraeger > >>> wrote: > Am 06.04.23 um 11:21 schrieb Peter Maydell: > > Christian, does our S390X machine get a guaranteed amount of CPU, > > or does it depend on what else is running on the hardware? > > I think its a shared system with shared CPUs. Can you check the steal > time in top or proc? If this is far too high we could ask to give you > more weight for that VM. > >>> > >>> It's idle at the moment and steal time seems to be low (0.0 .. 0.3); > >>> I'll try to remember to check next time it's running a job. > >>> > >> > >> Do you have /proc/stat ? > > > > Yes; hopefully it means more to you than it does to me :-) > > > > linux1@qemu01:~$ cat /proc/stat > > cpu 60904459 604975 15052194 1435958176 17128179 351949 758578 22218760 0 0 > > cpu0 15022535 146734 3786909 358774818 4283172 98313 237156 5894809 0 0 > > cpu1 15306890 151164 3746024 358968957 4378864 85629 172492 5434255 0 0 > > cpu2 15307709 157180 3762691 359141276 4138714 85736 176367 5474594 0 0 > > cpu3 15267324 149895 3756569 359073124 4327428 82269 172562 5415101 0 0 > > This is > user,nice,system,idle,iowait,irq,softirq,steal,guest,guest_nice > So overall there is some (20-30% since the last reboot) steal going on. > Not sure if this is the real problem since it is only Ubuntu 20.04. > Does a higher timeout make the problem go away? I would expect that raising the timeout would improve this (it seems to be on the borderline where some jobs make it in in time and some don't), but on the other hand 75 mins is already a pretty high timeout. I would rather we looked at whether we could cut down the amount we are compiling/testing on this box instead. thanks -- PMM
Re: s390 private runner CI job timing out
Am 06.04.23 um 14:05 schrieb Peter Maydell: On Thu, 6 Apr 2023 at 12:17, Christian Borntraeger wrote: Am 06.04.23 um 12:44 schrieb Peter Maydell: On Thu, 6 Apr 2023 at 11:40, Christian Borntraeger wrote: Am 06.04.23 um 11:21 schrieb Peter Maydell: Christian, does our S390X machine get a guaranteed amount of CPU, or does it depend on what else is running on the hardware? I think its a shared system with shared CPUs. Can you check the steal time in top or proc? If this is far too high we could ask to give you more weight for that VM. It's idle at the moment and steal time seems to be low (0.0 .. 0.3); I'll try to remember to check next time it's running a job. Do you have /proc/stat ? Yes; hopefully it means more to you than it does to me :-) linux1@qemu01:~$ cat /proc/stat cpu 60904459 604975 15052194 1435958176 17128179 351949 758578 22218760 0 0 cpu0 15022535 146734 3786909 358774818 4283172 98313 237156 5894809 0 0 cpu1 15306890 151164 3746024 358968957 4378864 85629 172492 5434255 0 0 cpu2 15307709 157180 3762691 359141276 4138714 85736 176367 5474594 0 0 cpu3 15267324 149895 3756569 359073124 4327428 82269 172562 5415101 0 0 This is user,nice,system,idle,iowait,irq,softirq,steal,guest,guest_nice So overall there is some (20-30% since the last reboot) steal going on. Not sure if this is the real problem since it is only Ubuntu 20.04. Does a higher timeout make the problem go away?
Re: s390 private runner CI job timing out
On Thu, 6 Apr 2023 at 12:17, Christian Borntraeger wrote: > > Am 06.04.23 um 12:44 schrieb Peter Maydell: > > On Thu, 6 Apr 2023 at 11:40, Christian Borntraeger > > wrote: > >> Am 06.04.23 um 11:21 schrieb Peter Maydell: > >>> Christian, does our S390X machine get a guaranteed amount of CPU, > >>> or does it depend on what else is running on the hardware? > >> > >> I think its a shared system with shared CPUs. Can you check the steal > >> time in top or proc? If this is far too high we could ask to give you > >> more weight for that VM. > > > > It's idle at the moment and steal time seems to be low (0.0 .. 0.3); > > I'll try to remember to check next time it's running a job. > > > > Do you have /proc/stat ? Yes; hopefully it means more to you than it does to me :-) linux1@qemu01:~$ cat /proc/stat cpu 60904459 604975 15052194 1435958176 17128179 351949 758578 22218760 0 0 cpu0 15022535 146734 3786909 358774818 4283172 98313 237156 5894809 0 0 cpu1 15306890 151164 3746024 358968957 4378864 85629 172492 5434255 0 0 cpu2 15307709 157180 3762691 359141276 4138714 85736 176367 5474594 0 0 cpu3 15267324 149895 3756569 359073124 4327428 82269 172562 5415101 0 0 intr 3015682291 0 1 0 ctxt 9482595181 btime 1676900212 processes 18915201 procs_running 2 procs_blocked 0 softirq 15948680473 1369597456 327774195 41134 114090717 19748 0 465505237 76101069 4250 710644779 -- PMM
Re: s390 private runner CI job timing out
Am 06.04.23 um 12:44 schrieb Peter Maydell: On Thu, 6 Apr 2023 at 11:40, Christian Borntraeger wrote: Am 06.04.23 um 11:21 schrieb Peter Maydell: Christian, does our S390X machine get a guaranteed amount of CPU, or does it depend on what else is running on the hardware? I think its a shared system with shared CPUs. Can you check the steal time in top or proc? If this is far too high we could ask to give you more weight for that VM. It's idle at the moment and steal time seems to be low (0.0 .. 0.3); I'll try to remember to check next time it's running a job. Do you have /proc/stat ?
Re: s390 private runner CI job timing out
On Thu, 6 Apr 2023 at 11:40, Christian Borntraeger wrote: > Am 06.04.23 um 11:21 schrieb Peter Maydell: > > Christian, does our S390X machine get a guaranteed amount of CPU, > > or does it depend on what else is running on the hardware? > > I think its a shared system with shared CPUs. Can you check the steal > time in top or proc? If this is far too high we could ask to give you > more weight for that VM. It's idle at the moment and steal time seems to be low (0.0 .. 0.3); I'll try to remember to check next time it's running a job. thanks -- PMM
Re: s390 private runner CI job timing out
Am 06.04.23 um 11:21 schrieb Peter Maydell: On Thu, 6 Apr 2023 at 07:57, Thomas Huth wrote: On 05/04/2023 17.15, Peter Maydell wrote: The s390 private runner CI job ubuntu-20.04-s390x-all seems to have started timing out a lot recently. Here's an example where it passed, but with only 53 seconds left on the clock before it would have been killed: https://gitlab.com/qemu-project/qemu/-/jobs/4066136770 It looks like 'make check' was about 30 minutes of the 75 minutes total, and compilation was 45 minutes. Any suggestions for how we can trim this down? (Presumably we could also raise the time limit given that this is a private runner job...) I don't have access to that system, so I can only guess: Did you check whether there are any other processes still running (leftovers from earlier test runs)? I did check for that, as it's been a problem in the past, but no, in this case no other jobs are running in the VM. If not, it's maybe because it is a VM that is running with other VMs in parallel that hog the CPU? In that case, you could contact the owner of the machine and ask whether there is anything that could be done about it. Or simply increase the timeout a little bit more... (our highest timeout in another job is 90 minutes, so I think that would still be OK?). Christian, does our S390X machine get a guaranteed amount of CPU, or does it depend on what else is running on the hardware? I think its a shared system with shared CPUs. Can you check the steal time in top or proc? If this is far too high we could ask to give you more weight for that VM.
Re: s390 private runner CI job timing out
On Thu, 6 Apr 2023 at 07:57, Thomas Huth wrote: > > On 05/04/2023 17.15, Peter Maydell wrote: > > The s390 private runner CI job ubuntu-20.04-s390x-all seems to have > > started timing out a lot recently. Here's an example where it passed, > > but with only 53 seconds left on the clock before it would have been > > killed: > > > > https://gitlab.com/qemu-project/qemu/-/jobs/4066136770 > > > > It looks like 'make check' was about 30 minutes of the 75 minutes > > total, and compilation was 45 minutes. > > > > Any suggestions for how we can trim this down? (Presumably we > > could also raise the time limit given that this is a private > > runner job...) > > I don't have access to that system, so I can only guess: Did you check > whether there are any other processes still running (leftovers from earlier > test runs)? I did check for that, as it's been a problem in the past, but no, in this case no other jobs are running in the VM. > If not, it's maybe because it is a VM that is running with other > VMs in parallel that hog the CPU? In that case, you could contact the owner > of the machine and ask whether there is anything that could be done about > it. Or simply increase the timeout a little bit more... (our highest timeout > in another job is 90 minutes, so I think that would still be OK?). Christian, does our S390X machine get a guaranteed amount of CPU, or does it depend on what else is running on the hardware? thanks -- PMM
Re: s390 private runner CI job timing out
On 05/04/2023 17.15, Peter Maydell wrote: The s390 private runner CI job ubuntu-20.04-s390x-all seems to have started timing out a lot recently. Here's an example where it passed, but with only 53 seconds left on the clock before it would have been killed: https://gitlab.com/qemu-project/qemu/-/jobs/4066136770 It looks like 'make check' was about 30 minutes of the 75 minutes total, and compilation was 45 minutes. Any suggestions for how we can trim this down? (Presumably we could also raise the time limit given that this is a private runner job...) I don't have access to that system, so I can only guess: Did you check whether there are any other processes still running (leftovers from earlier test runs)? If not, it's maybe because it is a VM that is running with other VMs in parallel that hog the CPU? In that case, you could contact the owner of the machine and ask whether there is anything that could be done about it. Or simply increase the timeout a little bit more... (our highest timeout in another job is 90 minutes, so I think that would still be OK?). Thomas