Re: s390 private runner CI job timing out

2023-04-06 Thread Christian Borntraeger




Am 06.04.23 um 14:39 schrieb Peter Maydell:

On Thu, 6 Apr 2023 at 13:30, Thomas Huth  wrote:

The thing is: it shouldn't take that long to build QEMU and run the tests
here, theoretically. Some days ago, the job was finishing in 39 minutes:

   https://gitlab.com/qemu-project/qemu/-/jobs/3973481571

The recent run took 74 minutes:

   https://gitlab.com/qemu-project/qemu/-/jobs/4066136770

That's almost a factor of two! So there is definitely something strange
going on.


So that 39 minute run was about 22 minutes compile, 17 minutes test.
The 74 minute run was 45 minutes compile, 30 minutes test.
The number of compile steps in meson was pretty much the same
(10379 vs 10384) in each case. So the most plausible conclusion
seems like "the VM mysteriously got slower by nearly a factor of 2",
given that the slowdown seems to affect the compile and test
stages about equally.

The VM has been up for 44 days, so we can rule out "rebooted into
a new kernel with some kind of perf bug".


Looks like the team is looking into some storage delays at the moment.
So it would be good to get some numbers for io wait and steal on the
next run to see if this is related to storage or CPU consumption.
Maybe collect /proc/stat before and after a run. Or login and use top.



Re: s390 private runner CI job timing out

2023-04-06 Thread Christian Borntraeger




Am 06.04.23 um 14:39 schrieb Peter Maydell:

On Thu, 6 Apr 2023 at 13:30, Thomas Huth  wrote:

The thing is: it shouldn't take that long to build QEMU and run the tests
here, theoretically. Some days ago, the job was finishing in 39 minutes:

   https://gitlab.com/qemu-project/qemu/-/jobs/3973481571

The recent run took 74 minutes:

   https://gitlab.com/qemu-project/qemu/-/jobs/4066136770

That's almost a factor of two! So there is definitely something strange
going on.


So that 39 minute run was about 22 minutes compile, 17 minutes test.
The 74 minute run was 45 minutes compile, 30 minutes test.
The number of compile steps in meson was pretty much the same
(10379 vs 10384) in each case. So the most plausible conclusion
seems like "the VM mysteriously got slower by nearly a factor of 2",
given that the slowdown seems to affect the compile and test
stages about equally.

The VM has been up for 44 days, so we can rule out "rebooted into
a new kernel with some kind of perf bug".


I will ask around if we can get a higher share of the system.



Re: s390 private runner CI job timing out

2023-04-06 Thread Christian Borntraeger




Am 06.04.23 um 14:30 schrieb Thomas Huth:

On 06/04/2023 14.13, Christian Borntraeger wrote:



Am 06.04.23 um 14:05 schrieb Peter Maydell:

On Thu, 6 Apr 2023 at 12:17, Christian Borntraeger
 wrote:


Am 06.04.23 um 12:44 schrieb Peter Maydell:

On Thu, 6 Apr 2023 at 11:40, Christian Borntraeger
 wrote:

Am 06.04.23 um 11:21 schrieb Peter Maydell:

Christian, does our S390X machine get a guaranteed amount of CPU,
or does it depend on what else is running on the hardware?


I think its a shared system with shared CPUs. Can you check the steal
time in top or proc? If this is far too high we could ask to give you
more weight for that VM.


It's idle at the moment and steal time seems to be low (0.0 .. 0.3);
I'll try to remember to check next time it's running a job.



Do you have /proc/stat ?


Yes; hopefully it means more to you than it does to me :-)

linux1@qemu01:~$ cat /proc/stat
cpu  60904459 604975 15052194 1435958176 17128179 351949 758578 22218760 0 0
cpu0 15022535 146734 3786909 358774818 4283172 98313 237156 5894809 0 0
cpu1 15306890 151164 3746024 358968957 4378864 85629 172492 5434255 0 0
cpu2 15307709 157180 3762691 359141276 4138714 85736 176367 5474594 0 0
cpu3 15267324 149895 3756569 359073124 4327428 82269 172562 5415101 0 0


This is
user,nice,system,idle,iowait,irq,softirq,steal,guest,guest_nice
So overall there is some (20-30% since the last reboot) steal going on.
Not sure if this is the real problem since it is only Ubuntu 20.04.
Does a higher timeout make the problem go away?


The thing is: it shouldn't take that long to build QEMU and run the tests here, 
theoretically. Some days ago, the job was finishing in 39 minutes:

  https://gitlab.com/qemu-project/qemu/-/jobs/3973481571

The recent run took 74 minutes:

  https://gitlab.com/qemu-project/qemu/-/jobs/4066136770

That's almost a factor of two! So there is definitely something strange going 
on.


But this has 786 instead of 659 tests, no?



Re: s390 private runner CI job timing out

2023-04-06 Thread Peter Maydell
On Thu, 6 Apr 2023 at 13:30, Thomas Huth  wrote:
> The thing is: it shouldn't take that long to build QEMU and run the tests
> here, theoretically. Some days ago, the job was finishing in 39 minutes:
>
>   https://gitlab.com/qemu-project/qemu/-/jobs/3973481571
>
> The recent run took 74 minutes:
>
>   https://gitlab.com/qemu-project/qemu/-/jobs/4066136770
>
> That's almost a factor of two! So there is definitely something strange
> going on.

So that 39 minute run was about 22 minutes compile, 17 minutes test.
The 74 minute run was 45 minutes compile, 30 minutes test.
The number of compile steps in meson was pretty much the same
(10379 vs 10384) in each case. So the most plausible conclusion
seems like "the VM mysteriously got slower by nearly a factor of 2",
given that the slowdown seems to affect the compile and test
stages about equally.

The VM has been up for 44 days, so we can rule out "rebooted into
a new kernel with some kind of perf bug".

thanks
-- PMM



Re: s390 private runner CI job timing out

2023-04-06 Thread Thomas Huth

On 06/04/2023 14.13, Christian Borntraeger wrote:



Am 06.04.23 um 14:05 schrieb Peter Maydell:

On Thu, 6 Apr 2023 at 12:17, Christian Borntraeger
 wrote:


Am 06.04.23 um 12:44 schrieb Peter Maydell:

On Thu, 6 Apr 2023 at 11:40, Christian Borntraeger
 wrote:

Am 06.04.23 um 11:21 schrieb Peter Maydell:

Christian, does our S390X machine get a guaranteed amount of CPU,
or does it depend on what else is running on the hardware?


I think its a shared system with shared CPUs. Can you check the steal
time in top or proc? If this is far too high we could ask to give you
more weight for that VM.


It's idle at the moment and steal time seems to be low (0.0 .. 0.3);
I'll try to remember to check next time it's running a job.



Do you have /proc/stat ?


Yes; hopefully it means more to you than it does to me :-)

linux1@qemu01:~$ cat /proc/stat
cpu  60904459 604975 15052194 1435958176 17128179 351949 758578 22218760 0 0
cpu0 15022535 146734 3786909 358774818 4283172 98313 237156 5894809 0 0
cpu1 15306890 151164 3746024 358968957 4378864 85629 172492 5434255 0 0
cpu2 15307709 157180 3762691 359141276 4138714 85736 176367 5474594 0 0
cpu3 15267324 149895 3756569 359073124 4327428 82269 172562 5415101 0 0


This is
user,nice,system,idle,iowait,irq,softirq,steal,guest,guest_nice
So overall there is some (20-30% since the last reboot) steal going on.
Not sure if this is the real problem since it is only Ubuntu 20.04.
Does a higher timeout make the problem go away?


The thing is: it shouldn't take that long to build QEMU and run the tests 
here, theoretically. Some days ago, the job was finishing in 39 minutes:


 https://gitlab.com/qemu-project/qemu/-/jobs/3973481571

The recent run took 74 minutes:

 https://gitlab.com/qemu-project/qemu/-/jobs/4066136770

That's almost a factor of two! So there is definitely something strange 
going on.


 Thomas




Re: s390 private runner CI job timing out

2023-04-06 Thread Peter Maydell
On Thu, 6 Apr 2023 at 13:13, Christian Borntraeger
 wrote:
>
>
>
> Am 06.04.23 um 14:05 schrieb Peter Maydell:
> > On Thu, 6 Apr 2023 at 12:17, Christian Borntraeger
> >  wrote:
> >>
> >> Am 06.04.23 um 12:44 schrieb Peter Maydell:
> >>> On Thu, 6 Apr 2023 at 11:40, Christian Borntraeger
> >>>  wrote:
>  Am 06.04.23 um 11:21 schrieb Peter Maydell:
> > Christian, does our S390X machine get a guaranteed amount of CPU,
> > or does it depend on what else is running on the hardware?
> 
>  I think its a shared system with shared CPUs. Can you check the steal
>  time in top or proc? If this is far too high we could ask to give you
>  more weight for that VM.
> >>>
> >>> It's idle at the moment and steal time seems to be low (0.0 .. 0.3);
> >>> I'll try to remember to check next time it's running a job.
> >>>
> >>
> >> Do you have /proc/stat ?
> >
> > Yes; hopefully it means more to you than it does to me :-)
> >
> > linux1@qemu01:~$ cat /proc/stat
> > cpu  60904459 604975 15052194 1435958176 17128179 351949 758578 22218760 0 0
> > cpu0 15022535 146734 3786909 358774818 4283172 98313 237156 5894809 0 0
> > cpu1 15306890 151164 3746024 358968957 4378864 85629 172492 5434255 0 0
> > cpu2 15307709 157180 3762691 359141276 4138714 85736 176367 5474594 0 0
> > cpu3 15267324 149895 3756569 359073124 4327428 82269 172562 5415101 0 0
>
> This is
> user,nice,system,idle,iowait,irq,softirq,steal,guest,guest_nice
> So overall there is some (20-30% since the last reboot) steal going on.
> Not sure if this is the real problem since it is only Ubuntu 20.04.
> Does a higher timeout make the problem go away?

I would expect that raising the timeout would improve this (it seems
to be on the borderline where some jobs make it in in time and some don't),
but on the other hand 75 mins is already a pretty high timeout.
I would rather we looked at whether we could cut down the amount
we are compiling/testing on this box instead.

thanks
-- PMM



Re: s390 private runner CI job timing out

2023-04-06 Thread Christian Borntraeger




Am 06.04.23 um 14:05 schrieb Peter Maydell:

On Thu, 6 Apr 2023 at 12:17, Christian Borntraeger
 wrote:


Am 06.04.23 um 12:44 schrieb Peter Maydell:

On Thu, 6 Apr 2023 at 11:40, Christian Borntraeger
 wrote:

Am 06.04.23 um 11:21 schrieb Peter Maydell:

Christian, does our S390X machine get a guaranteed amount of CPU,
or does it depend on what else is running on the hardware?


I think its a shared system with shared CPUs. Can you check the steal
time in top or proc? If this is far too high we could ask to give you
more weight for that VM.


It's idle at the moment and steal time seems to be low (0.0 .. 0.3);
I'll try to remember to check next time it's running a job.



Do you have /proc/stat ?


Yes; hopefully it means more to you than it does to me :-)

linux1@qemu01:~$ cat /proc/stat
cpu  60904459 604975 15052194 1435958176 17128179 351949 758578 22218760 0 0
cpu0 15022535 146734 3786909 358774818 4283172 98313 237156 5894809 0 0
cpu1 15306890 151164 3746024 358968957 4378864 85629 172492 5434255 0 0
cpu2 15307709 157180 3762691 359141276 4138714 85736 176367 5474594 0 0
cpu3 15267324 149895 3756569 359073124 4327428 82269 172562 5415101 0 0


This is
user,nice,system,idle,iowait,irq,softirq,steal,guest,guest_nice
So overall there is some (20-30% since the last reboot) steal going on.
Not sure if this is the real problem since it is only Ubuntu 20.04.
Does a higher timeout make the problem go away?



Re: s390 private runner CI job timing out

2023-04-06 Thread Peter Maydell
On Thu, 6 Apr 2023 at 12:17, Christian Borntraeger
 wrote:
>
> Am 06.04.23 um 12:44 schrieb Peter Maydell:
> > On Thu, 6 Apr 2023 at 11:40, Christian Borntraeger
> >  wrote:
> >> Am 06.04.23 um 11:21 schrieb Peter Maydell:
> >>> Christian, does our S390X machine get a guaranteed amount of CPU,
> >>> or does it depend on what else is running on the hardware?
> >>
> >> I think its a shared system with shared CPUs. Can you check the steal
> >> time in top or proc? If this is far too high we could ask to give you
> >> more weight for that VM.
> >
> > It's idle at the moment and steal time seems to be low (0.0 .. 0.3);
> > I'll try to remember to check next time it's running a job.
> >
>
> Do you have /proc/stat ?

Yes; hopefully it means more to you than it does to me :-)

linux1@qemu01:~$ cat /proc/stat
cpu  60904459 604975 15052194 1435958176 17128179 351949 758578 22218760 0 0
cpu0 15022535 146734 3786909 358774818 4283172 98313 237156 5894809 0 0
cpu1 15306890 151164 3746024 358968957 4378864 85629 172492 5434255 0 0
cpu2 15307709 157180 3762691 359141276 4138714 85736 176367 5474594 0 0
cpu3 15267324 149895 3756569 359073124 4327428 82269 172562 5415101 0 0
intr 3015682291 0 1 0
ctxt 9482595181
btime 1676900212
processes 18915201
procs_running 2
procs_blocked 0
softirq 15948680473 1369597456 327774195 41134 114090717 19748 0
465505237 76101069 4250 710644779

-- PMM



Re: s390 private runner CI job timing out

2023-04-06 Thread Christian Borntraeger

Am 06.04.23 um 12:44 schrieb Peter Maydell:

On Thu, 6 Apr 2023 at 11:40, Christian Borntraeger
 wrote:

Am 06.04.23 um 11:21 schrieb Peter Maydell:

Christian, does our S390X machine get a guaranteed amount of CPU,
or does it depend on what else is running on the hardware?


I think its a shared system with shared CPUs. Can you check the steal
time in top or proc? If this is far too high we could ask to give you
more weight for that VM.


It's idle at the moment and steal time seems to be low (0.0 .. 0.3);
I'll try to remember to check next time it's running a job.



Do you have /proc/stat ?



Re: s390 private runner CI job timing out

2023-04-06 Thread Peter Maydell
On Thu, 6 Apr 2023 at 11:40, Christian Borntraeger
 wrote:
> Am 06.04.23 um 11:21 schrieb Peter Maydell:
> > Christian, does our S390X machine get a guaranteed amount of CPU,
> > or does it depend on what else is running on the hardware?
>
> I think its a shared system with shared CPUs. Can you check the steal
> time in top or proc? If this is far too high we could ask to give you
> more weight for that VM.

It's idle at the moment and steal time seems to be low (0.0 .. 0.3);
I'll try to remember to check next time it's running a job.

thanks
-- PMM



Re: s390 private runner CI job timing out

2023-04-06 Thread Christian Borntraeger




Am 06.04.23 um 11:21 schrieb Peter Maydell:

On Thu, 6 Apr 2023 at 07:57, Thomas Huth  wrote:


On 05/04/2023 17.15, Peter Maydell wrote:

The s390 private runner CI job ubuntu-20.04-s390x-all seems to have
started timing out a lot recently. Here's an example where it passed,
but with only 53 seconds left on the clock before it would have been
killed:

https://gitlab.com/qemu-project/qemu/-/jobs/4066136770

It looks like 'make check' was about 30 minutes of the 75 minutes
total, and compilation was 45 minutes.

Any suggestions for how we can trim this down? (Presumably we
could also raise the time limit given that this is a private
runner job...)


I don't have access to that system, so I can only guess: Did you check
whether there are any other processes still running (leftovers from earlier
test runs)?


I did check for that, as it's been a problem in the past, but
no, in this case no other jobs are running in the VM.


If not, it's maybe because it is a VM that is running with other
VMs in parallel that hog the CPU? In that case, you could contact the owner
of the machine and ask whether there  is anything that could be done about
it. Or simply increase the timeout a little bit more... (our highest timeout
in another job is 90 minutes, so I think that would still be OK?).


Christian, does our S390X machine get a guaranteed amount of CPU,
or does it depend on what else is running on the hardware?


I think its a shared system with shared CPUs. Can you check the steal
time in top or proc? If this is far too high we could ask to give you
more weight for that VM.



Re: s390 private runner CI job timing out

2023-04-06 Thread Peter Maydell
On Thu, 6 Apr 2023 at 07:57, Thomas Huth  wrote:
>
> On 05/04/2023 17.15, Peter Maydell wrote:
> > The s390 private runner CI job ubuntu-20.04-s390x-all seems to have
> > started timing out a lot recently. Here's an example where it passed,
> > but with only 53 seconds left on the clock before it would have been
> > killed:
> >
> > https://gitlab.com/qemu-project/qemu/-/jobs/4066136770
> >
> > It looks like 'make check' was about 30 minutes of the 75 minutes
> > total, and compilation was 45 minutes.
> >
> > Any suggestions for how we can trim this down? (Presumably we
> > could also raise the time limit given that this is a private
> > runner job...)
>
> I don't have access to that system, so I can only guess: Did you check
> whether there are any other processes still running (leftovers from earlier
> test runs)?

I did check for that, as it's been a problem in the past, but
no, in this case no other jobs are running in the VM.

> If not, it's maybe because it is a VM that is running with other
> VMs in parallel that hog the CPU? In that case, you could contact the owner
> of the machine and ask whether there  is anything that could be done about
> it. Or simply increase the timeout a little bit more... (our highest timeout
> in another job is 90 minutes, so I think that would still be OK?).

Christian, does our S390X machine get a guaranteed amount of CPU,
or does it depend on what else is running on the hardware?

thanks
-- PMM



Re: s390 private runner CI job timing out

2023-04-06 Thread Thomas Huth

On 05/04/2023 17.15, Peter Maydell wrote:

The s390 private runner CI job ubuntu-20.04-s390x-all seems to have
started timing out a lot recently. Here's an example where it passed,
but with only 53 seconds left on the clock before it would have been
killed:

https://gitlab.com/qemu-project/qemu/-/jobs/4066136770

It looks like 'make check' was about 30 minutes of the 75 minutes
total, and compilation was 45 minutes.

Any suggestions for how we can trim this down? (Presumably we
could also raise the time limit given that this is a private
runner job...)


I don't have access to that system, so I can only guess: Did you check 
whether there are any other processes still running (leftovers from earlier 
test runs)? If not, it's maybe because it is a VM that is running with other 
VMs in parallel that hog the CPU? In that case, you could contact the owner 
of the machine and ask whether there  is anything that could be done about 
it. Or simply increase the timeout a little bit more... (our highest timeout 
in another job is 90 minutes, so I think that would still be OK?).


 Thomas