YP AB Intermittent failures meeting
===================================
https://windriver.zoom.us/j/3696693975
Attendees: Richard, Trevor, Randy, Saul, AlexB
Summary:
========
--------------------------------------------------------------
People have been busy with other work so this is mostly a duplicate
of the previous minutes with some cyptic? IRC chat logs added below.
--------------------------------------------------------------
Ptest results continue to improve yet again but there's still room
for even more improvement.
Alex made a graph of the number of AB INT issues per week:
https://bootlin.com/~alexandre/SWAT_stats.png
We assume that week 15, 16 was when the RCU bug in he kernel
started being a problem and week 29 was when it go fixed but
more careful analysis is required.
The make/ninja load average limit is in but it's not clear
if it's effective yet and it breaks dunfell.
Trevor has a build of dunfell that with some patches appears to work.
If anyone wants to help, we could use more eyes on the logs,
particularly the summary logs and understanding iostat #
when the dd test times out.
Plans for the week:
===================
Richard: M1
Alex: look into Rest API for BZ as part of Triage.
Sakib: hook more responsive load average in to latency test. (v3)
: Add PSI (/proc/pressure/*) when available
Trevor: No AB work
Saul: No AB work
Randy: PSI, simple experiments to learn what's 'normal',
make some PSI graphs!
Idea: bitbake/make/ninja use /proc/pressure to throttle builds?
../Randy
Meeting Notes:
==============
Cryptic notes this week since I don't have time for a proper summary!
[09:01] <vmeson> Agenda Items?
[09:01] <vmeson> 1. PSI /proc/pressure
see: https://www.kernel.org/doc/html/latest/accounting/psi.html
[09:01] <vmeson> 2. Status of AB
[09:02] <vmeson> 3. Randy promises valgrind patches (again!).
Discussion:
Randy explained what the PSI proc data was and
how overloaded ubuntu2004-ty-2.yocto.io was for most of a 25 hour
logging window. Typical high load data is:
pressure.cpu
some avg10=0.00 avg60=0.00 avg300=1.12 total=8891584119
some avg10=0.00 avg60=0.00 avg300=1.00 total=8891596517
some avg10=0.00 avg60=0.00 avg300=0.90 total=8891613258
some avg10=0.00 avg60=0.00 avg300=0.81 total=8891631326
pressure.memory
some avg10=0.00 avg60=0.00 avg300=0.00 total=237120503
full avg10=0.00 avg60=0.00 avg300=0.00 total=194500170
some avg10=0.00 avg60=0.00 avg300=0.00 total=237121370
full avg10=0.00 avg60=0.00 avg300=0.00 total=194500170
pressure.io
some avg10=65.22 avg60=56.24 avg300=40.82 total=142092410628
full avg10=65.22 avg60=56.23 avg300=40.28 total=134090291475
some avg10=48.50 avg60=52.56 avg300=41.42 total=142106526703
full avg10=48.50 avg60=52.56 avg300=40.93 total=134104406782
This is just:
for i in pressure.cpu pressure.memory pressure.io; do \
echo $i; tail -4 $i; \
done
There are two lines per call for some of the /proc/pressure files:
$ wc -l pressure*
3000 pressure.cpu
6000 pressure.io
6000 pressure.memory
Looking at the max io load for the times when the 'full' system
is overloaded:
$ grep full pressure.io | cat -n > pressure.io-full.numbered
$ sed -e 's/=/ /g' pressure.io-full.numbered | sort -k 4 -n | tail -2
1710 full avg10 97.52 avg60 93.85 avg300 92.29 total 124322928244
1699 full avg10 97.57 avg60 93.16 avg300 91.36 total 124015261496
# make a link so we have *full.numbered files for each subsystem:
$ ln -s pressure.cpu.numbered pressure.cpu-ful.numbered
$ for i in pressure.cpu pressure.memory pressure.io; do \
echo $i; egrep ' 1710| 1699' $i-full.numbered; \
done
pressure.cpu
1699 some avg10=0.00 avg60=0.00 avg300=0.00 total=7340981927
1710 some avg10=0.00 avg60=0.00 avg300=0.00 total=7341522636
pressure.memory
1699 full avg10=0.00 avg60=0.00 avg300=0.00 total=131800420
1710 full avg10=0.00 avg60=0.00 avg300=0.00 total=131800502
pressure.io
1699 full avg10=97.57 avg60=93.16 avg300=91.36 total=124015261496
1710 full avg10=97.52 avg60=93.85 avg300=92.29 total=124322928244
Conclusion, no cpu, memory contention, lots of IO -- clean-up or ???
[09:22] <vmeson> read-only YP BZ database query to find "AB-INT" build
failures, find build and highlight in "non-release" AB build summary.
[09:23] <vmeson> add /proc/pressure info to test results logs.
Sakib and I may do this, else Alex in January
[09:24] <vmeson> network access: weird AB bugs: DNS or connextion
failure, at times even qemu net connection fails!
[09:26] <vmeson> systemd-networkd at times monitors and maybe even
changes (if up/down?) tap interfaces
This seems to be hard-coded, mandatory behaviour in systemd managed
distros.
[09:26] <vmeson> Test system, start / stop qemu instances in a loop.
[09:29] <vmeson> say: 16 qemus booting core-image-minimal and doing a
simple test such as a network connection (download curl), along with
stress --mumble-args to load the host sytem to simulate a build.
[09:30] <abelloni> vmeson:
https://bugzilla.yoctoproject.org/show_bug.cgi?id=14467
[09:32] <vmeson> use rest API to query the bugzilla -- where's the API?
[09:33] <abelloni> https://wiki.mozilla.org/Bugzilla:REST_API
[09:33] <abelloni> I'll have a look at that
[09:37] <RP> "BzAPI is an alternate, deprecated REST API" - that will
be ours as we're not on bz5 yet
Notes from previous meeting which are largely still relevant.
1. job server
- ninja could be patched with make's more responsive algorithm
next or is this good enough?
Aug 26:
Randy made some graphs that show that the -l NUM results
in the number of compile jobs oscillates *wildly* between 0 and 200
on a 192 core builder compiling chromium. What I did was:
$ bitbake -c cleansstate chromium-x11
$ bitbake -c configure chromium-x11
$ bitbake -c compile chromium-x11
and while that compile was running:
$ while [ ! -f /tmp/compiling-chromium-is-done ]; do \
cat /proc/loadavg >> procs-load.log ; sleep 0.5 ;
done
Results so far:
https://postimg.cc/gallery/3hjfYfG/f8f46c97
Next step is either:
a. collect data as above for an image build and see if the sub-optimal
ninja behaviour makes a difference
and/or
b. patch ninja with make's more responsive load avg
algorithm:
https://git.savannah.gnu.org/cgit/make.git/commit/?id=d8728efc8
- Richard suggested that we extract make's code for measuring the load
average to a separate binary and run it in the periodic io latency
test. Also can we translate it to python?
- Trevor is working on this and had some problems so next week.
(Aug 19 - Trevor is back from vaction so maybe next week.)
- Trevor to see if the load average change really did reduce load
on WR build systems. (Aug 19)
2. AB status
Trevor is learning about buildbot and working on a scheduling bug
(CentOS worker?)
bitbake layer setup tool should allow multiple backends:
eg: kas, a y-a-helper.
ptest cases are improving, we may be close to done!
Let's wait a week to see how things go.
(July29, Aug 5, Aug 19, we're not done...)
- lttng-tools ptest is failing. RP is working on it with upstream.
The timeout (done on Aug 5) increase hasn't helped.
3. Sakib's improvements to the logging are merged.
Sakib generated a summary of all high latency 'top' logs from
~July 23->July 29 by just running his summary script on the
merged raw top logs.
More analysis required....
Still relevant parts of
Previous Meeting Notes:
=======================
4. bitbake server timeout ( no change july 29, Aug 19, Oct 7)
"Timeout while waiting for a reply from the bitbake server (60s)"
5. io stalls (no update: July 29, Oct 7)
Richard said that it would make sense to write an ftrace utility
/ script to monitor io latency and we could install it with sudo
Ch^W mentioned ftrace on IRC.
Sakib and Randy will work on that but not for a week or two
or longer! (Aug 19).
Randy collected iostat data on 3 build server:
https://postimg.cc/gallery/8cN6LYB
We agreed that having -ty-2 be ~ 100 utilization for many hours
in a row is not acceptable and that a threshold of ~ 10 minutes
at 100% utilization may be a reasonable limt. I need to figure out
if I can get data on the fraction of IO done per IO clas since
we do use ionice to do clean-up and other activities.
../Randy
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#55540): https://lists.yoctoproject.org/g/yocto/message/55540
Mute This Topic: https://lists.yoctoproject.org/mt/87620057/21656
Group Owner: yocto+ow...@lists.yoctoproject.org
Unsubscribe: https://lists.yoctoproject.org/g/yocto/unsub
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-