Jan asked me to investigate why we weren't getting a push on Xen 4.8. I investigated the failures in this osstest report.
I think we have a combination of: * Flaky armhf hardware * A real Xen-related heisenbug bug (but which is not a regression) (the EFAULT libxc/linux bug; CC to Juergen) * Mystery failures to make progress during local computation and I/O which look like Linux kernel bugs * Incompatibility between Xen 4.8 and osstest's approach to UEFI booting, now fixed. * A mystery libvirt heisenbug. (Hence the CC to Jim.) Jan: I would be inclined to force push this. OTOH, if we wait, eventually osstest will collect a set of flights which osstest's archeaologist can see justifies a push. Jim: please read down to where I discuss test-amd64-amd64-libvirt-pair. If you have any insight I'd appreciate it. Let me know if you want me to preserve the logs, which will otherwise expire in a few weeks. Juergen: this is just FYI. HTH. osstest service owner writes ("[xen-4.8-testing test] 124100: regressions - FAIL"): > flight 124100 xen-4.8-testing real [real] > http://logs.test-lab.xenproject.org/osstest/logs/124100/ > > Regressions :-( > > Tests which did not succeed and are blocking, > including tests which could not be run: > build-armhf-xsm <job status> broken in > 124070 > build-armhf-xsm 5 host-build-prep fail in 124070 REGR. vs. > 123091 > build-armhf 6 xen-build fail in 123844 REGR. vs. > 123091 I haven't looked but I think these are the arndale bug. It looked at 124070 and 123844 because it was trying to determine that other failures were heisenbugs. > build-i386-pvops 6 kernel-build fail in 123844 REGR. vs. > 123091 This is "git checkout 57a3ca7835962109d94533465a75e8c716b26845" taking more than 4000 seconds (!) on albana1. I have looked at the host logs for albana1 and there seem to be no other build failures, except for libvirt ones (which is expected because there is a race in the libvirt makefiles). I looked at the logs for this particular failure. osstest collected ps output, which shows this: 14583 ? D 00:00:03 0 3 0.6 1.7 1 balance_dirty_pages_ratel \_ git checkout 57a3ca7835962109d94533465a75e8c716b26845 There is nothing unexpected or interesting in any of the logfiles. Note that this host was not running Xen. The kernel was the default Debian jessie i386 kernel. I have no real explanation. This seems like it must be a bug in the kernel. > Tests which are failing intermittently (not blocking): > test-amd64-amd64-xl-credit2 7 xen-boot fail in 123701 pass in > 124100 > test-amd64-amd64-xl-qemuu-debianhvm-amd64-xsm 7 xen-boot fail in 123701 pass > in 124100 > test-amd64-amd64-livepatch 7 xen-boot fail in 123701 pass in > 124100 > test-amd64-amd64-pair 10 xen-boot/src_host fail in 123701 pass in > 124100 > test-amd64-amd64-pair 11 xen-boot/dst_host fail in 123701 pass in > 124100 > test-amd64-i386-qemuu-rhel6hvm-intel 7 xen-boot fail in 123701 pass in > 124100 > test-amd64-i386-rumprun-i386 7 xen-boot fail in 123701 pass in > 124100 > test-amd64-i386-xl-qemuu-debianhvm-amd64 7 xen-boot fail in 123701 pass in > 124100 > test-amd64-i386-qemut-rhel6hvm-intel 7 xen-boot fail in 123701 pass in > 124100 > test-amd64-i386-libvirt-xsm 7 xen-boot fail in 123701 pass in > 124100 > test-amd64-i386-migrupgrade 10 xen-boot/src_host fail in 123701 pass in > 124100 > test-amd64-i386-migrupgrade 11 xen-boot/dst_host fail in 123701 pass in > 124100 > test-amd64-amd64-xl-multivcpu 7 xen-boot fail in 123701 pass in > 124100 > test-amd64-amd64-xl-qemuu-ovmf-amd64 7 xen-boot fail in 123701 pass in > 124100 > test-xtf-amd64-amd64-3 7 xen-boot fail in 123844 pass in > 124100 I haven't looked at all of these, but I have looked at a few, including the xtf test in 123844. The jobs I looked at ran on one of the albanas (the new uefi hosts). These flights were after albana* were put into service but before I taught osstest to avoid trying to boot xen.gz from 4.9 and earlier on uefi hosts (by avoiding running 4.9 tests on those hosts at all). So I think these failures are all understood and expected. osstest is fixed now, so they will not occur in new runs. osstest is trying to justify them as heisenbugs, by observing that they passed in 124100. The wide range of affected tests means that osstest ends up looking for a lot of other passes to justify these, and I think that is a big part of the reason why the push is taking so long. > test-amd64-amd64-libvirt-pair 22 guest-migrate/src_host/dst_host fail pass > in 123701 From the log: 2018-06-12 20:59:40 Z executing ssh ... root@172.16.144.61 virsh migrate --live debian.guest.osstest xen+ssh://joubertin0 error: Timed out during operation: cannot acquire state change lock 2018-06-12 21:00:16 Z command nonzero waitstatus 256: [..] The libvirt libxl logs seem to show libxl doing a successful migration. Looking at the logs I see this: 2018-06-12 21:00:16.784+0000: 3507: warning : libxlDomainObjBeginJob:151 : Cannot start job (modify) for domain debian.guest.osstest; current job is (modify) owned by (24947) That job number looks like it's about right for a pid, but I think it must be a thread because it doesn't show up in the ps output. I did see this: Jun 12 21:00:20 joubertin0 logger: /etc/xen/scripts/vif-bridge: iptables setup failed. This may affect guest networking. but that seems to be after the failure. I don't have an explanation. I don't really know what this lock is. I suspect a bug in libvirt/libxl's locking arrangements, or some lower layer failure which is being obscured by the overly deep stack of logging arrangements. Whatever this bug is, it is a low-probability heisenbug. > test-amd64-i386-libvirt-pair 22 guest-migrate/src_host/dst_host fail pass in > 124070 This has the same proximate symptom as the failure I discuss just above. I haven't investigated in in detail. > test-amd64-amd64-xl-rtds 10 debian-install fail pass in > 123844 Here, "xen-create-image" (which calls debootstrap) took longer than 2500 seconds. There is nothing in the Xen serial console log. I have checked the apt and squid proxy cache logs and between 2018-06-12 22:28:10 Z when it starts xen-create-image, and 23:09:50 which it times it out, there are only these entries (apt-cacher; nothing for squid): 1528842567|I|325|172.16.144.61|debrep/dists/jessie/Release 1528842567|O|148696|172.16.144.61|debrep/dists/jessie/Release 1528842568|I|316|172.16.144.61|debrep/dists/jessie/Release.gpg 1528842568|O|2726|172.16.144.61|debrep/dists/jessie/Release.gpg 1528842620|I|363|172.16.144.61|debrep/dists/jessie/main/binary-amd64/Packages.gz 1528842620|O|9064332|172.16.144.61|debrep/dists/jessie/main/binary-amd64/Packages.gz The last thing in the xen-tools log is: I: Resolving dependencies of base packages... ps output shows a perl rune run by debootstrap, piped into sort and uniq: 8640 ? R 00:01:29 0 10 101 0.9 - - \_ perl -e $prevpkg = ""; @d = (); while (<STDIN>) { ?chomp; ?if (/^Package: (.*)$/) { ??$pkg = $1; ??if ($pkg ne $prevpkg) { ???for my $d (@d) { ????print "$d\n"; ???} ??} ??$prevpkg = $1; ??@d = (); ?} ?$in = 1 if (grep {$_ eq $pkg} @ARGV); ?$in = 0 if (/^$/); ?if ($in and (/^Depends: (.*)$/ or /^Pre-Depends: (.*)$/)) { ??for $d (split /\s*,\s*/, $1) { ???$d =~ s/\s*[|].*$//; ???$d =~ s/\s*[(].*[)]\s*//; ???push @d, $d; ??} ?} } for my $d (@d) { ?print "$d\n"; } acl base-passwd bsdutils debconf debianutils dpkg gcc-4.9-base initscripts libacl1 libaudit1 libblkid1 libc6 libcap2 libcap2-bin libcryptsetup4 libdb5.3 libdns-export100 libgcc1 libgcrypt20 libgnutls-deb0-28 libgnutls-openssl27 libidn11 libirs-export91 libisc-export95 liblocale-gettext-perl liblzma5 libmnl0 libncurses5 libnetfilter-acct1 libnettle4 libnfnetlink0 libpam-runtime libpam0g libpcre3 libpsl0 libselinux1 libsystemd0 libtinfo5 libuuid1 lsb-base mount multiarch-support passwd perl-base sysv-rc util-linux zlib1g Note that this process has not used much cpu time, so it is not stuck in an infinite loop. This suggests that joubertin1 was, at this point, trying to do some kind of computation, rather than network access. For som reason this computation wasn't progressing. I looked through all the rest of the logs and there was nothing illuminatingm interesting or unexpected, except for this during log capture: 2018-06-12 23:12:57 Z executing ssh ... root@172.16.144.61 lspci -vvv pcilib: sysfs_read_vpd: read failed: Connection timed out I searched the logs area generally for this message and it seems very common so I think it is a red herring. Overall I think this is probably an actual bug, in the kernel (4.14 in this test), Xen (staging-4.8) or perhaps the firmware or hardware. It is obviously some kind of very low probability heisenbug. > test-amd64-amd64-xl-qemut-debianhvm-amd64-xsm 13 guest-saverestore fail pass > in 124070 This shows xc: error: Failed to get types for pfn batch (14 = Bad address): Internal error Ie this ia almos certainly the libxc/linux EFAULT bug which Juergen is investigating, rather than a true regression in Xen 4.8. I think it likely that the EFAULT bug's probability is higher in Linux 4.14, than 4.9; I recently switched osstest to 4.14. Ian. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel