Bug#1060251: Bug#1059352: src:apt: fails to migrate to testing for too long: autopkgtest regression on armhf
Hi Julian! On 2024-01-12 01:47, Julian Andres Klode wrote: > Either way, these are harmless I'm not saying they're harmful, what I'm saying is: 1) the errors you see on armhf when building apt without stack-clash-protection are the same valgrind reports on amd64 as well. Hence, you could consider rebuilding apt without stack-clash-protection on armhf and dropping the armhf conditional in test/integration/test-apt-update-simple 2) if you run valgrind in CI, I think you should probably use --error-exitcode. Otherwise, what's the point of using valgrind at all, if you don't fail when it reports errors? :-) This is not directly relevant to #1060251 of course, I'm happy to file a separate bug against apt if you find that useful.
Bug#1060251: Bug#1059352: src:apt: fails to migrate to testing for too long: autopkgtest regression on armhf
On Thu, Jan 11, 2024 at 10:18:58PM +0100, Emanuele Rocca wrote: > Hi Julian, > > On 2024-01-11 05:46, Julian Andres Klode wrote: > > And there aren't any hard errors. We could zero initialize > > those or add supressions to make things look nicer I suppose. > > Mmmh no, they are all actual errors as far as valgrind is concerned. > > The thing is, you're running valgrind without --error-exitcode. By doing > so, the exit code of your tests is the exit code of apt-get, not of > valgrind. > > Try this instead: > > (sid-amd64)root@ariel:~# valgrind --error-exitcode=1 apt-get update > [...] > ==308534== ERROR SUMMARY: 6 errors from 6 contexts (suppressed: 0 from 0) > (sid-amd64)root@ariel:~# echo $? > 1 That's all just strawman arguments. As said before, these aren't the issue. These all occur in locations where we read the entire memory arena of our allocator, either for hashing or writing it to a file. While they _should_ be zero-initialized if you look at the code, we actually initialize things with a header object. The problem is the header object looks like: uint32_t magic; uint16_t major; uint16_t minor; bool dirty; uint16_t boo; And the padding bytes are never initialized but will inevitably be read by the hashing and write back of the entire arena. We can fix these errors by either zero initializing the padding in the object constructor: diff --git a/apt-pkg/pkgcache.cc b/apt-pkg/pkgcache.cc index 76336b9b3..b845cb13c 100644 --- a/apt-pkg/pkgcache.cc +++ b/apt-pkg/pkgcache.cc @@ -52,6 +52,7 @@ using APT::StringView; /* Simply initialize the header */ pkgCache::Header::Header() { + memset(this, 0, sizeof(*this)); #define APT_HEADER_SET(X,Y) X = Y; static_assert(std::numeric_limits::max() > Y, "Size violation detected in pkgCache::Header") APT_HEADER_SET(Signature, 0x98FE76DC); or by not creating a header on the stack (with uninitialized padding) and then copying it in, by using placement new operator: diff --git a/apt-pkg/pkgcachegen.cc b/apt-pkg/pkgcachegen.cc index 9e47ef369..3a85a9585 100644 --- a/apt-pkg/pkgcachegen.cc +++ b/apt-pkg/pkgcachegen.cc @@ -89,7 +89,7 @@ bool pkgCacheGenerator::Start() Map.UsePools(*Cache.HeaderP->Pools,sizeof(Cache.HeaderP->Pools)/sizeof(Cache.HeaderP->Pools[0])); // Starting header - *Cache.HeaderP = pkgCache::Header(); + new (Cache.HeaderP) pkgCache::Header(); // make room for the hashtables for packages and groups if (Map.RawAllocate(2 * (Cache.HeaderP->GetHashTableSize() * sizeof(map_pointer))) == 0) Either way, these are harmless -- debian developer - deb.li/jak | jak-linux.org - free software dev ubuntu core developer i speak de, en
Bug#1060251: Bug#1059352: src:apt: fails to migrate to testing for too long: autopkgtest regression on armhf
Hi Julian, On 2024-01-11 05:46, Julian Andres Klode wrote: > And there aren't any hard errors. We could zero initialize > those or add supressions to make things look nicer I suppose. Mmmh no, they are all actual errors as far as valgrind is concerned. The thing is, you're running valgrind without --error-exitcode. By doing so, the exit code of your tests is the exit code of apt-get, not of valgrind. Try this instead: (sid-amd64)root@ariel:~# valgrind --error-exitcode=1 apt-get update [...] ==308534== ERROR SUMMARY: 6 errors from 6 contexts (suppressed: 0 from 0) (sid-amd64)root@ariel:~# echo $? 1
Bug#1060251: Bug#1059352: src:apt: fails to migrate to testing for too long: autopkgtest regression on armhf
On Thu, Jan 11, 2024 at 03:53:17PM +0100, Emanuele Rocca wrote: > Hi Julian, > > On 2024-01-08 10:28, Julian Andres Klode wrote: > > (in Ubuntu we have partially recovered by disabling stack clash > > protection but it crashes on invalid writes there, I suppose we need > > to rebuild some more apt dependencies without the flag...). > > The 'invalid writes' issue seems unrelated to armhf and > stack-clash-protection, > I can reproduce it on my x86 workstation. It would be interesting to see if > once these problems are fixed valgrind on armhf still segfaults. > > (sid-amd64)root@ariel:~# valgrind apt-get update > ==194196== Memcheck, a memory error detector > ==194196== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al. > ==194196== Using Valgrind-3.20.0 and LibVEX; rerun with -h for copyright info > ==194196== Command: apt-get update > ==194196== > Hit:1 http://127.0.0.1:3142/debian sid InRelease > ==194196== Conditional jump or move depends on uninitialised value(s) > ==194196==at 0x4A89B3B: pkgCache::ReMap(bool const&) (in > /usr/lib/x86_64-linux-gnu/libapt-pkg.so.6.0.0) > [... more errors follow] The uninitialized values in ReMap are actually normal and correct behavior, not errors. It happens because we need to grow the array/map without having written all bytes of it first. The same applies to uninitalized bytes passed to write from pkgCacheFile::BuildCaches(), it's writing the partially initialized memory pool to the file. And there aren't any hard errors. We could zero initialize those or add supressions to make things look nicer I suppose. -- debian developer - deb.li/jak | jak-linux.org - free software dev ubuntu core developer i speak de, en
Bug#1060251: Bug#1059352: src:apt: fails to migrate to testing for too long: autopkgtest regression on armhf
Hi Julian, On 2024-01-08 10:28, Julian Andres Klode wrote: > (in Ubuntu we have partially recovered by disabling stack clash > protection but it crashes on invalid writes there, I suppose we need > to rebuild some more apt dependencies without the flag...). The 'invalid writes' issue seems unrelated to armhf and stack-clash-protection, I can reproduce it on my x86 workstation. It would be interesting to see if once these problems are fixed valgrind on armhf still segfaults. (sid-amd64)root@ariel:~# valgrind apt-get update ==194196== Memcheck, a memory error detector ==194196== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al. ==194196== Using Valgrind-3.20.0 and LibVEX; rerun with -h for copyright info ==194196== Command: apt-get update ==194196== Hit:1 http://127.0.0.1:3142/debian sid InRelease ==194196== Conditional jump or move depends on uninitialised value(s) ==194196==at 0x4A89B3B: pkgCache::ReMap(bool const&) (in /usr/lib/x86_64-linux-gnu/libapt-pkg.so.6.0.0) [... more errors follow]
Bug#1059352: src:apt: fails to migrate to testing for too long: autopkgtest regression on armhf
Control: retitle -1 valgrind: Access not within mapped region on armhf Hello Paul and Julian, On 2023-12-24 07:31, Paul Gevers wrote: > On 23-12-2023 20:40, Julian Andres Klode wrote: > > the logs for well known bugs so that you don't end up filing bugs > > against every package broken by valgrind's missing support for the new > > NOP sequences in binutils. > > I do *some* minor triaging, but I didn't know this to be a common problem. Note that there are three distinct valgrind problems mentioned here: - on armhf, flagging accesses below the stack pointer as illegal. This came up after enabling stack-clash-protection on armhf. I added a suppression for this problem in valgrind 1:3.20.0-2, so we shouldn't be seeing those anymore. For reference: https://lists.debian.org/debian-arm/2023/12/msg7.html - on i386, valgrind was failing due to new nop patterns that it didn't know about. This is https://bugs.debian.org/1057693 and was fixed by Aurelien in 1:3.20.0-2.1 - on armhf, a segfault when testing apt. I don't think this issue is widespread at all: in fact I'm not aware of any other instances except for apt. After going through all armhf CI failures caused by segfaults, apt seems to be the only one with valgrind in the picture. Just to be clear: I'm not saying that the problem is trivial, just that AFAIK it's not widespread. I'm including the relevant CI logs below for reference. Also I'm attaching the horrible script I've written to fetch all armhf CI failures in case anyone wants to poke around. 722s ==221367== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al. 722s ==221367== Using Valgrind-3.20.0 and LibVEX; rerun with -h for copyright info 722s ==221367== Command: /usr/bin/apt-get update -o Acquire::Queue-Mode=access 722s ==221367== 722s ==221367== 722s ==221367== Process terminating with default action of signal 11 (SIGSEGV) 722s ==221367== Access not within mapped region at address 0xFE7F2A9C 722s ==221367==at 0x4938838: ReadMessages(int, std::vector, std::allocator >, std::allocator, std::allocator > > >&) (in /usr/lib/arm-linux-gnueabihf/libapt-pkg.so.6.0.0) 722s ==221367== If you believe this happened as a result of a stack 722s ==221367== overflow in your program's main thread (unlikely but 722s ==221367== possible), you can try to increase the size of the 722s ==221367== main thread stack using the --main-stacksize= flag. 722s ==221367== The main thread stack size used in this run was 8388608. 722s ==221367== 722s ==221367== HEAP SUMMARY: 722s ==221367== in use at exit: 119,915 bytes in 1,882 blocks 722s ==221367== total heap usage: 11,673 allocs, 9,791 frees, 868,849 bytes allocated 722s ==221367== 722s ==221367== LEAK SUMMARY: 722s ==221367==definitely lost: 0 bytes in 0 blocks 722s ==221367==indirectly lost: 0 bytes in 0 blocks 722s ==221367== possibly lost: 0 bytes in 0 blocks 722s ==221367==still reachable: 119,915 bytes in 1,882 blocks 722s ==221367== suppressed: 0 bytes in 0 blocks 722s ==221367== Rerun with --leak-check=full to see details of leaked memory 722s ==221367== 722s ==221367== For lists of detected and suppressed errors, rerun with: -s 722s ==221367== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 42821 from 965) 722s Segmentation fault 722s FAIL: exitcode 139 armhf-ci-segfaults.sh Description: Bourne shell script
Bug#1059352: src:apt: fails to migrate to testing for too long: autopkgtest regression on armhf
Control: clone -1 -2 Control: reassign -2 valgrind Control: retitle -2 valgrind: armhf is broken On Sat, Dec 23, 2023 at 08:46:27AM +0100, Paul Gevers wrote: > Source: apt > Version: 2.7.6 > Severity: serious > Control: close -1 2.7.7 > Tags: sid trixie > User: release.debian@packages.debian.org > Usertags: out-of-sync > > Dear maintainer(s), > > The Release Team considers packages that are out-of-sync between testing and > unstable for more than 30 days as having a Release Critical bug in testing > [1]. Your package src:apt has been trying to migrate for 31 days [2]. Hence, > I am filing this bug. The version in unstable fails its own autopkgtest on > armhf. > > If a package is out of sync between unstable and testing for a longer > period, this usually means that bugs in the package in testing cannot be > fixed via unstable. Additionally, blocked packages can have impact on other > packages, which makes preparing for the release more difficult. Finally, it > often exposes issues with the package and/or > its (reverse-)dependencies. We expect maintainers to fix issues that hamper > the migration of their package in a timely manner. > > This bug will trigger auto-removal when appropriate. As with all new bugs, > there will be at least 30 days before the package is auto-removed. > > I have immediately closed this bug with the version in unstable, so if that > version or a later version migrates, this bug will no longer affect testing. > I have also tagged this bug to only affect sid and trixie, so it doesn't > affect (old-)stable. > > If you believe your package is unable to migrate to testing due to issues > beyond your control, don't hesitate to contact the Release Team. > > Paul > > [1] https://lists.debian.org/debian-devel-announce/2023/06/msg1.html > [2] https://qa.debian.org/excuses.php?package=apt > This is a release critical bug in valgrind. valgrind does not understand the new toolchain defaults like stack clash protection on armhf and just segfaults (in Ubuntu we have partially recovered by disabling stack clash protection but it crashes on invalid writes there, I suppose we need to rebuild some more apt dependencies without the flag...). As there doesn't appear to be any progress on the valgrind or toolchain sides to fix these issues, I'll go and disable valgrind on armhf in the next apt upload, but I am cloning this to valgrind to make it clear that it's horribly borked. -- debian developer - deb.li/jak | jak-linux.org - free software dev ubuntu core developer i speak de, en signature.asc Description: PGP signature
Bug#1059352: src:apt: fails to migrate to testing for too long: autopkgtest regression on armhf
Hi Julian, On 23-12-2023 20:40, Julian Andres Klode wrote: I know this is automated but I feel lije it would be sensible to scan It's *mostly* automated. the logs for well known bugs so that you don't end up filing bugs against every package broken by valgrind's missing support for the new NOP sequences in binutils. I do *some* minor triaging, but I didn't know this to be a common problem. Where do you think I should have picked this up? Anyways, this bug is already closed (when I submitted) and because apt is a key package, it doesn't even do anything against apt, so the bug serves mostly for tracking purposes. Paul OpenPGP_signature.asc Description: OpenPGP digital signature
Bug#1059352: src:apt: fails to migrate to testing for too long: autopkgtest regression on armhf
On Sat, Dec 23, 2023 at 08:46:27AM +0100, Paul Gevers wrote: > Source: apt > Version: 2.7.6 > Severity: serious > Control: close -1 2.7.7 > Tags: sid trixie > User: release.debian@packages.debian.org > Usertags: out-of-sync > > Dear maintainer(s), > > The Release Team considers packages that are out-of-sync between testing and > unstable for more than 30 days as having a Release Critical bug in testing > [1]. Your package src:apt has been trying to migrate for 31 days [2]. Hence, > I am filing this bug. The version in unstable fails its own autopkgtest on > armhf. > > If a package is out of sync between unstable and testing for a longer > period, this usually means that bugs in the package in testing cannot be > fixed via unstable. Additionally, blocked packages can have impact on other > packages, which makes preparing for the release more difficult. Finally, it > often exposes issues with the package and/or > its (reverse-)dependencies. We expect maintainers to fix issues that hamper > the migration of their package in a timely manner. > > This bug will trigger auto-removal when appropriate. As with all new bugs, > there will be at least 30 days before the package is auto-removed. > > I have immediately closed this bug with the version in unstable, so if that > version or a later version migrates, this bug will no longer affect testing. > I have also tagged this bug to only affect sid and trixie, so it doesn't > affect (old-)stable. > > If you believe your package is unable to migrate to testing due to issues > beyond your control, don't hesitate to contact the Release Team. I know this is automated but I feel lije it would be sensible to scan the logs for well known bugs so that you don't end up filing bugs against every package broken by valgrind's missing support for the new NOP sequences in binutils. Like this work is tracked in https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1057693 and normally we would forcemerge this into it. -- debian developer - deb.li/jak | jak-linux.org - free software dev ubuntu core developer i speak de, en
Bug#1059352: src:apt: fails to migrate to testing for too long: autopkgtest regression on armhf
Source: apt Version: 2.7.6 Severity: serious Control: close -1 2.7.7 Tags: sid trixie User: release.debian@packages.debian.org Usertags: out-of-sync Dear maintainer(s), The Release Team considers packages that are out-of-sync between testing and unstable for more than 30 days as having a Release Critical bug in testing [1]. Your package src:apt has been trying to migrate for 31 days [2]. Hence, I am filing this bug. The version in unstable fails its own autopkgtest on armhf. If a package is out of sync between unstable and testing for a longer period, this usually means that bugs in the package in testing cannot be fixed via unstable. Additionally, blocked packages can have impact on other packages, which makes preparing for the release more difficult. Finally, it often exposes issues with the package and/or its (reverse-)dependencies. We expect maintainers to fix issues that hamper the migration of their package in a timely manner. This bug will trigger auto-removal when appropriate. As with all new bugs, there will be at least 30 days before the package is auto-removed. I have immediately closed this bug with the version in unstable, so if that version or a later version migrates, this bug will no longer affect testing. I have also tagged this bug to only affect sid and trixie, so it doesn't affect (old-)stable. If you believe your package is unable to migrate to testing due to issues beyond your control, don't hesitate to contact the Release Team. Paul [1] https://lists.debian.org/debian-devel-announce/2023/06/msg1.html [2] https://qa.debian.org/excuses.php?package=apt OpenPGP_signature.asc Description: OpenPGP digital signature