[valgrind] [Bug 484742] unhandled instruction 0x4E9096B7
https://bugs.kde.org/show_bug.cgi?id=484742 --- Comment #2 from Joost VandeVondele --- ok, seems to be related to the +dotprod part of the isa (which we detect based on the asimddp flag). Probably acc = vdotq_s32(acc, a0, b0); or output[i] = vaddvq_s32(sum); -- You are receiving this mail because: You are watching all bug changes.
[valgrind] [Bug 484742] unhandled instruction 0x4E9096B7
https://bugs.kde.org/show_bug.cgi?id=484742 --- Comment #1 from Joost VandeVondele --- Also reproduces on a Raspberry Pi 5: Raspberry Pi 5 $ cat /proc/cpuinfo processor : 0 BogoMIPS: 108.00 Features: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp CPU implementer : 0x41 CPU architecture: 8 CPU variant : 0x4 CPU part: 0xd0b CPU revision: 1 -- You are receiving this mail because: You are watching all bug changes.
[valgrind] [Bug 484742] New: unhandled instruction 0x4E9096B7
https://bugs.kde.org/show_bug.cgi?id=484742 Bug ID: 484742 Summary: unhandled instruction 0x4E9096B7 Classification: Developer tools Product: valgrind Version: 3.22.0 Platform: Other OS: Linux Status: REPORTED Severity: normal Priority: NOR Component: general Assignee: jsew...@acm.org Reporter: joost.vandevond...@gmail.com Target Milestone: --- On NVIDIA's Grace CPU valgrind fails to run a binary (that otherwise runs fine) failing to handle an instruction: ``` ==45347== Memcheck, a memory error detector ==45347== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al. ==45347== Using Valgrind-3.22.0 and LibVEX; rerun with -h for copyright info ==45347== Command: ./stockfish bench ==45347== Stockfish dev-20240329-ec598b38 by the Stockfish developers (see AUTHORS file) Position: 1/48 (rnbqkbnr//8/8/8/8//RNBQKBNR w KQkq - 0 1) info string NNUE evaluation using nn-1ceb1ade0001.nnue info string NNUE evaluation using nn-baff1ede1f90.nnue disInstr(arm64): unhandled instruction 0x4E9096B7 disInstr(arm64): 0100'1110 1001' 1001'0110 1011'0111 ==45347== valgrind: Unrecognised instruction at address 0x40f684. ==45347==at 0x40F684: Stockfish::Eval::NNUE::Network, Stockfish::Eval::NNUE::FeatureTransformer<2560u, &Stockfish::StateInfo::accumulatorBig> >::evaluate(Stockfish::Position const&, bool, int*, bool) const [clone .constprop.0] (in /users/vjoost/Stockfish/src/stockfish) ==45347==by 0x40E667: Stockfish::Eval::evaluate(Stockfish::Eval::NNUE::Networks const&, Stockfish::Position const&, int) (in /users/vjoost/Stockfish/src/stockfish) ==45347==by 0x42A3F7: Stockfish::Search::Worker::iterative_deepening() (in /users/vjoost/Stockfish/src/stockfish) ==45347==by 0x4280E7: Stockfish::Search::Worker::start_searching() (in /users/vjoost/Stockfish/src/stockfish) ==45347==by 0x4210EB: Stockfish::Thread::idle_loop() (in /users/vjoost/Stockfish/src/stockfish) ==45347==by 0x42103F: Stockfish::NativeThread::NativeThread(void (Stockfish::Thread::*&&)(), Stockfish::Thread*&&)::{lambda(void*)#1}::_FUN(void*) (in /users/vjoost/Stockfish/src/stockfish) ==45347==by 0x507875B: start_thread (in /lib64/libpthread-2.31.so) ==45347==by 0x54BFEEB: thread_start (in /lib64/libc-2.31.so) ``` Linux OS. The program is compiled using `gcc version 12.3.0`, with target: `-march=armv8.2-a+dotprod`. /proc/cpuinfo gives: ``` Features: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh bti CPU implementer : 0x41 CPU architecture: 8 CPU variant : 0x0 CPU part: 0xd4f CPU revision: 0 ``` -- You are receiving this mail because: You are watching all bug changes.
[valgrind] [Bug 356457] valgrind: m_mallocfree.c:2042 (vgPlain_arena_free): Assertion 'blockSane(a, b)' failed.
https://bugs.kde.org/show_bug.cgi?id=356457 --- Comment #22 from Joost VandeVondele --- In my case, the issue has disappeared, and the 'only' thing changed is that the server has been updated and is now running Red Hat Enterprise Linux Server release 7.2, which for example includes a newer kernel 3.10.0-327.13.1.el7.x86_64. Valgrind, gcc etc. are still the same version. So, I would suspect this is some interaction with the OS causing this. -- You are receiving this mail because: You are watching all bug changes.
[valgrind] [Bug 356457] valgrind: m_mallocfree.c:2042 (vgPlain_arena_free): Assertion 'blockSane(a, b)' failed.
https://bugs.kde.org/show_bug.cgi?id=356457 --- Comment #14 from Joost VandeVondele --- Also no luck with --sanity-level=4 The fact that it is not reproducible on command is indeed not simplifying this. I wonder if this could be related to something external to valgrind triggering this. -- You are receiving this mail because: You are watching all bug changes.
[valgrind] [Bug 356457] valgrind: m_mallocfree.c:2042 (vgPlain_arena_free): Assertion 'blockSane(a, b)' failed.
https://bugs.kde.org/show_bug.cgi?id=356457 --- Comment #12 from Joost VandeVondele --- Since the error is recurring, I have now tried to run the self-hosting. Running : /data/vjoost/test/outer/install/bin/valgrind --sim-hints=enable-outer --trace-children=yes --smc-check=all-non-file --run-libc-freeres=no --tool=memcheck -v /data/vjoost/test/inner/install/bin/valgrind --suppressions=/data/vjoost/toolchain-r16494/install/valgrind.supp --max-stackframe=2168152 --error-exitcode=42 --vgdb-prefix=./inner --core-redzone-size=1000 --tool=memcheck -v /data/schuetto/auto_regtesting/regtests/cp2k/exe/local_valgrind/cp2k.sdbg ethanol_both_rcut10.0_e1-1_v1-4_RSR.inp (I.e. self-hosting with added redzone, on the our executable corresponding to a failed run, with its arguments and parameters), I get a seemingly correct run. The output will be attached as out.innerouter.2 . Maybe it is worthwhile to look with expert eyes. However, after observing in that output a warning on stack switching, I added --max-stackframe=68009224472 (as suggested, seems a bit large;-), and that lead to a run with some other error (Memcheck: the 'impossible' happened: create_MC_Chunk: shadow area is accessible). -- You are receiving this mail because: You are watching all bug changes.
[valgrind] [Bug 356457] valgrind: m_mallocfree.c:2042 (vgPlain_arena_free): Assertion 'blockSane(a, b)' failed.
https://bugs.kde.org/show_bug.cgi?id=356457 --- Comment #11 from Joost VandeVondele --- Created attachment 96783 --> https://bugs.kde.org/attachment.cgi?id=96783&action=edit self hosting output 3 -- You are receiving this mail because: You are watching all bug changes.
[valgrind] [Bug 356457] valgrind: m_mallocfree.c:2042 (vgPlain_arena_free): Assertion 'blockSane(a, b)' failed.
https://bugs.kde.org/show_bug.cgi?id=356457 --- Comment #10 from Joost VandeVondele --- Created attachment 96782 --> https://bugs.kde.org/attachment.cgi?id=96782&action=edit self-hosting output 2 -- You are receiving this mail because: You are watching all bug changes.
[valgrind] [Bug 356457] valgrind: m_mallocfree.c:2042 (vgPlain_arena_free): Assertion 'blockSane(a, b)' failed.
https://bugs.kde.org/show_bug.cgi?id=356457 --- Comment #8 from Joost VandeVondele --- The failures are observed on RedHatEnterpriseServer Release:6.7 over the weekend, I have been running valgrind with as an argument essentially just a start of the relevant binary (and have it print the version number). With >20 runs (10s each) I had no failure. This is on different machine with Redhat 7.2 I'll try something similar on the other machine, but the failure is not so easy to trigger, seemingly. Dynamic libraries in my case are few, and standard I suppose: > ldd /data/vjoost/clean/cp2k/cp2k/exe/local_valgrind/cp2k.sdbg linux-vdso.so.1 => (0x7ffe09f0d000) libstdc++.so.6 => /data/vjoost/toolchain-r16447/install/lib64/libstdc++.so.6 (0x7f7d5a89) libgfortran.so.3 => /data/vjoost/toolchain-r16447/install/lib64/libgfortran.so.3 (0x7f7d5a56f000) libm.so.6 => /lib64/libm.so.6 (0x00323320) libgcc_s.so.1 => /data/vjoost/toolchain-r16447/install/lib64/libgcc_s.so.1 (0x7f7d5a33c000) libquadmath.so.0 => /data/vjoost/toolchain-r16447/install/lib64/libquadmath.so.0 (0x7f7d5a0fd000) libc.so.6 => /lib64/libc.so.6 (0x003232e0) /lib64/ld-linux-x86-64.so.2 (0x003232a0) There are many more static libraries involved, and all are compiled with debug info. The binary is also large (~142Mb). -- You are receiving this mail because: You are watching all bug changes.
[valgrind] [Bug 356457] valgrind: m_mallocfree.c:2042 (vgPlain_arena_free): Assertion 'blockSane(a, b)' failed.
https://bugs.kde.org/show_bug.cgi?id=356457 --- Comment #3 from Joost VandeVondele --- just happened again, but it is really rare. (this is a 12 core server running valgrind +-12h a day... and this seems to happen every +- 10 days). Is any of the suggestions mentioned above possible without runtime overhead and excessive IO ? ==25277== Memcheck, a memory error detector ==25277== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al. ==25277== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info ==25277== Command: /data/schuetto/auto_regtesting/regtests/cp2k/exe/local_valgrind/cp2k.pdbg Pa.inp ==25277== blockSane: fail -- redzone-hi valgrind: m_mallocfree.c:2042 (vgPlain_arena_free): Assertion 'blockSane(a, b)' failed. -- You are receiving this mail because: You are watching all bug changes.
[valgrind] [Bug 356457] New: valgrind: m_mallocfree.c:2042 (vgPlain_arena_free): Assertion 'blockSane(a, b)' failed.
https://bugs.kde.org/show_bug.cgi?id=356457 Bug ID: 356457 Summary: valgrind: m_mallocfree.c:2042 (vgPlain_arena_free): Assertion 'blockSane(a, b)' failed. Product: valgrind Version: unspecified Platform: Compiled Sources OS: Linux Status: UNCONFIRMED Severity: normal Priority: NOR Component: general Assignee: jsew...@acm.org Reporter: joost.vandevond...@mat.ethz.ch Our nightly tester runs a few thousand testcases, and usually this goes all fine. However, seemingly some rare condition can trigger the following assert: ==10071== Memcheck, a memory error detector ==10071== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al. ==10071== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info ==10071== Command: /data/schuetto/auto_regtesting/regtests/cp2k/exe/local_valgrind/cp2k.sdbg O-B97-q6.inp ==10071== blockSane: fail -- redzone-hi valgrind: m_mallocfree.c:2042 (vgPlain_arena_free): Assertion 'blockSane(a, b)' failed. host stacktrace: ==10071==at 0x38083F48: show_sched_status_wrk (m_libcassert.c:343) ==10071==by 0x38084064: report_and_quit (m_libcassert.c:415) ==10071==by 0x380841F1: vgPlain_assert_fail (m_libcassert.c:481) ==10071==by 0x380925E6: vgPlain_arena_free (m_mallocfree.c:2042) ==10071==by 0x3811B51E: vgModuleLocal_img_done (image.c:778) ==10071==by 0x380BAFF0: vgModuleLocal_read_elf_debug_info (readelf.c:3027) ==10071==by 0x380B343A: di_notify_ACHIEVE_ACCEPT_STATE (debuginfo.c:749) ==10071==by 0x380B343A: vgPlain_di_notify_mmap (debuginfo.c:1067) ==10071==by 0x380D963D: vgModuleLocal_generic_PRE_sys_mmap (syswrap-generic.c:2367) ==10071==by 0x3810D6A1: vgSysWrap_amd64_linux_sys_mmap_before (syswrap-amd64-linux.c:637) ==10071==by 0x380D60A4: vgPlain_client_syscall (syswrap-main.c:1905) ==10071==by 0x380D2B9A: handle_syscall (scheduler.c:1118) ==10071==by 0x380D424E: vgPlain_scheduler (scheduler.c:1435) ==10071==by 0x380E37B6: thread_wrapper (syswrap-linux.c:102) ==10071==by 0x380E37B6: run_a_thread_NORETURN (syswrap-linux.c:155) sched status: running_tid=1 Thread 1: status = VgTs_Runnable (lwpid 10071) ==10071==at 0x3232A1761A: mmap (in /lib64/ld-2.12.so) ==10071==by 0x3232A076B9: _dl_map_object_from_fd (in /lib64/ld-2.12.so) ==10071==by 0x3232A08399: _dl_map_object (in /lib64/ld-2.12.so) ==10071==by 0x3232A0C3A1: openaux (in /lib64/ld-2.12.so) ==10071==by 0x3232A0E285: _dl_catch_error (in /lib64/ld-2.12.so) ==10071==by 0x3232A0CA84: _dl_map_object_deps (in /lib64/ld-2.12.so) ==10071==by 0x3232A0330F: dl_main (in /lib64/ld-2.12.so) ==10071==by 0x3232A160AD: _dl_sysdep_start (in /lib64/ld-2.12.so) ==10071==by 0x3232A014A3: _dl_start (in /lib64/ld-2.12.so) ==10071==by 0x3232A00B07: ??? (in /lib64/ld-2.12.so) ==10071==by 0x1: ??? ==10071==by 0xFFF000A32: ??? ==10071==by 0xFFF000A7C: ??? Note: see also the FAQ in the source distribution. I have attempted to reproduce this (running exactly the same commands and binary, etc.) but this didn't fail. The error appears to happen before our executable is really running, so maybe something is wrong on startup ? Reproducible: Couldn't Reproduce -- You are receiving this mail because: You are watching all bug changes.