[valgrind] [Bug 327548] false positive while destroying mutex
https://bugs.kde.org/show_bug.cgi?id=327548 Paul Floyd changed: What|Removed |Added Resolution|--- |FIXED Status|REOPENED|RESOLVED --- Comment #20 from Paul Floyd --- Now fixed commit 74e20007467b930fb14264519f74a45179b5f740 (HEAD -> master, origin/master, origin/HEAD) Author: Paul Floyd Date: Tue Jan 3 21:28:42 2023 +0100 BugĀ 327548 - false positive while destroying mutex -- You are receiving this mail because: You are watching all bug changes.
[valgrind] [Bug 327548] false positive while destroying mutex
https://bugs.kde.org/show_bug.cgi?id=327548 --- Comment #19 from Samuel Thibault --- (ran it for an hour) -- You are receiving this mail because: You are watching all bug changes.
[valgrind] [Bug 327548] false positive while destroying mutex
https://bugs.kde.org/show_bug.cgi?id=327548 --- Comment #18 from Samuel Thibault --- I have run the reproducer on all the various systems previously mentioned, with no issue so far, so it seems we're good with that fix. -- You are receiving this mail because: You are watching all bug changes.
[valgrind] [Bug 327548] false positive while destroying mutex
https://bugs.kde.org/show_bug.cgi?id=327548 --- Comment #17 from Samuel Thibault --- Ah, configure.ac already detects which case that is. So we need a valgrind built against the proper glibc, that's why I hadn't the proper name. -- You are receiving this mail because: You are watching all bug changes.
[valgrind] [Bug 327548] false positive while destroying mutex
https://bugs.kde.org/show_bug.cgi?id=327548 --- Comment #16 from Samuel Thibault --- > I would expect these to be covered by the default suppressions: > { >helgrind-glibc2X-004 >Helgrind:Race >obj:*/lib*/libc-2.*so* > } Ah, but glibc renamed libc-2.*so* into just libc.so.6, so that suppression won't take effect. Duplicating the entry with libc.so.* does indeed seem to avoid the warnings. I'll try to test on more machines and archs. -- You are receiving this mail because: You are watching all bug changes.
[valgrind] [Bug 327548] false positive while destroying mutex
https://bugs.kde.org/show_bug.cgi?id=327548 --- Comment #15 from Paul Floyd --- (In reply to Samuel Thibault from comment #12) > Yes that avoids that exact issue, but then a flurry of new ones pop up, I > was not getting them before: > > ==2511392== Possible data race during read of size 4 at 0x10C128 by thread #2 > ==2511392== Locks held: none > ==2511392==at 0x48F31A2: pthread_mutex_lock@@GLIBC_2.2.5 > (pthread_mutex_lock.c:94) [snip] I would expect these to be covered by the default suppressions: { helgrind-glibc2X-004 Helgrind:Race obj:*/lib*/libc-2.*so* } and { helgrind-glibc2X-005 Helgrind:Race obj:*/lib*/libpthread-2.*so* } which are very broad and cover most of the pthread internals. I'll try to see if I can test on 2.36 -- You are receiving this mail because: You are watching all bug changes.
[valgrind] [Bug 327548] false positive while destroying mutex
https://bugs.kde.org/show_bug.cgi?id=327548 --- Comment #14 from Samuel Thibault --- FYI, it's glibc 2.36: - pthread_mutex_lock.c:94 is when it checks assert (mutex->__data.__owner == 0); after obtaining the lock - pthread_mutex_lock.c:182 is when it sets mutex->__data.__owner after obtaining the lock - pthread_mutex_unlock.c:62 is when it clears mutex->__data.__owner before releasing the lock - pthread_mutex_unlock.c:65 is when it --mutex->__data.__nusers; before releasing the lock The other cond-related lines are more complex to describe, but they'll probably get fixed the same way as mutex would get fixed. I guess the disabling of checking somehow perturbates helgrind's history record (I guess that's part of the "This puts it in the same state as new memory allocated by this thread -- that is, basically owned exclusively by this thread." comment for the VALGRIND_HG_ENABLE_CHECKING macro) -- You are receiving this mail because: You are watching all bug changes.
[valgrind] [Bug 327548] false positive while destroying mutex
https://bugs.kde.org/show_bug.cgi?id=327548 --- Comment #13 from Paul Floyd --- Looking at the glibc source pthread_mutex_destroy is fairly simple. It does an atomic read of the kind and a plain read of nusers and returns EBUSY if nusers is not 0. It then sets kind to an invalid value, -1. pthread_mutex_unlock is a lot more complicated. It also does atomic reads of the kind, decrements nusers and checks the lock then does a cas on the lock. If I run with --default-suppressions=no then there huge numbers of hazards, but none related to pthread_mutex_destroy. I think that is because Helgrind sees the lock order is such that pthread_mutex_destroy is called in a serial section. -- You are receiving this mail because: You are watching all bug changes.
[valgrind] [Bug 327548] false positive while destroying mutex
https://bugs.kde.org/show_bug.cgi?id=327548 --- Comment #12 from Samuel Thibault --- Yes that avoids that exact issue, but then a flurry of new ones pop up, I was not getting them before: ==2511392== Possible data race during read of size 4 at 0x10C128 by thread #2 ==2511392== Locks held: none ==2511392==at 0x48F31A2: pthread_mutex_lock@@GLIBC_2.2.5 (pthread_mutex_lock.c:94) ==2511392==by 0x48487F8: mutex_lock_WRK (hg_intercepts.c:937) ==2511392==by 0x109272: f (in /net/inria/home/sthibault/test) ==2511392==by 0x484B966: mythread_wrapper (hg_intercepts.c:406) ==2511392==by 0x48EFFD3: start_thread (pthread_create.c:442) ==2511392==by 0x496F8CF: clone (clone.S:100) ==2511392== ==2511392== This conflicts with a previous write of size 4 by thread #1 ==2511392== Locks held: none ==2511392==at 0x48F4A47: __pthread_mutex_unlock_usercnt (pthread_mutex_unlock.c:62) ==2511392==by 0x48EF26D: __pthread_cond_wait_common (pthread_cond_wait.c:419) ==2511392==by 0x48EF26D: pthread_cond_wait@@GLIBC_2.3.2 (pthread_cond_wait.c:618) ==2511392==by 0x484BB03: pthread_cond_wait_WRK (hg_intercepts.c:1291) ==2511392==by 0x10933B: main (in /net/inria/home/sthibault/test) ==2511392== Address 0x10c128 is 8 bytes inside data symbol "mutex" ==2511392== ==2511392== ==2511392== ==2511392== Possible data race during write of size 4 at 0x10C128 by thread #2 ==2511392== Locks held: none ==2511392==at 0x48F31B9: pthread_mutex_lock@@GLIBC_2.2.5 (pthread_mutex_lock.c:182) ==2511392==by 0x48487F8: mutex_lock_WRK (hg_intercepts.c:937) ==2511392==by 0x109272: f (in /net/inria/home/sthibault/test) ==2511392==by 0x484B966: mythread_wrapper (hg_intercepts.c:406) ==2511392==by 0x48EFFD3: start_thread (pthread_create.c:442) ==2511392==by 0x496F8CF: clone (clone.S:100) ==2511392== ==2511392== This conflicts with a previous write of size 4 by thread #1 ==2511392== Locks held: none ==2511392==at 0x48F4A47: __pthread_mutex_unlock_usercnt (pthread_mutex_unlock.c:62) ==2511392==by 0x48EF26D: __pthread_cond_wait_common (pthread_cond_wait.c:419) ==2511392==by 0x48EF26D: pthread_cond_wait@@GLIBC_2.3.2 (pthread_cond_wait.c:618) ==2511392==by 0x484BB03: pthread_cond_wait_WRK (hg_intercepts.c:1291) ==2511392==by 0x10933B: main (in /net/inria/home/sthibault/test) ==2511392== Address 0x10c128 is 8 bytes inside data symbol "mutex" ==2511392== ==2511392== ==2511392== ==2511392== Lock at 0x10C120 was first observed ==2511392==at 0x484CFE9: pthread_mutex_init (hg_intercepts.c:818) ==2511392==by 0x1092EE: main (in /net/inria/home/sthibault/test) ==2511392== Address 0x10c120 is 0 bytes inside data symbol "mutex" ==2511392== ==2511392== Possible data race during write of size 8 at 0x10C0E8 by thread #2 ==2511392== Locks held: 1, at address 0x10C120 ==2511392==at 0x48EEE29: __atomic_wide_counter_add_relaxed (atomic_wide_counter.h:57) ==2511392==by 0x48EEE29: __condvar_add_g1_start_relaxed (pthread_cond_common.c:52) ==2511392==by 0x48EEE29: __condvar_quiesce_and_switch_g1 (pthread_cond_common.c:294) ==2511392==by 0x48EEE29: pthread_cond_signal@@GLIBC_2.3.2 (pthread_cond_signal.c:77) ==2511392==by 0x4848FB8: pthread_cond_signal_WRK (hg_intercepts.c:1570) ==2511392==by 0x10928B: f (in /net/inria/home/sthibault/test) ==2511392==by 0x484B966: mythread_wrapper (hg_intercepts.c:406) ==2511392==by 0x48EFFD3: start_thread (pthread_create.c:442) ==2511392==by 0x496F8CF: clone (clone.S:100) ==2511392== ==2511392== This conflicts with a previous read of size 8 by thread #1 ==2511392== Locks held: none ==2511392==at 0x48EF39E: __atomic_wide_counter_load_relaxed (atomic_wide_counter.h:30) ==2511392==by 0x48EF39E: __condvar_load_g1_start_relaxed (pthread_cond_common.c:46) ==2511392==by 0x48EF39E: __pthread_cond_wait_common (pthread_cond_wait.c:486) ==2511392==by 0x48EF39E: pthread_cond_wait@@GLIBC_2.3.2 (pthread_cond_wait.c:618) ==2511392==by 0x484BB03: pthread_cond_wait_WRK (hg_intercepts.c:1291) ==2511392==by 0x10933B: main (in /net/inria/home/sthibault/test) ==2511392== Address 0x10c0e8 is 8 bytes inside data symbol "cond" ==2511392== ==2511392== ==2511392== ==2511392== Possible data race during write of size 4 at 0x10C128 by thread #2 ==2511392== Locks held: none ==2511392==at 0x48F4A47: __pthread_mutex_unlock_usercnt (pthread_mutex_unlock.c:62) ==2511392==by 0x4848DD8: mutex_unlock_WRK (hg_intercepts.c:1184) ==2511392==by 0x10929A: f (in /net/inria/home/sthibault/test) ==2511392==by 0x484B966: mythread_wrapper (hg_intercepts.c:406) ==2511392==by 0x48EFFD3: start_thread (pthread_create.c:442) ==2511392==by 0x496F8CF: clone (clone.S:100) ==2511392== ==2511392== This conflicts with a previous write
[valgrind] [Bug 327548] false positive while destroying mutex
https://bugs.kde.org/show_bug.cgi?id=327548 --- Comment #11 from Paul Floyd --- This is the same as what Philippe suggested in the e-mail thread, but with similar protection for cond vars. -- You are receiving this mail because: You are watching all bug changes.
[valgrind] [Bug 327548] false positive while destroying mutex
https://bugs.kde.org/show_bug.cgi?id=327548 --- Comment #10 from Paul Floyd --- Can you try this patch please? diff --git a/helgrind/hg_intercepts.c b/helgrind/hg_intercepts.c index 5a83996..8c98e4e 100644 --- a/helgrind/hg_intercepts.c +++ b/helgrind/hg_intercepts.c @@ -879,7 +879,9 @@ static int mutex_destroy_WRK(pthread_mutex_t *mutex) if (mutex != NULL) { static const pthread_mutex_t mutex_init = PTHREAD_MUTEX_INITIALIZER; + VALGRIND_HG_DISABLE_CHECKING(mutex, sizeof(*mutex)); mutex_is_init = my_memcmp(mutex, &mutex_init, sizeof(*mutex)) == 0; + VALGRIND_HG_ENABLE_CHECKING(mutex, sizeof(*mutex)); } else { mutex_is_init = 0; } @@ -1785,7 +1787,9 @@ static int pthread_cond_destroy_WRK(pthread_cond_t* cond) if (cond != NULL) { const pthread_cond_t cond_init = PTHREAD_COND_INITIALIZER; + VALGRIND_HG_DISABLE_CHECKING(cond, sizeof(*cond)); cond_is_init = my_memcmp(cond, &cond_init, sizeof(*cond)) == 0; + VALGRIND_HG_ENABLE_CHECKING(cond, sizeof(*cond)); } else { cond_is_init = 0; } -- You are receiving this mail because: You are watching all bug changes.
[valgrind] [Bug 327548] false positive while destroying mutex
https://bugs.kde.org/show_bug.cgi?id=327548 --- Comment #9 from Paul Floyd --- I can reproduce this on a RHEL 7.6 machine. -- You are receiving this mail because: You are watching all bug changes.
[valgrind] [Bug 327548] false positive while destroying mutex
https://bugs.kde.org/show_bug.cgi?id=327548 --- Comment #8 from Samuel Thibault --- (with the various versions of valgrind between 3.9 and 3.19) -- You are receiving this mail because: You are watching all bug changes.
[valgrind] [Bug 327548] false positive while destroying mutex
https://bugs.kde.org/show_bug.cgi?id=327548 Samuel Thibault changed: What|Removed |Added Resolution|WORKSFORME |--- Ever confirmed|0 |1 Status|RESOLVED|REOPENED --- Comment #7 from Samuel Thibault --- To provide various datapoints, I have tried it valgrind --tool=helgrind ./test on: - Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz - Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz - Intel(R) Xeon(R) CPU E5-2650L v4 @ 1.70GHz - AMD EPYC 7352 24-Core Processor - AMD EPYC 7502 32-Core Processor with glibc 2.35 I got the hit ==420467== Helgrind, a thread error detector ==420467== Copyright (C) 2007-2017, and GNU GPL'd, by OpenWorks LLP et al. ==420467== Using Valgrind-3.19.0 and LibVEX; rerun with -h for copyright info ==420467== Command: ./test ==420467== ==420467== ---Thread-Announcement-- ==420467== ==420467== Thread #1 is the program's root thread ==420467== ==420467== ---Thread-Announcement-- ==420467== ==420467== Thread #2 was created ==420467==at 0x49792FF: clone (clone.S:76) ==420467==by 0x497A146: __clone_internal (clone-internal.c:83) ==420467==by 0x48F6484: create_thread (pthread_create.c:295) ==420467==by 0x48F6F78: pthread_create@@GLIBC_2.34 (pthread_create.c:828) ==420467==by 0x484E5D7: ??? (in /usr/libexec/valgrind/vgpreload_helgrind-amd64-linux.so) ==420467==by 0x1092DA: main (in /net/inria/home/sthibault/test) ==420467== ==420467== ==420467== ==420467== Possible data race during read of size 1 at 0x10C128 by thread #1 ==420467== Locks held: none ==420467==at 0x484B225: ??? (in /usr/libexec/valgrind/vgpreload_helgrind-amd64-linux.so) ==420467==by 0x484B4CA: ??? (in /usr/libexec/valgrind/vgpreload_helgrind-amd64-linux.so) ==420467==by 0x10937C: main (in /net/inria/home/sthibault/test) ==420467== ==420467== This conflicts with a previous write of size 4 by thread #2 ==420467== Locks held: none ==420467==at 0x48FB4D8: __pthread_mutex_unlock_usercnt (pthread_mutex_unlock.c:62) ==420467==by 0x484BCE8: ??? (in /usr/libexec/valgrind/vgpreload_helgrind-amd64-linux.so) ==420467==by 0x10929A: f (in /net/inria/home/sthibault/test) ==420467==by 0x484E7D6: ??? (in /usr/libexec/valgrind/vgpreload_helgrind-amd64-linux.so) ==420467==by 0x48F6849: start_thread (pthread_create.c:442) ==420467==by 0x497930F: clone (clone.S:100) ==420467== Address 0x10c128 is 8 bytes inside data symbol "mutex" ==420467== immediately on almost all of them, only AMD EPYC 7352 24-Core Processor took 1m to reproduce. On an AMD Opteron(tm) Processor 6174 with glibc 2.28 I couldn't reproduce it within an hour. On a Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz, I tried with various glibcs: 2.19, 2.24, 2.28, 2.31, 2.36, with all versions it happens within half an hour, sometimes within a minute. On an arm64 with glibc 2.36 it appeared immediately On a mips64el Cavium Octeon III V0.2 FPU V0.0 with glibc 2.36 it appeared immediately On a ppc64el POWER8 with glibc 2.36 it appeared within a minute. On an s390x it appeared in 8 minutes. -- You are receiving this mail because: You are watching all bug changes.
[valgrind] [Bug 327548] false positive while destroying mutex
https://bugs.kde.org/show_bug.cgi?id=327548 Paul Floyd changed: What|Removed |Added Status|REPORTED|RESOLVED Resolution|--- |WORKSFORME --- Comment #6 from Paul Floyd --- Reopen this if anyone else can reproduces the problem. -- You are receiving this mail because: You are watching all bug changes.
[valgrind] [Bug 327548] false positive while destroying mutex
https://bugs.kde.org/show_bug.cgi?id=327548 --- Comment #5 from Paul Floyd --- I ran this for a couple of hours without any errors. It could be related to the hardware. -- You are receiving this mail because: You are watching all bug changes.
[valgrind] [Bug 327548] false positive while destroying mutex
https://bugs.kde.org/show_bug.cgi?id=327548 Paul Floyd changed: What|Removed |Added CC||pjfl...@wanadoo.fr --- Comment #4 from Paul Floyd --- I don't get an error on FreeBSD. I'll check if the problem still exists on Linux. -- You are receiving this mail because: You are watching all bug changes.