Hi, On 2015-12-07 23:26, Andreas Beckmann wrote: > Dear libc maintainers, > > we recently got a bug report regarding the TSX-NI / lock elision bug in > combination with the non-free nvidia driver (#807244). Since that is > supposed to be fixed with the libc in experimental (and now sid as > well), perhaps you could take a look why this still happens. > Several forum posts denote that "compiling glibc without > --enable-lock-elision" works around that issue.
I disagree it is supposed to be fixed. Intel got a few bugs in there TSX-NI implementation for Haswell and Broadwell and possibly early versions of Skylake, and to avoid data loss we have therefore disabled lock elision for some CPU revisions. That said the bugs in the Intel implementation are corner cases, and it took quite some time for them to get discovered. If your program crashes reproducibly, it's definitely not an issue with the TSX-NI implementation. Disabling --enable-lock-elision it's just a workaround for the real issue. People now start to have CPUs with a working TSX-NI implementation which is therefore not blacklisted and thus the problem is appearing again. > A few ideas from my side, but since I don't have the hardware to test, I > cannot check anything: > * that specific CPU needs to be blacklisted / is incorrectly whitelisted As said above that couldn't be that. > * nvidia utilizes a code path in libc that is not covered by the current > patch (and that code path is not used by any other application) > * nvidia does call something it shouldn't call directly ... thus > circumenting the runtime-disabling of the specific routines in libc6 According to the backtrace the problem is typical of a call to mutex_unlock() on a mutex which hasn't been locked with mutex_lock() before. Nvidia should fix the bug there. > * nvidia code does issue the problematic instructions itself (but the > backtrace points to libc, so this sounds unlikely) > > Is there some way to check at runtime how lock elision is handled by > libc (on a concrete system)? What do you mean by "how is it handled"? I have attached a small program which demonstrate the issue. You can use it to check if your system is using lock elision or not. Running this program with ltrace it's quite easy the call to an already unlocked mutex. I wonder if it's doable to do the same with the whole Nvidia stack. Aurelien -- Aurelien Jarno GPG: 4096R/1DDD8C9B aurel...@aurel32.net http://www.aurel32.net
/* compile with gcc -o mutex_crash_tsx mutex_crash_tsx.c -lpthread */ #include <pthread.h> int main() { pthread_mutex_t m = PTHREAD_MUTEX_INITIALIZER; pthread_mutex_lock(&m); pthread_mutex_unlock(&m); pthread_mutex_unlock(&m); }