On Sat, 05 Nov 2016, Ian Jackson wrote: > Looking at the code, I think that gs in jessie is plainly violating > the rules about the use of pthread locks. On my partner's machine,
Per logs from message #15 on bug #842796: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=842796#15 SIGSEGV on __lll_unlock_elision is a signature (IME with very high confidence) of an attempt to unlock an already unlocked lock while running under hardware lock elision. Well, unlocking an already unlocked lock is a pthreads API rule violation, and it is going to crash the process on something that implements hardware lock elision. These would be Intel x86 processors with TSX enabled[1] for Debian 8/jessie. For Debian 9/stretch and for unstable, I believe it also includes IBM Power8, and s390x systems -- AFAIK they won't forgive an attempt to unlock an unlocked lock any more than Intel TSX does. [1] Broadwell-E, Skylake, and later processors, as well as Xeon *v5 processors. I am not sure if we blacklisted any of the Xeon *v4 or not, and too tired to look their model numbers up right now. Unfortunately, when hardware lock elision support was added to glibc upstream, libpthreads was *not* changed to properly assert() this forbidden condition on the non-hardware-elision codepaths. Such an assert() would have given us consistent behavior, thus flushing the bugs out in the open... at the cost of a performance hit (I have no idea how severe), and much screaming. To be fair: it is likely nobody upstream had any idea of just how much code got libpthreads usage wrong... and we certainly didn't know better in Debian, either. Well, now we're going to find out :-( BTW, AFAIK libpthreads still doesn't have any such assert(), so there's likely a lot of such buggy code in unstable still. This is going to cause trouble for Debian stretch, too. > Has something changed in jessie's libc recently ? I find it difficult > to imagine that these bugs would have been missed earlier during the > life of jessie. The required hardware was not widely available at the time, the knowledge of how hardware lock elision would really behave was sparse outside of Intel and IBM -- so people either didn't know, or did not grasp the importance of the fact that the hardware would be utterly intolerant to something that the old code was too lenient about -- and libpthreads was not instrumented to compensate for that. I actually recommended that it would be safer to disable lock elision for jessie[2]: the sharp corners nature of the code in glibc 2.19 scared me, as well as just how messed up the implementation on Intel processors were at the time. Unfortunately, I didn't push for it at all: I didn't know how correct I were at the time[3]. [2] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=762195#50 The hard truth is that nobody in Debian knew how deep those murky waters were at the time[3], and I don't think glibc upstream developers did either. So, we limited ourselves in Debian to blacklisting the processors where Intel (either for sure, or highly likely) screwed it up beyond repair. [3] A number of subtle Intel TSX errata were fixed by Skylake and Broadwell microcode updates, and the latest ones are quite recent. The until-then latent (or subtle) broken locking bugs in applications/libs becoming high-hitter crashers as more users get newer computers, etc. Anyway, any library or application that hits this issue has broken locking, plain and simple. A package crashing from this issue very likely requires a stable update to fix the locking (which won't always be a trivial fix, either), even if we changed libpthreads to disable lock elision support and it stopped the crashes -- even if it wouldn't crash anymore, the locking would still be broken and therefore suspect of not being as effective as it would have to be to ensure correct operation at all times. > I will try to make a patch to fix ghostscript, or at least file a > proper bug. But, if there was a libc change, would it be possible to > revert it or make some kind of workaround ? If the problem is too widespread and too hard to fix on a large number of packages, I suppose we could ask the glibc maintainers to consider disabling hardware lock elision support in stable through a stable update. Such a change to glibc would likely requires some patches to ensure it *really* disabled Intel TSX opcode/instruction insertion, but I think we already ship all of them as part of the Intel TSX blacklist. The result would need real-world testing on an up-to-date Skylake box as well as objdump inspection to ensure *no* TSX-related instructions leaked into the binaries. And what should we do about Debian stretch, then? Some references: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=824191 https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=800574 https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=762195 -- Henrique Holschuh