Re: libc recently more aggressive about pthread locks in stable ?

2016-11-05 Thread Christian Seiler
On 11/05/2016 08:13 PM, Ian Jackson wrote:
> I have just been debugging a ghostscript segfault on jessie amd64.
> 
> Looking at the code, I think that gs in jessie is plainly violating
> the rules about the use of pthread locks.  On my partner's machine,
> this makes it segfault on termination (with some input files, at
> least).  On my machine it works just fine.  The code in sid is better.
> 
> I recently encountered what seems to be a similar bug in ogg123 in
> stable.  #842796.
> 
> Has something changed in jessie's libc recently ?  I find it difficult
> to imagine that these bugs would have been missed earlier during the
> life of jessie.

Recently Frank Fegert discovered a problem with locking in open-iscsi
that only occurs on new hardware. The code previously was wrong, but
earlier CPUs were more forgiving when it came to this error and it
couldn't be triggered.

Frank wrote about the problem in his blog in great detail:
http://www.bityard.org/blog/2016/08/05/debugging_segfaults_open-iscsi_iscsiuio_intel_broadwell

I haven't looked in detail at your problem, but I could easily
imagine that the problem you're experiencing with other packages is
similar, especially since you mentioned migrating to new hardware.

Hope that helps.

Regards,
Christian

PS: In case someone was wondering: the specific problem with
open-iscsi is now fixed in sid, testing and jessie-backports; jessie
is not affected because we didn't yet build the component with the
issue there.



Re: libc recently more aggressive about pthread locks in stable ?

2016-11-05 Thread Ian Jackson
Ian Jackson writes ("libc recently more aggressive about pthread locks in 
stable ?"):
> I have just been debugging a ghostscript segfault on jessie amd64.
...
> I recently encountered what seems to be a similar bug in ogg123 in
> stable.  #842796.
> 
> Has something changed in jessie's libc recently ?  I find it difficult
> to imagine that these bugs would have been missed earlier during the
> life of jessie.
> 
> I will try to make a patch to fix ghostscript, or at least file a
> proper bug.  But, if there was a libc change, would it be possible to
> revert it or make some kind of workaround ?

FYI, the ghostscript bug, with patch for jessie, is #843324.
sid's ghostscript is fine and I think stretch's is too.

Ian.

-- 
Ian JacksonThese opinions are my own.

If I emailed you from an address @fyvzl.net or @evade.org.uk, that is
a private address which bypasses my fierce spamfilter.



Re: libc recently more aggressive about pthread locks in stable ?

2016-11-05 Thread Aurelien Jarno
On 2016-11-05 19:13, Ian Jackson wrote:
> I have just been debugging a ghostscript segfault on jessie amd64.
> 
> Looking at the code, I think that gs in jessie is plainly violating
> the rules about the use of pthread locks.  On my partner's machine,
> this makes it segfault on termination (with some input files, at
> least).  On my machine it works just fine.  The code in sid is better.
> 
> I recently encountered what seems to be a similar bug in ogg123 in
> stable.  #842796.
> 
> Has something changed in jessie's libc recently ?  I find it difficult
> to imagine that these bugs would have been missed earlier during the
> life of jessie.

I think you just got a new machine with a CPU supporting the TSX
instructions, which are more picky about following the pthreads
semantics.

Unfortunately given Intel fuck-up on TSX implementation in Haswell and
some Broadwell CPUs, they had to disable TSX instructions though firmware
updates, which in turns means we haven't got all packages in Jessie
tested by a wide set of people.

Aurelien

-- 
Aurelien Jarno  GPG: 4096R/1DDD8C9B
aurel...@aurel32.net http://www.aurel32.net



Re: libc recently more aggressive about pthread locks in stable ?

2016-11-05 Thread Henrique de Moraes Holschuh
On Sat, 05 Nov 2016, Ian Jackson wrote:
> Looking at the code, I think that gs in jessie is plainly violating
> the rules about the use of pthread locks.  On my partner's machine,

Per logs from message #15 on bug #842796:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=842796#15

SIGSEGV on __lll_unlock_elision is a signature (IME with very high
confidence) of an attempt to unlock an already unlocked lock while
running under hardware lock elision.


Well, unlocking an already unlocked lock is a pthreads API rule
violation, and it is going to crash the process on something that
implements hardware lock elision.

These would be Intel x86 processors with TSX enabled[1] for Debian
8/jessie.  For Debian 9/stretch and for unstable, I believe it also
includes IBM Power8, and s390x systems -- AFAIK they won't forgive an
attempt to unlock an unlocked lock any more than Intel TSX does.

[1] Broadwell-E, Skylake, and later processors, as well as Xeon *v5
processors.  I am not sure if we blacklisted any of the Xeon *v4
or not, and too tired to look their model numbers up right now.

Unfortunately, when hardware lock elision support was added to glibc
upstream, libpthreads was *not* changed to properly assert() this
forbidden condition on the non-hardware-elision codepaths.  Such an
assert() would have given us consistent behavior, thus flushing the bugs
out in the open... at the cost of a performance hit (I have no idea how
severe), and much screaming.

To be fair: it is likely nobody upstream had any idea of just how much
code got libpthreads usage wrong... and we certainly didn't know better
in Debian, either.  Well, now we're going to find out :-(

BTW, AFAIK libpthreads still doesn't have any such assert(), so there's
likely a lot of such buggy code in unstable still.  This is going to
cause trouble for Debian stretch, too.

> Has something changed in jessie's libc recently ?  I find it difficult
> to imagine that these bugs would have been missed earlier during the
> life of jessie.

The required hardware was not widely available at the time, the
knowledge of how hardware lock elision would really behave was sparse
outside of Intel and IBM -- so people either didn't know, or did not
grasp the importance of the fact that the hardware would be utterly
intolerant to something that the old code was too lenient about -- and
libpthreads was not instrumented to compensate for that.

I actually recommended that it would be safer to disable lock elision
for jessie[2]: the sharp corners nature of the code in glibc 2.19 scared
me, as well as just how messed up the implementation on Intel processors
were at the time.  Unfortunately, I didn't push for it at all: I didn't
know how correct I were at the time[3].

[2] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=762195#50

The hard truth is that nobody in Debian knew how deep those murky waters
were at the time[3], and I don't think glibc upstream developers did
either.  So, we limited ourselves in Debian to blacklisting the
processors where Intel (either for sure, or highly likely) screwed it up
beyond repair.

[3] A number of subtle Intel TSX errata were fixed by Skylake and
Broadwell microcode updates, and the latest ones are quite recent.
The until-then latent (or subtle) broken locking bugs in
applications/libs becoming high-hitter crashers as more users get
newer computers, etc.

Anyway, any library or application that hits this issue has broken
locking, plain and simple.

A package crashing from this issue very likely requires a stable update
to fix the locking (which won't always be a trivial fix, either), even
if we changed libpthreads to disable lock elision support and it stopped
the crashes -- even if it wouldn't crash anymore, the locking would
still be broken and therefore suspect of not being as effective as it
would have to be to ensure correct operation at all times.

> I will try to make a patch to fix ghostscript, or at least file a
> proper bug.  But, if there was a libc change, would it be possible to
> revert it or make some kind of workaround ?

If the problem is too widespread and too hard to fix on a large number
of packages, I suppose we could ask the glibc maintainers to consider
disabling hardware lock elision support in stable through a stable
update.

Such a change to glibc would likely requires some patches to ensure it
*really* disabled Intel TSX opcode/instruction insertion, but I think we
already ship all of them as part of the Intel TSX blacklist.  The result
would need real-world testing on an up-to-date Skylake box as well as
objdump inspection to ensure *no* TSX-related instructions leaked into
the binaries.

And what should we do about Debian stretch, then?

Some references:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=824191
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=800574
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=762195

-- 
  Henrique Holschuh



Re: libc recently more aggressive about pthread locks in stable ?

2016-11-06 Thread Ben Hutchings
On Sat, 2016-11-05 at 20:32 +0100, Christian Seiler wrote:
> On 11/05/2016 08:13 PM, Ian Jackson wrote:
> > I have just been debugging a ghostscript segfault on jessie amd64.
> > 
> > Looking at the code, I think that gs in jessie is plainly violating
> > the rules about the use of pthread locks.  On my partner's machine,
> > this makes it segfault on termination (with some input files, at
> > least).  On my machine it works just fine.  The code in sid is better.
> > 
> > I recently encountered what seems to be a similar bug in ogg123 in
> > stable.  #842796.
> > 
> > Has something changed in jessie's libc recently ?  I find it difficult
> > to imagine that these bugs would have been missed earlier during the
> > life of jessie.
> 
> Recently Frank Fegert discovered a problem with locking in open-iscsi
> that only occurs on new hardware. The code previously was wrong, but
> earlier CPUs were more forgiving when it came to this error and it
> couldn't be triggered.
> 
> Frank wrote about the problem in his blog in great detail:
> http://www.bityard.org/blog/2016/08/05/debugging_segfaults_open-iscsi_iscsiuio_intel_broadwell
[...]

This is not really a case of older CPUs being 'more forgiving'; they
had no locking operations[*] and nothing to forgive.  However, glibc
uses transactional memory (TSX) on the newer CPUs that implement it,
and that new code does result in the CPU detecting some locking errors.

It's worth noting that TSX is broken in 'Haswell' processors and is
supposed to be disabled via a microcode update.  I don't know whether
glibc avoids using it on these processors if the microcode update is
not applied.  (Linux doesn't appear to hide the feature flags.)

* The LOCK prefix is for 'bus locking' during a single instruction,
i.e. making it atomic.  The CPU can't know what higher-level operation
it's being used for.

Ben.

-- 
Ben Hutchings
The world is coming to an end.  Please log off.



signature.asc
Description: This is a digitally signed message part


Re: libc recently more aggressive about pthread locks in stable ?

2016-11-06 Thread Henrique de Moraes Holschuh
On Sun, 06 Nov 2016, Ben Hutchings wrote:
> It's worth noting that TSX is broken in 'Haswell' processors and is
> supposed to be disabled via a microcode update.  I don't know whether
> glibc avoids using it on these processors if the microcode update is
> not applied.  (Linux doesn't appear to hide the feature flags.)

It does avoid it.  For glibc libpthreads, Debian has blacklisted Intel
TSX use [in libpthreads] on all of Haswell and much of Broadwell.

But anything else *will* attempt to use it, people query cpuid directly
for these things.  You need a hypervisor that filters cpuid().

-- 
  Henrique Holschuh



Re: libc recently more aggressive about pthread locks in stable ?

2016-11-06 Thread Jeff Epler
[resending with correct Cc:]

I believe that similar bugs have been afflicting hurd and kfreebsd debian ports
for some time.  In retrospect, it's too bad these reports weren't given more
attention, because it could have made things better for Linux platforms as well.
:-/

see e.g.,
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=671785#48

Jeff



Re: libc recently more aggressive about pthread locks in stable ?

2016-11-06 Thread Ian Jackson
Henrique de Moraes Holschuh writes ("Re: libc recently more aggressive about 
pthread locks in stable ?"):
> Per logs from message #15 on bug #842796:
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=842796#15
> 
> SIGSEGV on __lll_unlock_elision is a signature (IME with very high
> confidence) of an attempt to unlock an already unlocked lock while
> running under hardware lock elision.

I don't know anything about hardware lock elision...

> Well, unlocking an already unlocked lock is a pthreads API rule
> violation, and it is going to crash the process on something that
> implements hardware lock elision.

... but you are of course correct about this.  I debugged the problem
with ghostscript, and it was indeed violating the pthreads rules.  I
have filed #843324 with a patch for Debian to backport the
corresponding upstream fix.  I don't understand the wider logic in
ghostscript; the bug was in the colour space management code and
occurred when a function was called with two pointer arguments which
were actually aliases of the same colourspace-related data structure.
Converting ghostscript to use recursive mutexes was IMO clearly
correct and fixed the bug.

> If the problem is too widespread and too hard to fix on a large number
> of packages, I suppose we could ask the glibc maintainers to consider
> disabling hardware lock elision support in stable through a stable
> update.

I think this would be a good idea.

ogg123 and ghostscript are hardly obscure programs.  It's difficult to
know how bad this problem is, but we would like stable to be useful
even on recent hardware.

> And what should we do about Debian stretch, then?

Perhaps we could add the assert you suggest, on non-lock-elision
hardware.  Whether to do that would depend on its performance impact.

TBH I wonder whether we really want to be giving an evidently shonky
codebase boobytrapped mutexes by default.  We could change the default
mutex type to recursive and make all of these bugs go away.

Ian.

-- 
Ian JacksonThese opinions are my own.

If I emailed you from an address @fyvzl.net or @evade.org.uk, that is
a private address which bypasses my fierce spamfilter.



Re: libc recently more aggressive about pthread locks in stable ?

2016-11-06 Thread Adrian Bunk
On Sun, Nov 06, 2016 at 05:41:34PM -0200, Henrique de Moraes Holschuh wrote:
> On Sun, 06 Nov 2016, Ben Hutchings wrote:
> > It's worth noting that TSX is broken in 'Haswell' processors and is
> > supposed to be disabled via a microcode update.  I don't know whether
> > glibc avoids using it on these processors if the microcode update is
> > not applied.  (Linux doesn't appear to hide the feature flags.)
> 
> It does avoid it.  For glibc libpthreads, Debian has blacklisted Intel
> TSX use [in libpthreads] on all of Haswell and much of Broadwell.
> 
> But anything else *will* attempt to use it, people query cpuid directly
> for these things.  You need a hypervisor that filters cpuid().

All users who are using intel-microcode from non-free instead of running 
outdated microcode with known errata should be OK here?

Running outdated microcode is a bad idea, and noone is making 
Debian-specific workarounds for all the other CPU errata.

cu
Adrian

-- 

   "Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
   "Only a promise," Lao Er said.
   Pearl S. Buck - Dragon Seed



Re: libc recently more aggressive about pthread locks in stable ?

2016-11-06 Thread Aurelien Jarno
On 2016-11-06 01:12, Henrique de Moraes Holschuh wrote:
> On Sat, 05 Nov 2016, Ian Jackson wrote:
> > Looking at the code, I think that gs in jessie is plainly violating
> > the rules about the use of pthread locks.  On my partner's machine,
> 
> Per logs from message #15 on bug #842796:
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=842796#15
> 
> SIGSEGV on __lll_unlock_elision is a signature (IME with very high
> confidence) of an attempt to unlock an already unlocked lock while
> running under hardware lock elision.
> 
> 
> Well, unlocking an already unlocked lock is a pthreads API rule
> violation, and it is going to crash the process on something that
> implements hardware lock elision.
> 
> These would be Intel x86 processors with TSX enabled[1] for Debian
> 8/jessie.  For Debian 9/stretch and for unstable, I believe it also
> includes IBM Power8, and s390x systems -- AFAIK they won't forgive an
> attempt to unlock an unlocked lock any more than Intel TSX does.
> 
> [1] Broadwell-E, Skylake, and later processors, as well as Xeon *v5
> processors.  I am not sure if we blacklisted any of the Xeon *v4
> or not, and too tired to look their model numbers up right now.
> 
> Unfortunately, when hardware lock elision support was added to glibc
> upstream, libpthreads was *not* changed to properly assert() this
> forbidden condition on the non-hardware-elision codepaths.  Such an
> assert() would have given us consistent behavior, thus flushing the bugs
> out in the open... at the cost of a performance hit (I have no idea how
> severe), and much screaming.

This has not been done has it would have a severe performance hit. That
said error checking mutexes also exist in GLIBC, and have been designed
exactly for that, ie they trade performance for correctness.

> To be fair: it is likely nobody upstream had any idea of just how much
> code got libpthreads usage wrong... and we certainly didn't know better
> in Debian, either.  Well, now we're going to find out :-(
> 
> BTW, AFAIK libpthreads still doesn't have any such assert(), so there's
> likely a lot of such buggy code in unstable still.  This is going to
> cause trouble for Debian stretch, too.

I don't expect it to be worse than jessie, actually probably better as
some of the bugs have been fixed by the various upstreams in the
meantime. Also remember that TSX is just making the bug more visible. It
means that users without TSX might experience hangs instead. There are
actually two "hang bugs" reporting against ghostscript, that could be
fixed by fixing the TSX bug.

[...]

> If the problem is too widespread and too hard to fix on a large number
> of packages, I suppose we could ask the glibc maintainers to consider
> disabling hardware lock elision support in stable through a stable
> update.
> 
> Such a change to glibc would likely requires some patches to ensure it
> *really* disabled Intel TSX opcode/instruction insertion, but I think we
> already ship all of them as part of the Intel TSX blacklist.  The result
> would need real-world testing on an up-to-date Skylake box as well as
> objdump inspection to ensure *no* TSX-related instructions leaked into
> the binaries.

We can disable multiarch by passing "--enable-lock-elision". There is no
risk that the instructions are leaked into the binaries except of course
for static binaries. That said so far we talk about a few packages only.
A lot of bugs have already been fixed during the jessie release cycle, I
remember sending patches for that.

> And what should we do about Debian stretch, then?

As said above disabling TSX in glibc is just hidding issues to users. We
should instead try to detect as many bugs as possible (possibly fixing
the corresponding bugs in jessie). One way would be to get a box with
TSX instructions and use it for the reproducible builds and/or the
autopkgtests.

Aurelien

-- 
Aurelien Jarno  GPG: 4096R/1DDD8C9B
aurel...@aurel32.net http://www.aurel32.net


signature.asc
Description: PGP signature


Re: libc recently more aggressive about pthread locks in stable ?

2016-11-06 Thread Adam Borowski
On Sun, Nov 06, 2016 at 08:02:56PM +, Ian Jackson wrote:
> Henrique de Moraes Holschuh writes ("Re: libc recently more aggressive about 
> pthread locks in stable ?"):
> > And what should we do about Debian stretch, then?
> 
> Perhaps we could add the assert you suggest, on non-lock-elision
> hardware.  Whether to do that would depend on its performance impact.

Idea: what if we added the assert now, for a few months of testing on all
hardware, then drop it during the freeze?

-- 
A MAP07 (Dead Simple) raspberry tincture recipe: 0.5l 95% alcohol, 1kg
raspberries, 0.4kg sugar; put into a big jar for 1 month.  Filter out and
throw away the fruits (can dump them into a cake, etc), let the drink age
at least 3-6 months.



Re: libc recently more aggressive about pthread locks in stable ?

2016-11-06 Thread Henrique de Moraes Holschuh
On Sun, 06 Nov 2016, Adrian Bunk wrote:
> On Sun, Nov 06, 2016 at 05:41:34PM -0200, Henrique de Moraes Holschuh wrote:
> > On Sun, 06 Nov 2016, Ben Hutchings wrote:
> > > It's worth noting that TSX is broken in 'Haswell' processors and is
> > > supposed to be disabled via a microcode update.  I don't know whether
> > > glibc avoids using it on these processors if the microcode update is
> > > not applied.  (Linux doesn't appear to hide the feature flags.)
> > 
> > It does avoid it.  For glibc libpthreads, Debian has blacklisted Intel
> > TSX use [in libpthreads] on all of Haswell and much of Broadwell.
> > 
> > But anything else *will* attempt to use it, people query cpuid directly
> > for these things.  You need a hypervisor that filters cpuid().
> 
> All users who are using intel-microcode from non-free instead of running 
> outdated microcode with known errata should be OK here?

Last time I checked, it looked like an yes for Skylake as far as Intel
TSX is concerned.

I don't know about the other processors, such as Broadwell-E.

-- 
  Henrique Holschuh



Re: libc recently more aggressive about pthread locks in stable ?

2016-11-07 Thread Henrique de Moraes Holschuh
On Sun, 06 Nov 2016, Adam Borowski wrote:
> On Sun, Nov 06, 2016 at 08:02:56PM +, Ian Jackson wrote:
> > Henrique de Moraes Holschuh writes ("Re: libc recently more aggressive 
> > about pthread locks in stable ?"):
> > > And what should we do about Debian stretch, then?
> > 
> > Perhaps we could add the assert you suggest, on non-lock-elision
> > hardware.  Whether to do that would depend on its performance impact.
> 
> Idea: what if we added the assert now, for a few months of testing on all
> hardware, then drop it during the freeze?

It doesn't even need to be added to production libc.  A debugger's
version of the package would be enough.

And we only need to instrument x86-64 (amd64), really.  It would be a
test corpus more than large enough, and typically fast enough to handle
whatever performance loss the extra check will cause.

Since this doesn't change lock type, it is far less intrusive than the
current options to root out such bugs.

-- 
  Henrique Holschuh



Re: libc recently more aggressive about pthread locks in stable ?

2016-11-07 Thread Lucas Nussbaum
Hi,

On 06/11/16 at 17:41 -0200, Henrique de Moraes Holschuh wrote:
> On Sun, 06 Nov 2016, Ben Hutchings wrote:
> > It's worth noting that TSX is broken in 'Haswell' processors and is
> > supposed to be disabled via a microcode update.  I don't know whether
> > glibc avoids using it on these processors if the microcode update is
> > not applied.  (Linux doesn't appear to hide the feature flags.)
> 
> It does avoid it.  For glibc libpthreads, Debian has blacklisted Intel
> TSX use [in libpthreads] on all of Haswell and much of Broadwell.
> 
> But anything else *will* attempt to use it, people query cpuid directly
> for these things.  You need a hypervisor that filters cpuid().

How can one know what glibc does on a given CPU? (preferably without
access to the hardware)

I could try to run an archive rebuild on hardware where glibc leverages
TSX to see what happens.

Lucas



Re: libc recently more aggressive about pthread locks in stable ?

2016-11-08 Thread Lucas Nussbaum
On 07/11/16 at 21:52 +0100, Lucas Nussbaum wrote:
> Hi,
> 
> On 06/11/16 at 17:41 -0200, Henrique de Moraes Holschuh wrote:
> > On Sun, 06 Nov 2016, Ben Hutchings wrote:
> > > It's worth noting that TSX is broken in 'Haswell' processors and is
> > > supposed to be disabled via a microcode update.  I don't know whether
> > > glibc avoids using it on these processors if the microcode update is
> > > not applied.  (Linux doesn't appear to hide the feature flags.)
> > 
> > It does avoid it.  For glibc libpthreads, Debian has blacklisted Intel
> > TSX use [in libpthreads] on all of Haswell and much of Broadwell.
> > 
> > But anything else *will* attempt to use it, people query cpuid directly
> > for these things.  You need a hypervisor that filters cpuid().
> 
> How can one know what glibc does on a given CPU? (preferably without
> access to the hardware)

Answering myself, the relevant patch is
https://sources.debian.net/src/glibc/2.24-5/debian/patches/amd64/local-blacklist-for-Intel-TSX.diff/

Lucas



Re: libc recently more aggressive about pthread locks in stable ?

2016-11-08 Thread Henrique de Moraes Holschuh
On Mon, 07 Nov 2016, Lucas Nussbaum wrote:
> On 06/11/16 at 17:41 -0200, Henrique de Moraes Holschuh wrote:
> > On Sun, 06 Nov 2016, Ben Hutchings wrote:
> > > It's worth noting that TSX is broken in 'Haswell' processors and is
> > > supposed to be disabled via a microcode update.  I don't know whether
> > > glibc avoids using it on these processors if the microcode update is
> > > not applied.  (Linux doesn't appear to hide the feature flags.)
> > 
> > It does avoid it.  For glibc libpthreads, Debian has blacklisted Intel
> > TSX use [in libpthreads] on all of Haswell and much of Broadwell.
> > 
> > But anything else *will* attempt to use it, people query cpuid directly
> > for these things.  You need a hypervisor that filters cpuid().
> 
> How can one know what glibc does on a given CPU? (preferably without
> access to the hardware)
> 
> I could try to run an archive rebuild on hardware where glibc leverages
> TSX to see what happens.

IMHO it would be better to instrument the locks in glibc with asserts,
instead.  You could use anything to test for pthread API violations,
then.

That said, if you are going to test Intel TSX for real, you need a
Desktop Skylake-based Core i5/i7 or Xeon E3v5 that reports "RTM" in
/proc/cpuinfo.  Some won't.

Not every Skylake model will have it enabled in the first place, and
apparently the firmware can (and some _do_) disable it, especially on
the mobile side.

Please ensure the Skylake firmware has microcode 0x9d/0x9e or later, or
install the latest version of the non-free intel-microcode package.  The
risk of unpredictable behaviour is quite real otherwise, and could mess
up the test results (and corrupt data).

Skylake errata are a nightmare. Note the AVX, AVX2, eDRAM (L4?), and TSX
ones, as well as the power-management ones:

http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-e3-1200v5-spec-update.pdf
http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/desktop-6th-gen-core-family-spec-update.pdf

Don't attempt to test TSX with perf or intel PT running.  perf is likely
to cause too many aborts, and Intel PT is an errata hell.

As for Broadwell, I don't know which processors would still have TSX
enabled in the first place when running the latest microcode, and we
blacklist most of them in glibc anyway (because almost all Broadwell-*
specification updates list it as either unavailable or unusable), so
they're not a very viable option to test this.

-- 
  Henrique Holschuh



Re: libc recently more aggressive about pthread locks in stable ?

2016-11-08 Thread Ian Jackson
Henrique de Moraes Holschuh writes ("Re: libc recently more aggressive about 
pthread locks in stable ?"):
> That said, I am not going to propose any changes to the glibc blacklist
> at this time, unless new information about how well Intel TSX really
> works in Broadwell becomes available.

So you think that the situation with jessie is OK ?

To be fair the only things which I have noticed broken so far are
ghostscript and ogg123 but I'm hardly a typical user.  My footprint
probably involves many fewer multithreaded programs.

Ian.

-- 
Ian JacksonThese opinions are my own.

If I emailed you from an address @fyvzl.net or @evade.org.uk, that is
a private address which bypasses my fierce spamfilter.



Re: libc recently more aggressive about pthread locks in stable ?

2016-11-08 Thread Henrique de Moraes Holschuh
On Tue, Nov 8, 2016, at 15:39, Ian Jackson wrote:
> Henrique de Moraes Holschuh writes ("Re: libc recently more aggressive
> about pthread locks in stable ?"):
> > That said, I am not going to propose any changes to the glibc blacklist
> > at this time, unless new information about how well Intel TSX really
> > works in Broadwell becomes available.
> 
> So you think that the situation with jessie is OK ?

I better clarify what I meant, sorry about that.

Please take my comment above as in "I am not going to request that the
Broadwell blacklist in glibc be reduced at this time" just because we
got a microcode update that likely fixes any lingering defects in
Broadwell Intel TSX or properly disables it.  

I.e. my comment was about the correctness of the Intel TSX
implementation in up-to-date Broadwell, and not about the misuse of the
libpthreads API that causes programs with shoddy/broken locking to crash
with a SIGSEGV.

I don't think the situation in jessie is OK in the "things that misuse
libpthreads will crash" sense.  However, without some sort of test run,
we don't know how bad it really is, either.  I fear it might be bad, but
I would love to be pleasantly surprised that people did get libpthreads
locking right most of the time...

-- 
  Henrique de Moraes Holschuh 



Re: libc recently more aggressive about pthread locks in stable ?

2016-11-09 Thread Lucas Nussbaum
On 08/11/16 at 16:01 -0200, Henrique de Moraes Holschuh wrote:
> I fear it might be bad, but
> I would love to be pleasantly surprised that people did get libpthreads
> locking right most of the time...

I wonder if it has been considered to "fix" glibc so that the misuses
that are tolerated without TSX are also tolerated with TSX? Or is that
impossible?

Lucas



Re: libc recently more aggressive about pthread locks in stable ?

2016-11-12 Thread Lucas Nussbaum
On 07/11/16 at 21:52 +0100, Lucas Nussbaum wrote:
> Hi,
> 
> On 06/11/16 at 17:41 -0200, Henrique de Moraes Holschuh wrote:
> > On Sun, 06 Nov 2016, Ben Hutchings wrote:
> > > It's worth noting that TSX is broken in 'Haswell' processors and is
> > > supposed to be disabled via a microcode update.  I don't know whether
> > > glibc avoids using it on these processors if the microcode update is
> > > not applied.  (Linux doesn't appear to hide the feature flags.)
> > 
> > It does avoid it.  For glibc libpthreads, Debian has blacklisted Intel
> > TSX use [in libpthreads] on all of Haswell and much of Broadwell.
> > 
> > But anything else *will* attempt to use it, people query cpuid directly
> > for these things.  You need a hypervisor that filters cpuid().
> 
> How can one know what glibc does on a given CPU? (preferably without
> access to the hardware)
> 
> I could try to run an archive rebuild on hardware where glibc leverages
> TSX to see what happens.

I did an archive rebuild on Amazon EC2 using m4.16xlarge instances, that
use a CPU with TSX enabled.

I've filed bugs for the packages that failed during that rebuild, but
don't fail on m4.large instances:
https://bugs.debian.org/cgi-bin/pkgreport.cgi?tag=qa-ftbfs-2016;users=debian...@lists.debian.org

It's not impossible that some of them are caused by problems with
building in parallel, unrelated to TSX.

L.



Re: libc recently more aggressive about pthread locks in stable ?

2016-11-12 Thread Henrique de Moraes Holschuh
Lucas,

Thanks for trying a build run with TSX enabled.

On Sat, 12 Nov 2016, Lucas Nussbaum wrote:
> I did an archive rebuild on Amazon EC2 using m4.16xlarge instances, that
> use a CPU with TSX enabled.

What microcode revision is that Xeon E5-2686 running?

> I've filed bugs for the packages that failed during that rebuild, but
> don't fail on m4.large instances:
> https://bugs.debian.org/cgi-bin/pkgreport.cgi?tag=qa-ftbfs-2016;users=debian...@lists.debian.org

We still need that instrumented libc if one is to test applications,
though, as most packages have little in the way of automated regression
test suites.  And people need to test the packages (using the
applications) with such an instrumented libc installed (or running on a
box with TSX active).

-- 
  Henrique Holschuh



Re: libc recently more aggressive about pthread locks in stable ?

2016-11-13 Thread Lucas Nussbaum
On 12/11/16 at 18:51 -0200, Henrique de Moraes Holschuh wrote:
> Lucas,
> 
> Thanks for trying a build run with TSX enabled.
> 
> On Sat, 12 Nov 2016, Lucas Nussbaum wrote:
> > I did an archive rebuild on Amazon EC2 using m4.16xlarge instances, that
> > use a CPU with TSX enabled.
> 
> What microcode revision is that Xeon E5-2686 running?

microcode: CPU0 sig=0x406f1, pf=0x1, revision=0xb14

(That's just on one node. I'm assuming that all nodes had the same
microcode revision, which is probably a reasonable bet)

Lucas



Re: libc recently more aggressive about pthread locks in stable ?

2016-11-14 Thread Gert Wollny
Am Sonntag, den 06.11.2016, 01:12 -0200 schrieb Henrique de Moraes
Holschuh:
> 
> 
> 
> Unfortunately, when hardware lock elision support was added to glibc
> upstream, libpthreads was *not* changed to properly assert() this
> forbidden condition on the non-hardware-elision codepaths.  Such an
> assert() would have given us consistent behavior, thus flushing the
> bugs out in the open... at the cost of a performance hit (I have no
> idea how severe), and much screaming.

An alternative to find problems with (un-)locking should be to compile
the program in question with -fsanitize=thread (on amd64) and run it.

Unfortunately, in current unstable with thread sanitizer one might get
#796246 (at least I had this).

Best, 
Gert 






Re: libc recently more aggressive about pthread locks in stable ?

2016-11-15 Thread Adrian Bunk
On Mon, Nov 14, 2016 at 10:31:18AM +0100, Gert Wollny wrote:
> Am Sonntag, den 06.11.2016, 01:12 -0200 schrieb Henrique de Moraes
> Holschuh:
> > 
> > 
> > 
> > Unfortunately, when hardware lock elision support was added to glibc
> > upstream, libpthreads was *not* changed to properly assert() this
> > forbidden condition on the non-hardware-elision codepaths.  Such an
> > assert() would have given us consistent behavior, thus flushing the
> > bugs out in the open... at the cost of a performance hit (I have no
> > idea how severe), and much screaming.
> 
> An alternative to find problems with (un-)locking should be to compile
> the program in question with -fsanitize=thread (on amd64) and run it.
> 
> Unfortunately, in current unstable with thread sanitizer one might get
> #796246 (at least I had this).

Does "-fsanitize=thread -no-pie" work for you?

> Best, 
> Gert 

cu
Adrian

-- 

   "Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
   "Only a promise," Lao Er said.
   Pearl S. Buck - Dragon Seed



Re: libc recently more aggressive about pthread locks in stable ?

2016-11-16 Thread Gert Wollny
Am Dienstag, den 15.11.2016, 18:06 +0200 schrieb Adrian Bunk:
> 
> > Unfortunately, in current unstable with thread sanitizer one might
> > get #796246 (at least I had this).
> 
> Does "-fsanitize=thread -no-pie" work for you?

Indeed, that fixed the problem with g++-6.2 (g++-5.4 doesn't has this
problem). 

Best, 
Gert 



Re: libc recently more aggressive about pthread locks in stable ?

2016-11-17 Thread Henrique de Moraes Holschuh
On Wed, Nov 9, 2016, at 06:26, Lucas Nussbaum wrote:
> On 08/11/16 at 16:01 -0200, Henrique de Moraes Holschuh wrote:
> > I fear it might be bad, but
> > I would love to be pleasantly surprised that people did get libpthreads
> > locking right most of the time...
> 
> I wonder if it has been considered to "fix" glibc so that the misuses
> that are tolerated without TSX are also tolerated with TSX? Or is that
> impossible?

AFAIK, the hardware cannot be programed to tolerate this kind of
programming error.  And I don't think that's a bad thing. Locking bugs
are already subtle enough when the whole deal is fully visible to
software and depends only on trivial atomic operations on machine word
sizes (32-bit on ia32/amd64). Hidden by hardware transactional memory,
they would go from subtle and difficult to debug straight into utterly
nasty hellbug land if the hardware was too permissive about misuse.

One can handle the SIGSEGV and attempt to recover, I suppose -- which is
painful enough to get right, and that assumes such a thing is possible
at all in the first place: we are talking about a threaded application
here -- but that is so very slow, that it is simply not worth it as far
as I am concerned.  Not that I think it would be desirable to do so in
the first place: locking bugs are best fixed, not papered over.

This is an area where KISS is absolutely required, too.  Handling that
SIGSEGV to trigger a safe whole-application exit while saving user data
is one thing, attempting to resume execution from a signal raised while
inside an transactional state that has been aborted(!) is quite another.
 This is NOT the kind of thing I would ever trust current and future
processors to always get right.  It reeks of an errata minefield one
should never enter willing.

The deal with *current* Debian stable is that, if the breakage is too
widespread, we simply might not be able to do the right thing (fix the
real bugs).  IMHO, this is not a valid excuse to paper over the breakage
for unstable (or even the next stable, as far as I am concerned.  I'd
rather delay the release, although it is _not_ clear at this time that
such a thing would be needed).   It is not really about Intel TSX, it is
about broken locking that was *already* causing hard-to-debug issues in
many cases (I believe Ian said ghostscript was already showing hard to
debug hangs in this thread), and Intel TSX happened to expose.

-- 
  Henrique de Moraes Holschuh 



Re: libc recently more aggressive about pthread locks in stable ?

2016-11-17 Thread Ian Jackson
Henrique de Moraes Holschuh writes ("Re: libc recently more aggressive about 
pthread locks in stable ?"):
> [stuff]

Thanks for your clear explanations.

> The deal with *current* Debian stable is that, if the breakage is too
> widespread, we simply might not be able to do the right thing (fix the
> real bugs).  IMHO, this is not a valid excuse to paper over the breakage
> for unstable (or even the next stable, as far as I am concerned.  I'd
> rather delay the release, although it is _not_ clear at this time that
> such a thing would be needed).   It is not really about Intel TSX, it is
> about broken locking that was *already* causing hard-to-debug issues in
> many cases (I believe Ian said ghostscript was already showing hard to
> debug hangs in this thread), and Intel TSX happened to expose.

Indeed.

(Although the ghostscript bug only took me an hour or two in total
from it going wrong to having a fixed build for the affected machine,
I think probably most maintainers have less experience of this kind of
thing.)

Ian.

-- 
Ian JacksonThese opinions are my own.

If I emailed you from an address @fyvzl.net or @evade.org.uk, that is
a private address which bypasses my fierce spamfilter.



Re: libc recently more aggressive about pthread locks in stable ?

2016-11-17 Thread Lucas Nussbaum
On 17/11/16 at 08:31 -0200, Henrique de Moraes Holschuh wrote:
> The deal with *current* Debian stable is that, if the breakage is too
> widespread, we simply might not be able to do the right thing (fix the
> real bugs).

Based on the number of bugs uncovered by my archive rebuild, I'm really
not sure that this class of bugs is widespread, and requires to be
special-cased. Relying on users to report problems, and then fix them
is probably enough.

L.



Re: libc recently more aggressive about pthread locks in stable ?

2016-11-17 Thread Henrique de Moraes Holschuh
On Sun, Nov 13, 2016, at 14:42, Lucas Nussbaum wrote:
> On 12/11/16 at 18:51 -0200, Henrique de Moraes Holschuh wrote:
> > Thanks for trying a build run with TSX enabled.
> > 
> > On Sat, 12 Nov 2016, Lucas Nussbaum wrote:
> > > I did an archive rebuild on Amazon EC2 using m4.16xlarge instances, that
> > > use a CPU with TSX enabled.
> > 
> > What microcode revision is that Xeon E5-2686 running?
> 
> microcode: CPU0 sig=0x406f1, pf=0x1, revision=0xb14

This is quite outdated, and it is in fact below the minimum recommended
revision for Linux use (as defined by "whatever Intel is distributing in
the latest Linux public microcode distribution" [1]).  I wouldn't trust
it for general production work, unless we're talking a distributed
application built upon resilient nodes and resilient inter-node result
cross-checking.  Kinda cloudy, I suppose :-)

Note that I don't think at all that it invalidates any results you got
about buildability under TSX.  It is just a reminder of yet another
reason to never trust the cloud too much.

It is also a reminder to ensure all of our (Debian) boxes are running
the latest relevant firmware update level and have the intel-microcode
package from stable non-free installed.

[1] Which is, as of 2016-11-04:
sig 0x000406f1, pf_mask 0xef, 2016-10-07, rev 0xb1f, size 25600

In a couple months[2], it will migrate to Debian stable, which currently
has:
sig 0x000406f1, pf_mask 0xef, 2016-06-06, rev 0xb1d, size 25600

[2] I enforce an one month quarantine in Debian testing before I even
submit a stable p-u request, and that one will typically sit in s-p-u
for a while (after it is approved) until the next stable point release
is out. Anyone that needs it earlier can get it from stable-backports. I
should upload the stable backport of intel-microcode 3.20161104-1 before
this weekend.

-- 
  Henrique de Moraes Holschuh 



Re: libc recently more aggressive about pthread locks in stable ?

2016-11-17 Thread Henrique de Moraes Holschuh
On Thu, Nov 17, 2016, at 09:11, Lucas Nussbaum wrote:
> On 17/11/16 at 08:31 -0200, Henrique de Moraes Holschuh wrote:
> > The deal with *current* Debian stable is that, if the breakage is too
> > widespread, we simply might not be able to do the right thing (fix the
> > real bugs).
> 
> Based on the number of bugs uncovered by my archive rebuild, I'm really
> not sure that this class of bugs is widespread, and requires to be
> special-cased. Relying on users to report problems, and then fix them
> is probably enough.

The rebuild was helpful, but it depends on the application/library
regression test suite to detect any application locking issues.  Not
every package has those, or runs those during build :-(

-- 
  Henrique de Moraes Holschuh 



Re: libc recently more aggressive about pthread locks in stable ?

2016-11-17 Thread Adrian Bunk
On Thu, Nov 17, 2016 at 09:28:34AM -0200, Henrique de Moraes Holschuh wrote:
> On Thu, Nov 17, 2016, at 09:11, Lucas Nussbaum wrote:
> > On 17/11/16 at 08:31 -0200, Henrique de Moraes Holschuh wrote:
> > > The deal with *current* Debian stable is that, if the breakage is too
> > > widespread, we simply might not be able to do the right thing (fix the
> > > real bugs).
> > 
> > Based on the number of bugs uncovered by my archive rebuild, I'm really
> > not sure that this class of bugs is widespread, and requires to be
> > special-cased. Relying on users to report problems, and then fix them
> > is probably enough.
> 
> The rebuild was helpful, but it depends on the application/library
> regression test suite to detect any application locking issues.  Not
> every package has those, or runs those during build :-(

But we do already have > 1 year of widespread testing by users
running unstable/testing on machines with TSX enabled.

So for unstable/stretch this does not seem to be a huge problem.

These are normal bugs that should be found and fixed if possible,
just like passing a pointer in an int or many other kinds of bugs.

Or do I miss anything here?

cu
Adrian

-- 

   "Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
   "Only a promise," Lao Er said.
   Pearl S. Buck - Dragon Seed



Re: libc recently more aggressive about pthread locks in stable ?

2016-11-17 Thread Henrique de Moraes Holschuh
On Thu, Nov 17, 2016, at 09:50, Adrian Bunk wrote:
> But we do already have > 1 year of widespread testing by users
> running unstable/testing on machines with TSX enabled.
> 
> So for unstable/stretch this does not seem to be a huge problem.
> 
> These are normal bugs that should be found and fixed if possible,
> just like passing a pointer in an int or many other kinds of bugs.
> 
> Or do I miss anything here?

I am not sure we have that much user coverage, given the blacklisting we
did in glibc.  Maybe a search for lock elision bug reports in relevant
Ubuntu releases would help... I think their glibc packages are close
enough to ours that they would have the same issues we do.

The current blacklist in Jessie's glibc is:
((model == 63 && stepping <= 2) || (model == 60 && stepping <= 3) ||
(model == 69 && stepping <= 1) || (model == 70 && stepping <= 1) ||
(model == 61 && stepping <= 4) || (model == 71 && stepping <= 1) ||
(model == 86 && stepping <= 2) ))

Which should be (ignoring ES/QS steppings, which were also blacklisted):

Haswell:
Signature: 0x306f2, main model: Haswell-E (Xeon E5v3, Core Extreme 4th
gen)
Signature: 0x306c3, main model: Haswell-DT (desktop Core 4th gen)
Signature: 0x40651, main model: Haswell low power (mobile Core  4th gen)
Signature: 0x40661, main model: Haswell "Crystal Well") (Core 4th gen
with eDRAM)

So, almost all of Haswell is blacklisted. The known exception is Xeon
E7v3 (signature 0x306f4), which was not blacklisted because apparently
either Intel TSX works well enough there, or it is never reported as
enabled in the first place.

Broadwell:
Signature: 0x306d4, main model: Broadwell (desktop and mobile Core 5th
gen)
Signature: 0x40671, main model: Broadwell "Brystal Well" (Core 5th gen
with eDRAM, also Xeon E3v4)
Signature: 0x50662, main model: Broadwell-DE (Xeon D-1500, stepping V1)
-- due to BDE42.

Note that newer steppings of Broadwell-DE are not blacklisted (0x50663:
stepping V2, 0x50664: stepping Y0 -- BDE85 fixed by up-to-date
microcode).   Also, Broadwell-EN/EP/EX (Core Extreme 5th gen,  Xeon E5v4
and E7v4) are not blacklisted, either.

-- 
  Henrique de Moraes Holschuh 



Re: libc recently more aggressive about pthread locks in stable ?

2016-11-17 Thread Adrian Bunk
On Thu, Nov 17, 2016 at 11:38:46AM -0200, Henrique de Moraes Holschuh wrote:
> On Thu, Nov 17, 2016, at 09:50, Adrian Bunk wrote:
> > But we do already have > 1 year of widespread testing by users
> > running unstable/testing on machines with TSX enabled.
> > 
> > So for unstable/stretch this does not seem to be a huge problem.
> > 
> > These are normal bugs that should be found and fixed if possible,
> > just like passing a pointer in an int or many other kinds of bugs.
> > 
> > Or do I miss anything here?
> 
> I am not sure we have that much user coverage, given the blacklisting we
> did in glibc.
>...

Skylake is not blacklisted anywhere, I assume?

cu
Adrian

-- 

   "Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
   "Only a promise," Lao Er said.
   Pearl S. Buck - Dragon Seed



Re: libc recently more aggressive about pthread locks in stable ?

2016-11-17 Thread Henrique de Moraes Holschuh
On Thu, Nov 17, 2016, at 12:12, Adrian Bunk wrote:
> On Thu, Nov 17, 2016 at 11:38:46AM -0200, Henrique de Moraes Holschuh
> wrote:
> > On Thu, Nov 17, 2016, at 09:50, Adrian Bunk wrote:
> > > But we do already have > 1 year of widespread testing by users
> > > running unstable/testing on machines with TSX enabled.
> > > 
> > > So for unstable/stretch this does not seem to be a huge problem.
> > > 
> > > These are normal bugs that should be found and fixed if possible,
> > > just like passing a pointer in an int or many other kinds of bugs.
> > > 
> > > Or do I miss anything here?
> > 
> > I am not sure we have that much user coverage, given the blacklisting we
> > did in glibc.
> >...
> 
> Skylake is not blacklisted anywhere, I assume?

Not by us, at least.  The processor model, firmware or microcode might
limit the availability of Intel TSX.

But it took quite a while for out-of-the-box Skylake BIOSes (i.e.
installed to the motherboard in the factory) to actually run Linux well
enough to finish a Debian install run (!).  Actually, it took a while
for it to do so even when the user updated the BIOS from the motherboard
vendor site before attempting to install Debian...

-- 
  Henrique de Moraes Holschuh