Bug#934752: libc6: SEGFAULTs caused by tcache after upgrade to Buster

2019-08-27 Thread Florian Weimer
* Pavel Matěja:

> Sorry for late answer.
>
> On 17. 08. 19 22:18, Florian Weimer wrote:
>> * Pavel Matěja:
>>
>>> The strange means they appear only on 2 servers out of 6.
>>> Servers with Xeon E5606 and Pentium G6950 were running fine while Xeon
>>> E3-1220 v6 produced crashes.
>>> It did not matter if the host Debian was Stretch or Buster.
>> Do you see crashes on stretch as well?  What does the backtrace look
>> like there?

> I newer saw the SEGFAULT when we had Stretch based chroot.
>
> I had just one SEGFAULT on Stretch host but I didn't collect coredumps
> back then.
> Unfortunately the server is already running Buster.
> Since the bug is caused by new libc in chroot I should be able to
> install just kernel from Stretch and wait for the SEGFAULT, right?
> I think the backtrace will be the same anyway.

If I recall correctly, stretch doesn't have the tcache code.  If the
crash happened there as well, it's something else.

>>> SSLv3 and TLS code path looked quite distinct to cause the same problem.
>>> Based on info that SEGFAULTs are related to memory allocation in new
>>> libc and CPU performance I found
>>> http://51.15.138.76/patch/17499/
>>> where Wilco Dijkstra discuss some problems with tcache which "leads to
>>> various crashes in benchtests"
>> I was under the impression that this problem only occurs if one of the
>> tunables has an out-of-bounds value.  Do you set any tunables?

> No, I didn't even know they existed.
> I did not read the libc sources yet so I don't know what does the
> patch actually fixes neither if it helps with my problem.

Then the patch will not help to fix the crash.

(By the way, even if the crash goes away if you use a tunable to disable
the thread cache, it could still be timing-related.  It's definitely
possible that the faster malloc/free implementation exposes pre-existing
data races.)

Thanks,
Florian



Bug#934752: libc6: SEGFAULTs caused by tcache after upgrade to Buster

2019-08-27 Thread Pavel Matěja

Sorry for late answer.

On 17. 08. 19 22:18, Florian Weimer wrote:

* Pavel Matěja:


The strange means they appear only on 2 servers out of 6.
Servers with Xeon E5606 and Pentium G6950 were running fine while Xeon
E3-1220 v6 produced crashes.
It did not matter if the host Debian was Stretch or Buster.

Do you see crashes on stretch as well?  What does the backtrace look
like there?

I newer saw the SEGFAULT when we had Stretch based chroot.

I had just one SEGFAULT on Stretch host but I didn't collect coredumps 
back then.

Unfortunately the server is already running Buster.
Since the bug is caused by new libc in chroot I should be able to 
install just kernel from Stretch and wait for the SEGFAULT, right?

I think the backtrace will be the same anyway.


SSLv3 and TLS code path looked quite distinct to cause the same problem.
Based on info that SEGFAULTs are related to memory allocation in new
libc and CPU performance I found
http://51.15.138.76/patch/17499/
where Wilco Dijkstra discuss some problems with tcache which "leads to
various crashes in benchtests"

I was under the impression that this problem only occurs if one of the
tunables has an out-of-bounds value.  Do you set any tunables?

No, I didn't even know they existed.
I did not read the libc sources yet so I don't know what does the patch 
actually fixes neither if it helps with my problem.


Pavel Matěja



Bug#934752: libc6: SEGFAULTs caused by tcache after upgrade to Buster

2019-08-17 Thread Florian Weimer
* Pavel Matěja:

> The strange means they appear only on 2 servers out of 6.
> Servers with Xeon E5606 and Pentium G6950 were running fine while Xeon 
> E3-1220 v6 produced crashes.
> It did not matter if the host Debian was Stretch or Buster.

Do you see crashes on stretch as well?  What does the backtrace look
like there?

> SSLv3 and TLS code path looked quite distinct to cause the same problem.
> Based on info that SEGFAULTs are related to memory allocation in new 
> libc and CPU performance I found
> http://51.15.138.76/patch/17499/
> where Wilco Dijkstra discuss some problems with tcache which "leads to 
> various crashes in benchtests"

I was under the impression that this problem only occurs if one of the
tunables has an out-of-bounds value.  Do you set any tunables?



Bug#934752: libc6: SEGFAULTs caused by tcache after upgrade to Buster

2019-08-17 Thread Aurelien Jarno
Hi,

On 2019-08-14 14:50, Pavel Matěja wrote:
> Package: glibc
> Version: 2.28-10:amd64
> 
> Dear Maintainer,
> 
> We are running manually compiled Apache and OpenSSL on Debian servers in
> Debian-based chroots.
> After chroot upgrade from Stretch to Buster we started to see strange
> SEGFAULTs.
> The strange means they appear only on 2 servers out of 6.
> Servers with Xeon E5606 and Pentium G6950 were running fine while Xeon
> E3-1220 v6 produced crashes.
> It did not matter if the host Debian was Stretch or Buster.

[snip]
 
> SSLv3 and TLS code path looked quite distinct to cause the same problem.
> Based on info that SEGFAULTs are related to memory allocation in new libc
> and CPU performance I found
> http://51.15.138.76/patch/17499/
> where Wilco Dijkstra discuss some problems with tcache which "leads to
> various crashes in benchtests"

This patch looks an early version of the one that has been merged in
glibc 2.29 to fix tunables tcache issues:

https://sourceware.org/bugzilla/show_bug.cgi?id=24531

The patch has been backported to the upstream glibc 2.28 branch:
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=58d2672f64176fcb323859d3bd5240fb1cf8f25c

Once we have the fix reaching unstable and then testing, I'll schedule
an upload to buster with the changes from the upstream glibc 2.28 branch.

> As workaround I tried to
> export GLIBC_TUNABLES=glibc.malloc.tcache_count=0
> in Apache startup script and I saw no SEGFAULT since.
> 
> I have coredumps but they contain production private keys for Apache which I
> can't share and to make things even worse they are 1,6GB each.
> 
> I understand this is heisenbug which you won't be able to reproduce. The CPU
> model dependency is beyond my comprehension.
> I'm curious if you are familiar with the new tcache and if you think if the
> patch in discussion can help.
> I'll try to build libc6 package with it to confirm final solution but I'm
> confused by the patch tree so far.

You can easily build a fixed glibc package that way (providing you have
the glibc build-dependencies, devscripts and git installed):
  apt-get source glibc
  cd glibc-2.28/
  quilt pop -a
  debian/rules update-from-upstream
  dch -i + set the version you want + add a new changelog entry
  debuild 

Regards,
Aurelien

-- 
Aurelien Jarno  GPG: 4096R/1DDD8C9B
aurel...@aurel32.net http://www.aurel32.net



Bug#934752: libc6: SEGFAULTs caused by tcache after upgrade to Buster

2019-08-14 Thread Pavel Matěja

Package: glibc
Version: 2.28-10:amd64

Dear Maintainer,

We are running manually compiled Apache and OpenSSL on Debian servers in 
Debian-based chroots.
After chroot upgrade from Stretch to Buster we started to see strange 
SEGFAULTs.

The strange means they appear only on 2 servers out of 6.
Servers with Xeon E5606 and Pentium G6950 were running fine while Xeon 
E3-1220 v6 produced crashes.

It did not matter if the host Debian was Stretch or Buster.

I was able to collect coredumps and get backtraces. They look like:
(gdb) bt
#0  tcache_get (tc_idx=0) at malloc.c:2934
#1  __GI___libc_malloc (bytes=3) at malloc.c:3042
#2  0x7fd8cc0961be in CRYPTO_malloc (num=3, file=0x7fd8cc2a548c 
"ssl/statem/extensions_clnt.c", line=1376) at crypto/mem.c:222
#3  0x7fd8cc26c7b9 in tls_parse_stoc_ec_pt_formats 
(s=0x7fd8640592d0, pkt=0x7fd864061810, context=256, x=0x0, chainidx=0)

    at ssl/statem/extensions_clnt.c:1376
#4  0x7fd8cc266af5 in tls_parse_extension (s=0x7fd8640592d0, 
idx=TLSEXT_IDX_ec_point_formats, context=256, exts=0x7fd864061770, 
x=0x0, chainidx=0)

    at ssl/statem/extensions.c:715
#5  0x7fd8cc266bbb in tls_parse_all_extensions (s=0x7fd8640592d0, 
context=256, exts=0x7fd864061770, x=0x0, chainidx=0, fin=1)

    at ssl/statem/extensions.c:748
#6  0x7fd8cc2798b6 in tls_process_server_hello (s=0x7fd8640592d0, 
pkt=0x7fd83cff8440) at ssl/statem/statem_clnt.c:1698
#7  0x7fd8cc277fc7 in ossl_statem_client_process_message 
(s=0x7fd8640592d0, pkt=0x7fd83cff8440) at ssl/statem/statem_clnt.c:1039
#8  0x7fd8cc275499 in read_state_machine (s=0x7fd8640592d0) at 
ssl/statem/statem.c:636
#9  0x7fd8cc274f15 in state_machine (s=0x7fd8640592d0, server=0) at 
ssl/statem/statem.c:434
#10 0x7fd8cc274a1b in ossl_statem_connect (s=0x7fd8640592d0) at 
ssl/statem/statem.c:250
#11 0x7fd8cc25b098 in SSL_do_handshake (s=0x7fd8640592d0) at 
ssl/ssl_lib.c:3599
#12 0x7fd8cc257199 in SSL_connect (s=0x7fd8640592d0) at 
ssl/ssl_lib.c:1653
#13 0x7fd8c957c934 in ssl_io_filter_handshake 
(filter_ctx=0x7fd85809a090) at ssl_engine_io.c:1243
#14 0x7fd8c957deca in ssl_io_filter_output (f=0x7fd85809a0e8, 
bb=0x7fd85406b8b0) at ssl_engine_io.c:1760

..

(gdb) bt
#0  tcache_get (tc_idx=0) at malloc.c:2934
#1  __GI___libc_malloc (bytes=16) at malloc.c:3042
#2  0x7fd8cc0961be in CRYPTO_malloc (num=16, file=0x7fd8cc159913 
"crypto/bio/bss_mem.c", line=115) at crypto/mem.c:222
#3  0x7fd8cc0961f1 in CRYPTO_zalloc (num=16, file=0x7fd8cc159913 
"crypto/bio/bss_mem.c", line=115) at crypto/mem.c:230
#4  0x7fd8cbf9ca0a in mem_init (bi=0x7fd860044130, flags=0) at 
crypto/bio/bss_mem.c:115
#5  0x7fd8cbf9cb3d in mem_new (bi=0x7fd860044130) at 
crypto/bio/bss_mem.c:138
#6  0x7fd8cbf9541a in BIO_new (method=0x7fd8cc204980 ) 
at crypto/bio/bio_lib.c:94
#7  0x7fd8cc2454a3 in ssl3_init_finished_mac (s=0x7fd8600a7be0) at 
ssl/s3_enc.c:322
#8  0x7fd8cc281eae in tls_setup_handshake (s=0x7fd8600a7be0) at 
ssl/statem/statem_lib.c:91
#9  0x7fd8cc274ea2 in state_machine (s=0x7fd8600a7be0, server=0) at 
ssl/statem/statem.c:419
#10 0x7fd8cc274a1b in ossl_statem_connect (s=0x7fd8600a7be0) at 
ssl/statem/statem.c:250
#11 0x7fd8cc25b098 in SSL_do_handshake (s=0x7fd8600a7be0) at 
ssl/ssl_lib.c:3599
#12 0x7fd8cc257199 in SSL_connect (s=0x7fd8600a7be0) at 
ssl/ssl_lib.c:1653
#13 0x7fd8c957c934 in ssl_io_filter_handshake 
(filter_ctx=0x7fd8580e8b78) at ssl_engine_io.c:1243
#14 0x7fd8c957deca in ssl_io_filter_output (f=0x7fd8580e8bd0, 
bb=0x55b212b0d518) at ssl_engine_io.c:1760

..

SSLv3 and TLS code path looked quite distinct to cause the same problem.
Based on info that SEGFAULTs are related to memory allocation in new 
libc and CPU performance I found

http://51.15.138.76/patch/17499/
where Wilco Dijkstra discuss some problems with tcache which "leads to 
various crashes in benchtests"


As workaround I tried to
export GLIBC_TUNABLES=glibc.malloc.tcache_count=0
in Apache startup script and I saw no SEGFAULT since.

I have coredumps but they contain production private keys for Apache 
which I can't share and to make things even worse they are 1,6GB each.


I understand this is heisenbug which you won't be able to reproduce. The 
CPU model dependency is beyond my comprehension.
I'm curious if you are familiar with the new tcache and if you think if 
the patch in discussion can help.
I'll try to build libc6 package with it to confirm final solution but 
I'm confused by the patch tree so far.


-- System Information:
Debian Release: Buster
Architecture: amd64 (x86_64)
Kernel: Linux 4.19.0-5-amd64 #1 SMP Debian 4.19.37-5+deb10u2 
(2019-08-08) x86_64 GNU/Linux


diff --git a/malloc/malloc.c b/malloc/malloc.c
index 801ba1f499b566e677b763fc84f8ba86f4f7ccd0..4db7283cc27118cd7d39410febf7be8f7633780a 100644
--- a/malloc/malloc.c
+++ b/malloc/malloc.c
@@ -2915,10 +2915,12 @@ typedef struct tcache_entry
time), this is for performance reasons.  */
 typedef struct tcache_perthread_