Le 31/03/2019 à 15:19, Aurelien Jarno a écrit :
> This bug is very likely a bug present in old glibc versions. It has been
> brought to light when enabling TLS support in openblas and not by a new
> glibc version.
> 
> Right now the bug has been workarounded by disabling TLS support in
> openblas. The way to handle this bug is to write a small testcase that
> can be forwarded upstream. It's not an easy task though.
> 

Hi,

I've made a test case here [0].
I've not tested it against latest glibc commit.
But it does reproduce the deadlock with glibc 2.28 on Linux.

To run the test case, do this:
```
gcc test_compiler_tls.c -o test_compiler_tls -ldl -g -pthread
gcc test_compiler_tls_lib.c -shared -o test_compiler_tls_lib.so \
 -g -pthread -fPIC
./test_compiler_tls ./test_compiler_tls_lib &
gdb --pid $! -ex 'thr a a bt'
```

This reproduce the deadlock that I've found in openblas:
1- The test_thread open the library which call its constructor
2- The library's constructor create a thread
   `thread_that_use_tls_after_sleep`
3- The thread `thread_that_use_tls_after_sleep` sleep for 100ms (this
   needs to be enough so dl_close is called before the sleep ends)
3- The test_thread close the library with dl_close
4- dl_close lock `dl_load_lock` and call the library's destructor
5- The library's destructor wait `thread_that_use_tls_after_sleep` to
   finish
6- The `thread_that_use_tls_after_sleep` thread try to read the TLS
   variable which cause a call to `__tls_get_addr`
7- `__tls_get_addr` cause a deadlock in `tls_get_addr_tail` trying to
   lock the same `dl_load_lock` as dl_close does
8- Nothing happen because dl_close thread is waiting for the
   `thread_that_use_tls_after_sleep` thread to finish which having the
   lock and the latter thread try to lock the same lock as dl_close and
   so never exit.

See [1] for the stacktrace.

Thread 3 is the library's thread created in its constructor and joined
in its destructor.
Thread 2 is the thread that does dl_open and dl_close.
Thread 1 is a "monitoring" thread to implement a timeout of 10s (useful
if this tests need to run on a CI system)

Where dl_close lock the `dl_load_lock`: [2]
Where tls_get_addr_tail lock the `dl_load_lock`: [3]

[0]: https://gist.github.com/amurzeau/26f045bdfea407528dd7de3102fb4be7
[1]:
https://gist.github.com/amurzeau/26f045bdfea407528dd7de3102fb4be7#file-gdb_stacktrace-txt
[2]: https://github.com/bminor/glibc/blob/glibc-2.28/elf/dl-close.c#L812
[3]: https://github.com/bminor/glibc/blob/glibc-2.28/elf/dl-tls.c#L761

-- 
Alexis Murzeau
PGP: B7E6 0EBB 9293 7B06 BDBC  2787 E7BD 1904 F480 937F

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to