Re: Thread-local storage issues arose again? Firefox frequently crashes on 10.0 aarch64

2024-04-15 Thread PHO

On 4/14/24 22:03, Taylor R Campbell wrote:


Thanks for tracking this down -- can you file a PR with this
information, plus `readelf -d libxul.so' output, so we have a place to
track possible fixes and pullups?


Filed as lib/58154. I put everything you requested in the PR.


Re: Thread-local storage issues arose again? Firefox frequently crashes on 10.0 aarch64

2024-04-14 Thread Taylor R Campbell
> Date: Sun, 14 Apr 2024 14:07:26 +0900
> From: PHO 
> 
> As I mentioned in 
> http://mail-index.netbsd.org/netbsd-users/2024/04/12/msg030915.html, 
> Firefox tab processes crash very frequently on NetBSD/aarch64 10.0. 
> Building it with PKG_OPTIONS.firefox+=debug-info revealed that when it 
> crashes it segfaults at one of these two places non-deterministically:

Thanks for tracking this down -- can you file a PR with this
information, plus `readelf -d libxul.so' output, so we have a place to
track possible fixes and pullups?

Can you also track down exactly what the definitions of `thread_local'
and `THREAD_LOCAL' are in this code (maybe use cc -E to get the
preprocessor output), just so we don't have to guess?

Can you also grant me access to your binary package set, so I can
examine the binaries and libraries in it?

Can you also find out what firefox is dlopening, if anything?

Can you also try building an ld.elf_so with -DDEBUG -DRTLD_DEBUG, and
run firefox with LD_DEBUG=1 in the environment using this ld.elf_so,
and skim through the output or share it with me?


Re: Thread-local storage issues arose again? Firefox frequently crashes on 10.0 aarch64

2024-04-14 Thread Taylor R Campbell
> Date: Sun, 14 Apr 2024 05:15:31 +
> From: Emmanuel Dreyfus 
> 
> On Sun, Apr 14, 2024 at 02:07:26PM +0900, PHO wrote:
> > As I mentioned in
> > http://mail-index.netbsd.org/netbsd-users/2024/04/12/msg030915.html, Firefox
> > tab processes crash very frequently on NetBSD/aarch64 10.0. Building it with
> > PKG_OPTIONS.firefox+=debug-info revealed that when it crashes it segfaults
> > at one of these two places non-deterministically:
> 
> Not sure it is related, but it could be: I get random SISEGV when 
> bulk-building NetBSD-10.0/i386 packages on NetBSD-10.0/amd64 XEN3_DOMU.

This is almost certainly unrelated to the issue on firefox/aarch64
which pho@ narrowed down to a thread-local storage matter; can you
start a separate thread or PR for it, with diagnostic information?


Re: Thread-local storage issues arose again? Firefox frequently crashes on 10.0 aarch64

2024-04-14 Thread Manuel Bouyer
On Sun, Apr 14, 2024 at 05:15:31AM +, Emmanuel Dreyfus wrote:
> On Sun, Apr 14, 2024 at 02:07:26PM +0900, PHO wrote:
> > As I mentioned in
> > http://mail-index.netbsd.org/netbsd-users/2024/04/12/msg030915.html, Firefox
> > tab processes crash very frequently on NetBSD/aarch64 10.0. Building it with
> > PKG_OPTIONS.firefox+=debug-info revealed that when it crashes it segfaults
> > at one of these two places non-deterministically:
> 
> Hello
> 
> Not sure it is related, but it could be: I get random SISEGV when 
> bulk-building NetBSD-10.0/i386 packages on NetBSD-10.0/amd64 XEN3_DOMU.

I also get theses SISEGV on bare-metal amd64 (it's a i9 with 20 cores) when
running builds with high paralelism (packages or build.sh).
And I also get these firefox crashes.

As I've not yet seens the SISEGV on Xen yet I suspected a hardware issue,
But if others are seeing it too a software bug becomes more likely

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--


Re: Thread-local storage issues arose again? Firefox frequently crashes on 10.0 aarch64

2024-04-13 Thread Emmanuel Dreyfus
On Sun, Apr 14, 2024 at 02:07:26PM +0900, PHO wrote:
> As I mentioned in
> http://mail-index.netbsd.org/netbsd-users/2024/04/12/msg030915.html, Firefox
> tab processes crash very frequently on NetBSD/aarch64 10.0. Building it with
> PKG_OPTIONS.firefox+=debug-info revealed that when it crashes it segfaults
> at one of these two places non-deterministically:

Hello

Not sure it is related, but it could be: I get random SISEGV when 
bulk-building NetBSD-10.0/i386 packages on NetBSD-10.0/amd64 XEN3_DOMU.

-- 
Emmanuel Dreyfus
m...@netbsd.org


Thread-local storage issues arose again? Firefox frequently crashes on 10.0 aarch64

2024-04-13 Thread PHO

Hi,

As I mentioned in 
http://mail-index.netbsd.org/netbsd-users/2024/04/12/msg030915.html, 
Firefox tab processes crash very frequently on NetBSD/aarch64 10.0. 
Building it with PKG_OPTIONS.firefox+=debug-info revealed that when it 
crashes it segfaults at one of these two places non-deterministically:


third_party/rlbox/include/rlbox_noop_sandbox.hpp:


rlbox_noop_sandbox_thread_data* get_rlbox_noop_sandbox_thread_data();
#  define RLBOX_NOOP_SANDBOX_STATIC_VARIABLES()\
thread_local rlbox::rlbox_noop_sandbox_thread_data \
  rlbox_noop_sandbox_thread_info{ 0, 0 };  \
namespace rlbox {  \
  rlbox_noop_sandbox_thread_data* get_rlbox_noop_sandbox_thread_data() \
  {\
return _noop_sandbox_thread_info;\
  }\
}  \
static_assert(true, "Enforce semi-colon")

> ...

  template
  auto impl_invoke_with_func_ptr(T_Converted* func_ptr, T_Args&&... params)
  {
#ifdef RLBOX_EMBEDDER_PROVIDES_TLS_STATIC_VARIABLES
auto& thread_data = *get_rlbox_noop_sandbox_thread_data();
#endif
auto old_sandbox = thread_data.sandbox; // <-- CRASHES HERE!
thread_data.sandbox = this;
auto on_exit =
  detail::make_scope_exit([&] { thread_data.sandbox = old_sandbox; });
return (*func_ptr)(params...);
  }


media/libjpeg/simd/arm/aarch64/jsimd.c:


static THREAD_LOCAL unsigned int simd_support = ~0;
 JSIMD_FASTST3 | JSIMD_FASTTBL;

> ...

LOCAL(void)
init_simd(void)
{
#ifndef NO_GETENV
  char env[2] = { 0 };
#endif
#if defined(__linux__) || defined(ANDROID) || defined(__ANDROID__)
  int bufsize = 1024; /* an initial guess for the line buffer size limit */
#endif

  if (simd_support != ~0U) // <-- CRASHES HERE!
return;

  simd_support = 0;


So both of these cases involve TLS, that is, tab processes segfault 
while attempting to access thread-local variables. At run-time these 
functions reside in libxul.so, which is dlopen'ed by the main process. I 
recall there were a few issues in TLS handling in the past but 
riastradh@ fixed them before we branched 10.0, right?


"readelf -r libxul.so" shows no R_AARCH64_TLS_TPR in its relocation 
table but only shows R_AARCH64_TLSDESC, so I believe these variables use 
local-dynamic model. I tried to create a minimal reproducer but it 
didn't crash:

https://gist.github.com/depressed-pho/b6894fdaef94a1b9aa5459b1a2f65590

So I speculated that there were some kind of limit in the size of TLS 
blocks that dlopen(3) could sanely handle, and libxul.so exceeded it. As 
I mentioned in the previous mail, I modified /usr/pkg/bin/firefox based 
on this speculation:



#!/bin/sh
LD_PRELOAD=/usr/pkg/lib/firefox/libxul.so /usr/pkg/lib/firefox/firefox "$@"


To my surprise this actually worked! Firefox hasn't crashed even once 
since this modification! Help, riastradh@, TLS is convoluted and I have 
nearly zero knowledge about this monstrosity!