https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109164
Bug ID: 109164 Summary: aarch64 thread_local initialization error with -ftree-pre and -foptimize-sibling-calls Product: gcc Version: 12.1.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: loganh at synopsys dot com Target Milestone: --- Created attachment 54687 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=54687&action=edit Bash script that reproduces the issue With -ftree-pre, -foptimize-sibling-calls, and -O1 enabled, on aarch64-linux-gnu, GCC 12.1.0 can generate code to access parts of thread_local variables before the corresponding TLS init function is called if the variable is accessed from a different TU than the variable is defined in. This reordering could likely cause a number of different issues, but the one that I've run into is that: - When the compiler generates code to call a virtual function on a reference to a to a global thread_local instance of an object defined in a different translation unit, and - The function calls itself in at least once branch, the address of the object is fetched from TLS before it's initialized, and when the vtable lookup is attempted on that object to call the virtual function the program segfaults. Here's an example of the kind of code that will trip it up: struct Struct { virtual void virtual_func(); }; extern thread_local Struct& thread_local_ref; bool other_func(void); bool test_func(void) { thread_local_ref.virtual_func(); return other_func() && test_func(); } When this is compiled (on aarch64-linux-gnu, with -O1 and -ftree-pre and -foptimize-sibling-calls) to an object file and then dumped with objdump -C -d, this is the code produced: 0000000000000000 <test_func()>: 0: a9be7bfd stp x29, x30, [sp, #-32]! 4: 910003fd mov x29, sp 8: a90153f3 stp x19, x20, [sp, #16] c: 90000000 adrp x0, 0 <thread_local_ref> 10: f9400000 ldr x0, [x0] 14: d53bd041 mrs x1, tpidr_el0 18: f8606834 ldr x20, [x1, x0] 1c: 90000013 adrp x19, 0 <TLS init function for thread_local_ref> 20: f9400273 ldr x19, [x19] 24: b4000053 cbz x19, 2c <test_func()+0x2c> 28: 94000000 bl 0 <TLS init function for thread_local_ref> 2c: f9400280 ldr x0, [x20] 30: f9400001 ldr x1, [x0] 34: aa1403e0 mov x0, x20 38: d63f0020 blr x1 3c: 94000000 bl 0 <other_func()> 40: 12001c00 and w0, w0, #0xff 44: 35ffff00 cbnz w0, 24 <test_func()+0x24> 48: a94153f3 ldp x19, x20, [sp, #16] 4c: a8c27bfd ldp x29, x30, [sp], #32 50: d65f03c0 ret Looking at addresses 0x14 through 0x18, you can see that the address of 'thread_local_ref' is read from the TLS block for the thread; the first time this function is called, this will result in register x20 containing zero, since the TLS block isn't intialized until the function call at 0x28. Directly after that, at location 0x2c, a read is attempted from the address in register x20 (zero) causing a segfault. Without -ftree-pre and -foptimize-sibling calls, and without letting `test_func` call itself on at least one path, the code to get the address of `thread_local_ref` is generated after the TLS init call, so the problem does not occur. I've attached a script that will reproduce what I've shown here, as well as demonstrate the issue in action with a full executable that will produce the segfault I've described.