llvmbot wrote:

<!--LLVM PR SUMMARY COMMENT-->

@llvm/pr-subscribers-flang-driver

Author: Sairudra More (Saieiei)

<details>
<summary>Changes</summary>

Flang currently lowers internal procedures passed as actual arguments using 
LLVM's `llvm.init.trampoline` / `llvm.adjust.trampoline` intrinsics, which 
require an executable stack. On modern Linux toolchains and security-hardened 
kernels that enforce W^X (Write XOR Execute), this causes link-time failures 
(`ld.lld: error: ... requires an executable stack`) or runtime `SEGV` from NX 
violations.

This patch introduces a runtime trampoline pool that allocates trampolines from 
a dedicated `mmap`'d region instead of the stack. The pool toggles page 
permissions between writable (for patching) and executable (for dispatch), so 
the stack stays non-executable throughout. On macOS, MAP_JIT and 
`pthread_jit_write_protect_np` are used for the same effect. An i-cache flush 
(`__builtin___clear_cache` on Linux, `sys_icache_invalidate` on macOS) is 
performed after each write→exec transition.

The feature is gated behind a new driver flag, `-fenable-runtime-trampoline` 
(off by default), which threads through the frontend into the 
`BoxedProcedurePass`. When enabled, the pass emits calls to 
`_FortranATrampolineInit`, `_FortranATrampolineAdjust`, and 
`_FortranATrampolineFree` instead of the legacy intrinsics. The legacy path is 
completely untouched when the flag is off.

The pool is a singleton with a fixed capacity (default 1024 slots, overridable 
via `FLANG_TRAMPOLINE_POOL_SIZE`). Each slot is 32 bytes and holds a small 
architecture-specific stub,  currently x86-64 (17 bytes, using `r10` as the 
nest/static-chain register) and AArch64 (24 bytes, using `x18`). The 
implementation compiles on all architectures but will crash at runtime with a 
clear diagnostic if trampoline emission is actually attempted on an unsupported 
target. This avoids breaking the flang-rt build on e.g. RISC-V or PPC64.

Freed slots are poisoned (the callee pointer is overwritten with a sentinel) 
and recycled into a freelist, so the pool can sustain long-running programs 
that repeatedly create and destroy closures.

A few design choices worth calling out:

The runtime avoids all C++ runtime dependencies,  no `std::mutex`, no `operator 
new`, no function-local statics with hidden guard variables. Locking is via 
flang-rt's own `Lock` / `CriticalSection`, memory is via 
`AllocateMemoryOrCrash` / `FreeMemory`, and the singleton uses explicit 
double-checked locking with a raw pointer. This was done so the trampoline pool 
links cleanly in minimal / freestanding flang-rt configurations.

`_FortranATrampolineFree` calls are inserted immediately before every 
`func.return` in the enclosing host function. This is a conservative but 
correct strategy. The trampoline handle cannot outlive the host's stack frame 
since the closure captures the host's local variables by reference.

The GNU_STACK note is verified via a dedicated integration test 
(`runtime-trampoline-gnustack.f90`) that compiles and links a Fortran program 
using the runtime path, then inspects the ELF with `llvm-readelf` to confirm 
the stack segment is `RW` (not `RWE`).

**Test coverage:**

- `flang/test/Driver/fenable-runtime-trampoline.f90` — flag forwarding (on, 
off, default)
- `flang/test/Fir/boxproc-runtime-trampoline.fir` — FIR-level FileCheck for 
emitted runtime calls
- `flang/test/Lower/runtime-trampoline.f90` — end-to-end lowering
- `flang-rt/test/Driver/runtime-trampoline-gnustack.f90` — GNU_STACK ELF 
verification

Closes #<!-- -->182813

---

Patch is 68.80 KiB, truncated to 20.00 KiB below, full version: 
https://github.com/llvm/llvm-project/pull/183108.diff


23 Files Affected:

- (modified) clang/include/clang/Options/Options.td (+5) 
- (modified) clang/lib/Driver/ToolChains/Flang.cpp (+4) 
- (added) flang-rt/include/flang-rt/runtime/trampoline.h (+69) 
- (modified) flang-rt/lib/runtime/CMakeLists.txt (+1) 
- (added) flang-rt/lib/runtime/trampoline.cpp (+424) 
- (added) flang-rt/test/Driver/runtime-trampoline-gnustack.f90 (+45) 
- (modified) flang/include/flang/Frontend/CodeGenOptions.def (+1) 
- (modified) flang/include/flang/Optimizer/Builder/Runtime/RTBuilder.h (+4) 
- (added) flang/include/flang/Optimizer/Builder/Runtime/Trampoline.h (+47) 
- (modified) flang/include/flang/Optimizer/CodeGen/CGPasses.td (+11-5) 
- (modified) flang/include/flang/Optimizer/Passes/CommandLineOpts.h (+1) 
- (modified) flang/include/flang/Optimizer/Passes/Pipelines.h (+2-1) 
- (added) flang/include/flang/Runtime/trampoline.h (+69) 
- (modified) flang/include/flang/Tools/CrossToolHelpers.h (+2) 
- (modified) flang/lib/Frontend/CompilerInvocation.cpp (+4) 
- (modified) flang/lib/Optimizer/Builder/CMakeLists.txt (+1) 
- (added) flang/lib/Optimizer/Builder/Runtime/Trampoline.cpp (+49) 
- (modified) flang/lib/Optimizer/CodeGen/BoxedProcedure.cpp (+272-192) 
- (modified) flang/lib/Optimizer/Passes/CommandLineOpts.cpp (+2) 
- (modified) flang/lib/Optimizer/Passes/Pipelines.cpp (+11-4) 
- (added) flang/test/Driver/fenable-runtime-trampoline.f90 (+15) 
- (added) flang/test/Fir/boxproc-runtime-trampoline.fir (+67) 
- (added) flang/test/Lower/runtime-trampoline.f90 (+41) 


``````````diff
diff --git a/clang/include/clang/Options/Options.td 
b/clang/include/clang/Options/Options.td
index 4ac812e92e2cb..93c1f2f529e3e 100644
--- a/clang/include/clang/Options/Options.td
+++ b/clang/include/clang/Options/Options.td
@@ -7567,6 +7567,11 @@ defm stack_arrays : BoolOptionWithoutMarshalling<"f", 
"stack-arrays",
   PosFlag<SetTrue, [], [ClangOption], "Attempt to allocate array temporaries 
on the stack, no matter their size">,
   NegFlag<SetFalse, [], [ClangOption], "Allocate array temporaries on the heap 
(default)">>;
 
+defm enable_runtime_trampoline : BoolOptionWithoutMarshalling<"f",
+  "enable-runtime-trampoline",
+  PosFlag<SetTrue, [], [ClangOption], "Use W^X compliant runtime trampoline 
pool for internal procedures">,
+  NegFlag<SetFalse, [], [ClangOption], "Use stack-based trampolines for 
internal procedures (default)">>;
+
 defm loop_versioning : BoolOptionWithoutMarshalling<"f", 
"version-loops-for-stride",
   PosFlag<SetTrue, [], [ClangOption], "Create unit-strided versions of loops">,
    NegFlag<SetFalse, [], [ClangOption], "Do not create unit-strided loops 
(default)">>;
diff --git a/clang/lib/Driver/ToolChains/Flang.cpp 
b/clang/lib/Driver/ToolChains/Flang.cpp
index 8425f8fec62a4..e2f04c4725def 100644
--- a/clang/lib/Driver/ToolChains/Flang.cpp
+++ b/clang/lib/Driver/ToolChains/Flang.cpp
@@ -203,6 +203,10 @@ void Flang::addCodegenOptions(const ArgList &Args,
       !stackArrays->getOption().matches(options::OPT_fno_stack_arrays))
     CmdArgs.push_back("-fstack-arrays");
 
+  if (Args.hasFlag(options::OPT_fenable_runtime_trampoline,
+                   options::OPT_fno_enable_runtime_trampoline, false))
+    CmdArgs.push_back("-fenable-runtime-trampoline");
+
   // -fno-protect-parens is the default for -Ofast.
   if (!Args.hasFlag(options::OPT_fprotect_parens,
                     options::OPT_fno_protect_parens,
diff --git a/flang-rt/include/flang-rt/runtime/trampoline.h 
b/flang-rt/include/flang-rt/runtime/trampoline.h
new file mode 100644
index 0000000000000..3b3ddff7a0587
--- /dev/null
+++ b/flang-rt/include/flang-rt/runtime/trampoline.h
@@ -0,0 +1,69 @@
+//===-- flang-rt/runtime/trampoline.h ----------------------------*- 
C++-*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM 
Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+// Internal declarations for the W^X-compliant trampoline pool.
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef FLANG_RT_RUNTIME_TRAMPOLINE_H_
+#define FLANG_RT_RUNTIME_TRAMPOLINE_H_
+
+#include <cstddef>
+#include <cstdint>
+
+namespace Fortran::runtime::trampoline {
+
+/// Per-trampoline data entry. Stored in a writable (non-executable) region.
+/// Each entry is paired with a trampoline code stub in the executable region.
+struct TrampolineData {
+  const void *calleeAddress;
+  const void *staticChainAddress;
+};
+
+/// Default number of trampoline slots in the pool.
+/// Can be overridden via FLANG_TRAMPOLINE_POOL_SIZE environment variable.
+constexpr std::size_t kDefaultPoolSize = 1024;
+
+/// Size of each trampoline code stub in bytes (platform-specific).
+#if defined(__x86_64__) || defined(_M_X64)
+// x86-64 trampoline stub:
+//   movq TDATA_OFFSET(%rip), %r10    # load static chain from TDATA
+//   movabsq $0, %r11                 # placeholder for callee address
+//   jmpq *%r11
+// Actually we use an indirect approach through the TDATA pointer:
+//   movq (%r10), %r10                # load static chain (8 bytes)
+//   -- but we need the TDATA pointer first
+// Simplified approach for x86-64:
+//   leaq tdata_entry(%rip), %r11     # get TDATA entry address
+//   movq 8(%r11), %r10               # load static chain
+//   jmpq *(%r11)                     # jump to callee
+constexpr std::size_t kTrampolineStubSize = 32;
+constexpr int kNestRegister = 10; // %r10 is the nest/static chain register
+#elif defined(__aarch64__) || defined(_M_ARM64)
+// AArch64 trampoline stub:
+//   adr x17, tdata_entry             # get TDATA entry address
+//   ldr x18, [x17, #8]              # load static chain
+//   ldr x17, [x17]                  # load callee address
+//   br x17
+constexpr std::size_t kTrampolineStubSize = 32;
+constexpr int kNestRegister = 18; // x18 is the platform register
+#elif defined(__powerpc64__) || defined(__ppc64__)
+constexpr std::size_t kTrampolineStubSize = 48;
+constexpr int kNestRegister = 11; // r11
+#else
+// Fallback: generous size
+constexpr std::size_t kTrampolineStubSize = 64;
+constexpr int kNestRegister = 0;
+#endif
+
+/// Alignment requirement for trampoline code stubs.
+constexpr std::size_t kTrampolineAlignment = 16;
+
+} // namespace Fortran::runtime::trampoline
+
+#endif // FLANG_RT_RUNTIME_TRAMPOLINE_H_
diff --git a/flang-rt/lib/runtime/CMakeLists.txt 
b/flang-rt/lib/runtime/CMakeLists.txt
index 9fa8376e9b99c..d5e89a169255c 100644
--- a/flang-rt/lib/runtime/CMakeLists.txt
+++ b/flang-rt/lib/runtime/CMakeLists.txt
@@ -88,6 +88,7 @@ set(host_sources
   stop.cpp
   temporary-stack.cpp
   time-intrinsic.cpp
+  trampoline.cpp
   unit-map.cpp
 )
 if (TARGET llvm-libc-common-utilities)
diff --git a/flang-rt/lib/runtime/trampoline.cpp 
b/flang-rt/lib/runtime/trampoline.cpp
new file mode 100644
index 0000000000000..ad6148f36392e
--- /dev/null
+++ b/flang-rt/lib/runtime/trampoline.cpp
@@ -0,0 +1,424 @@
+//===-- lib/runtime/trampoline.cpp -------------------------------*- 
C++-*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM 
Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+// W^X-compliant trampoline pool implementation.
+//
+// This file implements a runtime trampoline pool that maintains separate
+// memory regions for executable code (RX) and writable data (RW).
+//
+// On Linux the code region transitions RW → RX (never simultaneously W+X).
+// On macOS Apple Silicon the code region uses MAP_JIT with per-thread W^X
+// toggling via pthread_jit_write_protect_np, so the mapping permissions
+// include both W and X but hardware enforces that only one is active at
+// a time on any given thread.
+//
+// Architecture:
+//   - Code region (RX): Contains pre-assembled trampoline stubs that load
+//     callee address and static chain from a paired TDATA entry, then jump
+//     to the callee with the static chain in the appropriate register.
+//   - Data region (RW): Contains TrampolineData entries with {callee_address,
+//     static_chain_address} pairs, one per trampoline slot.
+//   - Free list: Tracks available trampoline slots for O(1) alloc/free.
+//
+// Thread safety: Uses Fortran::runtime::Lock (pthreads on POSIX,
+// CRITICAL_SECTION on Windows) — not std::mutex — to avoid C++ runtime
+// library dependence. A single global lock serializes pool operations.
+// This is a deliberate V1 design choice to keep the initial W^X
+// architectural change minimal. Per-thread lock-free pools are deferred
+// to a future optimization patch.
+//
+// AddressSanitizer note: The trampoline code region is allocated via
+// mmap (not malloc/new), so ASan does not track it. The data region
+// and handles are allocated via malloc (through AllocateMemoryOrCrash),
+// which ASan intercepts normally. No special annotations are needed.
+//
+// See flang/docs/InternalProcedureTrampolines.md for design details.
+//
+//===----------------------------------------------------------------------===//
+
+#include "flang/Runtime/trampoline.h"
+#include "flang-rt/runtime/lock.h"
+#include "flang-rt/runtime/memory.h"
+#include "flang-rt/runtime/terminator.h"
+#include "flang-rt/runtime/trampoline.h"
+
+#include <cassert>
+#include <cstdint>
+#include <cstdlib>
+#include <cstring>
+#include <new> // For placement-new only (no operator new/delete dependency)
+
+// Platform-specific headers for memory mapping.
+#if defined(_WIN32)
+#include <windows.h>
+#else
+#include <sys/mman.h>
+#include <unistd.h>
+#endif
+
+// macOS Apple Silicon requires MAP_JIT and pthread_jit_write_protect_np
+// to create executable memory under the hardened runtime.
+#if defined(__APPLE__) && defined(__aarch64__)
+#include <libkern/OSCacheControl.h>
+#include <pthread.h>
+#endif
+
+// Architecture support check. Stub generators exist only for x86-64 and
+// AArch64. On other architectures the file compiles but the runtime API
+// functions crash with a diagnostic if actually called, so that building
+// flang-rt on e.g. RISC-V or PPC64 never fails.
+#if defined(__x86_64__) || defined(_M_X64) || defined(__aarch64__) || \
+    defined(_M_ARM64)
+#define TRAMPOLINE_ARCH_SUPPORTED 1
+#else
+#define TRAMPOLINE_ARCH_SUPPORTED 0
+#endif
+
+namespace Fortran::runtime::trampoline {
+
+/// A handle returned to the caller. Contains enough info to find
+/// both the trampoline stub and its data entry.
+struct TrampolineHandle {
+  void *codePtr; // Pointer to the trampoline stub in the RX region.
+  TrampolineData *dataPtr; // Pointer to the data entry in the RW region.
+  std::size_t slotIndex; // Index in the pool for free-list management.
+};
+
+// Namespace-scope globals following Flang runtime conventions:
+// - Lock is trivially constructible (pthread_mutex_t / CRITICAL_SECTION)
+// - Pool pointer starts null; initialized under lock (double-checked locking)
+class TrampolinePool; // Forward declaration for pointer below.
+static Lock poolLock;
+static TrampolinePool *poolInstance{nullptr};
+
+/// The global trampoline pool.
+class TrampolinePool {
+public:
+  static TrampolinePool &instance() {
+    if (poolInstance) {
+      return *poolInstance;
+    }
+    CriticalSection critical{poolLock};
+    if (poolInstance) {
+      return *poolInstance;
+    }
+    // Allocate pool using malloc + placement new (trivial constructor).
+    Terminator terminator{__FILE__, __LINE__};
+    void *storage = AllocateMemoryOrCrash(terminator, sizeof(TrampolinePool));
+    poolInstance = new (storage) TrampolinePool();
+    return *poolInstance;
+  }
+
+  /// Allocate a trampoline slot and initialize it.
+  TrampolineHandle *allocate(
+      const void *calleeAddress, const void *staticChainAddress) {
+    CriticalSection critical{lock_};
+    ensureInitialized();
+
+    if (freeHead_ == kInvalidIndex) {
+      // Pool exhausted — fixed size by design for V1.
+      // The pool capacity is controlled by FLANG_TRAMPOLINE_POOL_SIZE
+      // (default 1024). Dynamic slab growth can be added in a follow-up
+      // patch if real workloads demonstrate a need for it.
+      Terminator terminator{__FILE__, __LINE__};
+      terminator.Crash("Trampoline pool exhausted (max %zu slots). "
+                       "Set FLANG_TRAMPOLINE_POOL_SIZE to increase.",
+          poolSize_);
+    }
+
+    std::size_t index = freeHead_;
+    freeHead_ = freeList_[index];
+
+    // Initialize the data entry.
+    dataRegion_[index].calleeAddress = calleeAddress;
+    dataRegion_[index].staticChainAddress = staticChainAddress;
+
+    // Create handle using malloc + placement new.
+    Terminator terminator{__FILE__, __LINE__};
+    void *mem = AllocateMemoryOrCrash(terminator, sizeof(TrampolineHandle));
+    auto *handle = new (mem) TrampolineHandle();
+    handle->codePtr =
+        static_cast<char *>(codeRegion_) + index * kTrampolineStubSize;
+    handle->dataPtr = &dataRegion_[index];
+    handle->slotIndex = index;
+
+    return handle;
+  }
+
+  /// Get the callable address of a trampoline.
+  void *getCallableAddress(TrampolineHandle *handle) { return handle->codePtr; 
}
+
+  /// Free a trampoline slot.
+  void free(TrampolineHandle *handle) {
+    CriticalSection critical{lock_};
+
+    std::size_t index = handle->slotIndex;
+
+    // Poison the data entry so that any dangling call through a freed
+    // trampoline traps immediately. We use a non-null, obviously-invalid
+    // address (0xDEAD...) so that the resulting fault is distinguishable
+    // from a null-pointer dereference when debugging.
+    dataRegion_[index].calleeAddress = reinterpret_cast<const void *>(
+        static_cast<uintptr_t>(~uintptr_t{0} - 1));
+    dataRegion_[index].staticChainAddress = nullptr;
+
+    // Return slot to free list.
+    freeList_[index] = freeHead_;
+    freeHead_ = index;
+
+    FreeMemory(handle);
+  }
+
+private:
+  static constexpr std::size_t kInvalidIndex = ~std::size_t{0};
+
+  TrampolinePool() = default;
+
+  void ensureInitialized() {
+    if (initialized_)
+      return;
+    initialized_ = true;
+
+    // Check environment variable for pool size override.
+    // Fixed-size pool by design (V1): avoids complexity of dynamic growth
+    // and re-protection of code pages. The default (1024 slots) is
+    // sufficient for typical Fortran programs. Users can override via:
+    //   export FLANG_TRAMPOLINE_POOL_SIZE=4096
+    poolSize_ = kDefaultPoolSize;
+    if (const char *envSize = std::getenv("FLANG_TRAMPOLINE_POOL_SIZE")) {
+      long val = std::strtol(envSize, nullptr, 10);
+      if (val > 0)
+        poolSize_ = static_cast<std::size_t>(val);
+    }
+
+    // Allocate the data region (RW).
+    dataRegion_ = static_cast<TrampolineData *>(
+        std::calloc(poolSize_, sizeof(TrampolineData)));
+    assert(dataRegion_ && "Failed to allocate trampoline data region");
+
+    // Allocate the code region (initially RW for writing stubs, then RX).
+    std::size_t codeSize = poolSize_ * kTrampolineStubSize;
+#if defined(_WIN32)
+    codeRegion_ = VirtualAlloc(
+        nullptr, codeSize, MEM_COMMIT | MEM_RESERVE, PAGE_READWRITE);
+#elif defined(__APPLE__) && defined(__aarch64__)
+    // macOS Apple Silicon: MAP_JIT is required for pages that will become
+    // executable. Use pthread_jit_write_protect_np to toggle W↔X.
+    codeRegion_ = mmap(nullptr, codeSize, PROT_READ | PROT_WRITE | PROT_EXEC,
+        MAP_PRIVATE | MAP_ANONYMOUS | MAP_JIT, -1, 0);
+    if (codeRegion_ == MAP_FAILED)
+      codeRegion_ = nullptr;
+    if (codeRegion_) {
+      // Enable writing on this thread (MAP_JIT defaults to execute).
+      pthread_jit_write_protect_np(0); // 0 = writable
+    }
+#else
+    codeRegion_ = mmap(nullptr, codeSize, PROT_READ | PROT_WRITE,
+        MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+    if (codeRegion_ == MAP_FAILED)
+      codeRegion_ = nullptr;
+#endif
+    assert(codeRegion_ && "Failed to allocate trampoline code region");
+
+    // Generate trampoline stubs.
+    generateStubs();
+
+    // Flush instruction cache. Required on architectures with non-coherent
+    // I-cache/D-cache (AArch64, PPC, etc.). On x86-64 this is a no-op
+    // but harmless. Without this, AArch64 may execute stale instructions.
+#if defined(__APPLE__) && defined(__aarch64__)
+    // On macOS, use sys_icache_invalidate (from libkern/OSCacheControl.h).
+    sys_icache_invalidate(codeRegion_, codeSize);
+#elif defined(_WIN32)
+    FlushInstructionCache(GetCurrentProcess(), codeRegion_, codeSize);
+#else
+    __builtin___clear_cache(static_cast<char *>(codeRegion_),
+        static_cast<char *>(codeRegion_) + codeSize);
+#endif
+
+    // Make code region executable and non-writable (W^X).
+#if defined(_WIN32)
+    DWORD oldProtect;
+    VirtualProtect(codeRegion_, codeSize, PAGE_EXECUTE_READ, &oldProtect);
+#elif defined(__APPLE__) && defined(__aarch64__)
+    // Switch back to execute-only (MAP_JIT manages per-thread W^X).
+    pthread_jit_write_protect_np(1); // 1 = executable
+#else
+    mprotect(codeRegion_, codeSize, PROT_READ | PROT_EXEC);
+#endif
+
+    // Initialize free list.
+    freeList_ = static_cast<std::size_t *>(
+        std::malloc(poolSize_ * sizeof(std::size_t)));
+    assert(freeList_ && "Failed to allocate trampoline free list");
+
+    for (std::size_t i = 0; i < poolSize_ - 1; ++i)
+      freeList_[i] = i + 1;
+    freeList_[poolSize_ - 1] = kInvalidIndex;
+    freeHead_ = 0;
+  }
+
+  /// Generate platform-specific trampoline stubs in the code region.
+  /// Each stub loads callee address and static chain from its paired
+  /// TDATA entry and jumps to the callee.
+  void generateStubs() {
+#if defined(__x86_64__) || defined(_M_X64)
+    generateStubsX86_64();
+#elif defined(__aarch64__) || defined(_M_ARM64)
+    generateStubsAArch64();
+#else
+    // Unsupported architecture — should never be reached because the
+    // extern "C" API functions guard with TRAMPOLINE_ARCH_SUPPORTED.
+    // Fill with trap bytes as a safety net.
+    std::memset(codeRegion_, 0, poolSize_ * kTrampolineStubSize);
+#endif
+  }
+
+#if defined(__x86_64__) || defined(_M_X64)
+  /// Generate x86-64 trampoline stubs.
+  ///
+  /// Each stub does:
+  ///   movabsq $dataEntry, %r11         ; load TDATA entry address
+  ///   movq    8(%r11), %r10            ; load static chain -> nest register
+  ///   jmpq    *(%r11)                  ; jump to callee address
+  ///
+  /// Total: 10 + 4 + 3 = 17 bytes, padded to kTrampolineStubSize.
+  void generateStubsX86_64() {
+    auto *code = static_cast<uint8_t *>(codeRegion_);
+
+    for (std::size_t i = 0; i < poolSize_; ++i) {
+      uint8_t *stub = code + i * kTrampolineStubSize;
+
+      // Address of the corresponding TDATA entry.
+      auto dataAddr = reinterpret_cast<uint64_t>(&dataRegion_[i]);
+
+      std::size_t off = 0;
+
+      // movabsq $dataAddr, %r11    (REX.W + B, opcode 0xBB for r11)
+      stub[off++] = 0x49; // REX.WB
+      stub[off++] = 0xBB; // MOV r11, imm64
+      std::memcpy(&stub[off], &dataAddr, 8);
+      off += 8;
+
+      // movq 8(%r11), %r10         (load staticChainAddress into r10)
+      stub[off++] = 0x4D; // REX.WRB
+      stub[off++] = 0x8B; // MOV r/m64 -> r64
+      stub[off++] = 0x53; // ModRM: [r11 + disp8], r10
+      stub[off++] = 0x08; // disp8 = 8
+
+      // jmpq *(%r11)               (jump to calleeAddress)
+      stub[off++] = 0x41; // REX.B
+      stub[off++] = 0xFF; // JMP r/m64
+      stub[off++] = 0x23; // ModRM: [r11], opcode extension 4
+
+      // Pad the rest with INT3 (0xCC) for safety.
+      while (off < kTrampolineStubSize)
+        stub[off++] = 0xCC;
+    }
+  }
+#endif
+
+#if defined(__aarch64__) || defined(_M_ARM64)
+  /// Generate AArch64 trampoline stubs.
+  ///
+  /// Each stub does:
+  ///   ldr x17, .Ldata_addr         ; load TDATA entry address
+  ///   ldr x18, [x17, #8]           ; load static chain -> x18 (nest reg)
+  ///   ldr x17, [x17]               ; load callee address
+  ///   br  x17                      ; jump to callee
+  ///   .Ldata_addr:
+  ///     .quad <address of dataRegion_[i]>
+  ///
+  /// Total: 4*4 + 8 = 24 bytes, padded to kTrampolineStubSize.
+  void generateStubsAArch64() {
+    auto *code = static_cast<uint8_t *>(codeRegion_);
+
+    for (std::size_t i = 0; i < poolSize_; ++i) {
+      auto *stub = reinterpret_cast<uint32_t *>(code + i * 
kTrampolineStubSize);
+
+      // Address of the corresponding TDATA entry.
+      auto dataAddr = reinterpret_cast<uint64_t>(&dataRegion_[i]);
+
+      // ldr x17, .Ldata_addr (PC-relative load, offset = 4 instructions = 16
+      // bytes) LDR (literal): opc=01, V=0, imm19=(16/4)=4, Rt=17
+      stub[0] = 0x58000091; // ldr x17, #16  (imm19=4, shifted left 2 = 16)
+                            // Encoding: 0101 1000 0000 0000 0000 0000 1001 
0001
+
+      // ldr x18, [x17, #8]  (load static chain)
+      // LDR (unsigned offset): size=11, V=0, opc=01, imm12=1(×8), Rn=17, Rt=18
+      stub[1] = 0xF9400632; // ldr x18, [x17, #8]
+
+      // ldr x17, [x17]      (load callee address)
+      // LDR (unsigned offset): size=11, V=0, opc=01, imm12=0, Rn=17, Rt=17
+      stub[2] = 0xF9400231; // ldr x17, [x17, #0]
+
+      // br x17
+      stub[3] = 0xD61F0220; // br x17
+
+      // .Ldata_addr: .quad dataRegion_[i]
+      std::memcpy(&stub[4], &dataAddr, 8);
+
+      // Pad remaining with BRK #0 (trap) for safety.
+      std::size_t usedWords = 4 + 2; // 4 instructions + 1 quad (2 words)
+      for (std::size_t w = usedWords;
+          w < kTrampolineStubSize / sizeof(uint32_t); ++w)
+        stub[w] = 0xD4200000; // brk #0
+    }
+  }
+#endif
+
+  Lock lock_;
+  bool initialized_{false};
+  std::size_t poolSize_{0};
+
+  void *codeRegion_{nullptr}; // RX after initialization
+  TrampolineData *da...
[truncated]

``````````

</details>


https://github.com/llvm/llvm-project/pull/183108
_______________________________________________
cfe-commits mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

Reply via email to