[openmp] [clang] [llvm] [OpenMP] Rework handling of global ctor/dtors in OpenMP (PR #71739)

2023-11-10 Thread Joseph Huber via cfe-commits


@@ -1038,6 +1048,109 @@ struct CUDADeviceTy : public GenericDeviceTy {
   using CUDAStreamManagerTy = GenericDeviceResourceManagerTy;
   using CUDAEventManagerTy = GenericDeviceResourceManagerTy;
 
+  Error callGlobalCtorDtorCommon(GenericPluginTy , DeviceImageTy ,
+ bool IsCtor) {
+// Perform a quick check for the named kernel in the image. The kernel
+// should be created by the 'nvptx-lower-ctor-dtor' pass.
+GenericGlobalHandlerTy  = Plugin.getGlobalHandler();
+GlobalTy Global(IsCtor ? "nvptx$device$init" : "nvptx$device$fini",
+sizeof(void *));
+if (auto Err = Handler.getGlobalMetadataFromImage(*this, Image, Global)) {
+  consumeError(std::move(Err));
+  return Plugin::success();
+}
+
+// The Nvidia backend cannot handle creating the ctor / dtor array
+// automatically so we must create it ourselves. The backend will emit
+// several globals that contain function pointers we can call. These are
+// prefixed with a known name due to Nvidia's lack of section support.
+const ELF64LEObjectFile *ELFObj =
+Handler.getOrCreateELFObjectFile(*this, Image);
+if (!ELFObj)
+  return Plugin::error("Unable to create ELF object for image %p",
+   Image.getStart());
+
+// Search for all symbols that contain a constructor or destructor.
+SmallVector> Funcs;
+for (ELFSymbolRef Sym : ELFObj->symbols()) {
+  auto NameOrErr = Sym.getName();
+  if (!NameOrErr)
+return NameOrErr.takeError();
+
+  if (!NameOrErr->starts_with(IsCtor ? "__init_array_object_"
+ : "__fini_array_object_"))
+continue;
+
+  uint16_t priority;
+  if (NameOrErr->rsplit('_').second.getAsInteger(10, priority))
+return Plugin::error("Invalid priority for constructor or destructor");
+
+  Funcs.emplace_back(*NameOrErr, priority);
+}
+
+// Sort the created array to be in priority order.
+llvm::sort(Funcs, [=](auto x, auto y) { return x.second < y.second; });
+
+// Allocate a buffer to store all of the known constructor / destructor
+// functions in so we can iterate them on the device.
+void *Buffer =
+allocate(Funcs.size() * sizeof(void *), nullptr, TARGET_ALLOC_SHARED);

jhuber6 wrote:

It's much more convenient than copying over the buffer. `SHARED` in CUDA 
context would be "migratable" memory without async access AFAIK. So this will 
most likely just invoke a migration once it's accessed. Unsure if that's slower 
or faster than waiting on an explicit memcpy. 

https://github.com/llvm/llvm-project/pull/71739
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[openmp] [clang] [llvm] [OpenMP] Rework handling of global ctor/dtors in OpenMP (PR #71739)

2023-11-10 Thread Johannes Doerfert via cfe-commits


@@ -313,12 +313,18 @@ static void 
registerGlobalCtorsDtorsForImage(__tgt_bin_desc *Desc,
 DP("Adding ctor " DPxMOD " to the pending list.\n",
DPxPTR(Entry->addr));
 Device.PendingCtorsDtors[Desc].PendingCtors.push_back(Entry->addr);
+MESSAGE("Calling deprecated constructor for entry %s will be removed "

jdoerfert wrote:

Add "WARNING: ", or similar.

https://github.com/llvm/llvm-project/pull/71739
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[openmp] [clang] [llvm] [OpenMP] Rework handling of global ctor/dtors in OpenMP (PR #71739)

2023-11-09 Thread Matt Arsenault via cfe-commits


@@ -2794,6 +2794,14 @@ void ItaniumCXXABI::registerGlobalDtor(CodeGenFunction 
, const VarDecl ,
   if (D.isNoDestroy(CGM.getContext()))
 return;
 
+  // OpenMP offloading supports C++ constructors and destructors but we do not
+  // always have 'atexit' available. Instead lower these to use the LLVM global
+  // destructors which we can handle directly in the runtime.
+  if (CGM.getLangOpts().OpenMP && CGM.getLangOpts().OpenMPIsTargetDevice &&
+  !D.isStaticLocal() &&
+  (CGM.getTriple().isAMDGPU() || CGM.getTriple().isNVPTX()))

arsenm wrote:

Oh look, it's both of my favorite patterns. Can you refine this into something 
better than language X | language Y and AMDGPU || PTX 

https://github.com/llvm/llvm-project/pull/71739
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[openmp] [clang] [llvm] [OpenMP] Rework handling of global ctor/dtors in OpenMP (PR #71739)

2023-11-09 Thread Jan Patrick Lehr via cfe-commits

https://github.com/jplehr commented:

I have only briefly looked at the NVPTX implementation.

https://github.com/llvm/llvm-project/pull/71739
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[openmp] [clang] [llvm] [OpenMP] Rework handling of global ctor/dtors in OpenMP (PR #71739)

2023-11-08 Thread Joseph Huber via cfe-commits

https://github.com/jhuber6 updated 
https://github.com/llvm/llvm-project/pull/71739

>From 5e378ae3efdebedb044528167131c8cae4571a59 Mon Sep 17 00:00:00 2001
From: Joseph Huber 
Date: Tue, 7 Nov 2023 17:12:31 -0600
Subject: [PATCH] [OpenMP] Rework handling of global ctor/dtors in OpenMP

Summary:
This patch reworks how we handle global constructors in OpenMP.
Previously, we emitted individual kernels that were all registered and
called individually. In order to provide more generic support, this
patch moves all handling of this to the target backend and the runtime
plugin. This has the benefit of supporting the GNU extensions for
constructors an destructors, removing a class of failures related to
shared library destruction order, and allows targets other than OpenMP
to use the same support without needing to change the frontend.

This is primarily done by calling kernels that the backend emits to
iterate a list of ctor / dtor functions. For x64, this is automatic and
we get it for free with the standard `dlopen` handling. For AMDGPU, we
emit `amdgcn.device.init` and `amdgcn.device.fini` functions which
handle everything atuomatically and simply need to be called. For NVPTX,
a patch https://github.com/llvm/llvm-project/pull/71549 provides the
kernels to call, but the runtime needs to set up the array manually by
pulling out all the known constructor / destructor functions.

One concession that this patch requires is the change that for GPU
targets in OpenMP offloading we will use `llvm.global_dtors` instead of
using `atexit`. This is because `atexit` is a separate runtime function
that does not mesh well with the handling we're trying to do here. This
should be equivalent in all cases except for cases where we would need
to destruct manually such as:

```
struct S { ~S() { foo(); } };
void foo() {
  static S s;
}
```

However this is broken in many other ways on the GPU, so it is not
regressing any support, simply increasing the scope of what we can
handle.

This changes the handling of ctors / dtors. This patch now outputs a
information message regarding the deprecation if the old format is used.
This will be completely removed in a later release.

Depends on: https://github.com/llvm/llvm-project/pull/71549
---
 clang/lib/CodeGen/CGDeclCXX.cpp   |  13 +-
 clang/lib/CodeGen/CGOpenMPRuntime.cpp | 130 --
 clang/lib/CodeGen/CGOpenMPRuntime.h   |   8 --
 clang/lib/CodeGen/CodeGenFunction.h   |   5 +
 clang/lib/CodeGen/CodeGenModule.h |  14 +-
 clang/lib/CodeGen/ItaniumCXXABI.cpp   |   8 ++
 .../amdgcn_openmp_device_math_constexpr.cpp   |  48 +--
 .../amdgcn_target_global_constructor.cpp  |  30 ++--
 clang/test/OpenMP/declare_target_codegen.cpp  |   1 -
 ...x_declare_target_var_ctor_dtor_codegen.cpp |  35 +
 .../llvm/Frontend/OpenMP/OMPIRBuilder.h   |   4 -
 llvm/lib/Target/NVPTX/NVPTXAsmPrinter.cpp |   7 +-
 .../plugins-nextgen/amdgpu/src/rtl.cpp|  52 +++
 .../common/PluginInterface/GlobalHandler.h|  10 +-
 .../PluginInterface/PluginInterface.cpp   |   7 +
 .../common/PluginInterface/PluginInterface.h  |  14 ++
 .../plugins-nextgen/cuda/src/rtl.cpp  | 115 
 openmp/libomptarget/src/rtl.cpp   |   6 +
 18 files changed, 291 insertions(+), 216 deletions(-)

diff --git a/clang/lib/CodeGen/CGDeclCXX.cpp b/clang/lib/CodeGen/CGDeclCXX.cpp
index 3fa28b343663f61..e08a1e5f42df20c 100644
--- a/clang/lib/CodeGen/CGDeclCXX.cpp
+++ b/clang/lib/CodeGen/CGDeclCXX.cpp
@@ -327,6 +327,15 @@ void CodeGenFunction::registerGlobalDtorWithAtExit(const 
VarDecl ,
   registerGlobalDtorWithAtExit(dtorStub);
 }
 
+/// Register a global destructor using the LLVM 'llvm.global_dtors' global.
+void CodeGenFunction::registerGlobalDtorWithLLVM(const VarDecl ,
+ llvm::FunctionCallee Dtor,
+ llvm::Constant *Addr) {
+  // Create a function which calls the destructor.
+  llvm::Function *dtorStub = createAtExitStub(VD, Dtor, Addr);
+  CGM.AddGlobalDtor(dtorStub);
+}
+
 void CodeGenFunction::registerGlobalDtorWithAtExit(llvm::Constant *dtorStub) {
   // extern "C" int atexit(void (*f)(void));
   assert(dtorStub->getType() ==
@@ -519,10 +528,6 @@ CodeGenModule::EmitCXXGlobalVarDeclInitFunc(const VarDecl 
*D,
D->hasAttr()))
 return;
 
-  if (getLangOpts().OpenMP &&
-  getOpenMPRuntime().emitDeclareTargetVarDefinition(D, Addr, PerformInit))
-return;
-
   // Check if we've already initialized this decl.
   auto I = DelayedCXXInitPosition.find(D);
   if (I != DelayedCXXInitPosition.end() && I->second == ~0U)
diff --git a/clang/lib/CodeGen/CGOpenMPRuntime.cpp 
b/clang/lib/CodeGen/CGOpenMPRuntime.cpp
index a8e1150e44566b8..d2be8141a3a4b31 100644
--- a/clang/lib/CodeGen/CGOpenMPRuntime.cpp
+++ b/clang/lib/CodeGen/CGOpenMPRuntime.cpp
@@ -1747,136 +1747,6 @@ llvm::Function