gregrodgers added a comment.

I like the idea of using an automatic include as a cc1 option (-include).  
However, I would prefer a more general automatic include for OpenMP, not just 
for math functions (__clang_cuda_device_functions.h). Clang cuda automatically 
includes __clang_cuda_runtime_wrapper.h.  It includes other files as needed 
like __clang_cuda_device_functions.h.  Lets hypothetically call my proposed 
automatic include for OpenMP , __clang_openmp_runtime_wrapper.h.

Just because clang cuda defines functions in __clang_cuda_device_functins.h and 
automatically includes them does not make it right for OpenMP.  In general, 
function definitions in headers should be avoided. The current function 
definitions in __clang_cuda_device_functions.h only work for hostile nv GPUs 
:).  This is how we can avoid function definitions in the headers.  In a new 
openmp build process, we can build libm-nvptx.bc.  This can be done by 
compiling __clang_cuda_device_functions.h as a device-only compile.  Assuming 
current naming conventions, these files would be installed in the same 
directory as libomptarget.so (.../lib).

How do we tell clang cc1 to use this bc library? Use -mlink-builtin-bitcode.   
AddMathDeviceFunctions would then look something like this.

if (this is for device cc1) {

  CC1Args.push_back("-mlink-builtin-bitcode");
  if ( getTriple().isNVPTX())
    CC1Args.push_back(DriverArgs.MakeArgString("libm-nvptx.bc"));
  if ( getTriple().getArch() == llvm::Triple::amdgcn);
    CC1Args.push_back(DriverArgs.MakeArgString("libm-amdgcn.bc"));

}

You can think of libm-<arch>.bc file as the device library equivalent of the 
host libm.so or libm.a.  This concept of "host-consistent" library definitions 
can go beyond math libraries.  In fact, I believe we should co-opt the -l 
(--library) option.  The driver toolchain should look for device bc libraries 
for any -lX command line option.  This gives us a strategy for adding 
user-defined device libraries.

The above code hints at the idea of architecture specific bc files (nvptx vs 
amdgcn).  The nvptx version would call into the cuda libdevice.  For radeon 
processors, we may want processor-optimized versions of the libraries, just 
like there are sub-architecture optimized versions of the cuda libdevice.  If 
we build --cuda-cuda-gpu-arch optimized versions of math bc libs,  then the 
above code will get a bit more complex depending on naming convention of the bc 
lib and the value of
 --cuda-gpu-arch (which should have an alias --offload-arch).

Using a bc lib, significantly reduces the complexity of 
__clang_openmp_runtime_wrapper.h.  We do not not need or see math device 
function definitions or the nv headers that they need. However, it does need to 
correct the behaviour of rogue system headers that define host-optimized 
functions. We can fix this by adding the following to 
__clang_openmp_runtime_wrapper.h so that host passes still get host-optimized 
functions.

#if defined(__AMDGCN__) || defined(__NVPTX__)
#define __NO_INLINE__ 1
#endif

There is a tradeoff to using pre-compiled bc libs.  It makes compile-time macro 
logic hard to implement.  For example,  we cant do this

#if defined(__CLANG_CUDA_APPROX_TRANSCENDENTALS__)
#define __FAST_OR_SLOW(fast, slow) fast
#else
#define __FAST_OR_SLOW(fast, slow) slow
#endif

The openmp build process would either need to build alternative bc libraries 
for each option or a supplemental bc library to address these types of options.
If some option is turned on, then an alternative lib or particular ordering of 
libs would be used to build the clang cc1 command.
For example, the above code for AddMathDeviceFunctions would have this

  ...
  if ( getTriple().isNVPTX()) {
     if (LangOpts.CUDADeviceApproxTranscendentals || LangOpts.FastMath) {
       CC1Args.push_back("-mlink-builtin-bitcode");
       CC1Args.push_back(DriverArgs.MakeArgString("libm-fast-nvptx.bc"));
     }
     CC1Args.push_back("-mlink-builtin-bitcode");
     CC1Args.push_back(DriverArgs.MakeArgString("libm-nvptx.bc"));
  }

I personally believe that pre-built bc libraries with some consistency to their 
host-equivalent libraries is a more sane approach for device libraries than 
complex header logic that is customized for each architecture.


Repository:
  rC Clang

https://reviews.llvm.org/D47849



_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

Reply via email to