The state of GPU codegen with Nim (bonus: LLVM JIT codegen)

mratsim Thu, 05 Jan 2023 02:45:06 -0800

> It'd be great to be able to do GPU / NN stuff in Nim (again).

Unfortunately for the time being my focus is cryptography. For GPU / NN, I 
think Flambeau is the better bet.


It will have a side-effect of solving Cuda/OpenCL/Vulkan codegen via LLVM but 
the kernels still would have to be (re-)implemented.

> Note that I really dislike LLVM -- it sucks as a user because the libraries 
> break compatibility regularly. Your distro provide LLVM13 but your compiler 
> needs LLVM14, etc. Then Apple's LLVM builds often don't work with regular 
> LLVM libs. I guess that's just part of the nature of using any of the machine 
> learning libraries though. :/

That was one of my concerns:

  * There has been talked about changing the JIT in LLVM for MCJIT to ORCv1 
then ORCv2, the documentation for ORC is still WIP: 
<https://llvm.org/docs/tutorial/BuildingAJIT1.html> and has been for like 4 
years now: > Warning: This tutorial is currently being updated to account for 
ORC API changes. Only Chapters 1 and 2 are up-to-date. > Example code from 
Chapters 3 to 5 will compile and run, but has not been updated
  * The latest cuda and latest Clang/LLVM version mismatch in distros for 
Nvidia codegen via Clang.



However, I don't think it will be an issue for my libraries:

  * Instead of hardcoding a libLLVM-15.so version, you can autodetect it with 
`{.passl: gorge("llvm-config --libs").}`
  * LLVM IR by necessity is very stable and hasn't been changed for years. The 
C API is also fully-featured (compared to many other popular C++ projects like 
OpenCV) and depended on by many languages like Rust, Julia, ...



Basically the only versioning woes I should get would be support for new 
backends, like OpenCL, OpenGL, Vulkan kernel generation via SPIR-V is only 
availble in LLVM-15 onward.

> I'm not sure if pure CUDA C++ kernels have the same issue too.

Cuda C++ is also usually forward compatible with only new instructions like 
tensor cores, synchronization or unified memory that require new versions.

The latest big breakage was hardware-level with the RTX 2XXX series and 
[Independent Thread 
Scheduling](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#independent-thread-scheduling).
 GPU threads are organized in groups of 32 called a warp and within a warp they 
executed the same instructions, which also meant executing all branches of an 
if/then/else if at least one thread had to take a different branch from the 
other. RTX 2XXX and later allowed independent branching and since that was a 
decade old assumption, lots of synchronization code broke on those GPUs.

> Also for the C++ compilation flags it would be possible to compile to source 
> and then manually compile the generated C++ code. For CUDA kernels I'd guess 
> that Nim features that use GNU features like computed goto's wouldn't be used.

If you do it at compile time:

  * I'm not sure you can portably do 
`staticWrite("foo.cpp");compile("foo.cpp")`, you might need 2-stage compilation
  * Or you have Nim -> Cuda codegen with the same woes as Arraymancer regarding 
compiler and compilation flags config.



If you do it at runtime, via NVRTC, which is my recommendation if you only want 
to support Nvidia, it should be the easiest to maintain and deploy. I'm only 
considering LLVM so I write LLVM IR and then all backends supported by LLVM 
(Nvidia, AMD, Intel via OpenCL/SPIR-V) are available.

The state of GPU codegen with Nim (bonus: LLVM JIT codegen)

Reply via email to