> It'd be great to be able to do GPU / NN stuff in Nim (again). Unfortunately for the time being my focus is cryptography. For GPU / NN, I think Flambeau is the better bet.
It will have a side-effect of solving Cuda/OpenCL/Vulkan codegen via LLVM but the kernels still would have to be (re-)implemented. > Note that I really dislike LLVM -- it sucks as a user because the libraries > break compatibility regularly. Your distro provide LLVM13 but your compiler > needs LLVM14, etc. Then Apple's LLVM builds often don't work with regular > LLVM libs. I guess that's just part of the nature of using any of the machine > learning libraries though. :/ That was one of my concerns: * There has been talked about changing the JIT in LLVM for MCJIT to ORCv1 then ORCv2, the documentation for ORC is still WIP: <https://llvm.org/docs/tutorial/BuildingAJIT1.html> and has been for like 4 years now: > Warning: This tutorial is currently being updated to account for ORC API changes. Only Chapters 1 and 2 are up-to-date. > Example code from Chapters 3 to 5 will compile and run, but has not been updated * The latest cuda and latest Clang/LLVM version mismatch in distros for Nvidia codegen via Clang. However, I don't think it will be an issue for my libraries: * Instead of hardcoding a libLLVM-15.so version, you can autodetect it with `{.passl: gorge("llvm-config --libs").}` * LLVM IR by necessity is very stable and hasn't been changed for years. The C API is also fully-featured (compared to many other popular C++ projects like OpenCV) and depended on by many languages like Rust, Julia, ... Basically the only versioning woes I should get would be support for new backends, like OpenCL, OpenGL, Vulkan kernel generation via SPIR-V is only availble in LLVM-15 onward. > I'm not sure if pure CUDA C++ kernels have the same issue too. Cuda C++ is also usually forward compatible with only new instructions like tensor cores, synchronization or unified memory that require new versions. The latest big breakage was hardware-level with the RTX 2XXX series and [Independent Thread Scheduling](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#independent-thread-scheduling). GPU threads are organized in groups of 32 called a warp and within a warp they executed the same instructions, which also meant executing all branches of an if/then/else if at least one thread had to take a different branch from the other. RTX 2XXX and later allowed independent branching and since that was a decade old assumption, lots of synchronization code broke on those GPUs. > Also for the C++ compilation flags it would be possible to compile to source > and then manually compile the generated C++ code. For CUDA kernels I'd guess > that Nim features that use GNU features like computed goto's wouldn't be used. If you do it at compile time: * I'm not sure you can portably do `staticWrite("foo.cpp");compile("foo.cpp")`, you might need 2-stage compilation * Or you have Nim -> Cuda codegen with the same woes as Arraymancer regarding compiler and compilation flags config. If you do it at runtime, via NVRTC, which is my recommendation if you only want to support Nvidia, it should be the easiest to maintain and deploy. I'm only considering LLVM so I write LLVM IR and then all backends supported by LLVM (Nvidia, AMD, Intel via OpenCL/SPIR-V) are available.