Hello Peter, On 03/18/2013 07:22 PM, Peter Colberg wrote: > In case you are familiar with the OpenCL support of the NVIDIA driver, > it has not been getting any better with the recent CUDA 5.0 release. > As I had to experience, their OpenCL implementation seems to be > serial, which prevents multi-GPU scheduling, or kernel execution and > memory transfer overlapping, when using a single host thread. I fear > the worst case, that OpenCL support would eventually be dropped > entirely.
I've heard about this. It's unfortunate, but understandable business-wise for the leading player in the GPU compute game. Of course that's bad news for the end users who could really benefit from a wide-spread and well-supported open heterogeneous compute API without vendor lock-in. Let's hope we can improve the situation with pocl. > A PTX device backend to POCL would allow users to develop portable > OpenCL (1.2) codes now, and run them on existing installations of > NVIDIA GPUs using the CUDA driver, until we have mature open-source > support for AMD and NVIDIA GPUs in distributions. > > What do you think about this idea? Does the CUDA driver API expose > enough functionality to implement OpenCL? How difficult would it > be to use the NVPTX backend of LLVM for compilation? It should be doable with the CUDA API and the LLVM NVPTX backend. I took a look at the CUDA API some time ago with this exact idea in mind, but didn't have the time to move forward with it. How much work it is, I'm not sure, as I haven't tested the LLVM NVPTX backend nor the API. But my guess is it shouldn't be too hard to get something running because we have the previous drivers for heterogeneous device setups as examples. If you are up for the task, take a look at the pocl device drivers for cellspu, TCE (ttasim), or the Tom Stellard's unfinished Gallium compute / AMD R600 driver. A key point is that the kernel compilation chain is simpler than for CPUs as the NVIDIA GPUs are SIMT. Thus, the kernel compiler should skip most of the complex passes and feed the single work-item kernel to the device. For R600 it's similar, but in that case it can sometimes benefit from multi work-item (replicated) input due to the ILP in the VLIW lanes. -- --Pekka ------------------------------------------------------------------------------ Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_mar _______________________________________________ pocl-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/pocl-devel
