On 03/22/2013 11:12 PM, Peter Colberg wrote: > Maybe there is a magic switch in the NVIDIA driver to enable the > JIT compilation of this kind of PTX code? In any case, it is not > exposed by the CUDA driver API.
I checked the examples in the NVIDIA compute SDK and there is a matrix multiplication case which uses shared memory and the CUDA driver API. Does that example work for you? It's in NVIDIA_GPU_Computing_SDK/C/src/matrixMulDrv. It uses cuModueLoadDataEx() similarly like you do. The main difference I see there is that it sets the max registers per CUDA thread to 32. Also, the kernel uses "automatic locals" in contrast to the "host allocated" kernel local args, if that makes any difference. -- --Pekka ------------------------------------------------------------------------------ Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_mar _______________________________________________ pocl-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/pocl-devel
