On 03/22/2013 11:12 PM, Peter Colberg wrote:
> Maybe there is a magic switch in the NVIDIA driver to enable the
> JIT compilation of this kind of PTX code? In any case, it is not
> exposed by the CUDA driver API.

I checked the examples in the NVIDIA compute SDK and there is
a matrix multiplication case which uses shared memory and the
CUDA driver API. Does that example work for you?

It's in NVIDIA_GPU_Computing_SDK/C/src/matrixMulDrv. It uses
cuModueLoadDataEx() similarly like you do. The main difference
I see there is that it sets the max registers per CUDA thread
to 32.

Also, the kernel uses "automatic locals" in contrast to the
"host allocated" kernel local args, if that makes any difference.

-- 
--Pekka


------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_mar
_______________________________________________
pocl-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pocl-devel

Reply via email to