Hi, We implement automatic fusion of elemwise computation in Theano(with the restriction that it can have for now only one output. Some work can remove this). As we support strides and broadcasting, this add much pointer arithmetic overhead in the gpu kernel compared to the contiguous memory case. I see regularly fused elemwise with 5-8 inputs array and from time to times even more.
To try to lower the pointer arithmetic overhead, we try to "merge" dimensions. Small example: a=gpuarray.zeros((3,4)) b=gpuarray.zeros((4,3)) b_t = b.T# A transposed view of b When we compile code, we don't know the stride. So we enable some optimization on the fly. The code that will try to merge dimensions on any elemwise operation on a and b will try this: """ if all input and outputs contiguous: call contiguous gpu kernel return ndim = a.ndim a_strides = a.strides a_shape = a.shape b_strides = b.strides b_shape = b.shape out_strides = out.strides if a.strides[0]==a.strides[1]*a.shape[1] and b.strides[0]==b.strides[1]*b.shape[1] and(idem for out_strides) and (check no broadcasting)(a.shape[0]==b.shape[0] and a.shape[1]==b.shape[1]): ndim=1 a_strides = (a.strides[1],) a_shape = a.shape[0]*a.shape[1] b_strides = (b.strides[1],) b_shape = b.shape[0]*b.shape[1] (idem for out_strides) if ndim==1: call gpu kernel for ndim==1 with a,b,a_strides,b_strides,a_shape,b_shape,out_shape,out_strides. else if ndim==2: call gpu kernel for ndim==1 with a,b,a_strides,b_strides,a_shape,b_shape,out_shape,out_strides. """ That case it would be fast enough in python. But this is simplified code, we use loop to look for all dimensions and some other case that I don't recall right now. When we use tensor of 4-5 or 9 dimensions with an elemwise operation that take 10s of inputs, the generated code get more complicated and I'm not sure how fast it will execute on gpu. I was just tring to see the potential problem that we could have. When I redo those optimization, I will try in python first and check if it is a problem or not. At least, I know I will need to use something else then pycuda in that case. thanks Fred 2011/5/3 Andreas Kloeckner <li...@informa.tiker.net>: > On Tue, 3 May 2011 11:20:36 -0400, Frédéric Bastien <no...@nouiz.org> wrote: >> Sorry to reply to my own post, but I have a question that I can't find >> the answer on the documentation site. >> >> Before calling some gpu function, in some case I do some good >> preprocessing of the parameter for optimization. Currently this is >> done in C and so this is not a bottle neck. We I move this to >> pycuda.gpuarray, I can do the preprocessing in python, but I fear it >> will be slow. Is there a way that the SourceModule define c function >> and call them from python? It is this function that will call the gpu >> function itself. >> >> I know I can do it in otherway, but if it can be done directly in the >> same systeme it would be great. > > PyCUDA makes no attempt to help you call host functions, whether those > go on and call CUDA kernels or no. If your preprocessing can be > parametrized in some way, we might be able to shove it into one of > PyCUDA's compiled modules. If that code needs to be generated on the > spot, that brings in a whole different set of issues. > > Can you describe what your preprocessor has to do? > > Andreas > > _______________________________________________ PyCUDA mailing list PyCUDA@tiker.net http://lists.tiker.net/listinfo/pycuda