Re: [PyCUDA] gpuarray.to_gpu() strange behavior

Frédéric Bastien Tue, 03 May 2011 11:56:07 -0700

Hi,

We implement automatic fusion of elemwise computation in Theano(with
the restriction that it can have for now only one output. Some work
can remove this). As we support strides and broadcasting, this add
much pointer arithmetic overhead in the gpu kernel compared to the
contiguous memory case. I see regularly fused elemwise with 5-8 inputs
array and from time to times even more.

To try to lower the pointer arithmetic overhead, we try to "merge"
dimensions. Small example:

a=gpuarray.zeros((3,4))
b=gpuarray.zeros((4,3))
b_t = b.T# A transposed view of b

When we compile code, we don't know the stride. So we enable some
optimization on the fly. The code that will try to merge dimensions on
any elemwise operation on a and b will try this:

"""
if all input and outputs contiguous:
  call contiguous gpu kernel
  return

ndim = a.ndim
a_strides = a.strides
a_shape = a.shape
b_strides = b.strides
b_shape = b.shape
out_strides = out.strides
if a.strides[0]==a.strides[1]*a.shape[1] and
b.strides[0]==b.strides[1]*b.shape[1] and(idem for out_strides) and
(check no broadcasting)(a.shape[0]==b.shape[0] and
a.shape[1]==b.shape[1]):
   ndim=1
   a_strides = (a.strides[1],)
   a_shape = a.shape[0]*a.shape[1]
   b_strides = (b.strides[1],)
   b_shape = b.shape[0]*b.shape[1]
   (idem for out_strides)

if ndim==1:
   call gpu kernel for ndim==1 with
a,b,a_strides,b_strides,a_shape,b_shape,out_shape,out_strides.
else if ndim==2:
   call gpu kernel for ndim==1 with
a,b,a_strides,b_strides,a_shape,b_shape,out_shape,out_strides.
"""

That case it would be fast enough in python. But this is simplified
code, we use loop to look for all dimensions and some other case that
I don't recall right now.  When we use tensor of 4-5 or 9 dimensions
with an elemwise operation that take 10s of inputs, the generated code
get more complicated and I'm not sure how fast it will execute on gpu.

I was just tring to see the potential problem that we could have. When
I redo those optimization, I will try in python first and check if it
is a problem or not. At least, I know I will need to use something
else then pycuda in that case.

thanks

Fred

2011/5/3 Andreas Kloeckner <li...@informa.tiker.net>:
> On Tue, 3 May 2011 11:20:36 -0400, Frédéric Bastien <no...@nouiz.org> wrote:
>> Sorry to reply to my own post, but I have a question that I can't find
>> the answer on the documentation site.
>>
>> Before calling some gpu function, in some case I do some good
>> preprocessing of the parameter for optimization. Currently this is
>> done in C and so this is not a bottle neck. We I move this to
>> pycuda.gpuarray, I can do the preprocessing in python, but I fear it
>> will be slow. Is there a way that the SourceModule define c function
>> and call them from python? It is this function that will call the gpu
>> function itself.
>>
>> I know I can do it in otherway, but if it can be done directly in the
>> same systeme it would be great.
>
> PyCUDA makes no attempt to help you call host functions, whether those
> go on and call CUDA kernels or no. If your preprocessing can be
> parametrized in some way, we might be able to shove it into one of
> PyCUDA's compiled modules. If that code needs to be generated on the
> spot, that brings in a whole different set of issues.
>
> Can you describe what your preprocessor has to do?
>
> Andreas
>
>

_______________________________________________
PyCUDA mailing list
PyCUDA@tiker.net
http://lists.tiker.net/listinfo/pycuda

Re: [PyCUDA] gpuarray.to_gpu() strange behavior

Reply via email to