Hi Mani,

> Thanks for the reply. That fixed it. I get only a 10% speed up using the
cusp options. Is the residual evaluation at each iteration happening on
the CPU or the GPU?

The residual evaluation happens on the CPU unless there is a dedicated kernel provided for this (which is not the case in ex19)


Is there anyway one can do the residual evaluation
on the GPU too, after the data has been transferred?

Technically it is possible by extracting the underlying GPU buffers from the vector objects and by manually managing the Field data. Frankly I don't know about the current state of the local-to-global mappings, you likely have to do quite some copying of data between host and device manually.


Ex42 shows how it
can be done using cusp but it looks really ugly and I want to use
OpenCL. Basically can I do something like this?

DMGetLocalVector(da, &localX); //Vector is now in GPU.
DMDAVecGetArray(da, localX, &x); //Array is on GPU.

//Create buffers for OpenCL
buffer = cl::Buffer(context, CL_MEM_USE_HOST_PTR |
                                                 CL_MEM_READ_WRITE,
                                   sizeofarray, &x[X2Start-Ng][X1Start-Ng]
                                    , &clErr);

(I'm hoping that here CL_MEM_USE_HOST_PTR will give a pointer to the
data already on the GPU)

// Launch OpenCL kernels and now map the buffers to read off the data.

DMDAVecRestoreArray(da, localX, &x);
DMRestoreLocalVector(da, &localX);

I think the question is whether DMDAVecGetArray will return a pointer to
the data on the GPU or not.

*VecGetArray() will always return a pointer due to the inability to overload functions in C. Buffers in OpenCL are of type cl_mem, so this won't work. Also, you won't be able to copy a two-dimensional array with just one pointer &x[][]. As far as I know, we don't have any API which provides GPU buffers directly, but maybe Matt added some functions for this to work with FEM recently.

As far as I can tell, only providing the kernel won't suffice because we don't have the GPU-implementations for 'Field' data available. Hence, you would have to copy the x and b arrays manually and then copy everything back, which is most likely too much of a performance hit to be worth the effort. Since GPUs are getting more and more integrated into CPUs, it's questionable whether it's worth the time to implement such additional memory management for accelerators if they disappear in their discrete PCI-Express form in a few years from now...

Best regards,
Karli

Reply via email to