You've only described a K x N processing problem, where you would run
N kernels that each process one row of K values.  You haven't
described any cross-communication or data shuffling if there are
multiple such sub-problems, nor approximate amount of work per input
or output data.  Are your tasks truly independent?  What data
management do you have to do to get your inputs into a parallel or
distributed decomposition?

At one far extreme, a high throughput job manager could be used to
execute a set of independent PyOpenCL programs, each sized to fit on
your OpenCL devices, each processing a different input file containing
a subset of your N rows of data.

In the middle are a huge number of choices to balance IO, memory, and
compute resources. This leads to a huge number of different research
programs all focusing on different niches and machine models.

If you really want to abstract away the GPGPU devices, you might want
to look at OpenMP or similar projects that have tried to add such
devices to their targets for auto-vectorization.  I don't work in that
space, and so have no specific recommendations.

At the other extreme, I adopted PyOpenCL to allow me to do my ad-hoc
processing in Python and Numpy with OpenCL callouts for certain
bottlenecks.  I have some image processing tasks where there isn't
even enough compute time per byte of input to warrant the PCIe
transfer bottleneck in some cases.  It is the same speed to run on an
i7-4700MQ mobile quad-core CPU (using just SIMD+multicore) as to run
on a desktop Kepler GPU.

For me, the data IO from disk or network would also dominate, so
distributed processing is pointless. Even still, I have used
explicit sub-block decomposition to split my large images into smaller
OpenCL tasks that can be marshaled through the system RAM or GPU to
improve locality and limit the intermediate working set sizes.


Karl


_______________________________________________
PyOpenCL mailing list
[email protected]
http://lists.tiker.net/listinfo/pyopencl

Reply via email to