Thanks for the replies @Eli and @David. I suppose given a 'small' enough list of sub-lists doing what I need to do in the host is good enough instead of moving and doing the task in the device. I am just looking out for possible scenarios where the list of sub-lists will be 'large' enough that moving and doing the task in the GPU using PyCUDA will be much faster. If indeed I do go ahead with the for loop as Eli and I similarly thought about that would have a running time of O( n ) where n is the number of sub-lists. Since I have lots of threads I can put to use I was thinking I'd rather do the O( n ) task in the device thus making it have a constant running time relatively speaking. Of course that's an ideal case and doesn't consider the device-host copy delays that much.
Best regards, ./francis
_______________________________________________ PyCUDA mailing list PyCUDA@tiker.net http://lists.tiker.net/listinfo/pycuda