Thanks for the quick answer, mea culpa, i just realised i was reallocating the texture every loop iteration which was causing of the lag i was experiencing. I should really check my code 7 times before posting to mailing lists (or i should stop drinking beer while coding)... By the way, i am interested in contributing to pycuda. I'm not very experienced in GPGPU (my research group just acquire a bunch of capable graphic cards), but i think i can help with the numpy/scipy integration and maybe some numerical / image processing algorithms.
cheers J-Pascal Andreas Klöckner wrote: > On Donnerstag 05 Februar 2009, J-Pascal Mercier wrote: > >> Hi, >> >> I have a kernel that is invoked in loop with the data calculated from >> the last kernel iteration. The kernel uses textures as input data. Right >> now, i use the function Memcpy2D/3D to copy the resulting GPUarray back >> to a texture but unfortunately this operation is very slow. I have only >> been able to achieve 3-4GB/s which is way lower than the 50-60 GB/s i >> can achieve in C with the fct cudaMemcpyToArray which unfortunately is >> part of the Runtime API. My guess is that the problem comes from >> parameters of Memcpy2D/3D but i can't get the right one to speed up the >> process. The function looks like : >> > > Odd--that sounds like the data is actually crossing the PCIe bus, which would > be less than useful. > > I have a suspicion: Your memory pitch is off. The manpage for cuMemAllocPitch > says this here: > > The pitch returned by cuMemAllocPitch() is guaranteed to work with > cuMemcpy2D() under all circumstances. For allocations of 2D arrays, it is > recommended that programmers consider performing pitch allocations using > cuMemAllocPitch(). Due to alignment restrictions in the hardware, this is > especially true if the application will be performing 2D memory copies > between > different regions of device memory (whether linear memory or CUDA arrays). > > That reveals a small deficiency in PyCuda: There needs to be a way to > allocate > GPUArrays that results in cuMemAllocPitch being used for the allocation. I'll > look into that (but if you're willing to cook up a patch, that wouldn't hurt, > either.) In the meantime, can you check (using just > pycuda.driver.mem_alloc_pitch) whether that fixes it? > > Andreas > > > > ------------------------------------------------------------------------ > > _______________________________________________ > PyCuda mailing list > [email protected] > http://tiker.net/mailman/listinfo/pycuda_tiker.net > _______________________________________________ PyCuda mailing list [email protected] http://tiker.net/mailman/listinfo/pycuda_tiker.net
