Re: [PyCuda] Slow Device to Array copy

J-Pascal Mercier Thu, 05 Feb 2009 10:58:12 -0800

Thanks for the quick answer,

mea culpa, i just realised i was reallocating the texture every loop
iteration which was causing of the lag i was experiencing. I should
really check my code 7 times before posting to mailing lists (or i
should stop drinking beer while coding)... By the way, i am interested
in contributing to pycuda. I'm not very experienced in GPGPU (my
research group just acquire a bunch of capable graphic cards), but i
think i can help with the numpy/scipy integration and maybe some
numerical / image processing algorithms.


cheers

J-Pascal



Andreas Klöckner wrote:
> On Donnerstag 05 Februar 2009, J-Pascal Mercier wrote:
>   
>> Hi,
>>
>> I have a kernel that is invoked in loop with the data calculated from
>> the last kernel iteration. The kernel uses textures as input data. Right
>> now, i use the function Memcpy2D/3D to copy the resulting GPUarray back
>> to a texture but unfortunately this operation is very slow. I have only
>> been able to achieve 3-4GB/s which is way lower than the 50-60 GB/s i
>> can achieve in C with the fct cudaMemcpyToArray which unfortunately is
>> part of the Runtime API. My guess is that the problem comes from
>> parameters of Memcpy2D/3D but i can't get the right one to speed up the
>> process. The function looks like :
>>     
>
> Odd--that sounds like the data is actually crossing the PCIe bus, which would 
> be less than useful.
>
> I have a suspicion: Your memory pitch is off. The manpage for cuMemAllocPitch 
> says this here:
>
> The pitch returned by cuMemAllocPitch() is guaranteed to work with 
> cuMemcpy2D() under all circumstances. For allocations of 2D arrays, it is 
> recommended that programmers consider performing pitch allocations using 
> cuMemAllocPitch(). Due to alignment restrictions in the hardware, this is 
> especially true if the application will be performing 2D memory copies 
> between 
> different regions of device memory (whether linear memory or CUDA arrays).
>
> That reveals a small deficiency in PyCuda: There needs to be a way to 
> allocate 
> GPUArrays that results in cuMemAllocPitch being used for the allocation. I'll 
> look into that (but if you're willing to cook up a patch, that wouldn't hurt, 
> either.) In the meantime, can you check (using just 
> pycuda.driver.mem_alloc_pitch) whether that fixes it?
>
> Andreas
>
>
>   
> ------------------------------------------------------------------------
>
> _______________________________________________
> PyCuda mailing list
> [email protected]
> http://tiker.net/mailman/listinfo/pycuda_tiker.net
>   



_______________________________________________
PyCuda mailing list
[email protected]
http://tiker.net/mailman/listinfo/pycuda_tiker.net

Re: [PyCuda] Slow Device to Array copy

Reply via email to