Hi, I am creating some timing tests with PyCUDA for batch-loading an image sequence. I first tried timing a normal, synchronous transfer over global memory.
Now I am looking to test pagelocked memory, specifically, I would like to test: Single-stream, pagelocked synchronous transfers, multi-stream, asynchronous pagelocked transfers and zero-copy memory using device mapped memory. For the first one, do I simply call pycuda.driver.memcpy_htod/dtoh using the pagelocked memory (I am using memflags=0 for creating the pagelocked memory, I assume it corresponds to cudaHostAllocDefault?) For the second, I would use the memcpy_(htod/dtoh)_async calls with more than one stream (my laptop supports concurrent kernels). For the final one, I would create my own context using pycuda.driver.make_context with the MAP_HOST flag, allocate the pagelocked memory using host_alloc_flags.DEVICE_MAP and call my kernel with the device pointer? Am I on the right track? I had a hard time finding good tutorials/source (even in the PyCUDA examples section), so I plan to submit some examples if I have time :) Also...Excellent library! Best regards, Alexander
_______________________________________________ PyCUDA mailing list [email protected] http://lists.tiker.net/listinfo/pycuda
