Am 07.05.2018 um 17:05 schrieb Oscar Amoros Huguet:
To clarify a bit what I was saying in the last email. When I said CUDA 
non-blocking streams, I meant non-default streams. All non-blocking streams are 
non-default streams, but non-default streams can be blocking or non-bloking 
with respect to the default streams. 
https://docs.nvidia.com/cuda/cuda-runtime-api/stream-sync-behavior.html

So, using cuMemcpyAsync, would allow the memory copies to overlap with any 
other copy or kernel execution, enqueued in any other non-default stream. 
https://devblogs.nvidia.com/how-overlap-data-transfers-cuda-cc/

If cuStreamSynchronize has to be called right after the last cuMemcpyAsync 
call, I see different ways of implementing this, but probably you will most 
likely prefer the following:

Add the cuMemcpyAsync to the list of cuda functions.
Add a field in AVCUDADeviceContext of type CUstream, and set it to 0 (zero) by default. 
Let's name it "CUstream cuda_stream"?
Call always cuMemcpyAsync instead of cuMemcpy, passing cuda_stream as the last 
parameter. cuMemcpyAsync(..., ..., ..., cuda_stream);
After the last cuMemcpyAsync, call cuStreamSynchronize on cuda_stream. 
cuStreamSynchronize(cuda_stream);

If the user does not change the context and the stream, the behavior will be exactly the 
same as it is now. No synchronization hazards. Because passing "0" as the cuda 
stream, makes the calls blocking, as if they weren't asynchronous calls.

But, if the user wants the copies to overlap with the rest of it's application, 
he can set it's own cuda context, and it's own non-default stream.

In any of the cases, ffmpeg does not have to handle cuda stream creation and 
destruction, which makes it simpler.

Hope you like it!

A different idea I'm looking at right now is to get rid of the memcpy entirely, turning the mapped cuvid frame into an AVFrame itself, with a buffer_ref that unmaps the cuvid frame when freeing it, instead of allocating a whole new buffer and copying it over. I'm not sure how that will play out with available free surfaces, but I will test.

I'll also add the stream basically like you described, as it seems useful to have around anyway.

If previously mentioned approach does not work, I'll implement this like described, probably for all cuMemCpy* in ffmpeg, as it at least does run the 2/3 plane copys asynchronous. Not sure if it can be changed to actually do them in parallel.

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Reply via email to