Removing the need for the memcpy itself would clearly be the best.

Looking at NSIGHT, I see that NVDEC internally calls a color space 
transformation kernel on the default stream, and does not synchronize with the 
calling CPU thread. The cuMemcpy calls you have right now, use the same default 
stream, and do block with the calling CPU thread. So they perform an implicit 
synchronization with the CPU thread.

This means, that if you remove the Memcpy's, and the user wants to make any 
cuda call, over the results of this kernel, to make it safely, they have two 
options:
1 Either they use the same default stream (which is what I'm trying to avoid 
here).
2 Or the NvDecoder call "bool Decode(const uint8_t *pData, int nSize, uint8_t 
***pppFrame, int *pnFrameReturned, uint32_t flags = 0, int64_t **ppTimestamp = 
NULL, int64_t timestamp = 0, CUstream stream = 0)" uses the cuda stream 
specified by ffmpeg, as we where saying in the previous emails, instead of not 
specifying any stream and therefore always defaulting to the stream 0, or 
default stream. So Decode(..., ..., ..., ..., ..., ..., ..., cuda_stream)"

The second option has another benefit. If the ffmpeg user, specifies it's own 
non-default stream, then, this kernel joins the "overlapping world", and can 
overlap with any other cuda task. Saving even more time.

Hope it helps!

If there are other places where cuMemcpy is called, (we don't use it, but I 
think I saw it somewhere in the code) I think it would be nice to have the 
option to use a custom cuda stream, and keep it as is otherwise just by not 
setting a custom stream.

P.S: I had thoughts of talking to NVIDIA to know if there is a way to not call 
this kernel, and get whatever comes from the encoder directly, so we can 
transform it to the format we need. That is, calling one kernel instead of two. 
I'll let you know if we do, in case this becomes an option. I wonder what 
uint32_t flags is used for though. It's not explained in the headers. 

-----Original Message-----
From: ffmpeg-devel <ffmpeg-devel-boun...@ffmpeg.org> On Behalf Of Timo 
Rothenpieler
Sent: Monday, May 7, 2018 5:13 PM
To: ffmpeg-devel@ffmpeg.org
Subject: Re: [FFmpeg-devel] [PATCH] Added the possibility to pass an externally 
created CUDA context to libavutil/hwcontext.c/av_hwdevice_ctx_create() for 
decoding with NVDEC

Am 07.05.2018 um 17:05 schrieb Oscar Amoros Huguet:
> To clarify a bit what I was saying in the last email. When I said CUDA 
> non-blocking streams, I meant non-default streams. All non-blocking 
> streams are non-default streams, but non-default streams can be 
> blocking or non-bloking with respect to the default streams. 
> https://docs.nvidia.com/cuda/cuda-runtime-api/stream-sync-behavior.htm
> l
> 
> So, using cuMemcpyAsync, would allow the memory copies to overlap with 
> any other copy or kernel execution, enqueued in any other non-default 
> stream. 
> https://devblogs.nvidia.com/how-overlap-data-transfers-cuda-cc/
> 
> If cuStreamSynchronize has to be called right after the last cuMemcpyAsync 
> call, I see different ways of implementing this, but probably you will most 
> likely prefer the following:
> 
> Add the cuMemcpyAsync to the list of cuda functions.
> Add a field in AVCUDADeviceContext of type CUstream, and set it to 0 (zero) 
> by default. Let's name it "CUstream cuda_stream"?
> Call always cuMemcpyAsync instead of cuMemcpy, passing cuda_stream as 
> the last parameter. cuMemcpyAsync(..., ..., ..., cuda_stream); After 
> the last cuMemcpyAsync, call cuStreamSynchronize on cuda_stream. 
> cuStreamSynchronize(cuda_stream);
> 
> If the user does not change the context and the stream, the behavior will be 
> exactly the same as it is now. No synchronization hazards. Because passing 
> "0" as the cuda stream, makes the calls blocking, as if they weren't 
> asynchronous calls.
> 
> But, if the user wants the copies to overlap with the rest of it's 
> application, he can set it's own cuda context, and it's own non-default 
> stream.
> 
> In any of the cases, ffmpeg does not have to handle cuda stream creation and 
> destruction, which makes it simpler.
> 
> Hope you like it!

A different idea I'm looking at right now is to get rid of the memcpy entirely, 
turning the mapped cuvid frame into an AVFrame itself, with a buffer_ref that 
unmaps the cuvid frame when freeing it, instead of allocating a whole new 
buffer and copying it over.
I'm not sure how that will play out with available free surfaces, but I will 
test.

I'll also add the stream basically like you described, as it seems useful to 
have around anyway.

If previously mentioned approach does not work, I'll implement this like 
described, probably for all cuMemCpy* in ffmpeg, as it at least does run the 
2/3 plane copys asynchronous. Not sure if it can be changed to actually do them 
in parallel.

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Reply via email to