Removing the need for the memcpy itself would clearly be the best. Looking at NSIGHT, I see that NVDEC internally calls a color space transformation kernel on the default stream, and does not synchronize with the calling CPU thread. The cuMemcpy calls you have right now, use the same default stream, and do block with the calling CPU thread. So they perform an implicit synchronization with the CPU thread.
This means, that if you remove the Memcpy's, and the user wants to make any cuda call, over the results of this kernel, to make it safely, they have two options: 1 Either they use the same default stream (which is what I'm trying to avoid here). 2 Or the NvDecoder call "bool Decode(const uint8_t *pData, int nSize, uint8_t ***pppFrame, int *pnFrameReturned, uint32_t flags = 0, int64_t **ppTimestamp = NULL, int64_t timestamp = 0, CUstream stream = 0)" uses the cuda stream specified by ffmpeg, as we where saying in the previous emails, instead of not specifying any stream and therefore always defaulting to the stream 0, or default stream. So Decode(..., ..., ..., ..., ..., ..., ..., cuda_stream)" The second option has another benefit. If the ffmpeg user, specifies it's own non-default stream, then, this kernel joins the "overlapping world", and can overlap with any other cuda task. Saving even more time. Hope it helps! If there are other places where cuMemcpy is called, (we don't use it, but I think I saw it somewhere in the code) I think it would be nice to have the option to use a custom cuda stream, and keep it as is otherwise just by not setting a custom stream. P.S: I had thoughts of talking to NVIDIA to know if there is a way to not call this kernel, and get whatever comes from the encoder directly, so we can transform it to the format we need. That is, calling one kernel instead of two. I'll let you know if we do, in case this becomes an option. I wonder what uint32_t flags is used for though. It's not explained in the headers. -----Original Message----- From: ffmpeg-devel <ffmpeg-devel-boun...@ffmpeg.org> On Behalf Of Timo Rothenpieler Sent: Monday, May 7, 2018 5:13 PM To: ffmpeg-devel@ffmpeg.org Subject: Re: [FFmpeg-devel] [PATCH] Added the possibility to pass an externally created CUDA context to libavutil/hwcontext.c/av_hwdevice_ctx_create() for decoding with NVDEC Am 07.05.2018 um 17:05 schrieb Oscar Amoros Huguet: > To clarify a bit what I was saying in the last email. When I said CUDA > non-blocking streams, I meant non-default streams. All non-blocking > streams are non-default streams, but non-default streams can be > blocking or non-bloking with respect to the default streams. > https://docs.nvidia.com/cuda/cuda-runtime-api/stream-sync-behavior.htm > l > > So, using cuMemcpyAsync, would allow the memory copies to overlap with > any other copy or kernel execution, enqueued in any other non-default > stream. > https://devblogs.nvidia.com/how-overlap-data-transfers-cuda-cc/ > > If cuStreamSynchronize has to be called right after the last cuMemcpyAsync > call, I see different ways of implementing this, but probably you will most > likely prefer the following: > > Add the cuMemcpyAsync to the list of cuda functions. > Add a field in AVCUDADeviceContext of type CUstream, and set it to 0 (zero) > by default. Let's name it "CUstream cuda_stream"? > Call always cuMemcpyAsync instead of cuMemcpy, passing cuda_stream as > the last parameter. cuMemcpyAsync(..., ..., ..., cuda_stream); After > the last cuMemcpyAsync, call cuStreamSynchronize on cuda_stream. > cuStreamSynchronize(cuda_stream); > > If the user does not change the context and the stream, the behavior will be > exactly the same as it is now. No synchronization hazards. Because passing > "0" as the cuda stream, makes the calls blocking, as if they weren't > asynchronous calls. > > But, if the user wants the copies to overlap with the rest of it's > application, he can set it's own cuda context, and it's own non-default > stream. > > In any of the cases, ffmpeg does not have to handle cuda stream creation and > destruction, which makes it simpler. > > Hope you like it! A different idea I'm looking at right now is to get rid of the memcpy entirely, turning the mapped cuvid frame into an AVFrame itself, with a buffer_ref that unmaps the cuvid frame when freeing it, instead of allocating a whole new buffer and copying it over. I'm not sure how that will play out with available free surfaces, but I will test. I'll also add the stream basically like you described, as it seems useful to have around anyway. If previously mentioned approach does not work, I'll implement this like described, probably for all cuMemCpy* in ffmpeg, as it at least does run the 2/3 plane copys asynchronous. Not sure if it can be changed to actually do them in parallel. _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel