On Tue, 3 Mar 2020 at 21:23, Philip Langdale <[email protected]> wrote: > > On Sun, 1 Mar 2020 07:16:05 +0300 > Dennis Mungai <[email protected]> wrote: > > > Hello there, > > > > I've ran into some scenarios where a long running FFmpeg process > > configured to use NVDEC crashes with the error message: illegal > > memory access, something related to cuda. > > > > I'm unable to consistently reproduce this issue with concurrent runs > > as I'm transcoding live channels (provided as mpegts udp streams). > > When I'm back on my desk I'll try to copy and paste the exact error > > message and the FFmpeg command used. > > > > Are there private options that can be passed to the NVDEC hwaccel for > > maximum stability? I've seen the use of -extra_hw_frames 2 being > > recommended on a related ticket which presented a segfault when > > handling encodes with B-frames and deinterlace filter in the same > > flow, but I'm unable to replicate such a workaround. > > > > Warm regards, > > > > Dennis. > > I'd need to see the error, and ideally a backtrace to even begin > investigating this. From your description, if it's a SIGABRT from > inside the cuda library, then it's likely an internal cuda issue - > perhaps related to a memory leak that only becomes an issue for very > long decode periods. And then nvidia people would need to look at it. > > Thanks, > > > --phil
Hello there, I think I've stumbled upon the solution. The fix is to set this variable in place: CUDA_DEVICE_ORDER=PCI_BUS_ID When you have multiple NVENC capable GPUs on the same host, by default, the device index returned differs from what nvidia-smi throws back at you because for whatever reason, NVIDIA always assigns GPU index 0 to the assumed "fastest GPU on the system", or even worse, what's assumed to be the first PCI slot ON BOOT (and this can change over multiple runs), and this heuristic runs even IF identical GPUs are installed. The real disaster unveils when both hwaccel nvdec/cuda is in use, with a -hwaccel_output_format set (preventing download of textures to system memory) and -hwaccel_device is set to a specific device because even in the same run, that device index *will* change *if* a filter chained to the hwaccel, say scale_npp or scale_cuda is re-initialized. On resumption, it is not guaranteed that the device index known to the prior context matches up and boom, a segfault (as described above). Setting that variable above completely eliminates the problem. Back to happy camping :-) Carl's observation, backed by Phil, proved to be most telling: Just because it triggers a segfault doesn't necessarily mean it's FFmpeg problem. Documentation on the same: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars And other threads mentioning similar issues with CUDA's default device handling behavior with multiple GPUs in place: 1. https://devtalk.nvidia.com/default/topic/605113/cuda-programming-and-performance/no-gpu-selected-code-working-properly-hows-this-possible-/?offset=11#3939141 2. https://shawnliu.me/post/nvidia-gpu-id-enumeration-in-linux/ Hope this is of help to someone else who stumbles on the same issue(s). _______________________________________________ ffmpeg-user mailing list [email protected] https://ffmpeg.org/mailman/listinfo/ffmpeg-user To unsubscribe, visit link above, or email [email protected] with subject "unsubscribe".
