On Monday, November 10th, 2025 at 3:09 PM, Thilo Schunck via ffmpeg-devel 
<[email protected]> wrote:

> 
> 
> Hi Team!
> 
> Apologies for maybe breaking submit rules but as of now I don't know better 
> :-)
> 
> I figured out on arm "hwdownload" is quite slow.
> I turns out this is caused by imgutils.c image_copy_plane which does a memcpy 
> loop
> 
> for (;height > 0; height--) {
> 
> 
> memcpy(dst, src, bytewidth);
> 
> dst += dst_linesize;
> src += src_linesize;
> }
> 
> As a POC, quick'n dirty I create 4 threads and split the copy. In my case 
> this improved fps from about ~26 to 51
> 
> ./ffmpeg -hide_banner -hwaccel v4l2request -hwaccel_output_format drm_prime \
> -threads 4 \
> -i ../Big_Buck_Bunny_720_10s_10MB.mp4 \
> -filter_complex "[0:v]hwdownload,format=nv12[myOut]" -map "[myOut]" \
> -f null -
> 
> Maybe someone is interested in this improvement with cleaned code.
> My PoC uses hard coded 4 threads which is for sure bad ...

I could see `hwcontext_drm` specifically using an internal `AVSliceThread` for 
`RAM<->VRAM` transfers. We have to be careful about thread safety, though, as 
`av_hwframe_transfer_data` may be called from multiple threads. So in the worst 
case, we need to create and tear down the `AVSliceThread` per frame being 
transferred..

As a possible alternative, `vf_hwdownload` could become frame-threaded, though 
that would only improve throughput, not latency.

_______________________________________________
ffmpeg-devel mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to