On Monday, November 10th, 2025 at 3:09 PM, Thilo Schunck via ffmpeg-devel
<[email protected]> wrote:
>
>
> Hi Team!
>
> Apologies for maybe breaking submit rules but as of now I don't know better
> :-)
>
> I figured out on arm "hwdownload" is quite slow.
> I turns out this is caused by imgutils.c image_copy_plane which does a memcpy
> loop
>
> for (;height > 0; height--) {
>
>
> memcpy(dst, src, bytewidth);
>
> dst += dst_linesize;
> src += src_linesize;
> }
>
> As a POC, quick'n dirty I create 4 threads and split the copy. In my case
> this improved fps from about ~26 to 51
>
> ./ffmpeg -hide_banner -hwaccel v4l2request -hwaccel_output_format drm_prime \
> -threads 4 \
> -i ../Big_Buck_Bunny_720_10s_10MB.mp4 \
> -filter_complex "[0:v]hwdownload,format=nv12[myOut]" -map "[myOut]" \
> -f null -
>
> Maybe someone is interested in this improvement with cleaned code.
> My PoC uses hard coded 4 threads which is for sure bad ...
I could see `hwcontext_drm` specifically using an internal `AVSliceThread` for
`RAM<->VRAM` transfers. We have to be careful about thread safety, though, as
`av_hwframe_transfer_data` may be called from multiple threads. So in the worst
case, we need to create and tear down the `AVSliceThread` per frame being
transferred..
As a possible alternative, `vf_hwdownload` could become frame-threaded, though
that would only improve throughput, not latency.
_______________________________________________
ffmpeg-devel mailing list -- [email protected]
To unsubscribe send an email to [email protected]