[FFmpeg-devel] [PATCH v3] libavcodec/vp8dec: fix the multi-thread HWAccel decode error
Fix the issue: https://github.com/intel/media-driver/issues/317 the root cause is update_dimensions will be called multple times when decoder thread number is not only 1, but update_dimensions call get_pixel_format in each decode thread will trigger the hwaccel_uninit/hwaccel_init more than once. But only one hwaccel should be shared with all decode threads. in current context, there are 3 situations in the update_dimensions(): 1. First time calling. No matter single thread or multithread, get_pixel_format() should be called after dimensions were set; 2. Dimention changed at the runtime. Dimention need to be updated when macroblocks_base is already allocated, get_pixel_format() should be called to recreate new frames according to updated dimension; 3. Multithread first time calling. After decoder init, the other threads will call update_dimensions() at first time to allocate macroblocks_base and set dimensions. But get_pixel_format() is shouldn't be called due to low level frames and context are already created. In this fix, we only call update_dimensions as need. Signed-off-by: Wang, Shaofei Reviewed-by: Jun, Zhao Reviewed-by: Haihao Xiang --- Updated typo in the commit message libavcodec/vp8.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/libavcodec/vp8.c b/libavcodec/vp8.c index ba79e5f..0a7f38b 100644 --- a/libavcodec/vp8.c +++ b/libavcodec/vp8.c @@ -187,7 +187,7 @@ static av_always_inline int update_dimensions(VP8Context *s, int width, int height, int is_vp7) { AVCodecContext *avctx = s->avctx; -int i, ret; +int i, ret, dim_reset = 0; if (width != s->avctx->width || ((width+15)/16 != s->mb_width || (height+15)/16 != s->mb_height) && s->macroblocks_base || height != s->avctx->height) { @@ -196,9 +196,12 @@ int update_dimensions(VP8Context *s, int width, int height, int is_vp7) ret = ff_set_dimensions(s->avctx, width, height); if (ret < 0) return ret; + +dim_reset = (s->macroblocks_base != NULL); } -if (!s->actually_webp && !is_vp7) { +if ((s->pix_fmt == AV_PIX_FMT_NONE || dim_reset) && + !s->actually_webp && !is_vp7) { s->pix_fmt = get_pixel_format(s); if (s->pix_fmt < 0) return AVERROR(EINVAL); -- 1.8.3.1 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
[FFmpeg-devel] [PATCH v2] libavcodec/vp8dec: fix the multi-thread HWAccel decode error
Fix the issue: https://github.com/intel/media-driver/issues/317 the root cause is update_dimensions will be called multple times when decoder thread number is not only 1, but update_dimensions call get_pixel_format in each decode thread will trigger the hwaccel_uninit/hwaccel_init more than once. But only one hwaccel should be shared with all decode threads. in current context, there are 3 situations in the update_dimensions(): 1. First time calling. No matter single thread or multithread, get_pixel_format() should be called after dimensions were set; 2. Dimention changed at the runtime. Dimention need to be updated when macroblocks_base is already allocated, get_pixel_format() should be called to recreate new frames according to updated dimention; 3. Multithread first time calling. After decoder init, the other threads will call update_dimensions() at first time to allocate macroblocks_base and set dimensions. But get_pixel_format() is shouldn't be called due to low level frames and context are already created. In this fix, we only call update_dimensions as need. Signed-off-by: Wang, Shaofei Reviewed-by: Jun, Zhao Reviewed-by: Haihao Xiang --- Previous code reviews: 2019-03-06 9:25 GMT+01:00, Wang, Shaofei : >> -Original Message- >> From: ffmpeg-devel [mailto:ffmpeg-devel-boun...@ffmpeg.org] On Behalf >> Of Carl Eugen Hoyos >> Sent: Wednesday, March 6, 2019 3:49 PM >> To: FFmpeg development discussions and patches >> >> Subject: Re: [FFmpeg-devel] [PATCH] libavcodec/vp8dec: fix the >> multi-thread HWAccel decode error >> >> 2018-08-09 9:09 GMT+02:00, Jun Zhao : >> > the root cause is update_dimentions call get_pixel_format will >> > trigger the hwaccel_uninit/hwaccel_init , in current context, there >> > are 3 situations in the update_dimentions(): >> > 1. First time calling. No matter single thread or multithread, >> >get_pixel_format() should be called after dimentions were >> >set; >> > 2. Dimention changed at the runtime. Dimention need to be >> >updated when macroblocks_base is already allocated, >> >get_pixel_format() should be called to recreate new frames >> >according to updated dimention; >> > 3. Multithread first time calling. After decoder init, the >> >other threads will call update_dimentions() at first time >> >to allocate macroblocks_base and set dimentions. >> >But get_pixel_format() is shouldn't be called due to low >> >level frames and context are already created. >> > In this fix, we only call update_dimentions as need. >> > >> > Signed-off-by: Wang, Shaofei >> > Reviewed-by: Jun, Zhao >> > --- >> > libavcodec/vp8.c |7 +-- >> > 1 files changed, 5 insertions(+), 2 deletions(-) >> > >> > diff --git a/libavcodec/vp8.c b/libavcodec/vp8.c index >> > 3adfeac..18d1ada 100644 >> > --- a/libavcodec/vp8.c >> > +++ b/libavcodec/vp8.c >> > @@ -187,7 +187,7 @@ static av_always_inline int >> > update_dimensions(VP8Context *s, int width, int height, int is_vp7) { >> > AVCodecContext *avctx = s->avctx; >> > -int i, ret; >> > +int i, ret, dim_reset = 0; >> > >> > if (width != s->avctx->width || ((width+15)/16 != s->mb_width >> > || >> > (height+15)/16 != s->mb_height) && s->macroblocks_base || >> > height != s->avctx->height) { @@ -196,9 +196,12 @@ int >> > update_dimensions(VP8Context *s, int width, int height, int is_vp7) >> > ret = ff_set_dimensions(s->avctx, width, height); >> > if (ret < 0) >> > return ret; >> > + >> > +dim_reset = (s->macroblocks_base != NULL); >> > } >> > >> > -if (!s->actually_webp && !is_vp7) { >> > +if ((s->pix_fmt == AV_PIX_FMT_NONE || dim_reset) && >> > + !s->actually_webp && !is_vp7) { >> >> Why is the new variable dim_reset needed? >> Wouldn't the patch be simpler if you used s->macroblocks_base here? > Since dim_reset was set in the "if" segment, it equal to (width != > s->avctx->width || ((width+15)/16 != s->mb_width || > (height+15)/16 != s->mb_height) || height != s->avctx->height) && > s->macroblocks_base Thank you! Carl Eugen libavcodec/vp8.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/libavcodec/vp8.c b/libavcodec/vp8.c index ba79e5f..0a7f38b 100644 --- a/libavcodec/vp8.c +++ b/libavcodec/vp8.c @@ -187,7 +187,7 @@ static av_always_inline int update_dimensions(VP8Context *s, int width, int height, int is_vp7) { AVCodecContext *avctx = s->avctx; -int i, ret; +int i, ret, dim_reset = 0; if (width != s->avctx->width || ((width+15)/16 != s->mb_width || (height+15)/16 != s->mb_height) && s->macroblocks_base || height != s->avctx->height) { @@ -196,9 +196,12 @@ int update_dimensions(VP8Context *s, int width, int height, int is_vp7) ret = ff_set_dimensions(s->avctx, width, height); if (ret < 0) return ret; + +dim_reset = (s->macroblocks_base != NULL); } -if (!s->actual
[FFmpeg-devel] [PATCH] Improved the performance of 1 decode + N filter graphs and adaptive bitrate.
It enabled MULTIPLE SIMPLE filter graph concurrency, which bring above about 4%~20% improvement in some 1:N scenarios by CPU or GPU acceleration Below are some test cases and comparison as reference. (Hardware platform: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz) (Software: Intel iHD driver - 16.9.00100, CentOS 7) For 1:N transcode by GPU acceleration with vaapi: ./ffmpeg -vaapi_device /dev/dri/renderD128 -hwaccel vaapi \ -hwaccel_output_format vaapi \ -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \ -vf "scale_vaapi=1280:720" -c:v h264_vaapi -f null /dev/null \ -vf "scale_vaapi=720:480" -c:v h264_vaapi -f null /dev/null test results: 2 encoders 5 encoders 10 encoders Improved 6.1%6.9% 5.5% For 1:N transcode by GPU acceleration with QSV: ./ffmpeg -hwaccel qsv -c:v h264_qsv \ -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \ -vf "scale_qsv=1280:720:format=nv12" -c:v h264_qsv -f null /dev/null \ -vf "scale_qsv=720:480:format=nv12" -c:v h264_qsv -f null /dev/null test results: 2 encoders 5 encoders 10 encoders Improved 6% 4% 15% For Intel GPU acceleration case, 1 decode to N scaling, by QSV: ./ffmpeg -hwaccel qsv -c:v h264_qsv \ -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \ -vf "scale_qsv=1280:720:format=nv12,hwdownload" -pix_fmt nv12 -f null /dev/null \ -vf "scale_qsv=720:480:format=nv12,hwdownload" -pix_fmt nv12 -f null /dev/null test results: 2 scale 5 scale 10 scale Improved 12% 21%21% For CPU only 1 decode to N scaling: ./ffmpeg -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \ -vf "scale=1280:720" -pix_fmt nv12 -f null /dev/null \ -vf "scale=720:480" -pix_fmt nv12 -f null /dev/null test results: 2 scale 5 scale 10 scale Improved 25%107% 148% Signed-off-by: Wang, Shaofei --- The patch will only effect on multiple SIMPLE filter graphs pipeline, Passed fate and refine the possible data race, AFL tested, without introducing extra crashs/hangs fftools/ffmpeg.c | 172 +-- fftools/ffmpeg.h | 13 + 2 files changed, 169 insertions(+), 16 deletions(-) diff --git a/fftools/ffmpeg.c b/fftools/ffmpeg.c index 544f1a1..5f6e712 100644 --- a/fftools/ffmpeg.c +++ b/fftools/ffmpeg.c @@ -164,7 +164,13 @@ static struct termios oldtty; static int restore_tty; #endif +/* enable abr threads when there were multiple simple filter graphs*/ +static int abr_threads_enabled = 0; + #if HAVE_THREADS +pthread_mutex_t fg_config_mutex; +pthread_mutex_t ost_init_mutex; + static void free_input_threads(void); #endif @@ -509,6 +515,17 @@ static void ffmpeg_cleanup(int ret) } av_fifo_freep(&fg->inputs[j]->ist->sub2video.sub_queue); } +#if HAVE_THREADS +if (abr_threads_enabled) { +av_frame_free(&fg->inputs[j]->input_frm); +pthread_mutex_lock(&fg->inputs[j]->process_mutex); +fg->inputs[j]->waited_frm = NULL; +fg->inputs[j]->t_end = 1; +pthread_cond_signal(&fg->inputs[j]->process_cond); +pthread_mutex_unlock(&fg->inputs[j]->process_mutex); +pthread_join(fg->inputs[j]->abr_thread, NULL); +} +#endif av_buffer_unref(&fg->inputs[j]->hw_frames_ctx); av_freep(&fg->inputs[j]->name); av_freep(&fg->inputs[j]); @@ -1419,12 +1436,13 @@ static void finish_output_stream(OutputStream *ost) * * @return 0 for success, <0 for severe errors */ -static int reap_filters(int flush) +static int reap_filters(int flush, InputFilter * ifilter) { AVFrame *filtered_frame = NULL; int i; -/* Reap all buffers present in the buffer sinks */ +/* Reap all buffers present in the buffer sinks or just reap specified + * buffer which related with the filter graph who got ifilter as input*/ for (i = 0; i < nb_output_streams; i++) { OutputStream *ost = output_streams[i]; OutputFile*of = output_files[ost->file_index]; @@ -1432,13 +1450,25 @@ static int reap_filters(int flush) AVCodecContext *enc = ost->enc_ctx; int ret = 0; +if (ifilter && abr_threads_enabled) +if (ost != ifilter->graph->outputs[0]->ost) +continue; + if (!ost->filter || !ost->filter->graph->graph) continue; filter = ost->filter->filter; if (!ost->initialized) { char error[1024] = ""; +#if HAVE_THREADS +if (abr_threads_enabled) +pthread_mutex_lock(&ost_init_mutex); +#endif ret = init_output_stream(ost, error, sizeof(error)); +#if HAVE_THREADS +if (abr_threads_enabled) +pthread_mutex_unlock(&ost_init_mutex); +#endif if (ret < 0) { av_log(NULL, AV_LOG_ER
[FFmpeg-devel] [PATCH v7] Improved the performance of 1 decode + N filter graphs and adaptive bitrate.
It enabled MULTIPLE SIMPLE filter graph concurrency, which bring above about 4%~20% improvement in some 1:N scenarios by CPU or GPU acceleration Below are some test cases and comparison as reference. (Hardware platform: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz) (Software: Intel iHD driver - 16.9.00100, CentOS 7) For 1:N transcode by GPU acceleration with vaapi: ./ffmpeg -vaapi_device /dev/dri/renderD128 -hwaccel vaapi \ -hwaccel_output_format vaapi \ -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \ -vf "scale_vaapi=1280:720" -c:v h264_vaapi -f null /dev/null \ -vf "scale_vaapi=720:480" -c:v h264_vaapi -f null /dev/null test results: 2 encoders 5 encoders 10 encoders Improved 6.1%6.9% 5.5% For 1:N transcode by GPU acceleration with QSV: ./ffmpeg -hwaccel qsv -c:v h264_qsv \ -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \ -vf "scale_qsv=1280:720:format=nv12" -c:v h264_qsv -f null /dev/null \ -vf "scale_qsv=720:480:format=nv12" -c:v h264_qsv -f null /dev/null test results: 2 encoders 5 encoders 10 encoders Improved 6% 4% 15% For Intel GPU acceleration case, 1 decode to N scaling, by QSV: ./ffmpeg -hwaccel qsv -c:v h264_qsv \ -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \ -vf "scale_qsv=1280:720:format=nv12,hwdownload" -pix_fmt nv12 -f null /dev/null \ -vf "scale_qsv=720:480:format=nv12,hwdownload" -pix_fmt nv12 -f null /dev/null test results: 2 scale 5 scale 10 scale Improved 12% 21%21% For CPU only 1 decode to N scaling: ./ffmpeg -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \ -vf "scale=1280:720" -pix_fmt nv12 -f null /dev/null \ -vf "scale=720:480" -pix_fmt nv12 -f null /dev/null test results: 2 scale 5 scale 10 scale Improved 25%107% 148% Signed-off-by: Wang, Shaofei Reviewed-by: Michael Niedermayer Reviewed-by: Mark Thompson --- The patch will only effect on multiple SIMPLE filter graphs pipeline, Passed fate and refine the possible data race, AFL tested, without introducing extra crashs/hangs: american fuzzy lop 2.52b (ffmpeg_g) ┌─ process timing ─┬─ overall results ─┐ │run time : 0 days, 9 hrs, 48 min, 48 sec │ cycles done : 0 │ │ last new path : 0 days, 0 hrs, 0 min, 0 sec│ total paths : 1866 │ │ last uniq crash : none seen yet │ uniq crashes : 0 │ │ last uniq hang : 0 days, 9 hrs, 19 min, 23 sec │ uniq hangs : 35 │ ├─ cycle progress ┬─ map coverage ─┴───┤ │ now processing : 0 (0.00%) │map density : 24.91% / 36.60% │ │ paths timed out : 0 (0.00%) │ count coverage : 2.40 bits/tuple │ ├─ stage progress ┼─ findings in depth ┤ │ now trying : calibration │ favored paths : 1 (0.05%) │ │ stage execs : 0/8 (0.00%) │ new edges on : 1100 (58.95%) │ │ total execs : 123k │ total crashes : 0 (0 unique) │ │ exec speed : 3.50/sec (...)│ total tmouts : 52 (47 unique) │ ├─ fuzzing strategy yields ───┴───┬─ path geometry ┤ │ bit flips : 0/0, 0/0, 0/0 │levels : 2 │ │ byte flips : 0/0, 0/0, 0/0 │ pending : 1866 │ │ arithmetics : 0/0, 0/0, 0/0 │ pend fav : 1 │ │ known ints : 0/0, 0/0, 0/0 │ own finds : 1862 │ │ dictionary : 0/0, 0/0, 0/0 │ imported : n/a │ │ havoc : 0/0, 0/0 │ stability : 76.69% │ │trim : 0.00%/1828, n/a ├┘ └─┘ [cpu000: 59%] fftools/ffmpeg.c | 172 +-- fftools/ffmpeg.h | 13 + 2 files changed, 169 insertions(+), 16 deletions(-) diff --git a/fftools/ffmpeg.c b/fftools/ffmpeg.c index 544f1a1..59a953a 100644 --- a/fftools/ffmpeg.c +++ b/fftools/ffmpeg.c @@ -164,7 +164,13 @@ static struct termios oldtty; static int restore_tty; #endif +/* enable abr threads when there were multiple simple filter graphs*/ +static int abr_threads_enabled = 0; + #if HAVE_THREADS +pthread_mutex_t fg_config_mutex; +pthread_mutex_t ost_init_mutex; + static void free_input_threads(void); #endif @@ -509,6 +515,17 @@ static void ffmpeg_cleanup(int ret) } av_fifo_freep(&fg->inputs[j]->ist->sub2video.sub_queue); } +#if HAVE_THREADS +if (abr_threads_enabled) { +av_frame_free(&fg->inputs[
[FFmpeg-devel] [PATCH v6] Improved the performance of 1 decode + N filter graphs and adaptive bitrate.
It enabled multiple simple filter graph concurrency, which bring above about 4%~20% improvement in some 1:N scenarios by CPU or GPU acceleration Below are some test cases and comparison as reference. (Hardware platform: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz) (Software: Intel iHD driver - 16.9.00100, CentOS 7) For 1:N transcode by GPU acceleration with vaapi: ./ffmpeg -vaapi_device /dev/dri/renderD128 -hwaccel vaapi \ -hwaccel_output_format vaapi \ -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \ -vf "scale_vaapi=1280:720" -c:v h264_vaapi -f null /dev/null \ -vf "scale_vaapi=720:480" -c:v h264_vaapi -f null /dev/null test results: 2 encoders 5 encoders 10 encoders Improved 6.1%6.9% 5.5% For 1:N transcode by GPU acceleration with QSV: ./ffmpeg -hwaccel qsv -c:v h264_qsv \ -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \ -vf "scale_qsv=1280:720:format=nv12" -c:v h264_qsv -f null /dev/null \ -vf "scale_qsv=720:480:format=nv12" -c:v h264_qsv -f null /dev/null test results: 2 encoders 5 encoders 10 encoders Improved 6% 4% 15% For Intel GPU acceleration case, 1 decode to N scaling, by QSV: ./ffmpeg -hwaccel qsv -c:v h264_qsv \ -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \ -vf "scale_qsv=1280:720:format=nv12,hwdownload" -pix_fmt nv12 -f null /dev/null \ -vf "scale_qsv=720:480:format=nv12,hwdownload" -pix_fmt nv12 -f null /dev/null test results: 2 scale 5 scale 10 scale Improved 12% 21%21% For CPU only 1 decode to N scaling: ./ffmpeg -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \ -vf "scale=1280:720" -pix_fmt nv12 -f null /dev/null \ -vf "scale=720:480" -pix_fmt nv12 -f null /dev/null test results: 2 scale 5 scale 10 scale Improved 25%107% 148% Signed-off-by: Wang, Shaofei --- Passed fate and refine the possible data race. The patch will only effect on multiple SIMPLE filter graphs pipeline fftools/ffmpeg.c | 172 +-- fftools/ffmpeg.h | 13 + 2 files changed, 169 insertions(+), 16 deletions(-) diff --git a/fftools/ffmpeg.c b/fftools/ffmpeg.c index 544f1a1..c0c9ca8 100644 --- a/fftools/ffmpeg.c +++ b/fftools/ffmpeg.c @@ -164,7 +164,13 @@ static struct termios oldtty; static int restore_tty; #endif +/* enable abr threads when there were multiple simple filter graphs*/ +static int abr_threads_enabled = 0; + #if HAVE_THREADS +pthread_mutex_t fg_config_mutex; +pthread_mutex_t ost_init_mutex; + static void free_input_threads(void); #endif @@ -509,6 +515,17 @@ static void ffmpeg_cleanup(int ret) } av_fifo_freep(&fg->inputs[j]->ist->sub2video.sub_queue); } +#if HAVE_THREADS +if (abr_threads_enabled) { +av_frame_free(&fg->inputs[j]->input_frm); +pthread_mutex_lock(&fg->inputs[j]->process_mutex); +fg->inputs[j]->waited_frm = NULL; +fg->inputs[j]->t_end = 1; +pthread_cond_signal(&fg->inputs[j]->process_cond); +pthread_mutex_unlock(&fg->inputs[j]->process_mutex); +pthread_join(fg->inputs[j]->abr_thread, NULL); +} +#endif av_buffer_unref(&fg->inputs[j]->hw_frames_ctx); av_freep(&fg->inputs[j]->name); av_freep(&fg->inputs[j]); @@ -1419,12 +1436,13 @@ static void finish_output_stream(OutputStream *ost) * * @return 0 for success, <0 for severe errors */ -static int reap_filters(int flush) +static int reap_filters(int flush, InputFilter * ifilter) { AVFrame *filtered_frame = NULL; int i; -/* Reap all buffers present in the buffer sinks */ +/* Reap all buffers present in the buffer sinks or just reap specified + * buffer which related with the filter graph who got ifilter as input*/ for (i = 0; i < nb_output_streams; i++) { OutputStream *ost = output_streams[i]; OutputFile*of = output_files[ost->file_index]; @@ -1432,13 +1450,25 @@ static int reap_filters(int flush) AVCodecContext *enc = ost->enc_ctx; int ret = 0; +if (ifilter && abr_threads_enabled) +if (ost != ifilter->graph->outputs[0]) +continue; + if (!ost->filter || !ost->filter->graph->graph) continue; filter = ost->filter->filter; if (!ost->initialized) { char error[1024] = ""; +#if HAVE_THREADS +if (abr_threads_enabled) +pthread_mutex_lock(&ost_init_mutex); +#endif ret = init_output_stream(ost, error, sizeof(error)); +#if HAVE_THREADS +if (abr_threads_enabled) +pthread_mutex_unlock(&ost_init_mutex); +#endif if (ret < 0) { av_log(NULL, AV_LOG_ERROR, "Error initializing output stream %d:%d -- %s\n",
[FFmpeg-devel] [PATCH v5] Improved the performance of 1 decode + N filter graphs and adaptive bitrate.
It enabled multiple filter graph concurrency, which bring above about 4%~20% improvement in some 1:N scenarios by CPU or GPU acceleration Below are some test cases and comparison as reference. (Hardware platform: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz) (Software: Intel iHD driver - 16.9.00100, CentOS 7) For 1:N transcode by GPU acceleration with vaapi: ./ffmpeg -vaapi_device /dev/dri/renderD128 -hwaccel vaapi \ -hwaccel_output_format vaapi \ -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \ -vf "scale_vaapi=1280:720" -c:v h264_vaapi -f null /dev/null \ -vf "scale_vaapi=720:480" -c:v h264_vaapi -f null /dev/null test results: 2 encoders 5 encoders 10 encoders Improved 6.1%6.9% 5.5% For 1:N transcode by GPU acceleration with QSV: ./ffmpeg -hwaccel qsv -c:v h264_qsv \ -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \ -vf "scale_qsv=1280:720:format=nv12" -c:v h264_qsv -f null /dev/null \ -vf "scale_qsv=720:480:format=nv12" -c:v h264_qsv -f null /dev/null test results: 2 encoders 5 encoders 10 encoders Improved 6% 4% 15% For Intel GPU acceleration case, 1 decode to N scaling, by QSV: ./ffmpeg -hwaccel qsv -c:v h264_qsv \ -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \ -vf "scale_qsv=1280:720:format=nv12,hwdownload" -pix_fmt nv12 -f null /dev/null \ -vf "scale_qsv=720:480:format=nv12,hwdownload" -pix_fmt nv12 -f null /dev/null test results: 2 scale 5 scale 10 scale Improved 12% 21%21% For CPU only 1 decode to N scaling: ./ffmpeg -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \ -vf "scale=1280:720" -pix_fmt nv12 -f null /dev/null \ -vf "scale=720:480" -pix_fmt nv12 -f null /dev/null test results: 2 scale 5 scale 10 scale Improved 25%107% 148% Signed-off-by: Wang, Shaofei Reviewed-by: Zhao, Jun --- fftools/ffmpeg.c| 121 fftools/ffmpeg.h| 14 ++ fftools/ffmpeg_filter.c | 1 + 3 files changed, 128 insertions(+), 8 deletions(-) diff --git a/fftools/ffmpeg.c b/fftools/ffmpeg.c index 544f1a1..676c783 100644 --- a/fftools/ffmpeg.c +++ b/fftools/ffmpeg.c @@ -509,6 +509,15 @@ static void ffmpeg_cleanup(int ret) } av_fifo_freep(&fg->inputs[j]->ist->sub2video.sub_queue); } +#if HAVE_THREADS +fg->inputs[j]->waited_frm = NULL; +av_frame_free(&fg->inputs[j]->input_frm); +pthread_mutex_lock(&fg->inputs[j]->process_mutex); +fg->inputs[j]->t_end = 1; +pthread_cond_signal(&fg->inputs[j]->process_cond); +pthread_mutex_unlock(&fg->inputs[j]->process_mutex); +pthread_join(fg->inputs[j]->abr_thread, NULL); +#endif av_buffer_unref(&fg->inputs[j]->hw_frames_ctx); av_freep(&fg->inputs[j]->name); av_freep(&fg->inputs[j]); @@ -1419,12 +1428,13 @@ static void finish_output_stream(OutputStream *ost) * * @return 0 for success, <0 for severe errors */ -static int reap_filters(int flush) +static int reap_filters(int flush, InputFilter * ifilter) { AVFrame *filtered_frame = NULL; int i; -/* Reap all buffers present in the buffer sinks */ +/* Reap all buffers present in the buffer sinks or just reap specified + * buffer which related with the filter graph who got ifilter as input*/ for (i = 0; i < nb_output_streams; i++) { OutputStream *ost = output_streams[i]; OutputFile*of = output_files[ost->file_index]; @@ -1436,6 +1446,11 @@ static int reap_filters(int flush) continue; filter = ost->filter->filter; +if (ifilter) { +if (ifilter != output_streams[i]->filter->graph->inputs[0]) +continue; +} + if (!ost->initialized) { char error[1024] = ""; ret = init_output_stream(ost, error, sizeof(error)); @@ -2179,7 +2194,8 @@ static int ifilter_send_frame(InputFilter *ifilter, AVFrame *frame) } } -ret = reap_filters(1); +ret = HAVE_THREADS ? reap_filters(1, ifilter) : reap_filters(1, NULL); + if (ret < 0 && ret != AVERROR_EOF) { av_log(NULL, AV_LOG_ERROR, "Error while filtering: %s\n", av_err2str(ret)); return ret; @@ -2252,12 +2268,100 @@ static int decode(AVCodecContext *avctx, AVFrame *frame, int *got_frame, AVPacke return 0; } +#if HAVE_THREADS +static void *filter_pipeline(void *arg) +{ +InputFilter *fl = arg; +AVFrame *frm; +int ret; +while(1) { +pthread_mutex_lock(&fl->process_mutex); +while (fl->waited_frm == NULL && !fl->t_end) +pthread_cond_wait(&fl->process_cond, &fl->process_mutex); +pthread_mutex_unlock(&fl->process_mutex); + +if (fl->t_end) break; + +frm = fl->waited_frm; +
[FFmpeg-devel] [PATCH v4] Improved the performance of 1 decode + N filter graphs and adaptive bitrate.
It enabled multiple filter graph concurrency, which bring above about 4%~20% improvement in some 1:N scenarios by CPU or GPU acceleration Below are some test cases and comparison as reference. (Hardware platform: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz) (Software: Intel iHD driver - 16.9.00100, CentOS 7) For 1:N transcode by GPU acceleration with vaapi: ./ffmpeg -vaapi_device /dev/dri/renderD128 -hwaccel vaapi \ -hwaccel_output_format vaapi \ -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \ -vf "scale_vaapi=1280:720" -c:v h264_vaapi -f null /dev/null \ -vf "scale_vaapi=720:480" -c:v h264_vaapi -f null /dev/null test results: 2 encoders 5 encoders 10 encoders Improved 6.1%6.9% 5.5% For 1:N transcode by GPU acceleration with QSV: ./ffmpeg -hwaccel qsv -c:v h264_qsv \ -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \ -vf "scale_qsv=1280:720:format=nv12" -c:v h264_qsv -f null /dev/null \ -vf "scale_qsv=720:480:format=nv12" -c:v h264_qsv -f null /dev/null test results: 2 encoders 5 encoders 10 encoders Improved 6% 4% 15% For Intel GPU acceleration case, 1 decode to N scaling, by QSV: ./ffmpeg -hwaccel qsv -c:v h264_qsv \ -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \ -vf "scale_qsv=1280:720:format=nv12,hwdownload" -pix_fmt nv12 -f null /dev/null \ -vf "scale_qsv=720:480:format=nv12,hwdownload" -pix_fmt nv12 -f null /dev/null test results: 2 scale 5 scale 10 scale Improved 12% 21%21% For CPU only 1 decode to N scaling: ./ffmpeg -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \ -vf "scale=1280:720" -pix_fmt nv12 -f null /dev/null \ -vf "scale=720:480" -pix_fmt nv12 -f null /dev/null test results: 2 scale 5 scale 10 scale Improved 25%107% 148% Signed-off-by: Wang, Shaofei Reviewed-by: Zhao, Jun --- fftools/ffmpeg.c| 112 +--- fftools/ffmpeg.h| 14 ++ fftools/ffmpeg_filter.c | 4 ++ 3 files changed, 124 insertions(+), 6 deletions(-) diff --git a/fftools/ffmpeg.c b/fftools/ffmpeg.c index 544f1a1..67b1a2a 100644 --- a/fftools/ffmpeg.c +++ b/fftools/ffmpeg.c @@ -1419,13 +1419,18 @@ static void finish_output_stream(OutputStream *ost) * * @return 0 for success, <0 for severe errors */ -static int reap_filters(int flush) +static int reap_filters(int flush, InputFilter * ifilter) { AVFrame *filtered_frame = NULL; int i; -/* Reap all buffers present in the buffer sinks */ +/* Reap all buffers present in the buffer sinks or just reap specified + * input filter buffer */ for (i = 0; i < nb_output_streams; i++) { +if (ifilter) { +if (ifilter != output_streams[i]->filter->graph->inputs[0]) +continue; +} OutputStream *ost = output_streams[i]; OutputFile*of = output_files[ost->file_index]; AVFilterContext *filter; @@ -2179,7 +2184,8 @@ static int ifilter_send_frame(InputFilter *ifilter, AVFrame *frame) } } -ret = reap_filters(1); +ret = HAVE_THREADS ? reap_filters(1, ifilter) : reap_filters(1, NULL); + if (ret < 0 && ret != AVERROR_EOF) { av_log(NULL, AV_LOG_ERROR, "Error while filtering: %s\n", av_err2str(ret)); return ret; @@ -2208,6 +2214,14 @@ static int ifilter_send_eof(InputFilter *ifilter, int64_t pts) ifilter->eof = 1; +#if HAVE_THREADS +ifilter->waited_frm = NULL; +pthread_mutex_lock(&ifilter->process_mutex); +ifilter->t_end = 1; +pthread_cond_signal(&ifilter->process_cond); +pthread_mutex_unlock(&ifilter->process_mutex); +pthread_join(ifilter->f_thread, NULL); +#endif if (ifilter->filter) { ret = av_buffersrc_close(ifilter->filter, pts, AV_BUFFERSRC_FLAG_PUSH); if (ret < 0) @@ -2252,12 +2266,95 @@ static int decode(AVCodecContext *avctx, AVFrame *frame, int *got_frame, AVPacke return 0; } +#if HAVE_THREADS +static void *filter_pipeline(void *arg) +{ +InputFilter *fl = arg; +AVFrame *frm; +int ret; +while(1) { +pthread_mutex_lock(&fl->process_mutex); +while (fl->waited_frm == NULL && !fl->t_end) +pthread_cond_wait(&fl->process_cond, &fl->process_mutex); +pthread_mutex_unlock(&fl->process_mutex); + +if (fl->t_end) break; + +frm = fl->waited_frm; +ret = ifilter_send_frame(fl, frm); +if (ret < 0) { +av_log(NULL, AV_LOG_ERROR, + "Failed to inject frame into filter network: %s\n", av_err2str(ret)); +} else { +ret = reap_filters(0, fl); +} +fl->t_error = ret; + +pthread_mutex_lock(&fl->finish_mutex); +fl->waited_frm = NULL; +pthread_cond_signal(&fl->finish_cond); +pthread_mutex_unlock(&fl->finish_mutex); + +
[FFmpeg-devel] [PATCH v3] Improved the performance of 1 decode + N filter graphs and adaptive bitrate.
With new option "-abr_pipeline" It enabled multiple filter graph concurrency, which bring obove about 4%~20% improvement in some 1:N scenarios by CPU or GPU acceleration Below are some test cases and comparison as reference. (Hardware platform: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz) (Software: Intel iHD driver - 16.9.00100, CentOS 7) For 1:N transcode by GPU acceleration with vaapi: ./ffmpeg -vaapi_device /dev/dri/renderD128 -hwaccel vaapi \ -hwaccel_output_format vaapi \ -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \ -vf "scale_vaapi=1280:720" -c:v h264_vaapi -f null /dev/null \ -vf "scale_vaapi=720:480" -c:v h264_vaapi -f null /dev/null \ -abr_pipeline test results: 2 encoders 5 encoders 10 encoders Improved 6.1%6.9% 5.5% For 1:N transcode by GPU acceleration with QSV: ./ffmpeg -hwaccel qsv -c:v h264_qsv \ -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \ -vf "scale_qsv=1280:720:format=nv12" -c:v h264_qsv -f null /dev/null \ -vf "scale_qsv=720:480:format=nv12" -c:v h264_qsv -f null /dev/null test results: 2 encoders 5 encoders 10 encoders Improved 6% 4% 15% For Intel GPU acceleration case, 1 decode to N scaling, by QSV: ./ffmpeg -hwaccel qsv -c:v h264_qsv \ -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \ -vf "scale_qsv=1280:720:format=nv12,hwdownload" -pix_fmt nv12 -f null /dev/null \ -vf "scale_qsv=720:480:format=nv12,hwdownload" -pix_fmt nv12 -f null /dev/null test results: 2 scale 5 scale 10 scale Improved 12% 21%21% For CPU only 1 decode to N scaling: ./ffmpeg -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \ -vf "scale=1280:720" -pix_fmt nv12 -f null /dev/null \ -vf "scale=720:480" -pix_fmt nv12 -f null /dev/null \ -abr_pipeline test results: 2 scale 5 scale 10 scale Improved 25%107% 148% Signed-off-by: Wang, Shaofei Reviewed-by: Zhao, Jun --- fftools/ffmpeg.c| 228 fftools/ffmpeg.h| 15 fftools/ffmpeg_filter.c | 4 + fftools/ffmpeg_opt.c| 6 +- 4 files changed, 237 insertions(+), 16 deletions(-) diff --git a/fftools/ffmpeg.c b/fftools/ffmpeg.c index 544f1a1..7dbff15 100644 --- a/fftools/ffmpeg.c +++ b/fftools/ffmpeg.c @@ -1523,6 +1523,109 @@ static int reap_filters(int flush) return 0; } +static int pipeline_reap_filters(int flush, InputFilter * ifilter) +{ +AVFrame *filtered_frame = NULL; +int i; + +for (i = 0; i < nb_output_streams; i++) { +if (ifilter == output_streams[i]->filter->graph->inputs[0]) break; +} +OutputStream *ost = output_streams[i]; +OutputFile*of = output_files[ost->file_index]; +AVFilterContext *filter; +AVCodecContext *enc = ost->enc_ctx; +int ret = 0; + +if (!ost->filter || !ost->filter->graph->graph) +return 0; +filter = ost->filter->filter; + +if (!ost->initialized) { +char error[1024] = ""; +ret = init_output_stream(ost, error, sizeof(error)); +if (ret < 0) { +av_log(NULL, AV_LOG_ERROR, "Error initializing output stream %d:%d -- %s\n", + ost->file_index, ost->index, error); +exit_program(1); +} +} + +if (!ost->filtered_frame && !(ost->filtered_frame = av_frame_alloc())) +return AVERROR(ENOMEM); +filtered_frame = ost->filtered_frame; + +while (1) { +double float_pts = AV_NOPTS_VALUE; // this is identical to filtered_frame.pts but with higher precision +ret = av_buffersink_get_frame_flags(filter, filtered_frame, + AV_BUFFERSINK_FLAG_NO_REQUEST); +if (ret < 0) { +if (ret != AVERROR(EAGAIN) && ret != AVERROR_EOF) { +av_log(NULL, AV_LOG_WARNING, + "Error in av_buffersink_get_frame_flags(): %s\n", av_err2str(ret)); +} else if (flush && ret == AVERROR_EOF) { +if (av_buffersink_get_type(filter) == AVMEDIA_TYPE_VIDEO) +do_video_out(of, ost, NULL, AV_NOPTS_VALUE); +} +break; +} +if (ost->finished) { +av_frame_unref(filtered_frame); +continue; +} +if (filtered_frame->pts != AV_NOPTS_VALUE) { +int64_t start_time = (of->start_time == AV_NOPTS_VALUE) ? 0 : of->start_time; +AVRational filter_tb = av_buffersink_get_time_base(filter); +AVRational tb = enc->time_base; +int extra_bits = av_clip(29 - av_log2(tb.den), 0, 16); + +tb.den <<= extra_bits; +float_pts = +av_rescale_q(filtered_frame->pts, filter_tb, tb) - +av_rescale_q(start_time, AV_TIME_BASE_Q, tb); +float_pts /= 1 << extra_bits; +// avoid exact midoints to reduce the chance of roundi
[FFmpeg-devel] [PATCH v2] Improved the performance of 1 decode + N filter graphs and adaptive bitrate.
With new option "-abr_pipeline" It enabled multiple filter graph concurrency, which bring obove about 4%~20% improvement in some 1:N scenarios by CPU or GPU acceleration Below are some test cases and comparison as reference. (Hardware platform: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz) (Software: Intel iHD driver - 16.9.00100, CentOS 7) For 1:N transcode by GPU acceleration with vaapi: ./ffmpeg -vaapi_device /dev/dri/renderD128 -hwaccel vaapi \ -hwaccel_output_format vaapi \ -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \ -vf "scale_vaapi=1280:720" -c:v h264_vaapi -f null /dev/null \ -vf "scale_vaapi=720:480" -c:v h264_vaapi -f null /dev/null \ -abr_pipeline test results: 2 encoders 5 encoders 10 encoders Improved 6.1%6.9% 5.5% For 1:N transcode by GPU acceleration with QSV: ./ffmpeg -hwaccel qsv -c:v h264_qsv \ -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \ -vf "scale_qsv=1280:720:format=nv12" -c:v h264_qsv -f null /dev/null \ -vf "scale_qsv=720:480:format=nv12" -c:v h264_qsv -f null /dev/null test results: 2 encoders 5 encoders 10 encoders Improved 6% 4% 15% For Intel GPU acceleration case, 1 decode to N scaling, by QSV: ./ffmpeg -hwaccel qsv -c:v h264_qsv \ -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \ -vf "scale_qsv=1280:720:format=nv12,hwdownload" -pix_fmt nv12 -f null /dev/null \ -vf "scale_qsv=720:480:format=nv12,hwdownload" -pix_fmt nv12 -f null /dev/null test results: 2 scale 5 scale 10 scale Improved 12% 21%21% For CPU only 1 decode to N scaling: ./ffmpeg -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \ -vf "scale=1280:720" -pix_fmt nv12 -f null /dev/null \ -vf "scale=720:480" -pix_fmt nv12 -f null /dev/null \ -abr_pipeline test results: 2 scale 5 scale 10 scale Improved 25%107% 148% Signed-off-by: Wang, Shaofei Reviewed-by: Zhao, Jun --- fftools/ffmpeg.c| 238 +--- fftools/ffmpeg.h| 15 +++ fftools/ffmpeg_filter.c | 6 ++ fftools/ffmpeg_opt.c| 6 +- 4 files changed, 251 insertions(+), 14 deletions(-) diff --git a/fftools/ffmpeg.c b/fftools/ffmpeg.c index 544f1a1..d608194 100644 --- a/fftools/ffmpeg.c +++ b/fftools/ffmpeg.c @@ -1523,6 +1523,110 @@ static int reap_filters(int flush) return 0; } +static int pipeline_reap_filters(int flush, InputFilter * ifilter) +{ +AVFrame *filtered_frame = NULL; +int i; + +for (i = 0; i < nb_output_streams; i++) { +if (ifilter == output_streams[i]->filter->graph->inputs[0]) break; +} +OutputStream *ost = output_streams[i]; +OutputFile*of = output_files[ost->file_index]; +AVFilterContext *filter; +AVCodecContext *enc = ost->enc_ctx; +int ret = 0; + +if (!ost->filter || !ost->filter->graph->graph) +return 0; +filter = ost->filter->filter; + +if (!ost->initialized) { +char error[1024] = ""; +ret = init_output_stream(ost, error, sizeof(error)); +if (ret < 0) { +av_log(NULL, AV_LOG_ERROR, "Error initializing output stream %d:%d -- %s\n", + ost->file_index, ost->index, error); +exit_program(1); +} +} + +if (!ost->filtered_frame && !(ost->filtered_frame = av_frame_alloc())) { +return AVERROR(ENOMEM); +} +filtered_frame = ost->filtered_frame; + +while (1) { +double float_pts = AV_NOPTS_VALUE; // this is identical to filtered_frame.pts but with higher precision +ret = av_buffersink_get_frame_flags(filter, filtered_frame, + AV_BUFFERSINK_FLAG_NO_REQUEST); +if (ret < 0) { +if (ret != AVERROR(EAGAIN) && ret != AVERROR_EOF) { +av_log(NULL, AV_LOG_WARNING, + "Error in av_buffersink_get_frame_flags(): %s\n", av_err2str(ret)); +} else if (flush && ret == AVERROR_EOF) { +if (av_buffersink_get_type(filter) == AVMEDIA_TYPE_VIDEO) +do_video_out(of, ost, NULL, AV_NOPTS_VALUE); +} +break; +} +if (ost->finished) { +av_frame_unref(filtered_frame); +continue; +} +if (filtered_frame->pts != AV_NOPTS_VALUE) { +int64_t start_time = (of->start_time == AV_NOPTS_VALUE) ? 0 : of->start_time; +AVRational filter_tb = av_buffersink_get_time_base(filter); +AVRational tb = enc->time_base; +int extra_bits = av_clip(29 - av_log2(tb.den), 0, 16); + +tb.den <<= extra_bits; +float_pts = +av_rescale_q(filtered_frame->pts, filter_tb, tb) - +av_rescale_q(start_time, AV_TIME_BASE_Q, tb); +float_pts /= 1 << extra_bits; +// avoid exact midoints to reduce the chance
[FFmpeg-devel] [PATCH] Improved the performance of 1 decode + N filter graphs and adaptive bitrate.
With new option "-abr_pipeline" It enabled multiple filter graph concurrency, which bring obove about 4%~20% improvement in some 1:N scenarios by CPU or GPU acceleration Below are some test cases and comparison as reference. (Hardware platform: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz) (Software: Intel iHD driver - 16.9.00100, CentOS 7) For 1:N transcode by GPU acceleration with vaapi: ./ffmpeg -vaapi_device /dev/dri/renderD128 -hwaccel vaapi \ -hwaccel_output_format vaapi \ -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \ -vf "scale_vaapi=1280:720" -c:v h264_vaapi -f null /dev/null \ -vf "scale_vaapi=720:480" -c:v h264_vaapi -f null /dev/null \ -abr_pipeline test results: 2 encoders 5 encoders 10 encoders Improved 6.1%6.9% 5.5% For 1:N transcode by GPU acceleration with QSV: ./ffmpeg -hwaccel qsv -c:v h264_qsv \ -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \ -vf "scale_qsv=1280:720:format=nv12" -c:v h264_qsv -f null /dev/null \ -vf "scale_qsv=720:480:format=nv12" -c:v h264_qsv -f null /dev/null test results: 2 encoders 5 encoders 10 encoders Improved 6% 4% 15% For Intel GPU acceleration case, 1 decode to N scaling, by QSV: ./ffmpeg -hwaccel qsv -c:v h264_qsv \ -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \ -vf "scale_qsv=1280:720:format=nv12,hwdownload" -pix_fmt nv12 -f null /dev/null \ -vf "scale_qsv=720:480:format=nv12,hwdownload" -pix_fmt nv12 -f null /dev/null test results: 2 scale 5 scale 10 scale Improved 12% 21%21% For CPU only 1 decode to N scaling: ./ffmpeg -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \ -vf "scale=1280:720" -pix_fmt nv12 -f null /dev/null \ -vf "scale=720:480" -pix_fmt nv12 -f null /dev/null \ -abr_pipeline test results: 2 scale 5 scale 10 scale Improved 25%107% 148% Signed-off-by: Wang, Shaofei Reviewed-by: Zhao, Jun --- fftools/ffmpeg.c| 239 +--- fftools/ffmpeg.h| 14 +++ fftools/ffmpeg_filter.c | 6 ++ fftools/ffmpeg_opt.c| 6 +- 4 files changed, 251 insertions(+), 14 deletions(-) diff --git a/fftools/ffmpeg.c b/fftools/ffmpeg.c index 544f1a1..f7a41fe 100644 --- a/fftools/ffmpeg.c +++ b/fftools/ffmpeg.c @@ -1523,6 +1523,112 @@ static int reap_filters(int flush) return 0; } +static int pipeline_reap_filters(int flush, InputFilter * ifilter) +{ +AVFrame *filtered_frame = NULL; +int i; + +for (i = 0; i < nb_output_streams; i++) { +if (ifilter == output_streams[i]->filter->graph->inputs[0]) break; +} +OutputStream *ost = output_streams[i]; +OutputFile*of = output_files[ost->file_index]; +AVFilterContext *filter; +AVCodecContext *enc = ost->enc_ctx; +int ret = 0; + +if (!ost->filter || !ost->filter->graph->graph) +return 0; +filter = ost->filter->filter; + +if (!ost->initialized) { +char error[1024] = ""; +ret = init_output_stream(ost, error, sizeof(error)); +if (ret < 0) { +av_log(NULL, AV_LOG_ERROR, "Error initializing output stream %d:%d -- %s\n", + ost->file_index, ost->index, error); +exit_program(1); +} +} + +if (!ost->filtered_frame && !(ost->filtered_frame = av_frame_alloc())) { +return AVERROR(ENOMEM); +} +filtered_frame = ost->filtered_frame; + +while (1) { +double float_pts = AV_NOPTS_VALUE; // this is identical to filtered_frame.pts but with higher precision +ret = av_buffersink_get_frame_flags(filter, filtered_frame, + AV_BUFFERSINK_FLAG_NO_REQUEST); +if (ret < 0) { +if (ret != AVERROR(EAGAIN) && ret != AVERROR_EOF) { +av_log(NULL, AV_LOG_WARNING, + "Error in av_buffersink_get_frame_flags(): %s\n", av_err2str(ret)); +} else if (flush && ret == AVERROR_EOF) { +if (av_buffersink_get_type(filter) == AVMEDIA_TYPE_VIDEO) +do_video_out(of, ost, NULL, AV_NOPTS_VALUE); +} +break; +} +if (ost->finished) { +av_frame_unref(filtered_frame); +continue; +} +if (filtered_frame->pts != AV_NOPTS_VALUE) { +int64_t start_time = (of->start_time == AV_NOPTS_VALUE) ? 0 : of->start_time; +AVRational filter_tb = av_buffersink_get_time_base(filter); +AVRational tb = enc->time_base; +int extra_bits = av_clip(29 - av_log2(tb.den), 0, 16); + +tb.den <<= extra_bits; +float_pts = +av_rescale_q(filtered_frame->pts, filter_tb, tb) - +av_rescale_q(start_time, AV_TIME_BASE_Q, tb); +float_pts /= 1 << extra_bits; +// avoid exact midoints to reduce the chance
[FFmpeg-devel] [PATCH] Improved the performance of 1 decode + N filter graphs and adaptive bitrate.
With new option "-abr_pipeline" It enabled multiple filter graph concurrency, which bring obvious improvement in some 1:N scenarios by CPU and GPU acceleration Below are some test cases and comparison as reference. (Hardware platform: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz) (Software: Intel iHD driver - 16.9.00100, CentOS 7) For Intel GPU acceleration case, 1 decode to N scaling, by vaapi: ./ffmpeg -vaapi_device /dev/dri/renderD128 -hwaccel vaapi \ -hwaccel_output_format vaapi \ -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \ -vf "scale_vaapi=1280:720:format=nv12,hwdownload" \ -pix_fmt nv12 -f null /dev/null \ -vf "scale_vaapi=720:480:format=nv12,hwdownload" \ -pix_fmt nv12 -f null /dev/null \ -abr_pipeline test results: 2 scale 5 scale 10 scale Improved 34%184% 240% For Intel GPU acceleration case, 1 decode to N scaling, by QSV: ./ffmpeg -hwaccel qsv -c:v h264_qsv \ -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \ -vf "scale_qsv=1280:720:format=nv12,hwdownload" -pix_fmt nv12 -f null /dev/null \ -vf "scale_qsv=720:480:format=nv12,hwdownload" -pix_fmt nv12 -f null /dev/null test results: 2 scale 5 scale 10 scale Improved 12% 21%21% For CPU only 1 decode to N scaling: ./ffmpeg -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \ -vf "scale=1280:720" -pix_fmt nv12 -f null /dev/null \ -vf "scale=720:480" -pix_fmt nv12 -f null /dev/null \ -abr_pipeline test results: 2 scale 5 scale 10 scale Improved 25%107% 148% For 1:N transcode by GPU acceleration with vaapi: ./ffmpeg -vaapi_device /dev/dri/renderD128 -hwaccel vaapi \ -hwaccel_output_format vaapi \ -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \ -vf "scale_vaapi=1280:720" -c:v h264_vaapi -f null /dev/null \ -vf "scale_vaapi=720:480" -c:v h264_vaapi -f null /dev/null \ -abr_pipeline test results: 2 encoders 5 encoders 10 encoders Improved 6.1%6.9% 5.5% For 1:N transcode by GPU acceleration with QSV: ./ffmpeg -hwaccel qsv -c:v h264_qsv \ -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \ -vf "scale_qsv=1280:720:format=nv12" -c:v h264_qsv -f null /dev/null \ -vf "scale_qsv=720:480:format=nv12" -c:v h264_qsv -f null /dev/null test results: 2 encoders 5 encoders 10 encoders Improved 6% 4% 15% Signed-off-by: Wang, Shaofei Reviewed-by: Zhao, Jun --- fftools/ffmpeg.c| 239 +--- fftools/ffmpeg.h| 12 +++ fftools/ffmpeg_filter.c | 6 ++ fftools/ffmpeg_opt.c| 6 +- 4 files changed, 249 insertions(+), 14 deletions(-) diff --git a/fftools/ffmpeg.c b/fftools/ffmpeg.c index 544f1a1..6131782 100644 --- a/fftools/ffmpeg.c +++ b/fftools/ffmpeg.c @@ -1523,6 +1523,112 @@ static int reap_filters(int flush) return 0; } +static int pipeline_reap_filters(int flush, InputFilter * ifilter) +{ +AVFrame *filtered_frame = NULL; +int i; + +for (i = 0; i < nb_output_streams; i++) { +if (ifilter == output_streams[i]->filter->graph->inputs[0]) break; +} +OutputStream *ost = output_streams[i]; +OutputFile*of = output_files[ost->file_index]; +AVFilterContext *filter; +AVCodecContext *enc = ost->enc_ctx; +int ret = 0; + +if (!ost->filter || !ost->filter->graph->graph) +return 0; +filter = ost->filter->filter; + +if (!ost->initialized) { +char error[1024] = ""; +ret = init_output_stream(ost, error, sizeof(error)); +if (ret < 0) { +av_log(NULL, AV_LOG_ERROR, "Error initializing output stream %d:%d -- %s\n", + ost->file_index, ost->index, error); +exit_program(1); +} +} + +if (!ost->filtered_frame && !(ost->filtered_frame = av_frame_alloc())) { +return AVERROR(ENOMEM); +} +filtered_frame = ost->filtered_frame; + +while (1) { +double float_pts = AV_NOPTS_VALUE; // this is identical to filtered_frame.pts but with higher precision +ret = av_buffersink_get_frame_flags(filter, filtered_frame, + AV_BUFFERSINK_FLAG_NO_REQUEST); +if (ret < 0) { +if (ret != AVERROR(EAGAIN) && ret != AVERROR_EOF) { +av_log(NULL, AV_LOG_WARNING, + "Error in av_buffersink_get_frame_flags(): %s\n", av_err2str(ret)); +} else if (flush && ret == AVERROR_EOF) { +if (av_buffersink_get_type(filter) == AVMEDIA_TYPE_VIDEO) +do_video_out(of, ost, NULL, AV_NOPTS_VALUE); +} +break; +} +if (ost->finished) { +av_frame_unref(filtered_frame); +continue; +} +if (filtered_frame->pts != AV_NOPTS_VALUE) { +int64_t start_time = (of->start_time == AV_NOPTS