[FFmpeg-devel] [PATCH 1/7] lavc/me_cmp: R-V V pix_abs

2024-02-06 Thread flow gg
From d4d6b3ea040f3f7997463b4452813bc75d1c9f9d Mon Sep 17 00:00:00 2001 From: sunyuechi Date: Sat, 3 Feb 2024 10:58:13 +0800 Subject: [PATCH 1/7] lavc/me_cmp: R-V V pix_abs C908: pix_abs_0_0_c: 534.0 pix_abs_0_0_rvv_i32: 136.2 pix_abs_1_0_c: 287.7 pix_abs_1_0_rvv_i32: 125.2 sad_0_c: 534.0 sad_0_r

Re: [FFmpeg-devel] [PATCH 1/7] lavc/me_cmp: R-V V pix_abs

2024-02-06 Thread Rémi Denis-Courmont
Hi, To sum a vector, you should only reduce once at the end of the function, c.f. how it's done in existing scalar products. Reduction instructions are (intrinsically) slow. -- Rémi Denis-Courmont http://www.remlab.net/ ___ ffmpeg-devel mailing li

Re: [FFmpeg-devel] [PATCH 1/7] lavc/me_cmp: R-V V pix_abs

2024-02-06 Thread flow gg
I think in most cases it is like this, but specifically for this function, using Reduction only once would be slower. The currently submitted version roughly takes: pix_abs_0_0_rvv_i32: 136.2 The version that uses Reduction only once takes: pix_abs_0_0_rvv_i32: 169.2 Here is the implementation o

Re: [FFmpeg-devel] [PATCH 1/7] lavc/me_cmp: R-V V pix_abs

2024-02-08 Thread Rémi Denis-Courmont
Le keskiviikkona 7. helmikuuta 2024, 2.01.23 EET flow gg a écrit : > I think in most cases it is like this, but specifically for this function, > using Reduction only once would be slower. > > The currently submitted version roughly takes: > pix_abs_0_0_rvv_i32: 136.2 > > The version that uses Re

Re: [FFmpeg-devel] [PATCH 1/7] lavc/me_cmp: R-V V pix_abs

2024-02-08 Thread flow gg
From my understanding, to use larger group multipliers, one needs to utilize vlse64 (8x8) vlse128 (16x16). However, due to the use in tests of ptr = img2 + y * WIDTH + x; d2 = call_ref(NULL, img1, ptr, WIDTH, h); d1 = call_new(NULL, img1, ptr, WIDTH, h); will get: pix_abs_1_0_rvv_i32 (fatal sig

Re: [FFmpeg-devel] [PATCH 1/7] lavc/me_cmp: R-V V pix_abs

2024-02-09 Thread Rémi Denis-Courmont
Le 9 février 2024 00:39:38 GMT+02:00, flow gg a écrit : >From my understanding, to use larger group multipliers, one needs to >utilize vlse64 (8x8) vlse128 (16x16). > >However, due to the use in tests of > >ptr = img2 + y * WIDTH + x; >d2 = call_ref(NULL, img1, ptr, WIDTH, h); >d1 = call_new(NUL

Re: [FFmpeg-devel] [PATCH 1/7] lavc/me_cmp: R-V V pix_abs

2024-02-09 Thread flow gg
The issue here is that any load greater than e8 will fail the test(Bus error), so it cannot use vlse64 or similar methods... Rémi Denis-Courmont 于2024年2月9日周五 18:32写道: > > > Le 9 février 2024 00:39:38 GMT+02:00, flow gg a > écrit : > >From my understanding, to use larger group multipliers, one n

Re: [FFmpeg-devel] [PATCH 1/7] lavc/me_cmp: R-V V pix_abs

2024-02-10 Thread Rémi Denis-Courmont
Le perjantaina 9. helmikuuta 2024, 17.34.40 EET flow gg a écrit : > The issue here is that any load greater than e8 will fail the test(Bus > error), so it cannot use vlse64 or similar methods... AFAICT, data is aligned on 16 bytes here, so using larger element sizes should not be a problem. That

Re: [FFmpeg-devel] [PATCH 1/7] lavc/me_cmp: R-V V pix_abs

2024-02-10 Thread Rémi Denis-Courmont
Le lauantaina 10. helmikuuta 2024, 11.14.11 EET Rémi Denis-Courmont a écrit : > But your patchset seems to leave those out anyway. Nevermind that bit, I missed other mails -- レミ・デニ-クールモン http://www.remlab.net/ ___ ffmpeg-devel mailing list ffmpeg-deve