Hi Chen, Thanks for the feedback. Will do the possible changes. On Thu, Mar 7, 2019 at 4:37 PM Niranjankumar Balasubramanian < [email protected]> wrote:
> Hi Chen, > Thanks for your suggestions. Your feedback is noted. > > On Thu, Mar 7, 2019 at 3:41 PM chen <[email protected]> wrote: > >> Just say it works. >> >> First at all, >> The expect algorithm is square of (x >> shift) >> It is 8 bits (I assume we talk with 8bpp, the 16bpp are similar) multiple >> of 8-bits and result is 16 bits. >> The function works on CU-level, the blockSize is up to 64 only, or call >> 6-bits. >> So, we can decide the maximum dynamic range is 16+6+6 = 28 bits >> >> In this way, the output uint64_t is unnecessary on 8bpp mode. >> >> Moreover, PMOVZXBD+VPMULDQ can be replace by PMOVZXBW+PMADDWD, (please >> remember that PMADDUBSW just work on one of unsigned input), >> this way may accelerate 3~4 times of processing throughput. >> I don't why not VPMULLD, it almost double performance >> >> Further, unnecessary VPSRLDQ because we choice VPMULDQ >> >> + vpmuldq m2, m1, m1 >> + vpsrldq m1, m1, 4 >> + vpmuldq m1, m1, m1 >> >> >> Regards, >> Min >> >> At 2019-03-07 17:36:19, "Dinesh Kumar Reddy" <[email protected]> >> wrote: >> >> +static void normFact_c(const pixel* src, uint32_t blockSize, int shift, >>> uint64_t *z_k) >>> +{ >>> + *z_k = 0; >>> + for (uint32_t block_yy = 0; block_yy < blockSize; block_yy += 1) >>> + { >>> + for (uint32_t block_xx = 0; block_xx < blockSize; block_xx += 1) >>> + { >>> + uint32_t temp = src[block_yy * blockSize + block_xx] >> >>> shift; >>> + *z_k += temp * temp; >>> + } >>> + } >>> +} >>> + >>> diff -r d12a4caf7963 -r 19f27e0c8a6f source/common/x86/pixel-a.asm >>> --- a/source/common/x86/pixel-a.asm Wed Feb 27 12:35:02 2019 +0530 >>> +++ b/source/common/x86/pixel-a.asm Mon Mar 04 15:36:38 2019 +0530 >>> @@ -388,6 +388,16 @@ >>> vpaddq m7, m6 >>> %endmacro >>> >>> +%macro NORM_FACT_COL 1 >>> + vpsrld m1, m0, SSIMRD_SHIFT >>> + vpmuldq m2, m1, m1 >>> + vpsrldq m1, m1, 4 >>> + vpmuldq m1, m1, m1 >>> + >>> + vpaddq m1, m2 >>> + vpaddq m3, m1 >>> +%endmacro >>> + >>> ; FIXME avoid the spilling of regs to hold 3*stride. >>> ; for small blocks on x86_32, modify pixel pointer instead. >>> >>> @@ -16303,3 +16313,266 @@ >>> movq [r4], xm4 >>> movq [r6], xm7 >>> RET >>> + >>> + >>> +;static void normFact_c(const pixel* src, uint32_t blockSize, int >>> shift, uint64_t *z_k) >>> +;{ >>> +; *z_k = 0; >>> +; for (uint32_t block_yy = 0; block_yy < blockSize; block_yy += 1) >>> +; { >>> +; for (uint32_t block_xx = 0; block_xx < blockSize; block_xx += >>> 1) >>> +; { >>> +; uint32_t temp = src[block_yy * blockSize + block_xx] >> >>> shift; >>> +; *z_k += temp * temp; >>> +; } >>> +; } >>> +;} >>> >>> +;-------------------------------------------------------------------------------------- >>> +; void normFact_c(const pixel* src, uint32_t blockSize, int shift, >>> uint64_t *z_k) >>> >>> +;-------------------------------------------------------------------------------------- >>> +INIT_YMM avx2 >>> +cglobal normFact8, 4, 5, 6 >>> + mov r4d, 8 >>> + vpxor m3, m3 ;z_k >>> + vpxor m5, m5 >>> +.row: >>> +%if HIGH_BIT_DEPTH >>> + vpmovzxwd m0, [r0] ;src >>> +%elif BIT_DEPTH == 8 >>> + vpmovzxbd m0, [r0] >>> +%else >>> + %error Unsupported BIT_DEPTH! >>> +%endif >>> >>> _______________________________________________ >> x265-devel mailing list >> [email protected] >> https://mailman.videolan.org/listinfo/x265-devel >> > > > -- > *Regards,* > *Akil* > _______________________________________________ > x265-devel mailing list > [email protected] > https://mailman.videolan.org/listinfo/x265-devel > -- *Regards,* *Akil R*
_______________________________________________ x265-devel mailing list [email protected] https://mailman.videolan.org/listinfo/x265-devel
