Hi, thanks for your review. > +#ifdef __MACH__ > +# define MACH > +#else > +# define MACH # > This is not good idea to bypass .const_data
MACH uses ".const_data" directive, which is invalid for ELF. For ELF the directive is ".rodata": > ELF .section .rodata > MACH .const_data > + ushll v0.8h, v0.8b, #0 > ... > + mul v16.8h, v0.8h, v24.8h > Why not MULL? That would not work for the rest of the computation. Part of the data in v0 gets used in the next computation, and then I would have to split mla into a mull + add. > + orr v0.16b, v1.16b, v1.16b > This is equal to MOV, I guess compiler will replace to right instruction on > ARM64 I replaced orr with mov instructions. > + // sum row[0-7] > + dup v18.2d, v16.d[1] > + dup v19.2d, v17.d[1] > + add v16.4h, v16.4h, v18.4h > + add v17.4h, v17.4h, v19.4h > + trn1 v16.2d, v16.2d, v17.2d > How about ADDP? I replaced the above 5 instructions with the following 3 and the performance improved. trn1 v20.2d, v16.2d, v17.2d trn2 v21.2d, v16.2d, v17.2d add v16.8h, v20.8h, v21.8h Please see attached the amended patch. Thanks, Sebastian
0001-arm64-port-luma_vpp.patch
Description: 0001-arm64-port-luma_vpp.patch
_______________________________________________ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel