https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789
--- Comment #24 from rguenther at suse dot de <rguenther at suse dot de> --- On September 27, 2020 4:56:43 AM GMT+02:00, crazylht at gmail dot com <gcc-bugzi...@gcc.gnu.org> wrote: >https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789 > >--- Comment #22 from Hongtao.liu <crazylht at gmail dot com> --- >>One of my workmates found that if we disable vectorization for >SPEC2017 >525.x264_r function sub4x4_dct in source file >x264_src/common/dct.c with ?>explicit function attribute >__attribute__((optimize("no-tree-vectorize"))), it >can speed up by 4%. > >For CLX, if we disable slp vectorization in sub4x4_dct by >__attribute__((optimize("no-tree-slp-vectorize"))), it can also speed >up by 4%. > >> Thanks Richi! Should we take care of this case? or neglect this kind >of >> extension as "no instruction"? I was intent to handle it in target >specific >> code, but it isn't recorded into cost vector while it seems too heavy >to do >> the bb_info slp_instances revisits in finish_cost. > >For i386 backend unsigned char --> unsigned short is no "no >instruction", but >in this case >--- >1033 _134 = MEM[(pixel *)pix1_295 + 2B]; > >1034 _135 = (short unsigned int) _134; >--- > >It could be combined and optimized to >--- >movzbl 19(%rcx), %r8d >--- > >So, if "unsigned char" variable is loaded from memory, then the >convertion >would also be "no instruction", i'm not sure if backend cost model >could handle >such situation. I think all attempts to address this from the side of the scalar cost is going to be difficult and fragile..