[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled

crazylht at gmail dot com via Gcc-bugs Sat, 26 Sep 2020 19:57:31 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789


--- Comment #22 from Hongtao.liu <crazylht at gmail dot com> ---
>One of my workmates found that if we disable vectorization for SPEC2017 
>>525.x264_r function sub4x4_dct in source file x264_src/common/dct.c with 
>?>explicit function attribute __attribute__((optimize("no-tree-vectorize"))), 
>it >can speed up by 4%.

For CLX, if we disable slp vectorization in sub4x4_dct by 
__attribute__((optimize("no-tree-slp-vectorize"))), it can also speed up by 4%.

> Thanks Richi! Should we take care of this case? or neglect this kind of
> extension as "no instruction"? I was intent to handle it in target specific
> code, but it isn't recorded into cost vector while it seems too heavy to do
> the bb_info slp_instances revisits in finish_cost.

For i386 backend unsigned char --> unsigned short is no "no instruction", but
in this case
---
1033  _134 = MEM[(pixel *)pix1_295 + 2B];                                       
1034  _135 = (short unsigned int) _134;
---

It could be combined and optimized to 
---
movzbl  19(%rcx), %r8d
---

So, if "unsigned char" variable is loaded from memory, then the convertion
would also be "no instruction", i'm not sure if backend cost model could handle
such situation.

[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled

Reply via email to