https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106038
Hongtao.liu <crazylht at gmail dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |crazylht at gmail dot com
--- Comment #3 from Hongtao.liu <crazylht at gmail dot com> ---
vectorizer saw 2 scalar loads + 2 bit_ops + 2 scalar stores vs 1 unaligned_load
+ 1 bit_op + 1 unaligned_store, only scale cost of bit_op doesn't help.
In rtl level, we have
205(note 3 14 4 2 NOTE_INSN_DELETED)
206(note 4 3 7 2 NOTE_INSN_FUNCTION_BEG)
207(insn 7 4 8 2 (set (reg:V2QI 87 [ vect__20.19 ])
208 (mem:V2QI (reg:DI 91) [0 MEM <const vector(2) unsigned char>
[(const uint8_t *)b_11(D)]+0 S2 A8])) "test.c":31:1 1414 {*movv2qi_internal}
209 (expr_list:REG_DEAD (reg:DI 91)
210 (nil)))
211(insn 8 7 9 2 (set (reg:V2QI 88 [ vect__18.16 ])
212 (mem:V2QI (reg/v/f:DI 85 [ a ]) [0 MEM <vector(2) unsigned char>
[(uint8_t *)a_10(D)]+0 S2 A8])) "test.c":31:1 1414 {*movv2qi_internal}
213 (expr_list:REG_EQUIV (mem:V2QI (reg/v/f:DI 85 [ a ]) [0 MEM <vector(2)
unsigned char> [(uint8_t *)a_10(D)]+0 S2 A8])
214 (nil)))
215(insn 9 8 10 2 (parallel [
216 (set (reg:V2QI 89 [ vect__21.20 ])
217 (xor:V2QI (reg:V2QI 87 [ vect__20.19 ])
218 (reg:V2QI 88 [ vect__18.16 ])))
219 (clobber (reg:CC 17 flags))
220 ]) "test.c":31:1 1627 {xorv2qi3}
221 (expr_list:REG_DEAD (reg:V2QI 88 [ vect__18.16 ])
222 (expr_list:REG_DEAD (reg:V2QI 87 [ vect__20.19 ])
223 (expr_list:REG_UNUSED (reg:CC 17 flags)
224 (expr_list:REG_EQUIV (mem:V2QI (reg/v/f:DI 85 [ a ]) [0 MEM
<vector(2) unsigned char> [(uint8_t *)a_10(D)]+0 S2 A8])
225 (nil))))))
226(insn 10 9 0 2 (set (mem:V2QI (reg/v/f:DI 85 [ a ]) [0 MEM <vector(2)
unsigned char> [(uint8_t *)a_10(D)]+0 S2 A8])
227 (reg:V2QI 89 [ vect__21.20 ])) "test.c":31:1 1414
{*movv2qi_internal}
228 (expr_list:REG_DEAD (reg:V2QI 89 [ vect__21.20 ])
if RA can allocate 87/88/89 into GPRs, it would same as non-vectorized version.