https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92175
Bug ID: 92175 Summary: x86 backend claims V4SI multiplication support, preventing more optimal pattern Product: gcc Version: 10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: rguenth at gcc dot gnu.org Target Milestone: --- Costing has 19010 /* Without sse4.1, we don't have PMULLD; it's emulated with 7 19011 insns, including two PMULUDQ. */ 19012 else if (mode == V4SImode && !(TARGET_SSE4_1 || TARGET_AVX)) 19013 return ix86_vec_cost (mode, cost->mulss * 2 + cost->sse_op * 5); but for a testcase doing just x * 2 that is excessive. The vectorizer would change that to x << 1 via vect_recog_mult_pattern (yeah, oddly not to x + x ...). This causes SSE vectorization to be disregarded easily, falling back to MMX "emulation" mode which doesn't claim V4SImode multiplication support producing essentially SSE code but with only half of the lanes doing useful work. I'm not sure if pattern recog should try costing here. Certainly the vectorizer won't try the PMULUDQ variant if the backend would claim to not support V4SImode mult. Noticed for the testcase in PR92173.