https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113042
Bug ID: 113042 Summary: popcount of 8bits and 128bits can be improved for !TARGET_CSSC Product: gcc Version: 14.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: enhancement Priority: P3 Component: target Assignee: pinskia at gcc dot gnu.org Reporter: pinskia at gcc dot gnu.org Target Milestone: --- Target: aarch64 Take: ``` unsigned h8 (const unsigned char *restrict a) { return __builtin_popcountg (a[0]); } unsigned __int128 h128 (const unsigned __int128 *restrict a) { return __builtin_popcountg (a[0]); } ``` Currently h8 produces: ``` ldr b31, [x0] cnt v31.8b, v31.8b addv b31, v31.8b fmov w0, s31 ret ``` But the addv is not needed here and we could instead just get: ``` ldr b31, [x0] cnt v31.8b, v31.8b smov w0, v31.b[0] ret ``` For h128, there are two cnt: ``` ldp d30, d31, [x0] mov x1, 0 cnt v30.8b, v30.8b cnt v31.8b, v31.8b addv b30, v30.8b addv b31, v31.8b fmov x2, d30 fmov x0, d31 add x0, x2, x0 ret ``` But we could do instead: ``` ldr q30, [x0] mov x1, 0 cnt v30.16b, v30.16b addv b30, v31.16b fmov x0, d30 ret ``` Basically we need to implement popcountqi2 and popcountti2 patterns. Note for TARGET_CSSC, Using the scalar cnt will still be better I suspect so I won't enable these patterns for that.