https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113042

            Bug ID: 113042
           Summary: popcount of 8bits and 128bits can be improved for
                    !TARGET_CSSC
           Product: gcc
           Version: 14.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: enhancement
          Priority: P3
         Component: target
          Assignee: pinskia at gcc dot gnu.org
          Reporter: pinskia at gcc dot gnu.org
  Target Milestone: ---
            Target: aarch64

Take:
```
unsigned h8 (const unsigned char *restrict a) {
  return __builtin_popcountg (a[0]);
}


unsigned __int128 h128 (const unsigned __int128 *restrict a) {
  return __builtin_popcountg (a[0]);
}

```

Currently h8 produces:
```
        ldr     b31, [x0]
        cnt     v31.8b, v31.8b
        addv    b31, v31.8b
        fmov    w0, s31
        ret
```
But the addv is not needed here and we could instead just get:
```
        ldr     b31, [x0]
        cnt     v31.8b, v31.8b
        smov    w0, v31.b[0]
        ret
```

For h128, there are two cnt:
```
        ldp     d30, d31, [x0]
        mov     x1, 0
        cnt     v30.8b, v30.8b
        cnt     v31.8b, v31.8b
        addv    b30, v30.8b
        addv    b31, v31.8b
        fmov    x2, d30
        fmov    x0, d31
        add     x0, x2, x0
        ret
```

But we could do instead:
```
        ldr     q30, [x0]
        mov     x1, 0
        cnt     v30.16b, v30.16b
        addv    b30, v31.16b
        fmov    x0, d30
        ret
```

Basically we need to implement popcountqi2 and popcountti2 patterns.

Note for TARGET_CSSC, Using the scalar cnt will still be better I suspect so I
won't enable these patterns for that.

Reply via email to