https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109287

            Bug ID: 109287
           Summary: Optimizing sal shr pairs when inlining function
           Product: gcc
           Version: 12.2.0
               URL: https://gcc.godbolt.org/z/aPTsjc1sM
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: c++
          Assignee: unassigned at gcc dot gnu.org
          Reporter: milasudril at gmail dot com
  Target Milestone: ---
            Target: x86-64_linux_gnu

I was trying to construct a span type to be used for working with a tile-based
image

```
#include <cstdint>
#include <type_traits>
#include <cstddef>

template<class T, size_t TileSize>
class span_2d_tiled
{
public:
    using IndexType = size_t;

    static constexpr size_t tile_size()
    {
        return TileSize;
    }

    constexpr explicit span_2d_tiled(): span_2d_tiled{0u, 0u, nullptr} {}

    constexpr explicit span_2d_tiled(IndexType w, IndexType h, T* ptr):
        m_tilecount_x{1 + (w - 1)/TileSize},
        m_tilecount_y{1 + (h - 1)/TileSize},
        m_ptr{ptr}
    {}

    constexpr auto tilecount_x() const { return m_tilecount_x; }

    constexpr auto tilecount_y() const { return m_tilecount_y; }

    constexpr T& operator()(IndexType x, IndexType y) const
    {
        auto const x_tile = x/TileSize;
        auto const y_tile = y/TileSize;
        auto const x_offset = x%TileSize;
        auto const y_offset = y%TileSize;
        auto const tile_start = y_tile*m_tilecount_x + x_tile;

        return *(m_ptr + tile_start + y_offset*TileSize + x_offset);
    }

private:
    IndexType m_tilecount_x;
    IndexType m_tilecount_y;
    T* m_ptr;
};

template<size_t TileSize, class Func>
void visit_tiles(size_t x_count, size_t y_count, Func&& f)
{
    for(size_t k = 0; k != y_count; ++k)
    {
        for(size_t l = 0; l != x_count; ++l)
        {
            for(size_t y = 0; y != TileSize; ++y)
            {
                for(size_t x = 0; x != TileSize; ++x)
                {
                    f(l*TileSize + x, k*TileSize + y);
                }
            }
        }
    }
}

void do_stuff(float);

void call_do_stuff(span_2d_tiled<float, 16> foo)
{
    visit_tiles<decltype(foo)::tile_size()>(foo.tilecount_x(),
foo.tilecount_y(), [foo](size_t x, size_t y){
        do_stuff(foo(x, y));
    });
}
```

Here, the user of this API wants to access individual pixels. Thus, the
coordinates are transformed before calling f. To do so, we multiply by TileSize
and adds the appropriate offset. In the callback, the pixel value is looked up.
But now we must find out what tile it is, and the offset within that tile,
which means that the inverse transformation must be applied. As can be seen in
the Godbolt link, GCC does not fully understand what is going on here. However,
latest clang appears to do a much better job with the same settings. It also
unrolls the inner loop, much better than if I used

```
#pragma GCC unroll 16
```

Reply via email to