Re: [PATCH v1 3/6] libstdc++: Improve low-rank layout_{left, right}::stride.

Tomasz Kaminski Mon, 28 Jul 2025 04:05:21 -0700

On Mon, Jul 28, 2025 at 10:24 AM Tomasz Kaminski <tkami...@redhat.com>
wrote:


>
>
>
> On Mon, Jul 28, 2025 at 10:03 AM Luc Grosheintz <luc.groshei...@gmail.com>
> wrote:
>
>>
>>
>> On 7/28/25 08:02, Tomasz Kaminski wrote:
>> > On Sun, Jul 27, 2025 at 2:47 PM Luc Grosheintz <
>> luc.groshei...@gmail.com>
>> > wrote:
>> >
>> >> The methods layout_{left,right}::mapping::stride are defined
>> >> as
>> >>
>> >>    \prod_{i = 0}^r E[i]
>> >>    \prod_{i = r+1}^n E[i]
>> >>
>> >> This is computed as the product of a pre-comupted static product and
>> the
>> >> product of the required dynamic extents.
>> >>
>> >> Disassembly shows that even for low-rank extents, i.e. rank == 1 and
>> >> rank == 2, with at least one  dynamic extent, the generated code loads
>> >> two values; and then runs the loop over at most one element, e.g.
>> >>
>> >>   220:  48 8b 0c f5 00 00 00   mov    rcx,QWORD PTR [rsi*8+0x0]
>> >>   227:  00
>> >>   228:  48 8b 04 f5 00 00 00   mov    rax,QWORD PTR [rsi*8+0x0]
>> >>   22f:  00
>> >>   230:  48 c1 e1 02            shl    rcx,0x2
>> >>   234:  74 1a                  je     250 <stride_left_d5+0x30>
>> >>   236:  48 01 f9               add    rcx,rdi
>> >>   239:  0f 1f 80 00 00 00 00   nop    DWORD PTR [rax+0x0]
>> >>   240:  48 63 17               movsxd rdx,DWORD PTR [rdi]
>> >>   243:  48 83 c7 04            add    rdi,0x4
>> >>   247:  48 0f af c2            imul   rax,rdx
>> >>   24b:  48 39 f9               cmp    rcx,rdi
>> >>   24e:  75 f0                  jne    240 <stride_left_d5+0x20>
>> >>   250:  c3                     ret
>> >>
>> >> If there's no dynamic extents, it simply loads the precomputed product
>> >> of static extents.
>> >>
>> >> For rank == 1 the answer is constant `1`; for rank == 2 it's either 1
>> or
>> >> extents.extent(k), with k == 0 for layout_left and k == 1 for
>> >> layout_right.
>> >>
>> >> Consider,
>> >>
>> >>    using Ed = std::extents<int, dyn>;
>> >>    int stride_left_d(const std::layout_left::mapping<Ed>& m, size_t r)
>> >>    { return m.stride(r); }
>> >>
>> >>    using E3d = std::extents<int, 3, dyn>;
>> >>    int stride_left_3d(const std::layout_left::mapping<E3d>& m, size_t
>> r)
>> >>    { return m.stride(r); }
>> >>
>> >>    using Ed5 = std::extents<int, dyn, 5>;
>> >>    int stride_left_d5(const std::layout_left::mapping<Ed5>& m, size_t
>> r)
>> >>    { return m.stride(r); }
>> >>
>> >> The optimized code for these three cases is:
>> >>
>> >>    0000000000000060 <stride_left_d>:
>> >>    60:  b8 01 00 00 00         mov    eax,0x1
>> >>    65:  c3                     ret
>> >>
>> >>    0000000000000090 <stride_left_3d>:
>> >>    90:  48 83 fe 01            cmp    rsi,0x1
>> >>    94:  19 c0                  sbb    eax,eax
>> >>    96:  83 e0 fe               and    eax,0xfffffffe
>> >>    99:  83 c0 03               add    eax,0x3
>> >>    9c:  c3                     ret
>> >>
>> >>    00000000000000a0 <stride_left_d5>:
>> >>    a0:  b8 01 00 00 00         mov    eax,0x1
>> >>    a5:  48 85 f6               test   rsi,rsi
>> >>    a8:  74 02                  je     ac <stride_left_d5+0xc>
>> >>    aa:  8b 07                  mov    eax,DWORD PTR [rdi]
>> >>    ac:  c3                     ret
>> >>
>> >> For rank == 1 it simply returns 1 (as expected). For rank == 2, it
>> >> either implements a branchless formula, or conditionally loads one
>> >> value. In all cases involving a dynamic extent this seems like it's
>> >> always doing clearly less work, both in terms of computation and loads.
>> >>
>> >> For rank == 2, it trades loading one value for a branchless sequence of
>> >> four instructions that don't require loading any values.
>> >>
>> > I will put this optimization into the __fwd_prod and __rev_pord
>> functions,
>> > so it will be applied for all uses. This will also avoid us creating
>> this
>> > caching
>> > tables for such small ranks.
>>
>> The problem is that we don't have the same amount of information in
>> the stride and __fwd_prod. The valid values for __r are 1, ..., rank;
>> for __fwd_prod it's inclusive, while in stride it's exclusive. Therefore,
>> we can do the optimization with one comparison less in stride than in
>> __fwd_prod.
>>
> Make sense, and I am OK having this optimization there. However, out of
> curiosity, if we put always_inline on the __fwd_prod, wouldn't compiler be
> able
> to eliminate __rank == __i. Also once we have separate call to __size, we
> could
> put assertions (just as a comment) that for __fwd_prod and __rev_prod __r
> is
> required to be smaller than rank().
>
I think I would pursue that direction, and have similar if constexpr in the
__size function:
__rank == 0 -> 1
__rank == 1 -> extent(0)
__rank == 2 -> extent(0) * extent(1)
__rank == rank_dynamic() -> __ext_prod with __sta_ext == 1
otherwise -> use __sta_extr.

>
>> I purposefully checked mdspan::size (part of the previous commit); and
>> on optimized builds it fully unrolls the loop and does everything
>> automatically, meaning it doesn't need help and we're not repeating
>> the same optimization several times.
>>
>> As for avoiding the tables for small ranks, we can refactor _RevProd
>> as follows:
>>
>>    template<size_t... _Extents>
>>      struct _RevProd
>>      {
>>         size_t _S_value(size_t i)
>>         { return _S_data[i]; }
>>
>>      private:
>>         array _S_data = consteval ...;
>>      }
>>
>> and use partial specialization to eliminate _S_data.
>>
>> >
>> >>
>> >> libstdc++-v3/ChangeLog:
>> >>
>> >>          * include/std/mdspan (layout_left::mapping::stride): Optimize
>> >>          for rank <= 2.
>> >>          (layout_right::mapping::stride): Ditto.
>> >>
>> >> Signed-off-by: Luc Grosheintz <luc.groshei...@gmail.com>
>> >> ---
>> >>   libstdc++-v3/include/std/mdspan | 14 ++++++++++++--
>> >>   1 file changed, 12 insertions(+), 2 deletions(-)
>> >>
>> >> diff --git a/libstdc++-v3/include/std/mdspan
>> >> b/libstdc++-v3/include/std/mdspan
>> >> index 06ccf3e3827..f288af96cdb 100644
>> >> --- a/libstdc++-v3/include/std/mdspan
>> >> +++ b/libstdc++-v3/include/std/mdspan
>> >> @@ -652,7 +652,12 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
>> >>         requires (extents_type::rank() > 0)
>> >>         {
>> >>          __glibcxx_assert(__i < extents_type::rank());
>> >> -       return __mdspan::__fwd_prod(_M_extents, __i);
>> >> +       if constexpr (extents_type::rank() == 1)
>> >> +         return 1;
>> >> +       else if constexpr (extents_type::rank() == 2)
>> >> +         return __i == 0 ? 1 : _M_extents.extent(0);
>> >> +       else
>> >> +         return __mdspan::__fwd_prod(_M_extents, __i);
>> >>         }
>> >>
>> >>         template<typename _OExtents>
>> >> @@ -797,7 +802,12 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
>> >>         requires (extents_type::rank() > 0)
>> >>         {
>> >>          __glibcxx_assert(__i < extents_type::rank());
>> >> -       return __mdspan::__rev_prod(_M_extents, __i);
>> >> +       if constexpr (extents_type::rank() == 1)
>> >> +         return 1;
>> >> +       else if constexpr (extents_type::rank() == 2)
>> >> +         return __i == 0 ? _M_extents.extent(1) : 1;
>> >> +       else
>> >> +         return __mdspan::__rev_prod(_M_extents, __i);
>> >>         }
>> >>
>> >>         template<typename _OExtents>
>> >> --
>> >> 2.50.0
>> >>
>> >>
>> >
>>
>>

Re: [PATCH v1 3/6] libstdc++: Improve low-rank layout_{left, right}::stride.

Reply via email to