On Fri, 18 Jul 2025 at 07:59, Tomasz Kaminski <tkami...@redhat.com> wrote:
>
>
>
> On Thu, Jul 17, 2025 at 7:02 PM Jonathan Wakely <jwak...@redhat.com> wrote:
>>
>> This implements the missing functions in _Utf_iterator to support
>> reverse iteration. All existing tests pass when the view is reversed, so
>> that the same code units are seen when iterating forwards or backwards.
>>
>> libstdc++-v3/ChangeLog:
>>
>>         * include/bits/unicode.h (_Utf_iterator::operator--): Reorder
>>         conditions and update position after reading a code unit.
>>         (_Utf_iterator::_M_read_reverse): Define.
>>         (_Utf_iterator::_M_read_utf8): Return extracted code point.
>>         (_Utf_iterator::_M_read_reverse_utf8): Define.
>>         (_Utf_iterator::_M_read_reverse_utf16): Define.
>>         (_Utf_iterator::_M_read_reverse_utf32): Define.
>>         * testsuite/ext/unicode/view.cc: Add checks for reversed views
>>         and reverse iteration.
>> ---
>>
>> Tested x86_64-linux.
>
> Please add a test where we have a string that contains only utf8 continuation 
> code-points.

Ah yes, to check the case where we find four continuation bytes in a
row, good idea. That isn't needed in the forward iteration case,
because you always notice the error on the first continuation byte
that doesn't follow a start byte, so it wasn't covered by the existing
tests.

> Outside that LGTM.
>>
>>
>>  libstdc++-v3/include/bits/unicode.h        | 136 ++++++++++++++++++++-
>>  libstdc++-v3/testsuite/ext/unicode/view.cc |  68 +++++++----
>>  2 files changed, 174 insertions(+), 30 deletions(-)
>>
>> diff --git a/libstdc++-v3/include/bits/unicode.h 
>> b/libstdc++-v3/include/bits/unicode.h
>> index 9a38462e8102..9ff019594b7b 100644
>> --- a/libstdc++-v3/include/bits/unicode.h
>> +++ b/libstdc++-v3/include/bits/unicode.h
>> @@ -209,10 +209,15 @@ namespace __unicode
>>        constexpr _Utf_iterator&
>>        operator--() requires bidirectional_iterator<_Iter>
>>        {
>> -       if (!_M_buf_index && _M_curr() != _M_first())
>> -         _M_read_reverse();
>> -       else if (_M_buf_index)
>> +       if (_M_buf_index > 0)
>>           --_M_buf_index;
>> +       else if (_M_curr() != _M_first())
>> +         {
>> +           _M_read_reverse();
>> +           _M_buf_index = _M_buf_last - 1;
>> +           ranges::advance(_M_curr(), -_M_to_increment);
>> +         }
>> +       // else erroneous, but ignored for now.
>>         return *this;
>>        }
>>
>> @@ -269,7 +274,18 @@ namespace __unicode
>>        }
>>
>>        constexpr void
>> -      _M_read_reverse(); // TODO
>> +      _M_read_reverse() requires bidirectional_iterator<_Iter>
>> +      {
>> +       if constexpr (sizeof(_FromFmt) == sizeof(uint8_t))
>> +         _M_read_reverse_utf8();
>> +       else if constexpr (sizeof(_FromFmt) == sizeof(uint16_t))
>> +         _M_read_reverse_utf16();
>> +       else
>> +         {
>> +           static_assert(sizeof(_FromFmt) == sizeof(uint32_t));
>> +           _M_read_reverse_utf32();
>> +         }
>> +      }
>>
>>        template<typename>
>>         struct _Guard
>> @@ -285,7 +301,7 @@ namespace __unicode
>>           _It _M_orig;
>>         };
>>
>> -      constexpr void
>> +      constexpr char32_t
>>        _M_read_utf8()
>>        {
>>         _Guard<_Iter> __g{this, _M_curr()};
>> @@ -383,6 +399,8 @@ namespace __unicode
>>           __c = _S_error();
>>
>>         _M_update(__c, __to_incr);
>> +
>> +       return __c;
>>        }
>>
>>        constexpr void
>> @@ -425,6 +443,114 @@ namespace __unicode
>>         _M_update(__c, 1);
>>        }
>>
>> +      constexpr void
>> +      _M_read_reverse_utf8() requires bidirectional_iterator<_Iter>
>> +      {
>> +       const auto __first = _M_first();
>> +       auto __curr = _M_curr();
>> +       // The code point we decode:
>> +       char32_t __c{};
>> +       // The last code unit read:
>> +       uint8_t __u = *--__curr;
>> +       // Count of bytes read:
>> +       uint8_t __to_incr = 1;
>> +
>> +       if (__u <= 0x7F) [[likely]]
>> +         {
>> +           _M_update(__u, 1);
>> +           return;
>> +         }
>> +
>> +       // Continuation bytes match 10xxxxxx
>> +       auto __is_continuation = [](uint8_t __b) {
>> +         return (__b & 0xC0) == 0x80;
>> +       };
>> +       // 0xC0 and 0xC1 would produce overlong encodings of ASCII 
>> characters.
>> +       // 0xF5-0xFF would produce code points above U+10FFFF
>> +       auto __is_invalid = [](uint8_t __b) {
>> +         return (__b & 0xFE) == 0xC0 || __b >= 0xF5;
>> +       };
>> +
>> +       // No valid or invalid multibyte sequence is longer than 4 bytes,
>> +       // so skip back over at most four bytes.
>> +       while (__is_continuation(__u) && __to_incr < 4 && __curr != __first)
>> +         {
>> +           ++__to_incr;
>> +           __u = *--__curr;
>> +         }
>> +
>> +       // If the last byte read was a continuation byte then either we read
>> +       // four continuation bytes, or stopped at the start of the sequence.
>> +       // Either way, the maximal subparts are the individual continuation
>> +       // bytes so each one should be replaced with U+FFFD.
>> +       if (__is_continuation(__u) || __is_invalid(__u)) [[unlikely]]
>
> I was wondering why you are checking, as it looked like optimizing the error 
> path,
> but then realized that this is required to make __seq_length computation below
> valid.

Yes, std::countl_one on a start byte tells you how long a multibyte
sequence is, so a 2-byte sequence has 0b110xxxxx as the start byte,
and a 3-byte sequence has 0b1110xxxx as the start byte, and 0b11110xxx
for a 4-byte sequence. But that only works if we actually have a valid
start byte, so I need to exclude the cases where we don't have a valid
start byte. I'll add a comment before the line calling countl_one.

>>
>> +         {
>> +           // Either found four continuation bytes (maximum allowed is 
>> three)
>> +           // or first non-continuation byte is an invalid UTF-8 code unit.
>> +           _M_update(_S_error(), 1);
>> +           return;
>> +         }
>> +       int __seq_length = std::countl_one((unsigned char)__u);
>> +       if (__seq_length < __to_incr) [[unlikely]]
>> +         {
>> +           // If the expected number of continuation bytes is less than
>> +           // the number we found, then the last continuation byte is a
>> +           // maximal subpart and the decremented iterator points to it.
>> +           _M_update(_S_error(), 1);
>> +           return;
>> +         }
>> +
>> +       auto __orig = std::__exchange(_M_curr(), std::move(__curr));
>> +       if (_M_read_utf8() == _S_error()) [[unlikely]]
>
> This is a nice trick, to not repeat all checks, and compute the actual value 
> of
> code point.

I initially started implementing a reverse decoder, inspecting each
byte in turn and checking if it formed part of a valid sequence. That
was too hard for my little brain.

Reusing the forward decoder here is what Eddie Nolan suggested to me
and is what he does in https://github.com/bemanproject/utf_view

Reply via email to