On Fri, 18 Jul 2025 at 07:59, Tomasz Kaminski <tkami...@redhat.com> wrote: > > > > On Thu, Jul 17, 2025 at 7:02 PM Jonathan Wakely <jwak...@redhat.com> wrote: >> >> This implements the missing functions in _Utf_iterator to support >> reverse iteration. All existing tests pass when the view is reversed, so >> that the same code units are seen when iterating forwards or backwards. >> >> libstdc++-v3/ChangeLog: >> >> * include/bits/unicode.h (_Utf_iterator::operator--): Reorder >> conditions and update position after reading a code unit. >> (_Utf_iterator::_M_read_reverse): Define. >> (_Utf_iterator::_M_read_utf8): Return extracted code point. >> (_Utf_iterator::_M_read_reverse_utf8): Define. >> (_Utf_iterator::_M_read_reverse_utf16): Define. >> (_Utf_iterator::_M_read_reverse_utf32): Define. >> * testsuite/ext/unicode/view.cc: Add checks for reversed views >> and reverse iteration. >> --- >> >> Tested x86_64-linux. > > Please add a test where we have a string that contains only utf8 continuation > code-points.
Ah yes, to check the case where we find four continuation bytes in a row, good idea. That isn't needed in the forward iteration case, because you always notice the error on the first continuation byte that doesn't follow a start byte, so it wasn't covered by the existing tests. > Outside that LGTM. >> >> >> libstdc++-v3/include/bits/unicode.h | 136 ++++++++++++++++++++- >> libstdc++-v3/testsuite/ext/unicode/view.cc | 68 +++++++---- >> 2 files changed, 174 insertions(+), 30 deletions(-) >> >> diff --git a/libstdc++-v3/include/bits/unicode.h >> b/libstdc++-v3/include/bits/unicode.h >> index 9a38462e8102..9ff019594b7b 100644 >> --- a/libstdc++-v3/include/bits/unicode.h >> +++ b/libstdc++-v3/include/bits/unicode.h >> @@ -209,10 +209,15 @@ namespace __unicode >> constexpr _Utf_iterator& >> operator--() requires bidirectional_iterator<_Iter> >> { >> - if (!_M_buf_index && _M_curr() != _M_first()) >> - _M_read_reverse(); >> - else if (_M_buf_index) >> + if (_M_buf_index > 0) >> --_M_buf_index; >> + else if (_M_curr() != _M_first()) >> + { >> + _M_read_reverse(); >> + _M_buf_index = _M_buf_last - 1; >> + ranges::advance(_M_curr(), -_M_to_increment); >> + } >> + // else erroneous, but ignored for now. >> return *this; >> } >> >> @@ -269,7 +274,18 @@ namespace __unicode >> } >> >> constexpr void >> - _M_read_reverse(); // TODO >> + _M_read_reverse() requires bidirectional_iterator<_Iter> >> + { >> + if constexpr (sizeof(_FromFmt) == sizeof(uint8_t)) >> + _M_read_reverse_utf8(); >> + else if constexpr (sizeof(_FromFmt) == sizeof(uint16_t)) >> + _M_read_reverse_utf16(); >> + else >> + { >> + static_assert(sizeof(_FromFmt) == sizeof(uint32_t)); >> + _M_read_reverse_utf32(); >> + } >> + } >> >> template<typename> >> struct _Guard >> @@ -285,7 +301,7 @@ namespace __unicode >> _It _M_orig; >> }; >> >> - constexpr void >> + constexpr char32_t >> _M_read_utf8() >> { >> _Guard<_Iter> __g{this, _M_curr()}; >> @@ -383,6 +399,8 @@ namespace __unicode >> __c = _S_error(); >> >> _M_update(__c, __to_incr); >> + >> + return __c; >> } >> >> constexpr void >> @@ -425,6 +443,114 @@ namespace __unicode >> _M_update(__c, 1); >> } >> >> + constexpr void >> + _M_read_reverse_utf8() requires bidirectional_iterator<_Iter> >> + { >> + const auto __first = _M_first(); >> + auto __curr = _M_curr(); >> + // The code point we decode: >> + char32_t __c{}; >> + // The last code unit read: >> + uint8_t __u = *--__curr; >> + // Count of bytes read: >> + uint8_t __to_incr = 1; >> + >> + if (__u <= 0x7F) [[likely]] >> + { >> + _M_update(__u, 1); >> + return; >> + } >> + >> + // Continuation bytes match 10xxxxxx >> + auto __is_continuation = [](uint8_t __b) { >> + return (__b & 0xC0) == 0x80; >> + }; >> + // 0xC0 and 0xC1 would produce overlong encodings of ASCII >> characters. >> + // 0xF5-0xFF would produce code points above U+10FFFF >> + auto __is_invalid = [](uint8_t __b) { >> + return (__b & 0xFE) == 0xC0 || __b >= 0xF5; >> + }; >> + >> + // No valid or invalid multibyte sequence is longer than 4 bytes, >> + // so skip back over at most four bytes. >> + while (__is_continuation(__u) && __to_incr < 4 && __curr != __first) >> + { >> + ++__to_incr; >> + __u = *--__curr; >> + } >> + >> + // If the last byte read was a continuation byte then either we read >> + // four continuation bytes, or stopped at the start of the sequence. >> + // Either way, the maximal subparts are the individual continuation >> + // bytes so each one should be replaced with U+FFFD. >> + if (__is_continuation(__u) || __is_invalid(__u)) [[unlikely]] > > I was wondering why you are checking, as it looked like optimizing the error > path, > but then realized that this is required to make __seq_length computation below > valid. Yes, std::countl_one on a start byte tells you how long a multibyte sequence is, so a 2-byte sequence has 0b110xxxxx as the start byte, and a 3-byte sequence has 0b1110xxxx as the start byte, and 0b11110xxx for a 4-byte sequence. But that only works if we actually have a valid start byte, so I need to exclude the cases where we don't have a valid start byte. I'll add a comment before the line calling countl_one. >> >> + { >> + // Either found four continuation bytes (maximum allowed is >> three) >> + // or first non-continuation byte is an invalid UTF-8 code unit. >> + _M_update(_S_error(), 1); >> + return; >> + } >> + int __seq_length = std::countl_one((unsigned char)__u); >> + if (__seq_length < __to_incr) [[unlikely]] >> + { >> + // If the expected number of continuation bytes is less than >> + // the number we found, then the last continuation byte is a >> + // maximal subpart and the decremented iterator points to it. >> + _M_update(_S_error(), 1); >> + return; >> + } >> + >> + auto __orig = std::__exchange(_M_curr(), std::move(__curr)); >> + if (_M_read_utf8() == _S_error()) [[unlikely]] > > This is a nice trick, to not repeat all checks, and compute the actual value > of > code point. I initially started implementing a reverse decoder, inspecting each byte in turn and checking if it formed part of a valid sequence. That was too hard for my little brain. Reusing the forward decoder here is what Eddie Nolan suggested to me and is what he does in https://github.com/bemanproject/utf_view