Re: [Issue 8 drafts 0001633]: Add memrchr

Steffen Nurpmeso via austin-group-l at The Open Group Tue, 27 Aug 2024 13:25:51 -0700

Jilles Tjoelker via austin-group-l at The Open Group wrote in
 <20240827145000.ga2...@stack.nl>:
 |On Fri, Aug 23, 2024 at 12:27:01PM +0200, Alejandro C via austin-group-l
 |at The Open Group wrote:
 |> wmemrchr(), and in general w*(), are functions that deal with wide
 |> characters --which have a fixed width--, not multi-byte characters \
 |> --which
 |> have a variable width--.
 |
 |> Thus, searching backwards for a wc should be a trivial loop:
 |> [snip] 
 |
 |I agree that the rationale is incorrect. However, I still agree that
 |wmemrchr() should not be added to the standard. Not only would it be
 |invention, but it would boil down to doing work to improve UTF-32
 |support (in most implementations). UTF-32 is inefficient with little
 |compensation (since single code points aren't that meaningful in today's
 |Unicode).


I agree with UTF-32 as such being not very helpful, as you in
practice have to watch for grapheme boundaries, which very well
can include modifiers etc always (ie even in the low 7-bit ASCII
compatible range).

(And to remark that at least for elder ISO C's and POSIX wchar_t
is not necessarily UTF-32 at all, and Citrus in particular uses
nifty things (once i looked, .. a decade ago).)

And yes the entire family of functions is *not* usable in
practice, iswupper*(), towupper*(), these are all thoughtless ISO
C inventions that never spent a though on reading the Unicode or
ISO 10646 standard at all.
For true internationalization you have to look at entire sentences
if you want to perform case conversions (if applicable) etc etc.
I think this came up on this list a decade ago, but even ISO C23
as i have glanced over does not support anything usable at all.

Anyhow this isolated inspection of bytes or UTF-32 characters is
rarely useful at all.  And for UTF-8 it is pretty easy to create
jumptables, i think the glib library did that already over two
decades ago (ie, if you know the UTF-8 string is syntactically
correct, looking at the first bytes gives the length of the
multi-byte sequence), and for backward scanning, well, i guess
there are mathematical tricks how you can scan multiple bytes
backward and look for a starter bytes.  No "trivial loop" though,
but text processing was only trivial as long as it was all
american (and otherwise careless).

--steffen
|
|Der Kragenbaer,                The moon bear,
|der holt sich munter           he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)

Re: [Issue 8 drafts 0001633]: Add memrchr

Reply via email to