[Perl/perl5] f35b37: Refactor utf8 to code point conversion

Karl Williamson via perl5-changes Sat, 22 Jan 2022 09:47:25 -0800

  Branch: refs/heads/pr0
  Home:   https://github.com/Perl/perl5
  Commit: f35b37e4239f6127b55db10c657aed7eca2ba1cd
      
https://github.com/Perl/perl5/commit/f35b37e4239f6127b55db10c657aed7eca2ba1cd
  Author: Karl Williamson <[email protected]>
  Date:   2022-01-22 (Sat, 22 Jan 2022)


  Changed paths:
    M inline.h

  Log Message:
  -----------
  Refactor utf8 to code point conversion

Most such conversions occur in the inlined function
Perl_utf8n_to_uvchr_msgs(), which several macros like utf8n_to_uvchr()
expand to.

This commit effectively removes a conditional from inside the loop, and
avoids some conditionals when converting the common case of the input
being UTF-8 invariant (ASCII on ASCII platforms).

Prior to this commit, the code did something different the first time
through the loop than the other times.  By hoisting that to pre-loop
initialization, that conditional is removed.  That meant rearranging the
loop to be a while(1), and have its exit conditions in the middle.

All calls to this function from the Perl core pass in a non-empty
string.  But outside calls could conceivably pass an empty one which
could lead to reading outside the buffer.  An extra check is added to
non-core calls, as is already done elsewhere.

This change means that calls from core execute no more conditionals than
the typical:

    if (UTF8_IS_INVARIANT(*s)) {
        code_point = *s;
    }
    else {
        code_point = utf8n_to_uvchr(s, ...)
    }

I'm therefore thinking these can now just be replaced by the simpler

    code_point = utf8n_to_uvchr(s, ...)

without a noticeable hit in performance.  The essential difference is
that the former gets its code point from the string already being
examined, and the latter looks up data in a 450 byte static array that
is referred to constantly, so is likely to be cached.

f

[Perl/perl5] f35b37: Refactor utf8 to code point conversion

Reply via email to