Branch: refs/heads/pr0
Home: https://github.com/Perl/perl5
Commit: f35b37e4239f6127b55db10c657aed7eca2ba1cd
https://github.com/Perl/perl5/commit/f35b37e4239f6127b55db10c657aed7eca2ba1cd
Author: Karl Williamson <[email protected]>
Date: 2022-01-22 (Sat, 22 Jan 2022)
Changed paths:
M inline.h
Log Message:
-----------
Refactor utf8 to code point conversion
Most such conversions occur in the inlined function
Perl_utf8n_to_uvchr_msgs(), which several macros like utf8n_to_uvchr()
expand to.
This commit effectively removes a conditional from inside the loop, and
avoids some conditionals when converting the common case of the input
being UTF-8 invariant (ASCII on ASCII platforms).
Prior to this commit, the code did something different the first time
through the loop than the other times. By hoisting that to pre-loop
initialization, that conditional is removed. That meant rearranging the
loop to be a while(1), and have its exit conditions in the middle.
All calls to this function from the Perl core pass in a non-empty
string. But outside calls could conceivably pass an empty one which
could lead to reading outside the buffer. An extra check is added to
non-core calls, as is already done elsewhere.
This change means that calls from core execute no more conditionals than
the typical:
if (UTF8_IS_INVARIANT(*s)) {
code_point = *s;
}
else {
code_point = utf8n_to_uvchr(s, ...)
}
I'm therefore thinking these can now just be replaced by the simpler
code_point = utf8n_to_uvchr(s, ...)
without a noticeable hit in performance. The essential difference is
that the former gets its code point from the string already being
examined, and the latter looks up data in a 450 byte static array that
is referred to constantly, so is likely to be cached.
f