Branch: refs/heads/blead
  Home:   https://github.com/Perl/perl5
  Commit: 61405b774c1372b18658a22e8b0f05df1456e676
      
https://github.com/Perl/perl5/commit/61405b774c1372b18658a22e8b0f05df1456e676
  Author: Karl Williamson <k...@cpan.org>
  Date:   2024-01-01 (Mon, 01 Jan 2024)

  Changed paths:
    M t/op/tr.t
    M toke.c

  Log Message:
  -----------
  Fix tr/\N{latin1}...\N{above latin1}/

When a string is being parsed, it isn't made UTF-8 until necessary; that
is, when it first finds a character that requires UTF-8 to represent.  If
all the characters prior to that one are ASCII, all that is needed is to
convert that one to UTF-8 and to turn on the UTF-8 flag, so that all
future characters encountered in the parse will be represented in UTF-8.

This is because all ASCII characters have the same representation in
UTF-8 as not; they are "UTF-8 invariant".  But if a UTF-8 *variant*
character was in the string prior to the UTF-8-required one, it must be
converted to its UTF-8 representation, when the string is converted.

All that is needed is to increment a count of variant characters as the
parse proceeds.

If nothing in the string requires UTF-8 by the end of the parse, the
count is ignored and the string remains non-UTF-8.

And if the count is zero when a UTF-8-required character is found, as
mentioned above, that character is converted to UTF-8, and the flag is
set to use UTF-8 going forward.

But a non-zero count at the first UTF-8-required character indicates
that before proceeding, the already-parsed string must be reparsed to
convert the variant characters already in it to UTF-8.

The count was not being incremented when the input notation used \N{};
this commit fixes that.  It was being incremented when the input
notation used \x{}, which is much more common in the field, so this bug
was unnoticed for a long time.

Fixes #21748

(Just for the record, on EBCDIC platforms more characters are UTF-8
invariant than ASCII platforms; the macros called here hide that from
the code.)


Reply via email to