Branch: refs/heads/blead Home: https://github.com/Perl/perl5 Commit: 61405b774c1372b18658a22e8b0f05df1456e676 https://github.com/Perl/perl5/commit/61405b774c1372b18658a22e8b0f05df1456e676 Author: Karl Williamson <k...@cpan.org> Date: 2024-01-01 (Mon, 01 Jan 2024)
Changed paths: M t/op/tr.t M toke.c Log Message: ----------- Fix tr/\N{latin1}...\N{above latin1}/ When a string is being parsed, it isn't made UTF-8 until necessary; that is, when it first finds a character that requires UTF-8 to represent. If all the characters prior to that one are ASCII, all that is needed is to convert that one to UTF-8 and to turn on the UTF-8 flag, so that all future characters encountered in the parse will be represented in UTF-8. This is because all ASCII characters have the same representation in UTF-8 as not; they are "UTF-8 invariant". But if a UTF-8 *variant* character was in the string prior to the UTF-8-required one, it must be converted to its UTF-8 representation, when the string is converted. All that is needed is to increment a count of variant characters as the parse proceeds. If nothing in the string requires UTF-8 by the end of the parse, the count is ignored and the string remains non-UTF-8. And if the count is zero when a UTF-8-required character is found, as mentioned above, that character is converted to UTF-8, and the flag is set to use UTF-8 going forward. But a non-zero count at the first UTF-8-required character indicates that before proceeding, the already-parsed string must be reparsed to convert the variant characters already in it to UTF-8. The count was not being incremented when the input notation used \N{}; this commit fixes that. It was being incremented when the input notation used \x{}, which is much more common in the field, so this bug was unnoticed for a long time. Fixes #21748 (Just for the record, on EBCDIC platforms more characters are UTF-8 invariant than ASCII platforms; the macros called here hide that from the code.)