[Perl/perl5] 1ae36e: locale.c: Convert to conditional ? : operator

Karl Williamson via perl5-changes Sun, 12 Nov 2023 16:45:16 -0800

  Branch: refs/heads/blead
  Home:   https://github.com/Perl/perl5
  Commit: 1ae36e9bbaa155adcc9230c5e4232b276a2a33fb
      
https://github.com/Perl/perl5/commit/1ae36e9bbaa155adcc9230c5e4232b276a2a33fb
  Author: Karl Williamson <k...@cpan.org>
  Date:   2023-11-12 (Sun, 12 Nov 2023)


  Changed paths:
    M locale.c

  Log Message:
  -----------
  locale.c: Convert to conditional ? : operator

I think this makes it less clumsy.


  Commit: 523eea4765a728d47445079c80c774a5b3902576
      
https://github.com/Perl/perl5/commit/523eea4765a728d47445079c80c774a5b3902576
  Author: Karl Williamson <k...@cpan.org>
  Date:   2023-11-12 (Sun, 12 Nov 2023)

  Changed paths:
    M locale.c

  Log Message:
  -----------
  locale.c: Rename variable

This was shadowing an outer variable, and conflating two things.  We are
looking for the UTF8ness of some strings in a locale to try to divine if
the locale itself is a UTF-8 one or not.  But we're doing this in the
context of trying to find the CODESET of the locale, like 8859-1 or
UTF-8..  And the utf8ness of the CODESET name is always going to be an
ASCII string.  Thus there are two types of utf8ness being looked at
here, and the names of the variables for each should be distinct.


  Commit: d8fc44bc3477dad1dbd623be1e193e8a2a70401f
      
https://github.com/Perl/perl5/commit/d8fc44bc3477dad1dbd623be1e193e8a2a70401f
  Author: Karl Williamson <k...@cpan.org>
  Date:   2023-11-12 (Sun, 12 Nov 2023)

  Changed paths:
    M locale.c

  Log Message:
  -----------
  locale.c: Avoid some mallocs

By reusing this buffer, we don't have to realloc unless the next thing
to store in it is bigger than the first.  The order of calling already
has abbreviations (hence shorter) coming after their full names.


  Commit: 868f26346d64105799b6828a6bd136ba01464cdc
      
https://github.com/Perl/perl5/commit/868f26346d64105799b6828a6bd136ba01464cdc
  Author: Karl Williamson <k...@cpan.org>
  Date:   2023-11-12 (Sun, 12 Nov 2023)

  Changed paths:
    M locale.c

  Log Message:
  -----------
  locale.c: Remove potential infinite recursive call

In Configurations where this #ifdef'd code is compiled, we recursively
call my_langinfo().  Prior to this commit, the call asked for the
UTFness of the returned string.  Depending on the particular values
involved, that could lead to this same code being executed to determine
that UTF8ness of the locale.  This would have proceeded ad infinitum
except a previous commit had created flags so as to skip any call that
would recurse infinitely.  But that can lead to erroneous results,
because when skipped, we may not know what the answer is.

This commit avoids all that by not asking the recursed call to return
the UTF8ness of the string, but instead use a heuristic to get its value
here.  This avoids needing to know the locale's UTF8ness (which is where
the infinite recursion would come from).  The heuristic is that if it is
illegal UTF-8, it isn't UTF-8; if it is plain ASCII, we can't tell; and
if it is legal UTF-8, it will be tentatively considered UTF-8.  This is
just one iteration of a loop through a bunch of strings, so that after
all the accumulated evidence of all iterations, we have confidence that
the total result is correct.

There are other code sections that also have the potential for infinite
recursion.  The next commits handle these.


  Commit: 2a8fb4df77474a4ee6fba9462c25cdb81d0977c9
      
https://github.com/Perl/perl5/commit/2a8fb4df77474a4ee6fba9462c25cdb81d0977c9
  Author: Karl Williamson <k...@cpan.org>
  Date:   2023-11-12 (Sun, 12 Nov 2023)

  Changed paths:
    M locale.c

  Log Message:
  -----------
  locale.c: Differently avoid infinite recursion

This commit changes the mechanism for avoiding potential infinite
recursion in my_localeconv().

Normally the UTF8ness of the locale is determined.  Then all strings
returned by localeconv() are examined to see if their SVs need to be
marked as UTF-8 or not.  Knowing that the locale is or isn't UTF-8 helps
in that determination.

But in figuring that out, some Configurations call this function asking
for just a single item.  That would lead to infinite recursion.

To avoid that, on such Configurations prior to this commit, the
UTF8ness of the overall locale wasn't calculated, but instead each
item's UTF8ness was calculated individually.  It's complicated, but it
turns out doing this finesses the issue.  See below for a fuller
explanaation.

This commit changes things so that for single item calls, the UTF8-ness
isn't determined here, but the caller does it itself, and it doesn't
generally need the locale's UTF8ness to make that determination.

To expand on why it's complicated:  This situation arises only on
Configurations where calculating the UTF8ness of the locale may not be
reliable.  But it very likely is reliable except for English locales
whose currency symbol is plain ASCII, such as the USA and Canada and
other former members of the British Empire who use the dollar sign for
their currency symbol.  (I told you it was complicated.)  But for such
locales, the strings are going to all be ASCII, so they aren't going to
be UTF-8, so we don't need to know the locale's UTF8ness.  What both the
previous mechanism and this new one share is both use the function
get_locale_string_utf8ness_i(), and that function has the intelligence
to not need the locale's UTF8ness for an ASCII string.


  Commit: 95a47f01ee33fb949895f73a78f8144fe4c97cef
      
https://github.com/Perl/perl5/commit/95a47f01ee33fb949895f73a78f8144fe4c97cef
  Author: Karl Williamson <k...@cpan.org>
  Date:   2023-11-12 (Sun, 12 Nov 2023)

  Changed paths:
    M embedvar.h
    M intrpvar.h
    M locale.c
    M sv.c

  Log Message:
  -----------
  locale.c: Collapse near duplicate code; removes recursion

We do the same operation on both an LC_MONETARY string and various LC_TIME
strings.  Previously the latter was changed to not have the potential
for infinite recursion.  The goal is to make both instances the same
here.  The way this commit does that is to have both instances share the
same code path.  Previously the operation in each was simple, but now it
is more complicated, with further revisions to come.

This however entails setting up some data structures to cope with the
difference in locale categories, which this commit does.


  Commit: 30d9e85a161d1fa966be9e598fa5d8d12430c005
      
https://github.com/Perl/perl5/commit/30d9e85a161d1fa966be9e598fa5d8d12430c005
  Author: Karl Williamson <k...@cpan.org>
  Date:   2023-11-12 (Sun, 12 Nov 2023)

  Changed paths:
    M locale.c

  Log Message:
  -----------
  locale.c: Move code to earlier and use its result

Under most Configurations, we can determine a locale's code set (like
"ISO 88591-1") directly.  In a few, we have to use more complicated
means, one of which is to see if the name of the locale conforms to the
XPG standard, which includes the code set as part of the name.

It turns out that in most instances, all we care about is if the code
set is UTF-8 or not, and there are ways to do that even on locale names
that don't meet the XPG standard.  However, those ways aren't foolproof.
Especially with English locales whose currency symbol is ASCII (like a
dollar sign or a string of ASCII characters), the code can't make a
definitive determination, and chooses the incorrect answer.  In those
cases, if the locale name does meet XPG criteria, we could use that
as a tie breaker to get the correct answer.

That's what this commit does.  It moves the extraction of the code set
from the locale name to prior to the UTF8ness determination, and when
there is ambiguity (as in the English locales), it uses the found code
set  name to resolve the ambiguity.


  Commit: 976959c8df1a685a1a37bbfa818e585f7859f698
      
https://github.com/Perl/perl5/commit/976959c8df1a685a1a37bbfa818e585f7859f698
  Author: Karl Williamson <k...@cpan.org>
  Date:   2023-11-12 (Sun, 12 Nov 2023)

  Changed paths:
    M locale.c

  Log Message:
  -----------
  locale.c: White space, comments only

Recent commits have removed the containing blocks for these two sections
of code, so outdent them


  Commit: f85b57b3e4f82dfebc574b19e1d743d5847ec3f7
      
https://github.com/Perl/perl5/commit/f85b57b3e4f82dfebc574b19e1d743d5847ec3f7
  Author: Karl Williamson <k...@cpan.org>
  Date:   2023-11-12 (Sun, 12 Nov 2023)

  Changed paths:
    M locale.c

  Log Message:
  -----------
  locale.c: Move a couple lines to earlier in the file

This attaches the comment more closely to what it is commenting.


  Commit: ba2aeae3cad4bf8abb07eba1321662da38c683ed
      
https://github.com/Perl/perl5/commit/ba2aeae3cad4bf8abb07eba1321662da38c683ed
  Author: Karl Williamson <k...@cpan.org>
  Date:   2023-11-12 (Sun, 12 Nov 2023)

  Changed paths:
    M locale.c

  Log Message:
  -----------
  locale.c: Remove some unnecessary code

This code was a relic which had been lately only necessary to prevent
potential infinite recursion.  The recursion has been removed, so this
can too.


  Commit: 7a10096df44ed38f4ce1b16b41069115ad5b84bf
      
https://github.com/Perl/perl5/commit/7a10096df44ed38f4ce1b16b41069115ad5b84bf
  Author: Karl Williamson <k...@cpan.org>
  Date:   2023-11-12 (Sun, 12 Nov 2023)

  Changed paths:
    M locale.c

  Log Message:
  -----------
  locale.c: Use MB_CUR_MAX to resolve ambiguities

This code is only compiled for rare Configurations where two
C99-required functions aren't available.  This likely means that someone
ran Configure using arguments to deny their use, which in turn means they
have been found to be buggy on this platform.

The code was originally developed before we required C99, but now
functions as a workaround for their absence, and the code doesn't give
perfect results.  In particular, it fails for English UTF-8 locales
whose names don't meet the XPG standard and which use the dollar sign as
the currency symbol (or other string of ASCII characters).

But, using another C99 feature, MB_CUR_MAX, fixes that.  Our experience
in the field is that it works well.


  Commit: a15d3a085745f320ab9b75bdf56b647a9f9914f9
      
https://github.com/Perl/perl5/commit/a15d3a085745f320ab9b75bdf56b647a9f9914f9
  Author: Karl Williamson <k...@cpan.org>
  Date:   2023-11-12 (Sun, 12 Nov 2023)

  Changed paths:
    M locale.c

  Log Message:
  -----------
  locale.c: Use fewer CPU cycles

Now that we have the code set name (if any) before having calculated the
UTF-8ness of the locale (from the past few commits), we can skip some
calculations.  It is very unlikely to be coincidental for the name to be
"UTF-8" and a string from that locale to be syntactically legal
UTF-8 (unless the string is all ASCII).  This is because of the highly
restrictive syntax of UTF-8.  Thus, if we find this situation, we can
presume that the name is telling the truth, and we don't have to keep
checking all the possible strings that we previously did.


Compare: https://github.com/Perl/perl5/compare/1d74e8214dd5...a15d3a085745

[Perl/perl5] 1ae36e: locale.c: Convert to conditional ? : operator

Reply via email to