commit:     7cd8eb7cc675990c6f435c4aec7870ed28dcb8b8
Author:     Kerin Millar <kfm <AT> plushkava <DOT> net>
AuthorDate: Tue Aug 12 17:21:06 2025 +0000
Commit:     Kerin Millar <kfm <AT> plushkava <DOT> net>
CommitDate: Tue Aug 12 17:30:50 2025 +0000
URL:        https://gitweb.gentoo.org/proj/locale-gen.git/commit/?id=7cd8eb7c

Better align the behaviour of normalize() with localedef(1)

The purpose of the normalize() subroutine is to normalize the codeset
portion of a locale name as the localedef(1) utility would; at least, to
the extent that is necessary for locale-gen(8) to function correctly.

Presently, it does so by splitting the string into two parts (with the
<period> character serving as a separator), stripping all instances of
the <hyphen-minus> from the second part, then converting the second part
to lowercase. This approach dates from April 2006, when the --update
option was introduced. However, it is not strictly correct. For one
thing, only the codeset part is supposed to be altered. Consider the
following locale name.

  de_DE.ISO-8859-15@euro

The present routine operates on a substring of "ISO-8859-15@euro".
Instead, it ought to operate only on "ISO-8859-15". While it seems
unlikely ever to occur, imagine a scenario in which GNU introduces a new
modifier that does not consist exclusively of lower case characters.

Another issue is that the routine strips only the <hyphen-minus>
character. Instead, it ought to strip all non-alphanumeric characters,
as does the normalize_codeset() function of the localedef(1) utility.

Revise the normalize() subroutine so as to address both of these issues
and maintain a theoretically high degree of forward-compatibility.

Fixes: 2df969d53ea596038a4857060a42d7f2fd25d7e3
Link: 
https://sourceware.org/git?p=glibc.git;a=blob;f=locale/programs/localedef.c;hb=glibc-2.42#l561
Signed-off-by: Kerin Millar <kfm <AT> plushkava.net>

 locale-gen | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/locale-gen b/locale-gen
index fda1830..4b5edd6 100755
--- a/locale-gen
+++ b/locale-gen
@@ -249,11 +249,14 @@ sub list_locales ($prefix) {
 }
 
 sub normalize ($canonical) {
-       if (2 == (my ($locale, $charmap) = split /\./, $canonical, 3)) {
-               # en_US.UTF-8 => en_US.utf8; en_US.ISO-8859-1 => en_US.iso88591
-               return join '.', $locale, lc($charmap =~ s/-//gr);
-       } else {
+       # This is similar to the normalize_codeset() function of localedef(1).
+       if ($canonical !~ m/(?<=\.)[^@]+/p) {
                die "Can't normalize " . render_printable($canonical);
+       } else {
+               # en_US.UTF-8            => en_US.utf8
+               # de_DE.ISO-8859-15@euro => de_DE.iso885915@euro
+               my $codeset = lc ${^MATCH} =~ tr/0-9A-Za-z//cdr;
+               return ${^PREMATCH} . $codeset . ${^POSTMATCH};
        }
 }
 

Reply via email to