commit: 7cd8eb7cc675990c6f435c4aec7870ed28dcb8b8 Author: Kerin Millar <kfm <AT> plushkava <DOT> net> AuthorDate: Tue Aug 12 17:21:06 2025 +0000 Commit: Kerin Millar <kfm <AT> plushkava <DOT> net> CommitDate: Tue Aug 12 17:30:50 2025 +0000 URL: https://gitweb.gentoo.org/proj/locale-gen.git/commit/?id=7cd8eb7c
Better align the behaviour of normalize() with localedef(1) The purpose of the normalize() subroutine is to normalize the codeset portion of a locale name as the localedef(1) utility would; at least, to the extent that is necessary for locale-gen(8) to function correctly. Presently, it does so by splitting the string into two parts (with the <period> character serving as a separator), stripping all instances of the <hyphen-minus> from the second part, then converting the second part to lowercase. This approach dates from April 2006, when the --update option was introduced. However, it is not strictly correct. For one thing, only the codeset part is supposed to be altered. Consider the following locale name. de_DE.ISO-8859-15@euro The present routine operates on a substring of "ISO-8859-15@euro". Instead, it ought to operate only on "ISO-8859-15". While it seems unlikely ever to occur, imagine a scenario in which GNU introduces a new modifier that does not consist exclusively of lower case characters. Another issue is that the routine strips only the <hyphen-minus> character. Instead, it ought to strip all non-alphanumeric characters, as does the normalize_codeset() function of the localedef(1) utility. Revise the normalize() subroutine so as to address both of these issues and maintain a theoretically high degree of forward-compatibility. Fixes: 2df969d53ea596038a4857060a42d7f2fd25d7e3 Link: https://sourceware.org/git?p=glibc.git;a=blob;f=locale/programs/localedef.c;hb=glibc-2.42#l561 Signed-off-by: Kerin Millar <kfm <AT> plushkava.net> locale-gen | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/locale-gen b/locale-gen index fda1830..4b5edd6 100755 --- a/locale-gen +++ b/locale-gen @@ -249,11 +249,14 @@ sub list_locales ($prefix) { } sub normalize ($canonical) { - if (2 == (my ($locale, $charmap) = split /\./, $canonical, 3)) { - # en_US.UTF-8 => en_US.utf8; en_US.ISO-8859-1 => en_US.iso88591 - return join '.', $locale, lc($charmap =~ s/-//gr); - } else { + # This is similar to the normalize_codeset() function of localedef(1). + if ($canonical !~ m/(?<=\.)[^@]+/p) { die "Can't normalize " . render_printable($canonical); + } else { + # en_US.UTF-8 => en_US.utf8 + # de_DE.ISO-8859-15@euro => de_DE.iso885915@euro + my $codeset = lc ${^MATCH} =~ tr/0-9A-Za-z//cdr; + return ${^PREMATCH} . $codeset . ${^POSTMATCH}; } }
