Re: Question regarding gettext behavior on iconv failure

Bruno Haible via austin-group-l at The Open Group Mon, 03 May 2021 14:38:36 -0700

Hi Eric,

> The example in question set up several .po files and a specific
> environment to test various pluralization/transcoding fallbacks, and
> concludes with a snippet where a string with an encoding error in
> ISO-8859-1 is output in spite of an iconv failure, rather than the
> string passed in to ngettext():
> 
> 
>     n_recipients = 1;
>     // The following outputs "1 Empfänger" encoded in UTF-8:
>     printf("%s\n", ngettext("recipient", "recipients", n_recipients));
> 
>     bind_textdomain_codeset("mail", "ASCII");
> 
>     n_recipients = 1;
>     // The following outputs "recipient" with the same encoding as the
> "recipient"
>     // argument to ngettext (remember, the the system is assumed to not
> support
>     // conversion from ISO/IEC 8859-1 to ASCII):
>     printf("%s\n", ngettext("recipient", "recipients", n_recipients));
>     // On GNU gettext, "1 Empfänger" is output in ISO-8859-1 here (i.e.
> no conversion is done). I think we already agreed on considering this
> behavior a bug,


I cannot reproduce this. Find attached my (complete) test case.

GNU gettext uses iconv_open() with arguments that indicate that a not 1:1
conversion (e.g. transliteration) is better than a failure.

The result thus depends on the iconv implementation. For GNU gettext
the recommended iconv implementations are:
  - on glibc systems: GNU libc,
  - otherwise: GNU libiconv.
Therefore here are the results on GNU libc (2.32) and on some other OS
(FreeBSD 13) with GNU libiconv:

With a mail.po that contains only umlauts:

Output on glibc systems (e.g. 2.32):
1 Empfänger
1 Empfaenger

Output on non-glibc systems with GNU libiconv:
1 Empfänger
1 Empf"anger

With a mail-utf8.po that contains also Hanzi characters:

Output on glibc systems (e.g. 2.32):
1 Empfänger Chinese (中文,普通话,汉语)      你好
1 Empfaenger Chinese (??,???,??)      ??

Output on non-glibc systems with GNU libiconv:
1 Empfänger Chinese (中文,普通话,汉语)      你好
recipient

As you can see:

  * For the first line of output, since the output encoding is UTF-8,
    iconv() never needed transliteration and never failed.

  * For the second line of output, in the first three cases, iconv()
    did transliteration, and the result was always an ASCII string.
    (The quality of glibc's transliteration of Hanzi characters to
    question marks can be debated, though.)

  * In the last case, iconv() failed, and thus GNU gettext output
    the corresponding argument to ngettext() untranslated.

> This raises a few questions: does the GNU gettext team agree that this
> can be considered a bug

No. Please provide a reproducible test case, that produces wrong results
on an interesting platform. NetBSD 3.0 or IRIX 6.5, for example, don't
count.

Bruno

/* Preparations:
- Install locale named 'de_DE.UTF-8' (using localedef).
- Find attached mail.po
- $ mkdir -p de/LC_MESSAGES
  $ msgfmt -c -o de/LC_MESSAGES/mail.mo mail.po
  or
  $ msgfmt -c -o de/LC_MESSAGES/mail.mo mail-utf8.po
- $ gcc -Wall foo.c
- $ LC_ALL=de_DE.UTF-8 ./a.out
*/

#include <libintl.h>
#include <locale.h>
#include <stdio.h>

int
main ()
{
  if (setlocale (LC_ALL, "") == NULL)
    return 1;
  textdomain ("mail");
  bindtextdomain ("mail", ".");

  unsigned int n_recipients;

  n_recipients = 1;
  // The following outputs "1 Empfänger" encoded in UTF-8:
  printf("%s\n", ngettext("recipient", "recipients", n_recipients));

  bind_textdomain_codeset("mail", "ASCII");

  n_recipients = 1;
  // The following outputs "recipient" with the same encoding as the "recipient"
  // argument to ngettext (remember, the the system is assumed to not support
  // conversion from ISO/IEC 8859-1 to ASCII):
  printf("%s\n", ngettext("recipient", "recipients", n_recipients));
  // On GNU gettext, "1 Empfänger" is output in ISO-8859-1 here (i.e. no conversion is done). I think we already agreed on considering this behavior a bug,
}
/*
With a mail.po that contains only umlauts:

Output on glibc systems (e.g. 2.32):
1 Empfänger
1 Empfaenger

Output on non-glibc systems with GNU libiconv:
1 Empfänger
1 Empf"anger

With a mail-utf8.po that contains also Hanzi characters:

Output on glibc systems (e.g. 2.32):
1 Empfänger Chinese (中文,普通话,汉语)      你好
1 Empfaenger Chinese (??,???,??)      ??

Output on non-glibc systems with GNU libiconv:
1 Empfänger Chinese (中文,普通话,汉语)      你好
recipient

*/

msgid ""
msgstr ""
"Content-Type: text/plain; charset=ISO_8859-1\n"
"Plural-Forms: nplurals=4; plural= n==1?0: (n>1 && n< 5)?1: (n==0)? 2:3;\n"

msgid "recipient"
msgid_plural "recipients"
msgstr[0] "1 Empf�nger"
msgstr[1] "2 bis 4 Empf�nger"
msgstr[2] "keine Empf�nger"
msgstr[3] "mehr als 4 Empf�nger"

msgid ""
msgstr ""
"Content-Type: text/plain; charset=UTF-8\n"
"Plural-Forms: nplurals=4; plural= n==1?0: (n>1 && n< 5)?1: (n==0)? 2:3;\n"

msgid "recipient"
msgid_plural "recipients"
msgstr[0] "1 Empfänger Chinese (中文,普通话,汉语)      你好"
msgstr[1] "2 bis 4 Empfänger"
msgstr[2] "keine Empfänger"
msgstr[3] "mehr als 4 Empfänger"

Re: Question regarding gettext behavior on iconv failure

Reply via email to