Re: MinGW vs. setlocale

Eli Zaretskii Thu, 12 Jun 2014 11:35:31 -0700

> From: [email protected] (Ludovic Courtès)
> Cc: [email protected]
> Date: Thu, 12 Jun 2014 10:39:08 +0200
> 
> > I now know what is the reason for that, and I cannot say that I'm
> > happier: it's libunistring's fault.  All these tests call libunistring
> > functions that require the locale's language as an argument.  Problem
> > is, libunistring doesn't support languages such as "fra" or "trk", it
> > only supports "fr" and "tr".  In general, it only supports 3-letter
> > language codes for those languages for which a 2-letter code doesn't
> > exist.  By contrast, Windows _always_ uses 3-letter codes in valid
> > locale names.
> >
> > So what happens is that locale_language always returns an empty
> > string, and Guile calls u32_casecoll etc. with that empty string,
> > which only works in the "C" locale.  In any other locale, the
> > comparison fails with EILSEQ, and Guile throws the appropriate
> > exception.


It turns out the truth is actually much worse.  The main problem is
not with the language, it's with the locale's codeset.  On Windows,
libunistring always thinks that the codeset is the default console
codepage, disregarding any changes by 'setlocale'.  Therefore, text in
any script that is not supported by the default console codepage will
always cause EILSEQ from libunistring.

And while the locale's language is an argument to libunistring APIs,
and therefore can be "fixed" in Guile, the codeset is extracted and
used internally inside libunistring, and never exposed to any API.

> OK.  (It would be nice if someone would take over maintainership of
> libunistring...)

Indeed.  But given the above situation, I just went ahead, built
libunistring from sources, and patched them.  Doing so made almost all
the problems with i18n.test disappear (and I also discovered 2
problems with my previous patch that you already pushed, see below).

I still have one problem left: the Turkish character-mapping tests
are failing.  I think that's because somehow the LC_ALL environment
variable gets set to "C".  With the current libunistring code, that
setting in the environment overrides what's been set by 'setlocale',
and the Turkish language rules are not used.

I will fix this in libunistring, but do you have any idea which code
might be pushing LC_ALL=C into the Guile's environment during the test
suite run?

By the way, how do I run a single test from test-suite?

> > (Btw, why does Guile use libunistring instead of the ANSI functions
> > for locale-dependent string comparison and collation?)
> 
> Because strings are internally either Latin-1 or UTF-32 (UCS-4).

Of course, but did you see what libunistring does?  It calls libiconv
to encode the Unicode strings into the locale's codeset (that's why I
got EILSEQ earlier, see above), and then works with the encoded
string.  Since Guile has a libiconv interface as well, it could easily
do the same, no?  Once the string is in the locale's codeset, all the
libc functions will DTRT wrt collating, sorting, etc.

> >> --8<---------------cut here---------------start------------->8---
> >> scheme@(guile-user)> ,m (ice-9 i18n)
> >> scheme@(ice-9 i18n)> (locale-decimal-point (make-locale LC_ALL "fr_FR"))
> >> $2 = ","
> >> scheme@(ice-9 i18n)> (locale-thousands-separator (make-locale LC_ALL 
> >> "fr_FR"))
> >> $3 = " "
> >> --8<---------------cut here---------------end--------------->8---
> >
> > I did try that, and saw a strange thing: the thousands separator is
> > displayed as "\xa0".  That is very strange, because nl_langinfo does
> > return " " for the French locale, as expected.  Why would the blank be
> > translated into NBSP?  Can this also be due to libunistring problems?
> 
> NBSP is actually a better answer than just space, because it’d be unwise
> to introduce a break in the middle of a number.

But nl_langinfo returns a blank.  So who converts that to NBSP?

> So does ‘number->locale-string’ return "123\xa0456" for you?

No, I get "123456".  I will revisit this after I finish fixing
libunistring, 

> >> >   UNRESOLVED: i18n.test: format ~h: French: 12345.5678
> >> >   UNRESOLVED: i18n.test: format ~h: English: 12345.5678
> >> >
> >> > ~h is not supported on Windows.
> >> 
> >> ~h is implemented using ‘number->locale-string’.
> >
> > Maybe I'm confused, but isn't ~h about position directive in formats?
> 
> Yes, but that’s implemented in Scheme, in ice-9/format.scm.

Thanks for the pointer, I guess I will need to take a better look at
that.

> > These don't work on Windows.
> 
> What doesn’t work?  ‘format’ doesn’t rely on any non-portable OS
> facility.

I meant positional arguments in printf formats.

Anyway, part of the changes for i18n.test I sent before were in error,
sorry.  Here's a small patch relative to the current git:

diff --git a/test-suite/tests/i18n.test b/test-suite/tests/i18n.test
index c63e3ac..b51ff15 100644
--- a/test-suite/tests/i18n.test
+++ b/test-suite/tests/i18n.test
@@ -99,7 +99,7 @@
 
 (define %turkish-utf8-locale-name
   (if mingw?
-      "tur_TRK.1254"
+      "trk_TUR.1254"
       "tr_TR.UTF-8"))
 
 (define %german-utf8-locale-name
@@ -109,7 +109,7 @@
 
 (define %greek-utf8-locale-name
   (if mingw?
-      "grc_ELL.1253"
+      "ell_GRC.1253"
       "el_GR.UTF-8"))
 
 (define %american-english-locale-name
@@ -164,14 +164,13 @@
   (under-locale-or-unresolved %french-utf8-locale thunk))
 
 (define (under-turkish-utf8-locale-or-unresolved thunk)
-  ;; FreeBSD 8.2 and 9.1, Solaris 2.10, Darwin 8.11.0, and MinGW have
+  ;; FreeBSD 8.2 and 9.1, Solaris 2.10, and Darwin 8.11.0 have
   ;; a broken tr_TR locale where `i' is mapped to uppercase `I'
   ;; instead of `İ', so disable tests on that platform.
   (if (or (string-contains %host-type "freebsd8")
           (string-contains %host-type "freebsd9")
           (string-contains %host-type "solaris2.10")
-          (string-contains %host-type "darwin8")
-          (string-contains %host-type "mingw32"))
+          (string-contains %host-type "darwin8"))
       (throw 'unresolved)
       (under-locale-or-unresolved %turkish-utf8-locale thunk)))

Re: MinGW vs. setlocale

Reply via email to