> From: l...@gnu.org (Ludovic Courtès) > Cc: guile-devel@gnu.org > Date: Thu, 12 Jun 2014 10:39:08 +0200 > > > I now know what is the reason for that, and I cannot say that I'm > > happier: it's libunistring's fault. All these tests call libunistring > > functions that require the locale's language as an argument. Problem > > is, libunistring doesn't support languages such as "fra" or "trk", it > > only supports "fr" and "tr". In general, it only supports 3-letter > > language codes for those languages for which a 2-letter code doesn't > > exist. By contrast, Windows _always_ uses 3-letter codes in valid > > locale names. > > > > So what happens is that locale_language always returns an empty > > string, and Guile calls u32_casecoll etc. with that empty string, > > which only works in the "C" locale. In any other locale, the > > comparison fails with EILSEQ, and Guile throws the appropriate > > exception.
It turns out the truth is actually much worse. The main problem is not with the language, it's with the locale's codeset. On Windows, libunistring always thinks that the codeset is the default console codepage, disregarding any changes by 'setlocale'. Therefore, text in any script that is not supported by the default console codepage will always cause EILSEQ from libunistring. And while the locale's language is an argument to libunistring APIs, and therefore can be "fixed" in Guile, the codeset is extracted and used internally inside libunistring, and never exposed to any API. > OK. (It would be nice if someone would take over maintainership of > libunistring...) Indeed. But given the above situation, I just went ahead, built libunistring from sources, and patched them. Doing so made almost all the problems with i18n.test disappear (and I also discovered 2 problems with my previous patch that you already pushed, see below). I still have one problem left: the Turkish character-mapping tests are failing. I think that's because somehow the LC_ALL environment variable gets set to "C". With the current libunistring code, that setting in the environment overrides what's been set by 'setlocale', and the Turkish language rules are not used. I will fix this in libunistring, but do you have any idea which code might be pushing LC_ALL=C into the Guile's environment during the test suite run? By the way, how do I run a single test from test-suite? > > (Btw, why does Guile use libunistring instead of the ANSI functions > > for locale-dependent string comparison and collation?) > > Because strings are internally either Latin-1 or UTF-32 (UCS-4). Of course, but did you see what libunistring does? It calls libiconv to encode the Unicode strings into the locale's codeset (that's why I got EILSEQ earlier, see above), and then works with the encoded string. Since Guile has a libiconv interface as well, it could easily do the same, no? Once the string is in the locale's codeset, all the libc functions will DTRT wrt collating, sorting, etc. > >> --8<---------------cut here---------------start------------->8--- > >> scheme@(guile-user)> ,m (ice-9 i18n) > >> scheme@(ice-9 i18n)> (locale-decimal-point (make-locale LC_ALL "fr_FR")) > >> $2 = "," > >> scheme@(ice-9 i18n)> (locale-thousands-separator (make-locale LC_ALL > >> "fr_FR")) > >> $3 = " " > >> --8<---------------cut here---------------end--------------->8--- > > > > I did try that, and saw a strange thing: the thousands separator is > > displayed as "\xa0". That is very strange, because nl_langinfo does > > return " " for the French locale, as expected. Why would the blank be > > translated into NBSP? Can this also be due to libunistring problems? > > NBSP is actually a better answer than just space, because it’d be unwise > to introduce a break in the middle of a number. But nl_langinfo returns a blank. So who converts that to NBSP? > So does ‘number->locale-string’ return "123\xa0456" for you? No, I get "123456". I will revisit this after I finish fixing libunistring, > >> > UNRESOLVED: i18n.test: format ~h: French: 12345.5678 > >> > UNRESOLVED: i18n.test: format ~h: English: 12345.5678 > >> > > >> > ~h is not supported on Windows. > >> > >> ~h is implemented using ‘number->locale-string’. > > > > Maybe I'm confused, but isn't ~h about position directive in formats? > > Yes, but that’s implemented in Scheme, in ice-9/format.scm. Thanks for the pointer, I guess I will need to take a better look at that. > > These don't work on Windows. > > What doesn’t work? ‘format’ doesn’t rely on any non-portable OS > facility. I meant positional arguments in printf formats. Anyway, part of the changes for i18n.test I sent before were in error, sorry. Here's a small patch relative to the current git: diff --git a/test-suite/tests/i18n.test b/test-suite/tests/i18n.test index c63e3ac..b51ff15 100644 --- a/test-suite/tests/i18n.test +++ b/test-suite/tests/i18n.test @@ -99,7 +99,7 @@ (define %turkish-utf8-locale-name (if mingw? - "tur_TRK.1254" + "trk_TUR.1254" "tr_TR.UTF-8")) (define %german-utf8-locale-name @@ -109,7 +109,7 @@ (define %greek-utf8-locale-name (if mingw? - "grc_ELL.1253" + "ell_GRC.1253" "el_GR.UTF-8")) (define %american-english-locale-name @@ -164,14 +164,13 @@ (under-locale-or-unresolved %french-utf8-locale thunk)) (define (under-turkish-utf8-locale-or-unresolved thunk) - ;; FreeBSD 8.2 and 9.1, Solaris 2.10, Darwin 8.11.0, and MinGW have + ;; FreeBSD 8.2 and 9.1, Solaris 2.10, and Darwin 8.11.0 have ;; a broken tr_TR locale where `i' is mapped to uppercase `I' ;; instead of `İ', so disable tests on that platform. (if (or (string-contains %host-type "freebsd8") (string-contains %host-type "freebsd9") (string-contains %host-type "solaris2.10") - (string-contains %host-type "darwin8") - (string-contains %host-type "mingw32")) + (string-contains %host-type "darwin8")) (throw 'unresolved) (under-locale-or-unresolved %turkish-utf8-locale thunk)))