Misbehaviour of apr_os_locale_encoding on Windows

Роман Донченко Mon, 12 Apr 2010 08:55:34 -0700

Hello,

On Windows, apr_os_locale_encoding returns the code page of the defaultuser locale (to be precise, it uses the current thread locale, but thatstarts as the default user locale), and, well, I have a problem with that.

The problem is that that code page is essentially meaningless. See [1] fora discussion of what various default locales mean and note that the codepage used by non-Unicode application is the one from the default *system*locale, and that's the code page that I think is the right choice forapr_os_locale_encoding. Why? Well, consider this example (Unicode-capablereader required).

Let's set our user locale to English (Canada) (code page = 1252), oursystem locale to Russian (code page = 1251), and try to use Subversion:


F:\Temp>svnadmin create testrepo

F:\Temp>svn co file:///F:/Temp/testrepo testwc
Checked out revision 0.

F:\Temp>echo. > testwc/test.txt

F:\Temp>svn add testwc\test.txt
A         testwc\test.txt

F:\Temp>svn ci testwc -m "В лесу родилась ёлочка."
Adding         testwc\test.txt
Transmitting file data .
Committed revision 1.

F:\Temp>svn log testwc\test.txt
------------------------------------------------------------------------
r1 | ?iiai | 2010-04-12 17:58:02 +0400 (Mon, 12 Apr 2010) | 1 line

A eano ?iaeeanu ?ei?ea.
------------------------------------------------------------------------

What happened here? My log message was initially passed to svn in CP1251,because that's the code page of the system locale. svn, however,interpreted it as CP1252, which led it to believe that the message wasactually "Â ëåñó ðîäèëàñü ¸ëî÷êà.". This is obviously broken. It thenconverted the message to CP866, the console output code page, which isnormally the right course of action, but here it additionally obfuscatedthe message by dropping the accents and some characters. The username wasmangled in the same way.

Now, I cheated a little, because Subversion doesn't actually useapr_os_locale_encoding in this instance, but its internal mechanism fordetermining the code page is the same, and I believe it showcases theundesired behaviour well. apr_os_locale_encoding needs to be an encodingthat can be used to interoperate with the OS and other applications, andthat's the system locale's code page.


The proposed fix is trivial:

Index: misc/win32/charset.c
===================================================================
--- misc/win32/charset.c        (revision 933252)
+++ misc/win32/charset.c        (working copy)
@@ -30,11 +30,7 @@
 #ifdef _UNICODE
     int i;
 #endif
-#if defined(_WIN32_WCE)
-    LCID locale = GetUserDefaultLCID();
-#else
-    LCID locale = GetThreadLocale();
-#endif
+    LCID locale = GetSystemDefaultLCID();
     int len = GetLocaleInfo(locale, LOCALE_IDEFAULTANSICODEPAGE, NULL, 0);
     char *cp = apr_palloc(pool, (len * sizeof(TCHAR)) + 2);

if (0 < GetLocaleInfo(locale, LOCALE_IDEFAULTANSICODEPAGE, (TCHAR*)(cp + 2), len))


Cheers,
Roman.

[1] http://blogs.msdn.com/michkap/archive/2005/02/01/364707.aspx

Misbehaviour of apr_os_locale_encoding on Windows

Reply via email to