Hello,
On Windows, apr_os_locale_encoding returns the code page of the default
user locale (to be precise, it uses the current thread locale, but that
starts as the default user locale), and, well, I have a problem with that.
The problem is that that code page is essentially meaningless. See [1] for
a discussion of what various default locales mean and note that the code
page used by non-Unicode application is the one from the default *system*
locale, and that's the code page that I think is the right choice for
apr_os_locale_encoding. Why? Well, consider this example (Unicode-capable
reader required).
Let's set our user locale to English (Canada) (code page = 1252), our
system locale to Russian (code page = 1251), and try to use Subversion:
F:\Temp>svnadmin create testrepo
F:\Temp>svn co file:///F:/Temp/testrepo testwc
Checked out revision 0.
F:\Temp>echo. > testwc/test.txt
F:\Temp>svn add testwc\test.txt
A testwc\test.txt
F:\Temp>svn ci testwc -m "В лесу родилась ёлочка."
Adding testwc\test.txt
Transmitting file data .
Committed revision 1.
F:\Temp>svn log testwc\test.txt
------------------------------------------------------------------------
r1 | ?iiai | 2010-04-12 17:58:02 +0400 (Mon, 12 Apr 2010) | 1 line
A eano ?iaeeanu ?ei?ea.
------------------------------------------------------------------------
What happened here? My log message was initially passed to svn in CP1251,
because that's the code page of the system locale. svn, however,
interpreted it as CP1252, which led it to believe that the message was
actually " ëåñó ðîäèëàñü ¸ëî÷êà.". This is obviously broken. It then
converted the message to CP866, the console output code page, which is
normally the right course of action, but here it additionally obfuscated
the message by dropping the accents and some characters. The username was
mangled in the same way.
Now, I cheated a little, because Subversion doesn't actually use
apr_os_locale_encoding in this instance, but its internal mechanism for
determining the code page is the same, and I believe it showcases the
undesired behaviour well. apr_os_locale_encoding needs to be an encoding
that can be used to interoperate with the OS and other applications, and
that's the system locale's code page.
The proposed fix is trivial:
Index: misc/win32/charset.c
===================================================================
--- misc/win32/charset.c (revision 933252)
+++ misc/win32/charset.c (working copy)
@@ -30,11 +30,7 @@
#ifdef _UNICODE
int i;
#endif
-#if defined(_WIN32_WCE)
- LCID locale = GetUserDefaultLCID();
-#else
- LCID locale = GetThreadLocale();
-#endif
+ LCID locale = GetSystemDefaultLCID();
int len = GetLocaleInfo(locale, LOCALE_IDEFAULTANSICODEPAGE, NULL, 0);
char *cp = apr_palloc(pool, (len * sizeof(TCHAR)) + 2);
if (0 < GetLocaleInfo(locale, LOCALE_IDEFAULTANSICODEPAGE, (TCHAR*)
(cp + 2), len))
Cheers,
Roman.
[1] http://blogs.msdn.com/michkap/archive/2005/02/01/364707.aspx