On Tue, Aug 6, 2024 at 11:44 PM Peter J. Holzer <hjp-pg...@hjp.at> wrote: > I assume that "1254" here is the code page. > But you specified --encoding=UTF-8 above, so your default locale uses a > different encoding than the template databases. I would expect that to > cause problems if the template databases contain any charecters where > the encodings differ (such as "ü" in the locale name).
It's weird, but on Windows, PostgreSQL allows UTF-8 encoding with any locale, and thus apparent contradictions: /* See notes in createdb() to understand these tests */ if (!(locale_enc == user_enc || locale_enc == PG_SQL_ASCII || locale_enc == -1 || #ifdef WIN32 user_enc == PG_UTF8 || #endif user_enc == PG_SQL_ASCII)) { pg_log_error("encoding mismatch"); ... and createdb's comments say that is acceptable because: * 3. selected encoding is UTF8 and platform is win32. This is because * UTF8 is a pseudo codepage that is supported in all locales since it's * converted to UTF16 before being used. At the time PostgreSQL was ported to Windows, UTF-8 was not a supported encoding in "char"-based system interfaces like strcoll_l(), and the port had to convert to "wchar_t" interfaces and call (in that example) wcscoll_l(). On modern Windows it is, and there are two locale names, with and without ".UTF-8" suffix (cf. glibc systems that have "en_US" and "en_US.UTF-8" where the suffix-less version uses whatever traditional encoding was used for that language before UTF-8 ate the world). If we were doing the Windows port today, we'd probably not have that special case for Windows, and we wouldn't have the wchar_t conversions. Then I think we'd allow only: --locale=tr-TR (defaults to --encoding=WIN1254) --locale=tr-TR --encoding=WIN1254 --locale-tr-TR.UTF-8 --locale=tr-TR.UTF-8 --encoding=UTF-8 If we come up with an automated (or even manual but documented) way to perform the "Turkish_Türkiye.1254" -> "tr-TR" upgrade as Dave was suggesting upthread, we'll probably want to be careful to tidy up these contradictory settings. For example I guess that American databases initialised by EDB's installer must be using --locale="English_United States.1252" and --encoding=UTF-8, and should be changed to "en-US.UTF-8", while those initialised by letting initdb.exe pick the encoding must be using --locale="English_United States.1252" and --encoding=WIN1252 (implicit) and should be changed to "en-US" to match the WIN1252 encoding. Only if we did that update would we be able to consider removing the extra UTF-16 conversions that are happening very frequently inside PostgreSQL code, which is a waste of CPU cycles and programmer sanity. (But that's all just speculation from studying the locale code -- I've never really used Windows.)