On Tue, Aug 6, 2024 at 11:44 PM Peter J. Holzer <hjp-pg...@hjp.at> wrote:
> I assume that "1254" here is the code page.
> But you specified --encoding=UTF-8 above, so your default locale uses a
> different encoding than the template databases. I would expect that to
> cause problems if the template databases contain any charecters where
> the encodings differ (such as "ü" in the locale name).

It's weird, but on Windows, PostgreSQL allows UTF-8 encoding with any
locale, and thus apparent contradictions:

    /* See notes in createdb() to understand these tests */
    if (!(locale_enc == user_enc ||
          locale_enc == PG_SQL_ASCII ||
          locale_enc == -1 ||
#ifdef WIN32
          user_enc == PG_UTF8 ||
#endif
          user_enc == PG_SQL_ASCII))
    {
        pg_log_error("encoding mismatch");

... and createdb's comments say that is acceptable because:

 * 3. selected encoding is UTF8 and platform is win32. This is because
 * UTF8 is a pseudo codepage that is supported in all locales since it's
 * converted to UTF16 before being used.

At the time PostgreSQL was ported to Windows, UTF-8 was not a
supported encoding in "char"-based system interfaces like strcoll_l(),
and the port had to convert to "wchar_t" interfaces and call (in that
example) wcscoll_l().  On modern Windows it is, and there are two
locale names, with and without ".UTF-8" suffix (cf. glibc systems that
have "en_US" and "en_US.UTF-8" where the suffix-less version uses
whatever traditional encoding was used for that language before UTF-8
ate the world).

If we were doing the Windows port today, we'd probably not have that
special case for Windows, and we wouldn't have the wchar_t
conversions.  Then I think we'd allow only:

--locale=tr-TR (defaults to --encoding=WIN1254)
--locale=tr-TR --encoding=WIN1254
--locale-tr-TR.UTF-8
--locale=tr-TR.UTF-8 --encoding=UTF-8

If we come up with an automated (or even manual but documented) way to
perform the "Turkish_Türkiye.1254" -> "tr-TR" upgrade as Dave was
suggesting upthread, we'll probably want to be careful to tidy up
these contradictory settings.  For example I guess that American
databases initialised by EDB's installer must be using
--locale="English_United States.1252" and --encoding=UTF-8, and should
be changed to "en-US.UTF-8", while those initialised by letting
initdb.exe pick the encoding must be using --locale="English_United
States.1252" and --encoding=WIN1252 (implicit) and should be changed
to "en-US" to match the WIN1252 encoding.

Only if we did that update would we be able to consider removing the
extra UTF-16 conversions that are happening very frequently inside
PostgreSQL code, which is a waste of CPU cycles and programmer sanity.
(But that's all just speculation from studying the locale code -- I've
never really used Windows.)


Reply via email to