Re: Windows default locale vs initdb

2024-08-08 Thread Andrew Dunstan


On 2024-08-08 Th 4:08 AM, Ertan Küçükoglu wrote:


I already installed Visual Studio 2022 with C++ support as
suggested in
https://www.postgresql.org/docs/current/install-windows-full.html
I cloned codes in the system.
But, I cannot find any "src/tools/msvc" directory. It is missing.
Document states I need everything in there
"The tools for building using Visual C++ or Platform SDK are in
the src\tools\msvc directory."
It seems I will need help setting up the build environment.


I am willing to be a tester for Windows given I could get help setting 
up the build environment.
It also feels documentation needs some update as I failed to find 
necessary files.



If you're trying to build the master branch those documents no longer 
apply. You will need to build using meson, as documented here: 




cheers


andrew


--
Andrew Dunstan
EDB:https://www.enterprisedb.com


Re: Windows default locale vs initdb

2024-08-08 Thread Ertan Küçükoglu
>
> I already installed Visual Studio 2022 with C++ support as suggested in
> https://www.postgresql.org/docs/current/install-windows-full.html
> I cloned codes in the system.
> But, I cannot find any "src/tools/msvc" directory. It is missing.
> Document states I need everything in there
> "The tools for building using Visual C++ or Platform SDK are in the
> src\tools\msvc directory."
> It seems I will need help setting up the build environment.
>

I am willing to be a tester for Windows given I could get help setting
up the build environment.
It also feels documentation needs some update as I failed to find necessary
files.

Thanks & Regards,
Ertan


Re: Windows default locale vs initdb

2024-08-06 Thread Thomas Munro
On Tue, Jul 23, 2024 at 11:19 AM Thomas Munro  wrote:
> On Tue, Jul 23, 2024 at 1:44 AM Andrew Dunstan  wrote:
> > I have an environment I can use for testing. But what exactly am I
> > testing? :-) Install a few "problem" language/region settings, switch
> > the system and ensure initdb runs ok?

I thought a bit more about what to do with the messy .UTF-8 situation
on Windows, and I think I might see a way forward that harmonises the
code and behaviour with Unix, and deletes a lot of special case code.
But it's only theories + CI so far.

0001, 0002:  As before, teach initdb.exe to choose eg "en-US" by default.

0003:  Force people to choose locales that match the database
encoding, as we do on Unix.  That is, forbid contradictory
combinations like --locale="English_United States.1252"
--encoding=UTF8, which are currently allowed (and the world is full of
such database clusters because that is how the EDB installer GUI makes
them).  The only allowed combinations for American English should now
be: --locale="en-US" --encoding="WIN1252", and --locale="en-US.UTF-8"
--encoding="UTF8".  You can still use the old names if you like, by
explicitly writing --locale="English_United States.1252", but the
encoding then has to be WIN1252.  It's crazy to mix them up, let's ban
that.

Obviously there is a pg_upgrade case to worry about there.  We'd have
to "fix" the now illegal combinations, and I don't know exactly how
yet.

0004:  Rip out the code that does extra wchar_t conversations for
collations.  If I've understood correctly, we don't need them: if you
have a .UTF-8 locale then your encoding is UTF-8 and should be able to
use strcoll_l() directly.  Right?

0005:  Something similar was being done for strftime().  And we might
as well use strftime_l() instead while we're here (part of general
movement to use _l functions and stop splattering setlocale() all over
the place, for the multithreaded future).

These patches pass on CI.  Do they give the expected results when used
on a real Windows system?

There are a few more places where we do wchar_t conversions that could
probably be stripped out too, if my assumptions are correct, and we
could dig further if the basic idea can be validated and people think
this is going in a good direction.
From 886815244ab43092562ae3118cd5588a2fad5bb2 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Mon, 20 Nov 2023 14:24:35 +1300
Subject: [PATCH v6 1/5] MinGW has GetLocaleInfoEx().

To use BCP 47 locale names like "en-US" without a suffix ".encoding", we
need to be able to call GetLocaleInfoEx() to look up the encoding.  That
was previously gated for MSVC only, but MinGW has had the function for
many years.  Remove that gating, because otherwise our MinGW build farm
animals would fail when a later commit switches to using the new names by
default.

There are probably other places where _MSC_VER is being used as a proxy
for detecting MinGW with an out-of-date idea about missing functions.

Discussion: https://postgr.es/m/CA%2BhUKGLsV3vTjPp7bOZBr3JTKp3Brkr9V0Qfmc7UvpWcmAQL4A%40mail.gmail.com
---
 src/port/chklocale.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/src/port/chklocale.c b/src/port/chklocale.c
index 8cb81c8640..a15b0d5349 100644
--- a/src/port/chklocale.c
+++ b/src/port/chklocale.c
@@ -204,7 +204,6 @@ win32_langinfo(const char *ctype)
 	char	   *r = NULL;
 	char	   *codepage;
 
-#if defined(_MSC_VER)
 	uint32		cp;
 	WCHAR		wctype[LOCALE_NAME_MAX_LENGTH];
 
@@ -229,7 +228,6 @@ win32_langinfo(const char *ctype)
 		}
 	}
 	else
-#endif
 	{
 		/*
 		 * Locale format on Win32 is _..  For
-- 
2.39.2

From 357751c04cdd3dc7dea1ee9409356d818af70d5d Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Tue, 19 Jul 2022 06:31:17 +1200
Subject: [PATCH v6 2/5] Default to IETF BCP 47 locale names in initdb on
 Windows.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Avoid selecting traditional Windows locale names written with English
words, because (1) they are unstable and explicitly not recommended for
use in databases and (2) they may contain non-ASCII characters, which we
can't put in our shared catalogs.  Since setlocale() returns such names,
on Windows use GetUserDefaultLocaleName() if the user didn't provide an
explicit locale.  It returns BCP 47 strings like "en-US".

Also update the documentation to recommend BCP 47 over the traditional
names when providing explicit values to initdb.

Reviewed-by: Juan José Santamaría Flecha 
Reviewed-by:
Discussion: https://postgr.es/m/CA%2BhUKGJ%3DXThErgAQRoqfCy1bKPxXVuF0%3D2zDbB%2BSxDs59pv7Fw%40mail.gmail.com
---
 doc/src/sgml/charset.sgml | 13 +++--
 src/bin/initdb/initdb.c   | 31 +--
 2 files changed, 40 insertions(+), 4 deletions(-)

diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index 834cb30c85..adb21eb079 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -83,8 +83,17 @@ initdb --locale=sv_SE
 system un

Re: Windows default locale vs initdb

2024-07-22 Thread Thomas Munro
On Tue, Jul 23, 2024 at 1:44 AM Andrew Dunstan  wrote:
> I have an environment I can use for testing. But what exactly am I
> testing? :-) Install a few "problem" language/region settings, switch
> the system and ensure initdb runs ok?

I just want to know about any weird unexpected consequences of using
BCP47 locale names, before we change the default in v18.  The only
concrete thing I found so far was that MinGW didn't like it, but I
provided a fix for that.  It'd still be possible to initialise a new
cluster with the old style names if you really want to, but you'd have
to pass it in explicitly; I was wondering if that could be necessary
in some pg_upgrade scenario but I guess not, it just clobbers
template0's pg_database row with values from the source database, and
recreates everything else so I think it should be fine (?).  I am a
little uneasy about the new names not having .encoding but there
doesn't seem to be an issue with that (such locales exist on Unix
too), and the OS still knows which encoding they use in that case.




Re: Windows default locale vs initdb

2024-07-22 Thread Ertan Küçükoglu
Andrew Dunstan , 22 Tem 2024 Pzt, 16:44 tarihinde şunu
yazdı:

> I have an environment I can use for testing. But what exactly am I
> testing? :-) Install a few "problem" language/region settings, switch
> the system and ensure initdb runs ok?
>
> Other than Turkish, which locales should I install?
>

Thomas earlier listed a few:
"From a quick search of other recent cases: Czech Republic -> Czechia,
Swaziland -> Eswatini, Cape Verde -> Cabo Verde, and more, plus others
that we have older records of in the mailing list that seemed to
change in some minor technical way: Macau, Hong Hong, Norwegian etc."

I am not sure if all needs testing though.

Thanks & Regards,
Ertan


Re: Windows default locale vs initdb

2024-07-22 Thread Andrew Dunstan



On 2024-07-21 Su 10:51 PM, Thomas Munro wrote:

Ertan Küçükoglu offered to try to review and test this, so here's a rebase.

Some notes:

* it turned out that the Turkish i/I test problem I mentioned earlier
in this thread[1] was just always broken on Windows, we just didn't
ever test with UTF-8 before Meson took over; it's skipped now, see
commit cff4e5a3[2]

* it seems that you can't actually put encodings like .1252 on the end
(.UTF-8 must be a special case); I don't know if we should look into a
better UTF-8 mode for modern Windows, but that'd be a separate project

* this patch only benefits people who run initdb.exe without
explicitly specifying a locale; probably a good number of real systems
in the wild actually use EDB's graphical installer which initialises a
cluster and has its own way of choosing the locale, as discussed in
Ertan's thread[3]

[1] 
https://www.postgresql.org/message-id/flat/CA%2BhUKGJZskvCh%3DQm75UkHrY6c1QZUuC92Po9rponj1BbLmcMEA%40mail.gmail.com#3a00c08214a4285d2f3c4297b0ac2be2
[2] https://github.com/postgres/postgres/commit/cff4e5a3
[3] 
https://www.postgresql.org/message-id/flat/CAH2i4ydECHZPxEBB7gtRG3vROv7a0d3tqAFXzcJWQ9hRsc1znQ%40mail.gmail.com



I have an environment I can use for testing. But what exactly am I 
testing? :-) Install a few "problem" language/region settings, switch 
the system and ensure initdb runs ok?


Other than Turkish, which locales should I install?


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Windows default locale vs initdb

2024-07-22 Thread Ertan Küçükoglu
Thomas Munro , 22 Tem 2024 Pzt, 14:00 tarihinde
şunu yazdı:

> Sorry, I didn't mean to put you on the spot :-)  Yeah you'd need to
> install a compiler, various libraries and tools to be able to build
> form source with a patch.  Unfortunately I'm not the best person to
> explain how to do that on Windows as I don't use it.  Honestly it
> might be a bit too much new stuff to figure out at once just to test
> this small patch.  What I'd be hoping for is confirmation that there
> are no weird unintended consequences or problems I'm not seeing since
> I'm writing blind patches based on documentation only, but it's
> probably too much to ask to figure out the whole development
> environment and then go on an open ended expedition looking for
> unknown problems.
>

I already installed Visual Studio 2022 with C++ support as suggested in
https://www.postgresql.org/docs/current/install-windows-full.html
I cloned codes in the system.
But, I cannot find any "src/tools/msvc" directory. It is missing.
Document states I need everything in there
"The tools for building using Visual C++ or Platform SDK are in the
src\tools\msvc directory."
It seems I will need help setting up the build environment.


Re: Windows default locale vs initdb

2024-07-22 Thread Thomas Munro
On Mon, Jul 22, 2024 at 8:04 PM Ertan Küçükoglu
 wrote:
> I am a complete noob about PostgreSQL development.
> I don't know about the PostgreSQL CI system.
> I will be needing some help as to how to do the tests.
> I have access to different Windows OSes (v10, Server 2022 mainly).
> These systems can be set to English or Turkish locales if needed.
> I can also add new Windows versions if needed.
> I do not know how to use patch files. I am also not sure what tests I should 
> do.
> Do I need to set up a Windows build system for PostgreSQL CI?
> Will I download some files (EXE, etc) ready for testing? Copy them over an 
> existing installation for testing?

Sorry, I didn't mean to put you on the spot :-)  Yeah you'd need to
install a compiler, various libraries and tools to be able to build
form source with a patch.  Unfortunately I'm not the best person to
explain how to do that on Windows as I don't use it.  Honestly it
might be a bit too much new stuff to figure out at once just to test
this small patch.  What I'd be hoping for is confirmation that there
are no weird unintended consequences or problems I'm not seeing since
I'm writing blind patches based on documentation only, but it's
probably too much to ask to figure out the whole development
environment and then go on an open ended expedition looking for
unknown problems.




Re: Windows default locale vs initdb

2024-07-22 Thread Thomas Munro
On Mon, Jul 22, 2024 at 8:38 PM Zaid Shabbir  wrote:
> Can you please list down some of the use cases for the patch ? Other than 
> Turkish, does this patch have an impact on other locales too ?

Hi Zaid,

Yes, initdb.exe would use BCP47 codes by default for all languages.
Who knows which country will change its name next?

>From a quick search of other recent cases: Czech Republic -> Czechia,
Swaziland -> Eswatini, Cape Verde -> Cabo Verde, and more, plus others
that we have older records of in the mailing list that seemed to
change in some minor technical way: Macau, Hong Hong, Norwegian etc.
The Windows manual says:

"We do not recommend this form for locale strings embedded in
code or serialized to storage, because these strings are more likely
to be changed by an operating system update than the locale name
form."

It's pretty bad for our users when it happens and the Windows locale
name changes: a database cluster that suddenly can't start, and even
after you've figured out why and adjusted the references in
postgresql.conf, you still can't connect.  There is also the problem
that some of the old full names have non-ASCII characters (Türkiye,
São Tomé and Príncipe, Curaçao, Côte d'Ivoire, Åland) which is bad at
least in theory because we use the string in times and places when it
it is not clear what the encoding the name itself has.

I don't use Windows myself, I've just been watching this train wreck
replaying in a loop for long enough.  Clearly it's going to take some
time to wean the user community off the unstable names, and it struck
me that the default is probably the main source of them in new
clusters, hence this patch.




Re: Windows default locale vs initdb

2024-07-22 Thread Zaid Shabbir
Hello Thomas,

Can you please list down some of the use cases for the patch ? Other than
Turkish, does this patch have an impact on other locales too ?


Regards,
Zaid


On Mon, Jul 22, 2024 at 7:52 AM Thomas Munro  wrote:

> Ertan Küçükoglu offered to try to review and test this, so here's a rebase.
>
> Some notes:
>
> * it turned out that the Turkish i/I test problem I mentioned earlier
> in this thread[1] was just always broken on Windows, we just didn't
> ever test with UTF-8 before Meson took over; it's skipped now, see
> commit cff4e5a3[2]
>
> * it seems that you can't actually put encodings like .1252 on the end
> (.UTF-8 must be a special case); I don't know if we should look into a
> better UTF-8 mode for modern Windows, but that'd be a separate project
>
> * this patch only benefits people who run initdb.exe without
> explicitly specifying a locale; probably a good number of real systems
> in the wild actually use EDB's graphical installer which initialises a
> cluster and has its own way of choosing the locale, as discussed in
> Ertan's thread[3]
>
> [1]
> https://www.postgresql.org/message-id/flat/CA%2BhUKGJZskvCh%3DQm75UkHrY6c1QZUuC92Po9rponj1BbLmcMEA%40mail.gmail.com#3a00c08214a4285d2f3c4297b0ac2be2
> [2] https://github.com/postgres/postgres/commit/cff4e5a3
> [3]
> https://www.postgresql.org/message-id/flat/CAH2i4ydECHZPxEBB7gtRG3vROv7a0d3tqAFXzcJWQ9hRsc1znQ%40mail.gmail.com
>


Re: Windows default locale vs initdb

2024-07-22 Thread Ertan Küçükoglu
Hi,

I am a complete noob about PostgreSQL development.
I don't know about the PostgreSQL CI system.
I will be needing some help as to how to do the tests.
I have access to different Windows OSes (v10, Server 2022 mainly).
These systems can be set to English or Turkish locales if needed.
I can also add new Windows versions if needed.
I do not know how to use patch files. I am also not sure what tests I
should do.
Do I need to set up a Windows build system for PostgreSQL CI?
Will I download some files (EXE, etc) ready for testing? Copy them over an
existing installation for testing?

Thanks for your help.

Regards,
Ertan

Thomas Munro , 22 Tem 2024 Pzt, 05:52 tarihinde
şunu yazdı:

> Ertan Küçükoglu offered to try to review and test this, so here's a rebase.
>
> Some notes:
>
> * it turned out that the Turkish i/I test problem I mentioned earlier
> in this thread[1] was just always broken on Windows, we just didn't
> ever test with UTF-8 before Meson took over; it's skipped now, see
> commit cff4e5a3[2]
>
> * it seems that you can't actually put encodings like .1252 on the end
> (.UTF-8 must be a special case); I don't know if we should look into a
> better UTF-8 mode for modern Windows, but that'd be a separate project
>
> * this patch only benefits people who run initdb.exe without
> explicitly specifying a locale; probably a good number of real systems
> in the wild actually use EDB's graphical installer which initialises a
> cluster and has its own way of choosing the locale, as discussed in
> Ertan's thread[3]
>
> [1]
> https://www.postgresql.org/message-id/flat/CA%2BhUKGJZskvCh%3DQm75UkHrY6c1QZUuC92Po9rponj1BbLmcMEA%40mail.gmail.com#3a00c08214a4285d2f3c4297b0ac2be2
> [2] https://github.com/postgres/postgres/commit/cff4e5a3
> [3]
> https://www.postgresql.org/message-id/flat/CAH2i4ydECHZPxEBB7gtRG3vROv7a0d3tqAFXzcJWQ9hRsc1znQ%40mail.gmail.com
>


Re: Windows default locale vs initdb

2024-07-21 Thread Thomas Munro
Ertan Küçükoglu offered to try to review and test this, so here's a rebase.

Some notes:

* it turned out that the Turkish i/I test problem I mentioned earlier
in this thread[1] was just always broken on Windows, we just didn't
ever test with UTF-8 before Meson took over; it's skipped now, see
commit cff4e5a3[2]

* it seems that you can't actually put encodings like .1252 on the end
(.UTF-8 must be a special case); I don't know if we should look into a
better UTF-8 mode for modern Windows, but that'd be a separate project

* this patch only benefits people who run initdb.exe without
explicitly specifying a locale; probably a good number of real systems
in the wild actually use EDB's graphical installer which initialises a
cluster and has its own way of choosing the locale, as discussed in
Ertan's thread[3]

[1] 
https://www.postgresql.org/message-id/flat/CA%2BhUKGJZskvCh%3DQm75UkHrY6c1QZUuC92Po9rponj1BbLmcMEA%40mail.gmail.com#3a00c08214a4285d2f3c4297b0ac2be2
[2] https://github.com/postgres/postgres/commit/cff4e5a3
[3] 
https://www.postgresql.org/message-id/flat/CAH2i4ydECHZPxEBB7gtRG3vROv7a0d3tqAFXzcJWQ9hRsc1znQ%40mail.gmail.com
From fb33b7eb5482bae31b70bb54dbe77325b543a89c Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Mon, 20 Nov 2023 14:24:35 +1300
Subject: [PATCH v5 1/2] MinGW has GetLocaleInfoEx().

To use BCP 47 locale names like "en-US" without a suffix ".encoding", we
need to be able to call GetLocaleInfoEx() to look up the encoding.  That
was previously gated for MSVC only, but MinGW has had the function for
many years.  Remove that gating, because otherwise our MinGW build farm
animals would fail when a later commit switches to using the new names by
default.

There are probably other places where _MSC_VER is being used as a proxy
for detecting MinGW with an out-of-date idea about missing functions.

Discussion: https://postgr.es/m/CA%2BhUKGLsV3vTjPp7bOZBr3JTKp3Brkr9V0Qfmc7UvpWcmAQL4A%40mail.gmail.com
---
 src/port/chklocale.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/src/port/chklocale.c b/src/port/chklocale.c
index 8cb81c8640e..a15b0d5349b 100644
--- a/src/port/chklocale.c
+++ b/src/port/chklocale.c
@@ -204,7 +204,6 @@ win32_langinfo(const char *ctype)
 	char	   *r = NULL;
 	char	   *codepage;
 
-#if defined(_MSC_VER)
 	uint32		cp;
 	WCHAR		wctype[LOCALE_NAME_MAX_LENGTH];
 
@@ -229,7 +228,6 @@ win32_langinfo(const char *ctype)
 		}
 	}
 	else
-#endif
 	{
 		/*
 		 * Locale format on Win32 is _..  For
-- 
2.45.2

From dc726a61aace86bda62687e3aa1411753ba3f1a4 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Tue, 19 Jul 2022 06:31:17 +1200
Subject: [PATCH v5 2/2] Default to IETF BCP 47 locale names in initdb on
 Windows.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Avoid selecting traditional Windows locale names written with English
words, because (1) they are unstable and explicitly not recommended for
use in databases and (2) they may contain non-ASCII characters, which we
can't put in our shared catalogs.  Since setlocale() returns such names,
on Windows use GetUserDefaultLocaleName() if the user didn't provide an
explicit locale.  It returns BCP 47 strings like "en-US".

Also update the documentation to recommend BCP 47 over the traditional
names when providing explicit values to initdb.

Reviewed-by: Juan José Santamaría Flecha 
Reviewed-by:
Discussion: https://postgr.es/m/CA%2BhUKGJ%3DXThErgAQRoqfCy1bKPxXVuF0%3D2zDbB%2BSxDs59pv7Fw%40mail.gmail.com
---
 doc/src/sgml/charset.sgml | 13 +++--
 src/bin/initdb/initdb.c   | 31 +--
 2 files changed, 40 insertions(+), 4 deletions(-)

diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index 834cb30c85a..adb21eb0799 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -83,8 +83,17 @@ initdb --locale=sv_SE
 system under what names depends on what was provided by the operating
 system vendor and what was installed.  On most Unix systems, the command
 locale -a will provide a list of available locales.
-Windows uses more verbose locale names, such as German_Germany
-or Swedish_Sweden.1252, but the principles are the same.
+   
+
+   
+Windows uses BCP 47 language tags, like ICU.
+For example, sv-SE represents Swedish as spoken in Sweden.
+Windows also supports more verbose locale names based on full names
+such as German_Germany or Swedish_Sweden.1252,
+but these are not recommended because they are not stable across operating
+system updates due to changes in geographical names, and may contain
+non-ASCII characters which are not supported in PostgreSQL's shared
+catalogs.

 

diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index f00718a0150..393232b6cec 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -64,6 +64,10 @@
 #include "sys/mman.h"
 #endif
 
+#ifdef WIN32
+#include 
+#endif
+
 #include "access/xlog_internal.h"
 #incl

Re: Windows default locale vs initdb

2023-12-13 Thread Thomas Munro
Here is a thought that occurs to me, as I follow along with Jeff
Davis's evolving proposals for built-in collations and ctypes:  What
would stop us from dropping support for the libc (sic) provider on
Windows?  That may sound radical and likely to cause extra work for
people on upgrade, but how does that compare to the pain of keeping
this barely maintained code in the tree?  Suppose the idea in this
thread goes ahead and we get people to transition to the modern locale
names: there is non-zero transitional/upgrade pain there too.  How
delicious it would be to just nuke the whole thing from orbit, and
keep only cross-platform code that is maintained with enthusiasm by
active hackers.

That's probably a little extreme, but it's the direction my thoughts
start to go in when confronting the realisation that it's up to us
[Unix hackers making drive-by changes], no one is coming to help us
[from the Windows user community].

I've even heard others talk about dropping Windows completely, due to
the maintenance imbalance.  This would be somewhat more fine grained.
(One could use a similar argument to drop non-NTFS filesystems and
turn on POSIX-mode file links, to end that other locus of struggle.)




Re: Windows default locale vs initdb

2023-11-19 Thread Thomas Munro
I clicked "Trigger" to get a Mingw test run of this, and it failed[1].
I see why: our function win32_langinfo() believes that it shouldn't
call GetLocaleInfoEx() on non-MSVC compilers, so we see 'initdb:
error: could not find suitable encoding for locale "en-US"'.  I think
it has fallback code that parses the ".1252" or whatever on the end of
the name, but "en-US" hasn't got one.  I don't know the first thing
about Mingw but it looks like a declaration for that function arrived
6 years ago[2], and deleting the "#if defined(_MSC_VER)" fixes the
problem and the tests pass[3].  As far as I know, we don't support any
Mingw but the very latest: it's not a target with real users who have
version requirements, it's just a developer [in]convenience, so if it
passes on CI and whatever MSYS version "fairywren" runs in the build
farm right now, that should be enough.

I could just do that in this patch, but I suppose that also means that
someone needs to go through pg_locale.c and other places that test
_MSC_VER not because they actually care about the compiler but because
they want to detect some crusty old Mingw version, and see what else
can be deleted as a result, possibly including a lot of fallback code.
It feels like a separate cleanup for a separate patch.

[1] https://cirrus-ci.com/task/5301814774464512
[2] 
https://github.com/mirror/mingw-w64/blame/eff726c461e09f35eeaed125a3570fa5f807f02b/mingw-w64-tools/widl/include/winnls.h#L931
[3] https://cirrus-ci.com/task/6558569718349824




Re: Windows default locale vs initdb

2023-11-19 Thread Thomas Munro
Another country has changed its name, and a Windows OS update has
again broken every PostgreSQL cluster in that whole country[1] (or at
least those that had accepted initdb's default choice of locale,
probably most).  Let's get to the bottom of this, because otherwise it
is simply going to keep happening, causing administrative pain for a
lot of people.

Here is a rebase of the basic patch I proposed last time, and a
re-statement of what we know:

1.  initdb chooses a default locale using a technique that gives you
an unstable ("Czech Republic"->"Czechia", "Turkey"->"Türkiye"),
non-ASCII ("Norwegian (Bokmål)") string that we are warned we should
not store anywhere.  We store it, and then later it is not recognised.
Instead we should select an IETF BCP 47 locale name, based on stable
ISO country and language codes, like "en-US", "tr-TR" etc.  Here is
the patch to teach initdb to use that, unchanged from v3 except that I
tweaked the docs a bit.

2.  In Windows 10+ it is now also possible to put ".UTF-8" on the end
of locale names.  I couldn't figure out whether we should do that, and
what effect it has on ctypes -- apparently not the effect I expected
(see upthread).  Was our UTF-8 support on Windows already broken, and
this new ".UTF-8" thing is just a new way to reach that brokenness?
Is it OK to continue to choose the "legacy" single byte encodings by
default on that OS, and consider that a separate topic for separate
research?

3.  It is not clear to me how we should deal with pg_upgrade.
Eventually we want all of the old-school names to fade away, and
pg_upgrade would need to be part of that.  Perhaps there is some API
that can be used to translate to the new canonical forms without us
having to maintain translation tables and other messiness in our tree.

4.  Eventually we should probably ban non-ASCII characters from
entering the relevant catalogues (they are shared, so their encoding
is undefined except that they must be a superset of ASCII), and delete
all the old win32setlocale.c kludges, after we reach a point where
everyone should be using exclusively BCP 47.

[1] 
https://www.postgresql.org/message-id/flat/18196-b10f93dfbde3d7db%40postgresql.org
From d015005cca08bc1c7ae487392ed7b5a4cfa58748 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Tue, 19 Jul 2022 06:31:17 +1200
Subject: [PATCH v4] Default to IETF BCP 47 locale names in initdb on Windows.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Avoid selecting traditional Windows locale names written with English
words, because (1) they are unstable and not recommended for use in
databases and (2) they may contain non-ASCII characters, which we can't
put in our shared catalogs.  Since setlocale() returns such names, on
Windows use GetUserDefaultLocaleName() if the user didn't provide an
explicit locale.  It returns BCP 47 strings like "en-US".

Also update the documentation to recommend BCP 47 over the traditional
names when providing explicit values to initdb.

Reviewed-by: Juan José Santamaría Flecha 
Discussion: https://postgr.es/m/CA%2BhUKGJ%3DXThErgAQRoqfCy1bKPxXVuF0%3D2zDbB%2BSxDs59pv7Fw%40mail.gmail.com
---
 doc/src/sgml/charset.sgml | 13 +++--
 src/bin/initdb/initdb.c   | 31 +--
 2 files changed, 40 insertions(+), 4 deletions(-)

diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index 74783d148f..9a2cd5c2d5 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -83,8 +83,17 @@ initdb --locale=sv_SE
 system under what names depends on what was provided by the operating
 system vendor and what was installed.  On most Unix systems, the command
 locale -a will provide a list of available locales.
-Windows uses more verbose locale names, such as German_Germany
-or Swedish_Sweden.1252, but the principles are the same.
+   
+
+   
+Windows uses BCP 47 language tags, like ICU.
+For example, sv-SE represents Swedish as spoken in Sweden.
+Windows also supports more verbose locale names based on full names
+such as German_Germany or Swedish_Sweden.1252,
+but these are not recommended because they are not stable across operating
+system updates due to changes in geographical names, and may contain
+non-ASCII characters which are not supported in PostgreSQL's shared
+catalogs.

 

diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 0c6f5ceb0a..021e847240 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -64,6 +64,10 @@
 #include "sys/mman.h"
 #endif
 
+#ifdef WIN32
+#include 
+#endif
+
 #include "access/xlog_internal.h"
 #include "catalog/pg_authid_d.h"
 #include "catalog/pg_class_d.h" /* pgrminclude ignore */
@@ -2132,6 +2136,7 @@ locale_date_order(const char *locale)
 static void
 check_locale_name(int category, const char *locale, char **canonname)
 {
+	char	   *locale_copy;
 	char	   *save;
 	char	   *res;
 
@@ -2147,10 +2152,30 @@ check

Re: Windows default locale vs initdb

2022-12-22 Thread Thomas Munro
On Fri, Jul 29, 2022 at 3:33 PM Thomas Munro  wrote:
> On Fri, Jul 22, 2022 at 11:59 PM Juan José Santamaría Flecha
>  wrote:
> > TL;DR; What I want to show through this example is that Windows ACP is not 
> > modified by setlocale(), it can only be done through the Windows registry 
> > and only in recent releases.
>
> Thanks, that was helpful, and so was that SO link.
>
> So it sounds like I should forget about the v3-0002 patch, but the
> v3-0001 and v3-0003 patches might have a future.  And it sounds like
> we might need to investigate maybe defending ourselves against the ACP
> being different than what we expect (ie not matching the database
> encoding)?  Did I understand correctly that you're looking into that?

I'm going to withdraw this entry.  The sooner we get something like
0001 into a release, the sooner the world will be rid of PostgreSQL
clusters initialised with the bad old locale names that the manual
very clearly tells you not to use for databases but I don't
understand this ACP/registry vs database encoding stuff and how it
relates to the use of BCP47 locale names, which puts me off changing
anything until we do.




Re: Windows default locale vs initdb

2022-07-28 Thread Thomas Munro
On Fri, Jul 22, 2022 at 11:59 PM Juan José Santamaría Flecha
 wrote:
> TL;DR; What I want to show through this example is that Windows ACP is not 
> modified by setlocale(), it can only be done through the Windows registry and 
> only in recent releases.

Thanks, that was helpful, and so was that SO link.

So it sounds like I should forget about the v3-0002 patch, but the
v3-0001 and v3-0003 patches might have a future.  And it sounds like
we might need to investigate maybe defending ourselves against the ACP
being different than what we expect (ie not matching the database
encoding)?  Did I understand correctly that you're looking into that?




Re: Windows default locale vs initdb

2022-07-22 Thread Juan José Santamaría Flecha
On Wed, Jul 20, 2022 at 1:44 PM Thomas Munro  wrote:

> On Wed, Jul 20, 2022 at 10:27 PM Juan José Santamaría Flecha
>  wrote:
> > Still, WIN1252 is not the wrong answer for what we are asking. Even if
> you enable UTF-8 support [1], the system will use the current default
> Windows ANSI code page (ACP) for the locale and UTF-8 for the code page.
>
> I'm still confused about what that means.  Suppose we decided to
> insist by adding a ".UTF-8" suffix to the name, as that page says we
> can now that we're on Windows 10+, when building the default locale
> name (see experimental 0002 patch, attached).  It initially seemed to
> have the right effect:
>
> The database cluster will be initialized with locale "en-US.UTF-8".
> The default database encoding has accordingly been set to "UTF8".
> The default text search configuration will be set to "english".
>
> Let me try to explain this using the "Beta: Use Unicode UTF-8 for
worldwide language support" option [1].

- Currently in a system with the language settings of "English_United
States" and that option disabled, when executing initdb you get:

The database cluster will be initialized with locale "English_United
States.1252".
The default database encoding has accordingly been set to "WIN1252".
The default text search configuration will be set to "english".

And as a test for psql:

SET lc_time='tr_tr.utf8';
SET
SELECT to_char('2000-2-01'::date, 'tmmonth');
ERROR:  character with byte sequence 0xc5 0x9f in encoding "UTF8" has no
equivalent in encoding "WIN1252"

We get this error even if the database encoding is UTF8, and is caused by
the tr_tr locales being encoded in WIN1254. We can discuss this in another
thread, and I can propose a patch.

- If we enable the UTF-8 support option, then the same test goes as:

The database cluster will be initialized with locale "English_United
States.utf8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

And for psql:

SET lc_time='tr_tr.utf8';
SET
SELECT to_char('2000-2-01'::date, 'tmmonth');
 to_char
-
 şubat
(1 row)

In this case the Windows locales are actually UTF8 encoded.

TL;DR; What I want to show through this example is that Windows ACP is not
modified by setlocale(), it can only be done through the Windows registry
and only in recent releases.


> But then the Turkish i test in contrib/citext/sql/citext_utf8.sql
> failed[1]:
>
> SELECT 'i'::citext = 'İ'::citext AS t;
>  t
>  ---
> - t
> + f
>  (1 row)
>
> This is current state of affairs:

- Windows:

SELECT U&'\0131' latin_small_dotless,U&'\0069' latin_small
,U&'\0049' latin_capital, lower(U&'\0049')
,U&'\0130' latin_capital_dotted, lower(U&'\0130');
 latin_small_dotless | latin_small | latin_capital | lower |
latin_capital_dotted | lower
-+-+---+---+--+---
 ı   | i   | I | i | İ
   | İ

- Linux:

SELECT U&'\0131' latin_small_dotless,U&'\0069' latin_small
,U&'\0049' latin_capital, lower(U&'\0049')
,U&'\0130' latin_capital_dotted, lower(U&'\0130');
 latin_small_dotless | latin_small | latin_capital | lower |
latin_capital_dotted | lower
-+-+---+---+--+---
 ı   | i   | I | i | İ
   | i

Latin_capital_dotted doesn't have the same lower value.

[1]
https://stackoverflow.com/questions/56419639/what-does-beta-use-unicode-utf-8-for-worldwide-language-support-actually-do

Regards,

Juan José Santamaría Flecha


Re: Windows default locale vs initdb

2022-07-20 Thread Thomas Munro
On Wed, Jul 20, 2022 at 10:27 PM Juan José Santamaría Flecha
 wrote:
> On Tue, Jul 19, 2022 at 4:47 AM Thomas Munro  wrote:
>> As for whether "accordingly" still applies, by the logic of of
>> win32_langinfo()...  Windows still considers WIN1252 to be the default
>> ANSI code page for "en-US", though it'd work with UTF-8 too.  I'm not
>> sure what to make of that.  The goal here was to give Windows users
>> good defaults, but WIN1252 is probably not what most people actually
>> want.  Hmph.
>
>
> Still, WIN1252 is not the wrong answer for what we are asking. Even if you 
> enable UTF-8 support [1], the system will use the current default Windows 
> ANSI code page (ACP) for the locale and UTF-8 for the code page.

I'm still confused about what that means.  Suppose we decided to
insist by adding a ".UTF-8" suffix to the name, as that page says we
can now that we're on Windows 10+, when building the default locale
name (see experimental 0002 patch, attached).  It initially seemed to
have the right effect:

The database cluster will be initialized with locale "en-US.UTF-8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

But then the Turkish i test in contrib/citext/sql/citext_utf8.sql failed[1]:

SELECT 'i'::citext = 'İ'::citext AS t;
 t
 ---
- t
+ f
 (1 row)

About the pg_upgrade problem, maybe it's OK ... existing old format
names should continue to work, but we can still remove the weird code
that does locale name tweaking, right?  pg_upgraded databases should
contain fixed names (ie that were fixed by old initdb so should
continue to work), and new clusters will get BCP 47 names.

I don't really know, I was just playing with rough ideas by sending
patches to CI here...

[1] https://cirrus-ci.com/task/6423238052937728
From b007eb45e575956d5035f4152f72177abddc2762 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Tue, 19 Jul 2022 06:31:17 +1200
Subject: [PATCH v3 1/3] Default to BCP 47 locale in initdb on Windows.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Avoid selecting traditional Windows locale names written with English
words, because they are unstable and not recommended for use in
databases.  Since setlocale() returns such names, on Windows use
GetUserDefaultLocaleName() if the user didn't provide an explicit
locale.

Also update the documentation to recommend BCP 47 over the traditional
names when providing explicit values to initdb.

Reviewed-by: Juan José Santamaría Flecha 
Discussion: https://postgr.es/m/CA%2BhUKGJ%3DXThErgAQRoqfCy1bKPxXVuF0%3D2zDbB%2BSxDs59pv7Fw%40mail.gmail.com
---
 doc/src/sgml/charset.sgml | 10 --
 src/bin/initdb/initdb.c   | 31 +--
 2 files changed, 37 insertions(+), 4 deletions(-)

diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index 445fd175d8..b656ca489f 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -83,8 +83,14 @@ initdb --locale=sv_SE
 system under what names depends on what was provided by the operating
 system vendor and what was installed.  On most Unix systems, the command
 locale -a will provide a list of available locales.
-Windows uses more verbose locale names, such as German_Germany
-or Swedish_Sweden.1252, but the principles are the same.
+   
+
+   
+Windows uses BCP 47 language tags, like ICU.
+For example, sv-SE represents Swedish as spoken in Sweden.
+Windows also supports more verbose locale names based on English words,
+such as German_Germany or Swedish_Sweden.1252,
+but these are not recommended.

 

diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 89b888eaa5..3af08b7b99 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -59,6 +59,10 @@
 #include "sys/mman.h"
 #endif
 
+#ifdef WIN32
+#include 
+#endif
+
 #include "access/xlog_internal.h"
 #include "catalog/pg_authid_d.h"
 #include "catalog/pg_class_d.h" /* pgrminclude ignore */
@@ -2007,6 +2011,7 @@ locale_date_order(const char *locale)
 static void
 check_locale_name(int category, const char *locale, char **canonname)
 {
+	char	   *locale_copy;
 	char	   *save;
 	char	   *res;
 
@@ -2022,10 +2027,30 @@ check_locale_name(int category, const char *locale, char **canonname)
 
 	/* for setlocale() call */
 	if (!locale)
-		locale = "";
+	{
+#ifdef WIN32
+		wchar_t		wide_name[LOCALE_NAME_MAX_LENGTH];
+		char		name[LOCALE_NAME_MAX_LENGTH];
+
+		/* use Windows API to find the default in BCP47 format */
+		if (GetUserDefaultLocaleName(wide_name, LOCALE_NAME_MAX_LENGTH) == 0)
+			pg_fatal("failed to get default locale name: error code %lu",
+	 GetLastError());
+		if (WideCharToMultiByte(CP_ACP, 0, wide_name, -1, name,
+LOCALE_NAME_MAX_LENGTH, NULL, NULL) == 0)
+			pg_fatal("failed to convert locale name: error code %lu",
+	 GetLastError());
+		locale_copy = pg_strdup(name);
+#else
+		/* use enviro

Re: Windows default locale vs initdb

2022-07-20 Thread Juan José Santamaría Flecha
On Tue, Jul 19, 2022 at 4:47 AM Thomas Munro  wrote:

> As for whether "accordingly" still applies, by the logic of of
> win32_langinfo()...  Windows still considers WIN1252 to be the default
> ANSI code page for "en-US", though it'd work with UTF-8 too.  I'm not
> sure what to make of that.  The goal here was to give Windows users
> good defaults, but WIN1252 is probably not what most people actually
> want.  Hmph.
>

Still, WIN1252 is not the wrong answer for what we are asking. Even if you
enable UTF-8 support [1], the system will use the current default Windows
ANSI code page (ACP) for the locale and UTF-8 for the code page.

[1]
https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/setlocale-wsetlocale?view=msvc-170

Regards,

Juan José Santamaría Flecha


Re: Windows default locale vs initdb

2022-07-20 Thread Juan José Santamaría Flecha
On Tue, Jul 19, 2022 at 12:59 AM Thomas Munro 
wrote:

> Now that museum-grade Windows has been defenestrated, we are free to
> call GetUserDefaultLocaleName().  Here's a patch.
>

This LGTM.

>
> I think we should also convert to POSIX format when making the
> collname in your pg_import_system_collations() proposal, so that
> COLLATE "en_US" works (= a SQL identifier), but that's another
> thread[1].  I don't think we should do it in collcollate or
> datcollate, which is a string for the OS to interpret.
>

That thread has been split [1], but that is how the current version behaves.

>
> With my garbage collector hat on, I would like to rip out all of the
> support for traditional locale names, eventually.  Deleting kludgy
> code is easy and fun -- 0002 is a first swing at that -- but there
> remains an important unanswered question.  How should someone
> pg_upgrade a "English_Canada.1521" cluster if we now reject that name?
>  We'd need to do a conversion to "en-CA", or somehow tell the user to.
> H.
>

Is there a safe way to do that in pg_upgrade or would we be forcing users
to pg_dump into the new cluster?

[1]
https://www.postgresql.org/message-id/flat/0050ec23-34d9-2765-9015-98c04f0e18ac%40postgrespro.ru

Regards,

Juan José Santamaría Flecha


Re: Windows default locale vs initdb

2022-07-18 Thread Thomas Munro
On Tue, Jul 19, 2022 at 10:58 AM Thomas Munro  wrote:
> Here's a patch.

I added this to the next commitfest, and cfbot promptly told me about
some warnings I needed to fix.  That'll teach me to post a patch
tested with "ci-os-only: windows".  Looking more closely at some error
messages that report GetLastError() where I'd mixed up %d and %lu, I
see also that I didn't quite follow existing conventions for wording
when reporting Windows error numbers, so I fixed that too.

In the "startcreate" step on CI you can see that it says:

The database cluster will be initialized with locale "en-US".
The default database encoding has accordingly been set to "WIN1252".
The default text search configuration will be set to "english".

As for whether "accordingly" still applies, by the logic of of
win32_langinfo()...  Windows still considers WIN1252 to be the default
ANSI code page for "en-US", though it'd work with UTF-8 too.  I'm not
sure what to make of that.  The goal here was to give Windows users
good defaults, but WIN1252 is probably not what most people actually
want.  Hmph.
From 95f2684150e2938f2e555d16bbed4295a6dad279 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Tue, 19 Jul 2022 06:31:17 +1200
Subject: [PATCH v2 1/2] Default to BCP 47 locale in initdb on Windows.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Avoid selecting traditional Windows locale names written with English
words, because they are unstable and not recommended for use in
databases.  Since setlocale() returns such names, on Windows use
GetUserDefaultLocaleName() if the user didn't provide an explicit
locale.

Also update the documentation to recommend BCP 47 over the traditional
names when providing explicit values to initdb.

Reviewed-by: Juan José Santamaría Flecha 
Discussion: https://postgr.es/m/CA%2BhUKGJ%3DXThErgAQRoqfCy1bKPxXVuF0%3D2zDbB%2BSxDs59pv7Fw%40mail.gmail.com
---
 doc/src/sgml/charset.sgml | 10 --
 src/bin/initdb/initdb.c   | 31 +--
 2 files changed, 37 insertions(+), 4 deletions(-)

diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index 445fd175d8..b656ca489f 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -83,8 +83,14 @@ initdb --locale=sv_SE
 system under what names depends on what was provided by the operating
 system vendor and what was installed.  On most Unix systems, the command
 locale -a will provide a list of available locales.
-Windows uses more verbose locale names, such as German_Germany
-or Swedish_Sweden.1252, but the principles are the same.
+   
+
+   
+Windows uses BCP 47 language tags, like ICU.
+For example, sv-SE represents Swedish as spoken in Sweden.
+Windows also supports more verbose locale names based on English words,
+such as German_Germany or Swedish_Sweden.1252,
+but these are not recommended.

 

diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 89b888eaa5..3af08b7b99 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -59,6 +59,10 @@
 #include "sys/mman.h"
 #endif
 
+#ifdef WIN32
+#include 
+#endif
+
 #include "access/xlog_internal.h"
 #include "catalog/pg_authid_d.h"
 #include "catalog/pg_class_d.h" /* pgrminclude ignore */
@@ -2007,6 +2011,7 @@ locale_date_order(const char *locale)
 static void
 check_locale_name(int category, const char *locale, char **canonname)
 {
+	char	   *locale_copy;
 	char	   *save;
 	char	   *res;
 
@@ -2022,10 +2027,30 @@ check_locale_name(int category, const char *locale, char **canonname)
 
 	/* for setlocale() call */
 	if (!locale)
-		locale = "";
+	{
+#ifdef WIN32
+		wchar_t		wide_name[LOCALE_NAME_MAX_LENGTH];
+		char		name[LOCALE_NAME_MAX_LENGTH];
+
+		/* use Windows API to find the default in BCP47 format */
+		if (GetUserDefaultLocaleName(wide_name, LOCALE_NAME_MAX_LENGTH) == 0)
+			pg_fatal("failed to get default locale name: error code %lu",
+	 GetLastError());
+		if (WideCharToMultiByte(CP_ACP, 0, wide_name, -1, name,
+LOCALE_NAME_MAX_LENGTH, NULL, NULL) == 0)
+			pg_fatal("failed to convert locale name: error code %lu",
+	 GetLastError());
+		locale_copy = pg_strdup(name);
+#else
+		/* use environment to find the default */
+		locale_copy = pg_strdup("");
+#endif
+	}
+	else
+		locale_copy = pg_strdup(locale);
 
 	/* set the locale with setlocale, to see if it accepts it. */
-	res = setlocale(category, locale);
+	res = setlocale(category, locale_copy);
 
 	/* save canonical name if requested. */
 	if (res && canonname)
@@ -2054,6 +2079,8 @@ check_locale_name(int category, const char *locale, char **canonname)
 			pg_fatal("invalid locale settings; check LANG and LC_* environment variables");
 		}
 	}
+
+	free(locale_copy);
 }
 
 /*
-- 
2.35.1

From 1e0b75b4c8958397a8e660fa0b8759f1da78a753 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Tue, 19 Jul 2022 08:53:08 +1200
Subject: [PATCH v2 2/2] Remove support for old Windows loc

Re: Windows default locale vs initdb

2022-07-18 Thread Thomas Munro
On Wed, Dec 15, 2021 at 11:32 PM Juan José Santamaría Flecha
 wrote:
> On Sun, May 16, 2021 at 6:29 AM Noah Misch  wrote:
>> On Mon, Apr 19, 2021 at 05:42:51PM +1200, Thomas Munro wrote:
>> > The question we asked ourselves
>> > multiple times in the other thread was how we're supposed to get to
>> > the modern BCP 47 form when creating the template databases.  It looks
>> > like one possibility, since Vista, is to call
>> > GetUserDefaultLocaleName()[2]
>>
>> > No patch, but I wondered if any Windows hackers have any feedback on
>> > relative sanity of trying to fix all these problems this way.
>>
>> Sounds reasonable.  If PostgreSQL v15 would otherwise run on Windows Server
>> 2003 R2, this is a good time to let that support end.
>>
> The value returned by GetUserDefaultLocaleName() is a system configured 
> parameter, independent of what you set with setlocale(). It might be 
> reasonable for initdb but not for a backend in most cases.

Agreed.  Only for initdb, and only if you didn't specify a locale name
on the command line.

> You can get the locale POSIX-ish name using GetLocaleInfoEx(), but this is no 
> longer recommended, because using LCIDs is no longer recommended [1]. 
> Although, this would work for legacy locales. Please find attached a POC 
> patch showing this approach.

Now that museum-grade Windows has been defenestrated, we are free to
call GetUserDefaultLocaleName().  Here's a patch.

One thing you did in your patch that I disagree with, I think, was to
convert a BCP 47 name to a POSIX name early, that is, s/-/_/.  I think
we should use the locale name exactly as Windows (really, under the
covers, ICU) spells it.  There is only one place in the tree today
that really wants a POSIX locale name, and that's LC_MESSAGES,
accessed by GNU gettext, not Windows.  We already had code to cope
with that.

I think we should also convert to POSIX format when making the
collname in your pg_import_system_collations() proposal, so that
COLLATE "en_US" works (= a SQL identifier), but that's another
thread[1].  I don't think we should do it in collcollate or
datcollate, which is a string for the OS to interpret.

With my garbage collector hat on, I would like to rip out all of the
support for traditional locale names, eventually.  Deleting kludgy
code is easy and fun -- 0002 is a first swing at that -- but there
remains an important unanswered question.  How should someone
pg_upgrade a "English_Canada.1521" cluster if we now reject that name?
 We'd need to do a conversion to "en-CA", or somehow tell the user to.
H.

[1] 
https://www.postgresql.org/message-id/flat/CAC%2BAXB0WFjJGL1n33bRv8wsnV-3PZD0A7kkjJ2KjPH0dOWqQdg%40mail.gmail.com
From d6d677fd185242590f0f716cf69d09e735122ff7 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Tue, 19 Jul 2022 06:31:17 +1200
Subject: [PATCH 1/2] Default to BCP 47 locale in initdb on Windows.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Avoid selecting traditional Windows locale names written with English
words, because they are unstable and not recommended for use in
databases.  Since setlocale() returns such names, instead use
GetUserDefaultLocaleName() if the user didn't provide an explicit
locale.

Also update the documentation to recommend BCP 47 over the traditional
names when providing explicit values to initdb.

Reviewed-by: Juan José Santamaría Flecha 
Discussion: https://postgr.es/m/CA%2BhUKGJ%3DXThErgAQRoqfCy1bKPxXVuF0%3D2zDbB%2BSxDs59pv7Fw%40mail.gmail.com
---
 doc/src/sgml/charset.sgml | 10 --
 src/bin/initdb/initdb.c   | 28 +++-
 2 files changed, 35 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index 445fd175d8..22e33f0f57 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -83,8 +83,14 @@ initdb --locale=sv_SE
 system under what names depends on what was provided by the operating
 system vendor and what was installed.  On most Unix systems, the command
 locale -a will provide a list of available locales.
-Windows uses more verbose locale names, such as German_Germany
-or Swedish_Sweden.1252, but the principles are the same.
+   
+
+   
+Windows uses BCP 47 language tags.
+For example, sv-SE represents Swedish as spoken in Sweden.
+Windows also supports more verbose locale names based on English words,
+such as German_Germany or Swedish_Sweden.1252,
+but these are not recommended.

 

diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 89b888eaa5..57c5ecf3cf 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -59,6 +59,10 @@
 #include "sys/mman.h"
 #endif
 
+#ifdef WIN32
+#include 
+#endif
+
 #include "access/xlog_internal.h"
 #include "catalog/pg_authid_d.h"
 #include "catalog/pg_class_d.h" /* pgrminclude ignore */
@@ -2022,7 +2026,27 @@ check_locale_name(int category, const char *locale, char **canonname)
 
 	/* for set

Re: Windows default locale vs initdb

2021-12-15 Thread Juan José Santamaría Flecha
On Sun, May 16, 2021 at 6:29 AM Noah Misch  wrote:

> On Mon, Apr 19, 2021 at 05:42:51PM +1200, Thomas Munro wrote:
>
> > The question we asked ourselves
> > multiple times in the other thread was how we're supposed to get to
> > the modern BCP 47 form when creating the template databases.  It looks
> > like one possibility, since Vista, is to call
> > GetUserDefaultLocaleName()[2]
>
> > No patch, but I wondered if any Windows hackers have any feedback on
> > relative sanity of trying to fix all these problems this way.
>
> Sounds reasonable.  If PostgreSQL v15 would otherwise run on Windows Server
> 2003 R2, this is a good time to let that support end.
>
> The value returned by GetUserDefaultLocaleName() is a system configured
parameter, independent of what you set with setlocale(). It might be
reasonable for initdb but not for a backend in most cases.

You can get the locale POSIX-ish name using GetLocaleInfoEx(), but this is
no longer recommended, because using LCIDs is no longer recommended [1].
Although, this would work for legacy locales. Please find attached a POC
patch showing this approach.

[1] https://docs.microsoft.com/en-us/globalization/locale/locale-names

Regards,

Juan José Santamaría Flecha


0001-POC-Make-Windows-locale-POSIX-looking.patch
Description: Binary data


Re: Windows default locale vs initdb

2021-05-15 Thread Noah Misch
On Mon, Apr 19, 2021 at 05:42:51PM +1200, Thomas Munro wrote:
> Currently initdb sets up template databases with old-style Windows
> locale names reported by the OS, and they seem to have caused us quite
> a few problems over the years:
> 
> db29620d "Work around Windows locale name with non-ASCII character."
> aa1d2fc5 "Another attempt at fixing Windows Norwegian locale."
> db477b69 "Deal with yet another issue related to "Norwegian (Bokmål)"..."
> 9f12a3b9 "Tolerate version lookup failure for old style Windows locale..."

> I suppose that was the only form available at the time the code was
> written, so there was no choice.

Right.

> The question we asked ourselves
> multiple times in the other thread was how we're supposed to get to
> the modern BCP 47 form when creating the template databases.  It looks
> like one possibility, since Vista, is to call
> GetUserDefaultLocaleName()[2]

> No patch, but I wondered if any Windows hackers have any feedback on
> relative sanity of trying to fix all these problems this way.

Sounds reasonable.  If PostgreSQL v15 would otherwise run on Windows Server
2003 R2, this is a good time to let that support end.




Re: Windows default locale vs initdb

2021-04-19 Thread Peter Eisentraut

On 19.04.21 07:42, Thomas Munro wrote:

It looks
like one possibility, since Vista, is to call
GetUserDefaultLocaleName()[2], which doesn't appear to have been
discussed before on this list.  That doesn't allow you to ask for the
default for each individual category, but I don't know if that is even
a concept for Windows user settings.


pg_newlocale_from_collation() doesn't support collcollate != collctype 
on Windows anyway, so that wouldn't be an issue.





Re: Windows default locale vs initdb

2021-04-19 Thread Andrew Dunstan


On 4/19/21 10:26 AM, Dave Page wrote:
>
>
> On Mon, Apr 19, 2021 at 11:52 AM Andrew Dunstan  > wrote:
>
>
> My understanding from Microsoft staff at conferences is that
> Azure's PostgreSQL SAS runs on  linux, not WIndows.
>
>
> This is from a regular Azure Database for PostgreSQL single server:
>
> postgres=> select version();
>                           version                           
> 
>  PostgreSQL 11.6, compiled by Visual C++ build 1800, 64-bit
> (1 row) 
>
> And this is from the new Flexible Server preview:
>
> postgres=> select version();
>                                                      version          
>                                           
> -
>  PostgreSQL 12.6 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu
> 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609, 64-bit
> (1 row)
>
> So I guess it's a case of "it depends".
>

Good to know. A year or two back at more than one conference I tried to enlist 
some of these folks in helping us with Windows PostgreSQL and their reply was 
that they knew nothing about it because they were on Linux :-) I guess things 
change over time.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Windows default locale vs initdb

2021-04-19 Thread Dave Page
On Mon, Apr 19, 2021 at 11:52 AM Andrew Dunstan  wrote:

>
> My understanding from Microsoft staff at conferences is that Azure's
> PostgreSQL SAS runs on  linux, not WIndows.
>

This is from a regular Azure Database for PostgreSQL single server:

postgres=> select version();
  version

 PostgreSQL 11.6, compiled by Visual C++ build 1800, 64-bit
(1 row)

And this is from the new Flexible Server preview:

postgres=> select version();
 version

-
 PostgreSQL 12.6 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu
5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609, 64-bit
(1 row)

So I guess it's a case of "it depends".

-- 
Dave Page
Blog: https://pgsnake.blogspot.com
Twitter: @pgsnake

EDB: https://www.enterprisedb.com


Re: Windows default locale vs initdb

2021-04-19 Thread Pavel Stehule
po 19. 4. 2021 v 12:52 odesílatel Andrew Dunstan 
napsal:

>
>
> On Mon, Apr 19, 2021 at 4:53 AM Pavel Stehule 
> wrote:
>
>>
>>
>> po 19. 4. 2021 v 7:43 odesílatel Thomas Munro 
>> napsal:
>>
>>> Hi,
>>>
>>> Moving this topic into its own thread from the one about collation
>>> versions, because it concerns pre-existing problems, and that thread
>>> is long.
>>>
>>> Currently initdb sets up template databases with old-style Windows
>>> locale names reported by the OS, and they seem to have caused us quite
>>> a few problems over the years:
>>>
>>> db29620d "Work around Windows locale name with non-ASCII character."
>>> aa1d2fc5 "Another attempt at fixing Windows Norwegian locale."
>>> db477b69 "Deal with yet another issue related to "Norwegian (Bokmål)"..."
>>> 9f12a3b9 "Tolerate version lookup failure for old style Windows
>>> locale..."
>>>
>>> ... and probably more, and also various threads about , for example,
>>> "German_German.1252" vs "German_Switzerland.1252" which seem to get
>>> confused or badly canonicalised or rejected somewhere in the mix.
>>>
>>> I hadn't focused on any of that before, being a non-Windows-user, but
>>> the entire contents of win32setlocale.c supports the theory that
>>> Windows' manual meant what it said when it said[1]:
>>>
>>> "We do not recommend this form for locale strings embedded in
>>> code or serialized to storage, because these strings are more likely
>>> to be changed by an operating system update than the locale name
>>> form."
>>>
>>> I suppose that was the only form available at the time the code was
>>> written, so there was no choice.  The question we asked ourselves
>>> multiple times in the other thread was how we're supposed to get to
>>> the modern BCP 47 form when creating the template databases.  It looks
>>> like one possibility, since Vista, is to call
>>> GetUserDefaultLocaleName()[2], which doesn't appear to have been
>>> discussed before on this list.  That doesn't allow you to ask for the
>>> default for each individual category, but I don't know if that is even
>>> a concept for Windows user settings.  It may be that some of the other
>>> nearby functions give a better answer for some reason.  But one thing
>>> is clear from a test that someone kindly ran for me: it reports
>>> standardised strings like "en-NZ", not strings like "English_New
>>> Zealand.1252".
>>>
>>> No patch, but I wondered if any Windows hackers have any feedback on
>>> relative sanity of trying to fix all these problems this way.
>>>
>>
>> Last weekend I talked with one user about one interesting (and messing)
>> issue. They needed to create a new database with Czech collation on Azure
>> SAS. There was not any entry in pg_collation for Czech language. The reply
>> from Microsoft support was to use CREATE DATABASE xxx TEMPLATE 'template0'
>> ENCODING 'utf8' LOCALE 'cs_CZ.UTF8' and it was working.
>>
>>
>>
> My understanding from Microsoft staff at conferences is that Azure's
> PostgreSQL SAS runs on  linux, not WIndows.
>

I had different informations, but still there was something wrong because
no czech locales was in pg_collation



>
> cheers
>
> andrew
>


Re: Windows default locale vs initdb

2021-04-19 Thread Andrew Dunstan
On Mon, Apr 19, 2021 at 4:53 AM Pavel Stehule 
wrote:

>
>
> po 19. 4. 2021 v 7:43 odesílatel Thomas Munro 
> napsal:
>
>> Hi,
>>
>> Moving this topic into its own thread from the one about collation
>> versions, because it concerns pre-existing problems, and that thread
>> is long.
>>
>> Currently initdb sets up template databases with old-style Windows
>> locale names reported by the OS, and they seem to have caused us quite
>> a few problems over the years:
>>
>> db29620d "Work around Windows locale name with non-ASCII character."
>> aa1d2fc5 "Another attempt at fixing Windows Norwegian locale."
>> db477b69 "Deal with yet another issue related to "Norwegian (Bokmål)"..."
>> 9f12a3b9 "Tolerate version lookup failure for old style Windows locale..."
>>
>> ... and probably more, and also various threads about , for example,
>> "German_German.1252" vs "German_Switzerland.1252" which seem to get
>> confused or badly canonicalised or rejected somewhere in the mix.
>>
>> I hadn't focused on any of that before, being a non-Windows-user, but
>> the entire contents of win32setlocale.c supports the theory that
>> Windows' manual meant what it said when it said[1]:
>>
>> "We do not recommend this form for locale strings embedded in
>> code or serialized to storage, because these strings are more likely
>> to be changed by an operating system update than the locale name
>> form."
>>
>> I suppose that was the only form available at the time the code was
>> written, so there was no choice.  The question we asked ourselves
>> multiple times in the other thread was how we're supposed to get to
>> the modern BCP 47 form when creating the template databases.  It looks
>> like one possibility, since Vista, is to call
>> GetUserDefaultLocaleName()[2], which doesn't appear to have been
>> discussed before on this list.  That doesn't allow you to ask for the
>> default for each individual category, but I don't know if that is even
>> a concept for Windows user settings.  It may be that some of the other
>> nearby functions give a better answer for some reason.  But one thing
>> is clear from a test that someone kindly ran for me: it reports
>> standardised strings like "en-NZ", not strings like "English_New
>> Zealand.1252".
>>
>> No patch, but I wondered if any Windows hackers have any feedback on
>> relative sanity of trying to fix all these problems this way.
>>
>
> Last weekend I talked with one user about one interesting (and messing)
> issue. They needed to create a new database with Czech collation on Azure
> SAS. There was not any entry in pg_collation for Czech language. The reply
> from Microsoft support was to use CREATE DATABASE xxx TEMPLATE 'template0'
> ENCODING 'utf8' LOCALE 'cs_CZ.UTF8' and it was working.
>
>
>
My understanding from Microsoft staff at conferences is that Azure's
PostgreSQL SAS runs on  linux, not WIndows.

cheers

andrew


Re: Windows default locale vs initdb

2021-04-19 Thread Pavel Stehule
po 19. 4. 2021 v 7:43 odesílatel Thomas Munro 
napsal:

> Hi,
>
> Moving this topic into its own thread from the one about collation
> versions, because it concerns pre-existing problems, and that thread
> is long.
>
> Currently initdb sets up template databases with old-style Windows
> locale names reported by the OS, and they seem to have caused us quite
> a few problems over the years:
>
> db29620d "Work around Windows locale name with non-ASCII character."
> aa1d2fc5 "Another attempt at fixing Windows Norwegian locale."
> db477b69 "Deal with yet another issue related to "Norwegian (Bokmål)"..."
> 9f12a3b9 "Tolerate version lookup failure for old style Windows locale..."
>
> ... and probably more, and also various threads about , for example,
> "German_German.1252" vs "German_Switzerland.1252" which seem to get
> confused or badly canonicalised or rejected somewhere in the mix.
>
> I hadn't focused on any of that before, being a non-Windows-user, but
> the entire contents of win32setlocale.c supports the theory that
> Windows' manual meant what it said when it said[1]:
>
> "We do not recommend this form for locale strings embedded in
> code or serialized to storage, because these strings are more likely
> to be changed by an operating system update than the locale name
> form."
>
> I suppose that was the only form available at the time the code was
> written, so there was no choice.  The question we asked ourselves
> multiple times in the other thread was how we're supposed to get to
> the modern BCP 47 form when creating the template databases.  It looks
> like one possibility, since Vista, is to call
> GetUserDefaultLocaleName()[2], which doesn't appear to have been
> discussed before on this list.  That doesn't allow you to ask for the
> default for each individual category, but I don't know if that is even
> a concept for Windows user settings.  It may be that some of the other
> nearby functions give a better answer for some reason.  But one thing
> is clear from a test that someone kindly ran for me: it reports
> standardised strings like "en-NZ", not strings like "English_New
> Zealand.1252".
>
> No patch, but I wondered if any Windows hackers have any feedback on
> relative sanity of trying to fix all these problems this way.
>

Last weekend I talked with one user about one interesting (and messing)
issue. They needed to create a new database with Czech collation on Azure
SAS. There was not any entry in pg_collation for Czech language. The reply
from Microsoft support was to use CREATE DATABASE xxx TEMPLATE 'template0'
ENCODING 'utf8' LOCALE 'cs_CZ.UTF8' and it was working.

Regards

Pavel


> [1]
> https://docs.microsoft.com/en-us/cpp/c-runtime-library/locale-names-languages-and-country-region-strings?view=msvc-160
> [2]
> https://docs.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-getuserdefaultlocalename
>
>
>


Windows default locale vs initdb

2021-04-18 Thread Thomas Munro
Hi,

Moving this topic into its own thread from the one about collation
versions, because it concerns pre-existing problems, and that thread
is long.

Currently initdb sets up template databases with old-style Windows
locale names reported by the OS, and they seem to have caused us quite
a few problems over the years:

db29620d "Work around Windows locale name with non-ASCII character."
aa1d2fc5 "Another attempt at fixing Windows Norwegian locale."
db477b69 "Deal with yet another issue related to "Norwegian (Bokmål)"..."
9f12a3b9 "Tolerate version lookup failure for old style Windows locale..."

... and probably more, and also various threads about , for example,
"German_German.1252" vs "German_Switzerland.1252" which seem to get
confused or badly canonicalised or rejected somewhere in the mix.

I hadn't focused on any of that before, being a non-Windows-user, but
the entire contents of win32setlocale.c supports the theory that
Windows' manual meant what it said when it said[1]:

"We do not recommend this form for locale strings embedded in
code or serialized to storage, because these strings are more likely
to be changed by an operating system update than the locale name
form."

I suppose that was the only form available at the time the code was
written, so there was no choice.  The question we asked ourselves
multiple times in the other thread was how we're supposed to get to
the modern BCP 47 form when creating the template databases.  It looks
like one possibility, since Vista, is to call
GetUserDefaultLocaleName()[2], which doesn't appear to have been
discussed before on this list.  That doesn't allow you to ask for the
default for each individual category, but I don't know if that is even
a concept for Windows user settings.  It may be that some of the other
nearby functions give a better answer for some reason.  But one thing
is clear from a test that someone kindly ran for me: it reports
standardised strings like "en-NZ", not strings like "English_New
Zealand.1252".

No patch, but I wondered if any Windows hackers have any feedback on
relative sanity of trying to fix all these problems this way.

[1] 
https://docs.microsoft.com/en-us/cpp/c-runtime-library/locale-names-languages-and-country-region-strings?view=msvc-160
[2] 
https://docs.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-getuserdefaultlocalename