Re: [HACKERS] initdb initalization failure for collation "ja_JP"
I wrote: >> One question that I've got is why the ICU portion refuses to load >> any entries unless is_encoding_supported_by_icu(GetDatabaseEncoding()). >> Surely this is completely wrong? I should think that what we load into >> pg_collation ought to be independent of template1's encoding, the same >> as it is for libc collations, and the right place to be making a test >> like that is where somebody attempts to use an ICU collation. Pursuant to the second part of that: I checked on what happens if you try to use an ICU collation in a database with a not-supported-by-ICU encoding. We have to cope with that scenario even with the current (broken IMO) initdb behavior, because even if template1 has a supported encoding, it's possible to create another database that doesn't. It does fail more or less cleanly; you get an "encoding "foo" not supported by ICU" message at runtime (out of get_encoding_name_for_icu). But that's quite a bit unlike the behavior for libc collations: with those, you get an error in collation name lookup, along the lines of collation "en_DK.utf8" for encoding "SQL_ASCII" does not exist The attached proposed patch makes the behavior for ICU collations the same, by dint of injecting the is_encoding_supported_by_icu() check into collation name lookup. regards, tom lane diff --git a/src/backend/catalog/namespace.c b/src/backend/catalog/namespace.c index 64f6fee..029a132 100644 *** a/src/backend/catalog/namespace.c --- b/src/backend/catalog/namespace.c *** OpfamilyIsVisible(Oid opfid) *** 1915,1923 --- 1915,1974 } /* + * lookup_collation + * If there's a collation of the given name/namespace, and it works + * with the given encoding, return its OID. Else return InvalidOid. + */ + static Oid + lookup_collation(const char *collname, Oid collnamespace, int32 encoding) + { + Oid collid; + HeapTuple colltup; + Form_pg_collation collform; + + /* Check for encoding-specific entry (exact match) */ + collid = GetSysCacheOid3(COLLNAMEENCNSP, + PointerGetDatum(collname), + Int32GetDatum(encoding), + ObjectIdGetDatum(collnamespace)); + if (OidIsValid(collid)) + return collid; + + /* + * Check for any-encoding entry. This takes a bit more work: while libc + * collations with collencoding = -1 do work with all encodings, ICU + * collations only work with certain encodings, so we have to check that + * aspect before deciding it's a match. + */ + colltup = SearchSysCache3(COLLNAMEENCNSP, + PointerGetDatum(collname), + Int32GetDatum(-1), + ObjectIdGetDatum(collnamespace)); + if (!HeapTupleIsValid(colltup)) + return InvalidOid; + collform = (Form_pg_collation) GETSTRUCT(colltup); + if (collform->collprovider == COLLPROVIDER_ICU) + { + if (is_encoding_supported_by_icu(encoding)) + collid = HeapTupleGetOid(colltup); + else + collid = InvalidOid; + } + else + { + collid = HeapTupleGetOid(colltup); + } + ReleaseSysCache(colltup); + return collid; + } + + /* * CollationGetCollid * Try to resolve an unqualified collation name. * Returns OID if collation found in search path, else InvalidOid. + * + * Note that this will only find collations that work with the current + * database's encoding. */ Oid CollationGetCollid(const char *collname) *** CollationGetCollid(const char *collname) *** 1935,1953 if (namespaceId == myTempNamespace) continue; /* do not look in temp namespace */ ! /* Check for database-encoding-specific entry */ ! collid = GetSysCacheOid3(COLLNAMEENCNSP, ! PointerGetDatum(collname), ! Int32GetDatum(dbencoding), ! ObjectIdGetDatum(namespaceId)); ! if (OidIsValid(collid)) ! return collid; ! ! /* Check for any-encoding entry */ ! collid = GetSysCacheOid3(COLLNAMEENCNSP, ! PointerGetDatum(collname), ! Int32GetDatum(-1), ! ObjectIdGetDatum(namespaceId)); if (OidIsValid(collid)) return collid; } --- 1986,1992 if (namespaceId == myTempNamespace) continue; /* do not look in temp namespace */ ! collid = lookup_collation(collname, namespaceId, dbencoding); if (OidIsValid(collid)) return collid; } *** CollationGetCollid(const char *collname) *** 1961,1966 --- 2000,2008 * Determine whether a collation (identified by OID) is visible in the * current search path. Visible means "would be found by searching * for the unqualified collation name". + * + * Note that only collations that work with the current database's encoding + * will be considered visible. */ bool CollationIsVisible(Oid collid) *** CollationIsVisible(Oid collid) *** 1990,1998 { /* * If it is in the path, it might still not be visible; it could be ! * hidden by another conversion of the same name earlier in the path. ! * So we must do a slow chec
Re: [HACKERS] initdb initalization failure for collation "ja_JP"
I wrote: > One question that I've got is why the ICU portion refuses to load > any entries unless is_encoding_supported_by_icu(GetDatabaseEncoding()). > Surely this is completely wrong? I should think that what we load into > pg_collation ought to be independent of template1's encoding, the same > as it is for libc collations, and the right place to be making a test > like that is where somebody attempts to use an ICU collation. But > I've not tried to test it. So I did test that, and found out the presumable reason why that's there: icu_from_uchar() falls over if the database encoding is unsupported, and we use that to convert ICU "display names" for use as comments for the ICU collations. But that's not very much less wrongheaded, because it will allow non-ASCII characters into the initial database contents, which is absolutely not acceptable. We assume we can bit-copy the contents of template0 and it will be valid in any encoding. Therefore, I think the right thing to do is remove that test and change get_icu_locale_comment() so that it rejects non-ASCII text, making the encoding conversion trivial, as in the attached patch. On my Fedora 25 laptop, the only collations that go without a comment in this approach are the "nb" ones (Norwegian Bokmål). As I recall, that locale is a second-class citizen for other reasons already, precisely because of its loony insistence on a non-ASCII name even when we're asking for an Anglicized version. I'm inclined to add a test to reject non-ASCII in the ICU locale names as well as the comments. We've had to do that for libc locale names, and this experience shows that the ICU locale maintainers don't have their heads screwed on any straighter. But this patch doesn't do that. regards, tom lane diff --git a/src/backend/commands/collationcmds.c b/src/backend/commands/collationcmds.c index 1c43f0b..6febe63 100644 *** a/src/backend/commands/collationcmds.c --- b/src/backend/commands/collationcmds.c *** get_icu_language_tag(const char *localen *** 431,437 /* * Get a comment (specifically, the display name) for an ICU locale. ! * The result is a palloc'd string. */ static char * get_icu_locale_comment(const char *localename) --- 431,439 /* * Get a comment (specifically, the display name) for an ICU locale. ! * The result is a palloc'd string, or NULL if we can't get a comment ! * or find that it's not all ASCII. (We can *not* accept non-ASCII ! * comments, because the contents of template0 must be encoding-agnostic.) */ static char * get_icu_locale_comment(const char *localename) *** get_icu_locale_comment(const char *local *** 439,444 --- 441,447 UErrorCode status; UChar displayname[128]; int32 len_uchar; + int32 i; char *result; status = U_ZERO_ERROR; *** get_icu_locale_comment(const char *local *** 446,456 displayname, lengthof(displayname), &status); if (U_FAILURE(status)) ! ereport(ERROR, ! (errmsg("could not get display name for locale \"%s\": %s", ! localename, u_errorName(status; ! icu_from_uchar(&result, displayname, len_uchar); return result; } --- 449,468 displayname, lengthof(displayname), &status); if (U_FAILURE(status)) ! return NULL; /* no good reason to raise an error */ ! /* Check for non-ASCII comment */ ! for (i = 0; i < len_uchar; i++) ! { ! if (displayname[i] > 127) ! return NULL; ! } ! ! /* OK, transcribe */ ! result = palloc(len_uchar + 1); ! for (i = 0; i < len_uchar; i++) ! result[i] = displayname[i]; ! result[len_uchar] = '\0'; return result; } *** pg_import_system_collations(PG_FUNCTION_ *** 642,655 /* Load collations known to ICU */ #ifdef USE_ICU - if (!is_encoding_supported_by_icu(GetDatabaseEncoding())) - { - ereport(NOTICE, - (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), - errmsg("encoding \"%s\" not supported by ICU", - pg_encoding_to_char(GetDatabaseEncoding(); - } - else { int i; --- 654,659 *** pg_import_system_collations(PG_FUNCTION_ *** 661,666 --- 665,671 { const char *name; char *langtag; + char *icucomment; const char *collcollate; UEnumeration *en; UErrorCode status; *** pg_import_system_collations(PG_FUNCTION_ *** 686,693 CommandCounterIncrement(); ! CreateComments(collid, CollationRelationId, 0, ! get_icu_locale_comment(name)); } /* --- 691,700 CommandCounterIncrement(); ! icucomment = get_icu_locale_comment(name); ! if (icucomment) ! CreateComments(collid, CollationRelationId, 0, ! icucomment); } /* *** pg_import_system_collations(PG_FUNCTION_ *** 720,727 CommandCounterIncrement(); ! CreateComments(colli
Re: [HACKERS] initdb initalization failure for collation "ja_JP"
I wrote: > Now the hard part of that is that because pg_import_system_collations > isn't using a temporary staging table, but is just inserting directly > into pg_collation, there isn't any way for it to eliminate duplicates > unless it uses if_not_exists behavior all the time. So there seem to > be two ways to proceed: > 1. Drop pg_import_system_collations' if_not_exists argument and just > define it as adding any collations not already known in pg_collation. > 2. Significantly rewrite it so that it de-dups the collation set by > hand before trying to insert into pg_collation. After further thought, I realized that there's another argument for making pg_import_system_collations() always do if-not-exists, which is that as it stands it's inconsistent anyway because it silently uses if-not-exists behavior for aliases. So attached is a draft patch that drops if_not_exists and tweaks the alias insertion code to guarantee deterministic behavior. I also corrected failure to use CommandCounterIncrement() in the ICU part of the code, which would cause problems if ICU can ever report duplicate names; something I don't especially care to assume is impossible. Also, I fixed things so that pg_import_system_collations() doesn't emit any chatter about duplicate existing names, because it didn't take long to realize that that behavior was useless and irritating. (Perhaps we should change the function to return the number of entries successfully added? But I didn't do that here.) I've tested this with a faked version of "locale -a" that generates duplicated entries, and it does what I think it should. One question that I've got is why the ICU portion refuses to load any entries unless is_encoding_supported_by_icu(GetDatabaseEncoding()). Surely this is completely wrong? I should think that what we load into pg_collation ought to be independent of template1's encoding, the same as it is for libc collations, and the right place to be making a test like that is where somebody attempts to use an ICU collation. But I've not tried to test it. Also, I'm a bit tempted to remove setup_collation()'s manual insertion of 'ucs_basic' in favor of adding a DATA() line for that to pg_collation.h. As things stand, if for some reason "locale -a" were to report a locale named that, initdb would fail. In the old code, it would not have failed --- it's uncertain whether you would have gotten the forced UTF8/C definition or libc's definition, but you would have gotten only one. However, that approach would result in 'ucs_basic' being treated as pinned, which perhaps we don't want. If we don't want it to be pinned, another idea is just to make setup_collation() insert it first not last, thereby ensuring that the SQL definition wins over any libc entries. Comments? regards, tom lane diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index e073f7b..bd2eaaa 100644 *** a/doc/src/sgml/func.sgml --- b/doc/src/sgml/func.sgml *** postgres=# SELECT * FROM pg_walfile_name *** 19711,19717 pg_import_system_collations ! pg_import_system_collations(if_not_exists boolean, schema regnamespace) void Import operating system collations --- 19711,19717 pg_import_system_collations ! pg_import_system_collations(schema regnamespace) void Import operating system collations *** postgres=# SELECT * FROM pg_walfile_name *** 19730,19747 ! pg_import_system_collations populates the system ! catalog pg_collation with collations based on all the ! locales it finds on the operating system. This is what initdb uses; see for more details. If additional locales are installed into the operating system later on, this function ! can be run again to add collations for the new locales. In that case, the ! parameter if_not_exists should be set to true to ! skip over existing collations. The schema ! parameter would typically be pg_catalog, but that is ! not a requirement. (Collation objects based on locales that are no longer ! present on the operating system are never removed by this function.) --- 19730,19748 ! pg_import_system_collations adds collations to the system ! catalog pg_collation based on all the ! locales it finds in the operating system. This is what initdb uses; see for more details. If additional locales are installed into the operating system later on, this function ! can be run again to add collations for the new locales. Locales that ! match existing entries in pg_collation will be skipped. ! (But collation objects based on locales that are no longer ! present in the operating system are not removed by this function.) ! The schema parameter would typically ! be pg_catalog, but that is no
Re: [HACKERS] initdb initalization failure for collation "ja_JP"
On 20/06/2017 17:37, Tom Lane wrote: I wrote: Marco Atzeri writes: Building on Cygwin latest 10 beta1 or head sourece, make check fails as: ... performing post-bootstrap initialization ... 2017-05-31 23:23:22.214 CEST [16860] FATAL: collation "ja_JP" for encoding "EUC_JP" already exists Hmph. Could we see the results of "locale -a | grep ja_JP" ? Despite the lack of followup from the OP, I'm pretty troubled by this report. It shows that the reimplementation of OS collation data import as pg_import_system_collations() is a whole lot more fragile than the original coding. We have never before trusted "locale -a" to not produce duplicate outputs, not since the very beginning in 414c5a2e. AFAICS, the current coding has also lost the protections we added very shortly after that in 853c1750f; and it has also lost the admittedly rather arbitrary, but at least deterministic, preference order for conflicting short aliases that was in the original initdb code. Hi Tom, I raised the duplication issue on the cygwin mailing list, and one of the core developer reports that they saw the same issues on Linux in the past. https://cygwin.com/ml/cygwin/2017-06/msg00253.html regards, tom lane Regards Marco -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb initalization failure for collation "ja_JP"
On 18/06/2017 16:48, Tom Lane wrote: Marco Atzeri writes: Building on Cygwin latest 10 beta1 or head sourece, make check fails as: ... performing post-bootstrap initialization ... 2017-05-31 23:23:22.214 CEST [16860] FATAL: collation "ja_JP" for encoding "EUC_JP" already exists Hmph. Could we see the results of "locale -a | grep ja_JP" ? regards, tom lane $ locale -a |grep -i ja ja_JP ja_JP ja_JP.utf8 ja_JP.ujis ja_JP@cjknarrow ja_JP.utf8@cjknarrow japanese japanese.euc japanese.sjis $ locale -a |grep -i "euc" japanese.euc korean.euc Regards Marco -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb initalization failure for collation "ja_JP"
I wrote: > Marco Atzeri writes: >> Building on Cygwin latest 10 beta1 or head sourece, >> make check fails as: >> ... >> performing post-bootstrap initialization ... 2017-05-31 23:23:22.214 >> CEST [16860] FATAL: collation "ja_JP" for encoding "EUC_JP" already exists > Hmph. Could we see the results of "locale -a | grep ja_JP" ? Despite the lack of followup from the OP, I'm pretty troubled by this report. It shows that the reimplementation of OS collation data import as pg_import_system_collations() is a whole lot more fragile than the original coding. We have never before trusted "locale -a" to not produce duplicate outputs, not since the very beginning in 414c5a2e. AFAICS, the current coding has also lost the protections we added very shortly after that in 853c1750f; and it has also lost the admittedly rather arbitrary, but at least deterministic, preference order for conflicting short aliases that was in the original initdb code. I suppose the idea was to see whether we actually needed those defenses, but since we have here a failure report after less than a month of beta, it seems clear to me that we do. I think we need to upgrade pg_import_system_collations to have all the same logic that was there before. Now the hard part of that is that because pg_import_system_collations isn't using a temporary staging table, but is just inserting directly into pg_collation, there isn't any way for it to eliminate duplicates unless it uses if_not_exists behavior all the time. So there seem to be two ways to proceed: 1. Drop pg_import_system_collations' if_not_exists argument and just define it as adding any collations not already known in pg_collation. 2. Significantly rewrite it so that it de-dups the collation set by hand before trying to insert into pg_collation. #2 seems like a lot more work, but on the other hand, we might need most of that logic anyway to get back deterministic alias handling. However, since I cannot see any real-world use case at all for if_not_exists = false, I figure we might as well do #1 and take whatever simplification we can get that way. I'm willing to do the legwork on this, but before I start, does anyone have any ideas or objections? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb initalization failure for collation "ja_JP"
Marco Atzeri writes: > Building on Cygwin latest 10 beta1 or head sourece, > make check fails as: > ... > performing post-bootstrap initialization ... 2017-05-31 23:23:22.214 > CEST [16860] FATAL: collation "ja_JP" for encoding "EUC_JP" already exists Hmph. Could we see the results of "locale -a | grep ja_JP" ? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] initdb initalization failure for collation "ja_JP"
Building on Cygwin latest 10 beta1 or head sourece, make check fails as: -initdb.log - The database cluster will be initialized with locales COLLATE: en_US.UTF-8 CTYPE:en_US.UTF-8 MESSAGES: C MONETARY: en_US.UTF-8 NUMERIC: en_US.UTF-8 TIME: en_US.UTF-8 The default database encoding has accordingly been set to "UTF8". The default text search configuration will be set to "english". Data page checksums are disabled. creating directory /cygdrive/e/cyg_pub/devel/postgresql/prova/postgresql-10.0-0.1.x86_64/build/src/test/regress/./tmp_check/data ... ok creating subdirectories ... ok selecting default max_connections ... 30 selecting default shared_buffers ... 128MB selecting dynamic shared memory implementation ... sysv creating configuration files ... ok running bootstrap script ... ok performing post-bootstrap initialization ... 2017-05-31 23:23:22.214 CEST [16860] FATAL: collation "ja_JP" for encoding "EUC_JP" already exists - This does not happen on 9.6.x. Any idea what to look ? Regards Marco -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers