Re: GUCs to control abbreviated sort keys
On Fri, Jan 27, 2023 at 12:29 PM Jeff Davis wrote: > I am using these GUCs for testing the various collation paths in my > collation refactoring branch. Speaking of testing, has anyone ever tried porting Tom's random test program[1] to ICU? [1] https://www.postgresql.org/message-id/31913.1458747...@sss.pgh.pa.us
Re: GUCs to control abbreviated sort keys
On Fri, Jan 27, 2023 at 12:34 PM Jeff Davis wrote: > It is non-deterministic, but I tried with two generated files, and got > similar results. Jeff and I coordinated off-list. It turned out that the nondeterministic nature of the program to generate test data was behind my initial inability to recreate Jeff's results. Once Jeff provided me with the exact data that he saw the problem with, I was able to recreate the problematic case for abbreviated keys. It turns out that this was due to aborting abbreviation way too late in the process. It would happen relatively late in the process, when more than 50% of all tuples had already had abbreviations generated by ICU. This was a marginal case for abbreviated keys, which is precisely why it only happened this long into the process. That factor is also likely why I couldn't recreate the problem at first, even though I had test data that was substantially the same as the data required to show the problem. Attached patch fixes the issue. It teaches varstr_abbrev_abort to do something similar to every other abbreviated keys abort function: stop estimating cardinality entirely (give up on giving up) once there are a certain number of distinct abbreviated keys, regardless of any other factor. This is very closely based on existing code from numeric_abbrev_abort, though I use a cutoff of 10k rather than a cutoff of 100k. This difference is justified by the special considerations for text, where we authoritative comparisons have further optimizations such as strcoll caching and the memcmp equality fast path. It's also required to actually fix the test case at hand -- 100k isn't enough to avoid the performance issue Jeff reported. I think that this should be committed to HEAD only. -- Peter Geoghegan v1-0001-Abort-abbreviated-keys-for-text-less-eagerly.patch Description: Binary data
Re: GUCs to control abbreviated sort keys
On Fri, 2023-01-27 at 11:41 -0800, Peter Geoghegan wrote: > I cannot recreate the issue you describe. Interesting. For my test: glibc 2.35 ICU 70.1 gcc11.3.0LLVM 14.0.0 > It's not impossible that the perl program you wrote produces > non-deterministic output It is non-deterministic, but I tried with two generated files, and got similar results. Right now I suspect the ICU version might be the reason. I'll try with 72. -- Jeff Davis PostgreSQL Contributor Team - AWS
Re: GUCs to control abbreviated sort keys
On Thu, Jan 26, 2023 at 3:29 PM Jeff Davis wrote: > On Thu, 2023-01-26 at 22:39 +0100, Peter Eisentraut wrote: > > Maybe an easier way to enable or disable it in the source code with a > > #define would serve this. Making it a GUC right away seems a bit > > heavy-handed. Further exploration and tweaking might well require > > further source code changes, so relying on a source code level toggle > > would seem appropriate. > > I am using these GUCs for testing the various collation paths in my > collation refactoring branch. I'm fine with adding the GUC as a developer option. I think that there is zero chance of the changes to tuplesort.c having appreciable overhead. > I find them pretty useful, and when I saw a regression, I thought > others might think it was useful, too. But if not I'll just leave them > in my branch and withdraw from this thread. I cannot recreate the issue you describe. With abbreviated keys, your exact test case takes 00:16.620 on my system. Without abbreviated keys, it takes 00:21.255. To me it appears to be a moderately good case for abbreviated keys, though certainly not as good as some cases that I've seen -- ~3x improvements are common enough. As a point of reference, the same test case with the C collation and with abbreviated keys takes 00:10.822. When I look at the "trace_sort" output for the C collation with abbreviated keys, and compare it to the equivalent "trace_sort" output for the original "en-US-x-icu" collation from your test case, it is clear that the overhead of generating collated abbreviated keys within ICU is relatively high -- the initial scan of the table (which is where we generate all abbreviated keys here) takes 4.45 seconds in the ICU case, and only 1.65 seconds in the "C" locale case. I think that you should look into that same difference on your own system, so that we can compare and contrast. The underlying issue might well have something to do with the ICU version you're using, or some other detail of your environment. I'm using Debian unstable here. Postgres links to the system ICU, which is ICU 72. It's not impossible that the perl program you wrote produces non-deterministic output, which should be controlled for, since it might just be significant. I see this on my system, having run the perl program as outlined in your test case: $ ls -l /tmp/strings.txt -rw-r--r-- 1 pg pg 431886574 Jan 27 11:13 /tmp/strings.txt $ sha1sum /tmp/strings.txt 22f60dc12527c215c8e3992e49d31dc531261a83 /tmp/strings.txt Does that match what you see on your system? -- Peter Geoghegan
Re: GUCs to control abbreviated sort keys
On Wed, Jan 25, 2023 at 4:16 PM Jeff Davis wrote: > $ perl text_generator.pl 1000 10 > /tmp/strings.txt > > CREATE TABLE s (t TEXT); > COPY s FROM '/tmp/strings.txt'; > VACUUM FREEZE s; > CHECKPOINT; > SET work_mem='10GB'; > SET max_parallel_workers = 0; > SET max_parallel_workers_per_gather = 0; > > SET sort_abbreviated_keys = false; > EXPLAIN ANALYZE SELECT t FROM s ORDER BY t COLLATE "en-US-x-icu"; > -- 20875ms > > SET sort_abbreviated_keys = true; > EXPLAIN ANALYZE SELECT t FROM s ORDER BY t COLLATE "en-US-x-icu"; > -- 22931ms > > Regression for abbreviated keys optimization in this case: 9.8% That's interesting. Do you have any idea why this happens? I've been a bit busy the last few days so haven't had a chance to look at the test case until now. It seems like it's just a lorum ipsum generator, except that each line is made to contain a random number of words, and certain letters from the Latin alphabet are replaced with other symbols. But why is that a problem for abbreviated keys? The most obvious way for things to go wrong is for the first 8 bytes of the strxfrm() blob to be very low-entropy, but it's not really clear to me what about your test case would make that more likely. I guess another explanation could be if having a few non-ASCII characters mixed into the string makes strxfrm() a lot slower. -- Robert Haas EDB: http://www.enterprisedb.com
Re: GUCs to control abbreviated sort keys
On Thu, 2023-01-26 at 22:39 +0100, Peter Eisentraut wrote: > Maybe an easier way to enable or disable it in the source code with a > #define would serve this. Making it a GUC right away seems a bit > heavy-handed. Further exploration and tweaking might well require > further source code changes, so relying on a source code level toggle > would seem appropriate. I am using these GUCs for testing the various collation paths in my collation refactoring branch. I find them pretty useful, and when I saw a regression, I thought others might think it was useful, too. But if not I'll just leave them in my branch and withdraw from this thread. -- Jeff Davis PostgreSQL Contributor Team - AWS
Re: GUCs to control abbreviated sort keys
On 25.01.23 22:16, Jeff Davis wrote: I am highlighting this case because the existence of a single non- contrived case or regression suggests that we may want to explore further and tweak heuristics. That's quite natural when the heuristics are based on a complex dependency like a collation provider. The sort_abbreviated_keys GUC makes that kind of exploration and tweaking a lot easier. Maybe an easier way to enable or disable it in the source code with a #define would serve this. Making it a GUC right away seems a bit heavy-handed. Further exploration and tweaking might well require further source code changes, so relying on a source code level toggle would seem appropriate.
Re: GUCs to control abbreviated sort keys
On Tue, 2023-01-24 at 19:43 -0600, Justin Pryzby wrote: > I think "an optimization, if applicable" is either too terse, or > somehow > wrong. Maybe: > > > Enables or disables the use of abbreviated keys, a sort > > optimization... Done. > > + optimization could return wrong results. Set to > > + true if certain that > > strxfrm() > > + can be trusted. > > "if you are certain"; or "if it is ..." Done. Thank you, rebased patch attached. -- Jeff Davis PostgreSQL Contributor Team - AWS From 39ed011cc51ba3a4af5e3b559a7b8de25fb895a5 Mon Sep 17 00:00:00 2001 From: Jeff Davis Date: Sat, 21 Jan 2023 12:44:07 -0800 Subject: [PATCH v2] Introduce GUCs to control abbreviated keys sort optimization. The setting sort_abbreviated_keys turns the optimization on or off overall. The optimization relies on collation providers, which are complex dependencies, and the performance of the optimization may rely on many factors. Introducing a GUC allows easier diagnosis when this optimization results in worse perforamnce. The setting trust_strxfrm replaces the define TRUST_STRXFRM, allowing users to experiment with the abbreviated keys optimization when using the libc provider. Previously, the optimization only applied to collations using the ICU provider unless specially compiled. By default, allowed only for superusers (because an incorrect setting could lead to wrong results), but can be granted to others. --- doc/src/sgml/config.sgml | 40 ++ src/backend/utils/adt/varlena.c| 10 +++--- src/backend/utils/misc/guc_tables.c| 24 + src/backend/utils/sort/tuplesortvariants.c | 17 ++--- 4 files changed, 82 insertions(+), 9 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index f985afc009..8f55b89f35 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -11252,6 +11252,46 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir' + + sort_abbreviated_keys (boolean) + + sort_abbreviated_keys configuration parameter + + + + +Enables or disables the use of abbreviated sort keys, a sort optimization, +if applicable. The default is true. Disabling may +be useful to diagnose problems or measure performance. + + + + + + trust_strxfrm (boolean) + + trust_strxfrm configuration parameter + + + + +Abbreviated keys, a sort optimization, depends on correct behavior of +the operating system function strxfrm() when +using a collation with the libc provider. On some +platforms strxfrm() does not return results +consistent with strcoll(), which means the +optimization could return wrong results. Set to +true if it is certain that +strxfrm() can be trusted. + + +The default value is false. This setting has no +effect if is set to +false. + + + + trace_locks (boolean) diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c index fd81c47474..c270022483 100644 --- a/src/backend/utils/adt/varlena.c +++ b/src/backend/utils/adt/varlena.c @@ -45,6 +45,9 @@ /* GUC variable */ int bytea_output = BYTEA_OUTPUT_HEX; +/* GUC to enable use of strxfrm() for abbreviated keys */ +bool trust_strxfrm = false; + typedef struct varlena unknown; typedef struct varlena VarString; @@ -2115,7 +2118,7 @@ varstr_sortsupport(SortSupport ssup, Oid typid, Oid collid) * other libc other than Cygwin has so far been shown to have a problem, * we take the conservative course of action for right now and disable * this categorically. (Users who are certain this isn't a problem on - * their system can define TRUST_STRXFRM.) + * their system can set the trust_strxfrm GUC to true.) * * Even apart from the risk of broken locales, it's possible that there * are platforms where the use of abbreviated keys should be disabled at @@ -2128,10 +2131,9 @@ varstr_sortsupport(SortSupport ssup, Oid typid, Oid collid) * categorically, we may still want or need to disable it for particular * platforms. */ -#ifndef TRUST_STRXFRM - if (!collate_c && !(locale && locale->provider == COLLPROVIDER_ICU)) + if (!trust_strxfrm && !collate_c && + !(locale && locale->provider == COLLPROVIDER_ICU)) abbreviate = false; -#endif /* * If we're using abbreviated keys, or if we're using a locale-aware diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c index 4ac808ed22..fd4a02fbf5 100644 --- a/src/backend/utils/misc/guc_tables.c +++ b/src/backend/utils/misc/guc_tables.c @@ -102,6 +102,8 @@ extern bool trace_syncscan; #i
Re: GUCs to control abbreviated sort keys
On Tue, 2023-01-24 at 21:42 -0500, Robert Haas wrote: > I find it a bit premature to include this comment in the very first > email what if other people don't like the idea? The trust_strxfrm GUC was pulled from the larger collation refactoring patch, which has been out for a while. The sort_abbreviated_keys GUC is new, and I posted these both in a new thread because they started to look independently useful. If someone doesn't like the idea, they are free to comment, like in every other case (though this patch doesn't seem very controversial to me?). I suppose the wording was off-putting, so I'll choose different words next time. > I would like to hear about the cases where abbreviated keys resulted > in a regression. I want to be clear that this is not a general criticism of the abbreviated keys optimization, nor a comprehensive analysis of its performance. I am highlighting this case because the existence of a single non- contrived case or regression suggests that we may want to explore further and tweak heuristics. That's quite natural when the heuristics are based on a complex dependency like a collation provider. The sort_abbreviated_keys GUC makes that kind of exploration and tweaking a lot easier. Built with meson on linux, gcc 11.3.0, opt -O3. Times are the middle of three runs, taken from the sort operator's "first returned tuple" time in EXPLAIN ANALYZE. Total runtime (as reported in EXPLAIN ANALYZE) is pretty much the same story, but I think there was slightly more noise in that number. $ perl text_generator.pl 1000 10 > /tmp/strings.txt CREATE TABLE s (t TEXT); COPY s FROM '/tmp/strings.txt'; VACUUM FREEZE s; CHECKPOINT; SET work_mem='10GB'; SET max_parallel_workers = 0; SET max_parallel_workers_per_gather = 0; SET sort_abbreviated_keys = false; EXPLAIN ANALYZE SELECT t FROM s ORDER BY t COLLATE "en-US-x-icu"; -- 20875ms SET sort_abbreviated_keys = true; EXPLAIN ANALYZE SELECT t FROM s ORDER BY t COLLATE "en-US-x-icu"; -- 22931ms Regression for abbreviated keys optimization in this case: 9.8% > I'd also like to know whether there's a realistic possibility that > making this a run-time test could itself result in a regression. The sort_abbreviated_keys branch is happening after tuplesort_begin_common (which creates memory contexts, etc.) and before preparing the sort keys (which involves catalog lookups). The trust_strxfrm branch is happening in the type-specific sort support function, which needs to be looked up in the catalog before being called (using V1 calling convention). It doesn't look likely that a single branch in that path will have a perf impact. Do you have a more specific concern? -- Jeff Davis PostgreSQL Contributor Team - AWS text_generator.pl Description: Perl program
Re: GUCs to control abbreviated sort keys
On Sat, Jan 21, 2023 at 8:16 PM Jeff Davis wrote: > This is fairly simple, so I plan to commit soon. I find it a bit premature to include this comment in the very first email what if other people don't like the idea? I would like to hear about the cases where abbreviated keys resulted in a regression. I'd also like to know whether there's a realistic possibility that making this a run-time test could itself result in a regression. -- Robert Haas EDB: http://www.enterprisedb.com
Re: GUCs to control abbreviated sort keys
On Sat, Jan 21, 2023 at 05:16:01PM -0800, Jeff Davis wrote: > + xreflabel="sort_abbreviated_keys"> > + sort_abbreviated_keys (boolean) > + > + sort_abbreviated_keys configuration > parameter > + > + > + > + > +Enables or disables the use of abbreviated sort keys, an > optimization, > +if applicable. The default is true. Disabling may I think "an optimization, if applicable" is either too terse, or somehow wrong. Maybe: | Enables or disables the use of abbreviated keys, a sort optimization... > +optimization could return wrong results. Set to > +true if certain that > strxfrm() > +can be trusted. "if you are certain"; or "if it is ..." -- Justin
GUCs to control abbreviated sort keys
The attached patch adds GUCs to control the use of the abbreviated keys optimization when sorting. Also, I changed the TRUST_STRXFRM from a #define into a GUC. One reason for these GUCs is to make it easier to diagnose any issues that come up with my collation work. Another is that I observed cases with ICU where the abbreviated keys optimization resulted in a ~8-10% regression, and it would be good to have some basic control over it. I made them developer options because they are more about diagnosing and I don't expect users to change these in production. If the issues with abbreviated keys get more complex (where maybe we need to consider costing each provider?), we can make it more user-facing. This is fairly simple, so I plan to commit soon. -- Jeff Davis PostgreSQL Contributor Team - AWS From 7699f28634f04772c718ac465bbbff48b849f2bc Mon Sep 17 00:00:00 2001 From: Jeff Davis Date: Sat, 21 Jan 2023 12:44:07 -0800 Subject: [PATCH v1] Introduce GUCs to control abbreviated keys sort optimization. The setting sort_abbreviated_keys turns the optimization on or off overall. The optimization relies on collation providers, which are complex dependencies, and the performance of the optimization may rely on many factors. Introducing a GUC allows easier diagnosis when this optimization results in worse perforamnce. The setting trust_strxfrm replaces the define TRUST_STRXFRM, allowing users to experiment with the abbreviated keys optimization when using the libc provider. Previously, the optimization only applied to collations using the ICU provider unless specially compiled. By default, allowed only for superusers (because an incorrect setting could lead to wrong results), but can be granted to others. --- doc/src/sgml/config.sgml | 40 ++ src/backend/utils/adt/varlena.c| 10 +++--- src/backend/utils/misc/guc_tables.c| 24 + src/backend/utils/sort/tuplesortvariants.c | 17 ++--- 4 files changed, 82 insertions(+), 9 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index dc9b78b0b7..47ee3ec6e7 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -11248,6 +11248,46 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir' + + sort_abbreviated_keys (boolean) + + sort_abbreviated_keys configuration parameter + + + + +Enables or disables the use of abbreviated sort keys, an optimization, +if applicable. The default is true. Disabling may +be useful to diagnose problems or measure performance. + + + + + + trust_strxfrm (boolean) + + trust_strxfrm configuration parameter + + + + +Abbreviated keys, a sort optimization, depends on correct behavior of +the operating system function strxfrm() when +using a collation with the libc provider. On some +platforms strxfrm() does not return results +consistent with strcoll(), which means the +optimization could return wrong results. Set to +true if certain that strxfrm() +can be trusted. + + +The default value is false. This setting has no +effect if is set to +false. + + + + trace_locks (boolean) diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c index 33ffdb013a..8b5b467205 100644 --- a/src/backend/utils/adt/varlena.c +++ b/src/backend/utils/adt/varlena.c @@ -45,6 +45,9 @@ /* GUC variable */ int bytea_output = BYTEA_OUTPUT_HEX; +/* GUC to enable use of strxfrm() for abbreviated keys */ +bool trust_strxfrm = false; + typedef struct varlena unknown; typedef struct varlena VarString; @@ -2092,7 +2095,7 @@ varstr_sortsupport(SortSupport ssup, Oid typid, Oid collid) * other libc other than Cygwin has so far been shown to have a problem, * we take the conservative course of action for right now and disable * this categorically. (Users who are certain this isn't a problem on - * their system can define TRUST_STRXFRM.) + * their system can set the trust_strxfrm GUC to true.) * * Even apart from the risk of broken locales, it's possible that there * are platforms where the use of abbreviated keys should be disabled at @@ -2105,10 +2108,9 @@ varstr_sortsupport(SortSupport ssup, Oid typid, Oid collid) * categorically, we may still want or need to disable it for particular * platforms. */ -#ifndef TRUST_STRXFRM - if (!collate_c && !(locale && locale->provider == COLLPROVIDER_ICU)) + if (!trust_strxfrm && !collate_c && + !(locale && locale->provider == COLLPROVIDER_ICU)) abbreviate = false; -#endif /* * If we're using abbreviated keys, or if we're using a locale-aware diff --git a/src/backend/utils/misc/guc_tables.c b/src/ba