Re: GUCs to control abbreviated sort keys

2023-02-06 Thread Thomas Munro
On Fri, Jan 27, 2023 at 12:29 PM Jeff Davis  wrote:
> I am using these GUCs for testing the various collation paths in my
> collation refactoring branch.

Speaking of testing, has anyone ever tried porting Tom's random test
program[1] to ICU?

[1] https://www.postgresql.org/message-id/31913.1458747...@sss.pgh.pa.us




Re: GUCs to control abbreviated sort keys

2023-02-06 Thread Peter Geoghegan
On Fri, Jan 27, 2023 at 12:34 PM Jeff Davis  wrote:
> It is non-deterministic, but I tried with two generated files, and got
> similar results.

Jeff and I coordinated off-list. It turned out that the
nondeterministic nature of the program to generate test data was
behind my initial inability to recreate Jeff's results. Once Jeff
provided me with the exact data that he saw the problem with, I was
able to recreate the problematic case for abbreviated keys.

It turns out that this was due to aborting abbreviation way too late
in the process. It would happen relatively late in the process, when
more than 50% of all tuples had already had abbreviations generated by
ICU. This was a marginal case for abbreviated keys, which is precisely
why it only happened this long into the process. That factor is also
likely why I couldn't recreate the problem at first, even though I had
test data that was substantially the same as the data required to show
the problem.

Attached patch fixes the issue. It teaches varstr_abbrev_abort to do
something similar to every other abbreviated keys abort function: stop
estimating cardinality entirely (give up on giving up) once there are
a certain number of distinct abbreviated keys, regardless of any other
factor.

This is very closely based on existing code from numeric_abbrev_abort,
though I use a cutoff of 10k rather than a cutoff of 100k. This
difference is justified by the special considerations for text, where
we authoritative comparisons have further optimizations such as
strcoll caching and the memcmp equality fast path. It's also required
to actually fix the test case at hand -- 100k isn't enough to avoid
the performance issue Jeff reported.

I think that this should be committed to HEAD only.

-- 
Peter Geoghegan


v1-0001-Abort-abbreviated-keys-for-text-less-eagerly.patch
Description: Binary data


Re: GUCs to control abbreviated sort keys

2023-01-27 Thread Jeff Davis
On Fri, 2023-01-27 at 11:41 -0800, Peter Geoghegan wrote:
> I cannot recreate the issue you describe.

Interesting. For my test:

glibc  2.35  ICU 70.1
gcc11.3.0LLVM 14.0.0

> It's not impossible that the perl program you wrote produces
> non-deterministic output

It is non-deterministic, but I tried with two generated files, and got
similar results.

Right now I suspect the ICU version might be the reason. I'll try with
72.


-- 
Jeff Davis
PostgreSQL Contributor Team - AWS






Re: GUCs to control abbreviated sort keys

2023-01-27 Thread Peter Geoghegan
On Thu, Jan 26, 2023 at 3:29 PM Jeff Davis  wrote:
> On Thu, 2023-01-26 at 22:39 +0100, Peter Eisentraut wrote:
> > Maybe an easier way to enable or disable it in the source code with a
> > #define would serve this.  Making it a GUC right away seems a bit
> > heavy-handed.  Further exploration and tweaking might well require
> > further source code changes, so relying on a source code level toggle
> > would seem appropriate.
>
> I am using these GUCs for testing the various collation paths in my
> collation refactoring branch.

I'm fine with adding the GUC as a developer option. I think that there
is zero chance of the changes to tuplesort.c having appreciable
overhead.

> I find them pretty useful, and when I saw a regression, I thought
> others might think it was useful, too. But if not I'll just leave them
> in my branch and withdraw from this thread.

I cannot recreate the issue you describe. With abbreviated keys, your
exact test case takes 00:16.620 on my system. Without abbreviated
keys, it takes 00:21.255.

To me it appears to be a moderately good case for abbreviated keys,
though certainly not as good as some cases that I've seen -- ~3x
improvements are common enough.

As a point of reference, the same test case with the C collation and
with abbreviated keys takes 00:10.822. When I look at the "trace_sort"
output for the C collation with abbreviated keys, and compare it to
the equivalent "trace_sort" output for the original "en-US-x-icu"
collation from your test case, it is clear that the overhead of
generating collated abbreviated keys within ICU is relatively high --
the initial scan of the table (which is where we generate all
abbreviated keys here) takes 4.45 seconds in the ICU case, and only
1.65 seconds in the "C" locale case. I think that you should look into
that same difference on your own system, so that we can compare and
contrast.

The underlying issue might well have something to do with the ICU
version you're using, or some other detail of your environment. I'm
using Debian unstable here. Postgres links to the system ICU, which is
ICU 72.

It's not impossible that the perl program you wrote produces
non-deterministic output, which should be controlled for, since it
might just be significant. I see this on my system, having run the
perl program as outlined in your test case:

$ ls -l /tmp/strings.txt
-rw-r--r-- 1 pg pg 431886574 Jan 27 11:13 /tmp/strings.txt
$ sha1sum /tmp/strings.txt
22f60dc12527c215c8e3992e49d31dc531261a83  /tmp/strings.txt

Does that match what you see on your system?

--
Peter Geoghegan




Re: GUCs to control abbreviated sort keys

2023-01-27 Thread Robert Haas
On Wed, Jan 25, 2023 at 4:16 PM Jeff Davis  wrote:
> $ perl text_generator.pl 1000 10 > /tmp/strings.txt
>
> CREATE TABLE s (t TEXT);
> COPY s FROM '/tmp/strings.txt';
> VACUUM FREEZE s;
> CHECKPOINT;
> SET work_mem='10GB';
> SET max_parallel_workers = 0;
> SET max_parallel_workers_per_gather = 0;
>
> SET sort_abbreviated_keys = false;
> EXPLAIN ANALYZE SELECT t FROM s ORDER BY t COLLATE "en-US-x-icu";
> -- 20875ms
>
> SET sort_abbreviated_keys = true;
> EXPLAIN ANALYZE SELECT t FROM s ORDER BY t COLLATE "en-US-x-icu";
> -- 22931ms
>
> Regression for abbreviated keys optimization in this case: 9.8%

That's interesting. Do you have any idea why this happens?

I've been a bit busy the last few days so haven't had a chance to look
at the test case until now. It seems like it's just a lorum ipsum
generator, except that each line is made to contain a random number of
words, and certain letters from the Latin alphabet are replaced with
other symbols. But why is that a problem for abbreviated keys? The
most obvious way for things to go wrong is for the first 8 bytes of
the strxfrm() blob to be very low-entropy, but it's not really clear
to me what about your test case would make that more likely. I guess
another explanation could be if having a few non-ASCII characters
mixed into the string makes strxfrm() a lot slower.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: GUCs to control abbreviated sort keys

2023-01-26 Thread Jeff Davis
On Thu, 2023-01-26 at 22:39 +0100, Peter Eisentraut wrote:
> Maybe an easier way to enable or disable it in the source code with a
> #define would serve this.  Making it a GUC right away seems a bit 
> heavy-handed.  Further exploration and tweaking might well require 
> further source code changes, so relying on a source code level toggle
> would seem appropriate.

I am using these GUCs for testing the various collation paths in my
collation refactoring branch.

I find them pretty useful, and when I saw a regression, I thought
others might think it was useful, too. But if not I'll just leave them
in my branch and withdraw from this thread.


-- 
Jeff Davis
PostgreSQL Contributor Team - AWS






Re: GUCs to control abbreviated sort keys

2023-01-26 Thread Peter Eisentraut

On 25.01.23 22:16, Jeff Davis wrote:

I am highlighting this case because the existence of a single non-
contrived case or regression suggests that we may want to explore
further and tweak heuristics. That's quite natural when the heuristics
are based on a complex dependency like a collation provider. The
sort_abbreviated_keys GUC makes that kind of exploration and tweaking a
lot easier.


Maybe an easier way to enable or disable it in the source code with a 
#define would serve this.  Making it a GUC right away seems a bit 
heavy-handed.  Further exploration and tweaking might well require 
further source code changes, so relying on a source code level toggle 
would seem appropriate.






Re: GUCs to control abbreviated sort keys

2023-01-25 Thread Jeff Davis
On Tue, 2023-01-24 at 19:43 -0600, Justin Pryzby wrote:
> I think "an optimization, if applicable" is either too terse, or
> somehow
> wrong.  Maybe:
> 
> > Enables or disables the use of abbreviated keys, a sort
> > optimization...

Done.

> > +    optimization could return wrong results. Set to
> > +    true if certain that
> > strxfrm()
> > +    can be trusted.
> 
> "if you are certain"; or "if it is ..."

Done.

Thank you, rebased patch attached.

-- 
Jeff Davis
PostgreSQL Contributor Team - AWS


From 39ed011cc51ba3a4af5e3b559a7b8de25fb895a5 Mon Sep 17 00:00:00 2001
From: Jeff Davis 
Date: Sat, 21 Jan 2023 12:44:07 -0800
Subject: [PATCH v2] Introduce GUCs to control abbreviated keys sort
 optimization.

The setting sort_abbreviated_keys turns the optimization on or off
overall. The optimization relies on collation providers, which are
complex dependencies, and the performance of the optimization may rely
on many factors. Introducing a GUC allows easier diagnosis when this
optimization results in worse perforamnce.

The setting trust_strxfrm replaces the define TRUST_STRXFRM, allowing
users to experiment with the abbreviated keys optimization when using
the libc provider. Previously, the optimization only applied to
collations using the ICU provider unless specially compiled. By
default, allowed only for superusers (because an incorrect setting
could lead to wrong results), but can be granted to others.
---
 doc/src/sgml/config.sgml   | 40 ++
 src/backend/utils/adt/varlena.c| 10 +++---
 src/backend/utils/misc/guc_tables.c| 24 +
 src/backend/utils/sort/tuplesortvariants.c | 17 ++---
 4 files changed, 82 insertions(+), 9 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index f985afc009..8f55b89f35 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -11252,6 +11252,46 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir'
   
  
 
+ 
+  sort_abbreviated_keys (boolean)
+  
+   sort_abbreviated_keys configuration parameter
+  
+  
+  
+   
+Enables or disables the use of abbreviated sort keys, a sort optimization,
+if applicable. The default is true. Disabling may
+be useful to diagnose problems or measure performance.
+   
+  
+ 
+
+ 
+  trust_strxfrm (boolean)
+  
+   trust_strxfrm configuration parameter
+  
+  
+  
+   
+Abbreviated keys, a sort optimization, depends on correct behavior of
+the operating system function strxfrm() when
+using a collation with the libc provider. On some
+platforms strxfrm() does not return results
+consistent with strcoll(), which means the
+optimization could return wrong results. Set to
+true if it is certain that
+strxfrm() can be trusted.
+   
+   
+The default value is false. This setting has no
+effect if  is set to
+false.
+   
+  
+ 
+
  
   trace_locks (boolean)
   
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index fd81c47474..c270022483 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -45,6 +45,9 @@
 /* GUC variable */
 int			bytea_output = BYTEA_OUTPUT_HEX;
 
+/* GUC to enable use of strxfrm() for abbreviated keys */
+bool		trust_strxfrm = false;
+
 typedef struct varlena unknown;
 typedef struct varlena VarString;
 
@@ -2115,7 +2118,7 @@ varstr_sortsupport(SortSupport ssup, Oid typid, Oid collid)
 	 * other libc other than Cygwin has so far been shown to have a problem,
 	 * we take the conservative course of action for right now and disable
 	 * this categorically.  (Users who are certain this isn't a problem on
-	 * their system can define TRUST_STRXFRM.)
+	 * their system can set the trust_strxfrm GUC to true.)
 	 *
 	 * Even apart from the risk of broken locales, it's possible that there
 	 * are platforms where the use of abbreviated keys should be disabled at
@@ -2128,10 +2131,9 @@ varstr_sortsupport(SortSupport ssup, Oid typid, Oid collid)
 	 * categorically, we may still want or need to disable it for particular
 	 * platforms.
 	 */
-#ifndef TRUST_STRXFRM
-	if (!collate_c && !(locale && locale->provider == COLLPROVIDER_ICU))
+	if (!trust_strxfrm && !collate_c &&
+		!(locale && locale->provider == COLLPROVIDER_ICU))
 		abbreviate = false;
-#endif
 
 	/*
 	 * If we're using abbreviated keys, or if we're using a locale-aware
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 4ac808ed22..fd4a02fbf5 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -102,6 +102,8 @@ extern bool trace_syncscan;
 #ifdef DEBUG_BOUNDED_SORT
 extern bool optimize_bounded_sort;
 #endif
+extern bool sort_abbreviated_keys;
+extern bool trust_strxfrm;
 

Re: GUCs to control abbreviated sort keys

2023-01-25 Thread Jeff Davis
On Tue, 2023-01-24 at 21:42 -0500, Robert Haas wrote:
> I find it a bit premature to include this comment in the very first
> email what if other people don't like the idea?

The trust_strxfrm GUC was pulled from the larger collation refactoring
patch, which has been out for a while. The sort_abbreviated_keys GUC is
new, and I posted these both in a new thread because they started to
look independently useful.

If someone doesn't like the idea, they are free to comment, like in
every other case (though this patch doesn't seem very controversial to
me?). I suppose the wording was off-putting, so I'll choose different
words next time.

> I would like to hear about the cases where abbreviated keys resulted
> in a regression.

I want to be clear that this is not a general criticism of the
abbreviated keys optimization, nor a comprehensive analysis of its
performance.

I am highlighting this case because the existence of a single non-
contrived case or regression suggests that we may want to explore
further and tweak heuristics. That's quite natural when the heuristics
are based on a complex dependency like a collation provider. The
sort_abbreviated_keys GUC makes that kind of exploration and tweaking a
lot easier.

Built with meson on linux, gcc 11.3.0, opt -O3. Times are the middle of
three runs, taken from the sort operator's "first returned tuple" time
in EXPLAIN ANALYZE. Total runtime (as reported in EXPLAIN ANALYZE) is
pretty much the same story, but I think there was slightly more noise
in that number.

$ perl text_generator.pl 1000 10 > /tmp/strings.txt

CREATE TABLE s (t TEXT);
COPY s FROM '/tmp/strings.txt';
VACUUM FREEZE s;
CHECKPOINT;
SET work_mem='10GB';
SET max_parallel_workers = 0;
SET max_parallel_workers_per_gather = 0;

SET sort_abbreviated_keys = false;
EXPLAIN ANALYZE SELECT t FROM s ORDER BY t COLLATE "en-US-x-icu";
-- 20875ms

SET sort_abbreviated_keys = true;
EXPLAIN ANALYZE SELECT t FROM s ORDER BY t COLLATE "en-US-x-icu";
-- 22931ms

Regression for abbreviated keys optimization in this case: 9.8%

> I'd also like to know whether there's a realistic possibility that
> making this a run-time test could itself result in a regression.

The sort_abbreviated_keys branch is happening after
tuplesort_begin_common (which creates memory contexts, etc.) and before
preparing the sort keys (which involves catalog lookups). The
trust_strxfrm branch is happening in the type-specific sort support
function, which needs to be looked up in the catalog before being
called (using V1 calling convention).

It doesn't look likely that a single branch in that path will have a
perf impact. Do you have a more specific concern?

-- 
Jeff Davis
PostgreSQL Contributor Team - AWS




text_generator.pl
Description: Perl program


Re: GUCs to control abbreviated sort keys

2023-01-24 Thread Robert Haas
On Sat, Jan 21, 2023 at 8:16 PM Jeff Davis  wrote:
> This is fairly simple, so I plan to commit soon.

I find it a bit premature to include this comment in the very first
email what if other people don't like the idea?

I would like to hear about the cases where abbreviated keys resulted
in a regression.

I'd also like to know whether there's a realistic possibility that
making this a run-time test could itself result in a regression.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: GUCs to control abbreviated sort keys

2023-01-24 Thread Justin Pryzby
On Sat, Jan 21, 2023 at 05:16:01PM -0800, Jeff Davis wrote:

> +  xreflabel="sort_abbreviated_keys">
> +  sort_abbreviated_keys (boolean)
> +  
> +   sort_abbreviated_keys configuration 
> parameter
> +  
> +  
> +  
> +   
> +Enables or disables the use of abbreviated sort keys, an 
> optimization,
> +if applicable. The default is true. Disabling may

I think "an optimization, if applicable" is either too terse, or somehow
wrong.  Maybe:

| Enables or disables the use of abbreviated keys, a sort optimization...

> +optimization could return wrong results. Set to
> +true if certain that 
> strxfrm()
> +can be trusted.

"if you are certain"; or "if it is ..."

-- 
Justin