Re: Rework of collation code, extensibility

2023-02-23 Thread Jeff Davis
On Wed, 2023-02-22 at 20:49 +0100, Peter Eisentraut wrote:
> > On 14.02.23 00:45, Jeff Davis wrote:
> > > > Now the patches are:
> > > > 
> > > >  0001: pg_strcoll/pg_strxfrm
> > > >  0002: pg_locale_deterministic()
> > > >  0003: cleanup a USE_ICU special case
> > > >  0004: GUCs (only for testing, not for commit)
> > 
> > I haven't read the whole thing again, but this arrangement looks
> > good > to 
> > me.  I don't have an opinion on whether 0004 is actually useful.

Committed with a few revisions after I took a fresh look over the
patch.

The most significant was that I found out that we are also hashing the
NUL byte at the end of the string when the collation is non-
deterministic. The refactoring patch doesn't change that of course, but
the API from pg_strnxfrm() is more clear and I added comments.

Also, ICU uses int32_t for string lengths rather than size_t (I'm not
sure that's a great idea, but that's what ICU does). I clarified the
boundary by changing the argument types of the ICU-specific static
functions to int32_t, while leaving the API entry points as size_t.


-- 
Jeff Davis
PostgreSQL Contributor Team - AWS






Re: Rework of collation code, extensibility

2023-02-22 Thread Peter Eisentraut

On 14.02.23 00:45, Jeff Davis wrote:

Now the patches are:

 0001: pg_strcoll/pg_strxfrm
 0002: pg_locale_deterministic()
 0003: cleanup a USE_ICU special case
 0004: GUCs (only for testing, not for commit)


I haven't read the whole thing again, but this arrangement looks good to 
me.  I don't have an opinion on whether 0004 is actually useful.






Re: Rework of collation code, extensibility

2023-02-20 Thread Jeff Davis
On Mon, 2023-02-13 at 15:45 -0800, Jeff Davis wrote:
> New version attached. Changes:

These patches, especially 0001, have been around for a while, and
they've received some review attention with no outstanding TODOs that
I'm aware of.

I plan to commit v10 (or something close to it) soon unless someone has
additional feedback.


-- 
Jeff Davis
PostgreSQL Contributor Team - AWS






Re: Rework of collation code, extensibility

2023-02-13 Thread Jeff Davis
New version attached. Changes:

* I moved the GUC patch to the end (so you can ignore it if it's not
useful for review)
* I cut out the pg_locale_internal.h rearrangement (at least for now,
it might seem useful after the dust settles on the other changes).
* I added a separate patch for pg_locale_deterministic().
* I added a separate patch for a simple cleanup of a USE_ICU special
case.

Now the patches are:

0001: pg_strcoll/pg_strxfrm
0002: pg_locale_deterministic()
0003: cleanup a USE_ICU special case
0004: GUCs (only for testing, not for commit)

Responses to your review comments inline below:

On Mon, 2023-02-13 at 11:35 +0100, Peter Eisentraut wrote:
> The commit message for 0002 says "Also remove the TRUST_STRXFRM
> define", 
> but I think this is incorrect, as that is done in the 0001 patch.

Fixed.

> I don't like that the pg_strnxfrm() function requires these kinds of 
> repetitive error checks:
> 
> +   if (rsize != bsize)
> +   elog(ERROR, "pg_strnxfrm() returned unexpected
> result");
> 
> This could be checked inside the function itself, so that the callers
> don't have to do this themselves every time.

The current API allows for a pattern like:

   /* avoids extra work if existing buffer is big enough */
   len = pg_strxfrm(buf, src, bufSize, loc);
   if (len >= bufSize)
   {
   buf = repalloc(len+1);
   bufSize = len+1;
   len2 = pg_strxfrm(buf, src, bufSize, loc);
   }

The test for rsize != bsize are just there to check that the underlying
library calls (strxfrm or getSortKey) behave as documented, and we
expect that they'd never be hit. It's hard to move that kind of check
into pg_strxfrm() without making it also manage the buffers.

Do you have a more specific suggestion? I'd like to keep the API
flexible enough that the caller can manage the buffers, like with
abbreviated keys. Perhaps the check can just be removed if we trust
that the library functions at least get the size calculation right? Or
turned into an Assert?


-- 
Jeff Davis
PostgreSQL Contributor Team - AWS


From 5fedf2efffa4a75d8de974c41af084c5a028551d Mon Sep 17 00:00:00 2001
From: Jeff Davis 
Date: Thu, 1 Dec 2022 14:45:15 -0800
Subject: [PATCH v10 1/4] Add pg_strcoll(), pg_strxfrm(), and variants.

Towards multi-lib ICU, which should be based on a clean separation of
the routines required for collation providers. Also offers a generally
better separation of responsibilities.

Callers with NUL-terminated strings should call pg_strcoll() or
pg_strxfrm(); callers with strings and their length should call the
variants pg_strncoll() or pg_strnxfrm().
---
 src/backend/access/hash/hashfunc.c |  45 +-
 src/backend/utils/adt/pg_locale.c  | 784 -
 src/backend/utils/adt/varchar.c|  41 +-
 src/backend/utils/adt/varlena.c| 368 ++
 src/include/utils/pg_locale.h  |  13 +
 5 files changed, 863 insertions(+), 388 deletions(-)

diff --git a/src/backend/access/hash/hashfunc.c b/src/backend/access/hash/hashfunc.c
index e3e40d6c21..c0ed995919 100644
--- a/src/backend/access/hash/hashfunc.c
+++ b/src/backend/access/hash/hashfunc.c
@@ -292,21 +292,19 @@ hashtext(PG_FUNCTION_ARGS)
 #ifdef USE_ICU
 		if (mylocale->provider == COLLPROVIDER_ICU)
 		{
-			int32_t		ulen = -1;
-			UChar	   *uchar = NULL;
-			Size		bsize;
-			uint8_t*buf;
+			Size		bsize, rsize;
+			char	   *buf;
+			const char *keydata = VARDATA_ANY(key);
+			size_t		keylen = VARSIZE_ANY_EXHDR(key);
 
-			ulen = icu_to_uchar(, VARDATA_ANY(key), VARSIZE_ANY_EXHDR(key));
-
-			bsize = ucol_getSortKey(mylocale->info.icu.ucol,
-	uchar, ulen, NULL, 0);
+			bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
 			buf = palloc(bsize);
-			ucol_getSortKey(mylocale->info.icu.ucol,
-			uchar, ulen, buf, bsize);
-			pfree(uchar);
 
-			result = hash_any(buf, bsize);
+			rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
+			if (rsize != bsize)
+elog(ERROR, "pg_strnxfrm() returned unexpected result");
+
+			result = hash_any((uint8_t *) buf, bsize);
 
 			pfree(buf);
 		}
@@ -350,21 +348,20 @@ hashtextextended(PG_FUNCTION_ARGS)
 #ifdef USE_ICU
 		if (mylocale->provider == COLLPROVIDER_ICU)
 		{
-			int32_t		ulen = -1;
-			UChar	   *uchar = NULL;
-			Size		bsize;
-			uint8_t*buf;
+			Size		bsize, rsize;
+			char	   *buf;
+			const char *keydata = VARDATA_ANY(key);
+			size_t		keylen = VARSIZE_ANY_EXHDR(key);
 
-			ulen = icu_to_uchar(, VARDATA_ANY(key), VARSIZE_ANY_EXHDR(key));
-
-			bsize = ucol_getSortKey(mylocale->info.icu.ucol,
-	uchar, ulen, NULL, 0);
+			bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
 			buf = palloc(bsize);
-			ucol_getSortKey(mylocale->info.icu.ucol,
-			uchar, ulen, buf, bsize);
-			pfree(uchar);
 
-			result = hash_any_extended(buf, bsize, PG_GETARG_INT64(1));
+			rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
+			if (rsize != bsize)
+elog(ERROR, "pg_strnxfrm() returned unexpected result");
+

Re: Rework of collation code, extensibility

2023-02-13 Thread Peter Eisentraut

On 01.02.23 00:33, Jeff Davis wrote:

On Tue, 2023-01-31 at 11:40 +0100, Peter Eisentraut wrote:

I don't know to what extent this depends on the abbreviated key GUC
discussion.  Does the rest of this patch set depend on this?


The overall refactoring is not dependent logically on the GUC patch. It
may require some trivial fixup if you eliminate the GUC patch.

I left it there because it makes exploring/testing easier (at least for
me), but the GUC patch doesn't need to be committed if there's no
consensus.


I took another closer look at the 0002 and 0003 patches.

The commit message for 0002 says "Also remove the TRUST_STRXFRM define", 
but I think this is incorrect, as that is done in the 0001 patch.


I don't like that the pg_strnxfrm() function requires these kinds of 
repetitive error checks:


+   if (rsize != bsize)
+   elog(ERROR, "pg_strnxfrm() returned unexpected result");

This could be checked inside the function itself, so that the callers 
don't have to do this themselves every time.


I don't really understand the 0003 patch.  It's a lot of churn but I'm 
not sure that it achieves more clarity or something.


The new function pg_locale_deterministic() seems sensible.  Maybe this 
could be proposed as a separate patch.


I don't understand the new header pg_locale_internal.h.  What is 
"internal" and what is not internal?  What are we hiding from whom? 
There are no code comments about this AFAICT.


pg_locale_struct has new fields

+   char   *collate;
+   char   *ctype;

that are not explained anywhere.

I think this patch would need a bit more explanation and commenting.





Re: Rework of collation code, extensibility

2023-01-31 Thread Jeff Davis
On Tue, 2023-01-31 at 11:40 +0100, Peter Eisentraut wrote:
> I don't know to what extent this depends on the abbreviated key GUC 
> discussion.  Does the rest of this patch set depend on this?

The overall refactoring is not dependent logically on the GUC patch. It
may require some trivial fixup if you eliminate the GUC patch.

I left it there because it makes exploring/testing easier (at least for
me), but the GUC patch doesn't need to be committed if there's no
consensus.

Regards,
Jeff Davis






Re: Rework of collation code, extensibility

2023-01-31 Thread Peter Eisentraut

On 27.01.23 00:47, Jeff Davis wrote:

I'm hoping to commit 0002 and 0003 soon-ish, maybe a week or two,
please let me know if you want me to hold off. (I won't commit the GUCs
unless others find them generally useful; they are included here to
more easily reproduce my performance tests.)


I have looked a bit at 0002 and 0003.  I like the direction.  I'll spend 
a bit more time reviewing it in detail.  It moves a lot of code around.


I don't know to what extent this depends on the abbreviated key GUC 
discussion.  Does the rest of this patch set depend on this?






Re: Rework of collation code, extensibility

2023-01-26 Thread Jeff Davis

Attached v9 and added perf numbers below.

I'm hoping to commit 0002 and 0003 soon-ish, maybe a week or two,
please let me know if you want me to hold off. (I won't commit the GUCs
unless others find them generally useful; they are included here to
more easily reproduce my performance tests.)

My primary motivation is still related to
https://commitfest.postgresql.org/41/3956/ but the combination of
cleaner code and a performance boost seems like reasonable
justification for this patch set independently.

There aren't any clear open items on this patch. Peter Eisentraut asked
me to focus this thread on the refactoring, which I've done by reducing
it to 2 patches, and I left multilib ICU up to the other thread. He
also questioned the increased line count, but I think the currently-low
line count is due to bad style. PeterG provided some review comments,
in particular when to do the tiebreaking, which I addressed.

This patch has been around for a while, so I'll take a fresh look and
see if I see risk areas, and re-run a few sanity checks. Of course more
feedback would also be welcome.

PERFORMANCE:

==
Setup:
==

base: master with v9-0001 applied (GUCs only)
refactor: master with v9-0001, v9-0002, v9-0003 applied

Note that I wasn't able to see any performance difference between the
base and master, v9-0001 just adds some GUCs to make testing easier.

glibc  2.35  ICU 70.1
gcc11.3.0LLVM 14.0.0

built with meson (uses -O3)

$ perl text_generator.pl 1000 10 > /tmp/strings.utf8.txt

CREATE TABLE s (t TEXT);
COPY s FROM '/tmp/strings.utf8.txt';
VACUUM FREEZE s;
CHECKPOINT;
SET work_mem='10GB';
SET max_parallel_workers = 0;
SET max_parallel_workers_per_gather = 0;

=
Test queries:
=

EXPLAIN ANALYZE SELECT t FROM s ORDER BY t COLLATE "C";
EXPLAIN ANALYZE SELECT t FROM s ORDER BY t COLLATE "en_US";
EXPLAIN ANALYZE SELECT t FROM s ORDER BY t COLLATE "en-US-x-icu";

Timings are measured as the milliseconds to return the first tuple from
the Sort operator (as reported in EXPLAIN ANALYZE). Median of three
runs.


Results:


  baserefactor   speedup

sort_abbreviated_keys=false:
  C   73777273  1.4%
  en_US  35081   35090  0.0%
  en-US-x-ixu20520   19465  5.4%

sort_abbreviated_keys=true:
  C   81058008  1.2%
  en_US  35067   34850  0.6%
  en-US-x-icu22626   21507  5.2%

===
Conclusion:
===

These numbers can move +/-1 percentage point, so I'd interpret anything
less than that as noise. This happens to be the first run where all the
numbers favored the refactoring patch, but it is generally consistent
with what I had seen before.

The important part is that, for ICU, it appears to be a substantial
speedup when using meson (-O3).

Also, when/if the multilib ICU support goes in, that may lose some of
these gains due to an extra indirect function call.

-- 
Jeff Davis
PostgreSQL Contributor Team - AWS




text_generator.pl
Description: Perl program
From 39ed011cc51ba3a4af5e3b559a7b8de25fb895a5 Mon Sep 17 00:00:00 2001
From: Jeff Davis 
Date: Sat, 21 Jan 2023 12:44:07 -0800
Subject: [PATCH v9 1/3] Introduce GUCs to control abbreviated keys sort
 optimization.

The setting sort_abbreviated_keys turns the optimization on or off
overall. The optimization relies on collation providers, which are
complex dependencies, and the performance of the optimization may rely
on many factors. Introducing a GUC allows easier diagnosis when this
optimization results in worse perforamnce.

The setting trust_strxfrm replaces the define TRUST_STRXFRM, allowing
users to experiment with the abbreviated keys optimization when using
the libc provider. Previously, the optimization only applied to
collations using the ICU provider unless specially compiled. By
default, allowed only for superusers (because an incorrect setting
could lead to wrong results), but can be granted to others.
---
 doc/src/sgml/config.sgml   | 40 ++
 src/backend/utils/adt/varlena.c| 10 +++---
 src/backend/utils/misc/guc_tables.c| 24 +
 src/backend/utils/sort/tuplesortvariants.c | 17 ++---
 4 files changed, 82 insertions(+), 9 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index f985afc009..8f55b89f35 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -11252,6 +11252,46 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir'
   
  
 
+ 
+  sort_abbreviated_keys (boolean)
+  
+   sort_abbreviated_keys configuration parameter
+  
+  
+  
+   
+Enables or disables the use of abbreviated sort keys, a sort optimization,
+if applicable. The default is true. Disabling may
+be useful to 

Re: Rework of collation code, extensibility

2023-01-20 Thread Jeff Davis
On Fri, 2023-01-20 at 12:54 -0800, Jeff Davis wrote:
> Both of these are surprising, and I haven't investigated deeply yet.

It's just because autoconf defaults to -O2 and meson to -O3, at least
on my machine. It turns out that, at -O2, master and the refactoring
branch are even; but at -O3, both get faster, and the refactoring pulls
ahead by a few percentage points.

At least that's what's happening for en-US-x-icu on UTF-8 with my test
data set. I didn't see much of a difference in other situations, but I
didn't retest those other situations this time around.

We should still look into why disabling abbreviated keys improves
performance in some cases. Maybe we need a GUC for that?


-- 
Jeff Davis
PostgreSQL Contributor Team - AWS






Re: Rework of collation code, extensibility

2023-01-20 Thread Jeff Davis
On Tue, 2023-01-17 at 14:18 -0800, Peter Geoghegan wrote:
> The second goal is a perfectly good enough goal on its own, and one
> that I am totally supportive of. Making the code clearer is icing on
> the cake.

Attached v8, which is just a rebase.

To reiterate: commitfest entry
https://commitfest.postgresql.org/41/3956/ is dependent on these
patches and is a big part of the motivation for refactoring.

> 
> I don't know. Quite possibly not. It would be nice to have some data
> on that, though.

I tested with hash aggregation, which might be more dependent on
pg_locale_deterministic() than sorting. I didn't see any significant
difference between master and the refactoring branch, so I don't see a
need to make that function "inline".

I also re-tested sorting and found some interesting results for en-US-
x-icu on a UTF-8 database (which is I suspect one of the most common
configurations for ICU):

  * the refactoring branch is now more than 5% faster, whether using
abbreviated keys or not
  * disabling abbreviated keys makes sorting 8-10% faster on both
master and the refactoring branch

Both of these are surprising, and I haven't investigated deeply yet.
Maybe something about LTO, some intervening patch, or I just made some
mistakes somewhere (I did this fairly quickly). But as of now, it
doesn't look like the refactoring patch hurts anything.


-- 
Jeff Davis
PostgreSQL Contributor Team - AWS


From d1e2e1757b043c876695b8fa8c304b5126efb3aa Mon Sep 17 00:00:00 2001
From: Jeff Davis 
Date: Mon, 5 Dec 2022 10:43:52 -0800
Subject: [PATCH v8 2/2] Refactor pg_locale_t routines.

  * add pg_locale_internal.h to hide pg_locale_struct
  * move info.lt into info.libc.lt to match icu
  * introduce init_default_locale()
  * introduce pg_locale_deterministic() accessor
  * make default_locale a static global in pg_locale.c
  * refactor pg_newlocale_from_collation()
---
 src/backend/access/hash/hashfunc.c |  82 +++---
 src/backend/commands/collationcmds.c   |   1 +
 src/backend/regex/regc_pg_locale.c |  45 ++--
 src/backend/utils/adt/formatting.c |  25 +-
 src/backend/utils/adt/like.c   |   3 +-
 src/backend/utils/adt/like_support.c   |   3 +-
 src/backend/utils/adt/pg_locale.c  | 342 +++--
 src/backend/utils/adt/varchar.c|  62 ++---
 src/backend/utils/adt/varlena.c|  14 +-
 src/backend/utils/init/postinit.c  |  29 ++-
 src/include/utils/pg_locale.h  |  55 +---
 src/include/utils/pg_locale_internal.h |  68 +
 12 files changed, 402 insertions(+), 327 deletions(-)
 create mode 100644 src/include/utils/pg_locale_internal.h

diff --git a/src/backend/access/hash/hashfunc.c b/src/backend/access/hash/hashfunc.c
index c0ed995919..7cbd39f466 100644
--- a/src/backend/access/hash/hashfunc.c
+++ b/src/backend/access/hash/hashfunc.c
@@ -282,36 +282,28 @@ hashtext(PG_FUNCTION_ARGS)
 	if (!lc_collate_is_c(collid))
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (!mylocale || mylocale->deterministic)
+	if (pg_locale_deterministic(mylocale))
 	{
 		result = hash_any((unsigned char *) VARDATA_ANY(key),
 		  VARSIZE_ANY_EXHDR(key));
 	}
 	else
 	{
-#ifdef USE_ICU
-		if (mylocale->provider == COLLPROVIDER_ICU)
-		{
-			Size		bsize, rsize;
-			char	   *buf;
-			const char *keydata = VARDATA_ANY(key);
-			size_t		keylen = VARSIZE_ANY_EXHDR(key);
-
-			bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
-			buf = palloc(bsize);
-
-			rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
-			if (rsize != bsize)
-elog(ERROR, "pg_strnxfrm() returned unexpected result");
-
-			result = hash_any((uint8_t *) buf, bsize);
-
-			pfree(buf);
-		}
-		else
-#endif
-			/* shouldn't happen */
-			elog(ERROR, "unsupported collprovider: %c", mylocale->provider);
+		Size		bsize, rsize;
+		char	   *buf;
+		const char *keydata = VARDATA_ANY(key);
+		size_t		keylen = VARSIZE_ANY_EXHDR(key);
+
+		bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
+		buf = palloc(bsize);
+
+		rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
+		if (rsize != bsize)
+			elog(ERROR, "pg_strnxfrm() returned unexpected result");
+
+		result = hash_any((uint8_t *) buf, bsize);
+
+		pfree(buf);
 	}
 
 	/* Avoid leaking memory for toasted inputs */
@@ -337,7 +329,7 @@ hashtextextended(PG_FUNCTION_ARGS)
 	if (!lc_collate_is_c(collid))
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (!mylocale || mylocale->deterministic)
+	if (pg_locale_deterministic(mylocale))
 	{
 		result = hash_any_extended((unsigned char *) VARDATA_ANY(key),
    VARSIZE_ANY_EXHDR(key),
@@ -345,30 +337,22 @@ hashtextextended(PG_FUNCTION_ARGS)
 	}
 	else
 	{
-#ifdef USE_ICU
-		if (mylocale->provider == COLLPROVIDER_ICU)
-		{
-			Size		bsize, rsize;
-			char	   *buf;
-			const char *keydata = VARDATA_ANY(key);
-			size_t		keylen = VARSIZE_ANY_EXHDR(key);
-
-			bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
-			buf = palloc(bsize);
-
-			rsize = pg_strnxfrm(buf, bsize, 

Re: Rework of collation code, extensibility

2023-01-17 Thread Peter Geoghegan
On Sat, Jan 14, 2023 at 12:03 PM Jeff Davis  wrote:
> The first goal I had was simply that the code was really hard to
> understand and work on, and refactoring was justified to improve the
> situation.
>
> The second goal, which is somewhat dependent on the first goal, is that
> we really need an ability to support multiple ICU libraries, and I
> wanted to do some common groundwork that would be needed for any
> approach we choose there, and provide some hooks to get us there. You
> are right that this goal influenced the first goal.

I don't disagree that it was somewhat independent of the first goal. I
just think that it makes sense to "round up to fully dependent".
Basically it's not independent enough to be worth talking about as an
independent thing, just as a practical matter - it's confusing at the
level of things like the commit message. There is a clear direction
that you're going in here from the start, and your intentions in 0001
do matter to somebody that's just looking at 0001 in isolation. That
is my opinion, at least.

The second goal is a perfectly good enough goal on its own, and one
that I am totally supportive of. Making the code clearer is icing on
the cake.

> ucol_strcollUTF8() accepts -1 to mean "nul-terminated". I did some
> basic testing and it doesn't seem like it's slower than using the
> length. If passing the length is faster for some reason, it would
> complicate the API because we'd need an entry point that's expecting
> nul-termination and lengths, which is awkward (as Peter E. pointed
> out).

That's good. I'm happy to leave it at that. I was only enquiring.

> I felt it was a little clearer amongst the other code, to a casual
> reader, but I suppose it's a style thing. I will change it if you
> insist.

I certainly won't insist.

> I'd have to expose the pg_locale_t struct, which didn't seem desirable
> to me. Do you think it's enough of a performance concern to be worth
> some ugliness there?

I don't know. Quite possibly not. It would be nice to have some data
on that, though.

-- 
Peter Geoghegan




Re: Rework of collation code, extensibility

2023-01-14 Thread Jeff Davis
On Fri, 2023-01-13 at 11:57 -0800, Peter Geoghegan wrote:
> You're adding a layer of indirection that's going to set things up
> for
> later patches that add a layer of indirection for version ICU
> libraries (and even libc itself), and some of the details only make
> sense in that context. This isn't just refactoring work that could
> just as easily have happened in some quite different context.

Right, well put. I have two goals and felt that they merged into one
patchset, but I think that caused more confusion.

The first goal I had was simply that the code was really hard to
understand and work on, and refactoring was justified to improve the
situation.

The second goal, which is somewhat dependent on the first goal, is that
we really need an ability to support multiple ICU libraries, and I
wanted to do some common groundwork that would be needed for any
approach we choose there, and provide some hooks to get us there. You
are right that this goal influenced the first goal.

I attached new patches:

  v7-0001: pg_strcoll and pg_strxfrm patches combined, your comments
addressed
  v7-0002: add pg_locale_internal.h (and other refactoring)

I will post the other patches in the other thread.

> That means that pg_strcoll()'s relationship to pg_strxfrm()'s isn't
> the same as the well known strcoll()/strxfrm() relationship.

That's a really good point. I changed tiebreaking to be the caller's
responsibility.

> * varstrfastcmp_locale() is no longer capable of calling
> ucol_strcollUTF8() through the shim interface, meaning that it has to
> determine string length based on NUL-termination, when in principle
> it
> could just use the known length of the string.

I think you misread, it still calls ucol_strcollUTF8() when applicable,
which is impoartant because otherwise it would require a conversion to
a UChar string.

ucol_strcollUTF8() accepts -1 to mean "nul-terminated". I did some
basic testing and it doesn't seem like it's slower than using the
length. If passing the length is faster for some reason, it would
complicate the API because we'd need an entry point that's expecting
nul-termination and lengths, which is awkward (as Peter E. pointed
out).

> * I don't see much point in this new varstr_abbrev_convert()
> variable:
> 
> +   const size_t max_prefix_bytes = sizeof(Datum);
> 
> varstr_abbrev_convert() is concerned with packing abbreviated key
> bytes into Datums, so it's perfectly reasonable to deal with
> Datums/sizeof(Datum) directly.

I felt it was a little clearer amongst the other code, to a casual
reader, but I suppose it's a style thing. I will change it if you
insist.

> * Having a separate pg_strxfrm_prefix_libc() function just to throw
> an
> error doesn't really add much IMV.

Removed.

> Comments on 0003-*:
> 
> I suggest that some of the very trivial functions you have here (such
> as pg_locale_deterministic()) be made inline functions.

I'd have to expose the pg_locale_t struct, which didn't seem desirable
to me. Do you think it's enough of a performance concern to be worth
some ugliness there?


-- 
Jeff Davis
PostgreSQL Contributor Team - AWS


From 965212ed108d0bc8884b38a277b23713788876ea Mon Sep 17 00:00:00 2001
From: Jeff Davis 
Date: Thu, 1 Dec 2022 14:45:15 -0800
Subject: [PATCH v7 1/2] Add pg_strcoll(), pg_strxfrm(), and variants.

In preparation for multi-lib ICU, which should be based on a clean
separation of the routines required for collation providers. Also
offers a generally better separation of responsibilities.

Callers with NUL-terminated strings should call pg_strcoll() or
pg_strxfrm(); callers with strings and their length should call the
variants pg_strncoll() or pg_strnxfrm().

Also remove the TRUST_STRXFRM define, and replace with a developer GUC
for easier testing.
---
 src/backend/access/hash/hashfunc.c  |  45 +-
 src/backend/utils/adt/pg_locale.c   | 783 +++-
 src/backend/utils/adt/varchar.c |  41 +-
 src/backend/utils/adt/varlena.c | 368 ++---
 src/backend/utils/misc/guc_tables.c |  11 +
 src/include/utils/pg_locale.h   |  14 +
 6 files changed, 874 insertions(+), 388 deletions(-)

diff --git a/src/backend/access/hash/hashfunc.c b/src/backend/access/hash/hashfunc.c
index e3e40d6c21..c0ed995919 100644
--- a/src/backend/access/hash/hashfunc.c
+++ b/src/backend/access/hash/hashfunc.c
@@ -292,21 +292,19 @@ hashtext(PG_FUNCTION_ARGS)
 #ifdef USE_ICU
 		if (mylocale->provider == COLLPROVIDER_ICU)
 		{
-			int32_t		ulen = -1;
-			UChar	   *uchar = NULL;
-			Size		bsize;
-			uint8_t*buf;
+			Size		bsize, rsize;
+			char	   *buf;
+			const char *keydata = VARDATA_ANY(key);
+			size_t		keylen = VARSIZE_ANY_EXHDR(key);
 
-			ulen = icu_to_uchar(, VARDATA_ANY(key), VARSIZE_ANY_EXHDR(key));
-
-			bsize = ucol_getSortKey(mylocale->info.icu.ucol,
-	uchar, ulen, NULL, 0);
+			bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
 			buf = palloc(bsize);
-			ucol_getSortKey(mylocale->info.icu.ucol,
-			uchar, ulen, 

Re: Rework of collation code, extensibility

2023-01-13 Thread Peter Geoghegan
On Wed, Jan 11, 2023 at 3:44 PM Jeff Davis  wrote:
> Attached trivial rebase as v6.

Some review comments for this v6.

Comments on 0001-*:

* I think that 0002-* can be squashed with 0001-*, since there isn't
any functional reason why you'd want to commit the strcoll() and
strxfrm() changes separately.

Sometimes it can be useful to break things up, despite the fact that
it couldn't possibly make sense to commit just one of the resulting
patches on its own. However, I don't think that that's appropriate
here. There is no apparent conceptual boundary that you're
highlighting by splitting things up like this. strxfrm() and strcoll()
are defined in terms of each other -- they're siblings, joined at the
hip -- so this seems kinda jarring.

* Your commit message for 0001 (and other patches) don't set things up
by describing what the point is, and what the work anticipates. I
think that they should do that.

You're adding a layer of indirection that's going to set things up for
later patches that add a layer of indirection for version ICU
libraries (and even libc itself), and some of the details only make
sense in that context. This isn't just refactoring work that could
just as easily have happened in some quite different context.

* I'm not sure that pg_strcoll() should be breaking ties itself. We
break ties using strcmp() for historical reasons, but must not do that
for deterministic ICU collations, which may be obscured.

That means that pg_strcoll()'s relationship to pg_strxfrm()'s isn't
the same as the well known strcoll()/strxfrm() relationship. That kind
of makes pg_strcoll() somewhat more than a strcoll() shim, which is
inconsistent. Another concern is that the deterministic collation
handling isn't handled in any one layer, which would have been nice.

Do we need to do things this way? What's it adding?

* varstrfastcmp_locale() is no longer capable of calling
ucol_strcollUTF8() through the shim interface, meaning that it has to
determine string length based on NUL-termination, when in principle it
could just use the known length of the string.

Presumably this might have performance implications. Have you thought
about that?

Some comments on 0002-*:

* I don't see much point in this new varstr_abbrev_convert() variable:

+   const size_t max_prefix_bytes = sizeof(Datum);

varstr_abbrev_convert() is concerned with packing abbreviated key
bytes into Datums, so it's perfectly reasonable to deal with
Datums/sizeof(Datum) directly.

* Having a separate pg_strxfrm_prefix_libc() function just to throw an
error doesn't really add much IMV.

Comments on 0003-*:

I suggest that some of the very trivial functions you have here (such
as pg_locale_deterministic()) be made inline functions.

Comments on 0006-*:

* get_builtin_libc_library() could be indented in a way that would
make it easier to understand.

--
Peter Geoghegan




Re: Rework of collation code, extensibility

2023-01-11 Thread Jeff Davis
On Wed, 2023-01-11 at 15:08 +0100, Peter Eisentraut wrote:
> I think the refactoring that you proposed in the thread "Refactor to 
> introduce pg_strcoll()." was on a sensible track.  Maybe we should
> try 
> to get that done.

Those should be patches 0001-0003 in this thread (now at v6), which are
all pure refactoring.

Let's consider those patches the topic of this thread and I'll move
0004-0007 back to the multi-lib ICU thread on the next revision.

-- 
Jeff Davis
PostgreSQL Contributor Team - AWS






Re: Rework of collation code, extensibility

2023-01-11 Thread vignesh C
On Thu, 22 Dec 2022 at 11:11, Jeff Davis  wrote:
>
> On Sat, 2022-12-17 at 19:14 -0800, Jeff Davis wrote:
> > Attached is a new patch series. I think there are enough changes that
> > this has become more of a "rework" of the collation code rather than
> > just a refactoring. This is a continuation of some prior work[1][2]
> > in
> > a new thread given its new scope.
>
> Here's version 5. There are a number of fixes, and better tests, and
> it's passing in CI.
>
> The libc hook support is still experimental, but what's working is
> passing in CI, even on windows. The challenges with libc hook support
> are:
>
>  * It obviously doesn't replace all of libc, so the separation is not
> as clean and there are a number of callers throughout the code that
> don't necessarily care about specific collations.
>
>  * libc relies on setlocale() / uselocale(), which is global state and
> not as easy to track.
>
>  * More platform issues (obviously) and harder to test.

The patch does not apply on top of HEAD as in [1], please post a rebased patch:
=== Applying patches on top of PostgreSQL commit ID
c971a5b27ac946e7c94f7f655d321279512c7ee7 ===
=== applying patch ./v5-0003-Refactor-pg_locale_t-routines.patch

Hunk #1 FAILED at 88.
...
1 out of 9 hunks FAILED -- saving rejects to file
src/backend/utils/adt/formatting.c.rej
patching file src/backend/utils/adt/like.c
Hunk #1 FAILED at 24.
Hunk #2 succeeded at 97 (offset 1 line).
1 out of 2 hunks FAILED -- saving rejects to file
src/backend/utils/adt/like.c.rej

[1] - http://cfbot.cputube.org/patch_41_4058.log

Regards,
Vignesh




Re: Rework of collation code, extensibility

2023-01-11 Thread Peter Eisentraut

On 06.01.23 08:04, Jeff Davis wrote:

The existing code is not great, in my opinion: it doesn't have clear
API boundaries, the comments are insufficient, and lots of special
cases need to be handled awkwardly by callers. That style is hard to
beat when it comes to the raw line count; but it's quite difficult to
understand and work on.

I think my changes are an improvement, but obviously that depends on
the opinion of others who are working in this part of the code. What do
you think?


I think the refactoring that you proposed in the thread "Refactor to 
introduce pg_strcoll()." was on a sensible track.  Maybe we should try 
to get that done.  The multiple-ICU stuff is still experimental and has 
its own rather impressive thread, so I don't think it's sensible to try 
to sort that out here.






Re: Rework of collation code, extensibility

2023-01-05 Thread Jeff Davis
On Wed, 2023-01-04 at 22:46 +0100, Peter Eisentraut wrote:
> It combines some refactoring that was previously posted with partial 
> support for multiple ICU libraries with partial support for some new 
> hooks.  Shouldn't those be three separate threads?

Originally they felt more separate to me, too; but as I worked on them
it seemed better to consider them as a patch series. Whatever is easier
for reviewers works for me, though.

>   I think the multiple 
> ICU libraries already does have a separate thread; how does this
> relate 
> to that work?

Multilib ICU support adds complexity, and my hope is that this patch
set cleans up and organizes things to better prepare for that
complexity.

>   I don't know what the hooks are supposed to be for?

I found them very useful for testing during development. One of the
patches adds a test module for the ICU hook, and I think that's a
valuable place to test regardless of whether any other extension uses
the hook. Also, if proper multilib support doesn't land in 16, then the
hooks could be a way to build rudimentary multilib support (or at least
some kind of ICU version lockdown) until it does land.

When Thomas's work is in place, I expect the hooks to change slightly.
The hooks are not meant to set any specific API in stone.

>   What 
> other locale libraries are you thinking about using this way?  How
> can 
> we asses whether these interfaces are sufficient for that?

I'm not considering any other locale libraries, nor did I see much
discussion of that.

>   The 
> refactoring patches don't look convincing just by looking at the
> numbers:
> 
>   3 files changed, 406 insertions(+), 247 deletions(-)
>   6 files changed, 481 insertions(+), 150 deletions(-)
>   12 files changed, 400 insertions(+), 323 deletions(-)

The existing code is not great, in my opinion: it doesn't have clear
API boundaries, the comments are insufficient, and lots of special
cases need to be handled awkwardly by callers. That style is hard to
beat when it comes to the raw line count; but it's quite difficult to
understand and work on.

I think my changes are an improvement, but obviously that depends on
the opinion of others who are working in this part of the code. What do
you think?


-- 
Jeff Davis
PostgreSQL Contributor Team - AWS






Re: Rework of collation code, extensibility

2023-01-04 Thread Peter Eisentraut

On 22.12.22 06:40, Jeff Davis wrote:

On Sat, 2022-12-17 at 19:14 -0800, Jeff Davis wrote:

Attached is a new patch series. I think there are enough changes that
this has become more of a "rework" of the collation code rather than
just a refactoring. This is a continuation of some prior work[1][2]
in
a new thread given its new scope.


Here's version 5. There are a number of fixes, and better tests, and
it's passing in CI.

The libc hook support is still experimental, but what's working is
passing in CI, even on windows. The challenges with libc hook support
are:

  * It obviously doesn't replace all of libc, so the separation is not
as clean and there are a number of callers throughout the code that
don't necessarily care about specific collations.

  * libc relies on setlocale() / uselocale(), which is global state and
not as easy to track.

  * More platform issues (obviously) and harder to test.


I'm confused by this patch set.

It combines some refactoring that was previously posted with partial 
support for multiple ICU libraries with partial support for some new 
hooks.  Shouldn't those be three separate threads?  I think the multiple 
ICU libraries already does have a separate thread; how does this relate 
to that work?  I don't know what the hooks are supposed to be for?  What 
other locale libraries are you thinking about using this way?  How can 
we asses whether these interfaces are sufficient for that?  The 
refactoring patches don't look convincing just by looking at the numbers:


 3 files changed, 406 insertions(+), 247 deletions(-)
 6 files changed, 481 insertions(+), 150 deletions(-)
 12 files changed, 400 insertions(+), 323 deletions(-)

My sense is this is trying to do too many things at once, and those 
things are each not fully developed yet.






Re: Rework of collation code, extensibility

2022-12-17 Thread Ted Yu
On Sat, Dec 17, 2022 at 8:54 PM John Naylor 
wrote:

>
> > nul-terminate -> null-terminate
>
> NUL is a common abbreviation for the zero byte (but not for zero
> pointers). See the ascii manpage.
>
> --
> John Naylor
> EDB: http://www.enterprisedb.com
>
> Ah.

`nul-terminated` does appear in the codebase.
Should have checked earlier.


Re: Rework of collation code, extensibility

2022-12-17 Thread John Naylor
On Sun, Dec 18, 2022 at 10:28 AM Ted Yu  wrote:

> It seems the `else` is not needed (since when the if branch is taken, we
return from the func).

By that same logic, this review comment is not needed, since compiler
vendors don't charge license fees by the number of keywords. ;-)
Joking aside, we don't really have a project style preference for this case.

> nul-terminate -> null-terminate

NUL is a common abbreviation for the zero byte (but not for zero pointers).
See the ascii manpage.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: Rework of collation code, extensibility

2022-12-17 Thread Ted Yu
On Sat, Dec 17, 2022 at 7:14 PM Jeff Davis  wrote:

> Attached is a new patch series. I think there are enough changes that
> this has become more of a "rework" of the collation code rather than
> just a refactoring. This is a continuation of some prior work[1][2] in
> a new thread given its new scope.
>
> Benefits:
>
> 1. Clearer division of responsibilities.
> 2. More consistent between libc and ICU providers.
> 3. Hooks that allow extensions to replace collation provider libraries.
> 4. New tests for the collation provider library hooks.
>
> There are a lot of changes, and still some loose ends, but I believe a
> few of these patches are close to ready.
>
> This set of changes does not express an opinion on how we might want to
> support multiple provider libraries in core; but whatever we choose, it
> should be easier to accomplish. Right now, the hooks have limited
> information on which to make the choice for a specific version of a
> collation provider library, but that's because there's limited
> information in the catalog. If the discussion here[3] concludes in
> adding collation provider library or library version information to the
> catalog, we can add additional parameters to the hooks.
>
> [1]
>
> https://postgr.es/m/99aa79cceefd1fe84fda23510494b8fbb7ad1e70.ca...@j-davis.com
> [2]
>
> https://postgr.es/m/c4fda90ec6a7568a896f243a38eb273c3b5c3d93.ca...@j-davis.com
> [3]
>
> https://postgr.es/m/ca+hukgleqmhnpzrgacisoueyfgz8w6ewdhtk2h-4qn0iosf...@mail.gmail.com
>
>
> --
> Jeff Davis
> PostgreSQL Contributor Team - AWS
>
> Hi,
For pg_strxfrm_libc in v4-0002-Add-pg_strxfrm-and-pg_strxfrm_prefix.patch:

+#ifdef HAVE_LOCALE_T
+   if (locale)
+   return strxfrm_l(dest, src, destsize, locale->info.lt);
+   else
+#endif
+   return strxfrm(dest, src, destsize);

It seems the `else` is not needed (since when the if branch is taken, we
return from the func).

+   /* nul-terminate arguments */

nul-terminate -> null-terminate

For pg_strnxfrm(), I think `result` can be removed - we directly return the
result from pg_strnxfrm_libc or pg_strnxfrm_icu.

Cheers