Re: [PATCH] Run UTF8-dependent tests for citext [Re: daitch_mokotoff module]

2022-01-05 Thread Tom Lane
Dag Lem  writes:
> Please find attached a patch to run the previously commented-out
> UTF8-dependent tests for citext, according to best practice. For now I
> don't dare to touch the unaccent module, which seems to be UTF8-only
> anyway.

I tried this on a bunch of different locale settings and concluded that
we need to restrict the locale to avoid failures: it falls over with
locale C.  With that, it passes on all UTF8 LANG settings on RHEL8
and FreeBSD 12, and all except am_ET.UTF-8 on current macOS.  I'm not
sure what the deal is with am_ET, but macOS has a long and sad history
of wonky UTF8 locales, so I was actually expecting worse.  If the
buildfarm shows more problems, we can restrict it further --- I won't
be too upset if we end up restricting to just Linux systems, like
collate.linux.utf8.  Anyway, pushed to see what happens.

regards, tom lane




[PATCH] Run UTF8-dependent tests for citext [Re: daitch_mokotoff module]

2022-01-05 Thread Dag Lem
Tom Lane  writes:

> Dag Lem  writes:
>
>> Running "ack -l '[\x80-\xff]'" in the contrib/ directory reveals that
>> two other modules are using UTF8 characters in tests - citext and
>> unaccent.
>
> Yeah, neither of those have been upgraded to said best practice.
> (If you feel like doing the legwork to improve that situation,
> that'd be great.)
>

Please find attached a patch to run the previously commented-out
UTF8-dependent tests for citext, according to best practice. For now I
don't dare to touch the unaccent module, which seems to be UTF8-only
anyway.


Best regards

Dag Lem

diff --git a/contrib/citext/Makefile b/contrib/citext/Makefile
index a7de52928d..789932fe36 100644
--- a/contrib/citext/Makefile
+++ b/contrib/citext/Makefile
@@ -11,7 +11,7 @@ DATA = citext--1.4.sql \
 	citext--1.0--1.1.sql
 PGFILEDESC = "citext - case-insensitive character string data type"
 
-REGRESS = citext
+REGRESS = citext citext_utf8
 
 ifdef USE_PGXS
 PG_CONFIG = pg_config
diff --git a/contrib/citext/expected/citext.out b/contrib/citext/expected/citext.out
index 3bac0534fb..48b4de8993 100644
--- a/contrib/citext/expected/citext.out
+++ b/contrib/citext/expected/citext.out
@@ -48,29 +48,6 @@ SELECT 'a'::citext <> 'ab'::citext AS t;
  t
 (1 row)
 
--- Multibyte sanity tests. Uncomment to run.
--- SELECT 'À'::citext =  'À'::citext AS t;
--- SELECT 'À'::citext =  'à'::citext AS t;
--- SELECT 'À'::text   =  'à'::text   AS f; -- text wins.
--- SELECT 'À'::citext <> 'B'::citext AS t;
--- Test combining characters making up canonically equivalent strings.
--- SELECT 'Ä'::text   <> 'Ä'::text   AS t;
--- SELECT 'Ä'::citext <> 'Ä'::citext AS t;
--- Test the Turkish dotted I. The lowercase is a single byte while the
--- uppercase is multibyte. This is why the comparison code can't be optimized
--- to compare string lengths.
--- SELECT 'i'::citext = 'İ'::citext AS t;
--- Regression.
--- SELECT 'láska'::citext <> 'laská'::citext AS t;
--- SELECT 'Ask Bjørn Hansen'::citext = 'Ask Bjørn Hansen'::citext AS t;
--- SELECT 'Ask Bjørn Hansen'::citext = 'ASK BJØRN HANSEN'::citext AS t;
--- SELECT 'Ask Bjørn Hansen'::citext <> 'Ask Bjorn Hansen'::citext AS t;
--- SELECT 'Ask Bjørn Hansen'::citext <> 'ASK BJORN HANSEN'::citext AS t;
--- SELECT citext_cmp('Ask Bjørn Hansen'::citext, 'Ask Bjørn Hansen'::citext) AS zero;
--- SELECT citext_cmp('Ask Bjørn Hansen'::citext, 'ask bjørn hansen'::citext) AS zero;
--- SELECT citext_cmp('Ask Bjørn Hansen'::citext, 'ASK BJØRN HANSEN'::citext) AS zero;
--- SELECT citext_cmp('Ask Bjørn Hansen'::citext, 'Ask Bjorn Hansen'::citext) AS positive;
--- SELECT citext_cmp('Ask Bjorn Hansen'::citext, 'Ask Bjørn Hansen'::citext) AS negative;
 -- Test > and >=
 SELECT 'B'::citext > 'a'::citext AS t;
  t 
diff --git a/contrib/citext/expected/citext_1.out b/contrib/citext/expected/citext_1.out
index 57fc863f7a..8ab4d4224e 100644
--- a/contrib/citext/expected/citext_1.out
+++ b/contrib/citext/expected/citext_1.out
@@ -48,29 +48,6 @@ SELECT 'a'::citext <> 'ab'::citext AS t;
  t
 (1 row)
 
--- Multibyte sanity tests. Uncomment to run.
--- SELECT 'À'::citext =  'À'::citext AS t;
--- SELECT 'À'::citext =  'à'::citext AS t;
--- SELECT 'À'::text   =  'à'::text   AS f; -- text wins.
--- SELECT 'À'::citext <> 'B'::citext AS t;
--- Test combining characters making up canonically equivalent strings.
--- SELECT 'Ä'::text   <> 'Ä'::text   AS t;
--- SELECT 'Ä'::citext <> 'Ä'::citext AS t;
--- Test the Turkish dotted I. The lowercase is a single byte while the
--- uppercase is multibyte. This is why the comparison code can't be optimized
--- to compare string lengths.
--- SELECT 'i'::citext = 'İ'::citext AS t;
--- Regression.
--- SELECT 'láska'::citext <> 'laská'::citext AS t;
--- SELECT 'Ask Bjørn Hansen'::citext = 'Ask Bjørn Hansen'::citext AS t;
--- SELECT 'Ask Bjørn Hansen'::citext = 'ASK BJØRN HANSEN'::citext AS t;
--- SELECT 'Ask Bjørn Hansen'::citext <> 'Ask Bjorn Hansen'::citext AS t;
--- SELECT 'Ask Bjørn Hansen'::citext <> 'ASK BJORN HANSEN'::citext AS t;
--- SELECT citext_cmp('Ask Bjørn Hansen'::citext, 'Ask Bjørn Hansen'::citext) AS zero;
--- SELECT citext_cmp('Ask Bjørn Hansen'::citext, 'ask bjørn hansen'::citext) AS zero;
--- SELECT citext_cmp('Ask Bjørn Hansen'::citext, 'ASK BJØRN HANSEN'::citext) AS zero;
--- SELECT citext_cmp('Ask Bjørn Hansen'::citext, 'Ask Bjorn Hansen'::citext) AS positive;
--- SELECT citext_cmp('Ask Bjorn Hansen'::citext, 'Ask Bjørn Hansen'::citext) AS negative;
 -- Test > and >=
 SELECT 'B'::citext > 'a'::citext AS t;
  t 
diff --git a/contrib/citext/expected/citext_utf8.out b/contrib/citext/expected/citext_utf8.out
new file mode 100644
index 00..1f4fa79aff
--- /dev/null
+++ b/contrib/citext/expected/citext_utf8.out
@@ -0,0 +1,119 @@
+/*
+ * This test must be run in a database with UTF-8 encoding,
+ * because other encodings don't support all the characters used.
+ */
+SELECT getdatabaseencoding() <> 'UTF8'
+   AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+set client_encoding