Re: Improving tracking/processing of buildfarm test failures

2024-11-01 Thread Alexander Lakhin

Hello hackers,

Please take a look at the October report on buildfarm failures:
# SELECT br, count(*) FROM failures WHERE dt >= '2024-10-01' AND
 dt < '2024-11-01' GROUP BY br;
REL_12_STABLE: 9
REL_13_STABLE: 9
REL_14_STABLE: 19
REL_15_STABLE: 25
REL_16_STABLE: 12
REL_17_STABLE: 14
master: 109
-- Total: 197
(Counting test failures only, excluding indent-check, Configure, Build
errors.)

# SELECT COUNT(*) FROM (SELECT DISTINCT issue_link FROM failures WHERE
 dt >= '2024-10-01' AND dt < '2024-11-01');
22

# SELECT issue_link, count(*) FROM failures WHERE dt >= '2024-10-01' AND
 dt < '2024-11-01' GROUP BY issue_link ORDER BY 2 DESC LIMIT 6;
https://www.postgresql.org/message-id/d63a3295-cac1-4a8e-9de1-0ebab996d...@eisentraut.org:
 54
-- Fixed

https://www.postgresql.org/message-id/caa4ek1k_ikmjekqsof9ssdau4s8_cu6n15rp13-j4cmskn-...@mail.gmail.com:
 23
-- Fixed

https://www.postgresql.org/message-id/362289.1730241...@sss.pgh.pa.us: 11
-- Will be fixed soon

https://www.postgresql.org/message-id/2480333.1729784...@sss.pgh.pa.us: 6
-- Fixed

https://www.postgresql.org/message-id/657815a2-5a89-fcc1-1c9d-d77a6986b...@gmail.com:
 5

https://www.postgresql.org/message-id/c638873f-8b1e-4770-ba49-5a0b3e140...@iki.fi:
 4
-- Fixed

# SELECT count(*) FROM failures WHERE dt >= '2024-10-01' AND
 dt < '2024-11-01' AND issue_link IS NULL; -- Unsorted/unhelpful failures
74

Short-lived failures:
165 (+ 11 from 362289.1730241...@sss.pgh.pa.us)

Best regards,
Alexander




Re: Inval reliability, especially for inplace updates

2024-10-31 Thread Alexander Lakhin

Hello Noah,

31.10.2024 04:39, Noah Misch wrote:

I had pushed this during the indicated week, before your mail.  Reverting it
is an option.  Let's see if more opinions arrive.


I've accidentally discovered an incorrect behaviour caused by commit
4eac5a1fa. Running this script:
for ((j=1; j<=100; j++)); do
echo "iteration $j"

cat << 'EOF' | timeout 60 psql >>psql-$SID.log || { res=1; echo "hanged on iteration 
$j"; break; }
SELECT format('CREATE TABLE t%s (a int, b text);', g) FROM generate_series(1, 
50) g
\gexec

SELECT format('DROP TABLE t%s;', g) FROM generate_series(1, 50) g
\gexec
EOF
done

with
autovacuum = on
autovacuum_naptime = 1s
autovacuum_vacuum_threshold = 1
autovacuum_analyze_threshold = 1

in parallel using separate servers (the full script is attached), like:
parallel -j40 --linebuffer --tag .../reproi.sh ::: `seq 40`

I can catch the following:
...
3   hanged on iteration 51
...
19  hanged on iteration 64
...
39  hanged on iteration 99

And after the script run, I see the server processes hanging:
law  1081433   1  0 16:22 ?    00:00:00 
.../usr/local/pgsql/bin/postgres
law  1081452 1081433  0 16:22 ?    00:00:00 postgres: checkpointer
law  1081453 1081433  0 16:22 ?    00:00:00 postgres: background writer
law  1081460 1081433  0 16:22 ?    00:00:00 postgres: walwriter
law  1081462 1081433  0 16:22 ?    00:00:00 postgres: autovacuum 
launcher
law  1081464 1081433  0 16:22 ?    00:00:00 postgres: logical 
replication launcher
law  1143065 1081433  0 16:32 ?    00:00:00 postgres: postgres postgres 
[local] CREATE TABLE
law  1143263 1081433  0 16:32 ?    00:00:00 postgres: autovacuum worker 
postgres
law  1143320 1081433  0 16:32 ?    00:00:00 postgres: autovacuum worker 
postgres
law  1143403 1081433  0 16:32 ?    00:00:00 postgres: autovacuum worker


Attaching to process 1143065
...
(gdb) bt
#0  __futex_abstimed_wait_common64 (private=, cancel=true, abstime=0x0, op=265, expected=0, 
futex_word=0x7fed9a8171b8) at ./nptl/futex-internal.c:57
#1  __futex_abstimed_wait_common (cancel=true, private=, abstime=0x0, clockid=0, expected=0, 
futex_word=0x7fed9a8171b8) at ./nptl/futex-internal.c:87
#2  __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x7fed9a8171b8, expected=expected@entry=0, 
clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=) at ./nptl/futex-internal.c:139
#3  0x7feda4674c5f in do_futex_wait (sem=sem@entry=0x7fed9a8171b8, abstime=0x0, clockid=0) at 
./nptl/sem_waitcommon.c:111

#4  0x7feda4674cf8 in __new_sem_wait_slow64 (sem=0x7fed9a8171b8, 
abstime=0x0, clockid=0) at ./nptl/sem_waitcommon.c:183
#5  0x561dd715f1fe in PGSemaphoreLock (sema=0x7fed9a8171b8) at pg_sema.c:327
#6  0x561dd722fe02 in LWLockAcquire (lock=0x7fed9ad9b4e4, mode=LW_SHARED) 
at lwlock.c:1318
#7  0x561dd71f8423 in LockBuffer (buffer=36, mode=1) at bufmgr.c:4182
#8  0x561dd6d4f506 in heapam_index_fetch_tuple (scan=0x561dd8cb6588, tid=0x561dd8cb64d0, snapshot=0x561dd8bfee28, 
slot=0x561dd8cb75a0, call_again=0x561dd8cb64d6, all_dead=0x7ffdd63842c6) at heapam_handler.c:146

...
(the full backtrace is attached)

All three autovacuum workers (1143263, 1143320, 1143403) are also waiting
for the same buffer lock:
#5  0x561dd715f1fe in PGSemaphoreLock (sema=0x7fed9a817338) at pg_sema.c:327
#6  0x561dd722fe02 in LWLockAcquire (lock=0x7fed9ad9b4e4, mode=LW_SHARED) 
at lwlock.c:1318
#7  0x561dd71f8423 in LockBuffer (buffer=36, mode=1) at bufmgr.c:4182

Probably, this can be reproduced with VACUUM pg_class/pg_type/..., but I
haven't found out the exact combination needed yet.

Also as a side note, these processes can't be terminated with SIGTERM, I
have to kill them.

Initially I saw this on a slowed down VM, but with the attached patch
applied I could reproduce it on my workstation too.

Best regards,
Alexander

reproi.sh
Description: application/shellscript
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 2168259247..730ef9b5a2 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -138,6 +138,10 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 			heap_page_prune_opt(hscan->xs_base.rel, hscan->xs_cbuf);
 	}
 
+uint64 r = 0;
+for (uint64 i = 0; i < 10; i++) r += (r ^ i);
+Assert(r != 0);
+
 	/* Obtain share-lock on the buffer so we can examine visibility */
 	LockBuffer(hscan->xs_cbuf, BUFFER_LOCK_SHARE);
 	got_heap_tuple = heap_hot_search_buffer(tid,
(gdb) bt
#0  __futex_abstimed_wait_common64 (private=, cancel=true, 
abstime=0x0, op=265, expected=0, futex_word=0x7fed9a8171b8) at 
./nptl/futex-internal.c:57
#1  __futex_abstimed_wait_common (cancel=true, private=, 
abstime=0x0, clockid=0, expected=0, futex_word=0x7fed9a8171b8) at 
./nptl/futex-internal.c:87
#2  __GI___futex_abstimed_wait_cancelable64 
(futex_word=futex_word@entry=0x7fed9a8171b

Re: Better error reporting from extension scripts (Was: Extend ALTER OPERATOR)

2024-10-28 Thread Alexander Lakhin

28.10.2024 19:06, Tom Lane wrote:

I've also dumped buf in read_whole_file() and found that in both
PG_BINARY_R and "r" modes the 0d 0a ending is preserved. But it changed
to 0a with the "rt" mode (see [1]), and it makes the test (and the whole
`meson test`) pass for me.

Interesting.  I believe we decided years ago that we didn't need to
use "rt" mode because that was the default on Windows, but was that
a misreading of the documentation?  The link you provided doesn't
give any hint that there are more than two behaviors.

However ... the link you provided also mentions that text mode
includes treating control-Z as EOF; which I'd forgotten, but it
does make it less safe than I thought to use text mode for
reading script files.


I think that this other behavior can be explained by pgwin32_fopen()/
pgwin32_open() coding (O_TEXT assumed implicitly only #ifdef FRONTEND).

Anyway, as you noticed, \x1A injected into test_extsql really leads to
the file contents truncation on read (with "rt"), so I agree that using the
text/translation mode here is not an improvement.


What I'm now thinking is that we should revert 924e03917 after
all (that is, go back to using PG_BINARY_R) and instead make
read_whole_file manually squash \r\n to \n if we're on Windows.
Ugly, but I have yet to find anything about that platform that
isn't.


Yes, I think this should help.

Best regards,
Alexander




Re: Better error reporting from extension scripts (Was: Extend ALTER OPERATOR)

2024-10-28 Thread Alexander Lakhin

Hello Tom,

27.10.2024 20:41, Tom Lane wrote:

I wrote:

In the no-good-deed-goes-unpunished department: buildfarm member
hamerkop doesn't like this patch [1].  The diffs look like
...
So what I'd like to do to fix this is to change
-   if ((file = AllocateFile(filename, PG_BINARY_R)) == NULL)
+   if ((file = AllocateFile(filename, "r")) == NULL)

Well, that didn't fix it :-(.  I went so far as to extract the raw log
files from the buildfarm database, and what they show is that there is
absolutely no difference between the lines diff is claiming are
different:

-QUERY:  CREATE FUNCTIN my_erroneous_func(int) RETURNS int LANGUAGE SQL\r\n
+QUERY:  CREATE FUNCTIN my_erroneous_func(int) RETURNS int LANGUAGE SQL\r\n

It's the same both before and after 924e03917, which made the code
change depicted above, so that didn't help.

So I'm pretty baffled.  I suppose the expected and result files must
actually be different, and something in subsequent processing is
losing the difference before it gets to the buildfarm database.
But I don't have the ability to debug that from here.  Does anyone
with access to hamerkop want to poke into this?

Without additional information, the only thing I can think of that
I have any confidence will eliminate these failures is to reformat
the affected test cases so that they produce just a single line of
output.  That's kind of annoying from a functionality-coverage point
of view, but I'm not sure avoiding it is worth moving mountains for.



I've managed to reproduce the issue with the core.autocrlf=true git setting
(which sets CR+LF line ending in test_ext7--2.0--2.1bad.sql) and with diff
from msys:
C:\msys64\usr\bin\diff.exe --version
diff (GNU diffutils) 3.8

(Gnu/Win32 Diff [1] doesn't detect those EOL differences and thus the test
doesn't fail.)

I can really see different line endings with hexdump:
hexdump -C ...testrun\test_extensions\regress\regression.diffs
0230  20 20 20 20 20 20 20 20  5e 0a 2d 51 55 45 52 59  | ^.-QUERY|
0240  3a 20 20 43 52 45 41 54  45 20 46 55 4e 43 54 49  |: CREATE FUNCTI|
0250  4e 20 6d 79 5f 65 72 72  6f 6e 65 6f 75 73 5f 66  |N my_erroneous_f|
0260  75 6e 63 28 69 6e 74 29  20 52 45 54 55 52 4e 53 |unc(int) RETURNS|
0270  20 69 6e 74 20 4c 41 4e  47 55 41 47 45 20 53 51  | int LANGUAGE SQ|
0280  4c 0a 2b 51 55 45 52 59  3a 20 20 43 52 45 41 54 |L.+QUERY:  CREAT|
0290  45 20 46 55 4e 43 54 49  4e 20 6d 79 5f 65 72 72  |E FUNCTIN my_err|
02a0  6f 6e 65 6f 75 73 5f 66  75 6e 63 28 69 6e 74 29 |oneous_func(int)|
02b0  20 52 45 54 55 52 4e 53  20 69 6e 74 20 4c 41 4e  | RETURNS int LAN|
02c0  47 55 41 47 45 20 53 51  4c 0d 0a 20 41 53 20 24  |GUAGE SQL.. AS $|

hexdump -C .../testrun/test_extensions/regress/results/test_extensions.out | 
grep -C5 FUNCTIN
0b80  20 5e 0d 0a 51 55 45 52  59 3a 20 20 43 52 45 41  | ^..QUERY:  CREA|
0b90  54 45 20 46 55 4e 43 54  49 4e 20 6d 79 5f 65 72  |TE FUNCTIN my_er|
0ba0  72 6f 6e 65 6f 75 73 5f  66 75 6e 63 28 69 6e 74 |roneous_func(int|
0bb0  29 20 52 45 54 55 52 4e  53 20 69 6e 74 20 4c 41  |) RETURNS int LA|
0bc0  4e 47 55 41 47 45 20 53  51 4c 0d 0d 0a 41 53 20  |NGUAGE SQL...AS |

whilst
hexdump -C .../src/test/modules/test_extensions/expected/test_extensions.out | 
grep -C5 FUNCTIN
0b80  20 5e 0d 0a 51 55 45 52  59 3a 20 20 43 52 45 41  | ^..QUERY:  CREA|
0b90  54 45 20 46 55 4e 43 54  49 4e 20 6d 79 5f 65 72  |TE FUNCTIN my_er|
0ba0  72 6f 6e 65 6f 75 73 5f  66 75 6e 63 28 69 6e 74 |roneous_func(int|
0bb0  29 20 52 45 54 55 52 4e  53 20 69 6e 74 20 4c 41  |) RETURNS int LA|
0bc0  4e 47 55 41 47 45 20 53  51 4c 0d 0a 41 53 20 24  |NGUAGE SQL..AS $|

It looks like --strip-trailing-cr doesn't work as desired for this diff version.

I've also dumped buf in read_whole_file() and found that in both
PG_BINARY_R and "r" modes the 0d 0a ending is preserved. But it changed
to 0a with the "rt" mode (see [1]), and it makes the test (and the whole
`meson test`) pass for me.

[1] https://gnuwin32.sourceforge.net/packages/diffutils.htm
[2] 
https://learn.microsoft.com/en-us/cpp/c-runtime-library/file-translation-constants?view=msvc-170

Best regards,
Alexander




Re: Statistics Import and Export

2024-10-27 Thread Alexander Lakhin

Hello Jeff and Corey,

26.10.2024 01:18, Jeff Davis wrote:

On Tue, 2024-09-17 at 05:02 -0400, Corey Huinker wrote:

I've taken most of Jeff's work, reincorporated it into roughly the
same patch structure as before, and am posting it now.

I have committed the import side of this patch series; that is, the
function calls that can load stats into an existing cluster without the
need to ANALYZE.

The pg_restore_*_stats() functions are designed such that pg_dump can
emit the calls. Some design choices of the functions worth noting:


Please look at the following seemingly atypical behavior of the new
functions:
CREATE TABLE test(id int);

SELECT pg_restore_attribute_stats(
  'relation', 'test'::regclass,
  'attname', 'id'::name,
  'inherited', false);

SELECT pg_restore_attribute_stats(
  'relation', 'test'::regclass,
  'attname', 'id'::name,
  'inherited', false
) FROM generate_series(1, 2);
ERROR:  XX000: tuple already updated by self
LOCATION:  simple_heap_update, heapam.c:4353

Or:
SELECT pg_clear_attribute_stats('test'::regclass, 'id'::name, false)
FROM generate_series(1, 2);
ERROR:  XX000: tuple already updated by self
LOCATION:  simple_heap_delete, heapam.c:3108

Best regards,
Alexander




Re: heap_inplace_lock vs. autovacuum w/ LOCKTAG_TUPLE

2024-10-26 Thread Alexander Lakhin

Hello Noah,

27.10.2024 07:09, Noah Misch wrote:

On Sat, Oct 26, 2024 at 11:49:36AM -0700, Noah Misch wrote:

intra-grant-inplace-db.spec got a novel failure today:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sarus&dt=2024-10-26%2014%3A08%3A58

The isolationtester_waiting query is supposed to detect that step vac2 is
blocked.  vac2 didn't finish within the timeout, but isolationtester_waiting
never considered it blocked.
... work on reproducing this.


FWIW, there was a similar failure in August: [1], and I also could not
reproduce that locally, yet wrote a preliminary analysis at [2] in the
Unsorted section, in the hope to see it again and continue investigation.

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=iguana&dt=2024-08-29%2013%3A57%3A57
[2] https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures

Best regards,
Alexander




Re: [BUG] Fix DETACH with FK pointing to a partitioned table fails

2024-10-25 Thread Alexander Lakhin

Hello Alvaro and Tender Wang,

24.10.2024 17:00, Alexander Lakhin wrote:

Please look at a new anomaly introduced with 53af9491a. When running the
following script:
CREATE TABLE t (a int, b int, PRIMARY KEY (a, b));
CREATE TABLE pt (a int, b int, FOREIGN KEY (a, b) REFERENCES t(a, b))
 PARTITION BY LIST (a);

CREATE TABLE tp1 (x int, a int, b int);
ALTER TABLE tp1 DROP COLUMN x;

ALTER TABLE pt ATTACH PARTITION tp1 FOR VALUES IN (1);

ALTER TABLE pt DETACH PARTITION tp1;


I've also discovered another anomaly with a similar setup, but it's not
related to 53af9491a.

CREATE TABLE t (a int, PRIMARY KEY (a));
INSERT INTO t VALUES (1);

CREATE TABLE pt (b int, a int) PARTITION BY RANGE (a);
ALTER TABLE pt DROP COLUMN b;
ALTER TABLE pt ADD FOREIGN KEY (a) REFERENCES t ON DELETE SET DEFAULT (a);

CREATE TABLE tp1 PARTITION OF pt FOR VALUES FROM (1) TO (2);
ALTER TABLE pt DETACH PARTITION tp1;

DELETE FROM t;
\d+ t

This script ends up with:
ERROR:  invalid attribute number 2
ERROR:  cache lookup failed for attribute 2 of relation 16398

(Perhaps it deserves a separate discussion.)

Best regards,
Alexander





Re: [BUG] Fix DETACH with FK pointing to a partitioned table fails

2024-10-25 Thread Alexander Lakhin

Hello Alvaro,

22.10.2024 17:32, Alvaro Herrera wrote:

Yeah.  I pushed these patches finally, thanks!


Please look at a new anomaly introduced with 53af9491a. When running the
following script:
CREATE TABLE t (a int, b int, PRIMARY KEY (a, b));
CREATE TABLE pt (a int, b int, FOREIGN KEY (a, b) REFERENCES t(a, b))
 PARTITION BY LIST (a);

CREATE TABLE tp1 (x int, a int, b int);
ALTER TABLE tp1 DROP COLUMN x;

ALTER TABLE pt ATTACH PARTITION tp1 FOR VALUES IN (1);

ALTER TABLE pt DETACH PARTITION tp1;

I get a memory access error detected by Valgrind:
2024-10-24 12:05:04.645 UTC [1079077] LOG:  statement: ALTER TABLE pt DETACH 
PARTITION tp1;
==00:00:00:07.887 1079077== Invalid read of size 2
==00:00:00:07.887 1079077==    at 0x4A61DD: DetachPartitionFinalize 
(tablecmds.c:19545)
==00:00:00:07.887 1079077==    by 0x4A5C11: ATExecDetachPartition 
(tablecmds.c:19386)
==00:00:00:07.887 1079077==    by 0x48561E: ATExecCmd (tablecmds.c:5540)
==00:00:00:07.887 1079077==    by 0x4845DE: ATRewriteCatalogs (tablecmds.c:5203)
==00:00:00:07.887 1079077==    by 0x4838EC: ATController (tablecmds.c:4758)
==00:00:00:07.887 1079077==    by 0x4834F1: AlterTable (tablecmds.c:4404)
==00:00:00:07.887 1079077==    by 0x7D6D52: ProcessUtilitySlow (utility.c:1318)
==00:00:00:07.887 1079077==    by 0x7D65F7: standard_ProcessUtility 
(utility.c:1067)
==00:00:00:07.887 1079077==    by 0x7D54F7: ProcessUtility (utility.c:523)
==00:00:00:07.887 1079077==    by 0x7D3D70: PortalRunUtility (pquery.c:1158)
==00:00:00:07.887 1079077==    by 0x7D3FE7: PortalRunMulti (pquery.c:1316)
==00:00:00:07.887 1079077==    by 0x7D3431: PortalRun (pquery.c:791)

Reproduced on REL_15_STABLE .. master.

Best regards,
Alexander




Re: pgbench: Improve result outputs related to failed transactinos

2024-10-24 Thread Alexander Lakhin

Hello Tatsuo-san,

11.10.2024 07:54, Tatsuo Ishii wrote:

...
  - number of transactions actually pocessed: 1 (tps = 410.846343)
...

Patch pushed.


Please consider fixing a typo sneaked in that commit: pocessed -> processed?

Best regards,
Alexander




Re: race condition in pg_class

2024-10-21 Thread Alexander Lakhin

Hello Noah,

25.09.2024 01:43, Noah Misch wrote:

Pushed, but the pushes contained at least one defect:



Please look at an anomaly introduced with a07e03fd8.
With the attached modification for intra-grant-inplace.spec, running this
test triggers a Valgrind-detected error for me:
==00:00:00:09.624 319033== Conditional jump or move depends on uninitialised 
value(s)
==00:00:00:09.624 319033==    at 0x25D120: DoesMultiXactIdConflict 
(heapam.c:7373)
==00:00:00:09.624 319033==    by 0x25B45A: heap_inplace_lock (heapam.c:6265)
==00:00:00:09.624 319033==    by 0x27D8CB: systable_inplace_update_begin 
(genam.c:867)
==00:00:00:09.624 319033==    by 0x33F717: index_update_stats (index.c:2856)
==00:00:00:09.624 319033==    by 0x33FEE2: index_build (index.c:3106)
==00:00:00:09.625 319033==    by 0x33C7D3: index_create (index.c:1276)
==00:00:00:09.625 319033==    by 0x451000: DefineIndex (indexcmds.c:1216)
==00:00:00:09.625 319033==    by 0x48D091: ATExecAddIndex (tablecmds.c:9156)
==00:00:00:09.625 319033==    by 0x483F8E: ATExecCmd (tablecmds.c:5302)
==00:00:00:09.625 319033==    by 0x483877: ATRewriteCatalogs (tablecmds.c:5186)
==00:00:00:09.625 319033==    by 0x482B9A: ATController (tablecmds.c:4741)
==00:00:00:09.625 319033==    by 0x4827A1: AlterTable (tablecmds.c:4387)
==00:00:00:09.625 319033==

Perhaps current_is_member in heap_inplace_lock() should be initialized
before the DoesMultiXactIdConflict() call as in other places...

Best regards,
Alexanderdiff --git a/src/test/isolation/specs/intra-grant-inplace.spec b/src/test/isolation/specs/intra-grant-inplace.spec
index 2992c85b44..3339c9f400 100644
--- a/src/test/isolation/specs/intra-grant-inplace.spec
+++ b/src/test/isolation/specs/intra-grant-inplace.spec
@@ -90,9 +90,9 @@ permutation
 
 # inplace wait NO KEY UPDATE w/ KEY SHARE
 permutation
-	keyshr5
 	b3
 	sfnku3
+	keyshr5
 	addk2(r3)
 	r3
 


Re: ECPG cleanup and fix for clang compile-time problem

2024-10-17 Thread Alexander Lakhin

Hello Tom,

16.10.2024 19:26, Tom Lane wrote:

Alexander Lakhin  writes:

Maybe you would like to fix in passing several (not new) defects, I've
found while playing with ecpg under Valgrind:

Done.  After evaluation I concluded that none of these were worth the
trouble to back-patch, but by all means let's fix such things in HEAD.


Thank you for fixing these defects!

I've spent a day testing ecpg preprocessor and found another couple of
bugs:
1)
EXEC SQL BEGIN DECLARE SECTION;
int i = 1;
EXEC SQL END DECLARE SECTION;

EXEC SQL DECLARE c CURSOR FOR SELECT :i;

{;}
}

EXEC SQL OPEN c;

==1247560==
==1247560== Invalid read of size 4
==1247560==    at 0x121C13: dump_variables (variable.c:462)
==1247560==    by 0x10CF75: output_statement (output.c:157)
==1247560==    by 0x116B6B: base_yyparse (preproc.y:1233)
==1247560==    by 0x10C78F: main (ecpg.c:483)
==1247560==  Address 0x4e39bc0 is 16 bytes inside a block of size 32 free'd
==1247560==    at 0x484B27F: free (vg_replace_malloc.c:872)
==1247560==    by 0x1219AE: remove_variables (variable.c:351)
==1247560==    by 0x11899F: base_yyparse (preproc.y:7853)
==1247560==    by 0x10C78F: main (ecpg.c:483)
==1247560==  Block was alloc'd at
==1247560==    at 0x4848899: malloc (vg_replace_malloc.c:381)
==1247560==    by 0x120703: mm_alloc (util.c:87)
==1247560==    by 0x120C3C: new_variable (variable.c:12)
==1247560==    by 0x11C27C: base_yyparse (preproc.y:8984)
==1247560==    by 0x10C78F: main (ecpg.c:483)
...

---
2)
EXEC SQL BEGIN DECLARE SECTION;
char s[100];
EXEC SQL END DECLARE SECTION;

EXEC SQL DECLARE cur_1 CURSOR FOR SELECT 1;
EXEC SQL FETCH cur_1 INTO :s[0];

==1247848== Invalid read of size 4
==1247848==    at 0x121388: find_variable (variable.c:238)
==1247848==    by 0x11D684: base_yyparse (preproc.y:9751)
==1247848==    by 0x10C78F: main (ecpg.c:483)
==1247848==  Address 0x0 is not stack'd, malloc'd or (recently) free'd

---
Also, processing of .../ecpg/test/sql/include.pgc, containing only:
EXEC SQL INCLUDE ../sql;

emits merely:
input in flex scanner failed

I think that's all that can be found here without extra efforts.

Best regards,
Alexander




Re: replace strtok()

2024-10-17 Thread Alexander Lakhin

Hello Ranier,

16.10.2024 14:14, Ranier Vilela wrote:

Em qua., 16 de out. de 2024 às 04:45, Peter Eisentraut  
escreveu:


This wouldn't fix anything, I think.  If colors is NULL, then strsep()
already returns NULL, so the added code does nothing.

If *colors* is NULL, then the delimiter is not found and strsep will return the 
entire
string /*stringp, so the token becomes invalid/.

IMO, I think it must be necessary to check if *colors* are NULL too.



I've tested your proposed change and what I'm seeing is that:
PG_COLOR=always PG_COLORS="error=01;31" initdb
doesn't color the "error" word:

while with only Peter's patch it works as expected:


Does your change work differently for you?

Best regards,
Alexander

Re: ECPG cleanup and fix for clang compile-time problem

2024-10-15 Thread Alexander Lakhin

Hello Tom,

14.10.2024 21:25, Tom Lane wrote:

Attached are rebased and renumbered 0006-0008, mostly to keep the
cfbot happy.  We could actually stop here, if we were feeling lazy,
but now that I've done the work I'm inclined to push forward with
the rest.

The rest is just memory leak removal, and I suspect that nobody really
cares that much about small leakage in the preprocessor: you'd have to
be running some darn big files through it to notice.  FTR, here are
the total leaks reported by valgrind for running the ecpg regression
tests, using code like


Maybe you would like to fix in passing several (not new) defects, I've
found while playing with ecpg under Valgrind:
echo "
  EXEC SQL DECLARE cur1 CURSOR FOR stmt1;
" > t.pgc

valgrind  .../preproc/ecpg ... t.pgc

==831888== Conditional jump or move depends on uninitialised value(s)
==831888==    at 0x10C7B0: main (ecpg.c:490)
==831888==
char_array.pgc:2: WARNING: cursor "cur1" has been declared but not opened

Another case:
  EXEC SQL DECLARE cur_1 CURSOR FOR stmt_1;
  EXEC SQL FETCH cur_1 INTO :f1[[i];

==1335775==
==1335775== Conditional jump or move depends on uninitialised value(s)
==1335775==    at 0x121294: find_variable (variable.c:211)
==1335775==    by 0x11D661: base_yyparse (preproc.y:9749)
==1335775==    by 0x10C78F: main (ecpg.c:483)
==1335775==
==1335775== Conditional jump or move depends on uninitialised value(s)
==1335775==    at 0x121299: find_variable (variable.c:211)
==1335775==    by 0x11D661: base_yyparse (preproc.y:9749)
==1335775==    by 0x10C78F: main (ecpg.c:483)
==1335775==
==1335775== Invalid read of size 1
==1335775==    at 0x12128B: find_variable (variable.c:211)
==1335775==    by 0x11D661: base_yyparse (preproc.y:9749)
==1335775==    by 0x10C78F: main (ecpg.c:483)
==1335775==  Address 0x4e3bc80 is 0 bytes after a block of size 8,208 alloc'd
==1335775==    at 0x4848899: malloc (vg_replace_malloc.c:381)
==1335775==    by 0x120585: mm_alloc (util.c:87)
==1335775==    by 0x12065A: loc_alloc (util.c:151)
==1335775==    by 0x120701: loc_strdup (util.c:172)
==1335775==    by 0x10D9EC: base_yylex_location (parser.c:261)
==1335775==    by 0x10D4A1: filtered_base_yylex (parser.c:75)
==1335775==    by 0x114CA4: base_yyparse (preproc.c:39316)
==1335775==    by 0x10C78F: main (ecpg.c:483)
==1335775==
declare.pgc:2: ERROR: variable "f1" is not declared

One more case:
  EXEC SQL BEGIN DECLARE SECTION;
    int i;
  EXEC SQL END DECLARE SECTION;

  EXEC SQL DECLARE C CURSOR FOR SELECT 1;
  {
    EXEC SQL FETCH 1 IN C INTO :i;
  }
  EXEC SQL MOVE BACKWARD 1 IN C;

==961441== Invalid read of size 1
==961441==    at 0x484FBD7: strcmp (vg_replace_strmem.c:924)
==961441==    by 0x11442F: add_additional_variables (preproc.y:470)
==961441==    by 0x117DEF: base_yyparse (preproc.y:3548)
==961441==    by 0x10C78F: main (ecpg.c:483)
==961441==  Address 0x0 is not stack'd, malloc'd or (recently) free'd

Best regards,
Alexander




Re: [PoC] Federated Authn/z with OAUTHBEARER

2024-10-15 Thread Alexander Lakhin

Hello Peter,

11.09.2024 10:37, Peter Eisentraut wrote:


This has been committed.



I've discovered that starting from 0785d1b8b,
make check -C src/bin/pg_combinebackup
fails under Valgrind, with the following diagnostics:
2024-10-15 14:29:52.883 UTC [3338981] 002_compare_backups.pl STATEMENT:  
UPLOAD_MANIFEST
==00:00:00:20.028 3338981== Conditional jump or move depends on uninitialised 
value(s)
==00:00:00:20.028 3338981==    at 0xA3E68F: json_lex (jsonapi.c:1496)
==00:00:00:20.028 3338981==    by 0xA3ED13: json_lex (jsonapi.c:1666)
==00:00:00:20.028 3338981==    by 0xA3D5AF: pg_parse_json_incremental 
(jsonapi.c:822)
==00:00:00:20.028 3338981==    by 0xA40ECF: 
json_parse_manifest_incremental_chunk (parse_manifest.c:194)
==00:00:00:20.028 3338981==    by 0x31656B: FinalizeIncrementalManifest 
(basebackup_incremental.c:237)
==00:00:00:20.028 3338981==    by 0x73B4A4: UploadManifest (walsender.c:709)
==00:00:00:20.028 3338981==    by 0x73DF4A: exec_replication_command 
(walsender.c:2185)
==00:00:00:20.028 3338981==    by 0x7C58C3: PostgresMain (postgres.c:4762)
==00:00:00:20.028 3338981==    by 0x7BBDA7: BackendMain (backend_startup.c:107)
==00:00:00:20.028 3338981==    by 0x6CF60F: postmaster_child_launch 
(launch_backend.c:274)
==00:00:00:20.028 3338981==    by 0x6D546F: BackendStartup (postmaster.c:3415)
==00:00:00:20.028 3338981==    by 0x6D2B21: ServerLoop (postmaster.c:1648)
==00:00:00:20.028 3338981==

(Initializing
    dummy_lex.inc_state = NULL;
before
    partial_result = json_lex(&dummy_lex);
makes these TAP tests pass for me.)

Best regards,
Alexander




Re: replace strtok()

2024-10-15 Thread Alexander Lakhin

Hello Ranier and Peter,

15.10.2024 14:44, Ranier Vilela wrote:



Em ter., 15 de out. de 2024 às 03:45, Peter Eisentraut  
escreveu:


Thanks for the analysis.  I think moreover we *only* need to check the
"stringp" for NULL, not the return value of strsep(), which would never
be NULL in our case.  So I propose the attached patch as a variant of yours.

I'm not 100% sure, but the contrib passwordcheck uses and It's not very safe.
The checks of NULL return are cheap, and will protect unwary users.

So I'm neutral here.


I also wonder, if other places touched by 5d2e1cc11 need corrections too.
I played with
PG_COLOR=always PG_COLORS="error=01;31" .../initdb

and it looks like this free() call in pg_logging_init():
        char       *colors = strdup(pg_colors_env);

        if (colors)
        {
...
            while ((token = strsep(&colors, ":")))
            {
...
            }

            free(colors);
        }
gets null in colors.

Best regards,
Alexander

Re: Inconsistency between try_mergejoin_path and create_mergejoin_plan

2024-10-12 Thread Alexander Lakhin

Hello Richard and Tom,

04.09.2024 06:50, Richard Guo wrote:

I pushed this patch with the test case remaining, as it adds only a
minimal number of test cycles.  I explained in the commit message why
the test case is included in equivclass.sql rather than in join.sql.


While playing with the equivclass test, I've discovered that the next step
to define a complete set of operators in the test:
@@ -65,6 +65,7 @@
 procedure = int8alias1eq,
 leftarg = int8, rightarg = int8alias1,
 restrict = eqsel, join = eqjoinsel,
+commutator = =,
 merges
 );

produces an internal error:
ERROR:  XX000: operator 32312 is not a member of opfamily 1976
LOCATION:  get_op_opfamily_properties, lsyscache.c:149

pg_regress/equivclass BACKTRACE:
get_op_opfamily_properties at lsyscache.c:149:3
MJExamineQuals at nodeMergejoin.c:228:19
ExecInitMergeJoin at nodeMergejoin.c:1608:25
ExecInitNode at execProcnode.c:303:27
InitPlan at execMain.c:964:14

Maybe the error itself is not that unexpected, but I'm confused by a
comment above the function:
 * Caller should already have verified that opno is a member of opfamily,
 * therefore we raise an error if the tuple is not found.
 */
void
get_op_opfamily_properties(Oid opno, Oid opfamily, bool ordering_op,
   int *strategy,
   Oid *lefttype,
   Oid *righttype)
{
    HeapTuple   tp;
    Form_pg_amop amop_tup;

    tp = SearchSysCache3(AMOPOPID,
 ObjectIdGetDatum(opno),
 CharGetDatum(ordering_op ? AMOP_ORDER : AMOP_SEARCH),
 ObjectIdGetDatum(opfamily));
    if (!HeapTupleIsValid(tp))
    elog(ERROR, "operator %u is not a member of opfamily %u",
 opno, opfamily);

This behavior reproduced on commit a33cf1041, dated 2007-01-23, which
added "ALTER OPERATOR FAMILY", but I think it also can be reproduced on
previous commits with manual catalog editing. (The comment was added by
a78fcfb51 from 2006-12-23, which introduced operator families.)

Best regards,
Alexander




Re: replace strtok()

2024-10-09 Thread Alexander Lakhin

Hello Peter,

23.07.2024 15:38, Peter Eisentraut wrote:

This has been committed.  Thanks.


Please look at the SCRAM secret, which breaks parse_scram_secret(),
perhaps because strsep() doesn't return NULL where strtok() did:

CREATE ROLE r PASSWORD 
'SCRAM-SHA-256$4096:hpFyHTUsSWcR7O9P$LgZFIt6Oqdo27ZFKbZ2nV+=vtnYM995pDh9ca6WSi120qVV5NeluNfUPkwm7Vqat25RjSPLkGeoZBQs6wVv+um4=';


Core was generated by `postgres: law regression [local] CREATE ROLE 
 '.
Program terminated with signal SIGSEGV, Segmentation fault.

#0  __strlen_avx2 () at ../sysdeps/x86_64/multiarch/strlen-avx2.S:74
(gdb) bt
#0  __strlen_avx2 () at ../sysdeps/x86_64/multiarch/strlen-avx2.S:74
#1  0x563625e9e5b0 in parse_scram_secret (...) at auth-scram.c:655

Best regards,
Alexander




Re: Testing autovacuum wraparound (including failsafe)

2024-10-08 Thread Alexander Lakhin

Hello Masahiko-san,

01.12.2023 05:14, Masahiko Sawada wrote:

FYI I've configured the buildfarm animal perentie to run regression
tests including xid_wraparound:

https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=perentie&br=HEAD


Please look at a failure produced by perentie recently: [1].

I've analyzed all the available detailed perentie logs (starting from
2024-04-04) and got the following durations of the pg_ctl stop operation
at the end of the 001_emergency_vacuum.pl (from the
module-xid_wraparound-check stage): see perentie-timings.txt and
perentie-timings.png attached. So it looks like perentie needs larger
PGCTLTIMEOUT for the test (maybe 180 seconds would work?).

Though it's not clear to me, why this test takes so long on that animal,
even when it succeeds. For example, [2] shows:
[09:28:23] t/001_emergency_vacuum.pl .. ok   225894 ms ( 0.00 usr 0.00 sys +  
0.31 cusr  0.43 csys =  0.74 CPU)
[09:30:53] t/002_limits.pl  ok   150291 ms ( 0.00 usr 0.00 sys +  
1.85 cusr  1.50 csys =  3.35 CPU)
[09:49:33] t/003_wraparounds.pl ... ok  1119766 ms ( 0.00 usr 0.00 sys +  
1.68 cusr  2.39 csys =  4.07 CPU)

While what I'm seeing locally on my Fedora 40 VM is:
PG_TEST_EXTRA="xid_wraparound" make -s check -C src/test/modules/xid_wraparound/ 
PROVE_FLAGS="--timer"
# +++ tap check in src/test/modules/xid_wraparound +++
[04:41:56] t/001_emergency_vacuum.pl .. ok    18852 ms ( 0.01 usr 0.00 sys +  
0.14 cusr  0.28 csys =  0.43 CPU)
[04:42:15] t/002_limits.pl  ok    18539 ms ( 0.00 usr 0.00 sys +  
0.72 cusr  0.88 csys =  1.60 CPU)
[04:42:34] t/003_wraparounds.pl ... ok    74368 ms ( 0.00 usr 0.00 sys +  
0.82 cusr  2.57 csys =  3.39 CPU)

Also, maybe it would make sense to run this test for REL_17_STABLE too, as
dodo is not with us since 2024-09-04, and I don't know if there are other
animals running these tests (having xid_wraparound in PG_TEST_EXTRA).

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=perentie&dt=2024-10-05%2000%3A00%3A14
[2] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=perentie&dt=2024-10-06%2000%3A00%3A13

Best regards,
Alexanderdt=2024-04-04 00:02:21-waiting for server to shut 
down.. done
dt=2024-04-05 00:00:07-waiting for server to shut 
down done
dt=2024-04-06 00:00:07-waiting for server to shut 
down. done
dt=2024-04-07 00:00:05-waiting for server to shut 
down
 done
dt=2024-04-08 00:02:47-waiting for server to shut 
down done
dt=2024-04-09 00:02:43-waiting for server to shut 
down done
dt=2024-04-10 00:00:05-waiting for server to shut 
down... done
dt=2024-04-11 00:00:06-waiting for server to shut down.. done
dt=2024-04-12 00:00:06-waiting for server to shut 
down.. done
dt=2024-04-13 00:00:07-waiting for server to shut 
down... done
dt=2024-04-14 00:00:06-waiting for server to shut 
down. done
dt=2024-04-15 00:02:48-waiting for server to shut 
down. done
dt=2024-04-16 00:00:07-waiting for server to shut 
down... done
dt=2024-04-17 00:02:33-waiting for server to shut 
down... done
dt=2024-04-18 00:00:06-waiting for server to shut 
down done
dt=2024-04-19 00:00:05-waiting for server to shut 
down done
dt=2024-04-20 00:00:06-waiting for server to shut 
down... done
dt=2024-04-21 00:00:05-waiting for server to shut 
down done
dt=2024-04-22 00:00:05-waiting for server to shut 
down... done
dt=2024-04-23 00:00:06-waiting for server to shut 
down done
dt=2024-04-24 00:00:05-waiting for server to shut 
down done
dt=2024-04-25 00:02:47-waiting for server to shut 
down... done
dt=2024-04-26 00:00:06-waiting for server to shut 
down.. done
dt=2024-04-27 00:00:06-waiting for server to shut 
down... done
dt=2024-04-28 00:00:05-waiting for server to shut 
down... done
dt=2024-04-29 00:02:31-waiting for server to shut 
down. done
dt=2024-05-01 00:00:03-waiting for server to shut 
down... done
dt=2024-05-02 00:02:36-waiting for server to shut 
down. done
dt=202

Re: IPC::Run accepts bug reports

2024-10-08 Thread Alexander Lakhin

Hello Noah,

04.10.2024 21:57, Noah Misch wrote:

That makes sense.  Would you file this at
https://github.com/cpan-authors/IPC-Run/issues?  I suppose that code should
become roughly:

   do { $r = POSIX::read(...) } while (!defined($r) && $!{EINTR});
   croak ... unless defined($r);



Just for reference: I've filed the bug report at:
https://github.com/cpan-authors/IPC-Run/issues/176

Best regards,
Alexander




Re: IPC::Run accepts bug reports

2024-10-04 Thread Alexander Lakhin

Hello Noah,

16.06.2024 02:48, Noah Misch wrote:

I don't see in https://github.com/cpan-authors/IPC-Run/issues anything
affecting PostgreSQL.  If you know of IPC::Run defects, please report them.
If I knew of an IPC::Run defect affecting PostgreSQL, I likely would work on
it before absurdity like https://github.com/cpan-authors/IPC-Run/issues/175
NetBSD-10-specific behavior coping.


It looks like a recent indri failure [1] revealed one more IPC::Run
anomaly. The failure log contains:
# Running: pgbench -n -t 1 -Dfoo=bla -Dnull=null -Dtrue=true -Done=1 -Dzero=0.0 -Dbadtrue=trueXXX 
-Dmaxint=9223372036854775807 -Dminint=-9223372036854775808 -M prepared -f 
/Users/buildfarm/bf-data/HEAD/pgsql.build/src/bin/pgbench/tmp_check/t_001_pgbench_with_server_main_data/001_pgbench_error_sleep_undefined_variable

[22:38:14.887](0.014s) ok 362 - pgbench script error: sleep undefined variable 
status (got 2 vs expected 2)
[22:38:14.887](0.000s) ok 363 - pgbench script error: sleep undefined variable 
stdout /(?^:processed: 0/1)/
[22:38:14.887](0.000s) not ok 364 - pgbench script error: sleep undefined 
variable stderr /(?^:sleep: undefined variable)/
[22:38:14.887](0.000s)
[22:38:14.887](0.000s) #   Failed test 'pgbench script error: sleep undefined variable stderr /(?^:sleep: undefined 
variable)/'

#   at t/001_pgbench_with_server.pl line 1242.
[22:38:14.887](0.000s) #   ''
# doesn't match '(?^:sleep: undefined variable)'

So the pgbench process exited as expected, stdout contained expected
string, but stderr happened to be empty.

Maybe such behavior is specific to macOS, and even on indri it's the
only failure of that ilk out of 2000+ runs (and I couldn't reproduce
this in a Linux VM), but I find this place in IPC::Run suspicious:
sub _read {
...
    my $r = POSIX::read( $_[0], $s, 10_000 );
    croak "$!: read( $_[0] )" if not($r) and !$!{EINTR};

That is, EINTR kind of recognized as an expected error, but there is no
retry in this case. Thus, with the following modification, which simulates
read() failed with EINTR:
 sub _read {
 confess 'undef' unless defined $_[0];
 my $s = '';
-    my $r = POSIX::read( $_[0], $s, 10_000 );
+    my $r;
+if (int(rand(100)) == 0)
+{
+   $r = 0;  $! = Errno::EINTR;
+}
+else
+{
+    $r = POSIX::read( $_[0], $s, 10_000 );
+}
 croak "$!: read( $_[0] )" if not($r) and !$!{EINTR};

I can see failures like the one in question when running that test.

Perhaps, I could reproduce the issue with a program, that sends signals
to running (pgbench) processes (and thus interrupts read()), if it makes
sense.

What do you think?

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=indri&dt=2024-10-02%2002%3A34%3A16

Best regards,
Alexander




Re: query_id, pg_stat_activity, extended query protocol

2024-10-03 Thread Alexander Lakhin

Hello Michael,

02.10.2024 06:29, Michael Paquier wrote:

On Wed, Oct 02, 2024 at 06:00:00AM +0300, Alexander Lakhin wrote:

Fortunately, it's still pretty warm here, so I'm wearing T-shirt and my
sleeve isn't long enough for that, but if you gave me 2-3 days, I would
focus on researching this area...

Sure, thanks.  I am also spending a few days thinking about patterns
around that before doing anything.  The buildfarm is green, so there
is some flebxibility.


I've managed to falsify the Assert in ExecutorRun() with the following:
SET compute_query_id = 'off';
SET track_activities = 'off';

CREATE PROCEDURE p1()
LANGUAGE plpgsql
AS $$
BEGIN
    PERFORM 1;
END;
$$;

CREATE PROCEDURE p2(x int)
LANGUAGE plpgsql
AS $$
BEGIN
  IF x = 1 THEN
    SET track_activities = 'on';
    SET compute_query_id = 'on';
  END IF;
  CALL p1();
END;
$$;

CALL p2(0);

CALL p2(1);

TRAP: failed Assert("!IsQueryIdEnabled() || !pgstat_track_activities || !debug_query_string || pgstat_get_my_query_id() 
!= 0"), File: "execMain.c", Line: 312, PID: 3765791


Best regards,
Alexander




Re: query_id, pg_stat_activity, extended query protocol

2024-10-01 Thread Alexander Lakhin

Hello Michael,

02.10.2024 03:52, Michael Paquier wrote:

Alexander, I've thought about a couple of fancy cases for
ExecutorRun() but I could not break it.  Perhaps you have something in
your sleeve that would also break this case?
--


Fortunately, it's still pretty warm here, so I'm wearing T-shirt and my
sleeve isn't long enough for that, but if you gave me 2-3 days, I would
focus on researching this area...

Best regards,
Alexander




Re: query_id, pg_stat_activity, extended query protocol

2024-10-01 Thread Alexander Lakhin

Hello Michael,

01.10.2024 05:07, Michael Paquier wrote:

On Mon, Sep 30, 2024 at 10:07:55AM +0900, Michael Paquier wrote:
...
And done this part.


If I'm not missing something, all the patches discussed here are committed
now, so maybe I've encountered a new anomaly.

Please try the following script:
BEGIN;
PREPARE s AS SELECT 1;
SELECT $1 \bind 1 \g
EXECUTE s;

It produces for me:
TRAP: failed Assert("!IsQueryIdEnabled() || !pgstat_track_activities || !debug_query_string || pgstat_get_my_query_id() 
!= 0"), File: "execMain.c", Line: 423, PID: 1296466


Best regards,
Alexander




promote request not received timely on slow Windows machines

2024-10-01 Thread Alexander Lakhin

Hello hackers,

Windows animal drongo reiterated a failure of 002_archiving_standby.pl [1]
recently [2] and this pushed me to run the issue to ground.

The log of the failed test contains:
[17:11:11.519](0.001s) ok 3 - recovery_end_command not executed yet
### Promoting node "standby"
# Running: pg_ctl -D 
C:\\prog\\bf\\root\\REL_17_STABLE\\pgsql.build/testrun/recovery/002_archiving\\data/t_002_archiving_standby_data/pgdata 
-l C:\\prog\\bf\\root\\REL_17_STABLE\\pgsql.build/testrun/recovery/002_archiving\\log/002_archiving_standby.log promote
waiting for server to 
promote... 
stopped waiting

pg_ctl: server did not promote in time
[17:20:06.095](534.576s) Bail out!  command "pg_ctl -D 
C:\\prog\\bf\\root\\REL_17_STABLE\\pgsql.build/testrun/recovery/002_archiving\\data/t_002_archiving_standby_data/pgdata 
-l C:\\prog\\bf\\root\\REL_17_STABLE\\pgsql.build/testrun/recovery/002_archiving\\log/002_archiving_standby.log promote" 
exited with value 1


That is, the test failed due to pg_ctl timeout (180 seconds in this case)
reached. The standby log file contains:
2024-09-29 17:11:10.319 UTC [6408:3] LOG:  recovery restart point at 0/3028BF8
2024-09-29 17:11:10.319 UTC [6408:4] DETAIL:  Last completed transaction was at 
log time 2024-09-29 17:10:57.783965+00.
The system cannot find the file specified.
2024-09-29 17:11:10.719 UTC [7440:5] 002_archiving.pl LOG: disconnection: session time: 0:00:00.488 user=pgrunner 
database=postgres host=127.0.0.1 port=62549

The system cannot find the file specified.
...
The system cannot find the file specified.
2024-09-29 17:20:08.237 UTC [6684:4] LOG:  received immediate shutdown request
The system cannot find the file specified.
The system cannot find the file specified.

but it doesn't contain the expected log message (which I see when the test
passes successfully):
LOG:  received promote request

I reproduced this failure on a slowed-down Windows VM and found that
StartupProcTriggerHandler(), which should set promote_signaled, might be
not called for a long time, because of missing interrupt checks.

Namely, the only place to check for interrupts during wal processing
in this case is inside WaitForWALToBecomeAvailable():
    if (!TimestampDifferenceExceeds(last_fail_time, now,
wal_retrieve_retry_interval))
    {
...
    elog(LOG, "waiting for WAL to become available at 
%X/%X",
 LSN_FORMAT_ARGS(RecPtr));
and
    (void) WaitLatch(&XLogRecoveryCtl->recoveryWakeupLatch, 
...);
...
                    }

(elog() -> errfinish() -> CHECK_FOR_INTERRUPTS();
WaitLatch() -> WaitEventSetWait() -> WaitEventSetWaitBlock() ->
 pgwin32_dispatch_queued_signals())

With log_min_messages = DEBUG2, other places appear, e.g.:
    elog(DEBUG2, "switched WAL source from %s to %s after %s",

and the test works reliably, but by default the only message logged
there is "waiting for WAL...".

So if the timestamp difference constantly exceeds
wal_retrieve_retry_interval, which set to 100ms in the test,
CHECK_FOR_INTERRUPTS() is not called at all.

I had also wondered, why this test failed only on Windows so far, while we
have even slower machines running Linux or FreeBSD. So I've run the test
on Linux with wal_retrieve_retry_interval decreased to 1ms and a debug
message added inside StartupProcTriggerHandler(), and got:
2024-10-01 07:33:22.301 UTC [3107649] LOG: !!!StartupProcTriggerHandler
2024-10-01 07:33:22.301 UTC [3107649] BACKTRACE:
StartupProcTriggerHandler at startup.c:96:19
wrapper_handler at pqsignal.c:110:2
__restore_rt at ??:0
__GI___wait4 at wait4.c:30:10
do_system at system.c:171:11
RestoreArchivedFile at xlogarchive.c:177:7
existsTimeLineHistory at timeline.c:235:3
findNewestTimeLine at timeline.c:277:6
rescanLatestTimeLine at xlogrecovery.c:4115:14
WaitForWALToBecomeAvailable at xlogrecovery.c:3674:10
...

Thus, signals are processed inside the system() call on this platform, but
this apparently is not so on Windows.


[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-07-26%2013%3A50%3A45
[2] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-09-29%2015%3A30%3A27

Best regards,
Alexander




Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

2024-10-01 Thread Alexander Lakhin

Hello Tom and Jelte,

31.08.2024 07:04, Alexander Lakhin wrote:

I've tested your fix with the modification I proposed upthread:
 idle_session_timeout_enabled = false;
 }
+if (rand() % 10 == 0) pg_usleep(1);
 /*
  * (5) disable async signal conditions again.

and can confirm that the issue is gone. On 8749d850f~1, the test failed
on iterations 3, 3. 12 for me, but on current REL_17_STABLE, 100 test
iterations succeeded.



One month later, treehopper has found a way to break that test: [1].
The failure log contains:
2024-09-30 19:34:31.347 UTC [3201034:13] fdw_retry_check LOG: execute 
: DECLARE c2 CURSOR FOR
    SELECT count(*) FROM ((("S 1"."T 1" r1 INNER JOIN "S 1"."T 1" r2 ON (TRUE)) INNER JOIN "S 1"."T 1" r4 ON (TRUE)) 
INNER JOIN "S 1"."T 1" r6 ON (TRUE))

2024-09-30 19:34:31.464 UTC [3201033:10] pg_regress/query_cancel ERROR:  
canceling statement due to statement timeout
2024-09-30 19:34:31.464 UTC [3201033:11] pg_regress/query_cancel STATEMENT:  SELECT count(*) FROM ft1 a CROSS JOIN ft1 b 
CROSS JOIN ft1 c CROSS JOIN ft1 d;

2024-09-30 19:34:31.466 UTC [3201035:1] [unknown] LOG:  connection received: 
host=[local]
2024-09-30 19:34:31.474 UTC [3201034:14] fdw_retry_check LOG: statement: FETCH 
100 FROM c2
2024-09-30 19:35:01.485 UTC [3201033:12] pg_regress/query_cancel WARNING:  could not get result of cancel request due to 
timeout


It looks like this time the cancel request arrived to the remote backend
when it processed FETCH, presumably at the DoingCommandRead stage.
I've managed to reproduce the issue with the additional modification:
@@ -1605,7 +1605,10 @@ postgresIterateForeignScan(ForeignScanState *node)
  * first call after Begin or ReScan.
  */
 if (!fsstate->cursor_exists)
+{
 create_cursor(node);
+pg_usleep(10);
+}

With postgres_fdw/Makefile modified to repeat the query_cancel test, I get:
ok 13    - query_cancel  245 ms
not ok 14    - query_cancel    30258 ms
ok 15    - query_cancel  249 ms
...
ok 19    - query_cancel  236 ms
not ok 20    - query_cancel    30258 ms
ok 21    - query_cancel  225 ms
..
not ok 33    - query_cancel    30272 ms
1..33
# 3 of 33 tests failed.

(Please find attached the complete patch.)

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=treehopper&dt=2024-09-30%2019%3A21%3A14

Best regards,
Alexanderdiff --git a/contrib/postgres_fdw/Makefile b/contrib/postgres_fdw/Makefile
index b9fa699305..3e1260e125 100644
--- a/contrib/postgres_fdw/Makefile
+++ b/contrib/postgres_fdw/Makefile
@@ -16,7 +16,7 @@ SHLIB_LINK_INTERNAL = $(libpq)
 EXTENSION = postgres_fdw
 DATA = postgres_fdw--1.0.sql postgres_fdw--1.0--1.1.sql
 
-REGRESS = postgres_fdw query_cancel
+REGRESS = postgres_fdw query_cancel $(shell printf 'query_cancel %.0s' `seq 
31`)
 
 ifdef USE_PGXS
 PG_CONFIG = pg_config
diff --git a/contrib/postgres_fdw/postgres_fdw.c 
b/contrib/postgres_fdw/postgres_fdw.c
index fc65d81e21..9190fbb3cb 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -1605,7 +1605,10 @@ postgresIterateForeignScan(ForeignScanState *node)
 * first call after Begin or ReScan.
 */
if (!fsstate->cursor_exists)
+{
create_cursor(node);
+pg_usleep(10);
+}
 
/*
 * Get some more tuples, if we've run out.
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index a750dc800b..c281547e22 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -4717,6 +4717,10 @@ PostgresMain(const char *dbname, const char *username)
idle_session_timeout_enabled = false;
}
 
+if (rand() % 10 == 0)
+{
+pg_usleep(1);
+}
/*
 * (5) disable async signal conditions again.
 *


Re: Improving tracking/processing of buildfarm test failures

2024-10-01 Thread Alexander Lakhin

Hello hackers,

Please take a look at the September report on buildfarm failures:
# SELECT br, count(*) FROM failures WHERE dt >= '2024-09-01' AND
 dt < '2024-10-01' GROUP BY br;
REL_12_STABLE: 12
REL_13_STABLE: 11
REL_14_STABLE: 12
REL_15_STABLE: 22
REL_16_STABLE: 35
REL_17_STABLE: 19
master: 33
-- Total: 144
(Counting test failures only, excluding indent-check, Configure, Build
errors.)

# SELECT COUNT(*) FROM (SELECT DISTINCT issue_link FROM failures WHERE
 dt >= '2024-09-01' AND dt < '2024-10-01');
16

# SELECT issue_link, count(*) FROM failures WHERE dt >= '2024-09-01' AND
 dt < '2024-10-01' GROUP BY issue_link ORDER BY 2 DESC LIMIT 6;
https://www.postgresql.org/message-id/ZuOKOut5hhDlf_bP@nathan: 51
-- Fixed

https://www.postgresql.org/message-id/20240924224352.93.nmi...@google.com: 31
-- Fixed

https://www.postgresql.org/message-id/a9a97e83-9ec8-5de5-bf69-80e9560f5...@gmail.com:
 8
-- An environmental issue?; still not fixed

https://www.postgresql.org/message-id/657815a2-5a89-fcc1-1c9d-d77a6986b...@gmail.com:
 5


https://www.postgresql.org/message-id/ca+hiwqhzknkoocutauzlmawxek-t+bgcmct4nm6qeng0xrh...@mail.gmail.com:
 3
-- Fixed

https://www.postgresql.org/message-id/ZuOKOut5hhDlf_bP%40nathan: 3


# SELECT count(*) FROM failures WHERE dt >= '2024-09-01' AND
 dt < '2024-10-01' AND issue_link IS NULL; -- Unsorted/unhelpful failures
28

Short-lived failures: 91

Also please find attached the updated script I use to check the failures.

Best regards,
Alexander

check-bf-failures.sh
Description: application/shellscript


Re: msys inet_pton strangeness

2024-09-30 Thread Alexander Lakhin

Hello Andrew and Thomas,

29.09.2024 18:47, Andrew Dunstan пишет:


I'm inclined to think we might need to reverse the order of the last two. TBH I don't really understand how this has 
worked up to now.




I've looked at the last successful run [1] and discovered that
fe-secure-common.c didn't compile cleanly too:
ccache gcc ... 
/home/pgrunner/bf/root/REL_15_STABLE/pgsql.build/../pgsql/src/interfaces/libpq/fe-secure-common.c
C:/tools/nmsys64/home/pgrunner/bf/root/REL_15_STABLE/pgsql/src/interfaces/libpq/fe-secure-common.c: In function 
'pq_verify_peer_name_matches_certificate_ip':
C:/tools/nmsys64/home/pgrunner/bf/root/REL_15_STABLE/pgsql/src/interfaces/libpq/fe-secure-common.c:219:21: warning: 
implicit declaration of function 'inet_pton'; did you mean 'inet_aton'? [-Wimplicit-function-declaration]

  219 | if (inet_pton(AF_INET6, host, &addr) == 1)
  | ^
  | inet_aton

So it worked just because that missing declaration generated just a
warning, not an error.

30.09.2024 01:28, Thomas Munro wrote:

Just an idea...

--- a/src/include/port/win32.h
+++ b/src/include/port/win32.h
@@ -16,7 +16,7 @@
   * get support for GetLocaleInfoEx() with locales. For everything else
   * the minimum version is Windows XP (0x0501).
   */
-#if defined(_MSC_VER) && _MSC_VER >= 1900
+#if !defined(_MSC_VER) || _MSC_VER >= 1900
  #define MIN_WINNT 0x0600
  #else
  #define MIN_WINNT 0x0501


This change works for me in the msys case. I have no VS 2013 on hand to
test the other branch, but it looks like HAVE_INET_PTON set to 1
unconditionally in src/tools/msvc/Solution.pm, so we probably will stumble
upon the same issue with _MSC_VER = 1800. What if we just set
MIN_WINNT 0x0600 for REL_15_STABLE? Or may be it would make sense to get
that old Visual Studio and recheck?

The other question that I still have is: where we expect to get system
_WIN32_WINNT from? As far as I can see, in the fe-secure-common.c case we
have the following include chain:
#include "postgres_fe.h"
    #include "c.h" // no other includes above
        #include "postgres_ext.h"
            #include "pg_config_ext.h"
            ...
            #include "pg_config.h"
            #include "pg_config_manual.h"    /* must be after pg_config.h */
            #include "pg_config_os.h"        /* must be before any system 
header files */
                // checks _WIN32_WINNT:
                #if defined(_WIN32_WINNT) && _WIN32_WINNT < MIN_WINNT

So if pg_config_os.h is really included before any system headers,
checking _WIN32_WINNT makes sense only when that define passed with
-D_WIN32_WINNT, no?

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=fairywren&dt=2024-09-19%2023%3A10%3A10&stg=build

Best regards,
Alexander




Re: msys inet_pton strangeness

2024-09-29 Thread Alexander Lakhin

29.09.2024 18:47, Andrew Dunstan wrote:

Yeah, src/include/port/win32/sys/socket.h has:

#include 
#include 
#include 

I'm inclined to think we might need to reverse the order of the last two. TBH I don't really understand how this has 
worked up to now.




As far as I can see, in my environment  _WIN32_WINNT defined with
#ifndef _WIN32_WINNT
#define _WIN32_WINNT 0x603
#endif

inside C:/msys64/ucrt64/include/_mingw.h, which can be included
 from C:/msys64/ucrt64/include/corecrt.h:10,
 from C:/msys64/ucrt64/include/crtdefs.h:10,
 from ../../../src/include/pg_config_os.h:40,
 from ../../../src/include/c.h:56,
 from ../../../src/include/postgres_fe.h:25,
 from fe-secure-common.c:20

or (if HAVE_CRTDEFS_H is not defined):
 from C:/msys64/ucrt64/include/corecrt.h:10,
 from C:/msys64/ucrt64/include/corecrt_stdio_config.h:10,
 from C:/msys64/ucrt64/include/stdio.h:9,
 from ../../../src/include/c.h:59,
 from ../../../src/include/postgres_fe.h:25,
 from fe-secure-common.c:20

or (if winsock2.h included directly):
 from C:/msys64/ucrt64/include/windows.h:9,
 from C:/msys64/ucrt64/include/winsock2.h:23

so including winsock2.h is sufficient to include _mingw.h, but it doesn't
redefine _WIN32_WINNT, unfortunately.

Best regards,
Alexander

Re: msys inet_pton strangeness

2024-09-28 Thread Alexander Lakhin

Hello Thomas and Andrew,

28.09.2024 23:52, Thomas Munro wrote:

On Sun, Sep 29, 2024 at 6:26 AM Andrew Dunstan  wrote:

We should have included ws2tcpip.h, which includes this:

#define InetPtonA inet_pton
WINSOCK_API_LINKAGE INT WSAAPI InetPtonA(INT Family, LPCSTR pStringBuf, PVOID 
pAddr);

It's conditioned on (_WIN32_WINNT >= 0x0600), but that should be true.

Can you print out the value to be sure?  I can't imagine they'd set it
lower themselves or make it go backwards in an upgrade, but perhaps
it's somehow not being set at all, and then we do:

#if defined(_MSC_VER) && _MSC_VER >= 1900
#define MIN_WINNT 0x0600
#else
#define MIN_WINNT 0x0501
#endif

In 16 we don't do that anymore, we just always set it to 0x0A00
(commit 495ed0ef2d72).  And before 15, we didn't want that function
yet (commit c1932e542863).


FWIW, I'm observing the same here.
For a trivial test.c (compiled with the same command line as
fe-secure-common.c) like:
"===_WIN32"
_WIN32;
"===_WIN32_WINNT";
_WIN32_WINNT;

with gcc -E (from mingw-w64-ucrt-x86_64-gcc 14.2.0-1), I get:
"===_WIN32"
1;
"===_WIN32_WINNT";
_WIN32_WINNT;

That is, _WIN32_WINNT is not defined, but with #include  above,
I see:
"===_WIN32_WINNT";
0x603

With #include "postgres_fe.h" (as in fe-secure-common.c) I get:
"===_WIN32_WINNT";
0x0501;

Best regards,
Alexander




Re: query_id, pg_stat_activity, extended query protocol

2024-09-25 Thread Alexander Lakhin

Hello Michael and Sami,

18.09.2024 08:46, Michael Paquier wrote:

So, I have applied 0001 down to 14, followed by 0002 on HEAD.



Please look at the script, which triggers Assert added by 24f520594:
(assuming shared_preload_libraries=pg_stat_statements)
SELECT repeat('x', 100) INTO t FROM generate_series(1, 10);
CREATE FUNCTION f() RETURNS int LANGUAGE sql IMMUTABLE RETURN 0;
CREATE INDEX ON t(f());

TRAP: failed Assert("!IsQueryIdEnabled() || pgstat_get_my_query_id() != 0"), File: 
"execMain.c", Line: 300, PID: 1288609
ExceptionalCondition at assert.c:52:13
ExecutorRun at execMain.c:302:6
postquel_getnext at functions.c:903:24
fmgr_sql at functions.c:1198:15
ExecInterpExpr at execExprInterp.c:746:8
ExecInterpExprStillValid at execExprInterp.c:2034:1
ExecEvalExprSwitchContext at executor.h:367:13
evaluate_expr at clauses.c:4997:14
evaluate_function at clauses.c:4505:1
simplify_function at clauses.c:4092:12
eval_const_expressions_mutator at clauses.c:2591:14
expression_tree_mutator_impl at nodeFuncs.c:3550:12
eval_const_expressions_mutator at clauses.c:3712:1
eval_const_expressions at clauses.c:2267:1
RelationGetIndexExpressions at relcache.c:5079:20
BuildIndexInfo at index.c:2426:7
...

Best regards,
Alexander




Re: Large expressions in indexes can't be stored (non-TOASTable)

2024-09-23 Thread Alexander Lakhin

Hello Nathan,

20.09.2024 19:51, Nathan Bossart wrote:

Here's a (probably naive) attempt at fixing these, too.  I'll give each
path a closer look once it feels like we've identified all the bugs.


Thank you for the updated patch!

I tested it with two code modifications (1st is to make each created
expression index TOASTed (by prepending 1M of spaces to the indexeprs
value) and 2nd to make each created index an expression index (by
modifying index_elem_options in gram.y) — both modifications are kludgy so
I don't dare to publish them) and found no other snapshot-related issues
during `make check-world`.

Best regards,
Alexander




Re: Large expressions in indexes can't be stored (non-TOASTable)

2024-09-19 Thread Alexander Lakhin

Hello Nathan,

19.09.2024 21:36, Nathan Bossart wrote:

On Thu, Sep 19, 2024 at 12:00:00PM +0300, Alexander Lakhin wrote:

completed with:
DROP INDEX CONCURRENTLY def_vec_quantizer_idx;

triggers an assertion failure:
TRAP: failed Assert("HaveRegisteredOrActiveSnapshot()"), File: 
"toast_internals.c", Line: 668, PID: 3723372

Ha, that was fast.  The attached patch seems to fix the assertion failures.
It's probably worth checking if any of the adjacent code paths are
affected, too.



Thank you for your attention to that issue!

I've found another two paths to reach that condition:
CREATE INDEX CONCURRENTLY ON def (vec_quantizer(id, :'b'));
ERROR:  cannot fetch toast data without an active snapshot

REINDEX INDEX CONCURRENTLY def_vec_quantizer_idx;
(or REINDEX TABLE CONCURRENTLY def;)
TRAP: failed Assert("HaveRegisteredOrActiveSnapshot()"), File: 
"toast_internals.c", Line: 668, PID: 2934502
ExceptionalCondition at assert.c:52:13
init_toast_snapshot at toast_internals.c:670:2
toast_delete_datum at toast_internals.c:429:60
toast_tuple_cleanup at toast_helper.c:303:30
heap_toast_insert_or_update at heaptoast.c:335:9
heap_update at heapam.c:3752:14
simple_heap_update at heapam.c:4210:11
CatalogTupleUpdate at indexing.c:324:2
index_concurrently_swap at index.c:1649:2
ReindexRelationConcurrently at indexcmds.c:4270:3
ReindexIndex at indexcmds.c:2962:1
ExecReindex at indexcmds.c:2884:4
ProcessUtilitySlow at utility.c:1570:22
...

Perhaps it would make sense to check all CatalogTupleUpdate(pg_index, ...)
calls (I've found 10 such instances, but haven't checked them yet).

Best regards,
Alexander




Re: Large expressions in indexes can't be stored (non-TOASTable)

2024-09-19 Thread Alexander Lakhin

Hello Nathan,

18.09.2024 22:52, Nathan Bossart wrote:

Committed.  I waffled on whether to add a test for system indexes that used
pg_index's varlena columns, but I ended up leaving it out.  I've attached
it here in case anyone thinks we should add it.


I've discovered that Jonathan's initial script:
CREATE TABLE def (id int);
SELECT array_agg(n) b FROM generate_series(1,10_000) n \gset
CREATE OR REPLACE FUNCTION vec_quantizer (a int, b int[]) RETURNS bool
AS $$ SELECT true $$ LANGUAGE SQL IMMUTABLE;
CREATE INDEX ON def (vec_quantizer(id, :'b'));

completed with:
DROP INDEX CONCURRENTLY def_vec_quantizer_idx;

triggers an assertion failure:
TRAP: failed Assert("HaveRegisteredOrActiveSnapshot()"), File: 
"toast_internals.c", Line: 668, PID: 3723372

with the following stack trace:
ExceptionalCondition at assert.c:52:13
init_toast_snapshot at toast_internals.c:670:2
toast_delete_datum at toast_internals.c:429:60
toast_tuple_cleanup at toast_helper.c:303:30
heap_toast_insert_or_update at heaptoast.c:335:9
heap_update at heapam.c:3752:14
simple_heap_update at heapam.c:4210:11
CatalogTupleUpdate at indexing.c:324:2
index_set_state_flags at index.c:3522:2
index_concurrently_set_dead at index.c:1848:2
index_drop at index.c:2286:3
doDeletion at dependency.c:1362:5
deleteOneObject at dependency.c:1279:12
deleteObjectsInList at dependency.c:229:3
performMultipleDeletions at dependency.c:393:2
RemoveRelations at tablecmds.c:1594:2
ExecDropStmt at utility.c:2008:4
...

This class of assert failures is not new, see e. g., bugs #13809, #18127,
but this concrete instance (with index_set_state_flags()) emerged with
b52c4fc3c and may be worth fixing while on it...

Best regards,
Alexander

Re: Should rolpassword be toastable?

2024-09-18 Thread Alexander Lakhin

Hello Tom and Nathan,

23.09.2023 21:00, Alexander Lakhin wrote:

23.09.2023 17:39, Tom Lane wrote:

I'm also now more than just slightly skeptical about whether
pg_database should have a toast table.  Has anybody tried,
say, storing a daticurules field wide enough to end up
out-of-line?


So for now only pg_authid is worthy of condemnation, AFAICS.



Let me remind you of this issue in light of b52c4fc3c.
Yes, it's opposite, but maybe it makes sense to fix it now in the hope that
~1 year of testing will bring something helpful for both changes.

Best regards,
Alexander




Re: [PATCH] Add additional extended protocol commands to psql: \parse and \bind

2024-09-17 Thread Alexander Lakhin

Hello Michael and Anthonin,

22.08.2024 10:33, Michael Paquier wrote:

Looks OK to me.  I have spent more time double-checking the whole, and
it looks like we're there, so applied.  Now let's play with it in more
regression tests.  Note that the refactoring patch has been merged
with the original one in a single commit.


Please look at an assertion failure, caused by \bind_named:
regression=# SELECT $1 \parse s
\bind_named s

regression=# \bind_named
\bind_named: missing required argument
regression=# 1 \g
psql: common.c:1501: ExecQueryAndProcessResults: Assertion `pset.stmtName != 
((void *)0)' failed.

Best regards,
Alexander




Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-09-17 Thread Alexander Lakhin

17.09.2024 10:47, Alexander Korotkov wrote:

Yes, now I did reproduce.  I got that the problem could be that insert
LSN is not yet written at primary, thus wait_for_catchup doesn't wait
for it.  I've workarounded that using pg_switch_wal().  The revised
patchset is attached.


Thank you for the revised patch!

The improved test works reliably for me (100 out of 100 runs passed),

Best regards,
Alexander




Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-09-16 Thread Alexander Lakhin

Hi Alexander,

16.09.2024 21:55, Alexander Korotkov wrote:

Please find two patches attached.  The first one does minor cleanup
including misuse of words you've pointed.  The second one adds missing
wait_for_catchup().  That should fix the test failure you've spotted.
Please, check if it fixes an issue for you.


Thank you for looking at that!

Unfortunately, the issue is still here — the test failed for me 6 out of
10 runs, as below:
[05:14:02.807](0.135s) ok 8 - got error after standby promote
error running SQL: 'psql::1: ERROR:  recovery is not in progress
HINT:  Waiting for LSN can only be executed during recovery.'
while running 'psql -XAtq -d port=12734 host=/tmp/04hQ75NuXf dbname='postgres' -f - -v ON_ERROR_STOP=1' with sql 'CALL 
pg_wal_replay_wait('0/300F248');' at .../src/test/recovery/../../../src/test/perl/PostgreSQL/Test/Cluster.pm line 2140.


043_wal_replay_wait_standby.log:
2024-09-17 05:14:02.714 UTC [1817258] 043_wal_replay_wait.pl ERROR: recovery is 
not in progress
2024-09-17 05:14:02.714 UTC [1817258] 043_wal_replay_wait.pl DETAIL:  Recovery ended before replaying target LSN 
2/570CD648; last replay LSN 0/300F210.

2024-09-17 05:14:02.714 UTC [1817258] 043_wal_replay_wait.pl STATEMENT:  CALL 
pg_wal_replay_wait('2/570CD648');
2024-09-17 05:14:02.714 UTC [1817155] LOG:  checkpoint starting: force
2024-09-17 05:14:02.714 UTC [1817154] LOG:  database system is ready to accept 
connections
2024-09-17 05:14:02.811 UTC [1817270] 043_wal_replay_wait.pl LOG: statement: 
CALL pg_wal_replay_wait('0/300F248');
2024-09-17 05:14:02.811 UTC [1817270] 043_wal_replay_wait.pl ERROR: recovery is 
not in progress

pg_waldump -p .../t_043_wal_replay_wait_primary_data/pgdata/pg_wal/ 
00010003
rmgr: Transaction len (rec/tot): 34/    34, tx:    748, lsn: 0/0300F1E8, prev 0/0300F1A8, desc: COMMIT 
2024-09-17 05:14:01.654874 UTC
rmgr: Standby len (rec/tot): 50/    50, tx:  0, lsn: 0/0300F210, prev 0/0300F1E8, desc: RUNNING_XACTS 
nextXid 749 latestCompletedXid 748 oldestRunningXid 749


I wonder, can you reproduce this with that bgwriter's modification?

I've also found two more "achievements" coined by 3c5db1d6b:
doc/src/sgml/func.sgml:   It may also happen that target lsn 
is not achieved
src/backend/access/transam/xlog.c-   * recovery was ended before achieving 
the target LSN.

Best regards,
Alexander




Re: Robocopy might be not robust enough for never-ending testing on Windows

2024-09-16 Thread Alexander Lakhin

Hello Thomas,

17.09.2024 04:01, Thomas Munro wrote:

On Mon, Sep 16, 2024 at 6:00 PM Alexander Lakhin  wrote:

So this leak looks like a recent and still existing defect.

 From my cartoon-like understanding of Windows, I would guess that if
event handles created by a program are leaked after it has exited, it
would normally imply that they've been duplicated somewhere else that
is still running (for example see the way that PostgreSQL's
dsm_impl_pin_segment() calls DuplicateHandle() to give a copy to the
postmaster, so that the memory segment continues to exist after the
backend exits), and if it's that, you'd be able to see the handle
count going up in the process monitor for some longer running process
somewhere (as seen in this report from the Chrome hackers[1]).  And if
it's not that, then I would guess it would have to be a kernel bug
because something outside userspace must be holding onto/leaking
handles.  But I don't really understand Windows beyond trying to debug
PostgreSQL at a distance, so my guesses may be way off.  If we wanted
to try to find a Windows expert to look at a standalone repro, does
your PS script work with *any* source directory, or is there something
about the initdb template, in which case could you post it in a .zip
file so that a non-PostgreSQL person could see the failure mode?

[1] 
https://randomascii.wordpress.com/2021/07/25/finding-windows-handle-leaks-in-chromium-and-others/


That's very interesting reading. I'll try to research the issue that deep
later (though I guess this case is different — after logging off and
logging in as another user, I can't see any processes belonging to the
first one, while those "Event objects" in non-paged pool still occupy
memory), but finding a Windows expert who perhaps can look at the
robocopy's sources, would be good too (and more productive).

So, the repro we can show is:
rm -r c:\temp\source
mkdir c:\temp\source
for ($i = 1; $i -le 1000; $i++)
{
echo 1 > "c:\temp\source\$i"
}

for ($i = 1; $i -le 1000; $i++)
{
echo "iteration $i"
rm -r c:\temp\target
robocopy.exe /E /NJH /NFL /NDL /NP c:\temp\source c:\temp\target
Get-WmiObject -Class Win32_PerfRawData_PerfOS_Memory | % PoolNonpagedBytes
}

It produces for me (on Windows 10 [Version 10.0.19045.4780]):
iteration 1
...
216887296
...
iteration 1000


--

   Total    Copied   Skipped  Mismatch    FAILED Extras
    Dirs : 1 1 0 0 0 0
   Files :  1000  1000 0 0 0 0
   Bytes : 7.8 k 7.8 k 0 0 0 0
   Times :   0:00:00   0:00:00   0:00:00 0:00:00


   Speed :   17660 Bytes/sec.
   Speed :   1.010 MegaBytes/min.
   Ended : Monday, September 16, 2024 8:58:09 PM

365080576

Just "touch c:\temp\source\$i" is not enough, files must be non-empty for
the leak to happen.

Best regards,
Alexander




Re: Robocopy might be not robust enough for never-ending testing on Windows

2024-09-15 Thread Alexander Lakhin

Hello Thomas,

14.09.2024 23:32, Thomas Munro wrote:

On Sun, Sep 15, 2024 at 1:00 AM Alexander Lakhin  wrote:

(That is, 0.1-0.2 MB leaks per one robocopy run.)

I observed this on Windows 10 (Version 10.0.19045.4780), with all updates
installed, but not on Windows Server 2016 (10.0.14393.0). Moreover, using
robocopy v14393 on Windows 10 doesn't affect the issue.

I don't understand Windows but that seems pretty weird to me, as it
seems to imply that a driver or something fairly low level inside the
kernel is leaking objects (at least by simple minded analogies to
operating systems I understand better).  Either that or robocop.exe
has userspace stuff involving at least one thread still running
somewhere after it's exited, but that seems unlikely as I guess you'd
have noticed that...


Yes, I see no robocopy process left after the test, and I think userspace
threads would not survive logoff.


Just a thought: I was surveying the block cloning landscape across
OSes and filesystems while looking into clone-based CREATE DATABASE
(CF #4886) and also while thinking about the new TAP test initdb
template copy trick, is that robocopy.exe tries to use Windows' block
cloning magic, just like cp on recent Linux and FreeBSD systems (at
one point I was wondering if that was causing some funky extra flush
stalls on some systems, I need to come back to that...).  It probably
doesn't actually work unless you have Windows 11 kernel with DevDrive
enabled (from reading, no Windows here), but I guess it still probably
uses the new system interfaces, probably something like CopyFileEx().
Does it still leak if you use /nooffload or /noclone?


I tested the following (with the script above):
Windows 10 (Version 10.0.19045.4780):
robocopy.exe (10.0.19041.4717) /NOOFFLOAD
iteration 1
496611328
...
iteration 1000
609701888

That is, it leaks

/NOCLONE is not supported by that robocopy version:
ERROR : Invalid Parameter #1 : "/NOCLONE"

Then, Windows 11 (Version 10.0.22000.613), robocopy 10.0.22000.469:
iteration 1
141217792
...
iteration 996
151670784
...
iteration 997
152817664
...
iteration 1000
151674880

That is, it doesn't leak.

robocopy.exe /NOOFFLOAD
iteration 1
152666112
...
iteration 1000
153341952

No leak.

/NOCLONE is not supported by that robocopy version:

Then I updated that Windows 11 to Version 10.0.22000.2538 (with KB5031358),
robocopy 10.0.22000.1516:
iteration 1
122753024
...
iteration 1000
244674560

It does leak.

robocopy /NOOFFLOAD
iteration 1
167522304
...
iteration 1000
283484160

It leaks as well.

Finally, I've installed newest Windows 11 Version 10.0.22631.4169, with
robocopy 10.0.22621.3672:
Non-paged pool increased from 133 to 380 MB after 1000 robocopy runs.

robocopy /OFFLOAD leaks too.

/NOCLONE is not supported by that robocopy version:

So this leak looks like a recent and still existing defect.

(Sorry for a delay, fighting with OS updates/installation took me a while.)

Best regards,
Alexander




Re: Robocopy might be not robust enough for never-ending testing on Windows

2024-09-14 Thread Alexander Lakhin

Hello Andrew,

14.09.2024 17:22, Andrew Dunstan wrote:


On 2024-09-14 Sa 9:00 AM, Alexander Lakhin wrote:

While trying to reproduce inexplicable drongo failures (e. g., [1])
seemingly related to BackgroundPsql, I stumbled upon close, but not
the same issue. ...



Interesting, good detective work. Still, wouldn't this mean drongo would fail 
consistently?


Yes, I think drongo is suffering from another disease, we're yet to find
which.



I wonder why we're using robocopy instead of our own RecursiveCopy module?



AFAICS, robocopy is also used by regress.exe, so switching to the perl
module would require perl even for regress tests. I know that perl was
a requirement for MSVC builds, but maybe that's not so with meson...

Best regards,
Alexander




Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-09-13 Thread Alexander Lakhin

Hi Alexander,

10.08.2024 20:18, Alexander Korotkov wrote:

On Sat, Aug 10, 2024 at 7:33 PM Alexander Korotkov  wrote:

On Tue, Aug 6, 2024 at 5:17 AM Alexander Korotkov  wrote:
...
Here is a revised version of the patchset.  I've fixed some typos,
identation, etc.  I'm going to push this once it passes cfbot.

The next revison of the patchset fixes uninitialized variable usage
spotted by cfbot.


When running check-world on a rather slow armv7 device, I came across the
043_wal_replay_wait.pl test failure:
t/043_wal_replay_wait.pl .. 7/? # Tests were run but no plan was 
declared and done_testing() was not seen.
# Looks like your test exited with 29 just after 8.

regress_log_043_wal_replay_wait contains:
...
01234[21:58:56.370](1.594s) ok 7 - multiple LSN waiters reported consistent data
### Promoting node "standby"
# Running: pg_ctl -D .../src/test/recovery/tmp_check/t_043_wal_replay_wait_standby_data/pgdata -l 
.../src/test/recovery/tmp_check/log/043_wal_replay_wait_standby.log promote

waiting for server to promote done
server promoted
[21:58:56.637](0.268s) ok 8 - got error after standby promote
error running SQL: 'psql::1: ERROR:  recovery is not in progress
HINT:  Waiting for LSN can only be executed during recovery.'
while running 'psql -XAtq -d port=10228 host=/tmp/Ftj8qpTQht dbname='postgres' -f - -v ON_ERROR_STOP=1' with sql 'CALL 
pg_wal_replay_wait('0/300D0E8');' at .../src/test/recovery/../../../src/test/perl/PostgreSQL/Test/Cluster.pm line 2140.


043_wal_replay_wait_standby.log contains:
2024-09-12 21:58:56.518 UTC [15220:1] [unknown] LOG:  connection received: 
host=[local]
2024-09-12 21:58:56.520 UTC [15220:2] [unknown] LOG:  connection authenticated: user="android" method=trust 
(.../src/test/recovery/tmp_check/t_043_wal_replay_wait_standby_data/pgdata/pg_hba.conf:117)
2024-09-12 21:58:56.520 UTC [15220:3] [unknown] LOG:  connection authorized: user=android database=postgres 
application_name=043_wal_replay_wait.pl

2024-09-12 21:58:56.527 UTC [15220:4] 043_wal_replay_wait.pl LOG: statement: 
CALL pg_wal_replay_wait('2/570CB4E8');
2024-09-12 21:58:56.535 UTC [15123:7] LOG:  received promote request
2024-09-12 21:58:56.535 UTC [15124:2] FATAL:  terminating walreceiver process 
due to administrator command
2024-09-12 21:58:56.537 UTC [15123:8] LOG:  invalid record length at 0/300D0B0: 
expected at least 24, got 0
2024-09-12 21:58:56.537 UTC [15123:9] LOG:  redo done at 0/300D088 system usage: CPU: user: 0.01 s, system: 0.00 s, 
elapsed: 14.23 s

2024-09-12 21:58:56.537 UTC [15123:10] LOG:  last completed transaction was at 
log time 2024-09-12 21:58:55.322831+00
2024-09-12 21:58:56.540 UTC [15123:11] LOG:  selected new timeline ID: 2
2024-09-12 21:58:56.589 UTC [15123:12] LOG:  archive recovery complete
2024-09-12 21:58:56.590 UTC [15220:5] 043_wal_replay_wait.pl ERROR: recovery is 
not in progress
2024-09-12 21:58:56.590 UTC [15220:6] 043_wal_replay_wait.pl DETAIL:  Recovery ended before replaying target LSN 
2/570CB4E8; last replay LSN 0/300D0B0.

2024-09-12 21:58:56.591 UTC [15121:1] LOG:  checkpoint starting: force
2024-09-12 21:58:56.592 UTC [15220:7] 043_wal_replay_wait.pl LOG: disconnection: session time: 0:00:00.075 user=android 
database=postgres host=[local]

2024-09-12 21:58:56.595 UTC [15120:4] LOG:  database system is ready to accept 
connections
2024-09-12 21:58:56.665 UTC [15227:1] [unknown] LOG:  connection received: 
host=[local]
2024-09-12 21:58:56.668 UTC [15227:2] [unknown] LOG:  connection authenticated: user="android" method=trust 
(.../src/test/recovery/tmp_check/t_043_wal_replay_wait_standby_data/pgdata/pg_hba.conf:117)
2024-09-12 21:58:56.668 UTC [15227:3] [unknown] LOG:  connection authorized: user=android database=postgres 
application_name=043_wal_replay_wait.pl

2024-09-12 21:58:56.675 UTC [15227:4] 043_wal_replay_wait.pl LOG: statement: 
CALL pg_wal_replay_wait('0/300D0E8');
2024-09-12 21:58:56.677 UTC [15227:5] 043_wal_replay_wait.pl ERROR: recovery is 
not in progress
2024-09-12 21:58:56.677 UTC [15227:6] 043_wal_replay_wait.pl HINT: Waiting for 
LSN can only be executed during recovery.
2024-09-12 21:58:56.679 UTC [15227:7] 043_wal_replay_wait.pl LOG: disconnection: session time: 0:00:00.015 user=android 
database=postgres host=[local]


Note that last replay LSN is 300D0B0, but the latter pg_wal_replay_wait
call wants to wait for 300D0E8.

pg_waldump -p 
src/test/recovery/tmp_check/t_043_wal_replay_wait_primary_data/pgdata/pg_wal/ 
00010003
shows:
rmgr: Heap    len (rec/tot): 59/    59, tx:    748, lsn: 0/0300D048, prev 0/0300D020, desc: INSERT off: 35, 
flags: 0x00, blkref #0: rel 1663/5/16384 blk 0
rmgr: Transaction len (rec/tot): 34/    34, tx:    748, lsn: 0/0300D088, prev 0/0300D048, desc: COMMIT 
2024-09-12 21:58:55.322831 UTC
rmgr: Standby len (rec/tot): 50/    50, tx:  0, lsn: 0/0300D0B0, prev 0/0300D088, desc: RUNNING_XACTS 
nextXid 749 latestCompletedXid 748 oldestRunningXid 749

Re: Yet another way for pg_ctl stop to fail on Windows

2024-09-08 Thread Alexander Lakhin

07.09.2024 21:11, Noah Misch wrote:




Noah, what do you think of handling this error in line with handling of
ERROR_BROKEN_PIPE and ERROR_BAD_PIPE (which was done in 0ea1f2a3a)?

I tried the following change:
     switch (GetLastError())
     {
     case ERROR_BROKEN_PIPE:
     case ERROR_BAD_PIPE:
+   case ERROR_PIPE_BUSY:
and saw no issues.

That would be a strict improvement over returning EINVAL like we do today.  We
do use PIPE_UNLIMITED_INSTANCES, so I expect the causes of ERROR_PIPE_BUSY are
process exit and ENOMEM-like situations.  While that change is the best thing
if the process is exiting, it could silently drop the signal in ENOMEM-like
situations.  Consider the following alternative.  If sig==0, just return 0
like you propose, because the process isn't completely gone.  Otherwise, sleep
and retry the signal, like pgwin32_open_handle() retries after certain errors.
What do you think of that?


Thank you for your attention to the issue!

I agree with your approach. It looks like Microsoft recommends to loop on
ERROR_PIPE_BUSY: [1] (they say "Calling CallNamedPipe is equivalent to
calling the CreateFile ..." at [2]).

So if we aim to not only fix "pg_ctl stop", but to make pgkill() robust,
it's the way to go, IMHO. I'm not sure about an infinite loop they show,
I'd vote for a loop with the same characteristics as in
pgwin32_open_handle().

I've managed to trigger ERROR_PIPE_BUSY with "pg_ctl reload", when running
20 TAP tests (see attached) in parallel (with 20 threads executing
$node->reload() in each), and with the kill() call inside do_reload()
repeated x100 (see the modification patch attached too):
...
# Running: pg_ctl -D 
.../099_test_pgkill\data/t_099_test_pgkill_node_data/pgdata reload
### Reloading node "node"
# Running: pg_ctl -D 
.../099_test_pgkill\data/t_099_test_pgkill_node_data/pgdata reload
[13:41:46.850](2.400s) # 18
server signaled
server signaled
server signaled
server signaled
server signaled
server signaled
server signaled
server signaled
!!!pgkill| GetLastError(): 231
pg_ctl: could not send reload signal (PID: 3912), iteration: 81, res: -1, 
errno: 22: Invalid argument
server signaled
[13:41:52.594](5.744s) # 19
...

[1] https://learn.microsoft.com/en-us/windows/win32/ipc/named-pipe-client
[2] 
https://learn.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-callnamedpipea

Best regards,
Alexanderdiff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c
index e7e878c22f..cab6eb9472 100644
--- a/src/bin/pg_ctl/pg_ctl.c
+++ b/src/bin/pg_ctl/pg_ctl.c
@@ -1167,12 +1174,17 @@ do_reload(void)
 		exit(1);
 	}
 
-	if (kill(pid, sig) != 0)
+for (int i = 1; i <= 100; i++)
+{
+int res, en;
+	if ((res = kill(pid, sig)) != 0)
 	{
-		write_stderr(_("%s: could not send reload signal (PID: %d): %m\n"),
-	 progname, (int) pid);
+en = errno;
+		write_stderr(_("%s: could not send reload signal (PID: %d), iteration: %d, res: %d, errno: %d: %m\n"),
+	 progname, (int) pid, i, res, en);
 		exit(1);
 	}
+}
 
 	print_msg(_("server signaled\n"));
 }
diff --git a/src/port/kill.c b/src/port/kill.c
index 412c2f19c1..b3d2c00796 100644
--- a/src/port/kill.c
+++ b/src/port/kill.c
@@ -70,6 +70,7 @@ pgkill(int pid, int sig)
 		return 0;
 	}
 
+int le = GetLastError();
 	switch (GetLastError())
 	{
 		case ERROR_BROKEN_PIPE:
@@ -89,6 +90,7 @@ pgkill(int pid, int sig)
 			errno = EPERM;
 			return -1;
 		default:
+fprintf(stderr, "!!!pgkill| GetLastError(): %d\n", le);
 			errno = EINVAL;		/* unexpected */
 			return -1;
 	}


099_test_pgkill.pl
Description: Perl program


Yet another way for pg_ctl stop to fail on Windows

2024-09-07 Thread Alexander Lakhin

Hello hackers,

While trying to reproduce a recent fairywren (a Windows animal) failure,
I ran amcheck/amcheck/003_cic_2pc in parallel inside a slowed-down
VM and came across another issue:
### Stopping node "CIC_2PC_test" using mode fast
# Running: pg_ctl -D C:\src\postgresql\build/testrun/amcheck_17/003_cic_2pc\data/t_003_cic_2pc_CIC_2PC_test_data/pgdata 
-m fast stop

waiting for server to shut down. failed
pg_ctl: server does not shut down
# pg_ctl stop failed: 256
# Postmaster PID for node "CIC_2PC_test" is 6048
[08:24:52.915](12.792s) Bail out!  pg_ctl stop failed

So "pg_ctl stop" failed due to not a timeout, but some other reason.

With extra logging added, I got:
### Stopping node "CIC_2PC_test" using mode fast
# Running: pg_ctl -D C:\src\postgresql\build/testrun/amcheck_3/003_cic_2pc\data/t_003_cic_2pc_CIC_2PC_test_data/pgdata 
-m fast stop

waiting for server to shut down..!!!pgkill| GetLastError(): 231
postmaster (9596) died untimely? res: -1, errno: 22
 failed

Thus, CallNamedPipe() in pgkill() returned ERROR_PIPE_BUSY (All pipe
instances are busy) and it was handled as an unexpected error.
(The error code 231 returned 10 times out of 10 failures of this ilk for
me.)

Noah, what do you think of handling this error in line with handling of
ERROR_BROKEN_PIPE and ERROR_BAD_PIPE (which was done in 0ea1f2a3a)?

I tried the following change:
    switch (GetLastError())
    {
    case ERROR_BROKEN_PIPE:
    case ERROR_BAD_PIPE:
+   case ERROR_PIPE_BUSY:
and saw no issues.

The reason I'd like to bring your attention to the issue (if you don't
mind), is that it's impossible to understand the reason of such false
failure if it happens in the buildfarm/CI.

Best regards,
Alexander




Re: On disable_cost

2024-09-06 Thread Alexander Lakhin

Hello Richard,

06.09.2024 12:51, Richard Guo wrote:

Ah I see.  label_sort_with_costsize is only used to label the Sort
node nicely for EXPLAIN, and usually we do not display the cost
numbers in regression tests.


In fact, I see the error with the following (EXPLAIN-less) query:
create table t (x int);

select * from t natural inner join
(select * from (values(1)) v(x)
  union all
 select 1 from t t1 full join t t2 using (x),
   t t3 full join t t4 using (x)
);

2024-09-06 10:01:48.034 UTC [696535:5] psql LOG:  statement: select * from t 
natural inner join
    (select * from (values(1)) v(x)
  union all
 select 1 from t t1 full join t t2 using (x),
   t t3 full join t t4 using (x)
    );
createplan.c:5457:6: runtime error: 4.99254e+09 is outside the range of 
representable values of type 'int'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior createplan.c:5457:6 in

(An UBSan-enabled build --with-blocksize=32 is required for this query to
trigger the failure.)

Best regards,
Alexander




Re: On disable_cost

2024-09-06 Thread Alexander Lakhin

Hello Robert,

21.08.2024 17:29, Robert Haas wrote:

I went ahead and committed these patches. ...


Please take a look at the following code:
static void
label_sort_with_costsize(PlannerInfo *root, Sort *plan, double limit_tuples)
{
...
    cost_sort(&sort_path, root, NIL,
  lefttree->total_cost,
  plan->plan.disabled_nodes,
  lefttree->plan_rows,
  lefttree->plan_width,
  0.0,
  work_mem,
  limit_tuples);

Given the cost_sort() declaration:
void
cost_sort(Path *path, PlannerInfo *root,
  List *pathkeys, int input_disabled_nodes,
  Cost input_cost, double tuples, int width,
  Cost comparison_cost, int sort_mem,
  double limit_tuples)

Aren't the input_disabled_nodes and input_cost arguments swapped in the
above call?

(I've discovered this with UBSan, which complained
createplan.c:5457:6: runtime error: 4.40465e+09 is outside the range of 
representable values of type 'int'
while executing a query with a large estimated cost.)

Best regards,
Alexander




Re: GUC names in messages

2024-09-04 Thread Alexander Lakhin

Hello Peter,

[ sorry for the kind of off-topic ]

17.05.2024 14:57, Peter Eisentraut wrote:

I committed your 0001 and 0002 now, with some small fixes.

There has also been quite a bit of new code, of course, since you posted your patches, so we'll probably find a few 
more things that could use adjustment.


I'd be happy to consider the rest of your patch set after beta1 and/or for PG18.


While translating messages, I've encountered a weird message, updated by
17974ec25:
    printf(_("(in \"wal_sync_method\" preference order, except fdatasync is Linux's 
default)\n"));

Does "except ..." make sense here or it's just a copy-and-paste from docs?:
    The default is the first method in the above list that is supported
    by the platform, except that fdatasync is the 
default on
    Linux and FreeBSD.

Best regards,
Alexander




Re: v17 vs v16 performance comparison

2024-09-02 Thread Alexander Lakhin

Hello Thomas,

02.08.2024 12:00, Alexander Lakhin wrote:







I had payed attention to:
Best pg-src-17--.* worse than pg-src-16--.* by 57.9 percents (225.11 > 142.52): 
pg_tpcds.query15
Average pg-src-17--.* worse than pg-src-16--.* by 55.5 percents (230.57 > 
148.29): pg_tpcds.query15
in May, performed `git bisect` for this degradation, that led me to commit
b7b0f3f27 [1].

This time I bisected the following anomaly:
Best pg-src-17--.* worse than pg-src-16--.* by 23.6 percents (192.25 > 155.58): 
pg_tpcds.query21
Average pg-src-17--.* worse than pg-src-16--.* by 25.1 percents (196.19 > 
156.85): pg_tpcds.query21
and to my surprise I got "b7b0f3f27 is the first bad commit".

Moreover, bisecting of another anomaly:
Best pg-src-17--.* worse than pg-src-16--.* by 24.2 percents (24269.21 > 
19539.89): pg_tpcds.query72
Average pg-src-17--.* worse than pg-src-16--.* by 24.2 percents (24517.66 > 
19740.12): pg_tpcds.query72
pointed at the same commit again.

...

But beside that, I've found a separate regression. Bisecting for this 
degradation:
Best pg-src-17--.* worse than pg-src-16--.* by 105.0 percents (356.63 > 
173.96): s64da_tpcds.query95
Average pg-src-17--.* worse than pg-src-16--.* by 105.2 percents (357.79 > 
174.38): s64da_tpcds.query95
pointed at f7816aec2.




Meanwhile I've bisected another degradation:
Best pg-src-17--.* worse than pg-src-16--.* by 11.3 percents (7.17 > 6.44): 
job.query6f
and came to the commit b7b0f3f27 again.


Now that the unfairness in all-cached parallel seq scan eliminated (with
3ed3683618),  I've re-run the same performance tests and got new results
(please see attached). As we can see, the aforementioned pg_tpcds.query72
got better:
  2024-05-15  2024-07-30 2024-09-03
pg-src-16--1   20492.58    19669.34    19913.32
pg-src-17--1   25286.10    24269.21    20654.95
pg-src-16--2   20769.88    19539.89    20429.72
pg-src-17--2   25771.90    24530.51    21244.92
pg-src-17--3   25978.55    24753.25    20904.09
pg-src-16--3   20943.10    20011.13    20086.61

We can also see the improvement of pg_tpcds.query16, but not on all runs:
  2024-05-15  2024-07-30 2024-09-03
pg-src-16--1 105.36   94.31   97.74
pg-src-17--1 145.74  145.53  145.51
pg-src-16--2 101.82   98.36   96.63
pg-src-17--2 140.07  146.90   96.93
pg-src-17--3 154.89  148.11  106.18
pg-src-16--3 101.03  100.94   93.44

So it looks like now we see the same instability, that we observed before
([1]).

Unfortunately, the troublesome tpcds.query15 hasn't produced good numbers
this time too:
  2024-05-15  2024-07-30 2024-09-03
pg-src-16--1 153.41  142.52  142.54
pg-src-17--1 229.84  225.11  212.51
pg-src-16--2 153.47  150.13  149.37
pg-src-17--2 236.34  227.15  232.73
pg-src-17--3 236.43  239.46  233.77
pg-src-16--3 151.03  152.23  144.90

From a bird's eye view, new v17-vs-v16 comparison has only 87 "worse",
while the previous one had 115 (it requires deeper analysis, of course, but
still...).

[1] 
https://www.postgresql.org/message-id/d1fb5c09-dd03-2540-9ec2-86dbfdfa2c65%40gmail.com

Best regards,
AlexanderTitle: Postgres benchmarking results

Benchmarking run started at 2024-09-02T09:38:29, ended at 2024-09-03T03:22:21Benchmark versionInstanceversionpg-src-16--1PostgreSQL 16.4-REL_16_STABLE/2015dd5c90 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, 64-bitpg-src-17--1PostgreSQL 17beta3-REL_17_STABLE/3ed3683618 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, 64-bitpg-src-16--2PostgreSQL 16.4-REL_16_STABLE/2015dd5c90 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, 64-bitpg-src-17--2PostgreSQL 17beta3-REL_17_STABLE/3ed3683618 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, 64-bitpg-src-17--3PostgreSQL 17beta3-REL_17_STABLE/3ed3683618 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, 64-bitpg-src-16--3PostgreSQL 16.4-REL_16_STABLE/2015dd5c90 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, 64-bitBenchmark pgbench_nativeInstancetpspg-src-16--110555.04pg-src-17--110713.05pg-src-16--210539.33pg-src-17--210689.77pg-src-17--310713.91pg-src-16--310562.90Benchmark pgbench_referenceInstancelatency_​avgtps1tps2pg-src-16--18.577471.287471.31pg-src-17--18.317700.457700.49pg-src-16--28.587463.047463.07pg-src-17--28.317701.527701.55pg-src-17--38.367655.437655.46pg-src-16--38.577464.597464.63Benchmark pg_tpchInstancequery1query2query3query4query5query6query7query8query9query10query11query12query13query14query15query16query17query18query19query20query21query22pg-src-16--19.631.452.970.700.901.442.894.630.842.450.772.182.860.853.160.988.6414.300.195.792.360.23pg-src-17--19.301.482.820.690.881.352.

Re: Typos in the code and README

2024-09-02 Thread Alexander Lakhin

Hello,

12.08.2024 14:59, David Rowley wrote:

(I know Daniel mentioned he'd get to these, but the ScanDirection one
was my fault and I needed to clear that off my mind. I did a few
others while on this topic.)


Thank you, David, for working on that!

I've gathered another bunch of defects with the possible substitutions.
Please take a look:
adapated -> adapted

becasue -> because

cancelled -> canceled (introduced by 90f517821, but see 8c9da1441)

cange -> change

comand -> command

CommitTSSLRU -> CommitTsSLRU (introduced by 53c2a97a9; maybe the fix
 should be back-patched...)

connectOptions2 -> pqConnectOptions2 (see 774bcffe4)

Injections points -> Injection points

jsetate -> jsestate

LockShmemSize -> remove the sentence? (added by ec0baf949, outdated with
 a794fb068)

MaybeStartSlotSyncWorker -> LaunchMissingBackgroundProcesses (the logic to
 start B_SLOTSYNC_WORKER moved from the former to the latter function with
 3354f8528)

multixact_member_buffer -> multixact_member_buffers

per_data_data -> per_buffer_data (see code below the comment; introduced by
 b5a9b18cd)

per_buffer_private -> remove the function declaration? (the duplicate
 declaration was added by a858be17c)

performancewise -> performance-wise? (coined by a7f107df2)

pgstat_add_kind -> pgstat_register_kind (see 7949d9594)

pg_signal_autovacuum -> pg_signal_autovacuum_worker (see d2b74882c)

recoveery -> recovery

RegisteredWorker -> RegisteredBgWorker

RUNNING_XACT -> RUNNING_XACTS

sanpshot -> snapshot

TypeEntry -> TypeCacheEntry (align with AttoptCacheEntry, from the same
 commit 40064a8ee)

The corresponding patch is attached for your convenience.

Best regards,
Alexanderdiff --git a/contrib/test_decoding/specs/skip_snapshot_restore.spec b/contrib/test_decoding/specs/skip_snapshot_restore.spec
index 3f1fb6f02c7..7b35dbcc9f3 100644
--- a/contrib/test_decoding/specs/skip_snapshot_restore.spec
+++ b/contrib/test_decoding/specs/skip_snapshot_restore.spec
@@ -39,7 +39,7 @@ step "s2_get_changes_slot0" { SELECT data FROM pg_logical_slot_get_changes('slot
 # serializes consistent snapshots to the disk at LSNs where are before
 # s0-transaction's commit. After s0-transaction commits, "s1_init" resumes but
 # must not restore any serialized snapshots and will reach the consistent state
-# when decoding a RUNNING_XACT record generated after s0-transaction's commit.
+# when decoding a RUNNING_XACTS record generated after s0-transaction's commit.
 # We check if the get_changes on 'slot1' will not return any s0-transaction's
 # changes as its confirmed_flush_lsn will be after the s0-transaction's commit
 # record.
diff --git a/doc/src/sgml/xfunc.sgml b/doc/src/sgml/xfunc.sgml
index 9bc23a9a938..af7864a1b5b 100644
--- a/doc/src/sgml/xfunc.sgml
+++ b/doc/src/sgml/xfunc.sgml
@@ -3891,8 +3891,8 @@ static const PgStat_KindInfo custom_stats = {
  it with pgstat_register_kind and a unique ID used to
  store the entries related to this type of statistics:
 
-extern PgStat_Kind pgstat_add_kind(PgStat_Kind kind,
-   const PgStat_KindInfo *kind_info);
+extern PgStat_Kind pgstat_register_kind(PgStat_Kind kind,
+const PgStat_KindInfo *kind_info);
 
  While developing a new extension, use
  PGSTAT_KIND_EXPERIMENTAL for
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index a03d56541d0..8c37d7eba76 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -2017,7 +2017,7 @@ check_multixact_offset_buffers(int *newval, void **extra, GucSource source)
 }
 
 /*
- * GUC check_hook for multixact_member_buffer
+ * GUC check_hook for multixact_member_buffers
  */
 bool
 check_multixact_member_buffers(int *newval, void **extra, GucSource source)
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 91f0fd6ea3e..b2457f121a7 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -382,7 +382,7 @@ RefreshMatViewByOid(Oid matviewOid, bool is_create, bool skipData,
 	 * command tag is left false in cmdtaglist.h. Otherwise, the change of
 	 * completion tag output might break applications using it.
 	 *
-	 * When called from CREATE MATERIALIZED VIEW comand, the rowcount is
+	 * When called from CREATE MATERIALIZED VIEW command, the rowcount is
 	 * displayed with the command tag CMDTAG_SELECT.
 	 */
 	if (qc)
diff --git a/src/backend/commands/waitlsn.c b/src/backend/commands/waitlsn.c
index d9cf9e7d75e..d7065726749 100644
--- a/src/backend/commands/waitlsn.c
+++ b/src/backend/commands/waitlsn.c
@@ -369,7 +369,7 @@ pg_wal_replay_wait(PG_FUNCTION_ARGS)
 	 */
 	InvalidateCatalogSnapshot();
 
-	/* Give up if there is still an active or registered sanpshot. */
+	/* Give up if there is still an active or registered snapshot. */
 	if (GetOldestSnapshot())
 		ereport(ERROR,
 (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
diff --git a/src/backe

Re: Improving tracking/processing of buildfarm test failures

2024-09-01 Thread Alexander Lakhin

Hello hackers,

Please take a look at the August report on buildfarm failures:
# SELECT br, count(*) FROM failures WHERE dt >= '2024-08-01' AND
 dt < '2024-09-01' GROUP BY br;
REL_12_STABLE: 2
REL_13_STABLE: 2
REL_14_STABLE: 12
REL_15_STABLE: 3
REL_16_STABLE: 5
REL_17_STABLE: 17
HEAD: 38
-- Total: 79
(Counting test failures only, excluding indent-check, Configure, Build
errors.)

# SELECT COUNT(*) FROM (SELECT DISTINCT issue_link FROM failures WHERE
 dt >= '2024-08-01' AND dt < '2024-09-01');
21

# SELECT issue_link, count(*) FROM failures WHERE dt >= '2024-08-01' AND
 dt < '2024-09-01' GROUP BY issue_link ORDER BY 2 DESC LIMIT 6;
https://www.postgresql.org/message-id/8ce8261a-bf3a-25e6-b473-4808f50a6ea7%40gmail.com:
 13
-- An environmental issue; fixed

https://www.postgresql.org/message-id/a9a97e83-9ec8-5de5-bf69-80e9560f5...@gmail.com:
 9
-- An environmental issue?; probably fixed

https://www.postgresql.org/message-id/4db099c8-4a52-3cc4-e970-14539a319...@gmail.com:
 7
-- Fixed

https://www.postgresql.org/message-id/c720cdc3-5ce0-c410-4854-70788175c...@gmail.com:
 6
-- Expected to be fixed with Release 18 of the buildfarm client

https://www.postgresql.org/message-id/657815a2-5a89-fcc1-1c9d-d77a6986b...@gmail.com:
 5

https://www.postgresql.org/message-id/3618203.1722473...@sss.pgh.pa.us: 4
-- Fixed

# SELECT count(*) FROM failures WHERE dt >= '2024-08-01' AND
 dt < '2024-09-01' AND issue_link IS NULL; -- Unsorted/unhelpful failures
13

Short-lived failures: 21

There were also two mysterious never-before-seen failures, both occurred on
POWER animals:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kingsnake&dt=2024-08-19%2019%3A17%3A59
 - REL_17_STABLE
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=iguana&dt=2024-08-29%2013%3A57%3A57
 - REL_15_STABLE

(I'm not sure yet, whether they should be considered "unhelpful". I'll
wait for more information from these animals/buildfarm in general to
determine what to do with these failures.)

Best regards,
Alexander




Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

2024-08-30 Thread Alexander Lakhin

Hello Tom,

30.08.2024 23:55, Tom Lane wrote:

Pushed with that addition and some comment-tweaking.  We'll see
whether that actually makes things more stable, but I don't think
it could make it worse.


Thank you for fixing that issue!

I've tested your fix with the modification I proposed upthread:
 idle_session_timeout_enabled = false;
 }
+if (rand() % 10 == 0) pg_usleep(1);
 /*
  * (5) disable async signal conditions again.

and can confirm that the issue is gone. On 8749d850f~1, the test failed
on iterations 3, 3. 12 for me, but on current REL_17_STABLE, 100 test
iterations succeeded.

At the same time, mylodon confirmed my other finding at [1] and failed [2] with:
-ERROR:  canceling statement due to statement timeout
+ERROR:  canceling statement due to user request

[1] 
https://www.postgresql.org/message-id/4db099c8-4a52-3cc4-e970-14539a319466%40gmail.com
[2] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2024-08-30%2023%3A03%3A31

Best regards,
Alexander




Re: configure failures on chipmunk

2024-08-29 Thread Alexander Lakhin

Hello Heikki,

26.08.2024 07:00, Alexander Lakhin wrote:

Could you also take a look at the kerberos setup on chipmunk? Now that
chipmunk goes beyond configure (on HEAD and REL_17_STABLE), it fails on the
kerberosCheck stage, ...


I see that chipmunk turned green.

Thank you for taking care of that!

Best regards,
Alexander




Re: Streaming read-ready sequential scan code

2024-08-29 Thread Alexander Lakhin

Hello Thomas,

29.08.2024 01:16, Thomas Munro wrote:

Yeah.  That's quite interesting, and must destabilise that
simple-minded demo.  I'm curious to know exactly what contention is
causing that (about 3/4 of a millisecond that I don't see and now I
want to know what it's waiting for), but it's a very crude test
lacking timer resolution in the earlier messages, and it's an
unrelated topic and a distraction.  Perhaps it explains why you saw
two different behaviours in Q15 with the patch and I didn't, though.
Really it shouldn't be so sensitive to such variations, it's obviously
a terrible plan, and TPC-DS needs a planner hacker mega-brain applied
to it; I'm going to try to nerd-snipe one...


I looked at two perf profiles of such out-of-sync processes and found no
extra calls or whatsoever in the slow one, it just has the number of perf
samples increased proportionally. It made me suspect CPU frequency
scaling... Indeed, with the "performance" governor set and the boost mode
disabled, I'm now seeing much more stable numbers (I do this tuning before
running performance tests, but I had forgotten about that when I ran that
your test, my bad).

I'm sorry for the noise and the distraction.

Still, now I can confirm your results. Without the patch, two parallel
workers gave "Buffers: shared hit=217 / Buffers: shared hit=226" 10 times
out of 10. And with the patch, I've got
"shared hit=219 / shared hit=224" 3 times,
"shared hit=220 / shared hit=223" 4 times,
"shared hit=221 / shared hit=222" 3 times of 10.

On b7b0f3f27~1, my results are:
"shared hit=219 / shared hit=224": 2
"shared hit=220 / shared hit=223": 3
"shared hit=221 / shared hit=222": 4
"shared hit=218 / shared hit=225": 1

Best regards,
Alexander




Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

2024-08-28 Thread Alexander Lakhin

Hello Alvaro,

Let me show you another related anomaly, which drongo kindly discovered
recently: [1]. That test failed with:
 SELECT dblink_cancel_query('dtest1');
- dblink_cancel_query
--
- OK
+   dblink_cancel_query
+--
+ cancel request timed out
 (1 row)

I've managed to reproduce this when running 20 dblink tests in parallel,
and with extra logging added (see attached) I've got:
...
2024-08-28 10:17:12.949 PDT [8236:204] pg_regress/dblink LOG: statement: SELECT 
dblink_cancel_query('dtest1');
!!!PQcancelPoll|8236| conn->status: 2
!!!PQcancelPoll|8236| conn->status: 3
!!!PQconnectPoll|8236| before pqPacketSend(..., &cancelpacket, ...)
!!!pqPacketSend|8236| before pqFlush
!!!pqsecure_raw_write|8236| could not send data to server: Socket is not 
connected (0x2749/10057)
!!!pqPacketSend|8236| after pqFlush, STATUS_OK
!!!PQconnectPoll|8236| after pqPacketSend, STATUS_OK
2024-08-28 10:17:12.950 PDT [5548:7] pg_regress LOG:  statement: select * from foo 
where f1 < 3
2024-08-28 10:17:12.951 PDT [8692:157] DEBUG:  forked new backend, pid=4644 
socket=5160
2024-08-28 10:17:12.973 PDT [4644:1] [unknown] LOG:  connection received: 
host=::1 port=55073
2024-08-28 10:17:12.973 PDT [4644:2] [unknown] LOG: !!!BackendInitialize| 
before ProcessSSLStartup()
!!!PQcancelPoll|8236| conn->status: 4
!!!PQcancelPoll|8236| conn->status: 4
2024-08-28 10:17:24.060 PDT [1436:1] DEBUG:  snapshot of 0+0 running transaction ids (lsn 0/194C4E0 oldest xid 780 
latest complete 779 next xid 780)

!!!PQcancelPoll|8236| conn->status: 4
2024-08-28 10:17:42.951 PDT [4644:3] [unknown] LOG: !!!BackendInitialize| 
ProcessSSLStartup() returned -1
2024-08-28 10:17:42.951 PDT [4644:4] [unknown] DEBUG: shmem_exit(0): 0 
before_shmem_exit callbacks to make
...

Thus, pqsecure_raw_write(), called via PQcancelPoll() -> PQconnectPoll() ->
pqPacketSend() -> pqFlush) -> pqSendSome() -> pqsecure_write(), returned
the WSAENOTCONN error, but it wasn't noticed at upper levels.
Consequently, the cancelling backend waited for the cancel packet that was
never sent.

The first commit, that I could reproduce this test failure on, is 2466d6654.

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-08-26%2021%3A35%3A04

Best regards,
Alexanderdiff --git a/src/backend/tcop/backend_startup.c b/src/backend/tcop/backend_startup.c
index cfa2755196..fc9abbfe53 100644
--- a/src/backend/tcop/backend_startup.c
+++ b/src/backend/tcop/backend_startup.c
@@ -252,8 +252,10 @@ BackendInitialize(ClientSocket *client_sock, CAC_state cac)
 	RegisterTimeout(STARTUP_PACKET_TIMEOUT, StartupPacketTimeoutHandler);
 	enable_timeout_after(STARTUP_PACKET_TIMEOUT, AuthenticationTimeout * 1000);
 
+elog(LOG, "!!!BackendInitialize| before ProcessSSLStartup()", status);
 	/* Handle direct SSL handshake */
 	status = ProcessSSLStartup(port);
+elog(LOG, "!!!BackendInitialize| ProcessSSLStartup() returned %d", status);
 
 	/*
 	 * Receive the startup packet (which might turn out to be a cancel request
diff --git a/src/interfaces/libpq/fe-cancel.c b/src/interfaces/libpq/fe-cancel.c
index 213a6f43c2..8482dfa3e8 100644
--- a/src/interfaces/libpq/fe-cancel.c
+++ b/src/interfaces/libpq/fe-cancel.c
@@ -209,6 +209,7 @@ PQcancelPoll(PGcancelConn *cancelConn)
 	PGconn	   *conn = &cancelConn->conn;
 	int			n;
 
+fprintf(stderr, "!!!PQcancelPoll|%d| conn->status: %d\n", getpid(), conn->status);
 	/*
 	 * We leave most of the connection establishment to PQconnectPoll, since
 	 * it's very similar to normal connection establishment. But once we get
diff --git a/src/interfaces/libpq/fe-connect.c b/src/interfaces/libpq/fe-connect.c
index 360d9a4547..72a31d29ed 100644
--- a/src/interfaces/libpq/fe-connect.c
+++ b/src/interfaces/libpq/fe-connect.c
@@ -3407,12 +3407,14 @@ keep_going:		/* We will come back to here until there is
 	cancelpacket.cancelRequestCode = (MsgType) pg_hton32(CANCEL_REQUEST_CODE);
 	cancelpacket.backendPID = pg_hton32(conn->be_pid);
 	cancelpacket.cancelAuthCode = pg_hton32(conn->be_key);
+fprintf(stderr, "!!!PQconnectPoll|%d| before pqPacketSend(..., &cancelpacket, ...)\n", getpid());
 	if (pqPacketSend(conn, 0, &cancelpacket, packetlen) != STATUS_OK)
 	{
 		libpq_append_conn_error(conn, "could not send cancel packet: %s",
 SOCK_STRERROR(SOCK_ERRNO, sebuf, sizeof(sebuf)));
 		goto error_return;
 	}
+fprintf(stderr, "!!!PQconnectPoll|%d| after pqPacketSend, STATUS_OK\n", getpid());
 	conn->status = CONNECTION_AWAITING_RESPONSE;
 	return PGRES_POLLING_READING;
 }
@@ -5012,9 +5014,13 @@ pqPacketSend(PGconn *conn, char pack_type,
 	if (pqPutMsgEnd(conn))
 		return STATUS_ERROR;
 
+if (buf_len == 12)
+fprintf(stderr, "!!!pqPacketSend|%d| before pqFlush\n", getpid());
 	/* Flush to ensure backend gets it. */
 	if (pqFlush(conn))
 		return STATUS_ERROR;
+if (buf_len == 12)
+fprintf(stderr, "!!!pqPacketSend|%d| after pqFlush, STATUS_OK\n", getpid());
 
 	return STATUS

Re: Streaming read-ready sequential scan code

2024-08-28 Thread Alexander Lakhin

28.08.2024 07:45, Thomas Munro wrote:

Huh.  I can reproduce what I showed with pretty low variance, on my
FreeBSD, Linux and macOS systems.  I included
parallel_leader_participation=off so that the workers would hopefully
start as closely together in time as possible, and hopefully allow
only a block or so of variation in the outcome.  If I do:

create or replace function f(i int) returns int language plpgsql
parallel safe as $$
begin
   raise notice '% pid %', clock_timestamp(), pg_backend_pid();
   return i;
end;
$$;

then

postgres=# explain (analyze) select f(i) from t limit 1;
NOTICE:  2024-08-28 16:41:32.845747+12 pid 47019
NOTICE:  2024-08-28 16:41:32.845746+12 pid 47018

shows start times differ by only a few microseconds at most, often 0.
I wonder if your system is more variable at beginning execution?
Maybe you're on a multi-socket system and forking/startup times vary
in some way I can't see on small systems or something like that?


I use a single-socket system with AMD Ryzen 5900X, running Ubuntu 20.04.
With no active background tasks running, I'm seeing:
NOTICE:  2024-08-28 08:17:36.917162+00 pid 320103
NOTICE:  2024-08-28 08:17:36.917163+00 pid 320102

NOTICE:  2024-08-28 08:17:40.592333+00 pid 320143
NOTICE:  2024-08-28 08:17:40.592645+00 pid 320144

With log_min_messages = DEBUG3, I get:
NOTICE:  2024-08-28 08:41:59.309364+00 pid 338263
NOTICE:  2024-08-28 08:41:59.310079+00 pid 338264

And the following messages in the server.log:
2024-08-28 08:41:59.304 UTC [338251] DEBUG:  starting background worker process 
"parallel worker for PID 338262"
2024-08-28 08:41:59.304 UTC [338251] DEBUG:  starting background worker process 
"parallel worker for PID 338262"
2024-08-28 08:41:59.305 UTC [338263] DEBUG:  InitPostgres
2024-08-28 08:41:59.305 UTC [338264] DEBUG:  InitPostgres
2024-08-28 08:41:59.309 UTC [338263] NOTICE:  2024-08-28 08:41:59.309364+00 pid 
338263
2024-08-28 08:41:59.309 UTC [338263] CONTEXT:  PL/pgSQL function f(integer) 
line 3 at RAISE
2024-08-28 08:41:59.309 UTC [338262] NOTICE:  2024-08-28 08:41:59.309364+00 pid 
338263
2024-08-28 08:41:59.309 UTC [338262] CONTEXT:  PL/pgSQL function f(integer) 
line 3 at RAISE
    parallel worker
2024-08-28 08:41:59.309 UTC [338263] DEBUG:  shmem_exit(0): 5 before_shmem_exit 
callbacks to make
2024-08-28 08:41:59.309 UTC [338263] DEBUG:  shmem_exit(0): 6 on_shmem_exit 
callbacks to make
2024-08-28 08:41:59.309 UTC [338263] DEBUG:  proc_exit(0): 1 callbacks to make
2024-08-28 08:41:59.309 UTC [338263] DEBUG:  exit(0)
2024-08-28 08:41:59.309 UTC [338263] DEBUG:  shmem_exit(-1): 0 
before_shmem_exit callbacks to make
2024-08-28 08:41:59.309 UTC [338263] DEBUG:  shmem_exit(-1): 0 on_shmem_exit 
callbacks to make
2024-08-28 08:41:59.309 UTC [338263] DEBUG:  proc_exit(-1): 0 callbacks to make
2024-08-28 08:41:59.310 UTC [338264] NOTICE:  2024-08-28 08:41:59.310079+00 pid 
338264
2024-08-28 08:41:59.310 UTC [338264] CONTEXT:  PL/pgSQL function f(integer) 
line 3 at RAISE

It looks like the two parallel workers were started simultaneously, but
then the second one lagged behind...

Best regards,
Alexander




Re: Streaming read-ready sequential scan code

2024-08-27 Thread Alexander Lakhin

Hello Thomas,

27.08.2024 09:52, Thomas Munro wrote:

Here's a really simple way to see the new unfairness at the end of a
parallel scan:

drop table if exists t;
create table t (i int);
insert into t select generate_series(1, 10);
alter table t set (parallel_workers = 2);
set parallel_setup_cost = 0;
set parallel_leader_participation = off;
explain (analyze, buffers, verbose) select count(*) from t;

On my machine, unpatched master shows:

  Worker 0:  actual time=0.036..12.452 rows=51076 loops=1
Buffers: shared hit=226
  Worker 1:  actual time=0.037..12.003 rows=48924 loops=1
Buffers: shared hit=217

The attached patch, which I'd like to push, is effectively what
Alexander tested (blocknums[16] -> blocknums[1]).  There's no point in
using an array of size 1, so I've turned it into a simple variable and
deleted the relevant comments.  My machine shows:

  Worker 0:  actual time=0.038..12.115 rows=49946 loops=1
Buffers: shared hit=221
  Worker 1:  actual time=0.038..12.109 rows=50054 loops=1
Buffers: shared hit=222

That difference may not seem huge, but other pre-existing things are
going pathologically wrong in the reported query that magnify it (see
my earlier analysis).  It's an interesting problem that will require
more study (my earlier analysis missed a detail that I'll write about
separately), but it doesn't seem to be new or have easy fixes, so that
will have to be for later work.


I've tried your query and could not get sustainable results, unfortunately.
The following script:
rm -rf "$PGDATA"; initdb -D "$PGDATA" >initdb.log 2>&1

pg_ctl -s -l server.log start

cat << EOF | psql | grep 'Parallel Seq Scan' -A10 | grep 'Worker ' -A1
create table t (i int);
insert into t select generate_series(1, 10);
alter table t set (parallel_workers = 2);
set parallel_setup_cost = 0;
set parallel_leader_participation = off;
explain (analyze, buffers, verbose) select count(*) from t;
EOF

pg_ctl -s stop

gives me unstable numbers on unpatched master:
 Worker 0:  actual time=0.024..5.814 rows=51076 loops=1
   Buffers: shared hit=226
 Worker 1:  actual time=0.023..5.614 rows=48924 loops=1
   Buffers: shared hit=217
---
 Worker 0:  actual time=0.027..5.130 rows=36612 loops=1
   Buffers: shared hit=162
 Worker 1:  actual time=0.013..5.605 rows=63388 loops=1
   Buffers: shared hit=281
---
 Worker 0:  actual time=0.025..5.447 rows=47460 loops=1
   Buffers: shared hit=210
 Worker 1:  actual time=0.019..5.688 rows=52540 loops=1
   Buffers: shared hit=233

and also with the patch applied:
 Worker 0:  actual time=0.012..4.486 rows=55478 loops=1
   Buffers: shared hit=246
 Worker 1:  actual time=0.014..4.430 rows=44522 loops=1
   Buffers: shared hit=197
---
 Worker 0:  actual time=0.013..4.269 rows=55822 loops=1
   Buffers: shared hit=247
 Worker 1:  actual time=0.017..4.238 rows=44178 loops=1
   Buffers: shared hit=196
---
 Worker 0:  actual time=0.014..4.974 rows=50624 loops=1
   Buffers: shared hit=224
 Worker 1:  actual time=0.016..4.932 rows=49376 loops=1
   Buffers: shared hit=219
---
 Worker 0:  actual time=0.012..5.459 rows=65648 loops=1
   Buffers: shared hit=291
 Worker 1:  actual time=0.022..5.109 rows=34352 loops=1
   Buffers: shared hit=152

Please correct me, if I'm doing something wrong.

Best regards,
Alexander




Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2024-08-26 Thread Alexander Lakhin

Hello Peter,

22.04.2024 20:59, Peter Geoghegan wrote:


Pushed a fix like that just now. 


I'm sorry to bother you again, but I've come across another assertion
failure. Please try the following query (I use a clean "postgres" database,
just after initdb):
EXPLAIN SELECT conname
  FROM pg_constraint WHERE conname IN ('pkey', 'ID')
  ORDER BY conname DESC;

SELECT conname
  FROM pg_constraint WHERE conname IN ('pkey', 'ID')
  ORDER BY conname DESC;

It fails for me as below:
 QUERY PLAN

 Index Only Scan Backward using pg_constraint_conname_nsp_index on 
pg_constraint  (cost=0.14..4.18 rows=2 width=64)
   Index Cond: (conname = ANY ('{pkey,ID}'::name[]))
(2 rows)

server closed the connection unexpectedly
...
with the stack trace:
...
#5  0x55a49f81148d in ExceptionalCondition (conditionName=0x55a49f8bb540 
"ItemIdHasStorage(itemId)",
    fileName=0x55a49f8bb4a8 "../../../../src/include/storage/bufpage.h", 
lineNumber=355) at assert.c:66
#6  0x55a49f0f2ddd in PageGetItem (page=0x7f97cbf17000 "", 
itemId=0x7f97cbf2f064)
    at ../../../../src/include/storage/bufpage.h:355
#7  0x55a49f0f9367 in _bt_checkkeys_look_ahead (scan=0x55a4a0ac4548, 
pstate=0x7ffd1a103670, tupnatts=2,
    tupdesc=0x7f97cb5d7be8) at nbtutils.c:4105
#8  0x55a49f0f8ac3 in _bt_checkkeys (scan=0x55a4a0ac4548, 
pstate=0x7ffd1a103670, arrayKeys=true,
    tuple=0x7f97cbf18890, tupnatts=2) at nbtutils.c:3612
#9  0x55a49f0ebb4b in _bt_readpage (scan=0x55a4a0ac4548, 
dir=BackwardScanDirection, offnum=20, firstPage=true)
    at nbtsearch.c:1863
...
(gdb) f 7
#7  0x55a49f0f9367 in _bt_checkkeys_look_ahead (scan=0x55a4a0ac4548, 
pstate=0x7ffd1a103670, tupnatts=2,
    tupdesc=0x7f97cb5d7be8) at nbtutils.c:4105
4105    ahead = (IndexTuple) PageGetItem(pstate->page,
(gdb) p aheadoffnum
$1 = 24596
(gdb) p pstate->offnum
$2 = 20
(gdb) p pstate->targetdistance
$3 = -24576

Best regards,
Alexander




Re: configure failures on chipmunk

2024-08-25 Thread Alexander Lakhin

Hello Heikki,

21.08.2024 09:29, Heikki Linnakangas wrote:
Yeah, I've been slowly upgrading it to a new debian/raspbian version, and "ccache" started segfaulting, not sure why. 
I compiled ccache from sources, and now it seems to work when I run it on its own. Not sure why the buildfarm run 
still failed.




Could you also take a look at the kerberos setup on chipmunk? Now that
chipmunk goes beyond configure (on HEAD and REL_17_STABLE), it fails on the
kerberosCheck stage, with the following error:
Can't exec "/usr/sbin/kdb5_util": No such file or directory at 
/home/pgbfarm/buildroot/HEAD/pgsql.build/src/test/kerberos/../../../src/test/perl/PostgreSQL/Test/Utils.pm line 349.


Best regards,
Alexander




Re: type cache cleanup improvements

2024-08-25 Thread Alexander Lakhin

Hello Alexander,

22.08.2024 19:52, Alexander Korotkov wrotd:

If no objections, I'm planning to push this after reverting PARTITION
SPLIT/MERGE.



Please try to perform `make check` against a CLOBBER_CACHE_ALWAYS build.
trilobite failed it:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=trilobite&dt=2024-08-25%2005%3A22%3A07

and I'm observing the same locally:
...
#5  0x5636d37555f8 in ExceptionalCondition (conditionName=0x5636d39b1940 
"found",
    fileName=0x5636d39b1308 "typcache.c", lineNumber=3077) at assert.c:66
#6  0x5636d37554a4 in delete_rel_type_cache_if_needed 
(typentry=0x5636d41d5d10) at typcache.c:3077
#7  0x5636d3754063 in InvalidateCompositeTypeCacheEntry 
(typentry=0x5636d41d5d10) at typcache.c:2355
#8  0x5636d37541d3 in TypeCacheRelCallback (arg=0, relid=0) at 
typcache.c:2441
...

(gdb) f 6
#6  0x5636d37554a4 in delete_rel_type_cache_if_needed 
(typentry=0x5636d41d5d10) at typcache.c:3077
3077    Assert(found);
(gdb) p found
$1 = false

(This Assert is introduced by c14d4acb8.)

Best regards,
Alexander




Re: DELETE PENDING strikes back, via pg_ctl stop/start

2024-08-23 Thread Alexander Lakhin

21.08.2024 13:00, Alexander Lakhin wrote:

As a recent failure, produced by drongo [1], shows, pg_ctl stop/start
sequence may break on Windows due to the transient DELETE PENDING state of
posmaster.pid.



lorikeet didn't lose it's chance to add two cents to the conversation and
failed on "pg_ctl stop" [1]:
waiting for server to shut downpg_ctl: could not open PID file 
"data-C/postmaster.pid": Permission denied

I find it less interesting, because Cygwin-targeted code doesn't try to
handle the DELETE PENDING state at all.

I've made a simple test (see attached), which merely executes stop/start
in a loop, and observed that running 10 test jobs in parallel is enough to
get:
### Stopping node "node" using mode fast
# Running: pg_ctl -D .../tmp_check/t_099_pg_ctl_stop+start_node_data/pgdata -m 
fast stop
waiting for server to shut downpg_ctl: could not open PID file 
".../tmp_check/t_099_pg_ctl_stop+start_node_data/pgdata/postmaster.pid": Permission denied

# pg_ctl stop failed: 256

or
# Running: pg_ctl -D .../tmp_check/t_099_pg_ctl_stop+start_node_data/pgdata -m 
fast stop
waiting for server to shut downpg_ctl: could not open PID file 
".../tmp_check/t_099_pg_ctl_stop+start_node_data/pgdata/postmaster.pid": Device or resource busy

# pg_ctl stop failed: 256

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2024-08-22%2009%3A52%3A46

Best regards,
Alexander

099_pg_ctl_stop+start.pl
Description: Perl program


RISC-V animals sporadically produce weird memory-related failures

2024-08-22 Thread Alexander Lakhin

Hello hackers,

While investigating a recent copperhead failure [1] with the following
diagnostics:
2024-08-20 20:56:47.318 CEST [2179731:95] LOG:  server process (PID 2184722) 
was terminated by signal 11: Segmentation fault
2024-08-20 20:56:47.318 CEST [2179731:96] DETAIL:  Failed process was running: COPY hash_f8_heap FROM 
'/home/pgbf/buildroot/HEAD/pgsql.build/src/test/regress/data/hash.data';


Core was generated by `postgres: pgbf regression [local] COPY   
 '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x002ac8e62674 in heap_multi_insert (relation=0x3f9525c890, slots=0x2ae68a5b30, ntuples=, 
cid=, options=, bistate=0x2ae6891c18) at heapam.c:2296

2296            tuple->t_tableOid = slots[i]->tts_tableOid;
#0  0x002ac8e62674 in heap_multi_insert (relation=0x3f9525c890, slots=0x2ae68a5b30, ntuples=, 
cid=, options=, bistate=0x2ae6891c18) at heapam.c:2296
#1  0x002ac8f41656 in table_multi_insert (bistate=, options=, cid=, 
nslots=1000, slots=0x2ae68a5b30, rel=) at ../../../src/include/access/tableam.h:1460
#2  CopyMultiInsertBufferFlush (miinfo=miinfo@entry=0x3ff87bceb0, buffer=0x2ae68a5b30, 
processed=processed@entry=0x3ff87bce90) at copyfrom.c:415
#3  0x002ac8f41f6c in CopyMultiInsertInfoFlush (processed=0x3ff87bce90, curr_rri=0x2ae67eacf8, miinfo=0x3ff87bceb0) 
at copyfrom.c:532

#4  CopyFrom (cstate=cstate@entry=0x2ae6897fc0) at copyfrom.c:1242
...
$1 = {si_signo = 11,  ... _sigfault = {si_addr = 0x2ae600cbcc}, ...

I discovered a similarly looking failure, [2]:
2023-02-11 18:33:09.222 CET [2591215:73] LOG:  server process (PID 2596066) was 
terminated by signal 11: Segmentation fault
2023-02-11 18:33:09.222 CET [2591215:74] DETAIL:  Failed process was running: COPY bt_i4_heap FROM 
'/home/pgbf/buildroot/HEAD/pgsql.build/src/test/regress/data/desc.data';


Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x002adc9bc61a in heap_multi_insert (relation=0x3fa3bd53a8, slots=0x2b098a13c0, ntuples=, 
cid=, options=, bistate=0x2b097eda10) at heapam.c:2095

2095            tuple->t_tableOid = slots[i]->tts_tableOid;

But then I found also different failures on copperhead, all looking like
memory-related anomalies:
[3]
Program terminated with signal SIGSEGV, Segmentation fault.
#0  fixempties (f=0x0, nfa=0x2b02a59410) at regc_nfa.c:2246
2246                for (a = inarcsorig[s2->no]; a != NULL; a = a->inchain)

[4]
pgsql.build/src/bin/pg_rewind/tmp_check/log/regress_log_004_pg_xlog_symlink
malloc(): memory corruption (fast)

[5]
2022-11-22 20:22:48.907 CET [1364156:4] LOG:  server process (PID 1364221) was 
terminated by signal 11: Segmentation fault
2022-11-22 20:22:48.907 CET [1364156:5] DETAIL:  Failed process was running: BASE_BACKUP LABEL 'pg_basebackup base 
backup' PROGRESS NOWAIT  TABLESPACE_MAP  MANIFEST 'yes'


[6]
psql exited with signal 11 (core dumped): '' while running 'psql -XAtq -d port=60743 host=/tmp/zHq9Kzn2b5 
dbname='postgres' -f - -v ON_ERROR_STOP=1' at 
/home/pgbf/buildroot/REL_14_STABLE/pgsql.build/contrib/bloom/../../src/test/perl/PostgresNode.pm line 1855.


[7]
- locktype | classid | objid | objsubid | mode  | granted
+ locktype | classid | objid | objsubid | mode  | gr_nted
(the most mysterious case)

[8]
Program terminated with signal SIGSEGV, Segmentation fault.
#0  GetMemoryChunkContext (pointer=0x2b21bca1f8) at 
../../../../src/include/utils/memutils.h:128
128        context = *(MemoryContext *) (((char *) pointer) - sizeof(void *));
...
$1 = {si_signo = 11, ... _sigfault = {si_addr = 0x2b21bca1f0}, ...

[9]
Program terminated with signal SIGSEGV, Segmentation fault.
#0  fixempties (f=0x0, nfa=0x2ac0bf4c60) at regc_nfa.c:2246
2246                for (a = inarcsorig[s2->no]; a != NULL; a = a->inchain)


Moreover, the other RISC-V animal, boomslang produced weird failures too:
[10]
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x002ae6b50abe in ExecInterpExpr (state=0x2b20ca0040, econtext=0x2b20c9fba8, isnull=) at 
execExprInterp.c:678

678                resultslot->tts_values[resultnum] = state->resvalue;

[11]
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x002addf22728 in ExecInterpExpr (state=0x2ae0af8848, econtext=0x2ae0b16028, isnull=) at 
execExprInterp.c:666

666                resultslot->tts_values[resultnum] = 
scanslot->tts_values[attnum];

[12]
INSERT INTO ftable SELECT * FROM generate_series(1, 7) i;

Core was generated by `postgres: buildfarm contrib_regression_postgres_fdw 
[local] INS'.
Program terminated with signal SIGABRT, Aborted.

As far as I can see, these animals run on Debian 10 with the kernel
version 5.15.5-2~bpo11+1 (2022-01-10), but RISC-V was declared an
official Debian architecture on 2023-07-23 [14]. So maybe the OS
version installed is not stable enough for testing...
(I've tried running the regression tests on a RISC-V machine emulated with
qemu, running Debian trixie, kernel version 6.8.

DELETE PENDING strikes back, via pg_ctl stop/start

2024-08-21 Thread Alexander Lakhin

Hello hackers,

As a recent failure, produced by drongo [1], shows, pg_ctl stop/start
sequence may break on Windows due to the transient DELETE PENDING state of
posmaster.pid.

Please look at the excerpt from the failure log:
...
pg_createsubscriber: stopping the subscriber
2024-08-19 18:02:47.608 UTC [6988:4] LOG:  received fast shutdown request
2024-08-19 18:02:47.608 UTC [6988:5] LOG:  aborting any active transactions
2024-08-19 18:02:47.612 UTC [5884:2] FATAL:  terminating walreceiver process 
due to administrator command
2024-08-19 18:02:47.705 UTC [7036:1] LOG:  shutting down
pg_createsubscriber: server was stopped
### the server instance (1) emitted only "shutting down" yet, but pg_ctl
### considered it stopped and returned 0 to pg_createsubscriber
[18:02:47.900](2.828s) ok 29 - run pg_createsubscriber without --databases
...
pg_createsubscriber: starting the standby with command-line options
pg_createsubscriber: pg_ctl command is: ...
2024-08-19 18:02:48.163 UTC [5284:1] FATAL:  could not create lock file 
"postmaster.pid": File exists
pg_createsubscriber: server was started
pg_createsubscriber: checking settings on subscriber
### pg_createsubscriber attempts to start new server instance (2), but
### it fails due to "postmaster.pid" still found on disk
2024-08-19 18:02:48.484 UTC [6988:6] LOG:  database system is shut down
### the server instance (1) is finally stopped and postmaster.pid unlinked

With extra debug logging and the ntries limit decreased to 10 (in
CreateLockFile()), I reproduced the failure easily (when running 20 tests
in parallel) and got additional information (see attached).

IIUC, the issue is caused by inconsistent checks for postmaster.pid
existence:
"pg_ctl stop" ... -> get_pgpid() calls fopen(pid_file, "r"),
 which fails with ENOENT for the DELETE_PENDING state (see
 pgwin32_open_handle()).

"pg_ctl start" ... -> CreateLockFile() calls
    fd = open(filename, O_RDWR | O_CREAT | O_EXCL, pg_file_create_mode);
which fails with EEXISTS for the same state of postmaster.pid.

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-08-19%2017%3A32%3A54

Best regards,
Alexanderdiff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 537d92c0cf..570d2d2557 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -1219,11 +1219,13 @@ CreateLockFile(const char *filename, bool amPostmaster,
 		/*
 		 * Couldn't create the pid file. Probably it already exists.
 		 */
-		if ((errno != EEXIST && errno != EACCES) || ntries > 100)
+		if ((errno != EEXIST && errno != EACCES) || ntries > 10)
+{
 			ereport(FATAL,
 	(errcode_for_file_access(),
-	 errmsg("could not create lock file \"%s\": %m",
-			filename)));
+	 errmsg("could not create lock file (ntries: %d) \"%s\": %m",
+			ntries, filename)));
+}
 
 		/*
 		 * Read the file to get the old owner's PID.  Note race condition
diff --git a/src/bin/pg_basebackup/pg_createsubscriber.c b/src/bin/pg_basebackup/pg_createsubscriber.c
index 6295783cde..583ed7d449 100644
--- a/src/bin/pg_basebackup/pg_createsubscriber.c
+++ b/src/bin/pg_basebackup/pg_createsubscriber.c
@@ -1479,7 +1479,7 @@ stop_standby_server(const char *datadir)
 	char	   *pg_ctl_cmd;
 	int			rc;
 
-	pg_ctl_cmd = psprintf("\"%s\" stop -D \"%s\" -s", pg_ctl_path,
+	pg_ctl_cmd = psprintf("\"%s\" stop -D \"%s\" ", pg_ctl_path,
 		  datadir);
 	pg_log_debug("pg_ctl command is: %s", pg_ctl_cmd);
 	rc = system(pg_ctl_cmd);
diff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c
index e7e878c22f..04787c6aec 100644
--- a/src/bin/pg_ctl/pg_ctl.c
+++ b/src/bin/pg_ctl/pg_ctl.c
@@ -276,6 +276,9 @@ get_pgpid(bool is_status_request)
 	pidf = fopen(pid_file, "r");
 	if (pidf == NULL)
 	{
+int en = errno;
+write_stderr("!!!get_pgpid| pid_file: %s, pidf: %p, errno: %d\n", pid_file, pidf, en);
+errno = en;
 		/* No pid file, not an error on startup */
 		if (errno == ENOENT)
 			return 0;
@@ -723,7 +726,10 @@ wait_for_postmaster_stop(void)
 		pid_t		pid;
 
 		if ((pid = get_pgpid(false)) == 0)
+{
+write_stderr("!!!wait_for_postmaster_stop| pid: %d\n", pid);
 			return true;		/* pid file is gone */
+}
 
 		if (kill(pid, 0) != 0)
 		{
diff --git a/src/port/open.c b/src/port/open.c
index 13e49af8d4..f7cbc819c0 100644
--- a/src/port/open.c
+++ b/src/port/open.c
@@ -138,8 +138,14 @@ pgwin32_open_handle(const char *fileName, int fileFlags, bool backup_semantics)
 		 * invisible.  With O_CREAT, we have no choice but to report that
 		 * there's a file in the way (which wouldn't happen on Unix).
 		 */
+DWORD ntstat = pg_RtlGetLastNtStatus();
+if (strstr(fileName, "postmaster.pid") != NULL)
+{
+fprintf(stderr, "!!!pgwin32_open_handle| fileFlags: %X, err: %d, ntstatus: %X\n", fileFlags, err, ntstat);
+}
+
 		if (err == ERROR_ACCESS_DENIED &&
-			pg_RtlGetLastNtStatus() == STATUS_DELETE_PENDING)
+			ntstat == STATUS_DELETE_PENDING)
 		{
 			if (fileFlags & O_CREAT)
 err = ERROR_FILE_EXISTS;
@@

Re: Remove dependence on integer wrapping

2024-08-21 Thread Alexander Lakhin

Hello Nathan,

21.08.2024 00:21, Nathan Bossart wrote:

I've combined all the current proposed changes into one patch.  I've also
introduced signed versions of the negation functions into int.h to avoid
relying on multiplication.



Thank you for taking care of this!

I'd like to add some info to show how big the iceberg is.

Beside other trap-triggered places in date/time conversion functions, I
also discovered:
1)
CREATE TABLE jt(j jsonb); INSERT INTO jt VALUES('[]'::jsonb);
UPDATE jt SET j[0][-2147483648] = '0';

#4  0x7f15ab00d7f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x5570113b2591 in __addvsi3 ()
#6  0x5570111d55a0 in push_null_elements (ps=0x7fff37385fb8, 
num=-2147483648) at jsonfuncs.c:1707
#7  0x5570111d5749 in push_path (st=0x7fff37385fb8, level=0, 
path_elems=0x55701300c880, path_nulls=0x55701300d520,
    path_len=2, newval=0x7fff37386030) at jsonfuncs.c:1770

The "problematic" code:
    while (num-- > 0)
    *ps = 0;
looks innocent to me, but is not for good enough for -ftrapv.
I think there could be other similar places and this raises two questions:
can they be reached with INT_MIN and what to do if so?

By the way, the same can be seen with CC=clang CPPFLAGS="-ftrapv". Please
look at the code produced by both compilers for x86_64:
https://godbolt.org/z/vjszjf4b3
(clang generates ud1, while gcc uses call __addvsi3)

The aside question is: should jsonb subscripting accept negative indexes
when the target array is not initialized yet?

Compare:
CREATE TABLE jt(j jsonb); INSERT INTO jt VALUES('[]'::jsonb);
UPDATE jt SET j[0][-1] = '0';
SELECT * FROM jt;
   j
---
 [[0]]

with
CREATE TABLE jt(j jsonb); INSERT INTO jt VALUES('[[]]'::jsonb);
UPDATE jt SET j[0][-1] = '0';
ERROR:  path element at position 2 is out of range: -1

2)
SELECT x, lag(x, -2147483648) OVER (ORDER BY x) FROM (SELECT 1) x;

#4  0x7fa7d00f47f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x5623a7336851 in __negvsi2 ()
#6  0x5623a726ae35 in leadlag_common (fcinfo=0x7ffd59cca950, forward=false, 
withoffset=true, withdefault=false)
    at windowfuncs.c:551
#7  0x5623a726af19 in window_lag_with_offset (fcinfo=0x7ffd59cca950) at 
windowfuncs.c:594

As to 32-bit Debian, I wrote about before, I use gcc (Debian 12.2.0-14).
Please look at the demo code (and it's assembly, produced with
gcc -S -ftrapv t.c) attached:
gcc -Wall -Wextra -fsanitize=signed-integer-overflow -Wstrict-overflow=5 \
 -O0 -ftrapv t.c -o t && ./t
Aborted (core dumped)

#4  0xb762226a in __GI_abort () at ./stdlib/abort.c:79
#5  0x00495077 in __mulvdi3.cold ()
#6  0x00495347 in pg_mul_s64_overflow ()

(It looks like -Wstrict-overflow can't help with the static analysis
desired in such cases.)

Moreover, I got `make check` failed with -ftrapv on aarch64 (using gcc 8.3)
as follows:
#1  0x007e1edc48e8 in __GI_abort () at abort.c:79
#2  0x005ee66b71cc in __subvdi3 ()
#3  0x005ee6560e24 in int8gcd_internal (arg1=-9223372036854775808, arg2=1) 
at int8.c:623
#4  0x005ee62f576c in ExecInterpExpr (state=0x5eeaba9d18, econtext=0x5eeaba95f0, 
isnull=)
    at execExprInterp.c:770
...
#13 0x005ee64e5d84 in exec_simple_query (
    query_string=query_string@entry=0x5eeaac7500 "SELECT a, b, gcd(a, b), gcd(a, -b), gcd(b, a), gcd(-b, a)\nFROM 
(VALUES (0::int8, 0::int8),\n", ' ' , "(0::int8, 29893644334::int8),\n", ' ' , 
"(288484263558::int8, 29893644334::int8),\n", ' ' ...) at postgres.c:1284


So I wonder whether enabling -ftrapv can really help us prepare the code
for -fno-wrapv?

Best regards,
Alexander#include 

typedef char bool;
typedef long long int int64;

static inline bool
pg_mul_s64_overflow(int64 a, int64 b, int64 *result)
{
	return __builtin_mul_overflow(a, b, result);
}

int main()
{
	int64 a = 9223372036854775807L;
	int64 b = 2L;
	int64 c = 0;
	bool r;
	r = pg_mul_s64_overflow(a, b, &c);
	printf("r: %d, c: %lld\n", r, c);
}
.file   "t.c"
.text
.globl  __mulvdi3
.type   pg_mul_s64_overflow, @function
pg_mul_s64_overflow:
.LFB0:
.cfi_startproc
pushl   %ebp
.cfi_def_cfa_offset 8
.cfi_offset 5, -8
movl%esp, %ebp
.cfi_def_cfa_register 5
pushl   %edi
pushl   %esi
pushl   %ebx
subl$76, %esp
.cfi_offset 7, -12
.cfi_offset 6, -16
.cfi_offset 3, -20
call__x86.get_pc_thunk.ax
addl$_GLOBAL_OFFSET_TABLE_, %eax
movl%eax, -52(%ebp)
movl8(%ebp), %eax
movl%eax, -32(%ebp)
movl12(%ebp), %eax
movl%eax, -28(%ebp)
movl16(%ebp), %eax
movl%eax, -40(%ebp)
movl20(%ebp), %eax
movl%eax, -36(%ebp)
movl$0, -64(%ebp)
movl$0, -60(%ebp)
movl-32(%ebp), %eax
movl-28(%ebp), %edx
movl%edx, %eax
movl%eax, %edx
sarl$31, %edx
movl%eax, -48(%ebp)
movl%edx,

Re: why is pg_upgrade's regression run so slow?

2024-08-19 Thread Alexander Lakhin

Hello Andrew,

29.07.2024 13:54, Andrew Dunstan wrote:


On 2024-07-29 Mo 4:00 AM, Alexander Lakhin wrote:


And another interesting fact is that TEMP_CONFIG is apparently ignored by
`meson test regress/regress`. For example, with temp.config containing
invalid settings, `meson test pg_upgrade/002_pg_upgrade` fails, but
`meson test regress/regress` passes just fine.




Well, that last fact explains the discrepancy I originally complained about. How the heck did that happen? It looks 
like we just ignored its use in Makefile.global.in :-(


Please look at the attached patch (partially based on ideas from [1]) for
meson.build, that aligns it with `make` in regard to use of TEMP_CONFIG.

Maybe it could be implemented via a separate meson option, but that would
also require modifying at least the buildfarm client...

[1] 
https://www.postgresql.org/message-id/CAN55FZ304Kp%2B510-iU5-Nx6hh32ny9jgGn%2BOB5uqPExEMK1gQQ%40mail.gmail.com

Best regards,
Alexanderdiff --git a/meson.build b/meson.build
index cd711c6d01..90f4e6da04 100644
--- a/meson.build
+++ b/meson.build
@@ -3388,6 +3388,12 @@ else
   runningcheck = false
 endif
 
+temp_config_env = run_command([
+  python,
+  '-c',
+  'import os; print(os.getenv("TEMP_CONFIG", ""))'
+], check: true).stdout().strip()
+
 testwrap = files('src/tools/testwrap')
 
 foreach test_dir : tests
@@ -3451,6 +3457,10 @@ foreach test_dir : tests
   env = test_env
   env.prepend('PATH', temp_install_bindir, test_dir['bd'])
 
+  if temp_config_env != ''
+test_command_base += [ '--temp-config=' + temp_config_env ]
+  endif
+
   test_kwargs = {
 'protocol': 'tap',
 'priority': 10,


Re: The xversion-upgrade test fails to stop server

2024-08-18 Thread Alexander Lakhin

Hello Andrew,

04.06.2024 14:56, Andrew Dunstan wrote:




but I can't see ../snapshot_too_old/output_iso/regression.diff and
.../snapshot_too_old/output_iso/log/postmaster.log in the log.




will do.



I've discovered a couple of other failures where the interesting log files
are not saved.

[1] has only inst/logfile saved and that's not enough for TAP tests in
src/test/modules.

I've emulated the failure (not trying to guess the real cause) with:
--- a/src/test/modules/test_custom_rmgrs/Makefile
+++ b/src/test/modules/test_custom_rmgrs/Makefile
@@ -14,0 +15,4 @@ TAP_TESTS = 1
+remove:
+    make uninstall
+install: remove

and saw mostly the same with the buildfarm client REL_17.

I've tried the following addition to run_build.pl:
@@ -2194,6 +2194,8 @@ sub make_testmodules_install_check
    my @logs = glob("$pgsql/src/test/modules/*/regression.diffs");
    push(@logs, "inst/logfile");
    $log->add_log($_) foreach (@logs);
+   @logs = glob("$pgsql/src/test/modules/*/tmp_check/log/*");
+   $log->add_log($_) foreach (@logs);
and it works as expected for me.

The output of another failure ([2]) contains only:
\342\226\266 1/1 + partition_prune  3736 ms FAIL
but no regression.diffs
(Fortunately, in this particular case I found "FATAL:  incorrect checksum
in control file" in inst/logfile, so that can explain the failure.)

Probably, run_build.pl needs something like:
@@ -1848,7 +1848,7 @@ sub make_install_check
    @checklog = run_log("cd $pgsql/src/test/regress && $make 
$chktarget");
    }
    my $status = $? >> 8;
-   my @logfiles = ("$pgsql/src/test/regress/regression.diffs", 
"inst/logfile");
+   my @logfiles = ("$pgsql/src/test/regress/regression.diffs", 
"$pgsql/testrun/regress-running/regress/regression.diffs", "inst/logfile");



A similar failure have occurred today:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=opaleye&dt=2024-06-08%2001%3A41%3A41

So maybe it would make sense to increase default PGCTLTIMEOUT for
buildfarm clients, say, to 180 seconds?



Sure. For now I have added it to the config on crake, but we can make it a 
default.


By the way, opaleye failed once again (with 120 seconds timeout):
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=opaleye&dt=2024-08-13%2002%3A04%3A07

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gull&dt=2024-08-06%2014%3A56%3A39
[2] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2024-08-17%2001%3A12%3A50

Best regards,
Alexander




Re: Remove dependence on integer wrapping

2024-08-18 Thread Alexander Lakhin

18.08.2024 00:52, Joseph Koshakow wrote:

The largest possible (theoretical) value for `nbuckets` is
`1073741824`, the largest power of 2 that fits into an `int`. So, the
largest possible value for `nbuckets << 1` is `2147483648`. This can
fully fit in a `uint32`, so the simple fix for this case is to cast
`nbuckets` to a `uint32` before shifting. I've attached this fix,
Alexander if you have time I would appreciate if you were able to test
it.



Yes, I've tested v25-0002-*.patch and can confirm that this fix works
as well.

Best regards,
Alexander




Re: Refactoring postmaster's code to cleanup after child exit

2024-08-18 Thread Alexander Lakhin

Hello Heikki,

10.08.2024 00:13, Heikki Linnakangas wrote:


Committed the patches up to and including this one, with tiny comment changes.


I've noticed that on the current HEAD server.log contains binary data
(an invalid process name) after a child crash. For example, while playing
with -ftapv, I've got:
SELECT to_date('2024 613566758 1', 'IYYY IW ID');
server closed the connection unexpectedly

grep -a 'was terminated' server.log
2024-08-18 07:07:06.482 UTC|||66c19d96.3482f6|LOG:  `�!x� (PID 3441407) was 
terminated by signal 6: Aborted

It looks like this was introduced by commit 28a520c0b (IIUC, namebuf in
CleanupBackend() may stay uninitialized in some code paths).

Best regards,
Alexander




Re: Remove dependence on integer wrapping

2024-08-17 Thread Alexander Lakhin

Hello Joe,

17.08.2024 22:16, Joseph Koshakow wrote:

Hi,

I wanted to take this opportunity to provide a brief summary of
outstanding work.

> Also there are several trap-producing cases with date types:
> SELECT to_date('1', 'CC');
> SELECT to_timestamp('10,999', 'Y,YYY');
> SELECT make_date(-2147483648, 1, 1);

This is resolved with Matthew's patches, which I've rebased, squashed
and attached to this email. They still require a review.



I've filed a separate bug report about date/time conversion issues
yesterday. Maybe it was excessive, but it also demonstrates other
problematic cases:
https://www.postgresql.org/message-id/18585-db646741dd649abd%40postgresql.org

Best regards,
Alexander




Re: Remove dependence on integer wrapping

2024-08-16 Thread Alexander Lakhin

Hello Nathan and Joe,

16.08.2024 19:52, Nathan Bossart wrote:

On Thu, Aug 15, 2024 at 10:49:46PM -0400, Joseph Koshakow wrote:

This updated version LGTM, I agree it's a good use of pg_abs_s32().

Committed.


Thank you for working on that issue!

I've tried `make check` with CC=gcc-13 CPPFLAGS="-O0 -ftrapv" and got a
server crash:
2024-08-16 17:14:36.102 UTC postmaster[1982703] LOG:   (PID 1983867) was 
terminated by signal 6: Aborted
2024-08-16 17:14:36.102 UTC postmaster[1982703] DETAIL:  Failed process was running: 
select '[]'::jsonb ->> -2147483648;
with the stack trace
...
#5  0x556aec224a11 in __negvsi2 ()
#6  0x556aec046238 in jsonb_array_element_text (fcinfo=0x556aedd70240) at 
jsonfuncs.c:993
#7  0x556aebc90b68 in ExecInterpExpr (state=0x556aedd70160, 
econtext=0x556aedd706a0, isnull=0x7ffdf82211e4)
    at execExprInterp.c:765
...
(gdb) f 6
#6  0x556aec046238 in jsonb_array_element_text (fcinfo=0x556aedd70240) at 
jsonfuncs.c:993
993 if (-element > nelements)
(gdb) p element
$1 = -2147483648

Sp it looks like jsonb_array_element_text() still needs the same
treatment as jsonb_array_element().

Moreover, I tried to use "-ftrapv" on 32-bit Debian and came across
another failure:
select '9223372036854775807'::int8 * 2147483648::int8;
server closed the connection unexpectedly
...
#4  0xb76a in __GI_abort () at ./stdlib/abort.c:79
#5  0x004cb2e1 in __mulvdi3.cold ()
#6  0x00abe7ab in pg_mul_s64_overflow (a=9223372036854775807, b=2147483648, 
result=0xbff1da68)
    at ../../../../src/include/common/int.h:264
#7  0x00abfbff in int8mul (fcinfo=0x14d9d04) at int8.c:496
#8  0x00782675 in ExecInterpExpr (state=0x14d9c4c, econtext=0x14da15c, 
isnull=0xbff1dc3f) at execExprInterp.c:765

Whilst
select '9223372036854775807'::int8 * 2147483647::int8;
emits
ERROR:  bigint out of range

I've also discovered another trap-triggering case for a 64-bit platform:
select 1 union all select 1 union all select 1 union all select 1 union all
select 1 union all select 1 union all select 1 union all select 1 union all
select 1 union all select 1 union all select 1 union all select 1 union all
select 1 union all select 1 union all select 1 union all select 1 union all
select 1 union all select 1 union all select 1 union all select 1 union all
select 1 union all select 1 union all select 1 union all select 1 union all
select 1 union all select 1 union all select 1 union all select 1 union all
select 1 union all select 1 union all select 1;

server closed the connection unexpectedly
...
#5  0x5576cfb1c9f3 in __negvdi2 ()
#6  0x5576cf627c68 in bms_singleton_member (a=0x5576d09f7fb0) at 
bitmapset.c:691
#7  0x5576cf72be0f in fix_append_rel_relids (root=0x5576d09df198, varno=31, 
subrelids=0x5576d09f7fb0)
    at prepjointree.c:3830
#8  0x5576cf7278c2 in pull_up_simple_subquery (root=0x5576d09df198, 
jtnode=0x5576d09f7470, rte=0x5576d09de300,
    lowest_outer_join=0x0, containing_appendrel=0x5576d09f7368) at 
prepjointree.c:1277
...
(gdb) f 6
#6  0x5576cf627c68 in bms_singleton_member (a=0x5576d09f7fb0) at 
bitmapset.c:691
691 if (result >= 0 || HAS_MULTIPLE_ONES(w))
(gdb) p/x w
$1 = 0x8000

Best regards,
Alexander




Re: Remove dependence on integer wrapping

2024-08-15 Thread Alexander Lakhin

Hello Joe,

05.08.2024 02:55, Joseph Koshakow wrote:


On Fri, Jun 14, 2024 at 8:00 AM Alexander Lakhin  wrote:
>
>    And the most interesting case to me:
>    SET temp_buffers TO 10;
>
>    CREATE TEMP TABLE t(i int PRIMARY KEY);
>    INSERT INTO t VALUES(1);
>
...
Alex, are you able to get a full stack trace for this panic? I'm unable
to reproduce this because I don't have enough memory in my system. I've
tried reducing `BLCKSZ` to 1024, which is the lowest value allowed per
my understanding, and I still don't have enough memory.


Yes, please take a look at it (sorry for the late reply):

(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=140438687430464) 
at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=140438687430464) at 
./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=140438687430464, signo=signo@entry=6) at 
./nptl/pthread_kill.c:89
#3  0x7fba70025476 in __GI_raise (sig=sig@entry=6) at 
../sysdeps/posix/raise.c:26
#4  0x7fba7000b7f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x563945aed511 in __addvsi3 ()
#6  0x563945a6c106 in init_htab (hashp=0x563947700980, nelem=10) at 
dynahash.c:720
#7  0x563945a6bd22 in hash_create (tabname=0x563945c591d9 "Local Buffer Lookup Table", nelem=10, 
info=0x7ffd4d394620, flags=40) at dynahash.c:567

#8  0x5639457f2760 in el () at localbuf.c:635
#9  0x5639457f19e3 in ExtendBufferedRelLocal (bmr=..., fork=MAIN_FORKNUM, flags=8, extend_by=1, 
extend_upto=4294967295, buffers=0x7ffd4d3948e0, extended_by=0x7ffd4d3947ac) at localbuf.c:326
#10 0x5639457e8851 in ExtendBufferedRelCommon (bmr=..., fork=MAIN_FORKNUM, strategy=0x0, flags=8, extend_by=1, 
extend_upto=4294967295, buffers=0x7ffd4d3948e0, extended_by=0x7ffd4d39488c) at bufmgr.c:2175
#11 0x5639457e6850 in ExtendBufferedRelBy (bmr=..., fork=MAIN_FORKNUM, strategy=0x0, flags=8, extend_by=1, 
buffers=0x7ffd4d3948e0, extended_by=0x7ffd4d39488c) at bufmgr.c:923
#12 0x5639452d8ae6 in RelationAddBlocks (relation=0x7fba650abd78, bistate=0x0, num_pages=1, use_fsm=true, 
did_unlock=0x7ffd4d394a3d) at hio.c:341
#13 0x5639452d944a in RelationGetBufferForTuple (relation=0x7fba650abd78, len=32, otherBuffer=0, options=0, 
bistate=0x0, vmbuffer=0x7ffd4d394ac4, vmbuffer_other=0x0, num_pages=1) at hio.c:767
#14 0x5639452be996 in heap_insert (relation=0x7fba650abd78, tup=0x5639476ecfc0, cid=0, options=0, bistate=0x0) at 
heapam.c:2019
#15 0x5639452cee84 in heapam_tuple_insert (relation=0x7fba650abd78, slot=0x5639476ecf30, cid=0, options=0, 
bistate=0x0) at heapam_handler.c:251
#16 0x5639455b3b07 in table_tuple_insert (rel=0x7fba650abd78, slot=0x5639476ecf30, cid=0, options=0, bistate=0x0) at 
../../../src/include/access/tableam.h:1405
#17 0x5639455b5c60 in ExecInsert (context=0x7ffd4d394d20, resultRelInfo=0x5639476ec390, slot=0x5639476ecf30, 
canSetTag=true, inserted_tuple=0x0, insert_destrel=0x0) at nodeModifyTable.c:1139

#18 0x5639455ba942 in ExecModifyTable (pstate=0x5639476ec180) at 
nodeModifyTable.c:4077
#19 0x563945575425 in ExecProcNodeFirst (node=0x5639476ec180) at 
execProcnode.c:469
#20 0x563945568095 in ExecProcNode (node=0x5639476ec180) at 
../../../src/include/executor/executor.h:274
#21 0x56394556af65 in ExecutePlan (estate=0x5639476ebf00, planstate=0x5639476ec180, use_parallel_mode=false, 
operation=CMD_INSERT, sendTuples=false, numberTuples=0, direction=ForwardScanDirection, dest=0x5639476f5470,

    execute_once=true) at execMain.c:1646
#22 0x5639455687e3 in standard_ExecutorRun (queryDesc=0x5639476f3e70, direction=ForwardScanDirection, count=0, 
execute_once=true) at execMain.c:363
#23 0x5639455685b9 in ExecutorRun (queryDesc=0x5639476f3e70, direction=ForwardScanDirection, count=0, 
execute_once=true) at execMain.c:304
#24 0x56394584986e in ProcessQuery (plan=0x5639476f5310, sourceText=0x56394760d610 "INSERT INTO t VALUES(1);", 
params=0x0, queryEnv=0x0, dest=0x5639476f5470, qc=0x7ffd4d395180) at pquery.c:160
#25 0x56394584b445 in PortalRunMulti (portal=0x56394768ab20, isTopLevel=true, setHoldSnapshot=false, 
dest=0x5639476f5470, altdest=0x5639476f5470, qc=0x7ffd4d395180) at pquery.c:1278
#26 0x56394584a93c in PortalRun (portal=0x56394768ab20, count=9223372036854775807, isTopLevel=true, run_once=true, 
dest=0x5639476f5470, altdest=0x5639476f5470, qc=0x7ffd4d395180) at pquery.c:791

#27 0x563945842fd9 in exec_simple_query (query_string=0x56394760d610 "INSERT 
INTO t VALUES(1);") at postgres.c:1284
#28 0x563945848536 in PostgresMain (dbname=0x563947644900 "regression", username=0x5639476448e8 "law") at 
postgres.c:4766

#29 0x56394583eb67 in BackendMain (startup_data=0x7ffd4d395404 "", 
startup_data_len=4) at backend_startup.c:107
#30 0x56394574e00e in postmaster_child_launch (child_type=B_BACKEND, startup_

Re: v17 vs v16 performance comparison

2024-08-02 Thread Alexander Lakhin

01.08.2024 06:41, Tom Lane wrote:



But beside that, I've found a separate regression. Bisecting for this 
degradation:
Best pg-src-17--.* worse than pg-src-16--.* by 105.0 percents (356.63 > 
173.96): s64da_tpcds.query95
Average pg-src-17--.* worse than pg-src-16--.* by 105.2 percents (357.79 > 
174.38): s64da_tpcds.query95
pointed at f7816aec2.

I'm not terribly concerned about that.  The nature of planner changes
like that is that some queries will get worse and some better, because
the statistics and cost estimates we're dealing with are not perfect.
It is probably worth drilling down into that test case to understand
where the planner is going wrong, with an eye to future improvements;
but I doubt it's something we need to address for v17.


Please find attached two plans for that query [1].
(I repeated the benchmark for f7816aec2 and f7816aec2~1 five times and
made sure that both plans are stable.)

Meanwhile I've bisected another degradation:
Best pg-src-17--.* worse than pg-src-16--.* by 11.3 percents (7.17 > 6.44): 
job.query6f
and came to the commit b7b0f3f27 again.

[1] 
https://github.com/swarm64/s64da-benchmark-toolkit/blob/master/benchmarks/tpcds/queries/queries_10/95.sql

Best regards,
Alexander[{
  "Query Text": "-- RNGSEED: 1\n\n-- EXPLAIN (FORMAT JSON)\nwith ws_wh 
as\n(select ws1.ws_order_number,ws1.ws_warehouse_sk wh1,ws2.ws_warehouse_sk 
wh2\n from web_sales ws1,web_sales ws2\n where ws1.ws_order_number = 
ws2.ws_order_number\n   and ws1.ws_warehouse_sk <> ws2.ws_warehouse_sk)\n 
select\n   count(distinct ws_order_number) as \"order count\"\n  
,sum(ws_ext_ship_cost) as \"total shipping cost\"\n  ,sum(ws_net_profit) as 
\"total net profit\"\nfrom\n   web_sales ws1\n  ,date_dim\n  
,customer_address\n  ,web_site\nwhere\nd_date between '2000-3-01' and\n 
  (cast('2000-3-01' as date) + 60)\nand ws1.ws_ship_date_sk = 
d_date_sk\nand ws1.ws_ship_addr_sk = ca_address_sk\nand ca_state = 'GA'\nand 
ws1.ws_web_site_sk = web_site_sk\nand web_company_name = 'pri'\nand 
ws1.ws_order_number in (select ws_order_number\n
from ws_wh)\nand ws1.ws_order_number in (select wr_order_number\n   
 from web_returns,ws_wh\nwhere 
wr_order_number = ws_wh.ws_order_number)\norder by count(distinct 
ws_order_number)\nlimit 100;\n",
  "Plan": {
"Node Type": "Limit",
"Parallel Aware": false,
"Async Capable": false,
"Startup Cost": 1244768.73,
"Total Cost": 1244768.73,
"Plan Rows": 1,
"Plan Width": 72,
"Actual Startup Time": 277228.286,
"Actual Total Time": 277228.294,
"Actual Rows": 1,
"Actual Loops": 1,
"Output": ["(count(DISTINCT ws1.ws_order_number))", 
"(sum(ws1.ws_ext_ship_cost))", "(sum(ws1.ws_net_profit))"],
"Plans": [
  {
"Node Type": "Hash Join",
"Parent Relationship": "InitPlan",
"Subplan Name": "CTE ws_wh",
"Parallel Aware": false,
"Async Capable": false,
"Join Type": "Inner",
"Startup Cost": 76039.92,
"Total Cost": 443675.28,
"Plan Rows": 17135538,
"Plan Width": 12,
"Actual Startup Time": 380.616,
"Actual Total Time": 3160.748,
"Actual Rows": 13341520,
"Actual Loops": 1,
"Output": ["ws1_1.ws_order_number", "ws1_1.ws_warehouse_sk", 
"ws2.ws_warehouse_sk"],
"Inner Unique": false,
"Hash Cond": "(ws1_1.ws_order_number = ws2.ws_order_number)",
"Join Filter": "(ws1_1.ws_warehouse_sk <> ws2.ws_warehouse_sk)",
"Rows Removed by Join Filter": 4787014,
"Plans": [
  {
"Node Type": "Seq Scan",
"Parent Relationship": "Outer",
"Parallel Aware": false,
"Async Capable": false,
"Relation Name": "web_sales",
"Schema": "public",
"Alias": "ws1_1",
"Startup Cost": 0.00,
"Total Cost": 52382.52,
"Plan Rows": 1441952,
"Plan Width": 8,
"Actual Startup Time": 0.013,
"Actual Total Time": 223.129,
"Actual Rows": 1441948,
"Actual Loops": 1,
"Output": ["ws1_1.ws_order_number", "ws1_1.ws_warehouse_sk"]
  },
  {
"Node Type": "Hash",
"Parent Relationship": "Inner",
"Parallel Aware": false,
"Async Capable": false,
"Startup Cost": 52382.52,
"Total Cost": 52382.52,
"Plan Rows": 1441952,
"Plan Width": 8,
"Actual Startup Time": 379.625,
"Actual Total Time": 379.626,
"Actual Rows": 1441948,
"Actual Loops": 1,
"Output": ["ws2.ws_warehouse_sk", "ws2.ws_order_number"],
"Hash Buckets": 262144,
"Original Hash Buckets": 262144,
"Hash Batches": 16,
"Original Hash Batches": 16,
"Peak Memory Usage": 5602,
"Plans"

Re: rare crash - FailedAssertion snapbuild.c Line: 580

2024-08-01 Thread Alexander Lakhin

Hello Euler,

01.08.2024 21:09, Euler Taveira wrote:

According to this discussion, there isn't a clue about the root cause.
If you have a test case, share it (mainly if you are observing it in
version 16+ that exposes some data which may be useful for analysis).



Please take a look at [1], where I presented a reproducer for apparently
the same issue.

[1] 
https://www.postgresql.org/message-id/b91cf8ef-b5af-5def-ff05-bd67336ef907%40gmail.com

Best regards,
Alexander

Re: Improving tracking/processing of buildfarm test failures

2024-08-01 Thread Alexander Lakhin

02.07.2024 15:00, Alexander Lakhin wrote:


One month later,  I'd like to summarize failures that I've investigated
and classified during June, 2024 on the aforementioned wiki page.
(Maybe it would make sense to issue a monthly report with such information
in the future.)


Please take a look at July report on the buildfarm failures:
# SELECT br, count(*) FROM failures WHERE dt >= '2024-07-01' AND
 dt < '2024-08-01' GROUP BY br;

REL_12_STABLE: 11
REL_13_STABLE: 9
REL_14_STABLE: 7
REL_15_STABLE: 10
REL_16_STABLE: 9
REL_17_STABLE: 68
HEAD: 106
-- Total: 220
(Counting test failures only, excluding indent-check, Configure, Build
errors.)

# SELECT COUNT(*) FROM (SELECT DISTINCT issue_link FROM failures WHERE
 dt >= '2024-07-01' AND dt < '2024-08-01');
40

# SELECT issue_link, count(*) FROM failures WHERE dt >= '2024-07-01' AND
 dt < '2024-08-01' GROUP BY issue_link ORDER BY 2 DESC LIMIT 9;

https://www.postgresql.org/message-id/20240404170055.qynecay7szu3d...@awork3.anarazel.de:
 29
-- An environmental issue

https://www.postgresql.org/message-id/a9a97e83-9ec8-5de5-bf69-80e9560f5...@gmail.com:
 20
-- Probably fixed

https://www.postgresql.org/message-id/1545399.1720554...@sss.pgh.pa.us: 11
-- Fixed

https://www.postgresql.org/message-id/4db099c8-4a52-3cc4-e970-14539a319...@gmail.com:
 9

https://www.postgresql.org/message-id/db093cce-7eec-8516-ef0f-891895178...@gmail.com:
 8
-- An environmental issue; probably fixed

https://www.postgresql.org/message-id/b2037a8d-fe6b-d299-da17-ff5f3214e...@gmail.com:
 8

https://www.postgresql.org/message-id/3e2cbd24-f45e-4b2b-ba83-8149214f0...@dunslane.net:
 8
-- Fixed

https://www.postgresql.org/message-id/68de6498-0449-a113-dd03-e198dded0...@gmail.com:
 8
-- Fixed

https://www.postgresql.org/message-id/3618203.1722473...@sss.pgh.pa.us: 8
-- Fixed

# SELECT count(*) FROM failures WHERE dt >= '2024-07-01' AND
 dt < '2024-08-01' AND issue_link IS NULL; -- Unsorted/unhelpful failures
17

And one more metric, that might be useful, but it requires also time
analysis — short-lived (eliminated immediately) failures: 83

I also wrote a simple script (see attached) to check for unknown buildfarm
failures using "HTML API", to make sure no failures missed. Surely, it
could be improved in many ways, but I find it rather useful as-is.

Best regards,
Alexander#!/bin/bash

TMP=${TMPDIR:-/tmp}
wget "https://buildfarm.postgresql.org/cgi-bin/show_failures.pl"; -O 
"$TMP/failures.html"
wget "https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures"; -O 
"$TMP/known-test-failures.html"
sed -E 's/\&max_days/\&max_days/; s/(hours|mins|secs| i) < /\1 \< /' -i 
"$TMP/failures.html"
cat << 'EOF' > "$TMP/t.xsl"

http://www.w3.org/1999/XSL/Transform";
  xmlns="http://www.w3.org/1999/xhtml";
  xmlns:xhtml="http://www.w3.org/1999/xhtml";>












EOF

for fl in $(xsltproc "$TMP/t.xsl" "$TMP/failures.html"); do
if [[ $fl == show_log* ]]; then
sfl=${fl/\&/\&}
grep -q "$sfl" "$TMP/known-test-failures.html" && continue
echo "An unknown failure found: 
https://buildfarm.postgresql.org/cgi-bin/$fl";
wget "https://buildfarm.postgresql.org/cgi-bin/$fl"; -O 
"$TMP/failure-$fl.log"

il=""
if grep -q -Pzo \
'(?s)pgsql.build/testrun/pg_basebackup/040_pg_createsubscriber/log/regress_log_040_pg_createsubscriber\b.*'\
'ok 29 - run pg_createsubscriber without --databases\s*\n.*'\
'pg_createsubscriber: error: recovery timed out\s*\n.*'\
'not ok 30 - run pg_createsubscriber on node S\s*\n'\
"$TMP/failure-$fl.log"; then

il="https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures#040_pg_createsubscriber.pl_fails_due_to_recovery_timed_out_during_pg_createsubscriber_run";

elif grep -q -Pzo 
'(?s)(pgsql.build/testrun/postgres_fdw-running/regress|pgsql.build/testrun/postgres_fdw/regress|pgsql.build/contrib/postgres_fdw)/regression.diffs<.*'\
' ERROR:  canceling statement due to statement timeout\s*\n'\
'\+WARNING:  could not get result of cancel request due to timeout\s*\n'\
"$TMP/failure-$fl.log"; then

il="https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures#The_postgres_fdw_test_fails_due_to_an_unexpected_warning_on_canceling_a_statement";

elif grep -q -Pzo \
'(?s)# poll_query_until timed out executing this query:\s*\n'\
'# \s*\n'\
'# SELECT NOT EXISTS \(\s*\n'\
'#  SELECT \*\s*\n'\
'#  FROM pg_database\s*\n'\
"#\s*WHERE age\(d

Re: v17 vs v16 performance comparison

2024-08-01 Thread Alexander Lakhin

Hello Thomas.

01.08.2024 08:57, Thomas Munro wrote:

On Thu, Aug 1, 2024 at 3:00 PM Alexander Lakhin  wrote:

So it looks like q15 from TPC-DS is not the only query suffering from that
change.

I'm going to try to set up a local repro to study these new cases.  If
you have a write-up somewhere of how exactly you run that, that'd be
useful.


I'm using this instrumentation (on my Ubuntu 22.04 workstation):
https://github.com/alexanderlaw/pg-mark.git
README.md can probably serve as a such write-up.

If you install all the prerequisites (some tests, including pg_tpcds,
require downloading additional resources; run-benchmarks.py will ask to
do that), there should be no problems with running benchmarks.

I just added two instances to config.xml:
    
    
and ran
1)
./prepare-instances.py -i pg-src-16 pg-src-17

2)
time ./run-benchmarks.py -i pg-src-16 pg-src-17 pg-src-16 pg-src-17 pg-src-17 
pg-src-16
(it took 1045m55,215s on my machine so you may prefer to choose the single
benchmark (-b pg_tpcds or maybe s64da_tpcds))

3)
./analyze-benchmarks.py -i 'pg-src-17--.*' 'pg-src-16--.*'

All the upper-level commands to run benchmarks are contained in config.xml,
so you can just execute them separately, but my instrumentation eases
processing of the results by creating one unified benchmark-results.xml.

Please feel free to ask any questions or give your feedback.

Thank you for paying attention to this!

Best regards,
Alexander




v17 vs v16 performance comparison

2024-07-31 Thread Alexander Lakhin

Hello hackers,

I've repeated the performance measurement for REL_17_STABLE (1e020258e)
and REL_16_STABLE (6f6b0f193) and found several benchmarks where v16 is
significantly better than v17. Please find attached an html table with
all the benchmarking results.

I had payed attention to:
Best pg-src-17--.* worse than pg-src-16--.* by 57.9 percents (225.11 > 142.52): 
pg_tpcds.query15
Average pg-src-17--.* worse than pg-src-16--.* by 55.5 percents (230.57 > 
148.29): pg_tpcds.query15
in May, performed `git bisect` for this degradation, that led me to commit
b7b0f3f27 [1].

This time I bisected the following anomaly:
Best pg-src-17--.* worse than pg-src-16--.* by 23.6 percents (192.25 > 155.58): 
pg_tpcds.query21
Average pg-src-17--.* worse than pg-src-16--.* by 25.1 percents (196.19 > 
156.85): pg_tpcds.query21
and to my surprise I got "b7b0f3f27 is the first bad commit".

Moreover, bisecting of another anomaly:
Best pg-src-17--.* worse than pg-src-16--.* by 24.2 percents (24269.21 > 
19539.89): pg_tpcds.query72
Average pg-src-17--.* worse than pg-src-16--.* by 24.2 percents (24517.66 > 
19740.12): pg_tpcds.query72
pointed at the same commit again.

So it looks like q15 from TPC-DS is not the only query suffering from that
change.

But beside that, I've found a separate regression. Bisecting for this 
degradation:
Best pg-src-17--.* worse than pg-src-16--.* by 105.0 percents (356.63 > 
173.96): s64da_tpcds.query95
Average pg-src-17--.* worse than pg-src-16--.* by 105.2 percents (357.79 > 
174.38): s64da_tpcds.query95
pointed at f7816aec2.

Does this deserve more analysis and maybe fixing?

[1] 
https://www.postgresql.org/message-id/63a63690-dd92-c809-0b47-af05459e95d1%40gmail.com

Best regards,
Alexander./analyze-benchmarks.py -i 'pg-src-17--.*' 'pg-src-16--.*'
Best pg-src-17--.* better than pg-src-16--.* by 5.3 percents (9.04 < 9.55): 
pg_tpch.query1
Average pg-src-17--.* better than pg-src-16--.* by 5.8 percents (9.05 < 9.61): 
pg_tpch.query1
Best pg-src-17--.* better than pg-src-16--.* by 5.3 percents (2.84 < 3.00): 
pg_tpch.query3
Average pg-src-17--.* better than pg-src-16--.* by 5.7 percents (2.85 < 3.02): 
pg_tpch.query3
Best pg-src-17--.* better than pg-src-16--.* by 8.2 percents (1.35 < 1.47): 
pg_tpch.query6
Average pg-src-17--.* better than pg-src-16--.* by 7.5 percents (1.36 < 1.47): 
pg_tpch.query6
Best pg-src-17--.* worse than pg-src-16--.* by 5.9 percents (0.90 > 0.85): 
pg_tpch.query9
Best pg-src-17--.* better than pg-src-16--.* by 8.2 percents (2.58 < 2.81): 
pg_tpch.query13
Average pg-src-17--.* better than pg-src-16--.* by 5.9 percents (2.66 < 2.83): 
pg_tpch.query13
Best pg-src-17--.* better than pg-src-16--.* by 6.2 percents (3.02 < 3.22): 
pg_tpch.query15
Average pg-src-17--.* better than pg-src-16--.* by 6.1 percents (3.03 < 3.23): 
pg_tpch.query15
Best pg-src-17--.* better than pg-src-16--.* by 5.6 percents (8.14 < 8.62): 
pg_tpch.query17
Average pg-src-17--.* better than pg-src-16--.* by 5.2 percents (8.18 < 8.63): 
pg_tpch.query17
Best pg-src-17--.* better than pg-src-16--.* by 5.3 percents (0.18 < 0.19): 
pg_tpch.query19
Best pg-src-17--.* better than pg-src-16--.* by 18.6 percents (4.81 < 5.91): 
pg_tpch.query20
Average pg-src-17--.* better than pg-src-16--.* by 10.6 percents (5.30 < 5.93): 
pg_tpch.query20
Best pg-src-17--.* better than pg-src-16--.* by 5.9 percents (2.07 < 2.20): 
pg_tpcds.vacuum_freeze_time
Average pg-src-17--.* better than pg-src-16--.* by 5.9 percents (2.08 < 2.21): 
pg_tpcds.vacuum_freeze_time
Best pg-src-17--.* better than pg-src-16--.* by 6.9 percents (73.31 < 78.77): 
pg_tpcds.query92
Average pg-src-17--.* better than pg-src-16--.* by 7.2 percents (74.12 < 
79.90): pg_tpcds.query92
Best pg-src-17--.* better than pg-src-16--.* by 6.3 percents (600.38 < 640.59): 
pg_tpcds.query13
Average pg-src-17--.* better than pg-src-16--.* by 6.0 percents (604.42 < 
643.14): pg_tpcds.query13
Best pg-src-17--.* better than pg-src-16--.* by 7.3 percents (1324.73 < 
1429.52): pg_tpcds.query9
Average pg-src-17--.* better than pg-src-16--.* by 7.3 percents (1327.16 < 
1431.45): pg_tpcds.query9
Best pg-src-17--.* worse than pg-src-16--.* by 23.6 percents (192.25 > 155.58): 
pg_tpcds.query21
Average pg-src-17--.* worse than pg-src-16--.* by 25.1 percents (196.19 > 
156.85): pg_tpcds.query21
Best pg-src-17--.* better than pg-src-16--.* by 7.1 percents (4.35 < 4.68): 
pg_tpcds.query93
Best pg-src-17--.* worse than pg-src-16--.* by 54.3 percents (145.53 > 94.31): 
pg_tpcds.query16
Average pg-src-17--.* worse than pg-src-16--.* by 50.0 percents (146.84 > 
97.87): pg_tpcds.query16
Best pg-src-17--.* better than pg-src-16--.* by 5.8 percents (250.23 < 265.52): 
pg_tpcds.query68
Best pg-src-17--.* better than pg-src-16--.* by 5.8 percents (7.38 < 7.83): 
pg_tpcds.query37
Average pg-src-17--.* worse than pg-src-16--.* by 19.4 percents (9.50 > 7.95): 
pg_tpcds.query37
Best pg-src-17--.* worse than pg-src-16--.* by 24.2 percents (24269.21 > 
19539.89): pg_tpcds.query72
Average pg-src-1

The stats.sql test is failing sporadically in v14- on POWER7/AIX 7.1 buildfarm animals

2024-07-31 Thread Alexander Lakhin

Hello hackers,

Yesterday, the buildfarm animal sungazer was benevolent enough to
demonstrate a rare anomaly, related to old stats collector:
test stats    ... FAILED   469155 ms


 1 of 212 tests failed.


--- 
/home/nm/farm/gcc64/REL_14_STABLE/pgsql.build/src/test/regress/expected/stats.out
 2022-03-30 01:18:17.0 +
+++ 
/home/nm/farm/gcc64/REL_14_STABLE/pgsql.build/src/test/regress/results/stats.out
 2024-07-30 09:49:39.0 +
@@ -165,11 +165,11 @@
   WHERE relname like 'trunc_stats_test%' order by relname;
    relname  | n_tup_ins | n_tup_upd | n_tup_del | n_live_tup | 
n_dead_tup
---+---+---+---++
-  trunc_stats_test  | 3 | 0 | 0 | 0 |  0
-  trunc_stats_test1 | 4 | 2 | 1 | 1 |  0
-  trunc_stats_test2 | 1 | 0 | 0 | 1 |  0
-  trunc_stats_test3 | 4 | 0 | 0 | 2 |  2
-  trunc_stats_test4 | 2 | 0 | 0 | 0 |  2
+  trunc_stats_test  | 0 | 0 | 0 | 0 |  0
+  trunc_stats_test1 | 0 | 0 | 0 | 0 |  0
+  trunc_stats_test2 | 0 | 0 | 0 | 0 |  0
+  trunc_stats_test3 | 0 | 0 | 0 | 0 |  0
+  trunc_stats_test4 | 0 | 0 | 0 | 0 |  0
...

inst/logfile contains:
2024-07-30 09:25:11.225 UTC [63307946:1] LOG:  using stale statistics instead of current ones because stats collector is 
not responding
2024-07-30 09:25:11.345 UTC [11206724:559] pg_regress/create_index LOG:  using stale statistics instead of current ones 
because stats collector is not responding

...

That's not the only failure of that kind occurred on sungazer, there were
also [2] (REL_13_STABLE), [3] (REL_13_STABLE), [4] (REL_12_STABLE).
Moreover, such failures were produced by all the other POWER7/AIX 7.1
animals: hornet ([5], [6]), tern ([7], [8]), mandrill ([9], [10], ...).
But I could not find such failures coming from POWER8 animals: hoverfly
(running AIX 7200-04-03-2038), ayu, boa, chub, and I did not encounter such
anomalies on x86 nor ARM platforms.

Thus, it looks like this stats collector issue is only happening on this
concrete platform, and given [11], I think such failures perhaps should
be just ignored for the next two years (until v14 EOL) unless AIX 7.1
will be upgraded and we see them on a vendor-supported OS version.

So I'm parking this information here just for reference.

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2024-07-30%2003%3A49%3A35
[2] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2023-02-09%2009%3A29%3A10
[3] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2022-06-16%2009%3A52%3A47
[4] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2023-12-13%2003%3A40%3A42
[5] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hornet&dt=2024-03-29%2005%3A27%3A09
[6] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hornet&dt=2024-03-19%2002%3A09%3A07
[7] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2022-12-16%2009%3A17%3A38
[8] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2021-04-01%2003%3A09%3A38
[9] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mandrill&dt=2021-04-05%2004%3A22%3A17
[10] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mandrill&dt=2021-07-12%2004%3A31%3A37
[11] https://www.postgresql.org/message-id/3154146.1697661946%40sss.pgh.pa.us

Best regards,
Alexander




Re: Remove dead code generation tools in src/backend/utils/mb/

2024-07-29 Thread Alexander Lakhin

Hello Tom and Heikki,

29.07.2024 17:15, Tom Lane wrote:

Heikki Linnakangas  writes:

These programs in src/backend/utils/mb/ are unused, and have been unused
and unusable since 2003:
iso.c
win1251.c
win866.c
Attached patch removes them. See commit message for a little more
detailed analysis.

+1.  Seems to have been my oversight in 4c3c8c048d.


I also wonder whether src/test/locale/ still makes sense; does anyone
run those tests (I could not run a single one on a quick attempt)?

(As far as I can tell, KOI8-R fallen out of mainstream usage in Russia
twenty years ago...)

Best regards,
Alexander




Re: why is pg_upgrade's regression run so slow?

2024-07-29 Thread Alexander Lakhin

Hello Andrew,

28.07.2024 22:43, Andrew Dunstan wrote:


Maybe, IDK. Meanwhile, I disabled "debug_parallel_query = regress" on HEAD for fairywren and drongo -  fairwren has 
just gone green, and I expect drongo will when it reports in a few hours.


I'm at a loss for an explanation.



I'm observing the same here (using a Windows 10 VM).

With no TEMP_CONFIG set, `meson test` gives me these numbers:
test: postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade
duration: 72.34s

test: postgresql:regress / regress/regress
duration: 41.98s

But with debug_parallel_query=regress in TEMP_CONFIG, I see:
test: postgresql:regress / regress/regress
duration: 50.08s

test: postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade
duration: 398.45s

With log_min_messages=DEBUG2 added to TEMP_CONFIG, I can see how many
parallel workers were started during the test:
...\postgresql\build>grep "starting background worker process" -r 
testrun/pg_upgrade | wc -l
17532

And another interesting fact is that TEMP_CONFIG is apparently ignored by
`meson test regress/regress`. For example, with temp.config containing
invalid settings, `meson test pg_upgrade/002_pg_upgrade` fails, but
`meson test regress/regress` passes just fine.

Best regards,
Alexander




dikkop failed the pg_combinebackupCheck/006_db_file_copy.pl test

2024-07-28 Thread Alexander Lakhin

Hello Tomas,

Please take a look at a recent dikkop's failure [1]. The
regress_log_006_db_file_copy file from that run shows:
[02:08:57.929](0.014s) # initializing database system by copying initdb template
...
[02:09:22.511](24.583s) ok 1 - full backup
...
[02:10:35.758](73.247s) not ok 2 - incremental backup

006_db_file_copy_primary.log contains:
2024-07-28 02:09:29.441 UTC [67785:12] 006_db_file_copy.pl LOG: received replication command: START_REPLICATION SLOT 
"pg_basebackup_67785" 0/400 TIMELINE 1
2024-07-28 02:09:29.441 UTC [67785:13] 006_db_file_copy.pl STATEMENT:  START_REPLICATION SLOT "pg_basebackup_67785" 
0/400 TIMELINE 1

2024-07-28 02:09:29.441 UTC [67785:14] 006_db_file_copy.pl LOG: acquired physical 
replication slot "pg_basebackup_67785"
2024-07-28 02:09:29.441 UTC [67785:15] 006_db_file_copy.pl STATEMENT:  START_REPLICATION SLOT "pg_basebackup_67785" 
0/400 TIMELINE 1

2024-07-28 02:10:29.487 UTC [67785:16] 006_db_file_copy.pl LOG: terminating 
walsender process due to replication timeout
2024-07-28 02:10:29.487 UTC [67785:17] 006_db_file_copy.pl STATEMENT:  START_REPLICATION SLOT "pg_basebackup_67785" 
0/400 TIMELINE 1


It looks like this incremental backup operation was performed slower than
usual (it took more than 60 seconds and apparently was interrupted due to
wal_sender_timeout). But looking at regress_log_006_db_file_copy from the
6 previous (successful) test runs, we can see:
[14:22:16.841](43.215s) ok 2 - incremental backup
[02:14:42.888](34.595s) ok 2 - incremental backup
[17:51:16.152](43.708s) ok 2 - incremental backup
[04:07:16.757](31.087s) ok 2 - incremental backup
[12:15:01.256](49.432s) ok 2 - incremental backup
[01:06:02.482](52.364s) ok 2 - incremental backup

Thus reaching 60s (e.g., due to some background activity) on this animal
seems pretty possible. So maybe it would make sense to increase
wal_sender_timeout for it, say, to 120s?

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dikkop&dt=2024-07-27%2023%3A22%3A57

Best regards,
Alexander




Re: recoveryCheck/008_fsm_truncation is failing on dodo in v14- (due to slow fsync?)

2024-07-26 Thread Alexander Lakhin

Hello Robins,

28.06.2024 13:20, Robins Tharakan wrote:

The past ~1 week, I tried to space out all other tasks on the machine, so as to 
ensure
that 1-min CPU is mostly <2 (and thus not many things hammering the disk) and 
with
that I see 0 failures these past few days. This isn't conclusive by any means, 
but it
does seem that reducing IO contention has helped remove the errors, like what
Alexander suspects / repros here.

Just a note, that I've reverted some of those recent changes now, and so if the 
theory
holds true, I wouldn't be surprised if some of these errors restarted on dodo.


Looking back at the test failures, I can see errors really reappeared
just after your revert (at 2024-06-28), so that theory proved true,
but I see none of those since 2024-07-02. Does it mean that you changed
something on dodo/fixed that performance issue?

Could you please describe how you resolved this issue, just for the record?

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dodo&dt=2024-06-28%2017%3A00%3A28
[2] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dodo&dt=2024-06-28%2017%3A10%3A12
[3] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dodo&dt=2024-07-01%2012%3A10%3A12
[4] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dodo&dt=2024-07-01%2013%3A01%3A00
[5] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dodo&dt=2024-07-02%2005%3A00%3A36
[6] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dodo&dt=2024-07-02%2018%3A00%3A15

Best regards,
Alexander

Re: fairywren timeout failures on the pg_amcheck/005_opclass_damage test

2024-07-26 Thread Alexander Lakhin

26.07.2024 15:41, Andrew Dunstan wrote:




One way to workaround this is to disable debug_parallel_query in the test
and another I find possible is to set max_parallel_workers = 0.




But wouldn't either of those just be masking the problem?



Yes, I'm inclined to consider this behavior a problem (what if the table
contained 1M rows?), that's why I called those solutions workarounds.

Of course, there are parallel_setup_cost and parallel_tuple_cost
parameters, which can prevent this from happening in the wild, but still...

Best regards.
Alexander




Re: fairywren timeout failures on the pg_amcheck/005_opclass_damage test

2024-07-26 Thread Alexander Lakhin

25.07.2024 15:00, Alexander Lakhin wrote:



The other question is: why is 005_opclass_damage taking so much time there?
...
$ make -s check -C src/bin/pg_amcheck/ PROVE_TESTS="t/005*" 
PROVE_FLAGS="--timer"
[11:11:53] t/005_opclass_damage.pl .. ok 1370 ms ( 0.00 usr 0.00 sys +  
0.10 cusr  0.07 csys =  0.17 CPU)

$ echo "debug_parallel_query = regress" >/tmp/extra.config
$ TEMP_CONFIG=/tmp/extra.config make -s check -C src/bin/pg_amcheck/ PROVE_TESTS="t/005*" 
PROVE_FLAGS="--timer"
[11:12:46] t/005_opclass_damage.pl .. ok    40854 ms ( 0.00 usr 0.00 sys +  
0.10 cusr  0.10 csys =  0.20 CPU)

...
So maybe at least this test should be improved for testing with
debug_parallel_query enabled, if such active use of parallel workers by
pg_amcheck can't be an issue to end users?



When running this test with "log_min_messages = DEBUG2" in my extra.config,
I see thousands of the following messages in the test log:
2024-07-26 09:32:54.544 UTC [2572189:46] DEBUG:  postmaster received pmsignal 
signal
2024-07-26 09:32:54.544 UTC [2572189:47] DEBUG:  registering background worker 
"parallel worker for PID 2572197"
2024-07-26 09:32:54.544 UTC [2572189:48] DEBUG:  starting background worker process 
"parallel worker for PID 2572197"
2024-07-26 09:32:54.547 UTC [2572189:49] DEBUG:  unregistering background worker 
"parallel worker for PID 2572197"
2024-07-26 09:32:54.547 UTC [2572189:50] DEBUG:  background worker "parallel 
worker" (PID 2572205) exited with exit code 0
2024-07-26 09:32:54.547 UTC [2572189:51] DEBUG:  postmaster received pmsignal 
signal
2024-07-26 09:32:54.547 UTC [2572189:52] DEBUG:  registering background worker 
"parallel worker for PID 2572197"
2024-07-26 09:32:54.547 UTC [2572189:53] DEBUG:  starting background worker process 
"parallel worker for PID 2572197"
2024-07-26 09:32:54.549 UTC [2572189:54] DEBUG:  unregistering background worker 
"parallel worker for PID 2572197"
2024-07-26 09:32:54.549 UTC [2572189:55] DEBUG:  background worker "parallel 
worker" (PID 2572206) exited with exit code 0
...

grep ' registering background worker' \
 src/bin/pg_amcheck/tmp_check/log/005_opclass_damage_test.log | wc -l
15669

So this test launches more than 15000 processes when debug_parallel_query
is enabled.

As far as I can see, this is happening because of the "PARALLEL UNSAFE"
marking is ignored when the function is called by CREATE INDEX/amcheck.

Namely, with a function defined as:
    CREATE FUNCTION int4_asc_cmp (a int4, b int4) RETURNS int LANGUAGE sql AS $$
    SELECT CASE WHEN $1 = $2 THEN 0 WHEN $1 > $2 THEN 1 ELSE -1 END; $$;

SELECT int4_asc_cmp(1, 2);
executed without parallel workers. Whilst when it's used by an index:
CREATE OPERATOR CLASS int4_fickle_ops FOR TYPE int4 USING btree AS
...
OPERATOR 5 > (int4, int4), FUNCTION 1 int4_asc_cmp(int4, int4);

INSERT INTO int4tbl (SELECT * FROM generate_series(1,1000) gs);

CREATE INDEX fickleidx ON int4tbl USING btree (i int4_fickle_ops);
launches 1000 parallel workers.

(This is reminiscent of bug #18314.)

One way to workaround this is to disable debug_parallel_query in the test
and another I find possible is to set max_parallel_workers = 0.

Best regards,
Alexander




Re: Sporadic connection-setup-related test failures on Cygwin in v15-

2024-07-25 Thread Alexander Lakhin

25.07.2024 19:25, Andrew Dunstan wrote:

+1.  I'm not planning to back-patch that work.  Perhaps lorikeet
could stop testing releases < 16?  They don't work and it's not our
bug[1].  We decided not to drop Cygwin support[2], but I don't think
we're learning anything from investigating that noise in the
known-broken branches.



Sure, it can. I've made that change.



Thank you, Andrew!

I've moved those issues to the "Fixed" category.

Best regards,
Alexander




fairywren timeout failures on the pg_amcheck/005_opclass_damage test

2024-07-25 Thread Alexander Lakhin

Hello hackers,

Please take a look at today's failure [1], that raises several questions
at once:
117/244 postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade TIMEOUT    
3001.48s   exit status 1
180/244 postgresql:pg_amcheck / pg_amcheck/005_opclass_damage TIMEOUT    
3001.43s   exit status 1

Ok: 227
Expected Fail:  0
Fail:   0
Unexpected Pass:    0
Skipped:    15
Timeout:    2

But looking at the previous test run [2], marked 'OK', I can see almost
the same:
115/244 postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade TIMEOUT    
3001.54s   exit status 1
176/244 postgresql:pg_amcheck / pg_amcheck/005_opclass_damage TIMEOUT    
3001.26s   exit status 1

Ok: 227
Expected Fail:  0
Fail:   0
Unexpected Pass:    0
Skipped:    15
Timeout:    2

So it's not clear to me, why is the latter test run considered failed
unlike the former?

As to the 005_opclass_damage failure itself, we can find successful test
runs with duration close to 3000s:
[3] 212/242 postgresql:pg_amcheck / pg_amcheck/005_opclass_damage OK
 2894.75s   10 subtests passed
[4] 212/242 postgresql:pg_amcheck / pg_amcheck/005_opclass_damage OK
 2941.73s   10 subtests passed

(The average duration across 35 successful runs in June was around 1300s;
but in July only 5 test runs were successful.)

The other question is: why is 005_opclass_damage taking so much time there?
Looking at the last three runs in REL_17_STABLE, I see:
 87/243 postgresql:pg_amcheck / pg_amcheck/005_opclass_damage OK   
22.80s   10 subtests passed
 87/243 postgresql:pg_amcheck / pg_amcheck/005_opclass_damage OK   
19.60s   10 subtests passed
 87/243 postgresql:pg_amcheck / pg_amcheck/005_opclass_damage OK    
6.09s   10 subtests passed

I suppose, the most significant factor here is
'HEAD' => [
    'debug_parallel_query = regress'
]
in the buildfarm client's config.

Indeed, it affects timing of the test very much, judging by what I'm
seeing locally, on Linux (I guess, process-related overhead might be
higher on Windows):
$ make -s check -C src/bin/pg_amcheck/ PROVE_TESTS="t/005*" 
PROVE_FLAGS="--timer"
[11:11:53] t/005_opclass_damage.pl .. ok 1370 ms ( 0.00 usr 0.00 sys +  
0.10 cusr  0.07 csys =  0.17 CPU)

$ echo "debug_parallel_query = regress" >/tmp/extra.config
$ TEMP_CONFIG=/tmp/extra.config make -s check -C src/bin/pg_amcheck/ PROVE_TESTS="t/005*" 
PROVE_FLAGS="--timer"
[11:12:46] t/005_opclass_damage.pl .. ok    40854 ms ( 0.00 usr 0.00 sys +  
0.10 cusr  0.10 csys =  0.20 CPU)

(Thus we can see 30x duration increase.)

It's worth to note that such increase is rather new (introduced by
5ae208720); in REL_16_STABLE I'm seeing:
$ make -s check -C src/bin/pg_amcheck/ PROVE_TESTS="t/005*" 
PROVE_FLAGS="--timer"
[11:18:52] t/005_opclass_damage.pl .. ok 1453 ms ( 0.00 usr 0.00 sys +  
0.82 cusr  0.11 csys =  0.93 CPU)

$ TEMP_CONFIG=/tmp/extra.config make -s check -C src/bin/pg_amcheck/ PROVE_TESTS="t/005*" 
PROVE_FLAGS="--timer"
[11:19:18] t/005_opclass_damage.pl .. ok 8032 ms ( 0.00 usr 0.00 sys +  
0.82 cusr  0.11 csys =  0.93 CPU)

So maybe at least this test should be improved for testing with
debug_parallel_query enabled, if such active use of parallel workers by
pg_amcheck can't be an issue to end users?

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2024-07-25%2003%3A23%3A19
[2] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2024-07-23%2023%3A01%3A55
[3] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2024-05-02%2019%3A03%3A08
[4] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2024-04-30%2014%3A03%3A06

Best regards,
Alexander




Re: Recent 027_streaming_regress.pl hangs

2024-07-24 Thread Alexander Lakhin

Hello Andrew,

04.06.2024 13:00, Alexander Lakhin wrote:

Also, 027_stream_regress still fails due to the same reason:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2024-05-22%2021%3A55%3A03
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2024-05-22%2021%3A54%3A50
(It's remarkable that these two animals failed at the same time.)



It looks like crake is failing now due to other reasons (not just
concurrency) — it produced 10+ failures of the
027_stream_regress test starting from July, 9.

The first such failure on REL_16_STABLE was [1], and that was the first run
with 'PG_TEST_EXTRA' => '... wal_consistency_checking'.

There is one known issue related to wal_consistency_checking [2], but I
see no "incorrect resource manager data checksum" in the failure log...

Moreover, the first such failure on REL_17_STABLE was [3], but that run
was performed without wal_consistency_checking, as far as I can see.

Can that failure be also related to the OS upgrade (I see that back in
June crake was running on Fedora 39, but now it's running on Fedora 40)?

So maybe we have two factors combined or there is another one?

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2024-07-17%2014%3A56%3A09
[2] 
https://www.postgresql.org/message-id/055bb33c-bccc-bc1d-c2f8-859853444...@gmail.com
[3] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2024-07-09%2021%3A37%3A04

Best regards,
Alexander




Re: Sporadic connection-setup-related test failures on Cygwin in v15-

2024-07-24 Thread Alexander Lakhin

24.07.2024 23:58, Thomas Munro wrote:

+1.  I'm not planning to back-patch that work.  Perhaps lorikeet
could stop testing releases < 16?  They don't work and it's not our
bug[1].  We decided not to drop Cygwin support[2], but I don't think
we're learning anything from investigating that noise in the
known-broken branches.


Yeah, it looks like lorikeet votes +[1] for your proposal.
(I suppose it failed due to the same signal processing issue, just another
way.)

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2024-07-24%2008%3A54%3A07

Best regards,
Alexander




Sporadic connection-setup-related test failures on Cygwin in v15-

2024-07-24 Thread Alexander Lakhin

Hello hackers,

A recent lorikeet (a Cygwin animal) failure [1] revealed one more
long-standing (see also [2], [3], [4]) issue related to Cygwin:
 SELECT dblink_connect('dtest1', connection_parameters());
- dblink_connect
-
- OK
-(1 row)
-
+ERROR:  could not establish connection
+DETAIL:  could not connect to server: Connection refused

where inst/logfile contains:
2024-07-16 05:38:21.492 EDT [66963f67.7823:4] LOG:  could not accept new 
connection: Software caused connection abort
2024-07-16 05:38:21.492 EDT [66963f8c.79e5:170] pg_regress/dblink ERROR:  could 
not establish connection
2024-07-16 05:38:21.492 EDT [66963f8c.79e5:171] pg_regress/dblink DETAIL:  
could not connect to server: Connection refused
    Is the server running locally and accepting
    connections on Unix domain socket 
"/home/andrew/bf/root/tmp/buildfarm-DK1yh4/.s.PGSQL.5838"?

I made a standalone reproducing script (assuming the dblink extension
installed):
numclients=50
for ((i=1;i<=1000;i++)); do
echo "iteration $i"

for ((c=1;c<=numclients;c++)); do
cat << 'EOF' | /usr/local/pgsql/bin/psql >/dev/null 2>&1 &

SELECT 'dbname='|| current_database()||' port='||current_setting('port')
  AS connstr
\gset

SELECT * FROM dblink('service=no_service', 'SELECT 1') AS t(i int);

SELECT * FROM
dblink(:'connstr', 'SELECT 1') AS t1(i int),
dblink(:'connstr', 'SELECT 2') AS t2(i int),
dblink(:'connstr', 'SELECT 3') AS t3(i int),
dblink(:'connstr', 'SELECT 4') AS t4(i int),
dblink(:'connstr', 'SELECT 5') AS t5(i int);
EOF
done
wait

grep -A1 "Software caused connection abort" server.log && break;
done

which fails for me as below:
iteration 318
2024-07-24 04:19:46.511 PDT [29062:6][postmaster][:0] LOG:  could not accept new connection: Software caused connection 
abort

2024-07-24 04:19:46.512 PDT [25312:8][client backend][36/1996:0] ERROR:  could 
not establish connection

The important fact here is that this failure is not reproduced after
7389aad63 (in v16), so it seems that it's somehow related to signal
processing. Given that, I'm inclined to stop here, without digging deeper,
at least until there are plans to backport that fix or something...

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2024-07-16%2009%3A18%3A31
 (REL_13_STABLE)
[2] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2022-07-21%2000%3A36%3A44
 (REL_14_STABLE)
[3] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2023-07-06%2009%3A19%3A36
 (REL_12_STABLE)
[4] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2022-02-12%2001%3A40%3A56 (REL_13_STABLE, 
postgres_fdw)


Best regards,
Alexander




Re: race condition when writing pg_control

2024-07-24 Thread Alexander Lakhin

Hello Thomas,

15.07.2024 06:44, Thomas Munro wrote:

I'm going to upgrade this to a proposal:

https://commitfest.postgresql.org/49/5124/

I wonder how often this happens in the wild.


Please look at a recent failure [1], produced by buildfarm animal
culicidae, which tests EXEC_BACKEND. I guess it's caused by the issue
discussed.

Maybe it would make sense to construct a reliable reproducer for the
issue (I could not find a ready-to-use recipe in this thread)...

What do you think?

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2024-07-24%2004%3A08%3A23

Best regards,
Alexander




The 031_recovery_conflict.pl test might fail due to late pgstat entries flushing

2024-07-23 Thread Alexander Lakhin

Hello hackers,

A recent buildfarm failure [1] shows that the test 031_recovery_conflict.pl
can fail yet another way:
 23/296 postgresql:recovery / recovery/031_recovery_conflict ERROR    
11.55s   exit status 1

[07:58:53.979](0.255s) ok 11 - tablespace conflict: logfile contains terminated 
connection due to recovery conflict
[07:58:54.058](0.080s) not ok 12 - tablespace conflict: stats show conflict on 
standby
[07:58:54.059](0.000s) #   Failed test 'tablespace conflict: stats show 
conflict on standby'
#   at 
/home/bf/bf-build/rorqual/REL_17_STABLE/pgsql/src/test/recovery/t/031_recovery_conflict.pl
 line 332.
[07:58:54.059](0.000s) #  got: '0'
# expected: '1'

I managed to reproduce a similar failure by running multiple test instances
in parallel on a slow VM. With extra logging added, I see in a failed run
log:
10
10  #   Failed test 'startup deadlock: stats show conflict on standby'
10  #   at t/031_recovery_conflict.pl line 368.
10  #  got: '0'
10  # expected: '1'
10  [19:48:19] t/031_recovery_conflict.pl ..
10  Dubious, test returned 1 (wstat 256, 0x100)

2024-07-23 19:48:13.966 UTC [1668402:12][client backend][1/2:0] LOG:  !!!pgstat_report_recovery_conflict| reason: 13, 
pgstat_track_counts: 1 at character 15

2024-07-23 19:48:13.966 UTC [1668402:13][client backend][1/2:0] STATEMENT:  
SELECT * FROM test_recovery_conflict_table2;
2024-07-23 19:48:13.966 UTC [1668402:14][client backend][1/2:0] ERROR:  canceling statement due to conflict with 
recovery at character 15
2024-07-23 19:48:13.966 UTC [1668402:15][client backend][1/2:0] DETAIL:  User transaction caused buffer deadlock with 
recovery.

...
2024-07-23 19:48:14.129 UTC [1668805:8][client backend][5/2:0] LOG: statement: SELECT confl_deadlock FROM 
pg_stat_database_conflicts WHERE datname='test_db';

...
2024-07-23 19:48:14.148 UTC [1668402:16][client backend][1/0:0] LOG:  
!!!pgstat_database_flush_cb| nowait: 0

This failure can be reproduced easily with a sleep added as below:
@@ -514,6 +514,7 @@ pgstat_shutdown_hook(int code, Datum arg)
 if (OidIsValid(MyDatabaseId))
 pgstat_report_disconnect(MyDatabaseId);

+if (rand() % 5 == 0) pg_usleep(10);
 pgstat_report_stat(true);

By running the test in a loop, I get miscellaneous
"stats show conflict on standby" failures, including:
iteration 19
# +++ tap check in src/test/recovery +++
t/031_recovery_conflict.pl .. 1/?
#   Failed test 'buffer pin conflict: stats show conflict on standby'
#   at t/031_recovery_conflict.pl line 332.
#  got: '0'
# expected: '1'
t/031_recovery_conflict.pl .. 17/? # Looks like you failed 1 test of 18.

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rorqual&dt=2024-07-23%2007%3A56%3A35

Best regards,
Alexander




Re: xid_wraparound tests intermittent failure.

2024-07-21 Thread Alexander Lakhin

Hello,

21.07.2024 20:34, Tom Lane wrote:

Andrew Dunstan  writes:

I noticed this when working on the PostgreSQL::Test::Session project I
have in hand. All the tests pass except occasionally the xid_wraparound
tests fail. It's not always the same test script that fails either. I
tried everything but couldn't make the failure stop. So then I switched
out my patch so it's running on plain master and set things running in a
loop. Lo and behold it can be relied on to fail after only a few
iterations.

I have been noticing xid_wraparound failures in the buildfarm too.
They seemed quite infrequent, but it wasn't till just now that
I realized that xid_wraparound is not run by default.  (You have to
put "xid_wraparound" in PG_TEST_EXTRA to enable it.)  AFAICS the
only buildfarm animals that have enabled it are dodo and perentie.
dodo is failing this test fairly often:

https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=dodo&br=HEAD


I think this failure is counted at [1]. Please look at the linked message
[2], where I described what makes the test fail.

[1] 
https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures#001_emergency_vacuum.pl_fails_to_wait_for_datfrozenxid_advancing

[2] 
https://www.postgresql.org/message-id/5811175c-1a31-4869-032f-7af5e3e45...@gmail.com

Best regards,
Alexander




Re: race condition in pg_class

2024-07-20 Thread Alexander Lakhin

Hello Noah,

28.06.2024 08:13, Noah Misch wrote:

Pushed.


A recent buildfarm test failure [1] showed that the
intra-grant-inplace-db.spec test added with 0844b3968 may fail
on a slow machine (per my understanding):

test intra-grant-inplace-db   ... FAILED 4302 ms

@@ -21,8 +21,7 @@
 WHERE datname = current_catalog
 AND age(datfrozenxid) > (SELECT min(age(x)) FROM frozen_witness);

-?column?
---
-datfrozenxid retreated
-(1 row)
+?column?
+
+(0 rows)

whilst the previous (successful) run shows much shorter duration:
test intra-grant-inplace-db   ... ok  540 ms

I reproduced this failure on a VM slowed down so that the test duration
reached 4+ seconds, with 100 test: intra-grant-inplace-db in
isolation_schedule:
test intra-grant-inplace-db   ... ok 4324 ms
test intra-grant-inplace-db   ... FAILED 4633 ms
test intra-grant-inplace-db   ... ok 4649 ms

But as the test going to be modified by the inplace110-successors-v8.patch
and the modified test (with all three latest patches applied) passes
reliably in the same conditions, maybe this failure doesn't deserve a
deeper exploration.

What do you think?

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=habu&dt=2024-07-18%2003%3A08%3A08

Best regards,
Alexander




Re: Should consider materializing the cheapest inner path in consider_parallel_nestloop()

2024-07-18 Thread Alexander Lakhin

Hello Richard,

18.07.2024 17:30, Richard Guo wrote:

The problemed plan is a non-parallel nestloop join.  It's just chance
which join order the planner will pick, and slight variations in
underlying statistics could result in a different displayed plan.
 From the two verbose plans, we can see slight variations in the
statistics for the parallel seqscan of tenk1.

->  Parallel Seq Scan on public.tenk1 t1  (cost=0.00..370.00 rows=2500
width=244)

VS.

->  Parallel Seq Scan on public.tenk1 t1  (cost=0.00..369.99 rows=2499
width=244)

I have no idea why the underlying statistics changed, but it seems
that this slight change is sufficent to result in a different plan.


I think it could be caused by the same reason as [1] and I really can
easily (without multiple instances/loops. just with `make check`) reproduce
the failure with cranky-ConditionalLockBufferForCleanup.patch (but
targeted for "VACUUM ANALYZE tenk1;").


According to the discussion in [1], I think what we wanted to test
with this query is that parallel nestloop join is not generated if the
inner path is not parallel-safe.  Therefore, I modified this test case
to use a lateral join, rendering the inner path not parallel-safe
while also enforcing the join order.  Please see attached.


The modified test survives my testing procedure. Thank you for the patch!

[1] 
https://www.postgresql.org/message-id/flat/66eb9a6e-fc67-a230-c5b1-2a741e8b88c6%40gmail.com

Best regards,
Alexander




Re: Should consider materializing the cheapest inner path in consider_parallel_nestloop()

2024-07-18 Thread Alexander Lakhin

Hello Richard,

12.07.2024 05:29, Richard Guo wrote:

On Sat, Jul 6, 2024 at 5:32 PM Richard Guo  wrote:

Here is a new rebase.

I'm planning to push it next week, barring any objections.

Pushed.


Please look at a recent buildfarm failure [1], which shows some
instability of that test addition:
 -- the joinrel is not parallel-safe due to the OFFSET clause in the subquery
 explain (costs off)
 select * from tenk1 t1, (select * from tenk2 t2 offset 0) t2 where t1.two 
> t2.two;
-    QUERY PLAN

+   QUERY PLAN
+-
  Nested Loop
    Join Filter: (t1.two > t2.two)
-   ->  Gather
- Workers Planned: 4
- ->  Parallel Seq Scan on tenk1 t1
+   ->  Seq Scan on tenk2 t2
    ->  Materialize
- ->  Seq Scan on tenk2 t2
+ ->  Gather
+   Workers Planned: 4
+   ->  Parallel Seq Scan on tenk1 t1
 (7 rows)

I've managed to reproduce this plan change when running
multiple 027_stream_regress.pl instances simultaneously, with
parallel_schedule reduced to:
test: test_setup
test: create_misc
test: create_index
test: sanity_check
test: select_parallel

I've added the following to the test and got two verbose plans for
comparison (see the attachment).
 -- the joinrel is not parallel-safe due to the OFFSET clause in the subquery
 explain (costs off)
 select * from tenk1 t1, (select * from tenk2 t2 offset 0) t2 where t1.two 
> t2.two;
+\o plan.txt
+explain (verbose)
+    select * from tenk1 t1, (select * from tenk2 t2 offset 0) t2 where t1.two 
> t2.two;
+\o
 alter table tenk2 reset (parallel_workers);

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tamandua&dt=2024-07-17%2017%3A12%3A53

Best regards,
Alexander

   QUERY PLAN   



 Nested Loop  (cost=0.00..1500840.00 rows= width=488)
   Output: t1.unique1, t1.unique2, t1.two, t1.four, t1.ten, t1.twenty, 
t1.hundred, t1.thousand, t1.twothousand, t1.fivethous, t1.tenthous, t1.odd, 
t1.even, t1.stringu1, t1.stringu2, t1.string4, t2.unique1, t2.unique2, t2.two, 
t2.four, t2.ten, t2.twenty, t2.hundred, t2.thousand, t2.twothousand, 
t2.fivethous, t2.tenthous, t2.odd, t2.even, t2.stringu1, t2.stringu2, t2.string4
   Join Filter: (t1.two > t2.two)
   ->  Gather  (cost=0.00..370.00 rows=1 width=244)
 Output: t1.unique1, t1.unique2, t1.two, t1.four, t1.ten, t1.twenty, 
t1.hundred, t1.thousand, t1.twothousand, t1.fivethous, t1.tenthous, t1.odd, 
t1.even, t1.stringu1, t1.stringu2, t1.string4
 Workers Planned: 4
 ->  Parallel Seq Scan on public.tenk1 t1  (cost=0.00..370.00 rows=2500 
width=244)
   Output: t1.unique1, t1.unique2, t1.two, t1.four, t1.ten, 
t1.twenty, t1.hundred, t1.thousand, t1.twothousand, t1.fivethous, t1.tenthous, 
t1.odd, t1.even, t1.stringu1, t1.stringu2, t1.string4
   ->  Materialize  (cost=0.00..495.00 rows=1 width=244)
 Output: t2.unique1, t2.unique2, t2.two, t2.four, t2.ten, t2.twenty, 
t2.hundred, t2.thousand, t2.twothousand, t2.fivethous, t2.tenthous, t2.odd, 
t2.even, t2.stringu1, t2.stringu2, t2.string4
 ->  Seq Scan on public.tenk2 t2  (cost=0.00..445.00 rows=1 
width=244)
   Output: t2.unique1, t2.unique2, t2.two, t2.four, t2.ten, 
t2.twenty, t2.hundred, t2.thousand, t2.twothousand, t2.fivethous, t2.tenthous, 
t2.odd, t2.even, t2.stringu1, t2.stringu2, t2.string4
(12 rows)



   QUERY PLAN   



 Nested Loop  (cost=0.00..1500089.98 rows=33316667

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

2024-07-16 Thread Alexander Lakhin

Hello,

25.06.2024 11:24, Jelte Fennema-Nio wrote:

My expectation is that that should remove all failure cases. If it
doesn't, I think our best bet is removing the test again.


It looks like that test eventually showed what could be called a virtue.
Please take a look at a recent BF failure [1]:
timed out after 10800 secs
...
# +++ regress install-check in contrib/postgres_fdw +++
# using postmaster on /home/andrew/bf/root/tmp/buildfarm-e2ahpQ, port 5878

So the postgres_fdw test hanged for several hours while running on the
Cygwin animal lorikeet.

I've managed to reproduce this issue in my Cygwin environment by running
the postgres_fdw test in a loop (10 iterations are enough to get the
described effect). And what I'm seeing is that a query-cancelling backend
is stuck inside pgfdw_xact_callback() -> pgfdw_abort_cleanup() ->
pgfdw_cancel_query() -> pgfdw_cancel_query_begin() -> libpqsrv_cancel() ->
WaitLatchOrSocket() -> WaitEventSetWait() -> WaitEventSetWaitBlock() ->
poll().

The timeout value (approximately 30 seconds), which is passed to poll(),
is effectively ignored by this call — the waiting lasts for unlimited time.

This definitely is caused by 2466d6654. (I applied the test change from that
commit to 2466d6654~1 and saw no issue when running the same test in a
loop.)

With gdb attached to a hanging backend, I see the following stack trace:
#0  0x7ffb7f70d5e4 in ntdll!ZwWaitForSingleObject () from 
/cygdrive/c/Windows/SYSTEM32/ntdll.dll
#1  0x7ffb7d2e920e in WaitForSingleObjectEx () from 
/cygdrive/c/Windows/System32/KERNELBASE.dll
#2  0x7ffb5ce78862 in fhandler_socket_wsock::evaluate_events (this=0x800126968, event_mask=50, events=@0x7b208: 
0, erase=erase@entry=false)

    at /usr/src/debug/cygwin-3.5.3-1/winsup/cygwin/fhandler/socket_inet.cc:268
#3  0x7ffb5cdef0f5 in peek_socket (me=0xa001a43c0) at 
/usr/src/debug/cygwin-3.5.3-1/winsup/cygwin/select.cc:1771
#4  0x7ffb5cdf211e in select_stuff::poll (this=this@entry=0x7b300, readfds=0x7b570, 
readfds@entry=0x8, writefds=0x7b560, writefds@entry=0x7b5c0, exceptfds=0x7b550,
    exceptfds@entry=0x7ffb5cdf2c97 ) at 
/usr/src/debug/cygwin-3.5.3-1/winsup/cygwin/select.cc:554
#5  0x7ffb5cdf257e in select (maxfds=maxfds@entry=45, readfds=0x8, writefds=0x7b5c0, 
exceptfds=0x7ffb5cdf2c97 , us=4308570016,

    us@entry=29973000) at 
/usr/src/debug/cygwin-3.5.3-1/winsup/cygwin/select.cc:204
#6  0x7ffb5cdf2927 in pselect (maxfds=45, readfds=0x7b570, writefds=0x7b560, exceptfds=0x7b550, 
to=, to@entry=0x7b500, set=, set@entry=0x0)

    at /usr/src/debug/cygwin-3.5.3-1/winsup/cygwin/select.cc:120
#7  0x7ffb5cdf2c97 in cygwin_select (maxfds=, readfds=, writefds=, 
exceptfds=, to=0x7b5b0)

    at /usr/src/debug/cygwin-3.5.3-1/winsup/cygwin/select.cc:147
#8  0x7ffb5cddc112 in poll (fds=, nfds=, timeout=) at 
/usr/src/debug/cygwin-3.5.3-1/winsup/cygwin/poll.cc:83

...
and socket_inet.c:268 ([2]) indeed contains an infinite wait call
(LOCK_EVENTS; / WaitForSingleObject (wsock_mtx, INFINITE)).

So it looks like a Cygwin bug, but maybe something should be done on our side
too, at least to prevent such lorikeet failures.

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2024-07-12%2010%3A05%3A27
[2] 
https://www.cygwin.com/cgit/newlib-cygwin/tree/winsup/cygwin/fhandler/socket_inet.cc?h=cygwin-3.5.3

Best regards,
Alexander




Re: Internal error codes triggered by tests

2024-07-14 Thread Alexander Lakhin

Hello Daniel and Michael,

12.07.2024 23:41, Daniel Gustafsson wrote:

On 10 Jul 2024, at 06:42, Michael Paquier  wrote:

The rest mentioned upthread seems either not worth the effort or are likely to
be bugs warranting proper investigation.



I've filed a bug report about the "WITH RECURSIVE" anomaly: [1], but what
I wanted to understand when presenting different error kinds is what
definition XX000 errors could have in principle?

It seems to me that we can't define them as indicators of unexpected (from
the server's POV) conditions, similar to assertion failures (but produced
with no asserts enabled), which should be and mostly get fixed.

If the next thing to do is to get backtrace_on_internal_error back and
that GUC is mainly intended for developers, then maybe having clean (or
containing expected backtraces only) regression test logs is a final goal
and we should stop here. But if it's expected that that GUC could be
helpful for users to analyze such errors in production and thus pay extra
attention to them, maybe having XX000 status for presumably
unreachable conditions only is desirable...

[1] 
https://www.postgresql.org/message-id/18536-0a342ec07901203e%40postgresql.org

Best regards,
Alexander




Re: speed up a logical replica setup

2024-07-11 Thread Alexander Lakhin

Hello Amit and Hou-San,

11.07.2024 13:21, Amit Kapila wrote:


We don't wait for the xmin to catch up corresponding to this insert
and I don't know if there is a way to do that. So, we should move this
Insert to after the call to pg_sync_replication_slots(). It won't
impact the general test of pg_createsubscriber.

Thanks to Hou-San for helping me in the analysis of this BF failure.


Thank you for investigating that issue!

May I ask you to look at another failure of the test occurred today [1]?
The failure log contains:
recovery_target_lsn = '0/30098D0'
pg_createsubscriber: starting the subscriber
...
pg_createsubscriber: server was started
pg_createsubscriber: waiting for the target server to reach the consistent state
...
2024-07-11 07:40:10.837 UTC [2948531][postmaster][:0] LOG:  received fast 
shutdown request

Though what I'm seeing after a successful run is:
recovery_target_lsn = '0/3009860'
pg_createsubscriber: starting the subscriber
...
pg_createsubscriber: server was started
pg_createsubscriber: waiting for the target server to reach the consistent state
...
2024-07-11 15:19:40.733 UTC [207517] 040_pg_createsubscriber.pl LOG:  
statement: SELECT pg_catalog.pg_is_in_recovery()
2024-07-11 15:19:41.635 UTC [207514] LOG:  recovery stopping after WAL location (LSN) 
"0/3009860"
2024-07-11 15:19:41.635 UTC [207514] LOG:  redo done at 0/3009860 system usage: CPU: user: 0.00 s, system: 0.00 s, 
elapsed: 21.00 s


I've managed to reproduce the failure locally. With the bgwriter mod:
-#define LOG_SNAPSHOT_INTERVAL_MS 15000
+#define LOG_SNAPSHOT_INTERVAL_MS 150

and wal_debug=on, when running 5 test instances with parallel, I get the
failure with the following log:
recovery_target_lsn = '0/3009A20'
pg_createsubscriber: starting the subscriber

2024-07-11 14:40:04.551 UTC [205589:72][startup][33/0:0] LOG:  REDO @ 0/30099E8; LSN 0/3009A20: prev 0/30099B0; xid 0; 
len 24 - Standby/RUNNING_XACTS: nextXid 747 latestCompletedXid 746 oldestRunningXid 747

# ^^^ the last REDO record in the log
...
pg_createsubscriber: server was started
pg_createsubscriber: waiting for the target server to reach the consistent state
...
pg_createsubscriber: server was stopped
pg_createsubscriber: error: recovery timed out
...
[14:43:06.011](181.800s) not ok 30 - run pg_createsubscriber on node S
[14:43:06.012](0.000s)
[14:43:06.012](0.000s) #   Failed test 'run pg_createsubscriber on node S'
#   at t/040_pg_createsubscriber.pl line 356.

$ pg_waldump -p src/bin/pg_basebackup_1/tmp_check/t_040_pg_createsubscriber_node_s_data/pgdata/pg_wal/ 
00010003 00010003 | tail -2
rmgr: Standby len (rec/tot): 50/    50, tx:  0, lsn: 0/030099B0, prev 0/03009948, desc: RUNNING_XACTS 
nextXid 747 latestCompletedXid 746 oldestRunningXid 747
rmgr: Standby len (rec/tot): 50/    50, tx:  0, lsn: 0/030099E8, prev 0/030099B0, desc: RUNNING_XACTS 
nextXid 747 latestCompletedXid 746 oldestRunningXid 747


Whilst
$ pg_waldump -p src/bin/pg_basebackup_1/tmp_check/t_040_pg_createsubscriber_node_p_data/pgdata/pg_wal/ 
00010003 00010003 | tail
rmgr: Standby len (rec/tot): 50/    50, tx:  0, lsn: 0/030099B0, prev 0/03009948, desc: RUNNING_XACTS 
nextXid 747 latestCompletedXid 746 oldestRunningXid 747
rmgr: Standby len (rec/tot): 50/    50, tx:  0, lsn: 0/030099E8, prev 0/030099B0, desc: RUNNING_XACTS 
nextXid 747 latestCompletedXid 746 oldestRunningXid 747
rmgr: Heap2   len (rec/tot): 60/    60, tx:    747, lsn: 0/03009A20, prev 0/030099E8, desc: NEW_CID rel: 
1663/16384/6104, tid: 0/1, cmin: 4294967295, cmax: 0, combo: 4294967295
rmgr: Heap    len (rec/tot): 54/    54, tx:    747, lsn: 0/03009A60, prev 0/03009A20, desc: DELETE xmax: 
747, off: 1, infobits: [KEYS_UPDATED], flags: 0x00, blkref #0: rel 1663/16384/6104 blk 0
rmgr: Transaction len (rec/tot): 78/    78, tx:    747, lsn: 0/03009A98, prev 0/03009A60, desc: INVALIDATION ; 
inval msgs: catcache 49 catcache 46 relcache 0
rmgr: Transaction len (rec/tot): 98/    98, tx:    747, lsn: 0/03009AE8, prev 0/03009A98, desc: COMMIT 
2024-07-11 14:43:05.994561 UTC; relcache init file inval dbid 16384 tsid 1663; inval msgs: catcache 49 catcache 46 
relcache 0
rmgr: Heap2   len (rec/tot): 60/    60, tx:    748, lsn: 0/03009B50, prev 0/03009AE8, desc: NEW_CID rel: 
1663/16385/6104, tid: 0/1, cmin: 4294967295, cmax: 0, combo: 4294967295
rmgr: Heap    len (rec/tot): 54/    54, tx:    748, lsn: 0/03009B90, prev 0/03009B50, desc: DELETE xmax: 
748, off: 1, infobits: [KEYS_UPDATED], flags: 0x00, blkref #0: rel 1663/16385/6104 blk 0
rmgr: Transaction len (rec/tot): 78/    78, tx:    748, lsn: 0/03009BC8, prev 0/03009B90, desc: INVALIDATION ; 
inval msgs: catcache 49 catcache 46 relcache 0
rmgr: Transaction len (rec/tot): 98/    98, tx:    748, lsn: 0/03009C18, prev 0/03009BC8, desc: COMMIT 
2024-07-11 14:43:06.008619

Re: speed up a logical replica setup

2024-07-09 Thread Alexander Lakhin

Hello Amit and Kuroda-san,

03.07.2024 14:02, Amit Kapila wrote:

Pushed 0002 and 0003. Let's wait for a discussion on 0001.


Please look at another failure of the test [1]:
[13:28:05.647](2.460s) not ok 26 - failover slot is synced
[13:28:05.648](0.001s) #   Failed test 'failover slot is synced'
#   at 
/home/bf/bf-build/skink-master/HEAD/pgsql/src/bin/pg_basebackup/t/040_pg_createsubscriber.pl
 line 307.
[13:28:05.648](0.000s) #  got: ''
# expected: 'failover_slot'

with 040_pg_createsubscriber_node_s.log containing:
2024-07-08 13:28:05.369 UTC [3985464][client backend][0/2:0] LOG: statement: 
SELECT pg_sync_replication_slots()
2024-07-08 13:28:05.557 UTC [3985464][client backend][0/2:0] LOG: could not sync slot "failover_slot" as remote slot 
precedes local slot
2024-07-08 13:28:05.557 UTC [3985464][client backend][0/2:0] DETAIL:  Remote slot has LSN 0/30047B8 and catalog xmin 
743, but local slot has LSN 0/30047B8 and catalog xmin 744.


I could not reproduce it locally, but I've discovered that that subtest
somehow depends on pg_createsubscriber executed for the
'primary contains unmet conditions on node P' check. For example with this
test modification:
@@ -249,7 +249,7 @@ command_fails(
 $node_p->connstr($db1), '--socket-directory',
 $node_s->host, '--subscriber-port',
 $node_s->port, '--database',
-    $db1, '--database',
+    'XXX', '--database',
 $db2
 ],
 'primary contains unmet conditions on node P');

I see the same failure:
2024-07-09 10:19:43.284 UTC [938890] 040_pg_createsubscriber.pl LOG:  
statement: SELECT pg_sync_replication_slots()
2024-07-09 10:19:43.292 UTC [938890] 040_pg_createsubscriber.pl LOG:  could not sync slot "failover_slot" as remote slot 
precedes local slot
2024-07-09 10:19:43.292 UTC [938890] 040_pg_createsubscriber.pl DETAIL:  Remote slot has LSN 0/3004780 and catalog xmin 
743, but local slot has LSN 0/3004780 and catalog xmin 744.


Thus maybe even a normal pg_createsubscriber run can affect the primary
server (it's catalog xmin) differently?

One difference I found in the logs, is that the skink failure's
regress_log_040_pg_createsubscriber contains:
pg_createsubscriber: error: publisher requires 2 wal sender processes, but only 
1 remain

Though for a successful run I see locally (I can't find logs of
successful test runs on skink):
pg_createsubscriber: error: publisher requires 2 wal sender processes, but only 
0 remain

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2024-07-08%2013%3A16%3A35

Best regards,
Alexander




Re: ssl tests fail due to TCP port conflict

2024-07-08 Thread Alexander Lakhin

Hello,

07.06.2024 17:25, Tom Lane wrote:

Andrew Dunstan  writes:

I still think my patch to force TCP mode for the SSL test makes sense as
well.

+1 to both things.  If that doesn't get the failure rate down to an
acceptable level, we can look at the retry idea.


I'd like to add that the kerberos/001_auth test also suffers from the port
conflict, but slightly differently. Look for example at [1]:
krb5kdc.log contains:
Jul 02 09:29:41 andres-postgres-buildfarm-v5 krb5kdc[471964](info): setting up 
network...
Jul 02 09:29:41 andres-postgres-buildfarm-v5 krb5kdc[471964](Error): Address already in use - Cannot bind server socket 
on 127.0.0.1.55853

Jul 02 09:29:41 andres-postgres-buildfarm-v5 krb5kdc[471964](Error): Failed 
setting up a UDP socket (for 127.0.0.1.55853)
Jul 02 09:29:41 andres-postgres-buildfarm-v5 krb5kdc[471964](Error): Address 
already in use - Error setting up network

As far as I can see, the port for kdc is chosen by
PostgreSQL::Test::Kerberos, via
PostgreSQL::Test::Cluster::get_free_port(), which checks only for TCP
port availability (with can_bind()), but not for UDP, so this increases
the probability of the conflict for this test (a similar failure: [2]).
Although we can also find a failure with TCP: [3]

(It's not clear to me, what processes can use UDP ports while testing,
but maybe those buildfarm animals are running on the same logical
machine simultaneously?)

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rorqual&dt=2024-07-02%2009%3A27%3A15
[2] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2024-05-15%2001%3A25%3A07
[3] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grassquit&dt=2024-07-04%2008%3A28%3A19

Best regards,
Alexander




Re: Internal error codes triggered by tests

2024-07-08 Thread Alexander Lakhin

Hello Michael,

05.07.2024 03:57, Michael Paquier wrote:

On Thu, Jul 04, 2024 at 11:00:01AM +0300, Alexander Lakhin wrote:

Could you please share your thoughts regarding other error cases, which is
not triggered by existing tests, but still can be easily reached by users?

For example:
SELECT satisfies_hash_partition(1, 1, 0, 0);

ERROR:  XX000: could not open relation with OID 1
LOCATION:  relation_open, relation.c:61

or:
CREATE TABLE t (b bytea);
INSERT INTO t SELECT ''::bytea;
CREATE INDEX brinidx ON t USING brin
  (b bytea_bloom_ops(n_distinct_per_range = -1.0));

ERROR:  XX000: the bloom filter is too large (44629 > 8144)
LOCATION:  bloom_init, brin_bloom.c:344

Should such cases be corrected too?

This is a case-by-case.  satisfies_hash_partition() is undocumented,
so doing nothing is fine by me.  The second one, though is something
taht can be triggered with rather normal DDL sequences.  That's more
annoying.


Thank you for the answer!

Let me show you other error types for discussion/classification:
SELECT pg_describe_object(1::regclass, 0, 0);

ERROR:  XX000: unsupported object class: 1
LOCATION:  getObjectDescription, objectaddress.c:4016
or
SELECT pg_identify_object_as_address('1'::regclass, 1, 1);

ERROR:  XX000: unsupported object class: 1
LOCATION:  getObjectTypeDescription, objectaddress.c:4597

--
SELECT format('BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
SET TRANSACTION SNAPSHOT ''%s''', repeat('-', 1000))
\gexec
ERROR:  XX000: could not open file "pg_snapshots/-...---" for reading: File 
name too long
LOCATION:  ImportSnapshot, snapmgr.c:1428

--
CREATE OPERATOR === (leftarg = int4, rightarg = int4, procedure = int4eq,
  commutator = ===, hashes);

CREATE TABLE t1 (a int);
ANALYZE t1;
CREATE TABLE t2 (a int);

SELECT * FROM t1, t2 WHERE t1.a === t2.a;

ERROR:  XX000: could not find hash function for hash operator 16385
LOCATION:  ExecHashTableCreate, nodeHash.c:560

--
WITH RECURSIVE oq(x) AS (
    WITH iq as (SELECT * FROM oq)
    SELECT * FROM iq
    UNION
    SELECT * from iq
)
SELECT * FROM oq;

ERROR:  XX000: missing recursive reference
LOCATION:  checkWellFormedRecursion, parse_cte.c:896
(commented as "should not happen", but still...)

--
CREATE EXTENSION ltree;
SELECT '1' ::ltree @ (repeat('!', 100)) ::ltxtquery;

ERROR:  XX000: stack too short
LOCATION:  makepol, ltxtquery_io.c:252

--
There is also a couple of dubious ereport(DEBUG1,
(errcode(ERRCODE_INTERNAL_ERROR), ...) calls like:
    /*
 * User-defined picksplit failed to create an actual split, ie it put
 * everything on the same side.  Complain but cope.
 */
    ereport(DEBUG1,
    (errcode(ERRCODE_INTERNAL_ERROR),
 errmsg("picksplit method for column %d of index \"%s\" failed",
    attno + 1, RelationGetRelationName(r)),

I'm not mentioning errors, that require more analysis and maybe correcting
the surrounding logic, not ERRCODE only.

Maybe it makes sense to separate the discussion of such errors, which are
not triggered by tests/not covered; I'm just not sure how to handle them
efficiently.

Best regards,
Alexander




Re: Use generation memory context for tuplestore.c

2024-07-05 Thread Alexander Lakhin

05.07.2024 07:57, David Rowley wrote:

Thanks for the report. I've just pushed a fix in 53abb1e0e.


Thank you, David!

Please look at another anomaly introduced with 590b045c3:
echo "
CREATE TABLE t(f int, t int);
INSERT INTO t VALUES (1, 1);

WITH RECURSIVE sg(f, t) AS (
SELECT * FROM t t1
UNION ALL
SELECT t2.* FROM t t2, sg WHERE t2.f = sg.t
) SEARCH DEPTH FIRST BY f, t SET seq
SELECT * FROM sg;
" | timeout 60 psql

triggers
TRAP: failed Assert("chunk->requested_size < oldsize"), File: "generation.c", 
Line: 842, PID: 830294

Best regards,
Alexander




Re: BUG: Postgres 14 + vacuum_defer_cleanup_age + FOR UPDATE + UPDATE

2024-07-05 Thread Alexander Lakhin

Hello Andres,

12.03.2023 02:41, Andres Freund wrote:

CI now finished the tests as well:
https://cirrus-ci.com/build/6675457702100992

So I'll go ahead and push that.


As I mentioned at [1], `meson test` fails on Windows x86 platform during
the test pg_amcheck/004_verify_heapam (I'm using VS 2022 Version 17.9.7):
meson setup build --wipe -Dcassert=true
cd build & ninja & meson test

... postgresql:pg_amcheck / pg_amcheck/004_verify_heapam ERROR 
6.95s   exit status 25

004_verify_heapam_test.log contains:
TRAP: failed Assert("FullTransactionIdIsNormal(fxid)"), File: 
"../contrib/amcheck/verify_heapam.c", Line: 1915, PID: 2560
2024-07-04 20:56:54.592 PDT [9780] LOG:  server process (PID 2560) was 
terminated by exception 0xC409
2024-07-04 20:56:54.592 PDT [9780] DETAIL:  Failed process was running: SELECT v.blkno, v.offnum, v.attnum, v.msg FROM 
pg_catalog.pg_class c, "public".verify_heapam(

    relation := c.oid, on_error_stop := false, check_toast := true, skip := 
'none'
    ) v WHERE c.oid = 16438 AND c.relpersistence != 't'

`git bisect` for this anomaly pointed at 4f5d461e0.
(I couldn't compile Postgres on that commit, but with
`git show 53ea2b7ad | git apply` (see also [2]) it's possible.)

The Assert in question is:
    else
    fxid = FullTransactionIdFromU64(nextfxid_i - diff);

    Assert(FullTransactionIdIsNormal(fxid));

It was not clear to me how it comes out that fxid is not normal, until I
looked at the disassembly:
    else
    fxid = FullTransactionIdFromU64(nextfxid_i - diff);
751812D2  sub ebx,eax
751812D4  sbb edi,edx

    Assert(FullTransactionIdIsNormal(fxid));
751812D6  jne FullTransactionIdFromXidAndCtx+0E6h (751812F6h)
751812D8  jb  FullTransactionIdFromXidAndCtx+0CFh (751812DFh)
751812DA  cmp ebx,3
751812DD  jae FullTransactionIdFromXidAndCtx+0E6h (751812F6h)
751812DF  push    77Bh
751812E4  push    offset string "../contrib/amcheck/verify_heapa@"... 
(7518C4A4h)
751812E9  push    offset string "FullTransactionIdIsNormal(fxid)" 
(7518DB04h)
751812EE  call    _ExceptionalCondition (75189FFEh)

The same code fragment for your convenience:
https://ideone.com/8wiGRY

Could you please look at this?

[1] 
https://www.postgresql.org/message-id/72705e42-42d1-ac6e-e7d5-4baec8a0d2af%40gmail.com
[2] https://postgr.es/m/17967-cd21e34a31414...@postgresql.org

Best regards,
Alexander

  1   2   3   4   5   6   >