Re: WAL Insertion Lock Improvements

2023-04-09 Thread Michael Paquier
On Mon, Feb 20, 2023 at 09:49:48PM -0800, Nathan Bossart wrote:
> I'm marking this as ready-for-committer.  I think a couple of the comments
> could use some small adjustments, but that probably doesn't need to hold up
> this patch.

Apologies.  I was planning to have a thorough look at this patch but
life got in the way and I have not been able to study what's happening
on this thread this close to the feature freeze.

Anyway, I am attaching two modules I have written for the sake of this
thread while beginning my lookup of the patch:
- lwlock_test.tar.gz, validation module for LWLocks with variable
waits.  This module can be loaded with shared_preload_libraries to
have two LWLocks and two variables in shmem, then have 2 backends play
ping-pong with each other's locks.  An isolation test may be possible,
though I have not thought hard about it.  Just use a SQL sequence like
that, for example, with N > 1 (see README):
Backend 1: SELECT lwlock_test_acquire();
Backend 2: SELECT lwlock_test_wait(N);
Backend 1: SELECT lwlock_test_update(N);
Backend 1: SELECT lwlock_test_release();
- custom_wal.tar.gz, thin wrapper for LogLogicalMessage() able to
generate N records of size M bytes in a single SQL call.  This can be
used to generate records of various sizes for benchmarking, limiting
the overhead of individual calls to pg_logical_emit_message_bytea().
I have begun gathering numbers with WAL records of various size and
length, using pgbench like:
$ cat script.sql
\set record_size 1
\set record_number 5000
SELECT custom_wal(:record_size, :record_number);
$ pgbench -n -c 500 -t 100 -f script.sql
So this limits most the overhead of behind parsing, planning, and most
of the INSERT logic.

I have been trying to get some reproducible numbers, but I think that
I am going to need a bigger maching than what I have been using for
the last few days, up to 400 connections.  It is worth noting that
00d1e02b may influence a bit the results, so we may want to have more
numbers with that in place particularly with INSERTs, and one of the
tests used upthread uses single row INSERTs.

Another question I had: would it be worth having some tests with
pg_wal/ mounted to a tmpfs so as I/O would not be a bottleneck?  It
should be instructive to get more measurement with a fixed number of
transactions and a rather high amount of concurrent connections (1k at
least?), where the contention would be on the variable waits.  My
first impression is that records should not be too small if you want
to see more the effects of this patch, either.

Looking at the patch..  LWLockConflictsWithVar() and
LWLockReleaseClearVar() are the trivial bits.  These are OK.

+* NB: Note the use of pg_atomic_exchange_u64 as opposed to just
+* pg_atomic_write_u64 to update the value. Since pg_atomic_exchange_u64 is
+* a full barrier, we're guaranteed that the subsequent shared memory
+* reads/writes, if any, happen after we reset the lock variable.

This mentions that the subsequent read/write operations are safe, so
this refers to anything happening after the variable is reset.  As
a full barrier, should be also mention that this is also ordered with
respect to anything that the caller did before clearing the variable?
From this perspective using pg_atomic_exchange_u64() makes sense to me
in LWLockReleaseClearVar().

+* XXX: Use of a spinlock at the beginning of this function to read
+* current insert position implies memory ordering. That means that
+* the immediate loads and stores to shared memory (for instance,
+* in LWLockUpdateVar called via LWLockWaitForVar) don't need an
+* explicit memory barrier as far as the current usage is
+* concerned. But that might not be safe in general.
 */
What's the part where this is not safe?  Based on what I see, this
code path is safe because of the previous spinlock.  This is the same
comment as at the beginning of LWLockConflictsWithVar().  Is that
something that we ought to document at the top of LWLockWaitForVar()
as well?  We have one caller of this function currently, but there may
be more in the future.

- * you're about to write out.
+ * you're about to write out. Using an atomic variable for insertingAt avoids
+ * taking any explicit lock for reads and writes.

Hmm.  Not sure that we need to comment at all.

-LWLockUpdateVar(LWLock *lock, uint64 *valptr, uint64 val)
+LWLockUpdateVar(LWLock *lock, pg_atomic_uint64 *valptr, uint64 val)
[...]
Assert(pg_atomic_read_u32(>state) & LW_VAL_EXCLUSIVE);
 
-   /* Update the lock's value */
-   *valptr = val;

The sensitive change is in LWLockUpdateVar().  I am not completely
sure to understand this removal, though.  Does that influence the case
where there are waiters?

Another thing I was wondering about: how much does the fast-path used
in LWLockUpdateVar() influence the performance numbers?  Am I right to
guess that it counts for most of the gain seen? 

Re: PGDOCS - function pg_get_publication_tables is not documented?

2023-04-09 Thread Amit Kapila
On Sun, Apr 9, 2023 at 7:35 AM Yu Shi (Fujitsu)  wrote:
>
> On Fri, Mar 24, 2023 6:26 AM Tom Lane  wrote:
> >
> > I do see a docs change that I think would be worth making: get
> > rid of the explicit mention of it in create_subscription.sgml
> > in favor of using that view.
> >
>
> I agree and I tried to modify the query to use the view.
> Please see the attached patch.
>

I am wondering whether we need to take the publication name as input
to find tables that can include non-local origins. I think anyway
users need to separately query publication names to give that input.

-- 
With Regards,
Amit Kapila.




Re: CREATE SUBSCRIPTION -- add missing tab-completes

2023-04-09 Thread Amit Kapila
On Sun, Apr 9, 2023 at 11:33 AM Gregory Stark (as CFM)
 wrote:
>
> On Fri, 7 Apr 2023 at 01:28, Amit Kapila  wrote:
> >
> > On Wed, Apr 5, 2023 at 5:58 AM Peter Smith  wrote:
> >
> > > PSA patches to add those tab completions.
> >
> > LGTM, so pushed.
>
> I moved this to the next CF but actually I just noticed the thread
> starts with the original patch being pushed. Maybe we should just
> close the CF entry? Is this further change relevant?
>

I have closed the CF entry as the patch for which this entry was
created is already committed. If anything comes as a result of further
discussion, I'll take care of it, or if required we can start a
separate thread.

-- 
With Regards,
Amit Kapila.




Re: Direct I/O

2023-04-09 Thread Andres Freund
Hi,

On 2023-04-10 00:17:12 +1200, Thomas Munro wrote:
> I think there are two separate bad phenomena.
> 
> 1.  A concurrent modification of the user space buffer while writing
> breaks the checksum so you can't read the data back in, as .  I can
> reproduce that with a stand-alone program, attached.  The "verifier"
> process occasionally reports EIO while reading, unless you comment out
> the "scribbler" process's active line.  The system log/dmesg gets some
> warnings.

I think we really need to think about whether we eventually we want to do
something to avoid modifying pages while IO is in progress. The only
alternative is for filesystems to make copies of everything in the IO path,
which is far from free (and obviously prevents from using DMA for the whole
IO). The copy we do to avoid the same problem when checksums are enabled,
shows up quite prominently in write-heavy profiles, so there's a "purely
postgres" reason to avoid these issues too.


> 2.  The crake-style failure doesn't involve any reported checksum
> failures or errors, and I'm not sure if another process is even
> involved.  I attach a complete syscall trace of a repro session.  (I
> tried to get strace to dump 8192 byte strings, but then it doesn't
> repro, so we have only the start of the data transferred for each
> page.)  Working back from the error message,
> 
> ERROR:  invalid page in block 78 of relation base/5/16384,
> 
> we have a page at offset 638976, and we can find all system calls that
> touched that offset:
> 
> [pid 26031] 23:26:48.521123 pwritev(50,
> [{iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> iov_len=8192}], 1, 638976) = 8192
> 
> [pid 26040] 23:26:48.568975 pwrite64(5,
> "\0\0\0\0\0Nj\1\0\0\0\0\240\3\300\3\0 \4
> \0\0\0\0\340\2378\0\300\2378\0"..., 8192, 638976) = 8192
> 
> [pid 26040] 23:26:48.593157 pread64(6,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> 8192, 638976) = 8192
> 
> In between the write of non-zeros and the read of zeros, nothing seems
> to happen that could justify that, that I can grok, but perhaps
> someone else will see something that I'm missing.  We pretty much just
> have the parallel worker scanning the table, and writing stuff out as
> it does it.  This was obtained with:

Have you tried to write a reproducer for this that doesn't involve postgres?
It'd certainly be interesting to know the precise conditions for this. E.g.,
can this also happen without O_DIRECT, if cache pressure is high enough for
the page to get evicted soon after (potentially simulated with fadvise or
such)?

We should definitely let the brtfs folks know of this issue... It's possible
that this bug was recently introduced even. What kernel version did you repro
this on Thomas?

I wonder if we should have a postgres-io-torture program in our tree for some
of these things. We've found issues with our assumptions on several operating
systems and filesystems, without systematically looking. Or even stressing IO
all that hard in our tests.

Greetings,

Andres Freund




RE: Fix the description of GUC "max_locks_per_transaction" and "max_pred_locks_per_transaction" in guc_table.c

2023-04-09 Thread Wei Wang (Fujitsu)
On Sat, Apr 8, 2023 at 1:32 AM Tom Lane  wrote:
> "wangw.f...@fujitsu.com"  writes:
> > On Tues, Apr 4, 2023 at 23:48 PM Tom Lane  wrote:
> >> I like the "per eligible process" wording, at least for guc_tables.c;
> >> or maybe it could be "per server process"?  That would be more
> >> accurate and not much longer than what we have now.
> 
> > Thanks both for sharing your opinions.
> > I agree that verbose descriptions make maintenance difficult.
> > For consistency, I unified the formulas in guc_tables.c and pg-doc into the 
> > same
> > suggested short formula. Attach the new patch.
> 
> After studying this for awhile, I decided "server process" is probably
> the better term --- people will have some idea what that means, while
> "eligible process" is not a term we use anywhere else.  Pushed with
> that change and some minor other wordsmithing.

Make sense to me
Thanks for pushing.

Regards,
Wang Wei


RE: Partial aggregates pushdown

2023-04-09 Thread fujii.y...@df.mitsubishielectric.co.jp
Hi Mr.Momjian, Mr.Lane, Mr.Freund.

Thank you for advices.

> From: Bruce Momjian 
> > > > 2. Automation of creating definition of partialaggfuncs In
> > > > development of v17, I manually create definition of
> > > > partialaggfuncs for avg, min, max, sum,
> > > count.
> > > > I am concerned that this may be undesirable.
> > > > So I am thinking that v17 should be modified to automate creating
> > > > definition of partialaggfuncs for all built-in aggregate functions.
> > >
> > > Are there any other builtin functions that need this?  I think we
> > > can just provide documention for extensions on how to do this.
> > For practical purposes, it is sufficient if partial aggregate for the
> > above functions can be pushed down.
> > I think you are right, it would be sufficient to document how to
> > achieve  partial aggregate pushdown for other built-in functions.
> 
> Uh, we actually want the patch to implement partial aggregate pushdown for all
> builtin data types that can support it.  Is that done?  I think it is only 
> extension
> aggregates, which we do not control, that need this documentation.
The last version of this patch can't pushdown partial aggregate for all builtin 
aggregate functions that can support it.
I will improve this patch to pushdown partial aggregate for all builtin 
aggregate functions
that can support it.

There is one more thing I would like your opinion on.
As the major version of PostgreSQL increase, it is possible that
the new builtin aggregate functions are added to the newer PostgreSQL.
This patch assume that aggpartialfns definitions exist in BKI files.
Due to this assumption, someone should add aggpartialfns definitions of new 
builtin aggregate functions to BKI files.
There are two possible ways to address this issue. Would the way1 be sufficient?
Or would way2 be preferable?
  way1) Adding documentaion for how to add these definitions to BKI files
  way2) Improving this patch to automatically add these definitions to BKI 
files by some tool such as initdb.

> From: Bruce Momjian 
> On Fri, Apr  7, 2023 at 09:16:14PM -0700, Andres Freund wrote:
> > On 2023-04-07 22:53:53 -0400, Bruce Momjian wrote:
> > > > postgres_fdw has no business pushing down calls to non-builtin
> > > > functions unless the user has explicitly authorized that via the
> > > > existing whitelisting mechanism.  I think you're reinventing the
> > > > wheel, and not very well.
> > >
> > > The patch has you assign an option at CREATE AGGREGATE time if it
> > > can do push down, and postgres_fdw checks that.  What whitelisting
> > > mechanism are you talking about?  async_capable?
> >
> > extensions (string)
> >
> > This option is a comma-separated list of names of PostgreSQL
> extensions that are installed, in compatible versions, on both the local and
> remote servers. Functions and operators that are immutable and belong to a
> listed extension will be considered shippable to the remote server. This 
> option
> can only be specified for foreign servers, not per-table.
> >
> > When using the extensions option, it is the user's responsibility that 
> > the
> listed extensions exist and behave identically on both the local and remote
> servers. Otherwise, remote queries may fail or behave unexpectedly.
> 
> Okay, this is very helpful --- it is exactly the issue we are dealing with 
> --- how
> can we know if partial aggregate functions exists on the remote server.  (I
> knew I was going to need API help on this.)
> 
> So, let's remove the PARTIALAGG_MINVERSION option from the patch and just
> make it automatic --- we push down builtin partial aggregates if the remote
> server is the same or newer _major_ version than the sending server.  For
> extensions, if people have older extensions on the same or newer foreign
> servers, they can adjust 'extensions' above.
Okay, I understand. I will remove PARTIALAGG_MINVERSION option from the patch
and I will add check whether aggpartialfn depends on some extension which
is containd in extensions list of the postgres_fdw's foreign server.
In the next version of this patch,
we can pushdown partial aggregate for an user-defined aggregate function only 
when the function pass through this check.

Sincerely yours,
Yuuki Fujii

--
Yuuki Fujii
Information Technology R Center Mitsubishi Electric Corporation




Re: Commitfest 2023-03 starting tomorrow!

2023-04-09 Thread Tom Lane
Greg Stark  writes:
> Not added:
> * Fix improper qual pushdown after applying outer join identity 3

I already made an open item for that one.

regards, tom lane




DecodeInterval fixes

2023-04-09 Thread Joseph Koshakow
Hi all,

This patch does three things in the DecodeInterval function:

1) Removes dead code for handling unit type RESERVE. There used to be
a unit called "invalid" that was of type RESERVE. At some point that
unit was removed and there were no more units of type RESERVE.
Therefore, the code for RESERVE unit handling is unreachable.

2) Restrict the unit "ago" to only appear at the end of the
interval. According to the docs [0], this is the only valid place to
put it, but we allowed it multiple times at any point in the input.

3) Error when the user has multiple consecutive units or a unit without
an accompanying value. I spent a lot of time trying to come up with
robust ways to detect this and ultimately settled on using the "type"
field. I'm not entirely happy with this approach, because it involves
having to look ahead to the next field in a couple of places. The other
approach I was considering was to introduce a boolean flag called
"unhandled_unit". After parsing a unit it would be set to true, after
applying the unit to a number it would be set to false. If it was true
right before parsing a unit, then we would error. Please let me know
if you have any suggestions here.

There is one more problem I noticed, but didn't fix. We allow multiple
"@" to be sprinkled anywhere in the input, even though the docs [0]
only allow it to appear at the beginning of the input. For example,
the following query works fine:

# SELECT INTERVAL '1 @ year @ @ @ 6 days @ 10 @ minutes';
interval

 1 year 6 days 00:10:00
(1 row)

Unfortunately, this can't be fixed in DecodeInterval, because all of
the "@" fields are filtered out before this method. Additionally, I
believe this means that the lines

 if (type == IGNORE_DTF)
 continue;

in DecodeInterval, that appears right after decoding the units, are
unreachable since
"@" is the only unit of type IGNORE_DTF. Since "@" is filtered out,
we'll never decode a unit of type IGNORE_DTF.

For reference, I previously merged a couple similar patches to this
one, but for other date-time types [1], [2].

Thanks,
Joe Koshakow

[0]
https://www.postgresql.org/docs/current/datatype-datetime.html#DATATYPE-INTERVAL-INPUT
[1]
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=5b3c5953553bb9fb0b171abc6041e7c7e9ca5b4d
[2]
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=bcc704b52490492e6bd73c056b3e9644504d
From 4c5641f15e5409ef5973a5f305352018f06da538 Mon Sep 17 00:00:00 2001
From: Joseph Koshakow 
Date: Sun, 9 Apr 2023 20:37:27 -0400
Subject: [PATCH] Fix interval decode handling of invalid intervals

This patch does three things in the DecodeInterval function:

1) Removes dead code for handling unit type RESERVE. There used to be
a unit called "invalid" that was of type RESERVE. At some point that
unit was removed and there were no more units of type RESERVE.
Therefore, the code for RESERVE unit handling is unreachable.

2) Restrict the unit "ago" to only appear at the end of the
interval. According to the docs [0], this is the only valid place to
put it, but we allowed it multiple times at any point in the input.

3) Error when the user has multiple consecutive units or a unit without
an accompanying value.

[0] https://www.postgresql.org/docs/current/datatype-datetime.html#DATATYPE-INTERVAL-INPUT
---
 src/backend/utils/adt/datetime.c   | 55 +++---
 src/test/regress/expected/interval.out | 18 +
 src/test/regress/sql/interval.sql  |  7 
 3 files changed, 65 insertions(+), 15 deletions(-)

diff --git a/src/backend/utils/adt/datetime.c b/src/backend/utils/adt/datetime.c
index be2e55bb29..17fc0d45ea 100644
--- a/src/backend/utils/adt/datetime.c
+++ b/src/backend/utils/adt/datetime.c
@@ -3335,7 +3335,8 @@ DecodeInterval(char **field, int *ftype, int nf, int range,
 if (force_negative &&
 	itm_in->tm_usec > 0)
 	itm_in->tm_usec = -itm_in->tm_usec;
-type = DTK_DAY;
+if (i != 0 && ftype[i - 1] != DTK_STRING && ftype[i - 1] != DTK_SPECIAL)
+	type = DTK_DAY;
 break;
 
 			case DTK_TZ:
@@ -3372,7 +3373,8 @@ DecodeInterval(char **field, int *ftype, int nf, int range,
 	 * specified. This handles the case of '1 +02:03' since we
 	 * are reading right to left.
 	 */
-	type = DTK_DAY;
+	if (i != 0 && ftype[i - 1] != DTK_STRING && ftype[i - 1] != DTK_SPECIAL)
+		type = DTK_DAY;
 	break;
 }
 
@@ -3475,12 +3477,14 @@ DecodeInterval(char **field, int *ftype, int nf, int range,
 		if (!AdjustMicroseconds(val, fval, 1, itm_in))
 			return DTERR_FIELD_OVERFLOW;
 		tmask = DTK_M(MICROSECOND);
+		type = IGNORE_DTF;
 		break;
 
 	case DTK_MILLISEC:
 		if (!AdjustMicroseconds(val, fval, 1000, itm_in))
 			return DTERR_FIELD_OVERFLOW;
 		tmask = DTK_M(MILLISECOND);
+		type = IGNORE_DTF;
 		break;
 
 	case DTK_SECOND:
@@ -3495,19 +3499,24 @@ DecodeInterval(char **field, int 

Re: Commitfest 2023-03 starting tomorrow!

2023-04-09 Thread Greg Stark
So here's the list of CF entries that I thought *might* still get
committed either because they're an Open Issue or they're one of those
other categories. I had intended to go through and add them all to the
Open Issues but it turns out I only feel confident in a couple of them
qualifying for that:

Already added:
* Default to ICU during initdb
* Assertion failure with parallel full hash join
* RecoveryConflictInterrupt() is unsafe in a signal handler
* pg_visibility's pg_check_visible() yields false positive when
working in parallel with autovacuum

Not added:
* Fix bogus error emitted by pg_recvlogical when interrupted
* clean up permission checks after 599b33b94
* Fix assertion failure with next_phase_at in snapbuild.c
* Fix assertion failure in SnapBuildInitialSnapshot()
* Fix improper qual pushdown after applying outer join identity 3
* Add document is_superuser
* Improve doc for autovacuum on partitioned tables
* Create visible links for HTML elements that have an id to make them
discoverable via the web interface
* Fix inconsistency in reporting checkpointer stats
* pg_rewind WAL deletion pitfall
* Update relfrozenxmin when truncating temp tables
* Testing autovacuum wraparound
* Add TAP tests for psql \g piped into program

I'll move these CF entries to the next CF now. I think they all are
arguably open issues though of varying severity.

-- 
greg




Re: Show various offset arrays for heap WAL records

2023-04-09 Thread Peter Geoghegan
On Fri, Apr 7, 2023 at 4:46 PM Peter Geoghegan  wrote:
> Pushed that one too.

I noticed that the nbtree VACUUM and DELETE record types have their
update/xl_btree_update arrays output incorrectly. We cannot use the
generic array_desc() approach with xl_btree_update elements, because
they're variable-width elements. The problem is that array_desc() only deals
with fixed-width elements.

I also changed some of the details around whitespace in arrays in the
fixup patch (though I didn't do the same with objects). It doesn't
seem useful to use so much whitespace for long arrays of integers
(really page offset numbers). And I brought a few nbtree desc routines
that still used ";" characters as punctuation in line with the new
convention.

Finally, the patch revises the guidelines written for rmgr desc
routine authors. I don't think that we need to describe how to handle
outputting whitespace in detail. It'll be quite natural for other
rmgrs to use existing facilities such as array_desc() themselves,
which makes whitespace type inconsistencies unlikely. I've tried to
make the limits of the guidelines clear. The main goal is to avoid
gratuitous inconsistencies, and to provide a standard way of doing
things that many different rmgrs are likely to want to do, again and
again. But individual rmgrs still have a certain amount of discretion,
which seems like a good thing to me (the alternative requires that we
fix at least a couple of things in nbtdesc.c and in heapdesc.c, which
doesn't seem useful to me).

--
Peter Geoghegan
From 44bfa575f5f28d1a8093e02b4a7993c1368cbeb5 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan 
Date: Sat, 8 Apr 2023 18:59:09 -0700
Subject: [PATCH v1] Fix WAL description of posting list updates.

---
 src/include/access/rmgrdesc_utils.h  |  41 ++-
 src/backend/access/rmgrdesc/heapdesc.c   |   2 +-
 src/backend/access/rmgrdesc/nbtdesc.c| 121 ++-
 src/backend/access/rmgrdesc/rmgrdesc_utils.c |  43 +--
 doc/src/sgml/pgwalinspect.sgml   |  34 +++---
 5 files changed, 123 insertions(+), 118 deletions(-)

diff --git a/src/include/access/rmgrdesc_utils.h b/src/include/access/rmgrdesc_utils.h
index 13552d6f3..16916e3e8 100644
--- a/src/include/access/rmgrdesc_utils.h
+++ b/src/include/access/rmgrdesc_utils.h
@@ -12,12 +12,49 @@
 #ifndef RMGRDESC_UTILS_H_
 #define RMGRDESC_UTILS_H_
 
+/*
+ * Guidelines for rmgr descriptor routine authors:
+ *
+ * The goal of these guidelines is to avoid gratuitous inconsistencies across
+ * each rmgr, and to allow users to parse desc output strings without too much
+ * difficulty.  This is not an API specification or an interchange format.
+ * (Only heapam and nbtree desc routines follow these guidelines at present,
+ * in any case.)
+ *
+ * Record descriptions are similar to JSON style key/value objects.  However,
+ * there is no explicit "string" type/string escaping.  Top-level { } brackets
+ * should be omitted.  For example:
+ *
+ * snapshotConflictHorizon: 0, flags: 0x03
+ *
+ * Record descriptions may contain variable-length arrays.  For example:
+ *
+ * nunused: 5, unused: [1, 2, 3, 4, 5]
+ *
+ * Nested objects are supported via { } brackets.  They generally appear
+ * inside variable-length arrays.  For example:
+ *
+ * ndeleted: 0, nupdated: 1, deleted: [], updated: [{ off: 45, nptids: 1, ptids: [0] }]
+ *
+ * Try to output things in an order that faithfully represents the order of
+ * things in the physical WAL record struct.  It's a good idea if the number
+ * of items in the array appears before the array.
+ *
+ * It's okay for individual WAL record types to invent their own conventions.
+ * For example, heapam's PRUNE records output the follow representation of
+ * redirects:
+ *
+ * ... redirected: [39->46], ...
+ *
+ * Arguably the PRUNE desc routine should be using object notation instead.
+ * This ad-hoc representation of redirects has the advantage of being terse in
+ * a context where that might matter a lot.
+ */
 extern void array_desc(StringInfo buf, void *array, size_t elem_size, int count,
 	   void (*elem_desc) (StringInfo buf, void *elem, void *data),
 	   void *data);
 extern void offset_elem_desc(StringInfo buf, void *offset, void *data);
 extern void redirect_elem_desc(StringInfo buf, void *offset, void *data);
-extern void relid_desc(StringInfo buf, void *relid, void *data);
-extern void uint16_elem_desc(StringInfo buf, void *value, void *data);
+extern void oid_elem_desc(StringInfo buf, void *relid, void *data);
 
 #endif			/* RMGRDESC_UTILS_H */
diff --git a/src/backend/access/rmgrdesc/heapdesc.c b/src/backend/access/rmgrdesc/heapdesc.c
index 6dfd7d7d1..3bd083875 100644
--- a/src/backend/access/rmgrdesc/heapdesc.c
+++ b/src/backend/access/rmgrdesc/heapdesc.c
@@ -127,7 +127,7 @@ heap_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, ", nrelids: %u", xlrec->nrelids);
 		appendStringInfoString(buf, ", relids:");
 		array_desc(buf, xlrec->relids, 

Re: Commitfest 2023-03 starting tomorrow!

2023-04-09 Thread Michael Paquier
On Sat, Apr 08, 2023 at 09:50:49PM -0400, Tom Lane wrote:
> I think that's largely independent.  We don't look back at closed-out CFs
> as a kind of TODO list; anything that's left behind there is basically
> never going to be seen again, until the author makes a new CF entry.
> 
> Anything that's to be treated as an open item for v16 should get added
> to the wiki page at
> 
> https://wiki.postgresql.org/wiki/PostgreSQL_16_Open_Items
> 
> It's not necessary that a patch exist to do that.

And patches that are marked as bug fixes in the CF app had better not
be lost in the long run, even if they are not listed as an open item
as a live issue.  Sometimes people apply "Bug Fix" as a category which
would imply a backpatch in most cases, and looking at the patch proves
that this can be wrong.
--
Michael


signature.asc
Description: PGP signature


Re: Direct I/O

2023-04-09 Thread Thomas Munro
On Mon, Apr 10, 2023 at 8:43 AM Tom Lane  wrote:
> Boy, it's hard to look at that trace and not call it a filesystem bug.

Agreed.

> Given the apparent dependency on COW, I wonder if this has something
> to do with getting confused about which copy is current?

Yeah, I suppose it would require bogus old page versions (or I guess
alternatively completely mixed up page offsets) rather than bogus
zeroed pages to explain the too-high count observed in one of crake's
failed runs: I guess it counted some pre-updated tuples that were
supposed to be deleted and then counted the post-updated tuples on
later pages (insert joke about the Easter variant of the Halloween
problem).  It's just that in the runs I've managed to observe and
analyse, the previous version always happened to be zeros.




Re: Commitfest 2023-03 starting tomorrow!

2023-04-09 Thread Michael Paquier
On Fri, Apr 07, 2023 at 10:40:01PM -0400, Kirk Wolak wrote:
>   I got no  response to my point that the backquote solution is cumbersome
> because I have to use* psql in both windows*
> *and in linux environments* (realizing I am the odd duck in this group).
> But my fall back was a common script file.  Then I shared my
> psqlrc file with a co-worker, and they ran into the missing script file.
> [ie, the same command does not work in both systems].
> 
>   I won't argue beyond this point, I'd just like to hear that you
> considered this final point...
> and I can move on.

FYI, this specific patch has been moved to the next commit fest of
2023-07:
https://commitfest.postgresql.org/43/4227/

This implies that this discussion will be considered for the
development cycle of v17, planned to begin in July.
--
Michael


signature.asc
Description: PGP signature


Re: Direct I/O

2023-04-09 Thread Noah Misch
On Sun, Apr 09, 2023 at 02:45:16PM -0700, Andres Freund wrote:
> On 2023-04-08 21:29:54 -0700, Noah Misch wrote:
> > On Sat, Apr 08, 2023 at 11:08:16AM -0700, Andres Freund wrote:
> > > On 2023-04-07 23:04:08 -0700, Andres Freund wrote:
> > > > If you look at log_newpage_range(), it's not surprising that we get 
> > > > this error
> > > > - it pins up to 32 buffers at once.
> > > > 
> > > > Afaics log_newpage_range() originates in 9155580fd5fc, but this caller 
> > > > is from
> > > > c6b92041d385.
> > 
> > > > Do we care about fixing this in the backbranches? Probably not, given 
> > > > there
> > > > haven't been user complaints?
> > 
> > I would not.  This is only going to come up where the user goes out of the 
> > way
> > to use near-minimum shared_buffers.
> 
> It's not *just* that scenario. With a few concurrent connections you can get
> into problematic territory even with halfway reasonable shared buffers.

I am not familiar with such cases.  You could get there with 64MB shared
buffers and 256 simultaneous commits of new-refilenode-creating transactions,
but I'd still file that under going out of one's way to use tiny shared
buffers relative to the write activity.  What combination did you envision?




Re: Parallel Full Hash Join

2023-04-09 Thread Michael Paquier
On Sat, Apr 08, 2023 at 02:19:54PM -0400, Melanie Plageman wrote:
> Another worker attached to the batch barrier, saw that it was in
> PHJ_BATCH_SCAN, marked it done and detached. This is fine.
> BarrierArriveAndDetachExceptLast() is meant to ensure no one waits
> (deadlock hazard) and that at least one worker stays to do the unmatched
> scan. It doesn't hurt anything for another worker to join and find out
> there is no work to do.
> 
> We should simply delete this assertion.

I have added an open item about that.  This had better be tracked.
--
Michael


signature.asc
Description: PGP signature


Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

2023-04-09 Thread Michael Paquier
On Sat, Apr 08, 2023 at 04:24:35PM +0200, Matthias van de Meent wrote:
> Thanks a lot! I'll post the separation of record construction and
> write-out to xlog in a future thread for 17.

Thanks!  Creating a new thread makes sense.

> One remaining question: Considering that the changes and checks of
> that commit are mostly internal to xloginsert.c (or xlog.c in older
> releases), and that no special public-facing changes were made, would
> it be safe to backport this to older releases?

The routine changes done in ffd1b6b cannot be backpatched on ABI
grounds, still you would propose to have protection around
needs_data as well as the whole record length.

> PostgreSQL 15 specifically would benefit from this as it supports
> external rmgrs which may generate WAL records and would benefit from
> these additional checks, but all supported releases of PostgreSQL have
> pg_logical_emit_message and are thus easily subject to the issue of
> writing oversized WAL records and subsequent recovery- and replication
> stream failures.

Custom RMGRs are a good argument, though I don't really see an urgent
argument about doing something in REL_15_STABLE.  For one, it would
mean more backpatching conflicts with ~14.  Another argument is that
XLogRecordMaxSize is not an exact science, either.  In ~15, a record
with a total size between XLogRecordMaxSize and
DecodeXLogRecordRequiredSpace(MaxAllocSize) would work, though it
would not in 16~ because we have the 4MB margin given as room for the
per-record allocation in the XLogReader.  A record of such a size
would not be generated anymore after a minor release update of 15.3~
if we were to do something about that by May on REL_15_STABLE.
--
Michael


signature.asc
Description: PGP signature


Re: Add index scan progress to pg_stat_progress_vacuum

2023-04-09 Thread Michael Paquier
On Fri, Apr 07, 2023 at 12:01:17PM -0700, Andres Freund wrote:
> Why would it mean that? Parallel workers are updated together with the leader,
> so there's no compatibility issue?

My point is that the callback system would still need to be maintained
in a stable branch, and, while useful, it could be used for much more
than it is originally written.  I guess that this could be used in
custom nodes with their own custom parallel nodes.
--
Michael


signature.asc
Description: PGP signature


Re: Add index scan progress to pg_stat_progress_vacuum

2023-04-09 Thread Michael Paquier
On Fri, Apr 07, 2023 at 07:27:17PM +, Imseih (AWS), Sami wrote:
> If called by a worker, it will send a 'P' message to the front end
> passing both the progress index, i.e. PROGRESS_VACUUM_INDEXES_PROCESSED
> And the value to increment by, i.e. 1 for index vacuum progress.
> 
> With that, the additional shared memory counters in ParallelVacuumState
> are not needed, and the poke of the worker to the leader goes directly
> through a generic backend_progress API.

Thanks for the new version.  This has unfortunately not been able to
make the cut for v16, but let's see it done at the beginning of the
v17 cycle.

+void
+pgstat_progress_parallel_incr_param(int index, int64 incr)
+{
+   /*
+* Parallel workers notify a leader through a 'P'
+* protocol message to update progress, passing the
+* progress index and increment value. Leaders can
+* just call pgstat_progress_incr_param directly.
+*/
+   if (IsParallelWorker())
+   {
+   static StringInfoData progress_message;
+
+   initStringInfo(_message);
+
+   pq_beginmessage(_message, 'P');
+   pq_sendint32(_message, index);
+   pq_sendint64(_message, incr);
+   pq_endmessage(_message);
+   }
+   else
+   pgstat_progress_incr_param(index, incr);
+}

I see.  You need to handle both the leader and worker case because
parallel_vacuum_process_one_index() can be called by either of them.

+   case 'P':   /* Parallel progress reporting */

Perhaps this comment should say that this is only about incremental
progress reporting, for the moment.

+* Increase and report the number of index scans. Also, we reset the 
progress
+* counters.

The counters reset are the two index counts, perhaps this comment
should mention this fact.

+   /* update progress */
+   int index = pq_getmsgint(msg, 4);
+   int incr = pq_getmsgint(msg, 1);
[...]
+   pq_beginmessage(_message, 'P');
+   pq_sendint32(_message, index);
+   pq_sendint64(_message, incr);
+   pq_endmessage(_message);

It seems to me that the receiver side is missing one pq_getmsgend()?
incr is defined and sent as an int64 on the sender side, hence the
receiver should use pq_getmsgint64(), no?  pq_getmsgint(msg, 1) means
to receive only one byte, see pqformat.c.  And the order is reversed?

There may be a case in the future about making 'P' more complicated
with more arguments, but what you have here should be sufficient for
your use-case.  Were there plans to increment more data for some
different and/or new progress indexes in the VACUUM path, by the way?
Most of that looked a bit tricky to me as this was AM-dependent, but I
may have missed something.
--
Michael


signature.asc
Description: PGP signature


Re: Direct I/O

2023-04-09 Thread Andres Freund
Hi,

On 2023-04-08 21:29:54 -0700, Noah Misch wrote:
> On Sat, Apr 08, 2023 at 11:08:16AM -0700, Andres Freund wrote:
> > On 2023-04-07 23:04:08 -0700, Andres Freund wrote:
> > > There were some failures in CI (e.g. [1] (and perhaps also bf, didn't yet
> > > check), about "no unpinned buffers available".  I was worried for a moment
> > > that this could actually be relation to the bulk extension patch.
> > > 
> > > But it looks like it's older - and not caused by direct_io support 
> > > (except by
> > > way of the test existing). I reproduced the issue locally by setting s_b 
> > > even
> > > lower, to 16 and made the ERROR a PANIC.
> > >
> > > [backtrace]
> 
> I get an ERROR, not a PANIC:

What I meant is that I changed the code to use PANIC, to make it easier to get
a backtrace.


> > > If you look at log_newpage_range(), it's not surprising that we get this 
> > > error
> > > - it pins up to 32 buffers at once.
> > > 
> > > Afaics log_newpage_range() originates in 9155580fd5fc, but this caller is 
> > > from
> > > c6b92041d385.
> 
> > > Do we care about fixing this in the backbranches? Probably not, given 
> > > there
> > > haven't been user complaints?
> 
> I would not.  This is only going to come up where the user goes out of the way
> to use near-minimum shared_buffers.

It's not *just* that scenario. With a few concurrent connections you can get
into problematic territory even with halfway reasonable shared buffers.


> > Here's a quick prototype of this approach.
> 
> This looks fine.  I'm not enthusiastic about incurring post-startup cycles to
> cater to allocating less than 512k*max_connections of shared buffers, but I
> expect the cycles in question are negligible here.

Yea, I can't imagine it'd matter, compared to the other costs. Arguably it'd
allow us to crank up the maximum batch size further, even.

Greetings,

Andres Freund




Re: differential code coverage

2023-04-09 Thread Andres Freund
Hi,

On 2023-04-04 09:03:45 -0700, Andres Freund wrote:
> For quite a while I'd been wishing to see *differential* code coverage, to see
> what changed in code coverage between two major releases. Unfortunately lcov
> didn't provide that. A few months ago a PR for it has been merged into lcov
> ([1]). There hasn't yet been a release though. And the feature definitely has
> some rough edges, still.
> 
> I'm planning to generate the 15->16 differential code coverage, once the
> feature freeze has been reached.
> 
> Another nice thing provided by the in-development lcov is hierarchical
> output. I find that to be much easier to navigate than the current flat
> output, as e.g. used on coverage.postgresql.org.

I've generated the output for 15 vs HEAD, now that we're past feature freeze.

Simpler version:
https://anarazel.de/postgres/cov/15-vs-HEAD-01/

Version that tries to date the source lines using information from git:
https://anarazel.de/postgres/cov/15-vs-HEAD-02/

The various abbreviations are explained (somewhat) if you hover over them.


There's a few interesting bit of new code not covered by teests, that should
easily be coverable:

https://anarazel.de/postgres/cov/15-vs-HEAD-01/src/bin/pg_dump/pg_dumpall.c.gcov.html#1014

https://anarazel.de/postgres/cov/15-vs-HEAD-01/src/backend/utils/adt/array_userfuncs.c.gcov.html#675
https://anarazel.de/postgres/cov/15-vs-HEAD-01/src/backend/utils/adt/array_userfuncs.c.gcov.html#921

https://anarazel.de/postgres/cov/15-vs-HEAD-01/src/backend/utils/adt/pg_locale.c.gcov.html#1964
https://anarazel.de/postgres/cov/15-vs-HEAD-01/src/backend/utils/adt/pg_locale.c.gcov.html#2219

https://anarazel.de/postgres/cov/15-vs-HEAD-01/src/backend/parser/parse_expr.c.gcov.html#3108
https://anarazel.de/postgres/cov/15-vs-HEAD-01/src/backend/nodes/nodeFuncs.c.gcov.html#1480
https://anarazel.de/postgres/cov/15-vs-HEAD-01/src/backend/nodes/nodeFuncs.c.gcov.html#1652

https://anarazel.de/postgres/cov/15-vs-HEAD-01/src/backend/optimizer/util/plancat.c.gcov.html#348

https://anarazel.de/postgres/cov/15-vs-HEAD-01/contrib/postgres_fdw/connection.c.gcov.html#1316
https://anarazel.de/postgres/cov/15-vs-HEAD-01/contrib/postgres_fdw/deparse.c.gcov.html#399

Greetings,

Andres Freund




Re: [PATCH] Support % wildcard in extension upgrade filenames

2023-04-09 Thread Sandro Santilli
On Mon, Apr 03, 2023 at 09:26:25AM +0700, Yurii Rashkovskii wrote:
> I want to chime in on the issue of lower-number releases that are released
> after higher-number releases. The way I see this particular problem is that
> we always put upgrade SQL files in release "packages," and they obviously
> become static resources.
> 
> While I [intentionally] overlook some details here, what if (as a
> convention, for projects where it matters) we shipped extensions with
> non-upgrade SQL files only, and upgrades were available as separate
> downloads? This way, we're not tying releases themselves to upgrade paths.
> This also requires no changes to Postgres.

This is actually something that's on the plate, and we recently
added a --disable-extension-upgrades-install configure switch
and a `install-extension-upgrades-from-known-versions` make target
in PostGIS to help going in that direction. I guess the ball would
now be in the hands of packagers.

> I know this may be a big delivery layout departure for well-established
> projects; I also understand that this won't solve the problem of having to
> have these files in the first place (though in many cases, they can be
> automatically generated once, I suppose, if they are trivial).

We will now also be providing a `postgis` script for administration
that among other things will support a `install-extension-upgrades`
command to install upgrade paths from specific old versions, but will
not hard-code any list of "known" old versions as such a list will
easily become outdated.

Note that all upgrade scripts installed by the Makefile target or by
the `postgis` scripts will only be empty upgrade paths from the source
version to the fake "ANY" version, as the ANY-- upgrade path
will still be the "catch-all" upgrade script.

--strk;

> On Tue, Jan 10, 2023 at 5:52 AM Tom Lane  wrote:
> 
> > I continue to think that this is a fundamentally bad idea.  It creates
> > all sorts of uncertainties about what is a valid update path and what
> > is not.  Restrictions like
> >
> > + Such wildcard update
> > + scripts will only be used when no explicit path is found from
> > + old to target version.
> >
> > are just band-aids to try to cover up the worst problems.
> >
> > Have you considered the idea of instead inventing a "\include" facility
> > for extension scripts?  Then, if you want to use one-monster-script
> > to handle different upgrade cases, you still need one script file for
> > each supported upgrade step, but those can be one-liners including the
> > common script file.  Plus, such a facility could be of use to people
> > who want intermediate factorization solutions (that is, some sharing
> > of code without buying all the way into one-monster-script).
> >
> > regards, tom lane
> >
> >
> >
> 
> -- 
> Y.




Re: Direct I/O

2023-04-09 Thread Tom Lane
Thomas Munro  writes:
> we have a page at offset 638976, and we can find all system calls that
> touched that offset:

> [pid 26031] 23:26:48.521123 pwritev(50,
> [{iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> iov_len=8192}], 1, 638976) = 8192

> [pid 26040] 23:26:48.568975 pwrite64(5,
> "\0\0\0\0\0Nj\1\0\0\0\0\240\3\300\3\0 \4
> \0\0\0\0\340\2378\0\300\2378\0"..., 8192, 638976) = 8192

> [pid 26040] 23:26:48.593157 pread64(6,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> 8192, 638976) = 8192

Boy, it's hard to look at that trace and not call it a filesystem bug.
Given the apparent dependency on COW, I wonder if this has something
to do with getting confused about which copy is current?

Another thing that struck me is that the two calls from pid 26040
are issued on different FDs.  I checked the strace log and verified
that these do both refer to "base/5/16384".  It looks like there was
a cache flush at about 23:26:48.575023 that caused 26040 to close
and later reopen all its database relation FDs.  Maybe that is
somehow contributing to the filesystem's confusion?  And more to the
point, could that explain why other O_DIRECT users aren't up in arms
over this bug?  Maybe they don't switch FDs as readily as we do.

regards, tom lane




Re: cataloguing NOT NULL constraints

2023-04-09 Thread Tom Lane
Alvaro Herrera  writes:
>> I'm inclined to think that this idea of suppressing the implied
>> NOT NULL from PRIMARY KEY is a nonstarter and we should just
>> go ahead and make such a constraint.  Another idea could be for
>> pg_dump to emit the NOT NULL, load data, do the ALTER ADD PRIMARY
>> KEY, and then ALTER DROP NOT NULL.

> I like that second idea, yeah.  It might be tough to make it work, but
> I'll try.

Yeah, I've been thinking more about it, and this might also yield a
workable solution for the TestUpgrade breakage.  The idea would be,
roughly, for pg_dump to emit NOT NULL column decoration in all the
same places it does now, and then to drop it again immediately after
doing ADD PRIMARY KEY if it judges that there was no other reason
to have it.  This gets rid of the inconsistency for --binary-upgrade
which I think is what is causing the breakage.

I also ran into something else I didn't much care for:

regression=# create table foo(f1 int primary key, f2 int);
CREATE TABLE
regression=# create table foochild() inherits(foo);
CREATE TABLE
regression=# alter table only foo alter column f2 set not null;
ERROR:  cannot add constraint only to table with inheritance children
HINT:  Do not specify the ONLY keyword.

Previous versions accepted this case, and I don't really see why
we can't do so with this new implementation -- isn't this exactly
what pg_constraint.connoinherit was invented to represent?  Moreover,
existing pg_dump files can contain precisely this construct, so
blowing it off isn't going to be zero-cost.

regards, tom lane




Re: cataloguing NOT NULL constraints

2023-04-09 Thread Alvaro Herrera
On 2023-Apr-09, Tom Lane wrote:

> In the new dispensation, pg_dump omits the NOT NULL clauses.
> Great, you say, that makes the output more like what the user wrote.
> I'm not so sure.  This means that the ALTER TABLE will be compelled
> to perform a full-table scan to verify that there are no nulls in the
> already-loaded data before it can add the missing NOT NULL constraint.

Yeah, I agree that this unintended consequence isn't very palatable.  I
think the other pg_upgrade problem is easily fixed (haven't tried yet),
but having to rethink the pg_dump representation would likely take
longer than we'd like.

> I'm inclined to think that this idea of suppressing the implied
> NOT NULL from PRIMARY KEY is a nonstarter and we should just
> go ahead and make such a constraint.  Another idea could be for
> pg_dump to emit the NOT NULL, load data, do the ALTER ADD PRIMARY
> KEY, and then ALTER DROP NOT NULL.

I like that second idea, yeah.  It might be tough to make it work, but
I'll try.

> In any case, I wonder whether that's the sort of redesign we should
> be doing post-feature-freeze.  It might be best to revert and try
> again in v17.

Yeah, sounds like reverting for now and retrying in v17 with the
discussed changes might be better.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"La espina, desde que nace, ya pincha" (Proverbio africano)




Re: Doc: Rework contrib appendix -- informative titles, tweaked sentences

2023-04-09 Thread Alvaro Herrera
On 2023-Apr-09, Noah Misch wrote:

> On Thu, Mar 30, 2023 at 01:27:05AM -0500, Karl O. Pinc wrote:

> > Your point seems valid but this is above my station.
> > I have no idea as to how to best resolve this, or even how to make the
> > resolution happen now that the change has been committed.
> 
> I'm inclined to just remove .  While intagg is
> indeed obsolete, having a one-entry list seems like undue weight.

I agree, let's just remove that.  The list of trusted modules is clearly
useful, but the list of obsoletes one isn't terribly interesting.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"El miedo atento y previsor es la madre de la seguridad" (E. Burke)




Re: Doc: Rework contrib appendix -- informative titles, tweaked sentences

2023-04-09 Thread Noah Misch
On Thu, Mar 30, 2023 at 01:27:05AM -0500, Karl O. Pinc wrote:
> On Wed, 29 Mar 2023 21:32:05 -0700
> Noah Misch  wrote:
> 
> > On Sun, Jan 22, 2023 at 02:42:46PM -0600, Karl O. Pinc wrote:
> > > v10-0001-List-trusted-and-obsolete-extensions.patch  
> > 
> > > + 
> > > +   These modules and extensions are obsolete:
> > > +
> > > +   
> > > + 
> > > + 
> > > +   
> > > +   
> > 
> > Commit a013738 incorporated this change.  Since xml2 is the only
> > in-tree way to use XSLT from SQL, I think xml2 is not obsolete.  Some
> > individual functions, e.g. xml_valid(), are obsolete.  (There are
> > years-old threats to render the module obsolete, but this has never
> > happened.)
> 
> Your point seems valid but this is above my station.
> I have no idea as to how to best resolve this, or even how to make the
> resolution happen now that the change has been committed.

I'm inclined to just remove .  While intagg is
indeed obsolete, having a one-entry list seems like undue weight.




Re: Direct I/O

2023-04-09 Thread Andrew Dunstan


On 2023-04-09 Su 09:14, Andrew Dunstan wrote:



On 2023-04-09 Su 08:39, Thomas Munro wrote:

On Sun, Apr 9, 2023 at 11:25 PM Andrew Dunstan  wrote:

Didn't seem to make any difference.

Thanks for testing.  I think it's COW (and I think that implies also
checksums?) that needs to be turned off, at least based on experiments
here.




Googling agrees with you about checksums.  But I'm still wondering if 
we shouldn't disable COW for the build directory etc. It is suggested 
at [1]:



Recommend to set nodatacow – turn cow off – for the files that
require fast IO and tend to get very big and get alot of random
writes: such VMDK (vm disks) files and the like.


I'll give it a whirl.




with COW disabled, I can no longer generate a failure with the test.


cheers


andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com


Re: cataloguing NOT NULL constraints

2023-04-09 Thread Tom Lane
Andres Freund  writes:
> I think
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo=2023-04-07%2021%3A16%3A04
> might point out a problem with the pg_dump or pg_upgrade backward compat
> paths:

Yeah, this patch has broken every single upgrade-from-back-branch test.

I think there's a second problem, though: even without considering
back branches, this has changed pg_dump output in a way that
I fear is unacceptable.  Consider for instance this table definition
(from rules.sql):

create table rule_and_refint_t1 (
id1a integer,
id1b integer,
primary key (id1a, id1b)
);

This used to be dumped as

CREATE TABLE public.rule_and_refint_t1 (
id1a integer NOT NULL,
id1b integer NOT NULL
);
...
... load data ...
...
ALTER TABLE ONLY public.rule_and_refint_t1
ADD CONSTRAINT rule_and_refint_t1_pkey PRIMARY KEY (id1a, id1b);

In the new dispensation, pg_dump omits the NOT NULL clauses.
Great, you say, that makes the output more like what the user wrote.
I'm not so sure.  This means that the ALTER TABLE will be compelled
to perform a full-table scan to verify that there are no nulls in the
already-loaded data before it can add the missing NOT NULL constraint.
The old dump output was carefully designed to avoid the need for that
scan.  Admittedly, we have to do a scan anyway to build the index,
so this is strictly less than a 2X penalty on the ALTER, but is
that acceptable?  It might be all right in the context of regular
dump/restore, where we're surely doing a lot of per-row work anyway
to load the data and make the index.  In the context of pg_upgrade,
though, it seems absolutely disastrous: there will now be a per-row
cost where there was none before, and that is surely a deal-breaker.

BTW, I note from testing that the NOT NULL clauses *are* still
emitted in at least some cases when doing --binary-upgrade from an old
version.  (This may be directly related to the buildfarm failures,
not sure.)  That's no solution though, because now what you get in
pg_constraint will differ depending on which way you upgraded,
which seems unacceptable too.

I'm inclined to think that this idea of suppressing the implied
NOT NULL from PRIMARY KEY is a nonstarter and we should just
go ahead and make such a constraint.  Another idea could be for
pg_dump to emit the NOT NULL, load data, do the ALTER ADD PRIMARY
KEY, and then ALTER DROP NOT NULL.

In any case, I wonder whether that's the sort of redesign we should
be doing post-feature-freeze.  It might be best to revert and try
again in v17.

regards, tom lane




Re: Direct I/O

2023-04-09 Thread Andrew Dunstan


On 2023-04-09 Su 08:39, Thomas Munro wrote:

On Sun, Apr 9, 2023 at 11:25 PM Andrew Dunstan  wrote:

Didn't seem to make any difference.

Thanks for testing.  I think it's COW (and I think that implies also
checksums?) that needs to be turned off, at least based on experiments
here.




Googling agrees with you about checksums.  But I'm still wondering if we 
shouldn't disable COW for the build directory etc. It is suggested at [1]:



   Recommend to set nodatacow – turn cow off – for the files that
   require fast IO and tend to get very big and get alot of random
   writes: such VMDK (vm disks) files and the like.


I'll give it a whirl.


cheers


andrew


[1] 

--
Andrew Dunstan
EDB:https://www.enterprisedb.com


Re: Direct I/O

2023-04-09 Thread Thomas Munro
On Sun, Apr 9, 2023 at 11:25 PM Andrew Dunstan  wrote:
> Didn't seem to make any difference.

Thanks for testing.  I think it's COW (and I think that implies also
checksums?) that needs to be turned off, at least based on experiments
here.




Re: Direct I/O

2023-04-09 Thread Thomas Munro
On Sun, Apr 9, 2023 at 4:52 PM Thomas Munro  wrote:
> Here, btrfs seems to be taking a different path that I can't quite
> make out...  I see no warning/error about a checksum failure like [1],
> and we apparently managed to read something other than a mix of the
> old and new page contents (which, based on your hypothesis, should
> just leave it indeterminate whether the hint bit changes were captured
> or not, and the rest of the page should be stable, right).  It's like
> the page time-travelled or got scrambled in some other way, but it
> didn't tell us?  I'll try to dig further...

I think there are two separate bad phenomena.

1.  A concurrent modification of the user space buffer while writing
breaks the checksum so you can't read the data back in, as .  I can
reproduce that with a stand-alone program, attached.  The "verifier"
process occasionally reports EIO while reading, unless you comment out
the "scribbler" process's active line.  The system log/dmesg gets some
warnings.

2.  The crake-style failure doesn't involve any reported checksum
failures or errors, and I'm not sure if another process is even
involved.  I attach a complete syscall trace of a repro session.  (I
tried to get strace to dump 8192 byte strings, but then it doesn't
repro, so we have only the start of the data transferred for each
page.)  Working back from the error message,

ERROR:  invalid page in block 78 of relation base/5/16384,

we have a page at offset 638976, and we can find all system calls that
touched that offset:

[pid 26031] 23:26:48.521123 pwritev(50,
[{iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
iov_len=8192}], 1, 638976) = 8192

[pid 26040] 23:26:48.568975 pwrite64(5,
"\0\0\0\0\0Nj\1\0\0\0\0\240\3\300\3\0 \4
\0\0\0\0\340\2378\0\300\2378\0"..., 8192, 638976) = 8192

[pid 26040] 23:26:48.593157 pread64(6,
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
8192, 638976) = 8192

In between the write of non-zeros and the read of zeros, nothing seems
to happen that could justify that, that I can grok, but perhaps
someone else will see something that I'm missing.  We pretty much just
have the parallel worker scanning the table, and writing stuff out as
it does it.  This was obtained with:

strace -f --absolute-timestamps=time,us ~/install/bin/postgres -D
pgdata -c io_direct=data -c shared_buffers=256kB -c wal_level=minimal
-c max_wal_senders=0 2>&1 | tee trace.log

The repro is just:

set debug_parallel_query=regress;
drop table if exists t;
create table t as select generate_series(1, 1);
update t set generate_series = 1;
select count(*) from t;

Occasionally it fails in a different way: after create table t, later
references to t can't find it in the catalogs but there is no invalid
page error.  Perhaps the freaky zeros are happening one 4k page at a
time but perhaps if you get two in a row it might look like an empty
catalog page and pass validation.


repro-strace.log.gz
Description: application/gzip
#define _GNU_SOURCE

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#define FLAGS (O_RDWR | O_CLOEXEC | O_DIRECT)
#define BUFFER_SIZE 8192
#define ALIGN 4096
#define LOOPS 100

#define PROCESS_WRITER 1
#define PROCESS_SCRIBBLER 2
#define PROCESS_VERIFIER 3

struct shared_data {
	uint8_t write_buffer[BUFFER_SIZE];
	int blockno;
	pthread_barrier_t barrier;
};

static int
run_process(int which, struct shared_data *shared)
{
	int fd;

	fd = open("file", FLAGS, 0644);
	if (fd < 0) {
		perror("open");
		return 1;
	}

	for (int i = 0; i < LOOPS; i++) {

		if (which == PROCESS_WRITER) {
			shared->blockno = rand() % 88;
			memset(shared->write_buffer, 0x00, sizeof(shared->write_buffer));
			if (pwrite(fd, shared->write_buffer, BUFFER_SIZE, BUFFER_SIZE * shared->blockno) < BUFFER_SIZE)
perror("writer: initialization pwrite failed or short");
			memset(shared->write_buffer, 0xff, BUFFER_SIZE);
		}

		pthread_barrier_wait(>barrier);

		if (which == PROCESS_WRITER) {
			if (pwrite(fd, shared->write_buffer, BUFFER_SIZE, BUFFER_SIZE * shared->blockno) < BUFFER_SIZE)
perror("writer: main pwrite failed or short");
		} else if (which == PROCESS_SCRIBBLER) {
			for (int j = 0; j < 3; j++)
shared->write_buffer[rand() % BUFFER_SIZE] = 0x42;
		}

		pthread_barrier_wait(>barrier);

		if (which == PROCESS_VERIFIER) {
			uint8_t read_buffer[BUFFER_SIZE];

			if (pread(fd, read_buffer, BUFFER_SIZE, BUFFER_SIZE * shared->blockno) < BUFFER_SIZE) {
perror("pread failed or short while verifying");
			} else {
for (int i = 0; i < BUFFER_SIZE; i++) {
	if (read_buffer[i] == 0xff)
		continue;
	if (read_buffer[i] == 0x42)
		continue;
	printf("got unexpected byte %d at position %d in block %d (stopped looking after that)\n",
		read_buffer[i], i, shared->blockno);
	break;
}
			}
		}

		pthread_barrier_wait(>barrier);
	}

	return 0;
}

int
main(int argc, char *argv[])
{
	pid_t 

Re: Direct I/O

2023-04-09 Thread Andrew Dunstan


On 2023-04-08 Sa 18:50, Thomas Munro wrote:

On Sun, Apr 9, 2023 at 10:17 AM Andrew Dunstan  wrote:

I can run the test in isolation, and it's get an error reliably.

Random idea: it looks like you have compression enabled.  What if you
turn it off in the directory where the test runs?  Something like
btrfs property set  compression ... according to the
intergoogles.  (I have never used btrfs before 6 minutes ago but I
can't seem to repro this with basic settings in a loopback btrfs
filesystems).



Didn't seem to make any difference.


cheers


andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com


Re: CREATE SUBSCRIPTION -- add missing tab-completes

2023-04-09 Thread Gregory Stark (as CFM)
On Fri, 7 Apr 2023 at 01:28, Amit Kapila  wrote:
>
> On Wed, Apr 5, 2023 at 5:58 AM Peter Smith  wrote:
>
> > PSA patches to add those tab completions.
>
> LGTM, so pushed.

I moved this to the next CF but actually I just noticed the thread
starts with the original patch being pushed. Maybe we should just
close the CF entry? Is this further change relevant?

-- 
Gregory Stark
As Commitfest Manager