Re: hash_xlog_split_allocate_page: failed to acquire cleanup lock

2022-08-09 Thread Thomas Munro
On Wed, Aug 10, 2022 at 5:35 PM Thomas Munro  wrote:
> 2021

Or, rather, 14 days into 2022 :-)




Re: hash_xlog_split_allocate_page: failed to acquire cleanup lock

2022-08-09 Thread Thomas Munro
On Wed, Aug 10, 2022 at 5:28 PM Andres Freund  wrote:
> I don't think it's a safe assumption that nobody would hold a pin on such a
> page during recovery. While not the case here, somebody else could have used
> pg_prewarm to read it in.
>
> But also, the checkpointer or bgwriter could have it temporarily pinned, to
> write it out, or another backend could try to write it out as a victim buffer
> and have it temporarily pinned.

Right, of course.  So it's just that hash indexes didn't get xlog'd
until 2017, and still aren't very popular, and then recovery didn't
get regression tested until 2021, so nobody ever hit it.




Re: hash_xlog_split_allocate_page: failed to acquire cleanup lock

2022-08-09 Thread Andres Freund
Hi,

On 2022-08-09 20:21:19 -0700, Mark Dilger wrote:
> > On Aug 9, 2022, at 7:26 PM, Andres Freund  wrote:
> >
> > The relevant code triggering it:
> >
> > newbuf = XLogInitBufferForRedo(record, 1);
> > _hash_initbuf(newbuf, xlrec->new_bucket, xlrec->new_bucket,
> >   xlrec->new_bucket_flag, true);
> > if (!IsBufferCleanupOK(newbuf))
> > elog(PANIC, "hash_xlog_split_allocate_page: failed to acquire 
> > cleanup lock");
> >
> > Why do we just crash if we don't already have a cleanup lock? That can't be
> > right. Or is there supposed to be a guarantee this can't happen?
>
> Perhaps the code assumes that when xl_hash_split_allocate_page record was
> written, the new_bucket field referred to an unused page, and so during
> replay it should also refer to an unused page, and being unused, that nobody
> will have it pinned.  But at least in heap we sometimes pin unused pages
> just long enough to examine them and to see that they are unused.  Maybe
> something like that is happening here?

I don't think it's a safe assumption that nobody would hold a pin on such a
page during recovery. While not the case here, somebody else could have used
pg_prewarm to read it in.

But also, the checkpointer or bgwriter could have it temporarily pinned, to
write it out, or another backend could try to write it out as a victim buffer
and have it temporarily pinned.


static int
SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
{
...
/*
 * Pin it, share-lock it, write it.  (FlushBuffer will do nothing if the
 * buffer is clean by the time we've locked it.)
 */
PinBuffer_Locked(bufHdr);
LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);


As you can see we acquire a pin without holding a lock on the page (and that
can't be changed!).


I assume this is trying to defend against some sort of deadlock by not
actually getting a cleanup lock (by passing get_cleanup_lock = true to
XLogReadBufferForRedoExtended()).

I don't think it's possible to rely on a dirty page to never be pinned by
another backend. All you can rely on with a cleanup lock is that there's no
*prior* references to the buffer, and thus it's safe to reorganize the buffer,
because the pin-holder hasn't yet gotten a lock on the page.


> I'd be curious to see the count returned by
> BUF_STATE_GET_REFCOUNT(LockBufHdr(newbuf)) right before this panic.  If it's
> just 1, then it's not another backend, but our own, and we'd want to debug
> why we're pinning the same page twice (or more) while replaying wal.

This was the first time in a couple hundred runs on that I have seen this, so
I don't think it's that easily debuggable for me.


> Otherwise, maybe it's a race condition with some other process that
> transiently pins a buffer and occasionally causes this code to panic?

As pointed out above, it's legal to have a transient pin on a page, so this
just looks like a bad assumption in the hash code to me.

Greetings,

Andres Freund




Blocking the use of TRIGGER privilege

2022-08-09 Thread Simon Riggs
The separate TRIGGER privilege is considered obsolescent. It is not
heavily used and exists mainly to facilitate trigger-based replication
in a multi-user system.
i.e.
GRANT TRIGGER ON foo TO bob;

Since logical replication recommends "Limit ownership and TRIGGER
privilege on such tables to trusted roles.", then it would be useful
to have a way to put in a restriction on that for the trigger
privilege.

We might suggest removing it completely, but it does appear to be a
part of the SQL Standard, T211-07, so that is not an option. In any
case, such a move would need us to do a lengthy deprecation dance
across multiple releases.

But we can just have an option to prevent the TRIGGER privilege being granted.

allow_trigger_privilege = off (new default in PG16) | on
shown in postgresql.conf, only settable at server start so that it
even blocks superusers and special roles.

Existing usage of the trigger privilege would not be touched, only new usage.

(No, this does not mean I want to ban triggers, only the trigger privilege).

Thoughts?

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2022-08-09 Thread Dilip Kumar
On Sun, Aug 7, 2022 at 9:47 AM Andres Freund  wrote:
>
> Hi,
>
> On 2022-08-07 09:24:40 +0530, Dilip Kumar wrote:
> > On Sat, Aug 6, 2022 at 9:36 PM Tom Lane  wrote:
> > >
> > > Dilip Kumar  writes:
> > > > On Fri, Aug 5, 2022 at 10:43 AM Dilip Kumar  
> > > > wrote:
> > > >> Yeah maybe it is not necessary to close as these unowned smgr will
> > > >> automatically get closed on the transaction end.
> > >
> > > I do not think this is a great idea for the per-relation smgrs created
> > > during RelationCopyStorageUsingBuffer.  Yeah, they'll be mopped up at
> > > transaction end, but that doesn't mean that creating possibly tens of
> > > thousands of transient smgrs isn't going to cause performance issues.
>
> I was assuming that the files would get reopened at the end of the transaction
> anyway, but it looks like that's not the case, unless wal_level=minimal.
>
> Hm. CreateAndCopyRelationData() calls RelationCreateStorage() with
> register_delete = false, which is ok because createdb_failure_callback will
> clean things up. But that's another thing that's not great for a routine with
> a general name...
>
>
> > Okay, so for that we can simply call smgrcloserellocator(rlocator);
> > before exiting the RelationCopyStorageUsingBuffer() right?
>
> Yea, I think so.

Done, along with that, I have also got the hunk of smgropen and
smgrclose in ScanSourceDatabasePgClass() which I had in v1 patch[1].
Because here we do not want to reuse the smgr of the pg_class again so
instead of closing at the end with smgrcloserellocator() we can just
keep the smgr reference and close immediately after getting the number
of blocks.  Whereas in CreateAndCopyRelationData and
RelationCopyStorageUsingBuffer() we are using the smgr of the source
and dest relation multiple time so it make sense to not close it
immediately and we can close while exiting the function with
smgrcloserellocator().

[1]
+ smgr = smgropen(rlocator, InvalidBackendId);
+ nblocks = smgrnblocks(smgr, MAIN_FORKNUM);
+ smgrclose(smgr);



-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


v3-0002-Optimize-copy-storage-from-source-to-destination.patch
Description: Binary data


v3-0001-Avoid-setting-the-fake-relcache-entry-as-smgr-own.patch
Description: Binary data


Re: hash_xlog_split_allocate_page: failed to acquire cleanup lock

2022-08-09 Thread Thomas Munro
On Wed, Aug 10, 2022 at 3:21 PM Mark Dilger
 wrote:
> > On Aug 9, 2022, at 7:26 PM, Andres Freund  wrote:
> > The relevant code triggering it:
> >
> >   newbuf = XLogInitBufferForRedo(record, 1);
> >   _hash_initbuf(newbuf, xlrec->new_bucket, xlrec->new_bucket,
> > xlrec->new_bucket_flag, true);
> >   if (!IsBufferCleanupOK(newbuf))
> >   elog(PANIC, "hash_xlog_split_allocate_page: failed to acquire 
> > cleanup lock");
> >
> > Why do we just crash if we don't already have a cleanup lock? That can't be
> > right. Or is there supposed to be a guarantee this can't happen?
>
> Perhaps the code assumes that when xl_hash_split_allocate_page record was 
> written, the new_bucket field referred to an unused page, and so during 
> replay it should also refer to an unused page, and being unused, that nobody 
> will have it pinned.  But at least in heap we sometimes pin unused pages just 
> long enough to examine them and to see that they are unused.  Maybe something 
> like that is happening here?

Here's an email about that:

https://www.postgresql.org/message-id/CAE9k0P=OXww6RQCGrmDNa8=L3EeB01SGbYuP23y-qZJ=4td...@mail.gmail.com

> I'd be curious to see the count returned by 
> BUF_STATE_GET_REFCOUNT(LockBufHdr(newbuf)) right before this panic.  If it's 
> just 1, then it's not another backend, but our own, and we'd want to debug 
> why we're pinning the same page twice (or more) while replaying wal.  
> Otherwise, maybe it's a race condition with some other process that 
> transiently pins a buffer and occasionally causes this code to panic?

But which backend could that be?  We aren't starting any at that point
in the test.

Someone might wonder if it's the startup process itself via the new
WAL prefetching machinery, but that doesn't pin pages, it only probes
the buffer mapping table to see if future pages are cached already
(see bufmgr.c PrefetchSharedBuffer()).  (This is a topic I've thought
about a bit because I have another installment of recovery prefetching
in development using real AIO that *does* pin pages in advance, and
has to deal with code that wants cleanup locks like this...)

It's possible that git log src/backend/access/hash/ can explain a
behaviour change, as there were some recent changes there, but it's
not jumping out at me.  Maybe 4f1f5a7f "Remove fls(), use
pg_leftmost_one_pos32() instead." has a maths error, but I don't see
it.  Maybe e09d7a12 "Improve speed of hash index build." accidentally
reaches a new state and triggers a latent bug.  Maybe a latent bug
showed up now just because we started testing recovery not too long
ago... but all of that still needs another backend involved.  We can
see which blocks the startup process has pinned, 23 != 45.  Hmmm.




Re: dropping datumSort field

2022-08-09 Thread Zhihong Yu
On Tue, Aug 9, 2022 at 9:04 PM David Rowley  wrote:

> On Wed, 10 Aug 2022 at 03:16, Zhihong Yu  wrote:
> > On Tue, Aug 9, 2022 at 8:01 AM Robert Haas 
> wrote:
> >>
> >> One problem with this patch is that, if I apply it, PostgreSQL does not
> compile:
> >>
> >> nodeSort.c:197:6: error: use of undeclared identifier 'tupDesc'
> >> if (tupDesc->natts == 1)
> >> ^
> >> 1 error generated.
> >>
> >> Leaving that aside, I don't really see any advantage in this sort of
> change.
>
> I'm with Robert on this one. If you look at the discussion for that
> commit, you'll find quite a bit of benchmarking was done around this
> [1].  The "if" test here is in quite a hot code path, so we want to
> ensure whatever is being referenced here causes the least amount of
> CPU cache misses.  Volcano executors which process a single row at a
> time are not exactly an ideal workload for a modern processor due to
> the bad L1i and L1d hit ratios. This becomes worse with more plan
> nodes as these caches are more likely to get flushed of useful cache
> lines when mode notes are visited. Various other fields in 'node' have
> just been accessed in the code leading up to the "if
> (node->datumSort)" check, so we're probably not going to encounter any
> CPU pipeline stalls waiting for cache lines to be loaded into L1d.  It
> seems you're proposing to change this and have offered no evidence of
> no performance regressions from doing so.  Going by the compilation
> error above, it seems unlikely that you've given performance any
> consideration at all.
>
> I mean this in the best way possible; for the future, I really
> recommend arriving with ideas that are well researched and tested.
> When you can, arrive with evidence to prove your change is good.  For
> this case, evidence would be benchmark results.  The problem is if you
> were to arrive with patches such as this too often then you'll start
> to struggle to get attention from anyone, let alone a committer. You
> don't want to build a reputation for bad quality work as it's likely
> most committers will steer clear of your work.  If you want a good
> recent example of a good proposal, have a look at Yuya Watari's
> write-up at [2] and [3]. There was a clear problem statement there and
> a patch that was clearly a proof of concept only, so the author was
> under no illusion that what he had was ideal.  Now, some other ideas
> were suggested on that thread to hopefully simplify the task and
> provide even better performance. Yuya went off and did that and
> arrived back armed with further benchmark results. I was quite
> impressed with this considering he's not posted to -hackers very
> often.  Now, if you were a committer, would you be looking at the
> patches from the person who sends in half-thought-through ideas, or
> patches from someone that has clearly put a great deal of effort into
> first clearly stating the problem and then proving that the problem is
> solved by the given patch?
>
> David
>
> [1]
> https://www.postgresql.org/message-id/CAApHDvrWV%3Dv0qKsC9_BHqhCn9TusrNvCaZDz77StCO--fmgbKA%40mail.gmail.com
> [2]
> https://www.postgresql.org/message-id/caj2pmkzncgoukse+_5lthd+kbxkvq6h2hqn8esxpxd+cxmg...@mail.gmail.com
> [3]
> https://www.postgresql.org/message-id/caj2pmkzkfvmphovyyuebpwb_nyyvk2+gadqgzxzvnjkvxgt...@mail.gmail.com


Hi, David:
Thanks for your detailed response.


Re: dropping datumSort field

2022-08-09 Thread David Rowley
On Wed, 10 Aug 2022 at 03:16, Zhihong Yu  wrote:
> On Tue, Aug 9, 2022 at 8:01 AM Robert Haas  wrote:
>>
>> One problem with this patch is that, if I apply it, PostgreSQL does not 
>> compile:
>>
>> nodeSort.c:197:6: error: use of undeclared identifier 'tupDesc'
>> if (tupDesc->natts == 1)
>> ^
>> 1 error generated.
>>
>> Leaving that aside, I don't really see any advantage in this sort of change.

I'm with Robert on this one. If you look at the discussion for that
commit, you'll find quite a bit of benchmarking was done around this
[1].  The "if" test here is in quite a hot code path, so we want to
ensure whatever is being referenced here causes the least amount of
CPU cache misses.  Volcano executors which process a single row at a
time are not exactly an ideal workload for a modern processor due to
the bad L1i and L1d hit ratios. This becomes worse with more plan
nodes as these caches are more likely to get flushed of useful cache
lines when mode notes are visited. Various other fields in 'node' have
just been accessed in the code leading up to the "if
(node->datumSort)" check, so we're probably not going to encounter any
CPU pipeline stalls waiting for cache lines to be loaded into L1d.  It
seems you're proposing to change this and have offered no evidence of
no performance regressions from doing so.  Going by the compilation
error above, it seems unlikely that you've given performance any
consideration at all.

I mean this in the best way possible; for the future, I really
recommend arriving with ideas that are well researched and tested.
When you can, arrive with evidence to prove your change is good.  For
this case, evidence would be benchmark results.  The problem is if you
were to arrive with patches such as this too often then you'll start
to struggle to get attention from anyone, let alone a committer. You
don't want to build a reputation for bad quality work as it's likely
most committers will steer clear of your work.  If you want a good
recent example of a good proposal, have a look at Yuya Watari's
write-up at [2] and [3]. There was a clear problem statement there and
a patch that was clearly a proof of concept only, so the author was
under no illusion that what he had was ideal.  Now, some other ideas
were suggested on that thread to hopefully simplify the task and
provide even better performance. Yuya went off and did that and
arrived back armed with further benchmark results. I was quite
impressed with this considering he's not posted to -hackers very
often.  Now, if you were a committer, would you be looking at the
patches from the person who sends in half-thought-through ideas, or
patches from someone that has clearly put a great deal of effort into
first clearly stating the problem and then proving that the problem is
solved by the given patch?

David

[1] 
https://www.postgresql.org/message-id/CAApHDvrWV%3Dv0qKsC9_BHqhCn9TusrNvCaZDz77StCO--fmgbKA%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/caj2pmkzncgoukse+_5lthd+kbxkvq6h2hqn8esxpxd+cxmg...@mail.gmail.com
[3] 
https://www.postgresql.org/message-id/caj2pmkzkfvmphovyyuebpwb_nyyvk2+gadqgzxzvnjkvxgt...@mail.gmail.com




Re: optimize lookups in snapshot [sub]xip arrays

2022-08-09 Thread John Naylor
On Wed, Aug 10, 2022 at 7:13 AM Nathan Bossart  wrote:
>
> On Tue, Aug 09, 2022 at 01:00:37PM -0700, Nathan Bossart wrote:
> > Your adjustments in 0002 seem reasonable to me.  I think it makes sense to
> > ensure there is test coverage for pg_lfind32(), but I don't know if that
> > syscache code is the right choice.  For non-USE_SSE2 builds, it might make
> > these lookups more expensive.

Yeah.

On Wed, Aug 10, 2022 at 9:25 AM Masahiko Sawada  wrote:
> I think that for non-USE_SSE2 builds, there is no additional overhead
> as all assertion-related code in pg_lfind32 depends on USE_SSE2.

Nathan is referring to RelationSupportsSysCache() and
RelationHasSysCache(). They currently use binary search and using
linear search on non-x86-64 platforms is probably slower.

[Nathan again]
> One option might be to create a small test module for pg_lfind32().  Here
> is an example.

LGTM, let's see what the buildfarm thinks of 0001.

--
John Naylor
EDB: http://www.enterprisedb.com




Re: Perform streaming logical transactions by background workers and parallel apply

2022-08-09 Thread Amit Kapila
On Tue, Aug 9, 2022 at 5:39 PM Amit Kapila  wrote:
>
> On Tue, Aug 9, 2022 at 11:09 AM Dilip Kumar  wrote:
> >
> > Some more comments
> >
> > +/*
> > + * Exit if any relation is not in the READY state and if any worker is
> > + * handling the streaming transaction at the same time. Because for
> > + * streaming transactions that is being applied in apply background
> > + * worker, we cannot decide whether to apply the change for a relation
> > + * that is not in the READY state (see should_apply_changes_for_rel) 
> > as we
> > + * won't know remote_final_lsn by that time.
> > + */
> > +if (list_length(ApplyBgworkersFreeList) !=
> > list_length(ApplyBgworkersList) &&
> > +!AllTablesyncsReady())
> > +{
> > +ereport(LOG,
> > +(errmsg("logical replication apply workers for
> > subscription \"%s\" will restart",
> > +MySubscription->name),
> > + errdetail("Cannot handle streamed replication
> > transaction by apply "
> > +   "background workers until all tables are
> > synchronized")));
> > +
> > +proc_exit(0);
> > +}
> >
> > How this situation can occur? I mean while starting a background
> > worker itself we can check whether all tables are sync ready or not
> > right?
> >
>
> We are already checking at the start in apply_bgworker_can_start() but
> I think it is required to check at the later point of time as well
> because the new rels can be added to pg_subscription_rel via Alter
> Subscription ... Refresh. I feel if that reasoning is correct then we
> can probably expand comments to make it clear.
>
> > +/* Check the status of apply background worker if any. */
> > +apply_bgworker_check_status();
> > +
> >
> > What is the need to checking each worker status on every commit?  I
> > mean if there are a lot of small transactions along with some
> > steamiing transactions
> > then it will affect the apply performance for those small transactions?
> >
>
> I don't think performance will be a concern because this won't do any
> costly operation unless invalidation happens in which case it will
> access system catalogs. However, if my above understanding is correct
> that new tables can be added during the apply process then not sure
> doing it at commit time is sufficient/correct because it can change
> even during the transaction.
>

One idea that may handle it cleanly is to check for
SUBREL_STATE_SYNCDONE state in should_apply_changes_for_rel() and
error out for apply_bg_worker(). For the SUBREL_STATE_READY state, it
should return true and for any other state, it can return false. The
one advantage of this approach could be that the parallel apply worker
will give an error only if the corresponding transaction has performed
any operation on the relation that has reached the SYNCDONE state.
OTOH, checking at each transaction end can also lead to erroring out
of workers even if the parallel apply transaction doesn't perform any
operation on the relation which is not in the READY state.

-- 
With Regards,
Amit Kapila.




Re: hash_xlog_split_allocate_page: failed to acquire cleanup lock

2022-08-09 Thread Mark Dilger



> On Aug 9, 2022, at 7:26 PM, Andres Freund  wrote:
> 
> The relevant code triggering it:
> 
>   newbuf = XLogInitBufferForRedo(record, 1);
>   _hash_initbuf(newbuf, xlrec->new_bucket, xlrec->new_bucket,
> xlrec->new_bucket_flag, true);
>   if (!IsBufferCleanupOK(newbuf))
>   elog(PANIC, "hash_xlog_split_allocate_page: failed to acquire 
> cleanup lock");
> 
> Why do we just crash if we don't already have a cleanup lock? That can't be
> right. Or is there supposed to be a guarantee this can't happen?

Perhaps the code assumes that when xl_hash_split_allocate_page record was 
written, the new_bucket field referred to an unused page, and so during replay 
it should also refer to an unused page, and being unused, that nobody will have 
it pinned.  But at least in heap we sometimes pin unused pages just long enough 
to examine them and to see that they are unused.  Maybe something like that is 
happening here?

I'd be curious to see the count returned by 
BUF_STATE_GET_REFCOUNT(LockBufHdr(newbuf)) right before this panic.  If it's 
just 1, then it's not another backend, but our own, and we'd want to debug why 
we're pinning the same page twice (or more) while replaying wal.  Otherwise, 
maybe it's a race condition with some other process that transiently pins a 
buffer and occasionally causes this code to panic?

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: shared-memory based stats collector - v70

2022-08-09 Thread Kyotaro Horiguchi
At Tue, 9 Aug 2022 09:53:19 -0700, Andres Freund  wrote in 
> Hi,
> 
> On 2022-08-09 17:24:35 +0900, Kyotaro Horiguchi wrote:
> > If I'm not missing something, it's strange that pgstat_lock_entry()
> > only takes LW_EXCLUSIVE.
> 
> I think it makes some sense, given that there's a larger number of callers for
> that in various stats-emitting code. Perhaps we could just add a separate
> function with a _shared() suffix?

Sure. That was an alternative I had in my mind.

> > The atached changes the interface of
> > pgstat_lock_entry() but there's only one user since another read of
> > shared stats entry is not using reference. Thus the interface change
> > might be too much. If I just add bare LWLockAcquire/Release() to
> > pgstat_fetch_entry,the amount of the patch will be reduced.
> 
> Could you try the pgstat_lock_entry_shared() approach?

Of course. Please find the attached.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From f406fd384d7ca145e37810a2f6a7a410f74d5795 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Wed, 10 Aug 2022 10:53:19 +0900
Subject: [PATCH] Take lock while building stats snapshot

It got lost somewhere while developing shared memory stats collector
but it is undoubtedly required.
---
 src/backend/utils/activity/pgstat.c   |  8 
 src/backend/utils/activity/pgstat_shmem.c | 12 
 src/include/utils/pgstat_internal.h   |  1 +
 3 files changed, 21 insertions(+)

diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 88e5dd1b2b..c23142a670 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -844,9 +844,11 @@ pgstat_fetch_entry(PgStat_Kind kind, Oid dboid, Oid objoid)
 	else
 		stats_data = MemoryContextAlloc(pgStatLocal.snapshot.context,
 		kind_info->shared_data_len);
+	pgstat_lock_entry_shared(entry_ref, false);
 	memcpy(stats_data,
 		   pgstat_get_entry_data(kind, entry_ref->shared_stats),
 		   kind_info->shared_data_len);
+	pgstat_unlock_entry(entry_ref);
 
 	if (pgstat_fetch_consistency > PGSTAT_FETCH_CONSISTENCY_NONE)
 	{
@@ -983,9 +985,15 @@ pgstat_build_snapshot(void)
 
 		entry->data = MemoryContextAlloc(pgStatLocal.snapshot.context,
 		 kind_info->shared_size);
+		/*
+		 * Take lwlock directly instead of using pg_stat_lock_entry_shared()
+		 * which requires a reference.
+		 */
+		LWLockAcquire(_data->lock, LW_SHARED);
 		memcpy(entry->data,
 			   pgstat_get_entry_data(kind, stats_data),
 			   kind_info->shared_size);
+		LWLockRelease(_data->lock);
 	}
 	dshash_seq_term();
 
diff --git a/src/backend/utils/activity/pgstat_shmem.c b/src/backend/utils/activity/pgstat_shmem.c
index 89060ef29a..0bfa460af1 100644
--- a/src/backend/utils/activity/pgstat_shmem.c
+++ b/src/backend/utils/activity/pgstat_shmem.c
@@ -579,6 +579,18 @@ pgstat_lock_entry(PgStat_EntryRef *entry_ref, bool nowait)
 	return true;
 }
 
+bool
+pgstat_lock_entry_shared(PgStat_EntryRef *entry_ref, bool nowait)
+{
+	LWLock	   *lock = _ref->shared_stats->lock;
+
+	if (nowait)
+		return LWLockConditionalAcquire(lock, LW_SHARED);
+
+	LWLockAcquire(lock, LW_SHARED);
+	return true;
+}
+
 void
 pgstat_unlock_entry(PgStat_EntryRef *entry_ref)
 {
diff --git a/src/include/utils/pgstat_internal.h b/src/include/utils/pgstat_internal.h
index 9303d05427..901d2041d6 100644
--- a/src/include/utils/pgstat_internal.h
+++ b/src/include/utils/pgstat_internal.h
@@ -581,6 +581,7 @@ extern void pgstat_detach_shmem(void);
 extern PgStat_EntryRef *pgstat_get_entry_ref(PgStat_Kind kind, Oid dboid, Oid objoid,
 			 bool create, bool *found);
 extern bool pgstat_lock_entry(PgStat_EntryRef *entry_ref, bool nowait);
+extern bool pgstat_lock_entry_shared(PgStat_EntryRef *entry_ref, bool nowait);
 extern void pgstat_unlock_entry(PgStat_EntryRef *entry_ref);
 extern bool pgstat_drop_entry(PgStat_Kind kind, Oid dboid, Oid objoid);
 extern void pgstat_drop_all_entries(void);
-- 
2.31.1



hash_xlog_split_allocate_page: failed to acquire cleanup lock

2022-08-09 Thread Andres Freund
Hi,

One CI run for the meson branch just failed in a way I hadn't seen before on
windows, when nothing had changed on windows

https://cirrus-ci.com/task/6111743586861056

027_stream_regress.pl ended up failing due to a timeout. Which in turn was
caused by the standby crashing.

2022-08-10 01:46:20.731 GMT [2212][startup] PANIC:  
hash_xlog_split_allocate_page: failed to acquire cleanup lock
2022-08-10 01:46:20.731 GMT [2212][startup] CONTEXT:  WAL redo at 0/7A6EED8 for 
Hash/SPLIT_ALLOCATE_PAGE: new_bucket 31, meta_page_masks_updated F, 
issplitpoint_changed F; blkref #0: rel 1663/16384/24210, blk 23; blkref #1: rel 
1663/16384/24210, blk 45; blkref #2: rel 1663/16384/24210, blk 0
abort() has been called2022-08-10 01:46:31.919 GMT [7560][checkpointer] LOG:  
restartpoint starting: time
2022-08-10 01:46:32.430 GMT [8304][postmaster] LOG:  startup process (PID 2212) 
was terminated by exception 0xC354

stack dump:
https://api.cirrus-ci.com/v1/artifact/task/6111743586861056/crashlog/crashlog-postgres.exe_21c8_2022-08-10_01-46-28-215.txt

The relevant code triggering it:

newbuf = XLogInitBufferForRedo(record, 1);
_hash_initbuf(newbuf, xlrec->new_bucket, xlrec->new_bucket,
  xlrec->new_bucket_flag, true);
if (!IsBufferCleanupOK(newbuf))
elog(PANIC, "hash_xlog_split_allocate_page: failed to acquire 
cleanup lock");

Why do we just crash if we don't already have a cleanup lock? That can't be
right. Or is there supposed to be a guarantee this can't happen?


Greetings,

Andres Freund




Re: optimize lookups in snapshot [sub]xip arrays

2022-08-09 Thread Masahiko Sawada
On Wed, Aug 10, 2022 at 5:00 AM Nathan Bossart  wrote:
>
> On Tue, Aug 09, 2022 at 01:21:41PM +0700, John Naylor wrote:
> > I decided I wasn't quite comfortable changing snapshot handling
> > without further guarantees.  To this end, 0002 in the attached v11 is
> > an addendum that adds assert checking (also pgindent and some
> > comment-smithing). As I suspected, make check-world passes even with
> > purposefully screwed-up coding. 0003 uses pg_lfind32 in syscache.c and
> > I verified that sticking in the wrong answer will lead to a crash in
> > assert-enabled builds in short order. I'd kind of like to throw this
> > (or something else suitable) at the build farm first for that reason.
> > It's simpler than the qsort/qunique/binary search that was there
> > before, so that's nice, but I've not tried to test performance.
>
> Your adjustments in 0002 seem reasonable to me.  I think it makes sense to
> ensure there is test coverage for pg_lfind32(), but I don't know if that
> syscache code is the right choice.  For non-USE_SSE2 builds, it might make
> these lookups more expensive.

I think that for non-USE_SSE2 builds, there is no additional overhead
as all assertion-related code in pg_lfind32 depends on USE_SSE2.

> I'll look around to see if there are any
> other suitable candidates.

As you proposed, having a test module for that seems to be a good
idea. We can add test codes for future optimizations that utilize SIMD
operations.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




RE: Perform streaming logical transactions by background workers and parallel apply

2022-08-09 Thread kuroda.hay...@fujitsu.com
Hi Wang,

> 6.a
> 
> It seems that the upper line represents the apply background worker, but I 
> think
> last_msg_send_time and last_msg_receipt_time should be null.
> Is it like initialization mistake?

I checked again about the issue.

Attributes worker->last_send_time, worker->last_recv_time, and 
worker->reply_time
are initialized in logicalrep_worker_launch():

```
...
TIMESTAMP_NOBEGIN(worker->last_send_time);
TIMESTAMP_NOBEGIN(worker->last_recv_time);
worker->reply_lsn = InvalidXLogRecPtr;
TIMESTAMP_NOBEGIN(worker->reply_time);
...
```

And the macro is defined in timestamp.h, and it seems that the values are 
initialized as PG_INT64_MIN.

```
#define DT_NOBEGIN  PG_INT64_MIN
#define DT_NOENDPG_INT64_MAX

#define TIMESTAMP_NOBEGIN(j)\
do {(j) = DT_NOBEGIN;} while (0)
```


However, in pg_stat_get_subscription(), these values are regarded as null if 
they are zero.

```
if (worker.last_send_time == 0)
nulls[4] = true;
else
values[4] = TimestampTzGetDatum(worker.last_send_time);
if (worker.last_recv_time == 0)
nulls[5] = true;
else
values[5] = TimestampTzGetDatum(worker.last_recv_time);
```

I think above lines are wrong, these values should be compared with 
PG_INT64_MIN.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



Re: Windows now has fdatasync()

2022-08-09 Thread Thomas Munro
David kindly ran some tests of this thing on real hardware.  The
results were mostly in line with expectations, but we learned some new
things.

TL;DR  We probably should consider this as a safer default, but it'd
be good for someone more hands-on with this OS and knowledgeable about
storage to investigate and propose that.  My original goal here was
primarily Unix/Windows harmonisation and cleanup since I'm doing a
bunch of hacking on I/O, but I can't unsee an
unsafe-at-least-on-consumer-gear default now that I've seen it.  The
main thing I'm aware of that we don't know yet is what happens if you
try it on a non-NTFS file system (ReFS? SMB?) -- hopefully it falls
back to fsync behaviour.

Observations from an old Windows 8.1 system with a SATA drive:

1.  So far you can apparently still actually compile and run on 8.1,
despite recent commits to de-support it.
2.  You can use the new wal_sync_method=fdatasync, without error, and
timings are consistent with falling back to full fsync behaviour.
That makes sense, I guess, because the function existed.  It's just a
new flag bit, and the default behaviour for flags == 0 was already
their fsync.  That seems like a good outcome even though 8.1 isn't a
target anymore.

Observations from a current Windows 11 system with an NVMe drive:

1.  fdatasync is faster than fsync, as expected.  Twice as fast with
write cache disabled, a bit faster with write cache enabled.
2.  Timings seem to suggest that open_datasync (the current default)
is not really writing through the drive cache.  I'd previously thought
that was a SATA-only problem based on [1], which said that EIDE/SATA
drivers did not pass through the FUA flag that NTFS sends for
FILE_FLAG_WRITE_THROUGH (= O_DSYNC) on the basis that many drives
ignored it anyway, but these numbers seem to suggest that David's
recent-ish NVMe system has the same problem as the old SATA system.

Generally, Windows' approach seems to be that NTFS
FILE_FLAG_WRITE_THROUGH fires an FUA flag into the storage stack, and
either the driver or the drive is free to fling it out the window, and
it's the user's problem to worry about that, whereas Linux at least
asks nicely if the drive understands FUA and falls back to flushing
the whole cache if not[2].  I also know that Linux has been flaky
around this in the past too, especially on consumer storage, and macOS
and at least some of the older BSD/UFS systems just don't do this
stuff at all for user data (yet) so it's not like there is anything
universal about this topic.  Note that drive caches are enabled by
default in Windows, and our manual does already tell you about this
problem[3].

One thing to note about the numbers below: pg_test_fsync.c's
open_datasync test is also using FILE_FLAG_NO_BUFFERING (= O_DIRECT),
unlike PostgreSQL, which muddies the waters slightly.  (There was a
patch upthread to fix that and report both numbers, I may come back to
that.)

Windows 11, NVMe, write cache enabled:

open_datasync 27306.286 ops/sec  37 usecs/op
fdatasync  3065.428 ops/sec 326 usecs/op
fsync  2577.498 ops/sec 388 usecs/op

Windows 11, NVMe, write cache disabled:

open_datasync  3477.258 ops/sec 288 usecs/op
fdatasync  3263.418 ops/sec 306 usecs/op
fsync  1641.502 ops/sec 609 usecs/op

Windows 8.1, SATA:

open_datasync 19934.532 ops/sec  50 usecs/op
fdatasync   231.429 ops/sec4321 usecs/op
fsync   240.050 ops/sec4166 usecs/op

(We couldn't figure out how to disable the write cache on the 8.1
machine -- the usual checkbox had no effect -- but we didn't waste
time investigating that old system beyond the curiosity of checking if
it'd work at all.)

[1] https://devblogs.microsoft.com/oldnewthing/20170510-00/?p=95505
[2] 
https://techcommunity.microsoft.com/t5/sql-server-blog/sql-server-on-linux-forced-unit-access-fua-internals/ba-p/3199102
[3] https://www.postgresql.org/docs/devel/wal-reliability.html




Re: pg_get_constraintdef: Schema qualify foreign tables unless pretty printing is enabled

2022-08-09 Thread Lukas Fittl
On Tue, Aug 9, 2022 at 5:33 PM Tom Lane  wrote:

> I would say that pg_get_indexdef is the one that's out of step.
> I count 11 calls of generate_relation_name in ruleutils.c,
> of which only three have this business of being overridden
> when not-pretty.  What is the rationale for that, and why
> would we move pg_get_constraintdef from one category to the
> other?
>

The overall motivation here is to make it easy to recreate the schema
without having to match the search_path on the importing side to be
identical to the exporting side. There is a workaround, which is to do a
SET search_path before calling these functions that excludes the referenced
schemas (which I guess is what pg_dump does?).

But I wonder, why do we have an explicit pretty printing flag on these
functions, and PRETTYFLAG_SCHEMA in the code to represent this behavior. If
we don't want pretty printing to affect schema qualification, why does that
flag exist?

Of the other call sites, in terms of using generate_relation_name vs
generate_qualified_relation_name:

* pg_get_triggerdef_worker makes it conditional on pretty=true, but only
for ON, not the FROM (not clear why that difference exists?)
* pg_get_indexdef_worker makes it conditional on prettyFlags &
PRETTYFLAG_SCHEMA for the ON
* pg_get_statisticsobj_worker does not handle pretty printing (always uses
generate_relation_name)
* make_ruledef makes it conditional on prettyFlags & PRETTYFLAG_SCHEMA for
the TO
* get_insert_query_def does not handle pretty printing (always uses
generate_relation_name)
* get_update_query_def does not handle pretty printing (always uses
generate_relation_name)
* get_delete_query_def does not handle pretty printing (always uses
generate_relation_name)
* get_rule_expr does not handle pretty printing (always uses
generate_relation_name)
* get_from_clause_item does not handle pretty printing (always uses
generate_relation_name)

Looking at that, it seems we didn't make the effort for the view related
code with all its complexity, and didn't do it for pg_get_statisticsobjdef
since it doesn't have a pretty flag. Why we didn't do it in
pg_get_triggerdef_worker for FROM isn't clear to me.

If we want to be entirely consistent (and keep supporting
PRETTYFLAG_SCHEMA), that probably means:

* Adding a pretty flag to pg_get_statisticsobjdef
* Teaching get_query_def to pass down prettyFlags to get_*_query_def
functions
* Update pg_get_triggerdef_worker to handle pretty for FROM as well

If that seems like a sensible direction I'd be happy to work on a patch.

Thanks,
Lukas

-- 
Lukas Fittl


Re: pg_get_constraintdef: Schema qualify foreign tables unless pretty printing is enabled

2022-08-09 Thread Tom Lane
Lukas Fittl  writes:
> Whilst debugging an issue with the output of pg_get_constraintdef, we've
> discovered that pg_get_constraintdef doesn't schema qualify foreign tables
> mentioned in the REFERENCES clause, even if pretty printing
> (PRETTYFLAG_SCHEMA) is turned off.

> This is a problem because it means there is no way to get a constraint
> definition that can be recreated on another system when multiple schemas
> are in use, but a different search_path is set. It's also different from
> pg_get_indexdef, where this flag is correctly respected.

I would say that pg_get_indexdef is the one that's out of step.
I count 11 calls of generate_relation_name in ruleutils.c,
of which only three have this business of being overridden
when not-pretty.  What is the rationale for that, and why
would we move pg_get_constraintdef from one category to the
other?

regards, tom lane




Re: optimize lookups in snapshot [sub]xip arrays

2022-08-09 Thread Nathan Bossart
On Tue, Aug 09, 2022 at 01:00:37PM -0700, Nathan Bossart wrote:
> Your adjustments in 0002 seem reasonable to me.  I think it makes sense to
> ensure there is test coverage for pg_lfind32(), but I don't know if that
> syscache code is the right choice.  For non-USE_SSE2 builds, it might make
> these lookups more expensive.  I'll look around to see if there are any
> other suitable candidates.

One option might be to create a small test module for pg_lfind32().  Here
is an example.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 0b15a0569318cc846980b7771ff809ef5d2d505c Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 3 Aug 2022 09:49:04 -0700
Subject: [PATCH v12 1/2] Introduce optimized routine for linear searches
 through an array of integers.

If SSE2 is available, this function uses it to speed up the search.  Otherwise,
it uses a simple 'for' loop.  This is a prerequisite for a follow-up commit
that will use this function to optimize [sub]xip lookups in
XidInMVCCSnapshot(), but it can be used anywhere that might benefit from such
an optimization.

It might be worthwhile to add an ARM-specific code path to this function in the
future.

Author: Nathan Bossart
Reviewed by: Andres Freund, John Naylor, Bharath Rupireddy, Masahiko Sawada
Discussion: https://postgr.es/m/20220713170950.GA3116318%40nathanxps13
---
 src/include/port/pg_lfind.h   | 103 ++
 src/test/modules/Makefile |   1 +
 src/test/modules/test_lfind/.gitignore|   4 +
 src/test/modules/test_lfind/Makefile  |  23 
 .../test_lfind/expected/test_lfind.out|  12 ++
 .../modules/test_lfind/sql/test_lfind.sql |   8 ++
 .../modules/test_lfind/test_lfind--1.0.sql|   8 ++
 src/test/modules/test_lfind/test_lfind.c  |  52 +
 .../modules/test_lfind/test_lfind.control |   4 +
 9 files changed, 215 insertions(+)
 create mode 100644 src/include/port/pg_lfind.h
 create mode 100644 src/test/modules/test_lfind/.gitignore
 create mode 100644 src/test/modules/test_lfind/Makefile
 create mode 100644 src/test/modules/test_lfind/expected/test_lfind.out
 create mode 100644 src/test/modules/test_lfind/sql/test_lfind.sql
 create mode 100644 src/test/modules/test_lfind/test_lfind--1.0.sql
 create mode 100644 src/test/modules/test_lfind/test_lfind.c
 create mode 100644 src/test/modules/test_lfind/test_lfind.control

diff --git a/src/include/port/pg_lfind.h b/src/include/port/pg_lfind.h
new file mode 100644
index 00..fb125977b2
--- /dev/null
+++ b/src/include/port/pg_lfind.h
@@ -0,0 +1,103 @@
+/*-
+ *
+ * pg_lfind.h
+ *	  Optimized linear search routines.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/include/port/pg_lfind.h
+ *
+ *-
+ */
+#ifndef PG_LFIND_H
+#define PG_LFIND_H
+
+#include "port/simd.h"
+
+/*
+ * pg_lfind32
+ *
+ * Return true if there is an element in 'base' that equals 'key', otherwise
+ * return false.
+ */
+static inline bool
+pg_lfind32(uint32 key, uint32 *base, uint32 nelem)
+{
+	uint32		i = 0;
+
+	/* Use SIMD intrinsics where available. */
+#ifdef USE_SSE2
+
+	/*
+	 * A 16-byte register only has four 4-byte lanes. For better
+	 * instruction-level parallelism, each loop iteration operates on a block
+	 * of four registers. Testing has showed this is ~40% faster than using a
+	 * block of two registers.
+	 */
+	const		__m128i keys = _mm_set1_epi32(key); /* load 4 copies of key */
+	uint32		iterations = nelem & ~0xF;	/* round down to multiple of 16 */
+
+#if defined(USE_ASSERT_CHECKING)
+	bool		assert_result = false;
+
+	/* pre-compute the result for assert checking */
+	for (i = 0; i < nelem; i++)
+	{
+		if (key == base[i])
+		{
+			assert_result = true;
+			break;
+		}
+	}
+#endif
+
+	for (i = 0; i < iterations; i += 16)
+	{
+		/* load the next block into 4 registers holding 4 values each */
+		const		__m128i vals1 = _mm_loadu_si128((__m128i *) & base[i]);
+		const		__m128i vals2 = _mm_loadu_si128((__m128i *) & base[i + 4]);
+		const		__m128i vals3 = _mm_loadu_si128((__m128i *) & base[i + 8]);
+		const		__m128i vals4 = _mm_loadu_si128((__m128i *) & base[i + 12]);
+
+		/* compare each value to the key */
+		const		__m128i result1 = _mm_cmpeq_epi32(keys, vals1);
+		const		__m128i result2 = _mm_cmpeq_epi32(keys, vals2);
+		const		__m128i result3 = _mm_cmpeq_epi32(keys, vals3);
+		const		__m128i result4 = _mm_cmpeq_epi32(keys, vals4);
+
+		/* combine the results into a single variable */
+		const		__m128i tmp1 = _mm_or_si128(result1, result2);
+		const		__m128i tmp2 = _mm_or_si128(result3, result4);
+		const		__m128i result = _mm_or_si128(tmp1, tmp2);
+
+		/* see if there was a match */
+		if (_mm_movemask_epi8(result) != 0)
+		{
+#if defined(USE_ASSERT_CHECKING)
+			Assert(assert_result == true);
+#endif
+			return true;
+		

pg_get_constraintdef: Schema qualify foreign tables unless pretty printing is enabled

2022-08-09 Thread Lukas Fittl
Hi hackers,

Whilst debugging an issue with the output of pg_get_constraintdef, we've
discovered that pg_get_constraintdef doesn't schema qualify foreign tables
mentioned in the REFERENCES clause, even if pretty printing
(PRETTYFLAG_SCHEMA) is turned off.

This is a problem because it means there is no way to get a constraint
definition that can be recreated on another system when multiple schemas
are in use, but a different search_path is set. It's also different from
pg_get_indexdef, where this flag is correctly respected.

I assume this is an oversight, since the fix is pretty straightforward, see
attached patch. I'll register the patch for the next commitfest.

Here is a test case from my colleague Maciek showing this difference:

create schema s;
create table s.foo(a int primary key);
create table s.bar(a int primary key, b int references s.foo(a));

select pg_get_indexdef(indexrelid, 0, false) from pg_index order by
indexrelid desc limit 3;

pg_get_indexdef

---
 CREATE UNIQUE INDEX bar_pkey ON s.bar USING btree (a)
 CREATE UNIQUE INDEX foo_pkey ON s.foo USING btree (a)
 CREATE UNIQUE INDEX pg_toast_13593_index ON pg_toast.pg_toast_13593 USING
btree (chunk_id, chunk_seq)
(3 rows)

select pg_get_constraintdef(oid, false) from pg_constraint order by oid
desc limit 3;
   pg_get_constraintdef
---
 FOREIGN KEY (b) REFERENCES foo(a)
 PRIMARY KEY (a)
 PRIMARY KEY (a)
(3 rows)

Thanks,
Lukas

-- 
Lukas Fittl
From 83a1ab6081f70d0eed820b2b13bc3ab6e93af8ec Mon Sep 17 00:00:00 2001
From: Lukas Fittl 
Date: Tue, 9 Aug 2022 16:55:35 -0700
Subject: [PATCH v1] pg_get_constraintdef: Schema qualify foreign tables by
 default

This matches pg_get_constraintdef to behave the same way as
pg_get_indexdef, which is to schema qualify referenced objects in the
definition, unless pretty printing is explicitly requested.

For pretty printing the previous behaviour is retained, which is to only
include the schema information if the referenced object is not in the
current search path.
---
 src/backend/utils/adt/ruleutils.c | 5 +++--
 src/test/regress/expected/foreign_key.out | 8 
 2 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index d575aa0066..b7a2a356b4 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -2249,8 +2249,9 @@ pg_get_constraintdef_worker(Oid constraintId, bool fullCommand,
 
 /* add foreign relation name */
 appendStringInfo(, ") REFERENCES %s(",
- generate_relation_name(conForm->confrelid,
-		NIL));
+ (prettyFlags & PRETTYFLAG_SCHEMA) ?
+ generate_relation_name(conForm->confrelid, NIL) :
+ generate_qualified_relation_name(conForm->confrelid));
 
 /* Fetch and build referenced-column list */
 val = SysCacheGetAttr(CONSTROID, tup,
diff --git a/src/test/regress/expected/foreign_key.out b/src/test/regress/expected/foreign_key.out
index da26f083bc..a70f7c2491 100644
--- a/src/test/regress/expected/foreign_key.out
+++ b/src/test/regress/expected/foreign_key.out
@@ -775,10 +775,10 @@ CREATE TABLE FKTABLE (
   FOREIGN KEY (tid, fk_id_del_set_default) REFERENCES PKTABLE ON DELETE SET DEFAULT (fk_id_del_set_default)
 );
 SELECT pg_get_constraintdef(oid) FROM pg_constraint WHERE conrelid = 'fktable'::regclass::oid ORDER BY oid;
-pg_get_constraintdef
-
- FOREIGN KEY (tid, fk_id_del_set_null) REFERENCES pktable(tid, id) ON DELETE SET NULL (fk_id_del_set_null)
- FOREIGN KEY (tid, fk_id_del_set_default) REFERENCES pktable(tid, id) ON DELETE SET DEFAULT (fk_id_del_set_default)
+   pg_get_constraintdef
+---
+ FOREIGN KEY (tid, fk_id_del_set_null) REFERENCES public.pktable(tid, id) ON DELETE SET NULL (fk_id_del_set_null)
+ FOREIGN KEY (tid, fk_id_del_set_default) REFERENCES public.pktable(tid, id) ON DELETE SET DEFAULT (fk_id_del_set_default)
 (2 rows)
 
 INSERT INTO PKTABLE VALUES (1, 0), (1, 1), (1, 2);
-- 
2.34.0



Re: Data is copied twice when specifying both child and parent table in publication

2022-08-09 Thread Peter Smith
Here are some more review comments for the HEAD_v8 patch:

==

1. Commit message

If there are two publications, one of them publish a parent table with
(publish_via_partition_root = true) and another publish child table,
subscribing to both publications from one subscription results in two initial
replications. It should only be copied once.

~

I took a 2nd look at that commit message and it seemed slightly
backwards to me - e.g. don't you really mean for the
'publish_via_partition_root' parameter to be used when publishing the
*child* table?

SUGGESTION
If there are two publications, one of them publishing a parent table
directly, and the other publishing a child table with
publish_via_partition_root = true, then subscribing to both those
publications from one subscription results in two initial
replications. It should only be copied once.

--
Kind Regards,
Peter Smith.
Fujitsu Australia




Re: Reducing the chunk header sizes on all memory context types

2022-08-09 Thread David Rowley
Thanks for giving this a look.

On Wed, 10 Aug 2022 at 02:37, Robert Haas  wrote:
> # We also add a restriction that block sizes for all 3 of the memory
> # allocators cannot be 1GB or larger.  We would be unable to store the
> # number of bytes that the block is offset from the chunk stored beyond this
> #1GB boundary on any block that was larger than 1GB.
>
> Earlier in the commit message, you say that allocations of 1GB or more
> are stored in dedicated blocks. But here you say that blocks can't be
> more than 1GB. Those statements seem to contradict each other. I guess
> you mean block sizes for blocks that contain chunks, or something like
> that?

I'll update that so it's more clear.

But, just to clarify here first, the 1GB restriction is just in
regards to the maxBlockSize parameter when creating a context.
Anything over set->allocChunkLimit goes on a dedicated block and there
is no 1GB size restriction on those dedicated blocks.

David




Re: Reducing the chunk header sizes on all memory context types

2022-08-09 Thread Tom Lane
David Rowley  writes:
> On Wed, 10 Aug 2022 at 09:23, Tom Lane  wrote:
>> Hmm, I suppose you mean we could reduce 4) if we needed to.  Yeah, that
>> seems like a reasonable place to buy more bits later if we run out of
>> MemoryContextMethodIDs.  Should be fine then.

> I think he means 3).  If 4) was reduced then that would further reduce
> the maxBlockSize we could pass when creating a context.  At least for
> aset.c and generation.c, we don't really need 3) to be 30-bits wide as
> the set->allocChunkLimit is almost certainly much smaller than that.

Oh, I see: we'd just be further constraining the size of chunk that
has to be pushed out as an "external" chunk.  Got it.

regards, tom lane




Re: Reducing the chunk header sizes on all memory context types

2022-08-09 Thread David Rowley
On Wed, 10 Aug 2022 at 09:23, Tom Lane  wrote:
>
> Andres Freund  writes:
> > On 2022-08-09 15:21:57 -0400, Tom Lane wrote:
> >> Do we really need it to be that tight?  I know we only have 3 methods 
> >> today,
> >> but 8 doesn't seem that far away.  If there were six bits reserved for
> >> this I'd be happier.
>
> > We only have so many bits available, so that'd have to come from some other
> > resource.  The current division is:
>
> > + * 1.3-bits to indicate the MemoryContextMethodID
> > + * 2.1-bit to indicate if the chunk is externally managed (see 
> > below)
> > + * 3.30-bits for the amount of memory which was reserved for the 
> > chunk
> > + * 4.30-bits for the number of bytes that must be subtracted from 
> > the chunk
> > + *   to obtain the address of the block that the chunk is stored 
> > on.
>
> > I suspect we could reduce 3) here a bit, which I think would end up with 
> > slab
> > context's max chunkSize shrinking further. Which should still be fine.
>
> Hmm, I suppose you mean we could reduce 4) if we needed to.  Yeah, that
> seems like a reasonable place to buy more bits later if we run out of
> MemoryContextMethodIDs.  Should be fine then.

I think he means 3).  If 4) was reduced then that would further reduce
the maxBlockSize we could pass when creating a context.  At least for
aset.c and generation.c, we don't really need 3) to be 30-bits wide as
the set->allocChunkLimit is almost certainly much smaller than that.
Allocations bigger than allocChunkLimit use a dedicated block with an
external chunk. External chunks don't use 3) or 4).

David




Re: Reducing the chunk header sizes on all memory context types

2022-08-09 Thread Tom Lane
Andres Freund  writes:
> On 2022-08-09 15:21:57 -0400, Tom Lane wrote:
>> Do we really need it to be that tight?  I know we only have 3 methods today,
>> but 8 doesn't seem that far away.  If there were six bits reserved for
>> this I'd be happier.

> We only have so many bits available, so that'd have to come from some other
> resource.  The current division is:

> + * 1.3-bits to indicate the MemoryContextMethodID
> + * 2.1-bit to indicate if the chunk is externally managed (see below)
> + * 3.30-bits for the amount of memory which was reserved for the 
> chunk
> + * 4.30-bits for the number of bytes that must be subtracted from 
> the chunk
> + *   to obtain the address of the block that the chunk is stored on.

> I suspect we could reduce 3) here a bit, which I think would end up with slab
> context's max chunkSize shrinking further. Which should still be fine.

Hmm, I suppose you mean we could reduce 4) if we needed to.  Yeah, that
seems like a reasonable place to buy more bits later if we run out of
MemoryContextMethodIDs.  Should be fine then.

regards, tom lane




Re: Reducing the chunk header sizes on all memory context types

2022-08-09 Thread Andres Freund
Hi,

On 2022-08-09 15:21:57 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > I think it's fine, given that we can change this at any time, but it's
> > probably worth to explicitly agree that this will for now restrict us to 8
> > context methods?
> 
> Do we really need it to be that tight?  I know we only have 3 methods today,
> but 8 doesn't seem that far away.  If there were six bits reserved for
> this I'd be happier.

We only have so many bits available, so that'd have to come from some other
resource.  The current division is:

+ * 1.  3-bits to indicate the MemoryContextMethodID
+ * 2.  1-bit to indicate if the chunk is externally managed (see below)
+ * 3.  30-bits for the amount of memory which was reserved for the chunk
+ * 4.  30-bits for the number of bytes that must be subtracted from the chunk
+ * to obtain the address of the block that the chunk is stored on.

I suspect we could reduce 3) here a bit, which I think would end up with slab
context's max chunkSize shrinking further. Which should still be fine.

But also, we could defer that to later, this is a limit that we can easily
change.


> >> # We also add a restriction that block sizes for all 3 of the memory
> >> # allocators cannot be 1GB or larger.  We would be unable to store the
> >> # number of bytes that the block is offset from the chunk stored beyond 
> >> this
> >> #1GB boundary on any block that was larger than 1GB.
> 
> Losing MemoryContextAllocHuge would be very bad, so I assume this comment
> is not telling the full truth.

It's just talking about chunked allocation (except for slab, which doesn't
have anything else, but as David pointed out, it makes no sense to use slab
for that large allocations). I.e. it's the max you can pass to
AllocSetContextCreate()'s and GenerationContextCreate()'s maxBlockSize, and to
SlabContextCreate()'s chunkSize.   I don't think we have any code that
currently sets a bigger limit than 8MB.

Greetings,

Andres Freund




Re: SQL/JSON features for v15

2022-08-09 Thread Jonathan S. Katz

On 8/9/22 4:58 PM, Jonathan S. Katz wrote:

We're looking for additional input on what makes sense as a best course 
of action, given what is presented in[3].


Missed adding Amit on the CC.

Jonathan


OpenPGP_signature
Description: OpenPGP digital signature


SQL/JSON features for v15

2022-08-09 Thread Jonathan S. Katz

Hi,

(Personal hat, not RMT hat unless otherwise noted).

This thread[1] raised some concerns around the implementation of the 
SQL/JSON features that are slated for v15, which includes an outstanding 
open item[2]. Given the current state of the discussion, when the RMT 
met on Aug 8, they several options, readable here[3]. Given we are now 
into the later part of the release cycle, we need to make some decisions 
on how to proceed with this feature given the concerns raised.


Per additional discussion on the thread, the group wanted to provide 
more visibility into the discussion to get opinions on how to proceed 
for the v15 release.


Without rehashing the thread, the options presented were:

1. Fixing the concerns addressed in the thread around the v15 SQL/JSON 
features implementation, noting that this would likely entail at least 
one more beta release and would push the GA date past our normal timeframe.


2. Try to commit a subset of the features that caused less debate. This 
was ruled out.


3. Revert the v15 SQL/JSON features work.


Based on the current release timing and the open issues presented on the 
thread, and the RMT had recommended reverting, but preferred to drive 
consensus on next steps.



From a release advocacy standpoint, I need about 6 weeks lead time to 
put together the GA launch. We're at the point where I typically deliver 
a draft release announcement. From this, given this involves a high 
visibility feature, I would want some clarity on what option we would 
like to pursue. Once the announcement translation process has begun (and 
this is when we have consensus on the release announcement), it becomes 
more challenging to change things out.


From a personal standpoint (restating from[3]), I would like to see 
what we could do to include support for this batch of the SQL/JSON 
features in v15. What is included looks like it closes most of the gap 
on what we've been missing syntactically since the standard was adopted, 
and the JSON_TABLE work is very convenient for converting JSON data into 
a relational format. I believe having this feature set is important for 
maintaining standards compliance, interoperability, tooling support, and 
general usability. Plus, JSON still seems to be pretty popular.


We're looking for additional input on what makes sense as a best course 
of action, given what is presented in[3].


Thanks,

Jonathan

[1] 
https://www.postgresql.org/message-id/flat/20220616233130.rparivafipt6doj3%40alap3.anarazel.de

[2] https://wiki.postgresql.org/wiki/PostgreSQL_15_Open_Items
[3] 
https://www.postgresql.org/message-id/787cef45-15de-8f1d-ed58-a1c1435bfc0e%40postgresql.org


OpenPGP_signature
Description: OpenPGP digital signature


Re: PG 15 (and to a smaller degree 14) regression due to ExprEvalStep size

2022-08-09 Thread Andrew Dunstan


On 2022-08-09 Tu 16:19, Jonathan S. Katz wrote:
> On 8/9/22 4:15 PM, Andrew Dunstan wrote:
>>
>> On 2022-08-09 Tu 15:50, Jonathan S. Katz wrote:
>>> On 8/9/22 3:22 PM, Andres Freund wrote:
>
>>>
 OTOH, it's not a great sign  this is around json again...
>>>
>>> Yeah, I was thinking about that too.
>>
>>
>> Ouch :-(
>>
>> I think after 10 years of being involved with our JSON features, I'm
>> going to take a substantial break on that front.
>
> I hope that wasn't taken as a sleight, but just an observation. There
> are other feature areas where I can make similar observations. All
> this work around a database system is challenging as there are many
> considerations that need to be made.
>
> You've done an awesome job driving the JSON work forward and it is
> greatly appreciated.
>
>


Thanks, I appreciate that (and I wasn't fishing for compliments). It's
more that I feel a bit tired of it ... some of my colleagues will
confirm that I've been saying this for a while, so it's not spurred by
this setback.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: PG 15 (and to a smaller degree 14) regression due to ExprEvalStep size

2022-08-09 Thread Jonathan S. Katz

On 8/9/22 4:15 PM, Andrew Dunstan wrote:


On 2022-08-09 Tu 15:50, Jonathan S. Katz wrote:

On 8/9/22 3:22 PM, Andres Freund wrote:





OTOH, it's not a great sign  this is around json again...


Yeah, I was thinking about that too.



Ouch :-(

I think after 10 years of being involved with our JSON features, I'm
going to take a substantial break on that front.


I hope that wasn't taken as a sleight, but just an observation. There 
are other feature areas where I can make similar observations. All this 
work around a database system is challenging as there are many 
considerations that need to be made.


You've done an awesome job driving the JSON work forward and it is 
greatly appreciated.


Thanks,

Jonathan


OpenPGP_signature
Description: OpenPGP digital signature


Re: PG 15 (and to a smaller degree 14) regression due to ExprEvalStep size

2022-08-09 Thread Andrew Dunstan


On 2022-08-09 Tu 15:50, Jonathan S. Katz wrote:
> On 8/9/22 3:22 PM, Andres Freund wrote:
>> Hi,
>>
>> On 2022-08-09 15:17:44 -0400, Tom Lane wrote:
>>> We have delayed releases for $COOL_FEATURE in the past, and I think
>>> our batting average on that is still .000: not once has it worked out
>>> well.
>>
>> I think it semi worked when jsonb (?) first went in - it took a while
>> and a
>> lot of effort from a lot of people, but in the end we made it work,
>> and it was
>> a success from our user's perspectives, I think. 
>
> Yeah, this was the example I was thinking of. To continue with the
> baseball analogy, it was a home-run from a PR perspective, and I can
> say as a power user at the time, the 9.4 JSONB representation worked
> well for my use case. Certainly newer functionality has made JSON
> easier to work with in PG.
>
> (I can't remember what the 9.5 hold up was).
>
> The cases where we either delayed/punted on $COOL_FEATURE that cause
> me concern are the ones where we say "OK, well fix this in the next
> release" and we are then waiting, 2, 3, 4 releases for the work to be
> completed. And to be clear, I'm thinking of this as "known issues" vs.
> "iterating towards the whole solution".


Where we ended up with jsonb was a long way from where we started, but
technical difficulties were largely confined because it didn't involve
anything like the parser or the expression evaluation code. Here the SQL
Standards Committee has imposed a pretty substantial technical burden on
us and the code that Andres complains of is attempting to deal with that.


>
>> OTOH, it's not a great sign  this is around json again...
>
> Yeah, I was thinking about that too.


Ouch :-(

I think after 10 years of being involved with our JSON features, I'm
going to take a substantial break on that front.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: optimize lookups in snapshot [sub]xip arrays

2022-08-09 Thread Nathan Bossart
On Tue, Aug 09, 2022 at 01:21:41PM +0700, John Naylor wrote:
> I decided I wasn't quite comfortable changing snapshot handling
> without further guarantees.  To this end, 0002 in the attached v11 is
> an addendum that adds assert checking (also pgindent and some
> comment-smithing). As I suspected, make check-world passes even with
> purposefully screwed-up coding. 0003 uses pg_lfind32 in syscache.c and
> I verified that sticking in the wrong answer will lead to a crash in
> assert-enabled builds in short order. I'd kind of like to throw this
> (or something else suitable) at the build farm first for that reason.
> It's simpler than the qsort/qunique/binary search that was there
> before, so that's nice, but I've not tried to test performance.

Your adjustments in 0002 seem reasonable to me.  I think it makes sense to
ensure there is test coverage for pg_lfind32(), but I don't know if that
syscache code is the right choice.  For non-USE_SSE2 builds, it might make
these lookups more expensive.  I'll look around to see if there are any
other suitable candidates.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: PG 15 (and to a smaller degree 14) regression due to ExprEvalStep size

2022-08-09 Thread Jonathan S. Katz

On 8/9/22 2:57 PM, Andres Freund wrote:

Hi,

On 2022-08-09 14:04:48 -0400, Jonathan S. Katz wrote:



    2. Recommend holding up the v15 release to allow for the code to be
redesigned and fixed (as based on Andres' estimates, this would push
the release out several months).


Obviously that's a question of the resources brought to bear.

 From my angle: I've obviously some of my own work to take care of as well, but
it's also just hard to improve something that includes a lot of undocumented
design decisions.


*nod*


    4. Revert the feature set and redesign and try to include for v16



Unless we decide on 4 immediately, I think it might be worth starting a
separate thread to get more attention. The subject doesn't necessarily have
everyone follow along.


*nod* I'll do this shortly.



Rereading the thread for the umpteenth time, I have seen Amit working
through Andres' concerns. From my read, the ones that seem pressing are:

* Lack of design documentation, which may be leading to some of the general
design concerns



* The use of substransactions within the executor, though docs explaining
the decisions on that could alleviate it (I realize this is a big topic and
any summary I give won't capture the full nuance)


I don't think subtransactions per-se are a fundamental problem. I think the
error handling implementation is ridiculously complicated, and while I started
to hack on improving it, I stopped when I really couldn't understand what
errors it actually needs to handle when and why.


Ah, thanks for the clarification. That makes sense.


* Debate on how to handle the type coercions


That's partially related to the error handling issue above.

One way this code could be drastically simplified is to force all
type-coercions to go through the "io coercion" path, which could be
implemented as a single execution step (which thus could trivially
start/finish a subtransaction) and would remove a lot of the complicated code
around coercions.


If we went down this path, would this make you feel more comfortable 
with including this work in the v15 release?


Thanks,

Jonathan


OpenPGP_signature
Description: OpenPGP digital signature


Re: PG 15 (and to a smaller degree 14) regression due to ExprEvalStep size

2022-08-09 Thread Jonathan S. Katz

On 8/9/22 3:22 PM, Andres Freund wrote:

Hi,

On 2022-08-09 15:17:44 -0400, Tom Lane wrote:

We have delayed releases for $COOL_FEATURE in the past, and I think
our batting average on that is still .000: not once has it worked out
well.


I think it semi worked when jsonb (?) first went in - it took a while and a
lot of effort from a lot of people, but in the end we made it work, and it was
a success from our user's perspectives, I think. 


Yeah, this was the example I was thinking of. To continue with the 
baseball analogy, it was a home-run from a PR perspective, and I can say 
as a power user at the time, the 9.4 JSONB representation worked well 
for my use case. Certainly newer functionality has made JSON easier to 
work with in PG.


(I can't remember what the 9.5 hold up was).

The cases where we either delayed/punted on $COOL_FEATURE that cause me 
concern are the ones where we say "OK, well fix this in the next 
release" and we are then waiting, 2, 3, 4 releases for the work to be 
completed. And to be clear, I'm thinking of this as "known issues" vs. 
"iterating towards the whole solution".



OTOH, it's not a great sign  this is around json again...


Yeah, I was thinking about that too.

Per Andres comment upthread, let's open a new thread to discuss the 
SQL/JSON + v15 topic to improve visibility and get more feedback. I can 
do that shortly.


We can continue with the technical discussion in here.

Jonathan


OpenPGP_signature
Description: OpenPGP digital signature


Re: Reducing the chunk header sizes on all memory context types

2022-08-09 Thread Alvaro Herrera
On 2022-Aug-09, Andres Freund wrote:

> Mildly wondering whether we ought to use designated initializers instead,
> given we're whacking it around already. Too easy to get the order wrong when
> adding new members, and we might want to have optional callbacks too.

Strong +1.  It makes code much easier to navigate (see XmlTableRoutine
and compare with heapam_methods, for example).

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
Subversion to GIT: the shortest path to happiness I've ever heard of
(Alexey Klyukin)




Re: moving basebackup code to its own directory

2022-08-09 Thread Robert Haas
On Tue, Aug 9, 2022 at 2:40 PM Justin Pryzby  wrote:
> It looks like this updates the header comments in the .h files but not the .c
> files.
>
> Personally, I find these to be silly boilerplate ..

Here is a version with some updates to the silly boilerplate.

-- 
Robert Haas
EDB: http://www.enterprisedb.com


v2-0001-Move-basebackup-code-to-new-directory-src-backend.patch
Description: Binary data


Re: PG 15 (and to a smaller degree 14) regression due to ExprEvalStep size

2022-08-09 Thread Andres Freund
Hi,

On 2022-08-09 15:17:44 -0400, Tom Lane wrote:
> We have delayed releases for $COOL_FEATURE in the past, and I think
> our batting average on that is still .000: not once has it worked out
> well.

I think it semi worked when jsonb (?) first went in - it took a while and a
lot of effort from a lot of people, but in the end we made it work, and it was
a success from our user's perspectives, I think. OTOH, it's not a great sign
this is around json again...

- Andres




Re: Reducing the chunk header sizes on all memory context types

2022-08-09 Thread Tom Lane
Andres Freund  writes:
> I think it's fine, given that we can change this at any time, but it's
> probably worth to explicitly agree that this will for now restrict us to 8
> context methods?

Do we really need it to be that tight?  I know we only have 3 methods today,
but 8 doesn't seem that far away.  If there were six bits reserved for
this I'd be happier.

>> # We also add a restriction that block sizes for all 3 of the memory
>> # allocators cannot be 1GB or larger.  We would be unable to store the
>> # number of bytes that the block is offset from the chunk stored beyond this
>> #1GB boundary on any block that was larger than 1GB.

Losing MemoryContextAllocHuge would be very bad, so I assume this comment
is not telling the full truth.

regards, tom lane




Re: PG 15 (and to a smaller degree 14) regression due to ExprEvalStep size

2022-08-09 Thread Tom Lane
"Jonathan S. Katz"  writes:
> Speaking personally, I would like to see what we could do to include 
> support for this batch of the SQL/JSON features in v15. What is included 
> looks like it closes most of the gap on what we've been missing 
> syntactically since the standard was adopted, and the JSON_TABLE work is 
> very convenient for converting JSON data into a relational format. I 
> believe having this feature set is important for maintaining standards 
> compliance, interoperability, tooling support, and general usability. 
> Plus, JSON still seems to be pretty popular :)
> ...
> I hope that these can be addressed satisfactorily in a reasonable (read: 
> now a much shorter) timeframe so we can include the SQL/JSON work in v15.

We have delayed releases for $COOL_FEATURE in the past, and I think
our batting average on that is still .000: not once has it worked out
well.  I think we're better off getting the pain over with quickly,
so I regretfully vote for revert.  And for a full redesign/rewrite
before we try again; based on Andres' comments, it needs that.

regards, tom lane




Re: PG 15 (and to a smaller degree 14) regression due to ExprEvalStep size

2022-08-09 Thread Andres Freund
Hi,

On 2022-08-09 14:04:48 -0400, Jonathan S. Katz wrote:
> On 8/9/22 11:03 AM, Andrew Dunstan wrote:
> > 
> > On 2022-08-09 Tu 09:59, Jonathan S. Katz wrote:
> 
> > > The RMT met today to discuss the state of this open item surrounding
> > > the SQL/JSON feature set. We discussed the specific concerns raised
> > > about the code and debated four different options:
> > > 
> > >    1. Do nothing and continue with the community process of stabilizing
> > > the code without significant redesign
> > > 
> > >    2. Recommend holding up the v15 release to allow for the code to be
> > > redesigned and fixed (as based on Andres' estimates, this would push
> > > the release out several months).

Obviously that's a question of the resources brought to bear.

>From my angle: I've obviously some of my own work to take care of as well, but
it's also just hard to improve something that includes a lot of undocumented
design decisions.


> > >    3. Revert the problematic parts of the code but try to include some
> > > of the features in the v15 release (e.g. JSON_TABLE)

> > I very much doubt option 3 is feasible. The parts that are controversial
> > go back at least in part to the first patches of the series. Trying to
> > salvage anything would almost certainly be more disruptive than trying
> > to fix it.

Agreed.


> > >    4. Revert the feature set and redesign and try to include for v16
> > > 
> > > Based on the concerns raised, the RMT is recommending option #4, to
> > > revert the SQL/JSON changes for v15, and come back with a redesign for
> > > v16.
> > > 
> > > If folks think there are some bits we can include in v15, we can
> > > consider option #3. (Personally, I would like to see if we can keep
> > > JSON_TABLE, but if there is too much overlap with the problematic
> > > portions of the code I am fine with waiting for v16).
> > > 
> > > At this stage in the release process coupled with the concerns, we're
> > > a bit worried about introducing changes that are unpredictable in
> > > terms of stability and maintenance. We also do not want to hold up the
> > > release while this feature set is goes through a redesign without
> > > agreement on what such a design would look like as well as a timeline.
> > > 
> > > Note that the above are the RMT's recommendations; while the RMT can
> > > explicitly call for a revert, we want to first guide the discussion on
> > > the best path forward knowing the challenges for including these
> > > features in v15.

Unless we decide on 4 immediately, I think it might be worth starting a
separate thread to get more attention. The subject doesn't necessarily have
everyone follow along.



> > I'm not sure what the danger is to stability, performance or correctness
> > in applying the changes Amit has proposed for release 15. But if that
> > danger is judged to be too great then I agree we should revert.

My primary problem is that as-is the code is nearly unmaintainable. It's hard
for Amit to fix that, given that he's not one of the original authors.


> Rereading the thread for the umpteenth time, I have seen Amit working
> through Andres' concerns. From my read, the ones that seem pressing are:
> 
> * Lack of design documentation, which may be leading to some of the general
> design concerns

> * The use of substransactions within the executor, though docs explaining
> the decisions on that could alleviate it (I realize this is a big topic and
> any summary I give won't capture the full nuance)

I don't think subtransactions per-se are a fundamental problem. I think the
error handling implementation is ridiculously complicated, and while I started
to hack on improving it, I stopped when I really couldn't understand what
errors it actually needs to handle when and why.


> * Debate on how to handle the type coercions

That's partially related to the error handling issue above.

One way this code could be drastically simplified is to force all
type-coercions to go through the "io coercion" path, which could be
implemented as a single execution step (which thus could trivially
start/finish a subtransaction) and would remove a lot of the complicated code
around coercions.


> > I should add that I'm very grateful to Amit for his work, and I'm sure
> > it isn't wasted effort, whatever the decision.
> 
> +1. While I've been quiet on this thread to date, I have definitely seen
> Amit working hard on addressing the concerns.

+1

Greetings,

Andres Freund




Re: moving basebackup code to its own directory

2022-08-09 Thread David Steele

On 8/9/22 14:40, Justin Pryzby wrote:

On Tue, Aug 09, 2022 at 01:32:49PM -0400, Robert Haas wrote:

On Tue, Aug 9, 2022 at 12:43 PM Magnus Hagander  wrote:

So maybe src/backend/backup? Or is that too grandiose for the amount
of stuff we have here?


+1 for src/backend/backup. I'd also be happy to see the start/stop code
move here at some point.


Yeah, sounds reasonable. There's never an optimal source code layout, but I 
agree this one is better than putting it under replication.


OK, here's a patch.


It looks like this updates the header comments in the .h files but not the .c
files.

Personally, I find these to be silly boilerplate ..


Good catch. I did not notice that just looking at the diff.

Definitely agree that repeating the filename in the top comment is 
mostly useless, but that seems like a separate conversation.


-David




Re: Reducing the chunk header sizes on all memory context types

2022-08-09 Thread Andres Freund
Hi,

On 2022-08-09 10:36:57 -0400, Robert Haas wrote:
> On Tue, Aug 9, 2022 at 8:53 AM David Rowley  wrote:
> > I think the patch is now starting to take shape.  I've added it to the
> > September commitfest [1].
> 
> This is extremely cool. The memory savings are really nice.

+1


> And I also like this:
> 
> # Additionally, this commit completely changes the rule that pointers must
> # be directly prefixed by the owning memory context and instead, we now
> # insist that they're directly prefixed by an 8-byte value where the least
> # significant 3-bits are set to a value to indicate which type of memory
> # context the pointer belongs to.  Using those 3 bits as an index to a new
> # array which stores the methods for each memory context type, we're now
> # able to pass the pointer given to functions such as pfree and repalloc to
> # the function specific to that context implementation to allow them to
> # devise their own methods of finding the memory context which owns the
> # given allocated chunk of memory.
> 
> That seems like a good system.

I'm obviously biased, but I agree.

I think it's fine, given that we can change this at any time, but it's
probably worth to explicitly agree that this will for now restrict us to 8
context methods?


> This part of the commit message might need to be clarified:
> 
> # We also add a restriction that block sizes for all 3 of the memory
> # allocators cannot be 1GB or larger.  We would be unable to store the
> # number of bytes that the block is offset from the chunk stored beyond this
> #1GB boundary on any block that was larger than 1GB.
> 
> Earlier in the commit message, you say that allocations of 1GB or more
> are stored in dedicated blocks. But here you say that blocks can't be
> more than 1GB. Those statements seem to contradict each other. I guess
> you mean block sizes for blocks that contain chunks, or something like
> that?

I would guess so as well.


> diff --git a/src/include/utils/memutils_internal.h 
> b/src/include/utils/memutils_internal.h
> new file mode 100644
> index 00..2dcfdd7ec3
> --- /dev/null
> +++ b/src/include/utils/memutils_internal.h
> @@ -0,0 +1,117 @@
> +/*-
> + *
> + * memutils_internal.h
> + * This file contains declarations for memory allocation utility
> + * functions for internal use.
> + *
> + *
> + * Portions Copyright (c) 2022, PostgreSQL Global Development Group
> + * Portions Copyright (c) 1994, Regents of the University of California
> + *
> + * src/include/utils/memutils_internal.h
> + *
> + *-
> + */
> +
> +#ifndef MEMUTILS_INTERNAL_H
> +#define MEMUTILS_INTERNAL_H
> +
> +#include "utils/memutils.h"
> +
> +extern void *AllocSetAlloc(MemoryContext context, Size size);
> +extern void AllocSetFree(void *pointer);
> [much more]

I really wish I knew of a technique to avoid this kind of thing, allowing to
fill a constant array from different translation units... On the linker level
that should be trivial, but I don't think there's a C way to reach that.


> +/*
> + * MemoryContextMethodID
> + *   A unique identifier for each MemoryContext implementation which
> + *   indicates the index into the mcxt_methods[] array. See mcxt.c.
> + */
> +typedef enum MemoryContextMethodID
> +{
> + MCTX_ASET_ID = 0,

Is there a reason to reserve 0 here? Practically speaking the 8-byte header
will always contain not just zeroes, but I don't think the patch currently
enforces that.  It's probably not worth wasting a "valuable" entry here...


> diff --git a/src/include/utils/memutils_memorychunk.h 
> b/src/include/utils/memutils_memorychunk.h
> new file mode 100644
> index 00..6239cf9008
> --- /dev/null
> +++ b/src/include/utils/memutils_memorychunk.h
> @@ -0,0 +1,185 @@
> +/*-
> + *
> + * memutils_memorychunk.h
> + * Here we define a struct named MemoryChunk which implementations of
> + * MemoryContexts may use as a header for chunks of memory they allocate.
> + *
> + * A MemoryChunk provides a lightweight header which a MemoryContext can use
> + * to store the size of an allocation and a reference back to the block which
> + * the given chunk is allocated on.
> + *
> + * Although MemoryChunks are used by each of our MemoryContexts, other
> + * implementations may choose to implement their own method for storing chunk
> + * headers.  The only requirement is that the header end with an 8-byte value
> + * which the least significant 3-bits of are set to the MemoryContextMethodID
> + * of the given context.

Well, there can't be other implementations other than ours. So maybe phrase it
as "future implementations"?


> + * By default, a MemoryChunk is 8 bytes in size, however when
> + * MEMORY_CONTEXT_CHECKING is defined the header becomes 16 bytes in size due
> + * to the 

Re: moving basebackup code to its own directory

2022-08-09 Thread Justin Pryzby
On Tue, Aug 09, 2022 at 01:32:49PM -0400, Robert Haas wrote:
> On Tue, Aug 9, 2022 at 12:43 PM Magnus Hagander  wrote:
> >> > So maybe src/backend/backup? Or is that too grandiose for the amount
> >> > of stuff we have here?
> >>
> >> +1 for src/backend/backup. I'd also be happy to see the start/stop code
> >> move here at some point.
> >
> > Yeah, sounds reasonable. There's never an optimal source code layout, but I 
> > agree this one is better than putting it under replication.
> 
> OK, here's a patch.

It looks like this updates the header comments in the .h files but not the .c
files.

Personally, I find these to be silly boilerplate ..

-- 
Justin




Re: Avoid erroring out when unable to remove or parse logical rewrite files to save checkpoint work

2022-08-09 Thread Nathan Bossart
On Tue, Aug 09, 2022 at 05:26:39PM +1200, Thomas Munro wrote:
> The changes were not from elog-LOG to elog-ERROR, they were changes
> from silently eating errors, but I take your (plural) general point.

I think this was in reference to 0002.  My comment was, at least.

> OK, so there aren't many places in 0001 that error out where we
> previously did not.
> 
> For the one in CheckPointLogicalRewriteHeap(), if it fails, you don't
> just want to skip -- it might be in the range that we need to fsync().
> So what if we just deleted that get_dirent_type() != PGFILETYPE_REG
> check altogether?  Depending on the name range check, we'd either
> attempt to unlink() and LOG if that fails, or we'd attempt to
> open()-or-ERROR and then fsync()-or-PANIC.

It looks like CheckPointLogicalRewriteHeap() presently ERRORs when unlink()
fails.  However, CheckPointSnapBuild() only LOGs when unlink() fails.
Since mappings files contain transaction IDs, persistent unlink() failures
might lead to wraparound, but I don't think failing checkpoints will help
improve matters.

Bharath's original patch changed CheckPointLogicalRewriteHeap() to LOG on
sscanf() and unlink() failures.  IIUC things would work the way you suggest
if that was applied.

> What functionality have we
> lost without the DT_REG check?  Well, now if someone ill-advisedly
> puts an important character device, socket, symlink (ie any kind of
> non-DT_REG) into that directory with a name conforming to the
> unlinkable range, we'll try to unlink it and possibly succeed, where
> previously we'd have skipped, and if it's in the checkpointable range,
> we'd try to fsync() it and likely fail, which is good because this is
> a supposed to be a checkpoint.

I don't know that I agree that the last part is good.  Presumably we don't
want checkpointing to fail forever because there's a random non-regular
file that can't be fsync'd.  But it's not clear why such files would exist
in the first place.  Presumably this isn't the sort of thing that Postgres
should be expected to support.

> In other words, if we're going to LOG and press on, maybe we should
> just press on anyway.  Would the error message be any less
> informative?  Would the risk of unlinking a character device that you
> accidentally stored in
> /clusters/pg16-main/pgdata/pg_logical/mappings/map-1234-5678 be a
> problem for anyone?

Probably not.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: PG 15 (and to a smaller degree 14) regression due to ExprEvalStep size

2022-08-09 Thread Jonathan S. Katz

On 8/9/22 11:03 AM, Andrew Dunstan wrote:


On 2022-08-09 Tu 09:59, Jonathan S. Katz wrote:



The RMT met today to discuss the state of this open item surrounding
the SQL/JSON feature set. We discussed the specific concerns raised
about the code and debated four different options:

   1. Do nothing and continue with the community process of stabilizing
the code without significant redesign

   2. Recommend holding up the v15 release to allow for the code to be
redesigned and fixed (as based on Andres' estimates, this would push
the release out several months).

   3. Revert the problematic parts of the code but try to include some
of the features in the v15 release (e.g. JSON_TABLE)

   4. Revert the feature set and redesign and try to include for v16

Based on the concerns raised, the RMT is recommending option #4, to
revert the SQL/JSON changes for v15, and come back with a redesign for
v16.

If folks think there are some bits we can include in v15, we can
consider option #3. (Personally, I would like to see if we can keep
JSON_TABLE, but if there is too much overlap with the problematic
portions of the code I am fine with waiting for v16).

At this stage in the release process coupled with the concerns, we're
a bit worried about introducing changes that are unpredictable in
terms of stability and maintenance. We also do not want to hold up the
release while this feature set is goes through a redesign without
agreement on what such a design would look like as well as a timeline.

Note that the above are the RMT's recommendations; while the RMT can
explicitly call for a revert, we want to first guide the discussion on
the best path forward knowing the challenges for including these
features in v15.




I very much doubt option 3 is feasible. The parts that are controversial
go back at least in part to the first patches of the series. Trying to
salvage anything would almost certainly be more disruptive than trying
to fix it.

I'm not sure what the danger is to stability, performance or correctness
in applying the changes Amit has proposed for release 15. But if that
danger is judged to be too great then I agree we should revert.


Speaking personally, I would like to see what we could do to include 
support for this batch of the SQL/JSON features in v15. What is included 
looks like it closes most of the gap on what we've been missing 
syntactically since the standard was adopted, and the JSON_TABLE work is 
very convenient for converting JSON data into a relational format. I 
believe having this feature set is important for maintaining standards 
compliance, interoperability, tooling support, and general usability. 
Plus, JSON still seems to be pretty popular :)


Rereading the thread for the umpteenth time, I have seen Amit working 
through Andres' concerns. From my read, the ones that seem pressing are:


* Lack of design documentation, which may be leading to some of the 
general design concerns
* The use of substransactions within the executor, though docs 
explaining the decisions on that could alleviate it (I realize this is a 
big topic and any summary I give won't capture the full nuance)

* Debate on how to handle the type coercions

(Please correct me if I've missed anything).

I hope that these can be addressed satisfactorily in a reasonable (read: 
now a much shorter) timeframe so we can include the SQL/JSON work in v15.


With my RMT hat on, the issue is that we're now at beta 3 and we still 
have not not reached a resolution on this open item. Even if it were 
committed tomorrow, we would definitely need a beta 4, and we would want 
to let the code bake a bit more to ensure we get adequate test coverage 
on it. This would likely put the release date into October, presuming we 
have not found any other issues that could cause a release delay.


With my advocacy hat on, we're at the point in the release cycle where I 
kick off the GA release process (e.g. announcement drafting). Not 
knowing the final status of a feature that's likely to be highlighted 
makes it difficult to write said release as well as kick off the other 
machinery (e.g. translations). If there is at least a decision on next 
steps, I can adjust the GA release process timeline.



I should add that I'm very grateful to Amit for his work, and I'm sure
it isn't wasted effort, whatever the decision.


+1. While I've been quiet on this thread to date, I have definitely seen 
Amit working hard on addressing the concerns.


Thanks,

Jonathan


OpenPGP_signature
Description: OpenPGP digital signature


Re: moving basebackup code to its own directory

2022-08-09 Thread David Steele

On 8/9/22 13:32, Robert Haas wrote:

On Tue, Aug 9, 2022 at 12:43 PM Magnus Hagander  wrote:

So maybe src/backend/backup? Or is that too grandiose for the amount
of stuff we have here?


+1 for src/backend/backup. I'd also be happy to see the start/stop code
move here at some point.


Yeah, sounds reasonable. There's never an optimal source code layout, but I 
agree this one is better than putting it under replication.


OK, here's a patch.


This looks good to me.

Regards,
-David




Re: dropping datumSort field

2022-08-09 Thread Robert Haas
On Tue, Aug 9, 2022 at 11:42 AM Zhihong Yu  wrote:
> Though the datumSort field may be harmless for now, in the future, the (auto) 
> alignment padding may not work if more fields are added to the struct.
> I think we should always leave some room in struct's for future expansion.

I doubt the size of this struct is particularly important, unlike
ExprEvalStep which needs to be small. But if it turns out in the
future that we need to try to squeeze this struct into fewer bytes, we
can always do something like this then. Right now there's no obvious
point to it.

Sure, it might be valuable *if* we add more fields to the struct and
*if* that means that the byte taken up by this flag actually makes the
struct bigger and *if* the size of the struct is demonstrated to be a
performance problem. But right now none of that has happened, and
maybe none of it will ever happen.

--
Robert Haas
EDB: http://www.enterprisedb.com




Re: moving basebackup code to its own directory

2022-08-09 Thread Robert Haas
On Tue, Aug 9, 2022 at 12:43 PM Magnus Hagander  wrote:
>> > So maybe src/backend/backup? Or is that too grandiose for the amount
>> > of stuff we have here?
>>
>> +1 for src/backend/backup. I'd also be happy to see the start/stop code
>> move here at some point.
>
> Yeah, sounds reasonable. There's never an optimal source code layout, but I 
> agree this one is better than putting it under replication.

OK, here's a patch.

-- 
Robert Haas
EDB: http://www.enterprisedb.com


v1-0001-Move-basebackup-code-to-new-directory-src-backend.patch
Description: Binary data


Re: shared-memory based stats collector - v70

2022-08-09 Thread Andres Freund
Hi,

On 2022-08-09 17:24:35 +0900, Kyotaro Horiguchi wrote:
> At Mon, 8 Aug 2022 11:53:19 -0400, Greg Stark  wrote in
> > I'm trying to wrap my head around the shared memory stats collector
> > infrastructure from
> > <20220406030008.2qxipjxo776dw...@alap3.anarazel.de> committed in
> > 5891c7a8ed8f2d3d577e7eea34dacff12d7b6bbd.
> >
> > I have one question about locking -- afaics there's nothing protecting
> > reading the shared memory stats. There is an lwlock protecting
> > concurrent updates of the shared memory stats, but that lock isn't
> > taken when we read the stats. Are we ok relying on atomic 64-bit reads
> > or is there something else going on that I'm missing?

Yes, that's not right. Not sure how it ended up that way. There was a lot of
refactoring and pushing down the locking into different places, I guess it got
lost somewhere on the way :(. It's unlikely to be a large problem, but we
should fix it.


> If I'm not missing something, it's strange that pgstat_lock_entry()
> only takes LW_EXCLUSIVE.

I think it makes some sense, given that there's a larger number of callers for
that in various stats-emitting code. Perhaps we could just add a separate
function with a _shared() suffix?


> The atached changes the interface of
> pgstat_lock_entry() but there's only one user since another read of
> shared stats entry is not using reference. Thus the interface change
> might be too much. If I just add bare LWLockAcquire/Release() to
> pgstat_fetch_entry,the amount of the patch will be reduced.

Could you try the pgstat_lock_entry_shared() approach?

Greetings,

Andres Freund




Re: shared-memory based stats collector - v70

2022-08-09 Thread Andres Freund
Hi,

On 2022-08-09 12:00:46 -0400, Greg Stark wrote:
> I was more aiming at a C function that extensions could use directly
> rather than an SQL function -- though I suppose having the former it
> would be simple enough to implement the latter using it. (though it
> would have to be one for each stat type I guess)

I think such a C extension could exist today, without patching core code? It'd
be a bit ugly to include pgstat_internal.h, I guess, but other than that...


> The reason I want a C function is I'm trying to get as far as I can
> without a connection to a database, without a transaction, without
> accessing the catalog, and as much as possible without taking locks.

I assume you don't include lwlocks under locks?


> I think this is important for making monitoring highly reliable and low
> impact on production.

I'm doubtful about that, but whatever.


> The main problem with my current code is that I'm accessing the shared
> memory hash table directly. This means the I'm possibly introducing
> locking contention on the shared memory hash table.

I don't think that's a large enough issue to worry about unless you're
polling at a very high rate, which'd be a bad idea in itself. If a backend
can't get the lock for some stats change it'll defer flushing the stats a bit,
so it'll not cause a lot of other problems.


> I'm thinking of separating the shared memory hash scan from the metric scan
> so the list can be quickly built minimizing the time the lock is held.

I'd really really want to see some evidence that any sort of complexity here
is worth it.


> I have a few things I would like to suggest for future improvements to
> this infrastructure. I haven't polished the details of it yet but the
> main thing I think I'm missing is the catalog name for the object. I
> don't want to have to fetch it from the catalog and in any case I
> think it would generally be useful and might regularize the
> replication slot handling too.

I'm *dead* set against including catalog names in shared memory stats. That'll
add a good amount of memory usage and complexity, without any sort of
comensurate gain.


> I also think it would be nice to have a change counter for every stat
> object, or perhaps a change time. Prometheus wouldn't be able to make
> use of it but other monitoring software might be able to receive only
> metrics that have changed since the last update which would really
> help on databases with large numbers of mostly static objects.

I think you're proposing adding overhead that doesn't even have a real user.

Greetings,

Andres Freund




Re: moving basebackup code to its own directory

2022-08-09 Thread Magnus Hagander
On Tue, Aug 9, 2022 at 6:41 PM David Steele  wrote:

> On 8/9/22 12:34, Robert Haas wrote:
> > On Tue, Aug 9, 2022 at 12:12 PM Magnus Hagander 
> wrote:
> >
> >> Anyway, I have no objection. If there'd been that many files, or plans
> to have it, in the beginning we probably would've put them in
> replication/basebackup or something like that from the beginning. I'm not
> sure how much it's worth doing wrt effects on backpatching etc, but if
> we're planning to add even more files in the future, the pain will just
> become bigger once we eventually do it...
> >
> > Right.
> >
> > It's not exactly clear to me what the optimal source code layout is
> > here. I think the placement here is under src/backend/replication
> > because the functionality is accessed via the replication protocol,
> > but I'm not sure if all backup-related code we ever add will be
> > related to the replication protocol. As a thought experiment, imagine
> > a background worker that triggers a backup periodically, or a
> > monitoring view that tells you about the status of your last 10 backup
> > attempts, or an in-memory hash table that tracks which files have been
> > modified since the last backup. I'm not planning on implementing any
> > of those things specifically, but I guess I'm a little concerned that
> > if we just do the obvious thing of src/backend/replication/backup it's
> > going to be end up being a little awkward if I or anyone else want to
> > add backup-related code that isn't specifically about the replication
> > protocol.
> >
> > So maybe src/backend/backup? Or is that too grandiose for the amount
> > of stuff we have here?
>
> +1 for src/backend/backup. I'd also be happy to see the start/stop code
> move here at some point.
>

Yeah, sounds reasonable. There's never an optimal source code layout, but I
agree this one is better than putting it under replication.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: moving basebackup code to its own directory

2022-08-09 Thread David Steele

On 8/9/22 12:34, Robert Haas wrote:

On Tue, Aug 9, 2022 at 12:12 PM Magnus Hagander  wrote:


Anyway, I have no objection. If there'd been that many files, or plans to have 
it, in the beginning we probably would've put them in replication/basebackup or 
something like that from the beginning. I'm not sure how much it's worth doing 
wrt effects on backpatching etc, but if we're planning to add even more files 
in the future, the pain will just become bigger once we eventually do it...


Right.

It's not exactly clear to me what the optimal source code layout is
here. I think the placement here is under src/backend/replication
because the functionality is accessed via the replication protocol,
but I'm not sure if all backup-related code we ever add will be
related to the replication protocol. As a thought experiment, imagine
a background worker that triggers a backup periodically, or a
monitoring view that tells you about the status of your last 10 backup
attempts, or an in-memory hash table that tracks which files have been
modified since the last backup. I'm not planning on implementing any
of those things specifically, but I guess I'm a little concerned that
if we just do the obvious thing of src/backend/replication/backup it's
going to be end up being a little awkward if I or anyone else want to
add backup-related code that isn't specifically about the replication
protocol.

So maybe src/backend/backup? Or is that too grandiose for the amount
of stuff we have here?


+1 for src/backend/backup. I'd also be happy to see the start/stop code 
move here at some point.


Regards,
-David




Re: shared-memory based stats collector - v70

2022-08-09 Thread Andres Freund
Hi,

On 2022-08-09 12:18:47 +0200, Drouvot, Bertrand wrote:
> What do you think about adding a function in core PG to provide such
> functionality? (means being able to retrieve all the stats (+ eventually add
> some filtering) without the need to connect to each database).

I'm not that convinced by the use case, but I think it's also low cost to add
and maintain, so if somebody cares enough to write something...

The only thing I would "request" is that such a function requires more
permissions than the default accessors do. I think it's a minor problem that
we allow so much access within a database right now, regardless of object
permissions, but it'd not be a great idea to expand that to other databases,
in bulk?

Greetings,

Andres Freund




Re: moving basebackup code to its own directory

2022-08-09 Thread David Steele

On 8/9/22 12:12, Magnus Hagander wrote:
On Tue, Aug 9, 2022 at 6:08 PM Robert Haas > wrote:


Hi,

I was thinking that it might make sense, to reduce clutter, to move
*backup*.c from src/backend/replication to a new directory, perhaps
src/backend/replication/backup or src/backend/backup.

There's no particular reason we *have* to do this, but there are 21 C
files in that directory and 11 of them are basebackup-related, so
maybe it's time, especially because I think we might end up adding
more basebackup-related stuff.

Thoughts?


Those 11 files are mostly your fault, of course ;)

Anyway, I have no objection. If there'd been that many files, or plans 
to have it, in the beginning we probably would've put them in 
replication/basebackup or something like that from the beginning. I'm 
not sure how much it's worth doing wrt effects on backpatching etc, but 
if we're planning to add even more files in the future, the pain will 
just become bigger once we eventually do it...


There are big changes all around for PG15 so back-patching will be 
complicated no matter what.


+1 from me and it would be great if we can get this into the PG15 branch 
as well.


Regards,
-David




Re: moving basebackup code to its own directory

2022-08-09 Thread Robert Haas
On Tue, Aug 9, 2022 at 12:12 PM Magnus Hagander  wrote:
> Those 11 files are mostly your fault, of course ;)

They are. I tend to prefer smaller source files than many developers,
because I find them easier to understand and maintain. If you only
include  in basebackup_gzip.c, then you can be pretty sure
nothing else involved with basebackup is accidentally depending on it.
Similarly with static variables. If you just have one giant file, it's
harder to be sure about that sort of thing.

> Anyway, I have no objection. If there'd been that many files, or plans to 
> have it, in the beginning we probably would've put them in 
> replication/basebackup or something like that from the beginning. I'm not 
> sure how much it's worth doing wrt effects on backpatching etc, but if we're 
> planning to add even more files in the future, the pain will just become 
> bigger once we eventually do it...

Right.

It's not exactly clear to me what the optimal source code layout is
here. I think the placement here is under src/backend/replication
because the functionality is accessed via the replication protocol,
but I'm not sure if all backup-related code we ever add will be
related to the replication protocol. As a thought experiment, imagine
a background worker that triggers a backup periodically, or a
monitoring view that tells you about the status of your last 10 backup
attempts, or an in-memory hash table that tracks which files have been
modified since the last backup. I'm not planning on implementing any
of those things specifically, but I guess I'm a little concerned that
if we just do the obvious thing of src/backend/replication/backup it's
going to be end up being a little awkward if I or anyone else want to
add backup-related code that isn't specifically about the replication
protocol.

So maybe src/backend/backup? Or is that too grandiose for the amount
of stuff we have here?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: remove useless comments

2022-08-09 Thread Junwang Zhao
Fair enough, the rephased version of the comments is in the attachment,
please take a look.

--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -1583,8 +1583,8 @@ getInstallationPaths(const char *argv0)
  FreeDir(pdir);

  /*
- * XXX is it worth similarly checking the share/ directory?  If the lib/
- * directory is there, then share/ probably is too.
+ * It's not worth checking the share/ directory.  If the lib/ directory
+ * is there, then share/ probably is too.
  */
 }

On Tue, Aug 9, 2022 at 11:50 PM Tom Lane  wrote:
>
> Junwang Zhao  writes:
> > The comments considering checking share/ directory was there for
> > almost 13 years, yet nobody ever trying to add the checking, and
> > there seems never any trouble for not checking it, then I think
> > we should remove those comments.
>
> I think that comment is valuable.  It shows that checking the
> sibling directories was considered and didn't seem worthwhile.
> Perhaps it should be rephrased in a more positive way (without XXX),
> but merely deleting it is a net negative because future hackers
> would have to reconstruct that reasoning.
>
> BTW, we're working in a 30+-year-old code base, so the mere fact
> that a comment has been there a long time does not make it bad.
>
> regards, tom lane



-- 
Regards
Junwang Zhao


0001-comments-rephrase-the-comment-in-a-more-positive-way.patch
Description: Binary data


Re: SUBTRANS: Minimizing calls to SubTransSetParent()

2022-08-09 Thread Simon Riggs
On Tue, 9 Aug 2022 at 12:39, Dilip Kumar  wrote:

> > This short patchset compiles and passes make check-world, with lengthy 
> > comments.
>
> Does this patch set work independently or it has dependency on the
> patches on the other thread "Smoothing the subtrans performance
> catastrophe"?

Patches 001 and 002 are common elements of a different patch,
"Smoothing the subtrans performance catastrophe", but other than that,
the two patches are otherwise independent of each other.

i.e. there are common elements in both patches
001 puts all subxid data in a snapshot (up to a limit of 64 xids per
topxid), even if one or more xids overflows.

> Because in this patch I see no code where we are
> changing anything to control the access of SubTransGetParent() from
> SubTransGetTopmostTransactionId()?

Those calls are unaffected, i.e. they both still work.

Right now, we register all subxids in subtrans. But not all xids are
subxids, so in fact, subtrans has many "holes" in it, where if you
look up the parent for an xid it will just return
InvalidTransactionId. There is a protection against that causing a
problem because if you call TransactionIdDidCommit/Abort you can get a
WARNING, or if you call SubTransGetTopmostTransaction() you can get an
ERROR, but it is possible if you do a lookup for an inappropriate xid.
i.e. if you call TransactionIdDidCommit() without first calling
TransactionIdIsInProgress() as you are supposed to do.

What this patch does is increase the number of "holes" in subtrans,
reducing the overhead and making the subtrans data structure more
amenable to using a dense structure rather than a sparse structure as
we do now, which then leads to I/O overheads. But in this patch, we
only have holes when we can prove that the subxid's parent will never
be requested.

Specifically, with this patch, running PL/pgSQL with a few
subtransactions in will cause those subxids to be logged in subtrans
about 1% as often as they are now, so greatly reducing the number of
subtrans calls.

Happy to provide more detailed review thoughts, so please keep asking questions.

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: moving basebackup code to its own directory

2022-08-09 Thread Magnus Hagander
On Tue, Aug 9, 2022 at 6:08 PM Robert Haas  wrote:

> Hi,
>
> I was thinking that it might make sense, to reduce clutter, to move
> *backup*.c from src/backend/replication to a new directory, perhaps
> src/backend/replication/backup or src/backend/backup.
>
> There's no particular reason we *have* to do this, but there are 21 C
> files in that directory and 11 of them are basebackup-related, so
> maybe it's time, especially because I think we might end up adding
> more basebackup-related stuff.
>
> Thoughts?
>
>
Those 11 files are mostly your fault, of course ;)

Anyway, I have no objection. If there'd been that many files, or plans to
have it, in the beginning we probably would've put them in
replication/basebackup or something like that from the beginning. I'm not
sure how much it's worth doing wrt effects on backpatching etc, but if
we're planning to add even more files in the future, the pain will just
become bigger once we eventually do it...

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


moving basebackup code to its own directory

2022-08-09 Thread Robert Haas
Hi,

I was thinking that it might make sense, to reduce clutter, to move
*backup*.c from src/backend/replication to a new directory, perhaps
src/backend/replication/backup or src/backend/backup.

There's no particular reason we *have* to do this, but there are 21 C
files in that directory and 11 of them are basebackup-related, so
maybe it's time, especially because I think we might end up adding
more basebackup-related stuff.

Thoughts?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Fast COPY FROM based on batch insert

2022-08-09 Thread Zhihong Yu
On Tue, Aug 9, 2022 at 4:45 AM Etsuro Fujita 
wrote:

> On Tue, Jul 26, 2022 at 7:19 PM Etsuro Fujita 
> wrote:
> > Yeah, I think the patch is getting better, but I noticed some issues,
> > so I'm working on them.  I think I can post a new version in the next
> > few days.
>
> * When running AFTER ROW triggers in CopyMultiInsertBufferFlush(), the
> patch uses the slots passed to ExecForeignBatchInsert(), not the ones
> returned by the callback function, but I don't think that that is
> always correct, as the documentation about the callback function says:
>
>  The return value is an array of slots containing the data that was
>  actually inserted (this might differ from the data supplied, for
>  example as a result of trigger actions.)
>  The passed-in slots can be re-used for this
> purpose.
>
> postgres_fdw re-uses the passed-in slots, but other FDWs might not, so
> I modified the patch to reference the returned slots when running the
> AFTER ROW triggers.  I also modified the patch to initialize the
> tts_tableOid.  Attached is an updated patch, in which I made some
> minor adjustments to CopyMultiInsertBufferFlush() as well.
>
> * The patch produces incorrect error context information:
>
> create extension postgres_fdw;
> create server loopback foreign data wrapper postgres_fdw options
> (dbname 'postgres');
> create user mapping for current_user server loopback;
> create table t1 (f1 int, f2 text);
> create foreign table ft1 (f1 int, f2 text) server loopback options
> (table_name 't1');
> alter table t1 add constraint t1_f1positive check (f1 >= 0);
> alter foreign table ft1 add constraint ft1_f1positive check (f1 >= 0);
>
> — single insert
> copy ft1 from stdin;
> Enter data to be copied followed by a newline.
> End with a backslash and a period on a line by itself, or an EOF signal.
> >> -1 foo
> >> 1 bar
> >> \.
> ERROR:  new row for relation "t1" violates check constraint "t1_f1positive"
> DETAIL:  Failing row contains (-1, foo).
> CONTEXT:  remote SQL command: INSERT INTO public.t1(f1, f2) VALUES ($1, $2)
> COPY ft1, line 1: "-1 foo"
>
> — batch insert
> alter server loopback options (add batch_size '2');
> copy ft1 from stdin;
> Enter data to be copied followed by a newline.
> End with a backslash and a period on a line by itself, or an EOF signal.
> >> -1 foo
> >> 1 bar
> >> \.
> ERROR:  new row for relation "t1" violates check constraint "t1_f1positive"
> DETAIL:  Failing row contains (-1, foo).
> CONTEXT:  remote SQL command: INSERT INTO public.t1(f1, f2) VALUES
> ($1, $2), ($3, $4)
> COPY ft1, line 3
>
> In single-insert mode the error context information is correct, but in
> batch-insert mode it isn’t (i.e., the line number isn’t correct).
>
> The error occurs on the remote side, so I'm not sure if there is a
> simple fix.  What I came up with is to just suppress error context
> information other than the relation name, like the attached.  What do
> you think about that?
>
> (In CopyMultiInsertBufferFlush() your patch sets cstate->cur_lineno to
> buffer->linenos[i] even when running AFTER ROW triggers for the i-th
> row returned by ExecForeignBatchInsert(), but that wouldn’t always be
> correct, as the i-th returned row might not correspond to the i-th row
> originally stored in the buffer as the callback function returns only
> the rows that were actually inserted on the remote side.  I think the
> proposed fix would address this issue as well.)
>
> * The patch produces incorrect row count in cases where some/all of
> the rows passed to ExecForeignBatchInsert() weren’t inserted on the
> remote side:
>
> create function trig_null() returns trigger as $$ begin return NULL;
> end $$ language plpgsql;
> create trigger trig_null before insert on t1 for each row execute
> function trig_null();
>
> — single insert
> alter server loopback options (drop batch_size);
> copy ft1 from stdin;
> Enter data to be copied followed by a newline.
> End with a backslash and a period on a line by itself, or an EOF signal.
> >> 0 foo
> >> 1 bar
> >> \.
> COPY 0
>
> — batch insert
> alter server loopback options (add batch_size '2');
> copy ft1 from stdin;
> Enter data to be copied followed by a newline.
> End with a backslash and a period on a line by itself, or an EOF signal.
> >> 0foo
> >> 1 bar
> >> \.
> COPY 2
>
> The row count is correct in single-insert mode, but isn’t in batch-insert
> mode.
>
> The reason is that in batch-insert mode the row counter is updated
> immediately after adding the row to the buffer, not after doing
> ExecForeignBatchInsert(), which might ignore the row.  To fix, I
> modified the patch to delay updating the row counter (and the progress
> of the COPY command) until after doing the callback function.  For
> consistency, I also modified the patch to delay it even when batching
> into plain tables.  IMO I think that that would be more consistent
> with the single-insert mode, as in that mode we update them after
> writing the tuple out to the table or sending 

Re: shared-memory based stats collector - v70

2022-08-09 Thread Greg Stark
On Tue, 9 Aug 2022 at 06:19, Drouvot, Bertrand  wrote:
>
>
> What do you think about adding a function in core PG to provide such
> functionality? (means being able to retrieve all the stats (+ eventually
> add some filtering) without the need to connect to each database).

I'm working on it myself too. I'll post a patch for discussion in a bit.

I was more aiming at a C function that extensions could use directly
rather than an SQL function -- though I suppose having the former it
would be simple enough to implement the latter using it. (though it
would have to be one for each stat type I guess)

The reason I want a C function is I'm trying to get as far as I can
without a connection to a database, without a transaction, without
accessing the catalog, and as much as possible without taking locks. I
think this is important for making monitoring highly reliable and low
impact on production. It's also kind of fundamental to accessing stats
for objects from other databases since we won't have easy access to
the catalogs for the other databases.

The main problem with my current code is that I'm accessing the shared
memory hash table directly. This means the I'm possibly introducing
locking contention on the shared memory hash table. I'm thinking of
separating the shared memory hash scan from the metric scan so the
list can be quickly  built minimizing the time the lock is held. We
could possibly also only rebuild that list at a lower frequency than
the metrics gathering so new objects might not show up instantly.

I have a few things I would like to suggest for future improvements to
this infrastructure. I haven't polished the details of it yet but the
main thing I think I'm missing is the catalog name for the object. I
don't want to have to fetch it from the catalog and in any case I
think it would generally be useful and might regularize the
replication slot handling too.

I also think it would be nice to have a change counter for every stat
object, or perhaps a change time. Prometheus wouldn't be able to make
use of it but other monitoring software might be able to receive only
metrics that have changed since the last update which would really
help on databases with large numbers of mostly static objects. Even on
typical databases there are tons of builtin objects (especially
functions) that are probably never getting updates.

-- 
greg




Re: Generalize ereport_startup_progress infrastructure

2022-08-09 Thread Bharath Rupireddy
On Tue, Aug 9, 2022 at 6:05 PM Robert Haas  wrote:
>
> On Mon, Aug 8, 2022 at 12:29 AM Bharath Rupireddy
>  wrote:
> > Here's v2 patch, passing progress report interval as an input to
> > begin_progress_report_phase() so that the processes can use their own
> > intervals(hard-coded or GUC) if they wish to not use the generic GUC
> > log_progress_report_interval.
>
> I don't think we should rename the GUC to be more generic. I like it
> the way that it is.

Done.

> I also think you should extend this patch series with 1 or 2
> additional patches showing where else you think we should be using
> this infrastructure.
>
> If no such places exist, this is pointless.

I'm attaching 0002 for reporting removal of temp files and temp
relation files by postmaster.

If this looks okay, I can code 0003 for reporting processing of
snapshot, mapping and old WAL files by checkpointer.

Thoughts?

-- 
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/


v3-0001-Generalize-ereport_startup_progress-infrastructur.patch
Description: Binary data


v3-0002-Progress-report-removal-of-temp-files-and-temp-re.patch
Description: Binary data


Re: remove useless comments

2022-08-09 Thread Tom Lane
Junwang Zhao  writes:
> The comments considering checking share/ directory was there for
> almost 13 years, yet nobody ever trying to add the checking, and
> there seems never any trouble for not checking it, then I think
> we should remove those comments.

I think that comment is valuable.  It shows that checking the
sibling directories was considered and didn't seem worthwhile.
Perhaps it should be rephrased in a more positive way (without XXX),
but merely deleting it is a net negative because future hackers
would have to reconstruct that reasoning.

BTW, we're working in a 30+-year-old code base, so the mere fact
that a comment has been there a long time does not make it bad.

regards, tom lane




remove useless comments

2022-08-09 Thread Junwang Zhao
The comments considering checking share/ directory was there for
almost 13 years, yet nobody ever trying to add the checking, and
there seems never any trouble for not checking it, then I think
we should remove those comments.

diff --git a/src/backend/postmaster/postmaster.c
b/src/backend/postmaster/postmaster.c
index 81cb585891..ecdc59ce5e 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -1581,11 +1581,6 @@ getInstallationPaths(const char *argv0)
 errhint("This may indicate an
incomplete PostgreSQL installation, or that the file \"%s\" has been
moved away from its proper location.",
 my_exec_path)));
FreeDir(pdir);
-
-   /*
-* XXX is it worth similarly checking the share/ directory?  If the lib/
-* directory is there, then share/ probably is too.
-*/
 }

-- 
Regards
Junwang Zhao


0001-comments-remove-useless-comments.patch
Description: Binary data


Re: dropping datumSort field

2022-08-09 Thread Zhihong Yu
On Tue, Aug 9, 2022 at 8:24 AM Robert Haas  wrote:

> On Tue, Aug 9, 2022 at 11:16 AM Zhihong Yu  wrote:
> > tupDesc is declared inside `if (!node->sort_Done)` block whereas the
> last reference to tupDesc is outside the if block.
>
> Yep.
>
> > I take your review comment and will go back to do more homework.
>
> The real point for me here is you haven't offered any reason to make
> this change. The structure member in question is basically free.
> Because of alignment padding it uses no more memory, and it makes the
> intent of the code clearer.
>
> Let's not change things just because we could.
>
> --
> Robert Haas
> EDB: http://www.enterprisedb.com


I should have provided motivation for the patch.

I was aware of recent changes around ExprEvalStep. e.g.

Remove unused fields from ExprEvalStep

Though the datumSort field may be harmless for now, in the future, the
(auto) alignment padding may not work if more fields are added to the
struct.
I think we should always leave some room in struct's for future expansion.

As for making the intent of the code clearer, the datumSort field is only
used in one method.
My previous patch removed some comment which should have been shifted into
this method.
In my opinion, localizing the check in single method is easier to
understand than resorting to additional struct field.

Cheers


Re: dropping datumSort field

2022-08-09 Thread Robert Haas
On Tue, Aug 9, 2022 at 11:16 AM Zhihong Yu  wrote:
> tupDesc is declared inside `if (!node->sort_Done)` block whereas the last 
> reference to tupDesc is outside the if block.

Yep.

> I take your review comment and will go back to do more homework.

The real point for me here is you haven't offered any reason to make
this change. The structure member in question is basically free.
Because of alignment padding it uses no more memory, and it makes the
intent of the code clearer.

Let's not change things just because we could.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: making relfilenodes 56 bits

2022-08-09 Thread Robert Haas
On Fri, Aug 5, 2022 at 3:25 AM Dilip Kumar  wrote:
> I think even if we start the range from the 4 billion we can not avoid
> keeping two separate ranges for system and user tables otherwise the
> next upgrade where old and new clusters both have 56 bits
> relfilenumber will get conflicting files.  And, for the same reason we
> still have to call SetNextRelFileNumber() during upgrade.

Well, my proposal to move everything from the new cluster up to higher
numbers would address this without requiring two ranges.

> So the idea is, we will be having 2 ranges for relfilenumbers, system
> range will start from 4 billion and user range maybe something around
> 4.1 (I think we can keep it very small though, just reserve 50k
> relfilenumber for system for future expansion and start user range
> from there).

A disadvantage of this is that it basically means all the file names
in new clusters are going to be 10 characters long. That's not a big
disadvantage, but it's not wonderful. File names that are only 5-7
characters long are common today, and easier to remember.

> So now system tables have no issues and also the user tables from the
> old cluster have no issues.  But pg_largeobject might get conflict
> when both old and new cluster are using 56 bits relfilenumber, because
> it is possible that in the new cluster some other system table gets
> that relfilenumber which is used by pg_largeobject in the old cluster.
>
> This could be resolved if we allocate pg_largeobject's relfilenumber
> from the user range, that means this relfilenumber will always be the
> first value from the user range.  So now if the old and new cluster
> both are using 56bits relfilenumber then pg_largeobject in both
> cluster would have got the same relfilenumber and if the old cluster
> is using the current 32 bits relfilenode system then the whole range
> of the new cluster is completely different than that of the old
> cluster.

I think this can work, but it does rely to some extent on the fact
that there are no other tables which need to be treated like
pg_largeobject. If there were others, they'd need fixed starting
RelFileNumber assignments, or some other trick, like renumbering them
twice in the cluster, first two a known-unused value and then back to
the proper value. You'd have trouble if in the other cluster
pg_largeobject was 4bn+1 and pg_largeobject2 was 4bn+2 and in the new
cluster the reverse, without some hackery.

I do feel like your idea here has some advantages - my proposal
requires rewriting all the catalogs in the new cluster before we do
anything else, and that's going to take some time even though they
should be small. But I also feel like it has some disadvantages: it
seems to rely on complicated reasoning and special cases more than I'd
like.

What do other people think?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: dropping datumSort field

2022-08-09 Thread Zhihong Yu
On Tue, Aug 9, 2022 at 8:01 AM Robert Haas  wrote:

> On Mon, Aug 8, 2022 at 5:51 PM Zhihong Yu  wrote:
> > Hi,
> > I was looking at ExecSort() w.r.t. datum sort.
> >
> > I wonder if the datumSort field can be dropped.
> > Here is a patch illustrating the potential simplification.
> >
> > Please take a look.
>
> One problem with this patch is that, if I apply it, PostgreSQL does not
> compile:
>
> nodeSort.c:197:6: error: use of undeclared identifier 'tupDesc'
> if (tupDesc->natts == 1)
> ^
> 1 error generated.
>
> Leaving that aside, I don't really see any advantage in this sort of
> change.
>
> --
> Robert Haas
> EDB: http://www.enterprisedb.com


Thanks for trying out the patch.

TupleDesc   tupDesc;

tupDesc is declared inside `if (!node->sort_Done)` block whereas the last
reference to tupDesc is outside the if block.

I take your review comment and will go back to do more homework.

Cheers


Re: PG 15 (and to a smaller degree 14) regression due to ExprEvalStep size

2022-08-09 Thread Andrew Dunstan


On 2022-08-09 Tu 09:59, Jonathan S. Katz wrote:
> On 8/5/22 4:36 PM, Andres Freund wrote:
>> Hi,
>>
>> I tried to look into some of the questions from Amit, but I have e.g.
>> no idea
>> what exactly the use of subtransactions tries to achieve - afaics
>> 1a36bc9dba8
>> is the first patch to introduce needing to evaluate parts expressions
>> in a
>> subtransaction - but there's not a single comment explaining why.
>>
>> It's really hard to understand the new json code. It's a substantial
>> amount of
>> new infrastructure, without any design documentation that I can find.
>> And it's
>> not like it's just code that's equivalent to nearby stuff. To me this
>> falls
>> way below our standards and I think it's *months* of work to fix.
>>
>> Even just the expresion evaluation code: EvalJsonPathVar(),
>> ExecEvalJsonConstructor(), ExecEvalJsonExpr(), ExecEvalJson().
>> There's one
>> layer of subtransactions in one of the paths in ExecEvalJsonExpr(),
>> another in
>> ExecEvalJson(). Some paths of ExecEvalJsonExpr() go through
>> subtransactions,
>> others don't.
>>
>> It's one thing for a small set of changes to be of this quality. But
>> this is
>> pretty large - a quick summing of diffstat ends up with about 17k
>> lines added,
>> of which ~2.5k are docs, ~4.8k are tests.
>
> The RMT met today to discuss the state of this open item surrounding
> the SQL/JSON feature set. We discussed the specific concerns raised
> about the code and debated four different options:
>
>   1. Do nothing and continue with the community process of stabilizing
> the code without significant redesign
>
>   2. Recommend holding up the v15 release to allow for the code to be
> redesigned and fixed (as based on Andres' estimates, this would push
> the release out several months).
>
>   3. Revert the problematic parts of the code but try to include some
> of the features in the v15 release (e.g. JSON_TABLE)
>
>   4. Revert the feature set and redesign and try to include for v16
>
> Based on the concerns raised, the RMT is recommending option #4, to
> revert the SQL/JSON changes for v15, and come back with a redesign for
> v16.
>
> If folks think there are some bits we can include in v15, we can
> consider option #3. (Personally, I would like to see if we can keep
> JSON_TABLE, but if there is too much overlap with the problematic
> portions of the code I am fine with waiting for v16).
>
> At this stage in the release process coupled with the concerns, we're
> a bit worried about introducing changes that are unpredictable in
> terms of stability and maintenance. We also do not want to hold up the
> release while this feature set is goes through a redesign without
> agreement on what such a design would look like as well as a timeline.
>
> Note that the above are the RMT's recommendations; while the RMT can
> explicitly call for a revert, we want to first guide the discussion on
> the best path forward knowing the challenges for including these
> features in v15.
>
>

I very much doubt option 3 is feasible. The parts that are controversial
go back at least in part to the first patches of the series. Trying to
salvage anything would almost certainly be more disruptive than trying
to fix it.

I'm not sure what the danger is to stability, performance or correctness
in applying the changes Amit has proposed for release 15. But if that
danger is judged to be too great then I agree we should revert.

I should add that I'm very grateful to Amit for his work, and I'm sure
it isn't wasted effort, whatever the decision.


cheers


andrew



--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: dropping datumSort field

2022-08-09 Thread Robert Haas
On Mon, Aug 8, 2022 at 5:51 PM Zhihong Yu  wrote:
> Hi,
> I was looking at ExecSort() w.r.t. datum sort.
>
> I wonder if the datumSort field can be dropped.
> Here is a patch illustrating the potential simplification.
>
> Please take a look.

One problem with this patch is that, if I apply it, PostgreSQL does not compile:

nodeSort.c:197:6: error: use of undeclared identifier 'tupDesc'
if (tupDesc->natts == 1)
^
1 error generated.

Leaving that aside, I don't really see any advantage in this sort of change.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Reducing the chunk header sizes on all memory context types

2022-08-09 Thread Robert Haas
On Tue, Aug 9, 2022 at 8:53 AM David Rowley  wrote:
> I think the patch is now starting to take shape.  I've added it to the
> September commitfest [1].

This is extremely cool. The memory savings are really nice. And I also
like this:

# Additionally, this commit completely changes the rule that pointers must
# be directly prefixed by the owning memory context and instead, we now
# insist that they're directly prefixed by an 8-byte value where the least
# significant 3-bits are set to a value to indicate which type of memory
# context the pointer belongs to.  Using those 3 bits as an index to a new
# array which stores the methods for each memory context type, we're now
# able to pass the pointer given to functions such as pfree and repalloc to
# the function specific to that context implementation to allow them to
# devise their own methods of finding the memory context which owns the
# given allocated chunk of memory.

That seems like a good system.

This part of the commit message might need to be clarified:

# We also add a restriction that block sizes for all 3 of the memory
# allocators cannot be 1GB or larger.  We would be unable to store the
# number of bytes that the block is offset from the chunk stored beyond this
#1GB boundary on any block that was larger than 1GB.

Earlier in the commit message, you say that allocations of 1GB or more
are stored in dedicated blocks. But here you say that blocks can't be
more than 1GB. Those statements seem to contradict each other. I guess
you mean block sizes for blocks that contain chunks, or something like
that?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Materialized view rewrite is broken when there is an event trigger

2022-08-09 Thread Tom Lane
Michael Paquier  writes:
> Please note that we should not add an event in create_am.sql even if
> it is empty, as it gets run in parallel of other tests so there could
> be interferences.  I think that this had better live in
> sql/event_trigger.sql, even if it requires an extra table AM to check
> this specific case.

Agreed this is a bug, but I do not think we should add the proposed
regression test, regardless of where exactly.  It looks like expending
a lot of cycles forevermore to watch for an extremely unlikely thing,
ie that we break this for ALTER MATERIALIZED VIEW and not anything
else.

I think the real problem here is that we don't have any mechanism
for verifying that table_rewrite_ok is correct.  The "cross-check"
in EventTriggerCommonSetup is utterly worthless, as this failure
shows.  Does it give any confidence at all that there are no other
mislabelings?  I sure have none now.  What can we do to verify that
more rigorously?  Or maybe we should find a way to get rid of the
table_rewrite_ok flag altogether?

regards, tom lane




Re: Materialized view rewrite is broken when there is an event trigger

2022-08-09 Thread Önder Kalacı
Hi,

I should be able to grab a little bit of time next week to look at
> what you have.
>

Thanks!


>
> Please note that we should not add an event in create_am.sql even if
> it is empty, as it gets run in parallel of other tests so there could
> be interferences.  I think that this had better live in
> sql/event_trigger.sql, even if it requires an extra table AM to check
> this specific case.
> --
>


Moved the test to event_trigger.sql.

> parallel group (2 tests, in groups of 1):  event_trigger oidjoins

Though, it also seems to run in parallel, but I assume the parallel test
already works fine with concurrent event triggers?

Thanks,
Onder


v2-Allow-MATERIALIZED-VIEW-Rewrite-when-event-triggers.patch
Description: Binary data


Re: Temporary file access API

2022-08-09 Thread Robert Haas
On Mon, Aug 8, 2022 at 2:26 PM Antonin Houska  wrote:
> > I don't think that (3) is a separate advantage from (1) and (2), so I
> > don't have anything separate to say about it.
>
> I thought that the uncontrollable buffer size is one of the things you
> complaint about in
>
> https://www.postgresql.org/message-id/CA+TgmoYGjN_f=fcerx49bzjhng+gocty+a+xhnrwcvvdy8u...@mail.gmail.com

Well, I think that email is mostly complaining about there being no
buffering at all in a situation where it would be advantageous to do
some buffering. But that particular code I believe is gone now because
of the shared-memory stats collector, and when looking through the
patch set, I didn't see that any of the cases that it touched were
similar to that one.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Asking for feedback on Pgperffarm

2022-08-09 Thread Yedil Serzhan
Hi, Mark, really thank you for your feedback.


On Mon, Aug 8, 2022 at 7:06 PM Mark Wong  wrote:

> Hi Yedil,
>
> On Mon, Aug 08, 2022 at 02:50:17PM +0200, Yedil Serzhan wrote:
> > Dear hackers,
> >
> > I'm Yedil. I'm working on the project "Postgres Performance Farm" during
> > Gsoc. Pgperffarm is a project like Postgres build farm but focuses on the
> > performance of the database. Now it has 2 types of benchmarks, pgbench
> and
> > tpc-h. The website is online here , and the
> repo
> > is here .
> >
> > I would like you to take a look at our website and, if possible, give
> some
> > feedback on, for example, what other data should be collected or what
> other
> > metrics could be used to compare performance.
>
> Nice work!
>
> We need to be careful with how results based on the TPC-H specification
> are presented.  It needs to be changed, but maybe not dramatically.
> Something like "Fair use derivation of TPC-H".  It needs to be clear
> that it's not an official TPC-H result.
>
> I think I've hinted at it in the #perffarm slack channel, that I think
> it would be better if you leveraged one of the already existing TPC-H
> derived kits.  While I'm partial to dbt-3, because I'm trying to
> maintain it and because it sounded like you were starting to do
> something similar to that, I think you can save a good amount of effort
> from reimplementing another kit from scratch.
>
> Regards,
> Mark
>


It makes sense to put it as a "fair use derivation of TPC-H". I also used
the term "composite score" because of your previous feedback on it.

I'll also check out the dbt-3 tool and if the effort is worth it, and if
it's necessary, I'll try to switch to it.

These are very valuable feedback, thank you again.

Best,
Yedil


Re: PG 15 (and to a smaller degree 14) regression due to ExprEvalStep size

2022-08-09 Thread Jonathan S. Katz

On 8/5/22 4:36 PM, Andres Freund wrote:

Hi,

I tried to look into some of the questions from Amit, but I have e.g. no idea
what exactly the use of subtransactions tries to achieve - afaics 1a36bc9dba8
is the first patch to introduce needing to evaluate parts expressions in a
subtransaction - but there's not a single comment explaining why.

It's really hard to understand the new json code. It's a substantial amount of
new infrastructure, without any design documentation that I can find. And it's
not like it's just code that's equivalent to nearby stuff. To me this falls
way below our standards and I think it's *months* of work to fix.

Even just the expresion evaluation code: EvalJsonPathVar(),
ExecEvalJsonConstructor(), ExecEvalJsonExpr(), ExecEvalJson(). There's one
layer of subtransactions in one of the paths in ExecEvalJsonExpr(), another in
ExecEvalJson(). Some paths of ExecEvalJsonExpr() go through subtransactions,
others don't.

It's one thing for a small set of changes to be of this quality. But this is
pretty large - a quick summing of diffstat ends up with about 17k lines added,
of which ~2.5k are docs, ~4.8k are tests.


The RMT met today to discuss the state of this open item surrounding the 
SQL/JSON feature set. We discussed the specific concerns raised about 
the code and debated four different options:


  1. Do nothing and continue with the community process of stabilizing 
the code without significant redesign


  2. Recommend holding up the v15 release to allow for the code to be 
redesigned and fixed (as based on Andres' estimates, this would push the 
release out several months).


  3. Revert the problematic parts of the code but try to include some 
of the features in the v15 release (e.g. JSON_TABLE)


  4. Revert the feature set and redesign and try to include for v16

Based on the concerns raised, the RMT is recommending option #4, to 
revert the SQL/JSON changes for v15, and come back with a redesign for v16.


If folks think there are some bits we can include in v15, we can 
consider option #3. (Personally, I would like to see if we can keep 
JSON_TABLE, but if there is too much overlap with the problematic 
portions of the code I am fine with waiting for v16).


At this stage in the release process coupled with the concerns, we're a 
bit worried about introducing changes that are unpredictable in terms of 
stability and maintenance. We also do not want to hold up the release 
while this feature set is goes through a redesign without agreement on 
what such a design would look like as well as a timeline.


Note that the above are the RMT's recommendations; while the RMT can 
explicitly call for a revert, we want to first guide the discussion on 
the best path forward knowing the challenges for including these 
features in v15.


Thanks,

Jonathan


OpenPGP_signature
Description: OpenPGP digital signature


Re: automatically generating node support functions

2022-08-09 Thread Amit Kapila
On Tue, Aug 9, 2022 at 12:14 AM Tom Lane  wrote:
>
> I wrote:
> > Ah.  It'd help if that complaint said what the command input actually
> > is :-(.  But on looking closer, I missed stripping the empty strings
> > that "split" will produce at the ends of the array.  I think the
> > attached will do the trick, and I really do want to get rid of this
> > copy of the file list if possible.
>
> I tried this version on the cfbot, and it seems happy, so pushed.
>

Thank you. I have verified the committed patch and it works.

-- 
With Regards,
Amit Kapila.




Re: Materialized view rewrite is broken when there is an event trigger

2022-08-09 Thread Michael Paquier
On Tue, Aug 09, 2022 at 02:55:06PM +0200, Önder Kalacı wrote:
> Attached a patch to fix as well. If the patch looks good to you, can you
> consider getting this to PG 15?

Thanks, this one is on me so I have added an open item.  I will
unfortunately not be able to address that this week because of life,
but I should be able to grab a little bit of time next week to look at
what you have.

Please note that we should not add an event in create_am.sql even if
it is empty, as it gets run in parallel of other tests so there could
be interferences.  I think that this had better live in
sql/event_trigger.sql, even if it requires an extra table AM to check
this specific case.
--
Michael


signature.asc
Description: PGP signature


Materialized view rewrite is broken when there is an event trigger

2022-08-09 Thread Önder Kalacı
Hi,

Attached a patch to fix as well. If the patch looks good to you, can you
consider getting this to PG 15?

Steps to repro:
-- some basic examples from src/test/regress/sql/create_am.sql
CREATE TABLE heaptable USING heap AS
SELECT a, repeat(a::text, 100) FROM generate_series(1,9) AS a;
CREATE ACCESS METHOD heap2 TYPE TABLE HANDLER heap_tableam_handler;
CREATE MATERIALIZED VIEW heapmv USING heap AS SELECT * FROM heaptable;

-- altering MATERIALIZED
ALTER MATERIALIZED VIEW heapmv SET ACCESS METHOD heap2;
ALTER MATERIALIZED VIEW heapmv SET ACCESS METHOD heap;

-- setup event trigger
CREATE OR REPLACE FUNCTION empty_event_trigger()
  RETURNS event_trigger AS $$
DECLARE
BEGIN
END;
$$ LANGUAGE plpgsql;
CREATE EVENT TRIGGER empty_triggger ON sql_drop EXECUTE PROCEDURE
empty_event_trigger();

-- now, after creating an event trigger, ALTER MATERIALIZED VIEW fails
unexpectedly
ALTER MATERIALIZED VIEW heapmv SET ACCESS METHOD heap2;
ERROR:  unexpected command tag "ALTER MATERIALIZED VIEW"

Thanks,
Onder Kalaci


v1-Allow-MATERIALIZED-VIEW-Rewrite-when-event-triggers.patch
Description: Binary data


Re: Reducing the chunk header sizes on all memory context types

2022-08-09 Thread David Rowley
On Wed, 13 Jul 2022 at 17:20, David Rowley  wrote:
> I did consider that in all cases where the allocations are above
> allocChunkLimit that the chunk is put on a dedicated block and in
> fact, the blockoffset is always the same for those.  I wondered if we
> could use the full 60 bits for the chunksize for those cases.  The
> reason I didn't pursue that is because:

As it turns out, we don't really need to explicitly store the chunk
size for chunks which are on dedicated blocks.  We can just calculate
this by subtracting the pointer to the memory from the block's endptr.
The block offset is always fixed too, like I mentioned above.

I've now revised the patch to completely get rid of the concept of
"large" chunks and instead memory chunks are always 8 bytes in size.
I've created a struct to this effect and named it MemoryChunk. All
memory contexts make use of this new struct.  The header bit which I
was previously using to denote a "large" chunk now marks if the chunk
is "external", meaning that the MemoryChunk does not have knowledge of
the chunk size and/or block offset. The MemoryContext itself is
expected to manage this when the chunk is external.   I've coded up
aset.c and generation.c to always use these external chunks when size
> set->allocChunkLimit.  There is now a coding pattern like the
following (taken from AllocSetRealloc:

if (MemoryChunkIsExternal(chunk))
{
block = ExternalChunkGetBlock(chunk);
oldsize = block->endptr - (char *) pointer;
}
else
{
block = MemoryChunkGetBlock(chunk);
oldsize = MemoryChunkGetSize(chunk);
}

Here the ExternalChunkGetBlock() macro is just subtracting the
ALLOC_BLOCKHDRSZ from the chunk pointer to obtain the block pointer,
whereas MemoryChunkGetBlock() is subtracting the blockoffset as is
stored in the 30-bits of the chunk header.

Andres and I had a conversation off-list about the storage of freelist
pointers.  Currently I have these stored in the actual allocated
memory.  The minimum allocation size is 8-bytes, which is big enough
for storing sizeof(void *). Andres suggested that it might be safer to
store this link at the end of the chunk rather than at the start.
I've not quite done that in the attached, but doing this should just
be a matter of adjusting the GetFreeListLink() macro to add the
chunksize - sizeof(AllocFreelistLink).

I did a little bit of benchmarking of the attached with a scale 1 pgbench.

master = 0b039e3a8

master
drowley@amd3990x:~$ pgbench -c 156 -j 156 -T 60 -S postgres
tps = 1638436.367793 (without initial connection time)
tps = 1652688.009579 (without initial connection time)
tps = 1671281.281856 (without initial connection time)

master + v2 patch
drowley@amd3990x:~$ pgbench -c 156 -j 156 -T 60 -S postgres
tps = 1825346.734486 (without initial connection time)
tps = 1824539.294142 (without initial connection time)
tps = 1807359.758822 (without initial connection time)

~10% faster.

There are a couple of things to note that might require discussion:

1. I've added a new restriction that block sizes cannot be above 1GB.
This is because the 30-bits in the MemoryChunk used for storing the
offset between the chunk and the block wouldn't be able to store the
offset if the chunk was offset more than 1GB from the block. I used
debian code search to see if I could find any user code that used
blocks this big. I found nothing.
2. slab.c has a new restriction that the chunk size cannot be >= 1GB.
I'm not at all concerned about this. I think if someone needed chunks
that big there'd be no benefits from slab context over an aset
context.
3. As mentioned above, aset.c now stores freelist pointers in the
allocated chunk's memory.  This allows us to get the header down to 8
bytes instead of today's 16 bytes. There's an increased danger that
buggy code that writes to memory after a pfree could stomp on this.

I think the patch is now starting to take shape.  I've added it to the
September commitfest [1].

David

[1] https://commitfest.postgresql.org/39/3810/
From 13034cd62f47eb26aca717e0c65a21f8c976477e Mon Sep 17 00:00:00 2001
From: David Rowley 
Date: Thu, 9 Jun 2022 20:09:22 +1200
Subject: [PATCH v2] Improve performance of and reduce overheads of memory
 management

Whenever we palloc a chunk of memory, traditionally, we prefix the
returned pointer with a pointer to the memory context to which the chunk
belongs.  This is required so that we're able to easily determine the
owning context when performing operations such as pfree() and repalloc().

For the AllocSet context, prior to this commit we additionally prefixed
the pointer to the owning context with the size of the chunk.  This made
the header 16 bytes in size.  This 16-byte overhead was required for all
AllocSet allocations regardless of the allocation size.

For the generation context, the problem was worse; in addition to the
pointer to the owning context and chunk size, we also stored a pointer to
the owning block so that we could track the number of freed chunks on a

Re: [RFC] building postgres with meson

2022-08-09 Thread Andrew Dunstan


On 2022-08-09 Tu 03:10, Andres Freund wrote:
> Hi,
>
> I was looking at re-unifying gendef2.pl that the meson patchset had introduced
> for temporary ease during hacking with gendef.pl. Testing that I noticed that
> either I and my machine is very confused, or gendef.pl's check whether it can
> skip work is bogus.
>
> I noticed that, despite having code to avoid rerunning when the input files
> are older than the .def file, it always runs.
>
> # if the def file exists and is newer than all input object files, skip
> # its creation
> if (-f $deffile
> && (-M $deffile > max(map { -M } <$ARGV[0]/*.obj>)))
> {
> print "Not re-generating $defname.DEF, file already exists.\n";
> exit(0);
> }
>
> My understanding of -M is that it returns the time delta between the file
> modification and the start of the script. Which makes the use of max() bogus,
> since it'll return the oldest time any input has been modified, not the
> newest. And the condition needs to be inverted, because we want to skip the
> work if $deffile is *newer*, right?
>
> Am I missing something here?


No, you're right, this is bogus. Reversing the test and using min
instead of max is the obvious fix.


> I'm tempted to just remove the not-regenerating logic - gendef.pl shouldn't
> run if there's nothing to do, and it'll e.g. not notice if there's an
> additional input that wasn't there during the last invocation of gendef.pl.
>

Maybe, need to think about that more.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Generalize ereport_startup_progress infrastructure

2022-08-09 Thread Robert Haas
On Mon, Aug 8, 2022 at 12:29 AM Bharath Rupireddy
 wrote:
> Here's v2 patch, passing progress report interval as an input to
> begin_progress_report_phase() so that the processes can use their own
> intervals(hard-coded or GUC) if they wish to not use the generic GUC
> log_progress_report_interval.

I don't think we should rename the GUC to be more generic. I like it
the way that it is.

I also think you should extend this patch series with 1 or 2
additional patches showing where else you think we should be using
this infrastructure.

If no such places exist, this is pointless.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: support for MERGE

2022-08-09 Thread Justin Pryzby
On Tue, Aug 09, 2022 at 11:48:23AM +0200, Álvaro Herrera wrote:
> On 2022-Aug-01, Álvaro Herrera wrote:
> 
> > > >  If MERGE attempts an INSERT
> > > >  and a unique index is present and a duplicate row is concurrently
> > > > +inserted, then a uniqueness violation error is raised;
> > > > +MERGE does not attempt to avoid such
> > > > +errors by evaluating MATCHED conditions.
> > > 
> > > This was a portion of a chang that was committed as eebf2.
> > > 
> > > But I don't understand why this changed from "does not attempt to avoid 
> > > the
> > > error by executing an UPDATE." to "...by evaluating
> > > MATCHED conditions."
> > > 
> > > Maybe it means to say "..by re-starting evaluation of match conditions".
> > 
> > Yeah, my thought there is that it may also be possible that the action
> > that would run if the conditions are re-run is a DELETE or a WHEN
> > MATCHED THEN DO NOTHING; so saying "by executing an UPDATE" it leaves
> > out those possibilities.  IOW if we're evaluating NOT MATCHED INSERT and
> > we find a duplicate, we do not go back to MATCHED.
> 
> So I propose to leave it as
> 
>If MERGE attempts an INSERT
>and a unique index is present and a duplicate row is concurrently
>inserted, then a uniqueness violation error is raised;
>MERGE does not attempt to avoid such
>errors by restarting evaluation of MATCHED
>  conditions.

I think by "leave it as" you mean "change it to".
(Meaning, without referencing UPDATE).

> (Is "re-starting" better than "restarting"?)

"re-starting" doesn't currently existing in the docs, so I guess not.
You could also say "starting from scratch the evaluation of MATCHED conditions".

Note that I proposed two other changes in the other thread ("MERGE and parsing
with prepared statements").

  - remove the sentence with "automatic type conversion will be attempted";
  - make examples more similar to emphasize their differences;

-- 
Justin




Re: Perform streaming logical transactions by background workers and parallel apply

2022-08-09 Thread Amit Kapila
On Tue, Aug 9, 2022 at 11:09 AM Dilip Kumar  wrote:
>
> Some more comments
>
> +/*
> + * Exit if any relation is not in the READY state and if any worker is
> + * handling the streaming transaction at the same time. Because for
> + * streaming transactions that is being applied in apply background
> + * worker, we cannot decide whether to apply the change for a relation
> + * that is not in the READY state (see should_apply_changes_for_rel) as 
> we
> + * won't know remote_final_lsn by that time.
> + */
> +if (list_length(ApplyBgworkersFreeList) !=
> list_length(ApplyBgworkersList) &&
> +!AllTablesyncsReady())
> +{
> +ereport(LOG,
> +(errmsg("logical replication apply workers for
> subscription \"%s\" will restart",
> +MySubscription->name),
> + errdetail("Cannot handle streamed replication
> transaction by apply "
> +   "background workers until all tables are
> synchronized")));
> +
> +proc_exit(0);
> +}
>
> How this situation can occur? I mean while starting a background
> worker itself we can check whether all tables are sync ready or not
> right?
>

We are already checking at the start in apply_bgworker_can_start() but
I think it is required to check at the later point of time as well
because the new rels can be added to pg_subscription_rel via Alter
Subscription ... Refresh. I feel if that reasoning is correct then we
can probably expand comments to make it clear.

> +/* Check the status of apply background worker if any. */
> +apply_bgworker_check_status();
> +
>
> What is the need to checking each worker status on every commit?  I
> mean if there are a lot of small transactions along with some
> steamiing transactions
> then it will affect the apply performance for those small transactions?
>

I don't think performance will be a concern because this won't do any
costly operation unless invalidation happens in which case it will
access system catalogs. However, if my above understanding is correct
that new tables can be added during the apply process then not sure
doing it at commit time is sufficient/correct because it can change
even during the transaction.

-- 
With Regards,
Amit Kapila.




Re: Fast COPY FROM based on batch insert

2022-08-09 Thread Etsuro Fujita
On Tue, Jul 26, 2022 at 7:19 PM Etsuro Fujita  wrote:
> Yeah, I think the patch is getting better, but I noticed some issues,
> so I'm working on them.  I think I can post a new version in the next
> few days.

* When running AFTER ROW triggers in CopyMultiInsertBufferFlush(), the
patch uses the slots passed to ExecForeignBatchInsert(), not the ones
returned by the callback function, but I don't think that that is
always correct, as the documentation about the callback function says:

 The return value is an array of slots containing the data that was
 actually inserted (this might differ from the data supplied, for
 example as a result of trigger actions.)
 The passed-in slots can be re-used for this purpose.

postgres_fdw re-uses the passed-in slots, but other FDWs might not, so
I modified the patch to reference the returned slots when running the
AFTER ROW triggers.  I also modified the patch to initialize the
tts_tableOid.  Attached is an updated patch, in which I made some
minor adjustments to CopyMultiInsertBufferFlush() as well.

* The patch produces incorrect error context information:

create extension postgres_fdw;
create server loopback foreign data wrapper postgres_fdw options
(dbname 'postgres');
create user mapping for current_user server loopback;
create table t1 (f1 int, f2 text);
create foreign table ft1 (f1 int, f2 text) server loopback options
(table_name 't1');
alter table t1 add constraint t1_f1positive check (f1 >= 0);
alter foreign table ft1 add constraint ft1_f1positive check (f1 >= 0);

— single insert
copy ft1 from stdin;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself, or an EOF signal.
>> -1 foo
>> 1 bar
>> \.
ERROR:  new row for relation "t1" violates check constraint "t1_f1positive"
DETAIL:  Failing row contains (-1, foo).
CONTEXT:  remote SQL command: INSERT INTO public.t1(f1, f2) VALUES ($1, $2)
COPY ft1, line 1: "-1 foo"

— batch insert
alter server loopback options (add batch_size '2');
copy ft1 from stdin;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself, or an EOF signal.
>> -1 foo
>> 1 bar
>> \.
ERROR:  new row for relation "t1" violates check constraint "t1_f1positive"
DETAIL:  Failing row contains (-1, foo).
CONTEXT:  remote SQL command: INSERT INTO public.t1(f1, f2) VALUES
($1, $2), ($3, $4)
COPY ft1, line 3

In single-insert mode the error context information is correct, but in
batch-insert mode it isn’t (i.e., the line number isn’t correct).

The error occurs on the remote side, so I'm not sure if there is a
simple fix.  What I came up with is to just suppress error context
information other than the relation name, like the attached.  What do
you think about that?

(In CopyMultiInsertBufferFlush() your patch sets cstate->cur_lineno to
buffer->linenos[i] even when running AFTER ROW triggers for the i-th
row returned by ExecForeignBatchInsert(), but that wouldn’t always be
correct, as the i-th returned row might not correspond to the i-th row
originally stored in the buffer as the callback function returns only
the rows that were actually inserted on the remote side.  I think the
proposed fix would address this issue as well.)

* The patch produces incorrect row count in cases where some/all of
the rows passed to ExecForeignBatchInsert() weren’t inserted on the
remote side:

create function trig_null() returns trigger as $$ begin return NULL;
end $$ language plpgsql;
create trigger trig_null before insert on t1 for each row execute
function trig_null();

— single insert
alter server loopback options (drop batch_size);
copy ft1 from stdin;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself, or an EOF signal.
>> 0 foo
>> 1 bar
>> \.
COPY 0

— batch insert
alter server loopback options (add batch_size '2');
copy ft1 from stdin;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself, or an EOF signal.
>> 0foo
>> 1 bar
>> \.
COPY 2

The row count is correct in single-insert mode, but isn’t in batch-insert mode.

The reason is that in batch-insert mode the row counter is updated
immediately after adding the row to the buffer, not after doing
ExecForeignBatchInsert(), which might ignore the row.  To fix, I
modified the patch to delay updating the row counter (and the progress
of the COPY command) until after doing the callback function.  For
consistency, I also modified the patch to delay it even when batching
into plain tables.  IMO I think that that would be more consistent
with the single-insert mode, as in that mode we update them after
writing the tuple out to the table or sending it to the remote side.

* I modified the patch so that when batching into foreign tables we
skip useless steps in CopyMultiInsertBufferInit() and
CopyMultiInsertBufferCleanup().

That’s all I have for now.  Sorry for the delay.

Best regards,
Etsuro Fujita



Re: SUBTRANS: Minimizing calls to SubTransSetParent()

2022-08-09 Thread Dilip Kumar
On Mon, Aug 8, 2022 at 6:41 PM Simon Riggs  wrote:
>
> On Thu, 4 Aug 2022 at 13:11, Simon Riggs  wrote:
> >
> > On Wed, 3 Aug 2022 at 20:18, Andres Freund  wrote:
> >
> > > I think we should consider redesigning subtrans more substantially - even 
> > > with
> > > the changes you propose here, there's still plenty ways to hit really bad
> > > performance. And there's only so much we can do about that without more
> > > fundamental design changes.
> >
> > I completely agree - you will be glad to hear that I've been working
> > on a redesign of the subtrans module.
> ...
> > I will post my patch, when complete, in a different thread.
>
> The attached patch reduces the overhead of SUBTRANS by minimizing the
> number of times SubTransSetParent() is called, to below 1% of the
> current rate in common cases.
>
> Instead of blindly calling SubTransSetParent() for every subxid, this
> proposal only calls SubTransSetParent() when that information will be
> required for later use. It does this by analyzing all of the callers
> of SubTransGetParent() and uses these pre-conditions to filter out
> calls/subxids that will never be required, for various reasons. It
> redesigns the way XactLockTableWait() calls
> SubTransGetTopmostTransactionId() to allow this.
>
> This short patchset compiles and passes make check-world, with lengthy 
> comments.

Does this patch set work independently or it has dependency on the
patches on the other thread "Smoothing the subtrans performance
catastrophe"?  Because in this patch I see no code where we are
changing anything to control the access of SubTransGetParent() from
SubTransGetTopmostTransactionId()?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Fix inconsistencies GUC categories

2022-08-09 Thread Michael Paquier
On Sat, Aug 06, 2022 at 09:54:36PM +0900, Michael Paquier wrote:
> Yep.  This change sounds right as well. 

Done as of 0b039e3.  Thanks, Kato-san.
--
Michael


signature.asc
Description: PGP signature


Re: Perform streaming logical transactions by background workers and parallel apply

2022-08-09 Thread Amit Kapila
On Thu, Aug 4, 2022 at 12:10 PM wangw.f...@fujitsu.com
 wrote:
>
> On Mon, Jul 25, 2022 at 21:50 PM Amit Kapila  wrote:
> > Few comments on 0001:
> > ==
>
> Thanks for your comments.
>

Review comments on v20-0001-Perform-streaming-logical-transactions-by-backgr
===
1.
+ 
+  If set to on, the incoming changes are written to
+  temporary files and then applied only after the transaction is
+  committed on the publisher.

It is not very clear that the transaction is applied when the commit
is received by the subscriber. Can we slightly change it to: "If set
to on, the incoming changes are written to
temporary files and then applied only after the transaction is
committed on the publisher and received by the subscriber."

2.
/* First time through, initialize apply workers hashtable */
+ if (ApplyBgworkersHash == NULL)
+ {
+ HASHCTL ctl;
+
+ MemSet(, 0, sizeof(ctl));
+ ctl.keysize = sizeof(TransactionId);
+ ctl.entrysize = sizeof(ApplyBgworkerEntry);
+ ctl.hcxt = ApplyContext;
+
+ ApplyBgworkersHash = hash_create("logical apply workers hash", 8, ,
+HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);

I think it would be better if we start with probably 16 element hash
table, 8 seems to be on the lower side.

3.
+/*
+ * Try to look up worker assigned before (see function apply_bgworker_get_free)
+ * inside ApplyBgworkersHash for requested xid.
+ */
+ApplyBgworkerState *
+apply_bgworker_find(TransactionId xid)

The above comment is not very clear. There doesn't seem to be any
function named apply_bgworker_get_free in the patch. Can we write this
comment as: "Find the previously assigned worker for the given
transaction, if any."

4.
/*
+ * Push apply error context callback. Fields will be filled applying a
+ * change.
+ */

/Fields will be filled applying a change./Fields will be filled while
applying a change.

5.
+void
+ApplyBgworkerMain(Datum main_arg)
+{
...
...
+ StartTransactionCommand();
+ oldcontext = MemoryContextSwitchTo(ApplyContext);
+
+ MySubscription = GetSubscription(MyLogicalRepWorker->subid, true);
+ if (!MySubscription)
+ {
+ ereport(LOG,
+ (errmsg("logical replication apply worker for subscription %u will not "
+ "start because the subscription was removed during startup",
+ MyLogicalRepWorker->subid)));
+ proc_exit(0);
+ }
+
+ MySubscriptionValid = true;
+ MemoryContextSwitchTo(oldcontext);
+
+ /* Setup synchronous commit according to the user's wishes */
+ SetConfigOption("synchronous_commit", MySubscription->synccommit,
+ PGC_BACKEND, PGC_S_OVERRIDE);
+
+ /* Keep us informed about subscription changes. */
+ CacheRegisterSyscacheCallback(SUBSCRIPTIONOID,
+   subscription_change_cb,
+   (Datum) 0);
+
+ CommitTransactionCommand();
...

This part appears of the code appears to be the same as we have in
ApplyWorkerMain() except that the patch doesn't check whether the
subscription is enabled. Is there a reason to not have that check here
as well? Then in ApplyWorkerMain(), we do LOG the type of worker that
is also missing here. Unless there is a specific reason to have a
different code here, we should move this part to a common function and
call it both from ApplyWorkerMain() and ApplyBgworkerMain().

6. I think the code in ApplyBgworkerMain() to set
session_replication_role, search_path, and connect to the database
also appears to be the same in ApplyWorkerMain(). If so, that can also
be moved to the common function mentioned in the previous point.

7. I think we need to register for subscription rel map invalidation
(invalidate_syncing_table_states) in ApplyBgworkerMain similar to
ApplyWorkerMain. The reason is that we check the table state after
processing a commit or similar change record via a call to
process_syncing_tables.

8. In apply_bgworker_setup_dsm(), we should have handling related to
dsm_create failure due to max_segments reached as we have in
InitializeParallelDSM(). We can follow the regular path of streaming
transactions in case we are not able to create DSM instead of
parallelizing it.

9.
+ shm_toc_initialize_estimator();
+ shm_toc_estimate_chunk(, sizeof(ApplyBgworkerShared));
+ shm_toc_estimate_chunk(, (Size) queue_size);
+
+ shm_toc_estimate_keys(, 1 + 1);

Here, you can directly write 2 instead of (1 + 1) stuff. It is quite
clear that we need two keys here.

10.
apply_bgworker_wait_for()
{
...
+ /* Wait to be signalled. */
+ WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, 0,
+   WAIT_EVENT_LOGICAL_APPLY_BGWORKER_STATE_CHANGE);
...
}

Typecast with the void, if we don't care for the return value.

11.
+static void
+apply_bgworker_shutdown(int code, Datum arg)
+{
+ SpinLockAcquire(>mutex);
+ MyParallelShared->status = APPLY_BGWORKER_EXIT;
+ SpinLockRelease(>mutex);

Is there a reason to not use apply_bgworker_set_status() directly?

12.
+ * Special case is if the first change comes from subtransaction, then
+ * we check that current_xid differs from stream_xid.
+ 

Re: Checking pgwin32_is_junction() errors

2022-08-09 Thread r . zharkov

On 2022-08-09 05:44, Thomas Munro wrote:
On Tue, Aug 9, 2022 at 8:30 AM Thomas Munro  
wrote:

On Mon, Aug 8, 2022 at 8:23 PM  wrote:
> "C:/HOME" is the junction point to the second volume on my hard drive -
> "\??\Volume{GUID}\" which name pgreadlink() erroneously strips here:
> 
https://github.com/postgres/postgres/blob/7e29a79a46d30dc236d097825ab849158929d977/src/port/dirmod.c#L357.



... Would it
be better to say: if it doesn't begin with "\??\X:", where X could be
any letter, then don't modify it?


Concretely, I wonder if this is a good fix at least in the short term.
Does this work for you, and do the logic and explanation make sense?


Yes, this patch works well with my junction points.
I checked a few variants:

21.07.2022  15:11 HOME [\??\Volume{GUID}\]
09.08.2022  15:06 Test1 [\\?\Volume{GUID}\]
09.08.2022  15:06 Test2 [\\.\Volume{GUID}\]
09.08.2022  15:17 Test3 [\??\Volume{GUID}\]
09.08.2022  15:27 Test4 [C:\temp\1]
09.08.2022  15:28 Test5 [C:\HOME\Temp\1]

After hours of reading the documentation and debugging, it seems to me
we can use REPARSE_GUID_DATA_BUFFER structure instead of our
REPARSE_JUNCTION_DATA_BUFFER [1]. DataBuffer doesn't contain any 
prefixes,

so we don't need to strip them. But we still need to construct a correct
volume name if a junction point is a volume mount point. Is it worth to
check this idea?

[1] 
https://docs.microsoft.com/en-us/windows/win32/api/winnt/ns-winnt-reparse_guid_data_buffer





Re: Using each rel as both outer and inner for JOIN_ANTI

2022-08-09 Thread Alvaro Herrera
Just for kicks, I ran query in your original post under EXPLAIN ANALYZE
in both patched and unpatched with this last version.  I got this (best
of three):

Unpatched:
55432 16devel 437532=# explain (analyze, buffers) select * from foo left join 
bar on foo.a = bar.c where bar.c is null;
 QUERY PLAN 


 Hash Anti Join  (cost=159039.00..183457.23 rows=10 width=20) (actual 
time=482.788..483.182 rows=10 loops=1)
   Hash Cond: (foo.a = bar.c)
   Buffers: shared hit=161 read=21964, temp read=8 written=8
   ->  Seq Scan on foo  (cost=0.00..1.10 rows=10 width=8) (actual 
time=0.020..0.022 rows=10 loops=1)
 Buffers: shared hit=1
   ->  Hash  (cost=72124.00..72124.00 rows=500 width=12) (actual 
time=482.128..482.129 rows=0 loops=1)
 Buckets: 262144  Batches: 64  Memory Usage: 2048kB
 Buffers: shared hit=160 read=21964
 ->  Seq Scan on bar  (cost=0.00..72124.00 rows=500 width=12) 
(actual time=0.092..237.431 rows=500 loops=1)
   Buffers: shared hit=160 read=21964
 Planning Time: 0.182 ms
 Execution Time: 483.248 ms


Patched:
  QUERY PLAN
  
──
 Hash Right Anti Join  (cost=1.23..90875.24 rows=10 width=20) (actual 
time=457.654..457.658 rows=10 loops=1)
   Hash Cond: (bar.c = foo.a)
   Buffers: shared hit=33 read=22092
   ->  Seq Scan on bar  (cost=0.00..72124.00 rows=500 width=12) (actual 
time=0.020..229.097 rows=500 loops=1)
 Buffers: shared hit=32 read=22092
   ->  Hash  (cost=1.10..1.10 rows=10 width=8) (actual time=0.011..0.012 
rows=10 loops=1)
 Buckets: 1024  Batches: 1  Memory Usage: 9kB
 Buffers: shared hit=1
 ->  Seq Scan on foo  (cost=0.00..1.10 rows=10 width=8) (actual 
time=0.006..0.007 rows=10 loops=1)
   Buffers: shared hit=1
 Planning Time: 0.067 ms
 Execution Time: 457.679 ms



I suppose this looks good as far as the plan goes, but the cost estimation
might be a little bit too optimistic: it is reporting that the new plan
costs 50% of the original, yet the execution time is only 5% lower.

I wonder where does time go (in unpatched) when seqscanning finishes
and before hashing starts.

(I had to disable JIT for the first one, as it insisted on JITting tuple
deforming.)

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
Bob [Floyd] used to say that he was planning to get a Ph.D. by the "green
stamp method," namely by saving envelopes addressed to him as 'Dr. Floyd'.
After collecting 500 such letters, he mused, a university somewhere in
Arizona would probably grant him a degree.  (Don Knuth)




Re: Checking pgwin32_is_junction() errors

2022-08-09 Thread r . zharkov

On 2022-08-09 03:30, Thomas Munro wrote:


Then our readlink() emulation removes it again, but in the case of
your \??\Volume{GUID} path, created by you, not our symlink()
emulation, removing "\??\" apparently makes it unopenable with
CreateFile() (I guess that's what fails?  What's the error?).  So your
patch just says: don't strip "\??\" if it's followed by "Volume".


Sorry, I thought wrong that everyone sees the backtrace on my screen.
Failes the CreateFile() function with fileName = "Volume{GUID}\" at [1].
And the GetLastError() returnes 2 (ERROR_FILE_NOT_FOUND).

Call Stack:
initdb.exe!pgwin32_open_handle(const char * fileName, ...) Line 111 C
initdb.exe!_pglstat64(const char * name, stat * buf) Line 128   C
initdb.exe!_pgstat64(const char * name, stat * buf) Line 221C
initdb.exe!pg_mkdir_p(char * path, int omode) Line 123  C
initdb.exe!create_data_directory() Line 2537C
initdb.exe!initialize_data_directory() Line 2696C
initdb.exe!main(int argc, char * * argv) Line 3102  C


I don't understand all the kinds of DOS, Windows and NT paths (let me
take a moment to say how much I love Unix), but here's a guess: could
it be that NT "\??\C:\foo" = DOS "C:\foo", but NT "\??\Volume..." =
DOS "\Volume..."?  In other words, if it hasn't got a drive letter,
maybe it still needs an initial "\" (or if not that, then *something*
special, because otherwise it looks like a relative path).


It seems to me, when we call CreateFile() Windows Object Manager 
searches

DOS devices (drive letters in our case) in DOS Device namespaces.
But it doesn't search the "Volume{GUID}" devices which must be named as
"\\?\Volume{GUID}\" [2].


Would it be better to say: if it doesn't begin with "\??\X:", where X
could be any letter, then don't modify it?



I think it will be better.

[1] 
https://github.com/postgres/postgres/blob/7e29a79a46d30dc236d097825ab849158929d977/src/port/open.c#L86
[2] 
https://docs.microsoft.com/en-us/windows/win32/fileio/naming-a-volume





Re: support for MERGE

2022-08-09 Thread Álvaro Herrera
On 2022-Aug-01, Álvaro Herrera wrote:

> > >  If MERGE attempts an INSERT
> > >  and a unique index is present and a duplicate row is concurrently
> > > +inserted, then a uniqueness violation error is raised;
> > > +MERGE does not attempt to avoid such
> > > +errors by evaluating MATCHED conditions.
> > 
> > This was a portion of a chang that was committed as eebf2.
> > 
> > But I don't understand why this changed from "does not attempt to avoid the
> > error by executing an UPDATE." to "...by evaluating
> > MATCHED conditions."
> > 
> > Maybe it means to say "..by re-starting evaluation of match conditions".
> 
> Yeah, my thought there is that it may also be possible that the action
> that would run if the conditions are re-run is a DELETE or a WHEN
> MATCHED THEN DO NOTHING; so saying "by executing an UPDATE" it leaves
> out those possibilities.  IOW if we're evaluating NOT MATCHED INSERT and
> we find a duplicate, we do not go back to MATCHED.

So I propose to leave it as

   If MERGE attempts an INSERT
   and a unique index is present and a duplicate row is concurrently
   inserted, then a uniqueness violation error is raised;
   MERGE does not attempt to avoid such
   errors by restarting evaluation of MATCHED
   conditions.

(Is "re-starting" better than "restarting"?)

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
Voy a acabar con todos los humanos / con los humanos yo acabaré
voy a acabar con todos (bis) / con todos los humanos acabaré ¡acabaré! (Bender)




Re: [PATCH] Expose port->authn_id to extensions and triggers

2022-08-09 Thread Michael Paquier
On Mon, Aug 08, 2022 at 12:43:14PM +0200, Drouvot, Bertrand wrote:
> but I'm not sure we should do it as a first step (given the fact that this
> is not Port->authn_id that is passed down to the parallel workers in the
> SYSTEM_USER patch).
> 
> What do you think about working on both (aka a) v11-002 only
> ClientConnectionInfo and b) SYSTEM_USER) in parallel?

It seems to me that completing ClientConnectionInfo first has the
advantage of not having to tweak twice the interface we are going to
use when passing down the full structure to the workers, so I would
choose for doing it first (with one field for the authn, and a second
field for the auth method so as the the workers can build SYSTEM_USER
by themselves when required).
--
Michael


signature.asc
Description: PGP signature


RE: Perform streaming logical transactions by background workers and parallel apply

2022-08-09 Thread kuroda.hay...@fujitsu.com
Dear Wang,

Thanks for updating patch sets! Followings are comments about v20-0001.

1. config.sgml

```
   
Specifies maximum number of logical replication workers. This includes
both apply workers and table synchronization workers.
   
```

I think you can add a description in the above paragraph, like
" This includes apply main workers, apply background workers, and table 
synchronization workers."

2. logical-replication.sgml

2.a Configuration Settings

```
   max_logical_replication_workers must be set to at least
   the number of subscriptions, again plus some reserve for the table
   synchronization.
```

I think you can add a description in the above paragraph, like
"... the number of subscriptions, plus some reserve for the table 
synchronization
 and the streaming transaction."

2.b Monitoring

```
  
   Normally, there is a single apply process running for an enabled
   subscription.  A disabled subscription or a crashed subscription will have
   zero rows in this view.  If the initial data synchronization of any
   table is in progress, there will be additional workers for the tables
   being synchronized.
  
```

I think you can add a sentence in the above paragraph, like 
"... synchronized. Moreover, if the streaming transaction is applied parallelly,
there will be additional workers"

3. launcher.c

```
+   /* Sanity check : we don't support table sync in subworker. */
```

I think "Sanity check :" should be "Sanity check:", per other files.

4. worker.c

4.a handle_streamed_transaction()

```
-   /* not in streaming mode */
-   if (!in_streamed_transaction)
+   /* Not in streaming mode */
+   if (!(in_streamed_transaction || am_apply_bgworker()))
```

I think the comment should also mention about apply background worker case.

4.b handle_streamed_transaction()

```
-   Assert(stream_fd != NULL);
```

I think this assersion seems reasonable in case of stream='on'.
Could you revive it and move to later part in the function, like after 
subxact_info_add(current_xid)?

4.c apply_handle_prepare_internal()

```
 * BeginTransactionBlock is necessary to balance the EndTransactionBlock
 * called within the PrepareTransactionBlock below.
 */
-   BeginTransactionBlock();
+   if (!IsTransactionBlock())
+   BeginTransactionBlock();
+
```

I think the comment should be "We must be in transaction block to balance...".

4.d apply_handle_stream_prepare()

```
- *
- * Logic is in two parts:
- * 1. Replay all the spooled operations
- * 2. Mark the transaction as prepared
  */
 static void
 apply_handle_stream_prepare(StringInfo s)
```

I think these comments are useful when stream='on',
so it should be moved to later part.

5. applybgworker.c

5.a apply_bgworker_setup()

```
+   elog(DEBUG1, "setting up apply worker #%u", 
list_length(ApplyBgworkersList) + 1); 
```

"apply worker" should be "apply background worker".

5.b LogicalApplyBgwLoop()

```
+   elog(DEBUG1, "[Apply BGW #%u] ended processing 
streaming chunk,"
+"waiting on shm_mq_receive", 
shared->worker_id);
```

A blank is needed after comma. I checked serverlog, and the message outputed 
like:

```
[Apply BGW #1] ended processing streaming chunk,waiting on shm_mq_receive
```

6.

When I started up the apply background worker and did `SELECT * from 
pg_stat_subscription`, I got following lines:

```
postgres=# select * from pg_stat_subscription;
 subid | subname |  pid  | relid | received_lsn |  last_msg_send_time   
| last_msg_receipt_time | latest_end_lsn |latest_end
_time
---+-+---+---+--+---+---++--
-
 16400 | sub | 22383 |   |  | -infinity 
| -infinity || -infinity
 16400 | sub | 22312 |   | 0/6734740| 2022-08-09 07:40:19.367676+00 
| 2022-08-09 07:40:19.375455+00 | 0/6734740  | 2022-08-09 07:40:
19.367676+00
(2 rows)
```


6.a

It seems that the upper line represents the apply background worker, but I 
think last_msg_send_time and last_msg_receipt_time should be null.
Is it like initialization mistake?

```
$ ps aux | grep 22383
... postgres: logical replication apply background worker for subscription 16400
```

6.b

Currently, the documentation doesn't clarify the method to determine the type 
of logical replication workers.
Could you add descriptions about it?
I think adding a column "subworker" is an alternative approach.


Best Regards,
Hayato Kuroda
FUJITSU LIMITED



Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication

2022-08-09 Thread Bharath Rupireddy
On Tue, Aug 9, 2022 at 12:42 PM Kyotaro Horiguchi
 wrote:
>
> > Can you please explain more about your idea, I may be missing something?
>
> (I'm not sure I understand the requirements here..)

I've explained the problem with the current HA setup with synchronous
replication upthread at [1]. Let me reiterate it here once again.

When a query is cancelled (a simple stroke of CTRL+C or
pg_cancel_backend() call) while the txn is waiting for ack in
SyncRepWaitForLSN(); for the client, the txn is actually committed
(locally-committed-but-not-yet-replicated to all of sync standbys)
like any other txn, a warning is emitted into server logs but it is of
no use for the client (think of client as applications). There can be
many such txns waiting for ack in SyncRepWaitForLSN() and query cancel
can be issued on all of those sessions. The problem is that the
subsequent reads will then be able to read all of those
locally-committed-but-not-yet-replicated to all of sync standbys txns
data - this is what I call an inconsistency (can we call this a
read-after-write inconsistency?) because of lack of proper query
cancel handling. And if the sync standbys are down or unable to come
up for some reason, until then, the primary will be serving clients
with the inconsistent data. BTW, I found a report of this problem here
[2].

The solution proposed for the above problem is to just 'not honor
query cancels at all while waiting for ack in SyncRepWaitForLSN()'.

When a proc die is pending, then also, there can be
locally-committed-but-not-yet-replicated to all of sync standbys txns.
Typically, there are two choices for the clients 1) reuse the primary
instance after restart 2) failover to one of sync standbys. For case
(1), there might be read-after-write inconsistency as explained above.
For case (2), those txns might get lost completely if the failover
target sync standby or the new primary didn't receive them and the
other sync standbys that have received them are now ahead and need a
special treatment (run pg_rewind) for them to be able to connect to
new primary.

The solution proposed for case (1) of the above problem is to 'process
the ProcDiePending immediately and upon restart the first backend can
wait until the sync standbys are caught up to ensure no inconsistent
reads'.
The solution proposed for case (2) of the above problem is to 'either
run pg_rewind for the sync standbys that are ahead or use the idea
proposed at [3]'.

I hope the above explanation helps.

[1] 
https://www.postgresql.org/message-id/flat/CALj2ACUrOB59QaE6%3DjF2cFAyv1MR7fzD8tr4YM5%2BOwEYG1SNzA%40mail.gmail.com
[2] 
https://stackoverflow.com/questions/42686097/how-to-disable-uncommited-reads-in-postgres-synchronous-replication
[3] 
https://www.postgresql.org/message-id/CALj2ACX-xO-ZenQt1MWazj0Z3ziSXBMr24N_X2c0dYysPQghrw%40mail.gmail.com

-- 
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/




Re: shared-memory based stats collector - v70

2022-08-09 Thread Kyotaro Horiguchi
At Mon, 8 Aug 2022 11:53:19 -0400, Greg Stark  wrote in 
> I'm trying to wrap my head around the shared memory stats collector
> infrastructure from
> <20220406030008.2qxipjxo776dw...@alap3.anarazel.de> committed in
> 5891c7a8ed8f2d3d577e7eea34dacff12d7b6bbd.
> 
> I have one question about locking -- afaics there's nothing protecting
> reading the shared memory stats. There is an lwlock protecting
> concurrent updates of the shared memory stats, but that lock isn't
> taken when we read the stats. Are we ok relying on atomic 64-bit reads
> or is there something else going on that I'm missing?
> 
> In particular I'm looking at pgstat.c:847 in pgstat_fetch_entry()
> which does this:
> 
> memcpy(stats_data,
>pgstat_get_entry_data(kind, entry_ref->shared_stats),
>kind_info->shared_data_len);
> 
> stats_data is the returned copy of the stats entry with all the
> statistics in it. But it's copied from the shared memory location
> directly using memcpy and there's no locking or change counter or
> anything protecting this memcpy that I can see.

We take LW_SHARED while creating a snapshot of fixed-numbered
stats. On the other hand we don't for variable-numbered stats.  I
agree to you, that we need that also for variable-numbered stats.

If I'm not missing something, it's strange that pgstat_lock_entry()
only takes LW_EXCLUSIVE. The atached changes the interface of
pgstat_lock_entry() but there's only one user since another read of
shared stats entry is not using reference. Thus the interface change
might be too much. If I just add bare LWLockAcquire/Release() to
pgstat_fetch_entry,the amount of the patch will be reduced.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 88e5dd1b2b..7c4e5f0238 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -844,9 +844,11 @@ pgstat_fetch_entry(PgStat_Kind kind, Oid dboid, Oid objoid)
 	else
 		stats_data = MemoryContextAlloc(pgStatLocal.snapshot.context,
 		kind_info->shared_data_len);
+	pgstat_lock_entry(entry_ref, LW_SHARED, false);
 	memcpy(stats_data,
 		   pgstat_get_entry_data(kind, entry_ref->shared_stats),
 		   kind_info->shared_data_len);
+	pgstat_unlock_entry(entry_ref);
 
 	if (pgstat_fetch_consistency > PGSTAT_FETCH_CONSISTENCY_NONE)
 	{
@@ -983,9 +985,15 @@ pgstat_build_snapshot(void)
 
 		entry->data = MemoryContextAlloc(pgStatLocal.snapshot.context,
 		 kind_info->shared_size);
+		/*
+		 * We're directly accesing the shared stats entry not using
+		 * reference. pg_stat_lock_entry() cannot be used here.
+		 */
+		LWLockAcquire(_data->lock, LW_SHARED);
 		memcpy(entry->data,
 			   pgstat_get_entry_data(kind, stats_data),
 			   kind_info->shared_size);
+		LWLockRelease(_data->lock);
 	}
 	dshash_seq_term();
 
diff --git a/src/backend/utils/activity/pgstat_database.c b/src/backend/utils/activity/pgstat_database.c
index d9275611f0..fdf4d022c1 100644
--- a/src/backend/utils/activity/pgstat_database.c
+++ b/src/backend/utils/activity/pgstat_database.c
@@ -364,7 +364,7 @@ pgstat_database_flush_cb(PgStat_EntryRef *entry_ref, bool nowait)
 	pendingent = (PgStat_StatDBEntry *) entry_ref->pending;
 	sharedent = (PgStatShared_Database *) entry_ref->shared_stats;
 
-	if (!pgstat_lock_entry(entry_ref, nowait))
+	if (!pgstat_lock_entry(entry_ref, LW_EXCLUSIVE, nowait))
 		return false;
 
 #define PGSTAT_ACCUM_DBCOUNT(item)		\
diff --git a/src/backend/utils/activity/pgstat_function.c b/src/backend/utils/activity/pgstat_function.c
index 427d8c47fc..318db0b3c8 100644
--- a/src/backend/utils/activity/pgstat_function.c
+++ b/src/backend/utils/activity/pgstat_function.c
@@ -200,7 +200,7 @@ pgstat_function_flush_cb(PgStat_EntryRef *entry_ref, bool nowait)
 
 	/* localent always has non-zero content */
 
-	if (!pgstat_lock_entry(entry_ref, nowait))
+	if (!pgstat_lock_entry(entry_ref, LW_EXCLUSIVE, nowait))
 		return false;
 
 	shfuncent->stats.f_numcalls += localent->f_counts.f_numcalls;
diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c
index a846d9ffb6..98dda726db 100644
--- a/src/backend/utils/activity/pgstat_relation.c
+++ b/src/backend/utils/activity/pgstat_relation.c
@@ -782,7 +782,7 @@ pgstat_relation_flush_cb(PgStat_EntryRef *entry_ref, bool nowait)
 		return true;
 	}
 
-	if (!pgstat_lock_entry(entry_ref, nowait))
+	if (!pgstat_lock_entry(entry_ref, LW_EXCLUSIVE, nowait))
 		return false;
 
 	/* add the values to the shared entry. */
diff --git a/src/backend/utils/activity/pgstat_shmem.c b/src/backend/utils/activity/pgstat_shmem.c
index 89060ef29a..fdd20d80a1 100644
--- a/src/backend/utils/activity/pgstat_shmem.c
+++ b/src/backend/utils/activity/pgstat_shmem.c
@@ -568,14 +568,14 @@ pgstat_release_entry_ref(PgStat_HashKey key, PgStat_EntryRef *entry_ref,
 }
 
 bool
-pgstat_lock_entry(PgStat_EntryRef *entry_ref, bool nowait)

Re: Using each rel as both outer and inner for JOIN_ANTI

2022-08-09 Thread Richard Guo
On Tue, Aug 2, 2022 at 3:13 PM Richard Guo  wrote:

> On Sun, Jul 31, 2022 at 12:07 AM Tom Lane  wrote:
>
>> I took a quick look through this.  The executor changes are indeed
>> impressively short, but that's largely because you've paid zero
>> attention to updating obsoleted comments.  For example, in
>> nodeHashjoin.c there are lots of references to right/full joins
>> that likely now need to cover right-anti.  I'm not sure that the
>> empty-rel startup optimizations are correct for this case, either.
>
>
> Thanks for the review! Yeah, you're right. I neglected to update the
> related comments. Will do that in the new patch. For the empty-rel
> startup optimizations, since the right-anti join also does null-fill on
> inner relation (the HJ_FILL_INNER case), I think we cannot skip building
> the hash table even when the outer rel is completely empty.
>

Here is the new patch which addresses the obsoleted comments.

Thanks
Richard


v5-0001-Using-each-rel-as-both-outer-and-inner-for-anti-j.patch
Description: Binary data


Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2022-08-09 Thread Bharath Rupireddy
On Mon, Aug 8, 2022 at 6:10 PM Bharath Rupireddy
 wrote:
>
> I played with a simple insert use-case [1] that generates ~380 WAL
> files, with different block sizes. To my surprise, I have not seen any
> improvement with larger block sizes. I may be doing something wrong
> here, suggestions on to test and see the benefits are welcome.
>
> > > I think this should also handle the remainder after processing whole
> > > blocks, just for completeness.  If I call the code as presented with size
> > > 8193, I think this code will only write 8192 bytes.
> >
> > Hm, I will fix it.
>
> Fixed.
>
> I'm attaching v5 patch-set. I've addressed review comments received so
> far and fixed a compiler warning that CF bot complained about.
>
> Please review it further.

I tried to vary the zero buffer size to see if there's any huge
benefit for the WAL-generating queries. Unfortunately, I didn't see
any benefit on my dev system (16 vcore, 512GB SSD, 32GB RAM) . The use
case I've tried is at [1] and the results are at [2].

Having said that, the use of pg_pwritev_with_retry() in walmethods.c
will definitely reduce number of system calls - on HEAD the
dir_open_for_write() makes pad_to_size/XLOG_BLCKSZ i.e. 16MB/8KB =
2,048 write() calls and with patch it makes only 64
pg_pwritev_with_retry() calls with XLOG_BLCKSZ zero buffer size. The
proposed patches will provide straight 32x reduction in system calls
(for pg_receivewal and pg_basebackup) apart from the safety against
partial writes.

[1]
/* built source code with release flags */
./configure --with-zlib --enable-depend --prefix=$PWD/inst/
--with-openssl --with-readline --with-perl --with-libxml CFLAGS='-O2'
> install.log && make -j 8 install > install.log 2>&1 &

\q
./pg_ctl -D data -l logfile stop
rm -rf data

/* ensured that nothing exists in OS page cache */
free -m
sudo su
sync; echo 3 > /proc/sys/vm/drop_caches
exit
free -m

./initdb -D data
./pg_ctl -D data -l logfile start
./psql -d postgres -c 'ALTER SYSTEM SET max_wal_size = "64GB";'
./psql -d postgres -c 'ALTER SYSTEM SET shared_buffers = "8GB";'
./psql -d postgres -c 'ALTER SYSTEM SET work_mem = "16MB";'
./psql -d postgres -c 'ALTER SYSTEM SET checkpoint_timeout = "1d";'
./pg_ctl -D data -l logfile restart
./psql -d postgres -c 'create table foo(bar int);'
./psql -d postgres
\timing
insert into foo select * from generate_series(1, 1);   /* this
query generates about 385 WAL files, no checkpoint hence no recycle of
old WAL files, all new WAL files */

[2]
HEAD
Time: 84249.535 ms (01:24.250)

HEAD with wal_init_zero off
Time: 75086.300 ms (01:15.086)

#definePWRITEV_BLCKSZXLOG_BLCKSZ
Time: 85254.302 ms (01:25.254)

#definePWRITEV_BLCKSZ(4 * XLOG_BLCKSZ)
Time: 83542.885 ms (01:23.543)

#definePWRITEV_BLCKSZ(16 * XLOG_BLCKSZ)
Time: 84035.770 ms (01:24.036)

#definePWRITEV_BLCKSZ(64 * XLOG_BLCKSZ)
Time: 84749.021 ms (01:24.749)

#definePWRITEV_BLCKSZ(256 * XLOG_BLCKSZ)
Time: 84273.466 ms (01:24.273)

#definePWRITEV_BLCKSZ(512 * XLOG_BLCKSZ)
Time: 84233.576 ms (01:24.234)

--
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/




Re: Data is copied twice when specifying both child and parent table in publication

2022-08-09 Thread Peter Smith
Here are some review comment for the HEAD_v8 patch:

==

1. Commit message

If there are two publications, one of them publish a parent table with
(publish_via_partition_root = true) and another publish child table,
subscribing to both publications from one subscription results in two initial
replications. It should only be copied once.

SUGGESTION (Note**)
If there are two publications - one of them publishing a parent table
(using publish_via_partition_root = true) and the other is publishing
one of the parent's child tables - then subscribing to both
publications from one subscription results in the same initial child
data being copied twice. It should only be copied once.


Note** - I've only reworded the original commit message slightly but
otherwise left it saying the same thing. But I still have doubts if
this message actually covers all the cases the patch is trying to
address. e.g. The comment (see below) in the 'fetch_table_list'
function seemed to cover more cases than what this commit message is
describing.
/*
* Get the list of tables from publisher, the partition table whose
* ancestor is also in this list will be ignored, otherwise the initial
* data in the partition table would be replicated twice.
*/


==

2. src/backend/catalog/pg_publication.c - pg_get_publication_tables

2a.
 /*
- * Returns information of tables in a publication.
+ * Returns information of the tables in the given publication array.
  */

What does "information of the tables" actually mean? Is it just the
names of the tables; is it more than that? IMO a longer, more
explanatory comment will be better here instead of a brief ambiguous
one.


2b.
+ *results = NIL;

This variable name 'results' is too generic, so it is not helpful when
trying to understand the subsequent code logic. Please give this a
meaningful name/description.

2c.
/* Deconstruct the parameter into elements. */

SUGGESTION
Deconstruct the parameter into elements where each element is a
publication name.

2d.
+ List *current_tables = NIL;

I think this is the tables only on the current pub_elem, so I thought
'pub_elem_tables' might make it easier to understand this list's
meaning.

2e.
+ /* Now sort and de-duplicate the result list */
+ list_sort(tables, list_oid_cmp);
+ list_deduplicate_oid(tables);

IMO this comment is confusing because there is another list that is
called the 'results' list, but that is not the same list you are
processing here. Also, it does not really add anything helpful to just
have comments saying almost the same as the function names
(sort/de-duplicate), so please give longer comments to say the reason
*why* the logic does this rather than just describing the steps.

2f.
+ /* Filter by final published table */
+ foreach(lc, results)

Please make this comment more descriptive to explain why/what the
logic is doing.

==

3. src/backend/commands/subscriptioncmds.c - fetch_table_list

3a.
+ bool check_columnlist = (server_version >= 15);

Given the assignment, maybe 'columnlist_supported' is a better name?

3b.
+ /* Get information of the tables belonging to the specified publications */

For  "Get information of..." can you elaborate *what* table
information this is getting and why?

3c.
+ if (server_version >= 16)
+ appendStringInfo(, "SELECT DISTINCT N.nspname, C.relname,\n"
+ "  ( CASE WHEN (array_length(GPT.attrs, 1) = C.relnatts)\n"
+ " THEN NULL ELSE GPT.attrs END\n"
+ "  ) AS attnames\n"
+ " FROM pg_class C\n"
+ "   JOIN pg_namespace N ON N.oid = C.relnamespace\n"
+ "   JOIN ( SELECT (pg_get_publication_tables(VARIADIC
array_agg(pubname::text))).*\n"
+ "  FROM pg_publication\n"
+ "  WHERE pubname IN ( %s )) as GPT\n"
+ "   ON GPT.relid = C.oid\n",
+ pub_names.data);

AFAICT the main reason for this check was to decide if you can use the
new version of 'pg_get_publication_tables' that supports the VARIADIC
array of pub names or not. If that is correct, then maybe the comment
should describe that reason, or maybe add another bool var similar to
the 'check_columnlist' one for this.

--
Kind Regards,
Peter Smith.
Fujitsu Australia




Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication

2022-08-09 Thread Kyotaro Horiguchi
At Mon, 8 Aug 2022 19:13:25 +0530, Bharath Rupireddy 
 wrote in 
> On Fri, Aug 5, 2022 at 8:19 AM Kyotaro Horiguchi
>  wrote:
> >
> > At Tue, 26 Apr 2022 08:26:59 +0200, Laurenz Albe  
> > wrote in
> > > While this may mitigate the problem, I don't think it will deal with
> > > all the cases which could cause a transaction to end up committed locally,
> > > but not on the synchronous standby.  I think that only using the full
> > > power of two-phase commit can make this bulletproof.
> > >
> > > Is it worth adding additional complexity that is not a complete solution?
> >
> > I would agree to this. Likewise 2PC, whatever we do to make it
> > perfect, we're left with unresolvable problems at least for now.
> >
> > Doesn't it meet your requirements if we have a means to know the last
> > transaction on the current session is locally committed or aborted?
> >
> > We are already internally managing last committed LSN. I think we can
> > do the same thing about transaction abort and last inserted LSN and we
> > can expose them any way. This is way simpler than the (maybe)
> > uncompletable attempt to fill up the deep gap.
> 
> There can be more txns that are
> locally-committed-but-not-yet-replicated. Even if we have that
> information stored somewhere, what do we do with it? Those txns are
> committed from the client perspective but not committed from the
> server's perspective.
> 
> Can you please explain more about your idea, I may be missing something?

(I'm not sure I understand the requirements here..)

I understand that it is about query cancellation.  In the case of
primary crash/termination, client cannot even know whether the commit
of the ongoing transaction, if any, has been recorded.  Anyway no way
other than to somehow confirm that the change by the transaction has
been actually made after restart.  I believe it is the standard
practice of the applications that work on HA clusters.

The same is true in the case of query cancellation since commit
response doesn't reach the client, too.  But even in this case if we
had functions/views that tells us the
last-committed/last-aborted/last-inserted LSN on a session, we can
know whether the last transaction has been committed along with the
commit LSN maybe more easily.

# In fact, I see those functions rather as a means to know whether a
# change by the last transaction on a session is available on some
# replica.

For example, the below heavily simplified pseudo code might display
how the fucntions (if they were functions) work.

  try {
s.execute("INSERT ..");
c.commit();
  } catch (Exception x) {
c.commit();
if (s.execute("SELECT pg_session_last_committed_lsn() = "
   "pg_session_last_inserted_lsn()"))
{
  /* the transaction has been locally committed */
  if (s.execute("SELECT replay_lsn >= pg_session_last_committed_lsn() "
   "FROM pg_stat_replication where ")
   /* the commit has been replicated to xxx, LSN is known */
} else {
  /* the transaction has not been locally committed */
  
}
  }


regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




  1   2   >