Re: BufferAlloc: don't take two simultaneous locks

2022-03-13 Thread Kyotaro Horiguchi
At Mon, 14 Mar 2022 09:39:48 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> I'll examine the possibility to resolve this...

The existence of nfree and nalloc made me confused and I found the
reason.

In the case where a parittion collects many REUSE-ASSIGN-REMOVEed
elemetns from other paritiotns, nfree gets larger than nalloced.  This
is a strange point of the two counters.  nalloced is only referred to
as (sum(nalloced[])).  So we don't need nalloced per-partition basis
and the formula to calculate the number of used elements would be as
follows.

 sum(nalloced - nfree)
 =  - sum(nfree)

We rarely create fresh elements in shared hashes so I don't think
there's additional contention on the  even if it were
a global atomic.

So, the remaining issue is the possible imbalancement among
partitions.  On second thought, by the current way, if there's a bad
deviation in partition-usage, a heavily hit partition finally collects
elements via get_hash_entry().  By the patch's way, similar thing
happens via the REUSE-ASSIGN-REMOVE sequence. But buffers once used
for something won't be freed until buffer invalidation. But bulk
buffer invalidation won't deviatedly distribute freed buffers among
partitions.  So I conclude for now that is a non-issue.

So my opinion on the counters is:

I'd like to ask you to remove nalloced from partitions then add a
global atomic for the same use?

No need to do something for the possible deviation issue.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: add checkpoint stats of snapshot and mapping files of pg_logical dir

2022-03-13 Thread Bharath Rupireddy
On Mon, Mar 14, 2022 at 10:45 AM Michael Paquier  wrote:
>
> On Sun, Mar 13, 2022 at 02:58:58PM -0700, Nathan Bossart wrote:
> > On Sun, Mar 13, 2022 at 01:54:10PM +0530, Bharath Rupireddy wrote:
> >> Another thing I added in v2 is to not emit snapshot and mapping files
> >> stats in case of restartpoint as logical decoding isn't supported on
> >> standbys, so it doesn't make sense to emit the stats there and cause
> >> server log to grow unnecessarily. Having said that, I added a note
> >> there to change it whenever logical decoding on standbys is supported.
> >
> > I think we actually do want to include this information for restartpoints
> > since some files might be left over from the snapshot that was used to
> > create the standby.  Also, presumably these functions could do some work
> > during recovery on a primary.
>
> Yes, I would agree that consistency makes sense here, and it is not
> complicated to add the code to support this code path anyway.  There
> is a risk that folks working on logical decoding on standbys overse
> this code path, instead.

I agree to be consistent and emit the message even in case of restartpoint.

> > Another problem I see is that this patch depends on the return value of the
> > lstat() calls that we are trying to remove in 0001 from another thread [0].
> > I think the size of the removed/sync'd files is somewhat useful for
> > understanding disk space usage, but I suspect the time spent performing
> > these tasks is more closely related to the number of files.  Do you think
> > reporting the sizes is worth the extra system call?
>
> We are not talking about files that are large either, are we?
>
> Another thing I am a bit annoyed with in this patch is the fact that
> the size of the ereport() call is doubled.  The LOG currently
> generated is already bloated, and this does not arrange things.

Yes, this is a concern. Also, when there were no logical replication
slots on a plain server or the server removed or cleaned up all the
snapshot/mappings files, why would anyone want to have these messages
with all 0s in the server log?

Here's what I'm thinking:

Leave the existing "checkpoint/restartpoint complete" messages intact,
add the following in LogCheckpointEnd:

if (CheckpointStats.repl_map_files_rmvd_cnt ||
CheckpointStats.repl_map_files_syncd_cnt ||
CheckpointStats.repl_snap_files_rmvd_cnt)
{
ereport(LOG,
(errmsg("snapbuild snapshot file(s) removed="
UINT64_FORMAT ", size=%zu bytes, time=%ld.%03d s, cutoff LSN=%X/%X; "
"logical rewrite mapping file(s) removed="
UINT64_FORMAT ", size=%zu bytes, synced=" UINT64_FORMAT ", size=%zu
bytes, time=%ld.%03d s, cutoff LSN=%X/%X",
CheckpointStats.repl_snap_files_rmvd_cnt,
CheckpointStats.repl_snap_files_rmvd_sz,
repl_snap_msecs / 1000, (int) (repl_snap_msecs % 1000),
LSN_FORMAT_ARGS(CheckpointStats.repl_snap_cutoff_lsn),
CheckpointStats.repl_map_files_rmvd_cnt,
CheckpointStats.repl_map_files_rmvd_sz,
CheckpointStats.repl_map_files_syncd_cnt,
CheckpointStats.repl_map_files_syncd_sz,
repl_map_msecs / 1000, (int) (repl_map_msecs % 1000),
LSN_FORMAT_ARGS(CheckpointStats.repl_map_cutoff_lsn;
}

Thoughts?

Regards,
Bharath Rupireddy.




Re: add checkpoint stats of snapshot and mapping files of pg_logical dir

2022-03-13 Thread Michael Paquier
On Sun, Mar 13, 2022 at 02:58:58PM -0700, Nathan Bossart wrote:
> On Sun, Mar 13, 2022 at 01:54:10PM +0530, Bharath Rupireddy wrote:
>> Another thing I added in v2 is to not emit snapshot and mapping files
>> stats in case of restartpoint as logical decoding isn't supported on
>> standbys, so it doesn't make sense to emit the stats there and cause
>> server log to grow unnecessarily. Having said that, I added a note
>> there to change it whenever logical decoding on standbys is supported.
> 
> I think we actually do want to include this information for restartpoints
> since some files might be left over from the snapshot that was used to
> create the standby.  Also, presumably these functions could do some work
> during recovery on a primary.

Yes, I would agree that consistency makes sense here, and it is not
complicated to add the code to support this code path anyway.  There 
is a risk that folks working on logical decoding on standbys overse
this code path, instead.

> Another problem I see is that this patch depends on the return value of the
> lstat() calls that we are trying to remove in 0001 from another thread [0].
> I think the size of the removed/sync'd files is somewhat useful for
> understanding disk space usage, but I suspect the time spent performing
> these tasks is more closely related to the number of files.  Do you think
> reporting the sizes is worth the extra system call?

We are not talking about files that are large either, are we?

Another thing I am a bit annoyed with in this patch is the fact that
the size of the ereport() call is doubled.  The LOG currently
generated is already bloated, and this does not arrange things.
--
Michael


signature.asc
Description: PGP signature


Re: pg_ls_tmpdir to show directories and shared filesets (and pg_ls_*)

2022-03-13 Thread Michael Paquier
On Wed, Mar 09, 2022 at 10:50:45AM -0600, Justin Pryzby wrote:
> I also changed pg_ls_dir_recurse() to handle concurrent removal of a dir, 
> which
> I noticed caused an infrequent failure on CI.  However I'm not including that
> here, since the 2nd half of the patch set seems isn't ready due to lstat() on
> windows.

lstat() has been a subject of many issues over the years with our
internal emulation and issues related to its concurrency, but we use
it in various areas of the in-core code, so that does not sound like
an issue to me.  It depends on what you want to do with it in
genfile.c and which data you'd expect, in addition to the detection of
junction points for WIN32, I guess.  v34 has no references to
pg_ls_dir_recurse(), but that's a WITH RECURSIVE, so we would not
really need it, do we?

@@ -27618,7 +27618,7 @@ SELECT 
convert_from(pg_read_binary_file('file_in_utf8.txt'), 'UTF8');
 Returns a record containing the file's size, last access time stamp,
 last modification time stamp, last file status change time stamp (Unix
 platforms only), file creation time stamp (Windows only), and a flag
-indicating if it is a directory.
+indicating if it is a directory (or a symbolic link to a directory).


 This function is restricted to superusers by default, but other users

This is from 0001, and this addition in the documentation is not
completely right.  As pg_stat_file() uses stat() to get back the
information of a file/directory, we'd just follow the link if
specifying one in the input argument.  We could say instead, if we
were to improve the docs, that "If filename is a link, this function
returns information about the file or directory the link refers to."
I would put that as a different paragraph.

+select * from pg_ls_archive_statusdir() limit 0;
+ name | size | modification 
+--+--+--
+(0 rows)

FWIW, this one is fine as of ValidateXLOGDirectoryStructure() that
would make sure archive_status exists before any connection is
attempted to the cluster.

> +select * from pg_ls_logdir() limit 0;

This test on pg_ls_logdir() would fail if running installcheck on a
cluster that has logging_collector disabled.  So this cannot be
included.

+select * from pg_ls_logicalmapdir() limit 0;
+select * from pg_ls_logicalsnapdir() limit 0;
+select * from pg_ls_replslotdir('') limit 0;
+select * from pg_ls_tmpdir() limit 0;
+select * from pg_ls_waldir() limit 0;
+select * from pg_stat_file('.') limit 0;

The rest of the patch set should be stable AFAIK, there are various
steps when running a checkpoint that makes sure that any of these
exist, without caring about the value of wal_level.

+   
+For each file in the specified directory, list the file and its
+metadata.
+Restricted to superusers by default, but other users can be granted
+EXECUTE to run the function.
+   

What is metadata in this case?  (I have read the code and know what
you mean, but folks only looking at the documentation may be puzzled
by that).  It could be cleaner to use the same tupledesc for any
callers of this function, and return NULL in cases these are not 
adapted.

+   /* check the optional arguments */
+   if (PG_NARGS() == 3)
+   {
+   if (!PG_ARGISNULL(1))
+   {
+   if (PG_GETARG_BOOL(1))
+   flags |= LS_DIR_MISSING_OK;
+   else
+   flags &= ~LS_DIR_MISSING_OK;
+   }
+
+   if (!PG_ARGISNULL(2))
+   {
+   if (PG_GETARG_BOOL(2))
+   flags &= ~LS_DIR_SKIP_DOT_DIRS;
+   else
+   flags |= LS_DIR_SKIP_DOT_DIRS;
+   }
+   }

The subtle different between the false and true code paths of those
arguments 1 and 2 had better be explained?  The bit-wise operations
are slightly different in both cases, so it is not clear which part
does what, and what's the purpose of this logic.

-   SetSingleFuncCall(fcinfo, 0);
+   /* isdir depends on metadata */
+   Assert(!(flags_DIR_ISDIR) || (flags_DIR_METADATA));
+   /* Unreasonable to show isdir and skip dirs */
+   Assert(!(flags_DIR_ISDIR) || !(flags_DIR_SKIP_DIRS));

Incorrect code format.  Spaces required.

+-- This tests the missing_ok parameter, which causes pg_ls_tmpdir to
succeed even if the tmpdir doesn't exist yet
+-- The name='' condition is never true, so the function runs to
completion but returns zero rows.
+-- The query is written to ERROR if the tablespace doesn't exist,
rather than silently failing to call pg_ls_tmpdir()
+SELECT c.* FROM (SELECT oid FROM pg_tablespace b WHERE
b.spcname='regress_tblspace' UNION SELECT 0 ORDER BY 1 DESC LIMIT 1)
AS b , pg_ls_tmpdir(oid) AS c WHERE c.name='Does not exist';

So, here, we have a test that may not actually test what we want to
test.

Hmm.  I am not convinced that we need a new set of SQL functions as
presented in 0003 (addition of meta-data in pg_ls_dir()), and
extensions of 0004 (do the same but for pg_ls_tmpdir) and 

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-03-13 Thread Peter Geoghegan
On Fri, Feb 25, 2022 at 5:52 PM Peter Geoghegan  wrote:
> There is an important practical way in which it makes sense to treat
> 0001 as separate to 0002. It is true that 0001 is independently quite
> useful. In practical terms, I'd be quite happy to just get 0001 into
> Postgres 15, without 0002. I think that that's what you meant here, in
> concrete terms, and we can agree on that now.

Attached is v10. While this does still include the freezing patch,
it's not in scope for Postgres 15. As I've said, I still think that it
makes sense to maintain the patch series with the freezing stuff,
since it's structurally related. So, to be clear, the first two
patches from the patch series are in scope for Postgres 15. But not
the third.

Highlights:

* Changes to terminology and commit messages along the lines suggested
by Andres.

* Bug fixes to heap_tuple_needs_freeze()'s MultiXact handling. My
testing strategy here still needs work.

* Expanded refactoring by v10-0002 patch.

The v10-0002 patch (which appeared for the first time in v9) was
originally all about fixing a case where non-aggressive VACUUMs were
at a gratuitous disadvantage (relative to aggressive VACUUMs) around
advancing relfrozenxid -- very much like the lazy_scan_noprune work
from commit 44fa8488. And that is still its main purpose. But the
refactoring now seems related to Andres' idea of making non-aggressive
VACUUMs decides to scan a few extra all-visible pages in order to be
able to advance relfrozenxid.

The code that sets up skipping the visibility map is made a lot
clearer by v10-0002. That patch moves a significant amount of code
from lazy_scan_heap() into a new helper routine (so it continues the
trend started by the Postgres 14 work that added lazy_scan_prune()).
Now skipping a range of visibility map pages is fundamentally based on
setting up the range up front, and then using the same saved details
about the range thereafter -- we don't have anymore ad-hoc
VM_ALL_VISIBLE()/VM_ALL_FROZEN() calls for pages from a range that we
already decided to skip (so no calls to those routines from
lazy_scan_heap(), at least not until after we finish processing in
lazy_scan_prune()).

This is more or less what we were doing all along for one special
case: aggressive VACUUMs. We had to make sure to either increment
frozenskipped_pages or increment scanned_pages for every page from
rel_pages -- this issue is described by lazy_scan_heap() comments on
HEAD that begin with "Tricky, tricky." (these date back to the freeze
map work from 2016). Anyway, there is no reason to not go further with
that: we should make whole ranges the basic unit that we deal with
when skipping. It's a lot simpler to think in terms of entire ranges
(not individual pages) that are determined to be all-visible or
all-frozen up-front, without needing to recheck anything (regardless
of whether it's an aggressive VACUUM).

We don't need to track frozenskipped_pages this way. And it's much
more obvious that it's safe for more complicated cases, in particular
for aggressive VACUUMs.

This kind of approach seems necessary to make non-aggressive VACUUMs
do a little more work opportunistically, when they realize that they
can advance relfrozenxid relatively easily that way (which I believe
Andres favors as part of overhauling freezing). That becomes a lot
more natural when you have a clear and unambiguous separation between
deciding what range of blocks to skip, and then actually skipping. I
can imagine the new helper function added by v10-0002 (which I've
called lazy_scan_skip_range()) eventually being taught to do these
kinds of tricks.

In general I think that all of the details of what to skip need to be
decided up front. The loop in lazy_scan_heap() should execute skipping
based on the instructions it receives from the new helper function, in
the simplest way possible. The helper function can become more
intelligent about the costs and benefits of skipping in the future,
without that impacting lazy_scan_heap().

--
Peter Geoghegan
From 43ab00609392ed7ad31be491834bdac348e13653 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan 
Date: Fri, 11 Mar 2022 19:16:02 -0800
Subject: [PATCH v10 3/3] Make page-level characteristics drive freezing.

Teach VACUUM to freeze all of the tuples on a page whenever it notices
that it would otherwise mark the page all-visible, without also marking
it all-frozen.  VACUUM typically won't freeze _any_ tuples on the page
unless _all_ tuples (that remain after pruning) are all-visible.  This
makes the overhead of vacuuming much more predictable over time.  We
avoid the need for large balloon payments during aggressive VACUUMs
(typically anti-wraparound autovacuums).  Freezing is proactive, so
we're much less likely to get into "freezing debt".

The new approach to freezing also enables relfrozenxid advancement in
non-aggressive VACUUMs, which might be enough to avoid aggressive
VACUUMs altogether (with many individual tables/workloads).  While the
non-aggressive 

Re: Allow async standbys wait for sync replication

2022-03-13 Thread Kyotaro Horiguchi
At Mon, 14 Mar 2022 11:30:02 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> At Sat, 12 Mar 2022 14:33:32 -0800, Nathan Bossart  
> wrote in 
> > On Tue, Mar 08, 2022 at 06:01:23PM -0800, Andres Freund wrote:
> > > To me it's architecturally the completely wrong direction. We should move 
> > > in
> > > the *other* direction, i.e. allow WAL to be sent to standbys before the
> > > primary has finished flushing it locally. Which requires similar
> > > infrastructure to what we're discussing here.
> > 
> > I think this is a good point.  After all, WALRead() has the following
> > comment:
> > 
> >  * XXX probably this should be improved to suck data directly from the
> >  * WAL buffers when possible.
> > 
> > Once you have all the infrastructure for that, holding back WAL replay on
> > async standbys based on synchronous replication might be relatively easy.

Just to make sure and a bit off from the topic, I think the
optimization itself is quite promising and want to have.

> That is, (as my understanding) async standbys are required to allow
> overwriting existing unreplayed records after reconnection.  But,
> putting aside how to remember that LSN, if that happens at a segment
> boundary, the async replica may run into the similar situation with
> the missing-contrecord case.  But standby cannot insert any original
> record to get out from that situation.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Allow async standbys wait for sync replication

2022-03-13 Thread Kyotaro Horiguchi
At Sat, 12 Mar 2022 14:33:32 -0800, Nathan Bossart  
wrote in 
> On Tue, Mar 08, 2022 at 06:01:23PM -0800, Andres Freund wrote:
> > To me it's architecturally the completely wrong direction. We should move in
> > the *other* direction, i.e. allow WAL to be sent to standbys before the
> > primary has finished flushing it locally. Which requires similar
> > infrastructure to what we're discussing here.
> 
> I think this is a good point.  After all, WALRead() has the following
> comment:
> 
>  * XXX probably this should be improved to suck data directly from the
>  * WAL buffers when possible.
> 
> Once you have all the infrastructure for that, holding back WAL replay on
> async standbys based on synchronous replication might be relatively easy.

That is, (as my understanding) async standbys are required to allow
overwriting existing unreplayed records after reconnection.  But,
putting aside how to remember that LSN, if that happens at a segment
boundary, the async replica may run into the similar situation with
the missing-contrecord case.  But standby cannot insert any original
record to get out from that situation.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: On login trigger: take three

2022-03-13 Thread Andres Freund
On 2022-03-13 20:35:44 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > I was thinking that the way to use it would be to specify it as a client
> > option. Like PGOPTIONS='-c ignore_event_trigger=login' psql.
> 
> Ugh ... that would allow people (at least superusers) to bypass
> the login trigger at will, which seems to me to break a lot of
> the use-cases for the feature.  I supposed we'd want this to be a
> PGC_POSTMASTER setting for security reasons.

Shrug. This doesn't seem to add actual security to me.




Re: pg_walinspect - a new extension to get raw WAL data and WAL stats

2022-03-13 Thread Kyotaro Horiguchi
At Fri, 11 Mar 2022 15:39:13 -0500, Robert Haas  wrote 
in 
> On Thu, Mar 10, 2022 at 9:38 PM Kyotaro Horiguchi
>  wrote:
> > It seems to me too rigorous that pg_get_wal_records_info/stats()
> > reject future LSNs as end-LSN and I think WARNING or INFO and stop at
> > the real end-of-WAL is more kind to users.  I think the same with the
> > restriction that start and end LSN are required to be different.
> 
> In his review just yesterday, Jeff suggested this: "Don't issue
> WARNINGs or other messages for ordinary situations, like when
> pg_get_wal_records_info() hits the end of WAL." I think he's entirely
> right, and I don't think any patch that does otherwise should get

It depends on what we think is the "ordinary" here.  If we don't
expect that specified LSN range is not filled-out, the case above is
ordinary and no need for any WARNING nor INFO.  I'm fine with that
definition here.

> committed. It is worth remembering that the results of queries are
> often examined by something other than a human being sitting at a psql
> terminal. Any tool that uses this is going to want to understand what
> happened from the result set, not by parsing strings that may show up
> inside warning messages.

Right. I don't think it is right that WARNING is required to evaluate
the result. And I think that the WARNING like 'reached end-of-wal
before end LSN' is a kind that is not required in evaluation of the
result. Since each WAL row contains at least start LSN.

> I think that the right answer here is to have a function that returns
> one row per record parsed, and each row should also include the start
> and end LSN of the record. If for some reason the WAL records return
> start after the specified start LSN (e.g. because we skip over a page
> header) or end before the specified end LSN (e.g. because we reach
> end-of-WAL) the user can figure it out from looking at the LSNs in the
> output rows and comparing them to the LSNs provided as input.

I agree with you here.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Add index scan progress to pg_stat_progress_vacuum

2022-03-13 Thread Masahiko Sawada
On Sat, Mar 12, 2022 at 4:00 PM Imseih (AWS), Sami  wrote:
>
> > nitpick: Can we remove the extra spaces in the parentheses?
>
> fixed
>
> > What does it mean if there isn't an entry in the map?  Is this actually
> > expected, or should we ERROR instead?
>
> I cleaned up the code here and added comments.
>
> > I think the number of entries should be shared between
> > VacuumWorkerProgressShmemInit() and VacuumWorkerProgressShmemSize().
> > Otherwise, we might update one and not the other.
>
> Fixed
>
> > I think we should elaborate a bit more in this comment.  It's difficult to
> > follow what this is doing without referencing the comment above
> > set_vacuum_worker_progress().
>
> More comments added
>
> I also simplified the 0002 patch as well.

I'm still unsure the current design of 0001 patch is better than other
approaches we’ve discussed. Even users who don't use parallel vacuum
are forced to allocate shared memory for index vacuum progress, with
GetMaxBackends() entries from the beginning. Also, it’s likely to
extend the progress tracking feature for other parallel operations in
the future but I think the current design is not extensible. If we
want to do that, we will end up creating similar things for each of
them or re-creating index vacuum progress tracking feature while
creating a common infra. It might not be a problem as of now but I'm
concerned that introducing a feature that is not extensible and forces
users to allocate additional shmem might be a blocker in the future.
Looking at the precedent example, When we introduce the progress
tracking feature, we implemented it in an extensible way. On the other
hand, others in this thread seem to agree with this approach, so I'd
like to leave it to committers.

Anyway, here are some comments on v5-0001 patch:

+/* in commands/vacuumparallel.c */
+extern void VacuumWorkerProgressShmemInit(void);
+extern Size VacuumWorkerProgressShmemSize(void);
+extern void vacuum_worker_end(int leader_pid);
+extern void vacuum_worker_update(int leader_pid);
+extern void vacuum_worker_end_callback(int code, Datum arg);
+extern void set_vaccum_worker_progress(Datum *values);

These functions' body is not in vacuumparallel.c. As the comment says,
I think these functions should be implemented in vacuumparallel.c.

---
+/*
+ * set_vaccum_worker_progress --- updates the number of indexes that have been
+ * vacuumed or cleaned up in a parallel vacuum.
+ */
+void
+set_vaccum_worker_progress(Datum *values)

s/vaccum/vacuum/

---
+void
+set_vaccum_worker_progress(Datum *values)
+{
+VacProgressEntry *entry;
+int leader_pid = values[0];

I thik we should use DatumGetInt32().

---
+entry = (VacProgressEntry *)
hash_search(VacuumWorkerProgressHash, _pid, HASH_ENTER_NULL,
);
+
+if (!entry)
+elog(ERROR, "cannot allocate shared memory for vacuum
worker progress");

Since we raise an error in case of out of memory, I think we can use
HASH_ENTER instead of HASH_ENTER_NULL. Or do we want to emit a
detailed error message here?

---
+   VacuumWorkerProgressHash = NULL;

This line is not necessary.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: BufferAlloc: don't take two simultaneous locks

2022-03-13 Thread Kyotaro Horiguchi
At Fri, 11 Mar 2022 12:34:32 +0300, Yura Sokolov  
wrote in 
> В Пт, 11/03/2022 в 15:49 +0900, Kyotaro Horiguchi пишет:
> > At Fri, 11 Mar 2022 15:30:30 +0900 (JST), Kyotaro Horiguchi 
> >  > BufTableDelete(BufferTag *tagPtr, uint32 hashcode, bool 
> > reuse)
> > 
> > BufTableDelete considers both reuse and !reuse cases but
> > BufTableInsert doesn't and always does HASH_ASSIGN.  That looks
> > odd. We should use HASH_ENTER here.  Thus I think it is more
> > reasonable that HASH_ENTRY uses the stashed entry if exists and
> > needed, or returns it to freelist if exists but not needed.
> > 
> > What do you think about this?
> 
> Well... I don't like it but I don't mind either.
> 
> Code in HASH_ENTER and HASH_ASSIGN cases differs much.
> On the other hand, probably it is possible to merge it carefuly.
> I'll try.

Honestly, I'm not sure it wins on performance basis. It just came from
interface consistency (mmm. a bit different, maybe.. convincibility?).

regards.
-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: BufferAlloc: don't take two simultaneous locks

2022-03-13 Thread Kyotaro Horiguchi
At Fri, 11 Mar 2022 11:30:27 +0300, Yura Sokolov  
wrote in 
> В Пт, 11/03/2022 в 15:30 +0900, Kyotaro Horiguchi пишет:
> > At Thu, 03 Mar 2022 01:35:57 +0300, Yura Sokolov  
> > wrote in 
> > > В Вт, 01/03/2022 в 10:24 +0300, Yura Sokolov пишет:
> > > > Ok, here is v4.
> > > 
> > > And here is v5.
> > > 
> > > First, there was compilation error in Assert in dynahash.c .
> > > Excuse me for not checking before sending previous version.
> > > 
> > > Second, I add third commit that reduces HASHHDR allocation
> > > size for non-partitioned dynahash:
> > > - moved freeList to last position
> > > - alloc and memset offset(HASHHDR, freeList[1]) for
> > >   non-partitioned hash tables.
> > > I didn't benchmarked it, but I will be surprised if it
> > > matters much in performance sence.
> > > 
> > > Third, I put all three commits into single file to not
> > > confuse commitfest application.
> > 
> > Thanks!  I looked into dynahash part.
> > 
> >  struct HASHHDR
> >  {
> > -   /*
> > -* The freelist can become a point of contention in 
> > high-concurrency hash
> > 
> > Why did you move around the freeList?
> > 
> > 
> > -   longnentries;   /* number of entries in 
> > associated buckets */
> > +   longnfree;  /* number of free entries 
> > in the list */
> > +   longnalloced;   /* number of entries 
> > initially allocated for
> > 
> > Why do we need nfree?  HASH_ASSING should do the same thing with
> > HASH_REMOVE.  Maybe the reason is the code tries to put the detached
> > bucket to different free list, but we can just remember the
> > freelist_idx for the detached bucket as we do for hashp.  I think that
> > should largely reduce the footprint of this patch.
> 
> If we keep nentries, then we need to fix nentries in both old
> "freeList" partition and new one. It is two freeList[partition]->mutex
> lock+unlock pairs.
>
> But count of free elements doesn't change, so if we change nentries
> to nfree, then no need to fix freeList[partition]->nfree counters,
> no need to lock+unlock. 

Ah, okay. I missed that bucket reuse chages key in most cases.

But still I don't think its good to move entries around partition
freelists for another reason.  I'm afraid that the freelists get into
imbalanced state.  get_hash_entry prefers main shmem allocation than
other freelist so that could lead to freelist bloat, or worse
contension than the traditinal way involving more than two partitions.

I'll examine the possibility to resolve this...

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: On login trigger: take three

2022-03-13 Thread Tom Lane
Andres Freund  writes:
> I was thinking that the way to use it would be to specify it as a client
> option. Like PGOPTIONS='-c ignore_event_trigger=login' psql.

Ugh ... that would allow people (at least superusers) to bypass
the login trigger at will, which seems to me to break a lot of
the use-cases for the feature.  I supposed we'd want this to be a
PGC_POSTMASTER setting for security reasons.

regards, tom lane




Re: On login trigger: take three

2022-03-13 Thread Andres Freund
Hi,

On 2022-03-13 19:57:08 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2022-03-13 23:31:03 +0100, Daniel Gustafsson wrote:
> >> Something like a '-c ignore_event_trigger=' GUC perhaps?
>
> > Did you mean login instead of event?
>
> > Something like it would work for me. It probably should be
> > GUC_DISALLOW_IN_FILE?
>
> Why?  Inserting such a setting in postgresql.conf and restarting
> would be the first thing I'd think of if I needed to get out
> of a problem.  The only other way is to set it on the postmaster
> command line, which is going to be awkward-to-impossible with
> most system-provided start scripts.

I was thinking that the way to use it would be to specify it as a client
option. Like PGOPTIONS='-c ignore_event_trigger=login' psql.

Greetings,

Andres Freund




Re: On login trigger: take three

2022-03-13 Thread Tom Lane
Andres Freund  writes:
> On 2022-03-13 23:31:03 +0100, Daniel Gustafsson wrote:
>> Something like a '-c ignore_event_trigger=' GUC perhaps? 

> Did you mean login instead of event?

> Something like it would work for me. It probably should be
> GUC_DISALLOW_IN_FILE?

Why?  Inserting such a setting in postgresql.conf and restarting
would be the first thing I'd think of if I needed to get out
of a problem.  The only other way is to set it on the postmaster
command line, which is going to be awkward-to-impossible with
most system-provided start scripts.

regards, tom lane




Re: Tab completion not listing schema list for create/alter publication for all tables in schema

2022-03-13 Thread Tom Lane
vignesh C  writes:
> Here "pg_%" should be "pg_%%".

Right you are.  Patch pushed, thanks!

regards, tom lane




Re: On login trigger: take three

2022-03-13 Thread Andres Freund
Hi,

On 2022-03-13 23:31:03 +0100, Daniel Gustafsson wrote:
> > On 12 Mar 2022, at 03:46, Andres Freund  wrote:
> 
> >> +   
> >> + The login event occurs when a user logs in to the
> >> + system.
> >> + Any bugs in a trigger procedure for this event may prevent successful
> >> + login to the system. Such bugs may be fixed after first restarting 
> >> the
> >> + system in single-user mode (as event triggers are disabled in this 
> >> mode).
> >> + See the  reference page for details 
> >> about
> >> + using single-user mode.
> >> +   
> > 
> > I'm strongly against adding any new dependencies on single user mode.
> > 
> > A saner approach might be a superuser-only GUC that can be set as part of 
> > the
> > connection data (e.g. PGOPTIONS='-c ignore_login_event_trigger=true').
> 
> This, and similar approaches, has been discussed in this thread.  I'm not a 
> fan
> of holes punched with GUC's like this, but if you plan on removing single-user
> mode (as I recall seeing in an initdb thread) altogether then that kills the
> discussion. So.

Even if we end up not removing single user mode, I think it's not OK to add
new failure modes that require single user mode to resolve after not-absurd
operations (I'm ok with needing single user mode if somebody does delete from
pg_authid). It's too hard to reach for most.

We already have GUCs like row_security, so it doesn't seem insane to add one
that disables login event triggers. What is the danger that you see?


> Since we already recommend single-user mode to handle broken event triggers, 
> we
> should IMO do something to cover all of them rather than risk ending up with
> one disabling GUC per each event type.  Right now we have this on the CREATE
> EVENT TRIGGER reference page:

IMO the other types of event triggers make it a heck of a lot harder to get
yourself into a situation that you can't get out of...


> "Event triggers are disabled in single-user mode (see postgres).  If an
> erroneous event trigger disables the database so much that you can't even
> drop the trigger, restart in single-user mode and you'll be able to do
> that."
> Something like a '-c ignore_event_trigger=' GUC perhaps? 

Did you mean login instead of event?

Something like it would work for me. It probably should be
GUC_DISALLOW_IN_FILE?

Greetings,

Andres Freund




Re: BufferAlloc: don't take two simultaneous locks

2022-03-13 Thread Zhihong Yu
On Sun, Mar 13, 2022 at 3:27 PM Yura Sokolov 
wrote:

> В Вс, 13/03/2022 в 07:05 -0700, Zhihong Yu пишет:
> >
> > Hi,
> > In the description:
> >
> > There is no need to hold both lock simultaneously.
> >
> > both lock -> both locks
>
> Thanks.
>
> > +* We also reset the usage_count since any recency of use of the old
> >
> > recency of use -> recent use
>
> Thanks.
>
> > +BufTableDelete(BufferTag *tagPtr, uint32 hashcode, bool reuse)
> >
> > Later on, there is code:
> >
> > +   reuse ? HASH_REUSE : HASH_REMOVE,
> >
> > Can flag (such as HASH_REUSE) be passed to BufTableDelete() instead of
> bool ? That way, flag can be used directly in the above place.
>
> No.
> BufTable* functions are created to abstract Buffer Table from dynahash.
> Pass of HASH_REUSE directly will break abstraction.
>
> > +   longnalloced;   /* number of entries initially allocated
> for
> >
> > nallocated isn't very long. I think it would be better to name the field
> nallocated 'nallocated'.
>
> It is debatable.
> Why not num_allocated? allocated_count? number_of_allocations?
> Same points for nfree.
> `nalloced` is recognizable and unambiguous. And there are a lot
> of `*alloced` in the postgresql's source, so this one will not
> be unusual.
>
> I don't see the need to make it longer.
>
> But if someone supports your point, I will not mind to changing
> the name.
>
> > +   sum += hashp->hctl->freeList[i].nalloced;
> > +   sum -= hashp->hctl->freeList[i].nfree;
> >
> > I think it would be better to calculate the difference between nalloced
> and nfree first, then add the result to sum (to avoid overflow).
>
> Doesn't really matter much, because calculation must be valid
> even if all nfree==0.
>
> I'd rather debate use of 'long' in dynahash at all: 'long' is
> 32bit on 64bit Windows. It is better to use 'Size' here.
>
> But 'nelements' were 'long', so I didn't change things. I think
> it is place for another patch.
>
> (On the other hand, dynahash with 2**31 elements is at least
> 512GB RAM... we doubtfully trigger problem before OOM killer
> came. Does Windows have an OOM killer?)
>
> > Subject: [PATCH 3/3] reduce memory allocation for non-partitioned
> dynahash
> >
> > memory allocation -> memory allocations
>
> For each dynahash instance single allocation were reduced.
> I think, 'memory allocation' is correct.
>
> Plural will be
> reduce memory allocations for non-partitioned dynahashes
> ie both 'allocations' and 'dynahashes'.
> Am I wrong?
>
> Hi,
bq. reduce memory allocation for non-partitioned dynahash

It seems the following is clearer:

reduce one memory allocation for every non-partitioned dynahash

Cheers


Re: On login trigger: take three

2022-03-13 Thread Daniel Gustafsson
> On 12 Mar 2022, at 03:46, Andres Freund  wrote:

>> +   
>> + The login event occurs when a user logs in to the
>> + system.
>> + Any bugs in a trigger procedure for this event may prevent successful
>> + login to the system. Such bugs may be fixed after first restarting the
>> + system in single-user mode (as event triggers are disabled in this 
>> mode).
>> + See the  reference page for details about
>> + using single-user mode.
>> +   
> 
> I'm strongly against adding any new dependencies on single user mode.
> 
> A saner approach might be a superuser-only GUC that can be set as part of the
> connection data (e.g. PGOPTIONS='-c ignore_login_event_trigger=true').

This, and similar approaches, has been discussed in this thread.  I'm not a fan
of holes punched with GUC's like this, but if you plan on removing single-user
mode (as I recall seeing in an initdb thread) altogether then that kills the
discussion. So.

Since we already recommend single-user mode to handle broken event triggers, we
should IMO do something to cover all of them rather than risk ending up with
one disabling GUC per each event type.  Right now we have this on the CREATE
EVENT TRIGGER reference page:

"Event triggers are disabled in single-user mode (see postgres).  If an
erroneous event trigger disables the database so much that you can't even
drop the trigger, restart in single-user mode and you'll be able to do
that."

Something like a '-c ignore_event_trigger=' GUC perhaps? 

--
Daniel Gustafsson   https://vmware.com/





Re: BufferAlloc: don't take two simultaneous locks

2022-03-13 Thread Yura Sokolov
В Вс, 13/03/2022 в 07:05 -0700, Zhihong Yu пишет:
> 
> Hi,
> In the description:
> 
> There is no need to hold both lock simultaneously. 
> 
> both lock -> both locks

Thanks.

> +* We also reset the usage_count since any recency of use of the old
> 
> recency of use -> recent use

Thanks.

> +BufTableDelete(BufferTag *tagPtr, uint32 hashcode, bool reuse)
> 
> Later on, there is code:
> 
> +   reuse ? HASH_REUSE : HASH_REMOVE,
> 
> Can flag (such as HASH_REUSE) be passed to BufTableDelete() instead of bool ? 
> That way, flag can be used directly in the above place.

No.
BufTable* functions are created to abstract Buffer Table from dynahash.
Pass of HASH_REUSE directly will break abstraction.

> +   longnalloced;   /* number of entries initially allocated for
> 
> nallocated isn't very long. I think it would be better to name the field 
> nallocated 'nallocated'.

It is debatable.
Why not num_allocated? allocated_count? number_of_allocations?
Same points for nfree.
`nalloced` is recognizable and unambiguous. And there are a lot
of `*alloced` in the postgresql's source, so this one will not
be unusual.

I don't see the need to make it longer.

But if someone supports your point, I will not mind to changing
the name.

> +   sum += hashp->hctl->freeList[i].nalloced;
> +   sum -= hashp->hctl->freeList[i].nfree;
> 
> I think it would be better to calculate the difference between nalloced and 
> nfree first, then add the result to sum (to avoid overflow).

Doesn't really matter much, because calculation must be valid
even if all nfree==0.

I'd rather debate use of 'long' in dynahash at all: 'long' is
32bit on 64bit Windows. It is better to use 'Size' here.

But 'nelements' were 'long', so I didn't change things. I think
it is place for another patch.

(On the other hand, dynahash with 2**31 elements is at least
512GB RAM... we doubtfully trigger problem before OOM killer
came. Does Windows have an OOM killer?)

> Subject: [PATCH 3/3] reduce memory allocation for non-partitioned dynahash
> 
> memory allocation -> memory allocations

For each dynahash instance single allocation were reduced.
I think, 'memory allocation' is correct.

Plural will be
reduce memory allocations for non-partitioned dynahashes
ie both 'allocations' and 'dynahashes'.
Am I wrong?


--

regards
Yura Sokolov
From 68800f6f02f062320e6d9fe42c986809a06a37cb Mon Sep 17 00:00:00 2001
From: Yura Sokolov 
Date: Mon, 21 Feb 2022 08:49:03 +0300
Subject: [PATCH 1/3] bufmgr: do not acquire two partition locks.

Acquiring two partition locks leads to complex dependency chain that hurts
at high concurrency level.

There is no need to hold both locks simultaneously. Buffer is pinned so
other processes could not select it for eviction. If tag is cleared and
buffer removed from old partition other processes will not find it.
Therefore it is safe to release old partition lock before acquiring
new partition lock.
---
 src/backend/storage/buffer/bufmgr.c | 198 ++--
 1 file changed, 96 insertions(+), 102 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f5459c68f89..f7dbfc90aaa 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1275,8 +1275,9 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		}
 
 		/*
-		 * To change the association of a valid buffer, we'll need to have
-		 * exclusive lock on both the old and new mapping partitions.
+		 * To change the association of a valid buffer, we'll need to reset
+		 * tag first, so we need to have exclusive lock on the old mapping
+		 * partitions.
 		 */
 		if (oldFlags & BM_TAG_VALID)
 		{
@@ -1289,93 +1290,16 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			oldHash = BufTableHashCode();
 			oldPartitionLock = BufMappingPartitionLock(oldHash);
 
-			/*
-			 * Must lock the lower-numbered partition first to avoid
-			 * deadlocks.
-			 */
-			if (oldPartitionLock < newPartitionLock)
-			{
-LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
-LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-			}
-			else if (oldPartitionLock > newPartitionLock)
-			{
-LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
-			}
-			else
-			{
-/* only one partition, only one lock */
-LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-			}
+			LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
 		}
 		else
 		{
-			/* if it wasn't valid, we need only the new partition */
-			LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
 			/* remember we have no old-partition lock or tag */
 			oldPartitionLock = NULL;
 			/* keep the compiler quiet about uninitialized variables */
 			oldHash = 0;
 		}
 
-		/*
-		 * Try to make a hashtable entry for the buffer under its new tag.
-		 * This could fail because while we were writing someone else
-		 * 

Re: add checkpoint stats of snapshot and mapping files of pg_logical dir

2022-03-13 Thread Nathan Bossart
On Sun, Mar 13, 2022 at 01:54:10PM +0530, Bharath Rupireddy wrote:
> Thanks for reviewing this. I agree with all of the above suggestions
> and incorporated them in the v2 patch.

Thanks for the new patch.

> Another thing I added in v2 is to not emit snapshot and mapping files
> stats in case of restartpoint as logical decoding isn't supported on
> standbys, so it doesn't make sense to emit the stats there and cause
> server log to grow unnecessarily. Having said that, I added a note
> there to change it whenever logical decoding on standbys is supported.

I think we actually do want to include this information for restartpoints
since some files might be left over from the snapshot that was used to
create the standby.  Also, presumably these functions could do some work
during recovery on a primary.

Another problem I see is that this patch depends on the return value of the
lstat() calls that we are trying to remove in 0001 from another thread [0].
I think the size of the removed/sync'd files is somewhat useful for
understanding disk space usage, but I suspect the time spent performing
these tasks is more closely related to the number of files.  Do you think
reporting the sizes is worth the extra system call?

[0] https://postgr.es/m/20220215231123.GA2584239%40nathanxps13

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: SQL/JSON: JSON_TABLE

2022-03-13 Thread Andrew Dunstan


On 2/9/22 08:22, Himanshu Upadhyaya wrote:
> On Wed, Feb 2, 2022 at 12:44 AM Andrew Dunstan  wrote:
>>
>> rebased with some review comments attended to.
> I am in process of reviewing these patches, initially, have started
> with 0002-JSON_TABLE-v55.patch.
> Tested many different scenarios with various JSON messages and these
> all are working as expected. Just one question on the below output.
>
> ‘postgres[1406146]=#’SELECT * FROM JSON_TABLE(jsonb '1', '$' COLUMNS
> (a int PATH '$.a' ERROR ON EMPTY)) jt;
>  a
> ---
>
> (1 row)
>
> ‘postgres[1406146]=#’SELECT * FROM JSON_TABLE(jsonb '1', '$' COLUMNS
> (a int PATH '$.a' ERROR ON ERROR)) jt;
>  a
> ---
>
> (1 row)
>
> is not "ERROR ON ERROR" is expected to give error?


I think I understand what's going on here. In the first example 'ERROR
ON EMPTY' causes an error condition, but as the default action for an
error condition is to return null that's what happens. To get an error
raised you would need to say 'ERROR ON EMPTY ERROR ON ERROR'. I don't
know if that's according to spec. It seems kinda screwy, arguably a POLA
violation, although that would hardly be a first for the SQL Standards
body.  But I'm speculating here, I'm not a standards lawyer.

In the second case it looks like there isn't really an error. There
would be if you used 'strict' in the path expression.


This whole area needs more documentation.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Issue with pg_stat_subscription_stats

2022-03-13 Thread Melanie Plageman
On Sat, Mar 12, 2022 at 3:15 PM Andres Freund  wrote:
>
> Hi,
>
> On 2022-03-12 08:28:35 +0530, Amit Kapila wrote:
> > On Sat, Mar 12, 2022 at 2:14 AM Melanie Plageman
> >  wrote:
> > >
> > > So, I noticed that pg_stat_reset_subscription_stats() wasn't working
> > > properly, and, upon further investigation, I'm not sure the view
> > > pg_stat_subscription_stats is being properly populated.
> > >
> >
> > I have tried the below scenario based on this:
> > Step:1 Create some data that generates conflicts and lead to apply
> > failures and then check in the view:
>
> I think the problem is present when there was *no* conflict
> previously. Because nothing populates the stats entry without an error, the
> reset doesn't have anything to set the stats_reset field in, which then means
> that the stats_reset field is NULL even though stats have been reset.

Yes, this is what I meant. stats_reset is not initialized and without
any conflict happening to populate the stats, after resetting the stats,
the field still does not get populated. I think this is a bit
unexpected.

psql (15devel)
Type "help" for help.

mplageman=# select * from pg_stat_subscription_stats ;
 subid | subname | apply_error_count | sync_error_count | stats_reset
---+-+---+--+-
 16398 | mysub   | 0 |0 |
(1 row)

mplageman=# select pg_stat_reset_subscription_stats(16398);
 pg_stat_reset_subscription_stats
--

(1 row)

mplageman=# select * from pg_stat_subscription_stats ;
 subid | subname | apply_error_count | sync_error_count | stats_reset
---+-+---+--+-
 16398 | mysub   | 0 |0 |
(1 row)

- Melanie




Tab completion not listing schema list for create/alter publication for all tables in schema

2022-03-13 Thread vignesh C
Hi,

I noticed that the following commands "CREATE PUBLICATION pub1 FOR ALL
TABLES IN SCHEMA" and  "ALTER PUBLICATION pub1 ADD ALL TABLES IN
SCHEMA" does not complete with the schema list. I feel this is because
of the following code in tab-complete.c:
.
COMPLETE_WITH_QUERY_PLUS(Query_for_list_of_schemas
" AND nspname NOT LIKE E'pg_%'",
"CURRENT_SCHEMA");
.
Here "pg_%" should be "pg_%%".
Attached a patch to handle this.
Thoughts?

Regards,
Vignesh
From 4321bafb2b7594f6c1af5d02f64e934fdba9c2ef Mon Sep 17 00:00:00 2001
From: Vigneshwaran C 
Date: Sun, 13 Mar 2022 22:09:56 +0530
Subject: [PATCH] Tab completion not listing schema list for create/alter
 publication.

Tab completion not listing schema list for create/alter publication.
---
 src/bin/psql/tab-complete.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 6957567264..6d5c928c10 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -1811,7 +1811,7 @@ psql_completion(const char *text, int start, int end)
 		COMPLETE_WITH("(", "ALL TABLES IN SCHEMA", "TABLE");
 	else if (Matches("ALTER", "PUBLICATION", MatchAny, "ADD|DROP|SET", "ALL", "TABLES", "IN", "SCHEMA"))
 		COMPLETE_WITH_QUERY_PLUS(Query_for_list_of_schemas
- " AND nspname NOT LIKE E'pg_%'",
+ " AND nspname NOT LIKE E'pg_%%'",
  "CURRENT_SCHEMA");
 	/* ALTER PUBLICATION  SET ( */
 	else if (HeadMatches("ALTER", "PUBLICATION", MatchAny) && TailMatches("SET", "("))
@@ -2956,7 +2956,7 @@ psql_completion(const char *text, int start, int end)
 	 */
 	else if (Matches("CREATE", "PUBLICATION", MatchAny, "FOR", "ALL", "TABLES", "IN", "SCHEMA"))
 		COMPLETE_WITH_QUERY_PLUS(Query_for_list_of_schemas
- " AND nspname NOT LIKE E'pg_%'",
+ " AND nspname NOT LIKE E'pg_%%'",
  "CURRENT_SCHEMA");
 	else if (Matches("CREATE", "PUBLICATION", MatchAny, "FOR", "ALL", "TABLES", "IN", "SCHEMA", MatchAny) && (!ends_with(prev_wd, ',')))
 		COMPLETE_WITH("WITH (");
-- 
2.32.0



Re: BufferAlloc: don't take two simultaneous locks

2022-03-13 Thread Zhihong Yu
On Sun, Mar 13, 2022 at 3:25 AM Yura Sokolov 
wrote:

> В Пт, 11/03/2022 в 17:21 +0900, Kyotaro Horiguchi пишет:
> > At Fri, 11 Mar 2022 15:49:49 +0900 (JST), Kyotaro Horiguchi <
> horikyota@gmail.com> wrote in
> > > At Fri, 11 Mar 2022 15:30:30 +0900 (JST), Kyotaro Horiguchi <
> horikyota@gmail.com> wrote in
> > > > Thanks!  I looked into dynahash part.
> > > >
> > > >  struct HASHHDR
> > > >  {
> > > > -   /*
> > > > -* The freelist can become a point of contention in
> high-concurrency hash
> > > >
> > > > Why did you move around the freeList?
>
> This way it is possible to allocate just first partition, not all 32
> partitions.
>
> >
> > Then I looked into bufmgr part.  It looks fine to me but I have some
> > comments on code comments.
> >
> > >* To change the association of a valid buffer, we'll
> need to have
> > >* exclusive lock on both the old and new mapping
> partitions.
> > >   if (oldFlags & BM_TAG_VALID)
> >
> > We don't take lock on the new mapping partition here.
>
> Thx, fixed.
>
> > +* Clear out the buffer's tag and flags.  We must do this to
> ensure that
> > +* linear scans of the buffer array don't think the buffer is
> valid. We
> > +* also reset the usage_count since any recency of use of the
> old content
> > +* is no longer relevant.
> > +*
> > +* We are single pinner, we hold buffer header lock and exclusive
> > +* partition lock (if tag is valid). Given these statements it
> is safe to
> > +* clear tag since no other process can inspect it to the moment.
> >
> > This comment is a merger of the comments from InvalidateBuffer and
> > BufferAlloc.  But I think what we need to explain here is why we
> > invalidate the buffer here despite of we are going to reuse it soon.
> > And I think we need to state that the old buffer is now safe to use
> > for the new tag here.  I'm not sure the statement is really correct
> > but clearing-out actually looks like safer.
>
> I've tried to reformulate the comment block.
>
> >
> > > Now it is safe to use victim buffer for new tag.  Invalidate the
> > > buffer before releasing header lock to ensure that linear scans of
> > > the buffer array don't think the buffer is valid.  It is safe
> > > because it is guaranteed that we're the single pinner of the buffer.
> > > That pin also prevents the buffer from being stolen by others until
> > > we reuse it or return it to freelist.
> >
> > So I want to revise the following comment.
> >
> > -* Now it is safe to use victim buffer for new tag.
> > +* Now reuse victim buffer for new tag.
> > >* Make sure BM_PERMANENT is set for buffers that must be
> written at every
> > >* checkpoint.  Unlogged buffers only need to be written at
> shutdown
> > >* checkpoints, except for their "init" forks, which need to be
> treated
> > >* just like permanent relations.
> > >*
> > >* The usage_count starts out at 1 so that the buffer can
> survive one
> > >* clock-sweep pass.
> >
> > But if you think the current commet is fine, I don't insist on the
> > comment chagnes.
>
> Used suggestion.
>
> Fr, 11/03/22 Yura Sokolov wrote:
> > В Пт, 11/03/2022 в 15:49 +0900, Kyotaro Horiguchi пишет:
> > > BufTableDelete considers both reuse and !reuse cases but
> > > BufTableInsert doesn't and always does HASH_ASSIGN.  That looks
> > > odd. We should use HASH_ENTER here.  Thus I think it is more
> > > reasonable that HASH_ENTRY uses the stashed entry if exists and
> > > needed, or returns it to freelist if exists but not needed.
> > >
> > > What do you think about this?
> >
> > Well... I don't like it but I don't mind either.
> >
> > Code in HASH_ENTER and HASH_ASSIGN cases differs much.
> > On the other hand, probably it is possible to merge it carefuly.
> > I'll try.
>
> I've merged HASH_ASSIGN into HASH_ENTER.
>
> As in previous letter, three commits are concatted to one file
> and could be applied with `git am`.
>
> ---
>
> regards
>
> Yura Sokolov
> Postgres Professional
> y.soko...@postgrespro.ru
> funny.fal...@gmail.com


Hi,
In the description:

There is no need to hold both lock simultaneously.

both lock -> both locks

+* We also reset the usage_count since any recency of use of the old

recency of use -> recent use

+BufTableDelete(BufferTag *tagPtr, uint32 hashcode, bool reuse)

Later on, there is code:

+   reuse ? HASH_REUSE : HASH_REMOVE,

Can flag (such as HASH_REUSE) be passed to BufTableDelete() instead of bool
? That way, flag can be used directly in the above place.

+   longnalloced;   /* number of entries initially allocated for

nallocated isn't very long. I think it would be better to name the
field nallocated* '*nallocated'.

+   sum += hashp->hctl->freeList[i].nalloced;
+   sum -= hashp->hctl->freeList[i].nfree;

I think it would be better to calculate 

Re: Support logical replication of DDLs

2022-03-13 Thread Dilip Kumar
On Mon, Feb 21, 2022 at 9:43 PM Zheng Li  wrote:
>
> Hello,
>
> One of the most frequently requested improvements from our customers
> is to reduce downtime associated with software updates (both major and
> minor versions). To do this, we have reviewed potential contributions to
> improving logical replication.
>
> I’m working on a patch to support logical replication of data
> definition language statements (DDLs). This is a useful feature when a
> database in logical replication has lots of tables, functions and
> other objects that change over time, such as in online cross major
> version upgrade.

+1

> I put together a prototype that replicates DDLs using the generic
> messages for logical decoding. The idea is to log the candidate DDL
> string in ProcessUtilitySlow() using LogLogicalMessge() with a new
> flag in WAL record type xl_logical_message indicating it’s a DDL
> message. The xl_logical_message record is decoded and sent to the
> subscriber via pgoutput. The logical replication worker process is
> dispatched for this new DDL message type and executes the command
> accordingly.

If you don't mind, would you like to share the POC or the branch for this work?

> However, there are still many edge cases to sort out because not every
> DDL statement can/should be replicated. Some of these include:

> 3. CREATE TABLE AS and SELECT INTO, For example:
>
> CREATE TABLE foo AS
> SELECT field_1, field_2 FROM bar;
>
> There are a few issues that can occur here. For one, it’s possible
> that table bar doesn't exist on the subscriber. Even if “bar” does
> exist, it may not be fully up-to-date with the publisher, which would
> cause a data mismatch on “foo” between the publisher and subscriber.

In such cases why don't we just log the table creation WAL for DDL
instead of a complete statement which creates the table and inserts
the tuple?  Because we are already WAL logging individual inserts and
once you make sure of replicating the table creation I think the exact
data insertion on the subscriber side will be taken care of by the
insert WALs no?


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Add id's to various elements in protocol.sgml

2022-03-13 Thread Brar Piening

On 09.03.2022 at 20:43, Brar Piening wrote:

Attached is a pretty huge patch that adds ids to all sections and all
the varlistentries where the containing variablelist already had at
least one id (plus a few additional ones that I stumbled upon and
deemed useful). It also adds html links next to the respective heading
in the html documentation and emits a build message and a comment when
a section or a relevant (see before) varlistentry doesn't have an id.


I have uploaded a doc build with the patch applied to
https://pgdocs.piening.info/ to make it easier for you all to review the
results and see what is there and what isn't and how it feels UI-wise.

You may want to look at https://pgdocs.piening.info/app-psql.html where
the patch adds ids and links to all varlistentries but doesn't do so for
the headings (because they are refsect1 headings not sect1 headings).

https://pgdocs.piening.info/protocol-flow.html is pretty much the
opposite. The patch adds ids and links to the headings (they are sect2
headings) but doesn't add them to the varlistentries (yet - because I
mostly sticked to the algorithm suggested at
https://www.postgresql.org/message-id/621FAF40.5070507%40anastigmatix.net
to contain the workload).







Re: BufferAlloc: don't take two simultaneous locks

2022-03-13 Thread Yura Sokolov
В Пт, 11/03/2022 в 17:21 +0900, Kyotaro Horiguchi пишет:
> At Fri, 11 Mar 2022 15:49:49 +0900 (JST), Kyotaro Horiguchi 
>  wrote in 
> > At Fri, 11 Mar 2022 15:30:30 +0900 (JST), Kyotaro Horiguchi 
> >  wrote in 
> > > Thanks!  I looked into dynahash part.
> > > 
> > >  struct HASHHDR
> > >  {
> > > -   /*
> > > -* The freelist can become a point of contention in 
> > > high-concurrency hash
> > > 
> > > Why did you move around the freeList?

This way it is possible to allocate just first partition, not all 32 partitions.

> 
> Then I looked into bufmgr part.  It looks fine to me but I have some
> comments on code comments.
> 
> >* To change the association of a valid buffer, we'll need to 
> > have
> >* exclusive lock on both the old and new mapping partitions.
> >   if (oldFlags & BM_TAG_VALID)
> 
> We don't take lock on the new mapping partition here.

Thx, fixed.

> +* Clear out the buffer's tag and flags.  We must do this to ensure 
> that
> +* linear scans of the buffer array don't think the buffer is valid. 
> We
> +* also reset the usage_count since any recency of use of the old 
> content
> +* is no longer relevant.
> +*
> +* We are single pinner, we hold buffer header lock and exclusive
> +* partition lock (if tag is valid). Given these statements it is 
> safe to
> +* clear tag since no other process can inspect it to the moment.
> 
> This comment is a merger of the comments from InvalidateBuffer and
> BufferAlloc.  But I think what we need to explain here is why we
> invalidate the buffer here despite of we are going to reuse it soon.
> And I think we need to state that the old buffer is now safe to use
> for the new tag here.  I'm not sure the statement is really correct
> but clearing-out actually looks like safer.

I've tried to reformulate the comment block.

> 
> > Now it is safe to use victim buffer for new tag.  Invalidate the
> > buffer before releasing header lock to ensure that linear scans of
> > the buffer array don't think the buffer is valid.  It is safe
> > because it is guaranteed that we're the single pinner of the buffer.
> > That pin also prevents the buffer from being stolen by others until
> > we reuse it or return it to freelist.
> 
> So I want to revise the following comment.
> 
> -* Now it is safe to use victim buffer for new tag.
> +* Now reuse victim buffer for new tag.
> >* Make sure BM_PERMANENT is set for buffers that must be written at 
> > every
> >* checkpoint.  Unlogged buffers only need to be written at shutdown
> >* checkpoints, except for their "init" forks, which need to be 
> > treated
> >* just like permanent relations.
> >*
> >* The usage_count starts out at 1 so that the buffer can survive one
> >* clock-sweep pass.
> 
> But if you think the current commet is fine, I don't insist on the
> comment chagnes.

Used suggestion.

Fr, 11/03/22 Yura Sokolov wrote:
> В Пт, 11/03/2022 в 15:49 +0900, Kyotaro Horiguchi пишет:
> > BufTableDelete considers both reuse and !reuse cases but
> > BufTableInsert doesn't and always does HASH_ASSIGN.  That looks
> > odd. We should use HASH_ENTER here.  Thus I think it is more
> > reasonable that HASH_ENTRY uses the stashed entry if exists and
> > needed, or returns it to freelist if exists but not needed.
> > 
> > What do you think about this?
> 
> Well... I don't like it but I don't mind either.
> 
> Code in HASH_ENTER and HASH_ASSIGN cases differs much.
> On the other hand, probably it is possible to merge it carefuly.
> I'll try.

I've merged HASH_ASSIGN into HASH_ENTER.

As in previous letter, three commits are concatted to one file
and could be applied with `git am`.

---

regards

Yura Sokolov
Postgres Professional
y.soko...@postgrespro.ru
funny.fal...@gmail.com
From fbec0dd7d9f11aeaeb8f141ad3dedab7178aeb2e Mon Sep 17 00:00:00 2001
From: Yura Sokolov 
Date: Mon, 21 Feb 2022 08:49:03 +0300
Subject: [PATCH 1/3] bufmgr: do not acquire two partition locks.

Acquiring two partition locks leads to complex dependency chain that hurts
at high concurrency level.

There is no need to hold both lock simultaneously. Buffer is pinned so
other processes could not select it for eviction. If tag is cleared and
buffer removed from old partition other processes will not find it.
Therefore it is safe to release old partition lock before acquiring
new partition lock.
---
 src/backend/storage/buffer/bufmgr.c | 198 ++--
 1 file changed, 96 insertions(+), 102 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f5459c68f89..63824b15686 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1275,8 +1275,9 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		}
 
 		/*
-		 * To change the association of a valid 

Re: Defer selection of asynchronous subplans until the executor initialization stage

2022-03-13 Thread Etsuro Fujita
Hi Alexander,

On Wed, Sep 15, 2021 at 3:40 PM Alexander Pyhalov
 wrote:
> Etsuro Fujita писал 2021-08-30 12:52:
> > To allow async execution in a bit more cases, I modified the patch a
> > bit further: a ProjectionPath put directly above an async-capable
> > ForeignPath would also be considered async-capable as ForeignScan can
> > project and no separate Result is needed in that case, so I modified
> > mark_async_capable_plan() as such, and added test cases to the
> > postgres_fdw regression test.  Attached is an updated version of the
> > patch.

> The patch looks good to me and seems to work as expected.

Thanks for reviewing!  I’m planning to commit the patch.

Sorry for the long delay.

Best regards,
Etsuro Fujita




Re: postgres_fdw: commit remote (sub)transactions in parallel during pre-commit

2022-03-13 Thread Etsuro Fujita
On Sat, Mar 12, 2022 at 10:02 AM David Zhang  wrote:
> Applied patches v6-0002 and v6-0003 to master branch, and the `make check` 
> test is ok.
>
> Here is my test result in 10 times average on 3 virtual machines:
> before the patches:
>
> abort.1 = 2.5473 ms
>
> abort.2 = 4.1572 ms
>
> after the patches with OPTIONS (ADD parallel_abort 'true'):
>
> abort.1 = 1.7136 ms
>
> abort.2 = 2.5833 ms
>
> Overall, it has about 32 ~ 37 % improvement in my testing environment.

I think that is a great improvement.  Thanks for testing!

Best regards,
Etsuro Fujita




Re: add checkpoint stats of snapshot and mapping files of pg_logical dir

2022-03-13 Thread Bharath Rupireddy
On Sat, Mar 12, 2022 at 1:35 AM Nathan Bossart  wrote:
>
> +CheckpointStats.repl_map_cutoff_lsn = cutoff;
>
> Could we set repl_map_cutoff_lsn closer to where it is calculated?  Right
> now, it's set at the bottom of the function just before the directory is
> freed.  Is there a strong reason to do it there?
>
> +"logical rewrite mapping file(s) removed/synced=" 
> UINT64_FORMAT ", size=%zu bytes, time=%ld.%03d s, cutoff LSN=%X/%X",
>
> Can we split out the "removed" versus "synced" files?  Otherwise, I think
> this is basically just telling us how many files are in the mappings
> directory.
>
> +CheckpointStats.repl_snap_cutoff_lsn = cutoff;
>
> I have the same note for this one as I have for repl_map_cutoff_lsn.

Thanks for reviewing this. I agree with all of the above suggestions
and incorporated them in the v2 patch.

Another thing I added in v2 is to not emit snapshot and mapping files
stats in case of restartpoint as logical decoding isn't supported on
standbys, so it doesn't make sense to emit the stats there and cause
server log to grow unnecessarily. Having said that, I added a note
there to change it whenever logical decoding on standbys is supported.

Please review v2.

Regards,
Bharath Rupireddy.


v2-0001-add-checkpoint-stats-of-snapshot-and-mapping-file.patch
Description: Binary data