Re: BufferAlloc: don't take two simultaneous locks
At Mon, 14 Mar 2022 09:39:48 +0900 (JST), Kyotaro Horiguchi wrote in > I'll examine the possibility to resolve this... The existence of nfree and nalloc made me confused and I found the reason. In the case where a parittion collects many REUSE-ASSIGN-REMOVEed elemetns from other paritiotns, nfree gets larger than nalloced. This is a strange point of the two counters. nalloced is only referred to as (sum(nalloced[])). So we don't need nalloced per-partition basis and the formula to calculate the number of used elements would be as follows. sum(nalloced - nfree) = - sum(nfree) We rarely create fresh elements in shared hashes so I don't think there's additional contention on the even if it were a global atomic. So, the remaining issue is the possible imbalancement among partitions. On second thought, by the current way, if there's a bad deviation in partition-usage, a heavily hit partition finally collects elements via get_hash_entry(). By the patch's way, similar thing happens via the REUSE-ASSIGN-REMOVE sequence. But buffers once used for something won't be freed until buffer invalidation. But bulk buffer invalidation won't deviatedly distribute freed buffers among partitions. So I conclude for now that is a non-issue. So my opinion on the counters is: I'd like to ask you to remove nalloced from partitions then add a global atomic for the same use? No need to do something for the possible deviation issue. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: add checkpoint stats of snapshot and mapping files of pg_logical dir
On Mon, Mar 14, 2022 at 10:45 AM Michael Paquier wrote: > > On Sun, Mar 13, 2022 at 02:58:58PM -0700, Nathan Bossart wrote: > > On Sun, Mar 13, 2022 at 01:54:10PM +0530, Bharath Rupireddy wrote: > >> Another thing I added in v2 is to not emit snapshot and mapping files > >> stats in case of restartpoint as logical decoding isn't supported on > >> standbys, so it doesn't make sense to emit the stats there and cause > >> server log to grow unnecessarily. Having said that, I added a note > >> there to change it whenever logical decoding on standbys is supported. > > > > I think we actually do want to include this information for restartpoints > > since some files might be left over from the snapshot that was used to > > create the standby. Also, presumably these functions could do some work > > during recovery on a primary. > > Yes, I would agree that consistency makes sense here, and it is not > complicated to add the code to support this code path anyway. There > is a risk that folks working on logical decoding on standbys overse > this code path, instead. I agree to be consistent and emit the message even in case of restartpoint. > > Another problem I see is that this patch depends on the return value of the > > lstat() calls that we are trying to remove in 0001 from another thread [0]. > > I think the size of the removed/sync'd files is somewhat useful for > > understanding disk space usage, but I suspect the time spent performing > > these tasks is more closely related to the number of files. Do you think > > reporting the sizes is worth the extra system call? > > We are not talking about files that are large either, are we? > > Another thing I am a bit annoyed with in this patch is the fact that > the size of the ereport() call is doubled. The LOG currently > generated is already bloated, and this does not arrange things. Yes, this is a concern. Also, when there were no logical replication slots on a plain server or the server removed or cleaned up all the snapshot/mappings files, why would anyone want to have these messages with all 0s in the server log? Here's what I'm thinking: Leave the existing "checkpoint/restartpoint complete" messages intact, add the following in LogCheckpointEnd: if (CheckpointStats.repl_map_files_rmvd_cnt || CheckpointStats.repl_map_files_syncd_cnt || CheckpointStats.repl_snap_files_rmvd_cnt) { ereport(LOG, (errmsg("snapbuild snapshot file(s) removed=" UINT64_FORMAT ", size=%zu bytes, time=%ld.%03d s, cutoff LSN=%X/%X; " "logical rewrite mapping file(s) removed=" UINT64_FORMAT ", size=%zu bytes, synced=" UINT64_FORMAT ", size=%zu bytes, time=%ld.%03d s, cutoff LSN=%X/%X", CheckpointStats.repl_snap_files_rmvd_cnt, CheckpointStats.repl_snap_files_rmvd_sz, repl_snap_msecs / 1000, (int) (repl_snap_msecs % 1000), LSN_FORMAT_ARGS(CheckpointStats.repl_snap_cutoff_lsn), CheckpointStats.repl_map_files_rmvd_cnt, CheckpointStats.repl_map_files_rmvd_sz, CheckpointStats.repl_map_files_syncd_cnt, CheckpointStats.repl_map_files_syncd_sz, repl_map_msecs / 1000, (int) (repl_map_msecs % 1000), LSN_FORMAT_ARGS(CheckpointStats.repl_map_cutoff_lsn; } Thoughts? Regards, Bharath Rupireddy.
Re: add checkpoint stats of snapshot and mapping files of pg_logical dir
On Sun, Mar 13, 2022 at 02:58:58PM -0700, Nathan Bossart wrote: > On Sun, Mar 13, 2022 at 01:54:10PM +0530, Bharath Rupireddy wrote: >> Another thing I added in v2 is to not emit snapshot and mapping files >> stats in case of restartpoint as logical decoding isn't supported on >> standbys, so it doesn't make sense to emit the stats there and cause >> server log to grow unnecessarily. Having said that, I added a note >> there to change it whenever logical decoding on standbys is supported. > > I think we actually do want to include this information for restartpoints > since some files might be left over from the snapshot that was used to > create the standby. Also, presumably these functions could do some work > during recovery on a primary. Yes, I would agree that consistency makes sense here, and it is not complicated to add the code to support this code path anyway. There is a risk that folks working on logical decoding on standbys overse this code path, instead. > Another problem I see is that this patch depends on the return value of the > lstat() calls that we are trying to remove in 0001 from another thread [0]. > I think the size of the removed/sync'd files is somewhat useful for > understanding disk space usage, but I suspect the time spent performing > these tasks is more closely related to the number of files. Do you think > reporting the sizes is worth the extra system call? We are not talking about files that are large either, are we? Another thing I am a bit annoyed with in this patch is the fact that the size of the ereport() call is doubled. The LOG currently generated is already bloated, and this does not arrange things. -- Michael signature.asc Description: PGP signature
Re: pg_ls_tmpdir to show directories and shared filesets (and pg_ls_*)
On Wed, Mar 09, 2022 at 10:50:45AM -0600, Justin Pryzby wrote: > I also changed pg_ls_dir_recurse() to handle concurrent removal of a dir, > which > I noticed caused an infrequent failure on CI. However I'm not including that > here, since the 2nd half of the patch set seems isn't ready due to lstat() on > windows. lstat() has been a subject of many issues over the years with our internal emulation and issues related to its concurrency, but we use it in various areas of the in-core code, so that does not sound like an issue to me. It depends on what you want to do with it in genfile.c and which data you'd expect, in addition to the detection of junction points for WIN32, I guess. v34 has no references to pg_ls_dir_recurse(), but that's a WITH RECURSIVE, so we would not really need it, do we? @@ -27618,7 +27618,7 @@ SELECT convert_from(pg_read_binary_file('file_in_utf8.txt'), 'UTF8'); Returns a record containing the file's size, last access time stamp, last modification time stamp, last file status change time stamp (Unix platforms only), file creation time stamp (Windows only), and a flag -indicating if it is a directory. +indicating if it is a directory (or a symbolic link to a directory). This function is restricted to superusers by default, but other users This is from 0001, and this addition in the documentation is not completely right. As pg_stat_file() uses stat() to get back the information of a file/directory, we'd just follow the link if specifying one in the input argument. We could say instead, if we were to improve the docs, that "If filename is a link, this function returns information about the file or directory the link refers to." I would put that as a different paragraph. +select * from pg_ls_archive_statusdir() limit 0; + name | size | modification +--+--+-- +(0 rows) FWIW, this one is fine as of ValidateXLOGDirectoryStructure() that would make sure archive_status exists before any connection is attempted to the cluster. > +select * from pg_ls_logdir() limit 0; This test on pg_ls_logdir() would fail if running installcheck on a cluster that has logging_collector disabled. So this cannot be included. +select * from pg_ls_logicalmapdir() limit 0; +select * from pg_ls_logicalsnapdir() limit 0; +select * from pg_ls_replslotdir('') limit 0; +select * from pg_ls_tmpdir() limit 0; +select * from pg_ls_waldir() limit 0; +select * from pg_stat_file('.') limit 0; The rest of the patch set should be stable AFAIK, there are various steps when running a checkpoint that makes sure that any of these exist, without caring about the value of wal_level. + +For each file in the specified directory, list the file and its +metadata. +Restricted to superusers by default, but other users can be granted +EXECUTE to run the function. + What is metadata in this case? (I have read the code and know what you mean, but folks only looking at the documentation may be puzzled by that). It could be cleaner to use the same tupledesc for any callers of this function, and return NULL in cases these are not adapted. + /* check the optional arguments */ + if (PG_NARGS() == 3) + { + if (!PG_ARGISNULL(1)) + { + if (PG_GETARG_BOOL(1)) + flags |= LS_DIR_MISSING_OK; + else + flags &= ~LS_DIR_MISSING_OK; + } + + if (!PG_ARGISNULL(2)) + { + if (PG_GETARG_BOOL(2)) + flags &= ~LS_DIR_SKIP_DOT_DIRS; + else + flags |= LS_DIR_SKIP_DOT_DIRS; + } + } The subtle different between the false and true code paths of those arguments 1 and 2 had better be explained? The bit-wise operations are slightly different in both cases, so it is not clear which part does what, and what's the purpose of this logic. - SetSingleFuncCall(fcinfo, 0); + /* isdir depends on metadata */ + Assert(!(flags_DIR_ISDIR) || (flags_DIR_METADATA)); + /* Unreasonable to show isdir and skip dirs */ + Assert(!(flags_DIR_ISDIR) || !(flags_DIR_SKIP_DIRS)); Incorrect code format. Spaces required. +-- This tests the missing_ok parameter, which causes pg_ls_tmpdir to succeed even if the tmpdir doesn't exist yet +-- The name='' condition is never true, so the function runs to completion but returns zero rows. +-- The query is written to ERROR if the tablespace doesn't exist, rather than silently failing to call pg_ls_tmpdir() +SELECT c.* FROM (SELECT oid FROM pg_tablespace b WHERE b.spcname='regress_tblspace' UNION SELECT 0 ORDER BY 1 DESC LIMIT 1) AS b , pg_ls_tmpdir(oid) AS c WHERE c.name='Does not exist'; So, here, we have a test that may not actually test what we want to test. Hmm. I am not convinced that we need a new set of SQL functions as presented in 0003 (addition of meta-data in pg_ls_dir()), and extensions of 0004 (do the same but for pg_ls_tmpdir) and
Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
On Fri, Feb 25, 2022 at 5:52 PM Peter Geoghegan wrote: > There is an important practical way in which it makes sense to treat > 0001 as separate to 0002. It is true that 0001 is independently quite > useful. In practical terms, I'd be quite happy to just get 0001 into > Postgres 15, without 0002. I think that that's what you meant here, in > concrete terms, and we can agree on that now. Attached is v10. While this does still include the freezing patch, it's not in scope for Postgres 15. As I've said, I still think that it makes sense to maintain the patch series with the freezing stuff, since it's structurally related. So, to be clear, the first two patches from the patch series are in scope for Postgres 15. But not the third. Highlights: * Changes to terminology and commit messages along the lines suggested by Andres. * Bug fixes to heap_tuple_needs_freeze()'s MultiXact handling. My testing strategy here still needs work. * Expanded refactoring by v10-0002 patch. The v10-0002 patch (which appeared for the first time in v9) was originally all about fixing a case where non-aggressive VACUUMs were at a gratuitous disadvantage (relative to aggressive VACUUMs) around advancing relfrozenxid -- very much like the lazy_scan_noprune work from commit 44fa8488. And that is still its main purpose. But the refactoring now seems related to Andres' idea of making non-aggressive VACUUMs decides to scan a few extra all-visible pages in order to be able to advance relfrozenxid. The code that sets up skipping the visibility map is made a lot clearer by v10-0002. That patch moves a significant amount of code from lazy_scan_heap() into a new helper routine (so it continues the trend started by the Postgres 14 work that added lazy_scan_prune()). Now skipping a range of visibility map pages is fundamentally based on setting up the range up front, and then using the same saved details about the range thereafter -- we don't have anymore ad-hoc VM_ALL_VISIBLE()/VM_ALL_FROZEN() calls for pages from a range that we already decided to skip (so no calls to those routines from lazy_scan_heap(), at least not until after we finish processing in lazy_scan_prune()). This is more or less what we were doing all along for one special case: aggressive VACUUMs. We had to make sure to either increment frozenskipped_pages or increment scanned_pages for every page from rel_pages -- this issue is described by lazy_scan_heap() comments on HEAD that begin with "Tricky, tricky." (these date back to the freeze map work from 2016). Anyway, there is no reason to not go further with that: we should make whole ranges the basic unit that we deal with when skipping. It's a lot simpler to think in terms of entire ranges (not individual pages) that are determined to be all-visible or all-frozen up-front, without needing to recheck anything (regardless of whether it's an aggressive VACUUM). We don't need to track frozenskipped_pages this way. And it's much more obvious that it's safe for more complicated cases, in particular for aggressive VACUUMs. This kind of approach seems necessary to make non-aggressive VACUUMs do a little more work opportunistically, when they realize that they can advance relfrozenxid relatively easily that way (which I believe Andres favors as part of overhauling freezing). That becomes a lot more natural when you have a clear and unambiguous separation between deciding what range of blocks to skip, and then actually skipping. I can imagine the new helper function added by v10-0002 (which I've called lazy_scan_skip_range()) eventually being taught to do these kinds of tricks. In general I think that all of the details of what to skip need to be decided up front. The loop in lazy_scan_heap() should execute skipping based on the instructions it receives from the new helper function, in the simplest way possible. The helper function can become more intelligent about the costs and benefits of skipping in the future, without that impacting lazy_scan_heap(). -- Peter Geoghegan From 43ab00609392ed7ad31be491834bdac348e13653 Mon Sep 17 00:00:00 2001 From: Peter Geoghegan Date: Fri, 11 Mar 2022 19:16:02 -0800 Subject: [PATCH v10 3/3] Make page-level characteristics drive freezing. Teach VACUUM to freeze all of the tuples on a page whenever it notices that it would otherwise mark the page all-visible, without also marking it all-frozen. VACUUM typically won't freeze _any_ tuples on the page unless _all_ tuples (that remain after pruning) are all-visible. This makes the overhead of vacuuming much more predictable over time. We avoid the need for large balloon payments during aggressive VACUUMs (typically anti-wraparound autovacuums). Freezing is proactive, so we're much less likely to get into "freezing debt". The new approach to freezing also enables relfrozenxid advancement in non-aggressive VACUUMs, which might be enough to avoid aggressive VACUUMs altogether (with many individual tables/workloads). While the non-aggressive
Re: Allow async standbys wait for sync replication
At Mon, 14 Mar 2022 11:30:02 +0900 (JST), Kyotaro Horiguchi wrote in > At Sat, 12 Mar 2022 14:33:32 -0800, Nathan Bossart > wrote in > > On Tue, Mar 08, 2022 at 06:01:23PM -0800, Andres Freund wrote: > > > To me it's architecturally the completely wrong direction. We should move > > > in > > > the *other* direction, i.e. allow WAL to be sent to standbys before the > > > primary has finished flushing it locally. Which requires similar > > > infrastructure to what we're discussing here. > > > > I think this is a good point. After all, WALRead() has the following > > comment: > > > > * XXX probably this should be improved to suck data directly from the > > * WAL buffers when possible. > > > > Once you have all the infrastructure for that, holding back WAL replay on > > async standbys based on synchronous replication might be relatively easy. Just to make sure and a bit off from the topic, I think the optimization itself is quite promising and want to have. > That is, (as my understanding) async standbys are required to allow > overwriting existing unreplayed records after reconnection. But, > putting aside how to remember that LSN, if that happens at a segment > boundary, the async replica may run into the similar situation with > the missing-contrecord case. But standby cannot insert any original > record to get out from that situation. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Allow async standbys wait for sync replication
At Sat, 12 Mar 2022 14:33:32 -0800, Nathan Bossart wrote in > On Tue, Mar 08, 2022 at 06:01:23PM -0800, Andres Freund wrote: > > To me it's architecturally the completely wrong direction. We should move in > > the *other* direction, i.e. allow WAL to be sent to standbys before the > > primary has finished flushing it locally. Which requires similar > > infrastructure to what we're discussing here. > > I think this is a good point. After all, WALRead() has the following > comment: > > * XXX probably this should be improved to suck data directly from the > * WAL buffers when possible. > > Once you have all the infrastructure for that, holding back WAL replay on > async standbys based on synchronous replication might be relatively easy. That is, (as my understanding) async standbys are required to allow overwriting existing unreplayed records after reconnection. But, putting aside how to remember that LSN, if that happens at a segment boundary, the async replica may run into the similar situation with the missing-contrecord case. But standby cannot insert any original record to get out from that situation. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: On login trigger: take three
On 2022-03-13 20:35:44 -0400, Tom Lane wrote: > Andres Freund writes: > > I was thinking that the way to use it would be to specify it as a client > > option. Like PGOPTIONS='-c ignore_event_trigger=login' psql. > > Ugh ... that would allow people (at least superusers) to bypass > the login trigger at will, which seems to me to break a lot of > the use-cases for the feature. I supposed we'd want this to be a > PGC_POSTMASTER setting for security reasons. Shrug. This doesn't seem to add actual security to me.
Re: pg_walinspect - a new extension to get raw WAL data and WAL stats
At Fri, 11 Mar 2022 15:39:13 -0500, Robert Haas wrote in > On Thu, Mar 10, 2022 at 9:38 PM Kyotaro Horiguchi > wrote: > > It seems to me too rigorous that pg_get_wal_records_info/stats() > > reject future LSNs as end-LSN and I think WARNING or INFO and stop at > > the real end-of-WAL is more kind to users. I think the same with the > > restriction that start and end LSN are required to be different. > > In his review just yesterday, Jeff suggested this: "Don't issue > WARNINGs or other messages for ordinary situations, like when > pg_get_wal_records_info() hits the end of WAL." I think he's entirely > right, and I don't think any patch that does otherwise should get It depends on what we think is the "ordinary" here. If we don't expect that specified LSN range is not filled-out, the case above is ordinary and no need for any WARNING nor INFO. I'm fine with that definition here. > committed. It is worth remembering that the results of queries are > often examined by something other than a human being sitting at a psql > terminal. Any tool that uses this is going to want to understand what > happened from the result set, not by parsing strings that may show up > inside warning messages. Right. I don't think it is right that WARNING is required to evaluate the result. And I think that the WARNING like 'reached end-of-wal before end LSN' is a kind that is not required in evaluation of the result. Since each WAL row contains at least start LSN. > I think that the right answer here is to have a function that returns > one row per record parsed, and each row should also include the start > and end LSN of the record. If for some reason the WAL records return > start after the specified start LSN (e.g. because we skip over a page > header) or end before the specified end LSN (e.g. because we reach > end-of-WAL) the user can figure it out from looking at the LSNs in the > output rows and comparing them to the LSNs provided as input. I agree with you here. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Add index scan progress to pg_stat_progress_vacuum
On Sat, Mar 12, 2022 at 4:00 PM Imseih (AWS), Sami wrote: > > > nitpick: Can we remove the extra spaces in the parentheses? > > fixed > > > What does it mean if there isn't an entry in the map? Is this actually > > expected, or should we ERROR instead? > > I cleaned up the code here and added comments. > > > I think the number of entries should be shared between > > VacuumWorkerProgressShmemInit() and VacuumWorkerProgressShmemSize(). > > Otherwise, we might update one and not the other. > > Fixed > > > I think we should elaborate a bit more in this comment. It's difficult to > > follow what this is doing without referencing the comment above > > set_vacuum_worker_progress(). > > More comments added > > I also simplified the 0002 patch as well. I'm still unsure the current design of 0001 patch is better than other approaches we’ve discussed. Even users who don't use parallel vacuum are forced to allocate shared memory for index vacuum progress, with GetMaxBackends() entries from the beginning. Also, it’s likely to extend the progress tracking feature for other parallel operations in the future but I think the current design is not extensible. If we want to do that, we will end up creating similar things for each of them or re-creating index vacuum progress tracking feature while creating a common infra. It might not be a problem as of now but I'm concerned that introducing a feature that is not extensible and forces users to allocate additional shmem might be a blocker in the future. Looking at the precedent example, When we introduce the progress tracking feature, we implemented it in an extensible way. On the other hand, others in this thread seem to agree with this approach, so I'd like to leave it to committers. Anyway, here are some comments on v5-0001 patch: +/* in commands/vacuumparallel.c */ +extern void VacuumWorkerProgressShmemInit(void); +extern Size VacuumWorkerProgressShmemSize(void); +extern void vacuum_worker_end(int leader_pid); +extern void vacuum_worker_update(int leader_pid); +extern void vacuum_worker_end_callback(int code, Datum arg); +extern void set_vaccum_worker_progress(Datum *values); These functions' body is not in vacuumparallel.c. As the comment says, I think these functions should be implemented in vacuumparallel.c. --- +/* + * set_vaccum_worker_progress --- updates the number of indexes that have been + * vacuumed or cleaned up in a parallel vacuum. + */ +void +set_vaccum_worker_progress(Datum *values) s/vaccum/vacuum/ --- +void +set_vaccum_worker_progress(Datum *values) +{ +VacProgressEntry *entry; +int leader_pid = values[0]; I thik we should use DatumGetInt32(). --- +entry = (VacProgressEntry *) hash_search(VacuumWorkerProgressHash, _pid, HASH_ENTER_NULL, ); + +if (!entry) +elog(ERROR, "cannot allocate shared memory for vacuum worker progress"); Since we raise an error in case of out of memory, I think we can use HASH_ENTER instead of HASH_ENTER_NULL. Or do we want to emit a detailed error message here? --- + VacuumWorkerProgressHash = NULL; This line is not necessary. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Re: BufferAlloc: don't take two simultaneous locks
At Fri, 11 Mar 2022 12:34:32 +0300, Yura Sokolov wrote in > В Пт, 11/03/2022 в 15:49 +0900, Kyotaro Horiguchi пишет: > > At Fri, 11 Mar 2022 15:30:30 +0900 (JST), Kyotaro Horiguchi > > > BufTableDelete(BufferTag *tagPtr, uint32 hashcode, bool > > reuse) > > > > BufTableDelete considers both reuse and !reuse cases but > > BufTableInsert doesn't and always does HASH_ASSIGN. That looks > > odd. We should use HASH_ENTER here. Thus I think it is more > > reasonable that HASH_ENTRY uses the stashed entry if exists and > > needed, or returns it to freelist if exists but not needed. > > > > What do you think about this? > > Well... I don't like it but I don't mind either. > > Code in HASH_ENTER and HASH_ASSIGN cases differs much. > On the other hand, probably it is possible to merge it carefuly. > I'll try. Honestly, I'm not sure it wins on performance basis. It just came from interface consistency (mmm. a bit different, maybe.. convincibility?). regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: BufferAlloc: don't take two simultaneous locks
At Fri, 11 Mar 2022 11:30:27 +0300, Yura Sokolov wrote in > В Пт, 11/03/2022 в 15:30 +0900, Kyotaro Horiguchi пишет: > > At Thu, 03 Mar 2022 01:35:57 +0300, Yura Sokolov > > wrote in > > > В Вт, 01/03/2022 в 10:24 +0300, Yura Sokolov пишет: > > > > Ok, here is v4. > > > > > > And here is v5. > > > > > > First, there was compilation error in Assert in dynahash.c . > > > Excuse me for not checking before sending previous version. > > > > > > Second, I add third commit that reduces HASHHDR allocation > > > size for non-partitioned dynahash: > > > - moved freeList to last position > > > - alloc and memset offset(HASHHDR, freeList[1]) for > > > non-partitioned hash tables. > > > I didn't benchmarked it, but I will be surprised if it > > > matters much in performance sence. > > > > > > Third, I put all three commits into single file to not > > > confuse commitfest application. > > > > Thanks! I looked into dynahash part. > > > > struct HASHHDR > > { > > - /* > > -* The freelist can become a point of contention in > > high-concurrency hash > > > > Why did you move around the freeList? > > > > > > - longnentries; /* number of entries in > > associated buckets */ > > + longnfree; /* number of free entries > > in the list */ > > + longnalloced; /* number of entries > > initially allocated for > > > > Why do we need nfree? HASH_ASSING should do the same thing with > > HASH_REMOVE. Maybe the reason is the code tries to put the detached > > bucket to different free list, but we can just remember the > > freelist_idx for the detached bucket as we do for hashp. I think that > > should largely reduce the footprint of this patch. > > If we keep nentries, then we need to fix nentries in both old > "freeList" partition and new one. It is two freeList[partition]->mutex > lock+unlock pairs. > > But count of free elements doesn't change, so if we change nentries > to nfree, then no need to fix freeList[partition]->nfree counters, > no need to lock+unlock. Ah, okay. I missed that bucket reuse chages key in most cases. But still I don't think its good to move entries around partition freelists for another reason. I'm afraid that the freelists get into imbalanced state. get_hash_entry prefers main shmem allocation than other freelist so that could lead to freelist bloat, or worse contension than the traditinal way involving more than two partitions. I'll examine the possibility to resolve this... regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: On login trigger: take three
Andres Freund writes: > I was thinking that the way to use it would be to specify it as a client > option. Like PGOPTIONS='-c ignore_event_trigger=login' psql. Ugh ... that would allow people (at least superusers) to bypass the login trigger at will, which seems to me to break a lot of the use-cases for the feature. I supposed we'd want this to be a PGC_POSTMASTER setting for security reasons. regards, tom lane
Re: On login trigger: take three
Hi, On 2022-03-13 19:57:08 -0400, Tom Lane wrote: > Andres Freund writes: > > On 2022-03-13 23:31:03 +0100, Daniel Gustafsson wrote: > >> Something like a '-c ignore_event_trigger=' GUC perhaps? > > > Did you mean login instead of event? > > > Something like it would work for me. It probably should be > > GUC_DISALLOW_IN_FILE? > > Why? Inserting such a setting in postgresql.conf and restarting > would be the first thing I'd think of if I needed to get out > of a problem. The only other way is to set it on the postmaster > command line, which is going to be awkward-to-impossible with > most system-provided start scripts. I was thinking that the way to use it would be to specify it as a client option. Like PGOPTIONS='-c ignore_event_trigger=login' psql. Greetings, Andres Freund
Re: On login trigger: take three
Andres Freund writes: > On 2022-03-13 23:31:03 +0100, Daniel Gustafsson wrote: >> Something like a '-c ignore_event_trigger=' GUC perhaps? > Did you mean login instead of event? > Something like it would work for me. It probably should be > GUC_DISALLOW_IN_FILE? Why? Inserting such a setting in postgresql.conf and restarting would be the first thing I'd think of if I needed to get out of a problem. The only other way is to set it on the postmaster command line, which is going to be awkward-to-impossible with most system-provided start scripts. regards, tom lane
Re: Tab completion not listing schema list for create/alter publication for all tables in schema
vignesh C writes: > Here "pg_%" should be "pg_%%". Right you are. Patch pushed, thanks! regards, tom lane
Re: On login trigger: take three
Hi, On 2022-03-13 23:31:03 +0100, Daniel Gustafsson wrote: > > On 12 Mar 2022, at 03:46, Andres Freund wrote: > > >> + > >> + The login event occurs when a user logs in to the > >> + system. > >> + Any bugs in a trigger procedure for this event may prevent successful > >> + login to the system. Such bugs may be fixed after first restarting > >> the > >> + system in single-user mode (as event triggers are disabled in this > >> mode). > >> + See the reference page for details > >> about > >> + using single-user mode. > >> + > > > > I'm strongly against adding any new dependencies on single user mode. > > > > A saner approach might be a superuser-only GUC that can be set as part of > > the > > connection data (e.g. PGOPTIONS='-c ignore_login_event_trigger=true'). > > This, and similar approaches, has been discussed in this thread. I'm not a > fan > of holes punched with GUC's like this, but if you plan on removing single-user > mode (as I recall seeing in an initdb thread) altogether then that kills the > discussion. So. Even if we end up not removing single user mode, I think it's not OK to add new failure modes that require single user mode to resolve after not-absurd operations (I'm ok with needing single user mode if somebody does delete from pg_authid). It's too hard to reach for most. We already have GUCs like row_security, so it doesn't seem insane to add one that disables login event triggers. What is the danger that you see? > Since we already recommend single-user mode to handle broken event triggers, > we > should IMO do something to cover all of them rather than risk ending up with > one disabling GUC per each event type. Right now we have this on the CREATE > EVENT TRIGGER reference page: IMO the other types of event triggers make it a heck of a lot harder to get yourself into a situation that you can't get out of... > "Event triggers are disabled in single-user mode (see postgres). If an > erroneous event trigger disables the database so much that you can't even > drop the trigger, restart in single-user mode and you'll be able to do > that." > Something like a '-c ignore_event_trigger=' GUC perhaps? Did you mean login instead of event? Something like it would work for me. It probably should be GUC_DISALLOW_IN_FILE? Greetings, Andres Freund
Re: BufferAlloc: don't take two simultaneous locks
On Sun, Mar 13, 2022 at 3:27 PM Yura Sokolov wrote: > В Вс, 13/03/2022 в 07:05 -0700, Zhihong Yu пишет: > > > > Hi, > > In the description: > > > > There is no need to hold both lock simultaneously. > > > > both lock -> both locks > > Thanks. > > > +* We also reset the usage_count since any recency of use of the old > > > > recency of use -> recent use > > Thanks. > > > +BufTableDelete(BufferTag *tagPtr, uint32 hashcode, bool reuse) > > > > Later on, there is code: > > > > + reuse ? HASH_REUSE : HASH_REMOVE, > > > > Can flag (such as HASH_REUSE) be passed to BufTableDelete() instead of > bool ? That way, flag can be used directly in the above place. > > No. > BufTable* functions are created to abstract Buffer Table from dynahash. > Pass of HASH_REUSE directly will break abstraction. > > > + longnalloced; /* number of entries initially allocated > for > > > > nallocated isn't very long. I think it would be better to name the field > nallocated 'nallocated'. > > It is debatable. > Why not num_allocated? allocated_count? number_of_allocations? > Same points for nfree. > `nalloced` is recognizable and unambiguous. And there are a lot > of `*alloced` in the postgresql's source, so this one will not > be unusual. > > I don't see the need to make it longer. > > But if someone supports your point, I will not mind to changing > the name. > > > + sum += hashp->hctl->freeList[i].nalloced; > > + sum -= hashp->hctl->freeList[i].nfree; > > > > I think it would be better to calculate the difference between nalloced > and nfree first, then add the result to sum (to avoid overflow). > > Doesn't really matter much, because calculation must be valid > even if all nfree==0. > > I'd rather debate use of 'long' in dynahash at all: 'long' is > 32bit on 64bit Windows. It is better to use 'Size' here. > > But 'nelements' were 'long', so I didn't change things. I think > it is place for another patch. > > (On the other hand, dynahash with 2**31 elements is at least > 512GB RAM... we doubtfully trigger problem before OOM killer > came. Does Windows have an OOM killer?) > > > Subject: [PATCH 3/3] reduce memory allocation for non-partitioned > dynahash > > > > memory allocation -> memory allocations > > For each dynahash instance single allocation were reduced. > I think, 'memory allocation' is correct. > > Plural will be > reduce memory allocations for non-partitioned dynahashes > ie both 'allocations' and 'dynahashes'. > Am I wrong? > > Hi, bq. reduce memory allocation for non-partitioned dynahash It seems the following is clearer: reduce one memory allocation for every non-partitioned dynahash Cheers
Re: On login trigger: take three
> On 12 Mar 2022, at 03:46, Andres Freund wrote: >> + >> + The login event occurs when a user logs in to the >> + system. >> + Any bugs in a trigger procedure for this event may prevent successful >> + login to the system. Such bugs may be fixed after first restarting the >> + system in single-user mode (as event triggers are disabled in this >> mode). >> + See the reference page for details about >> + using single-user mode. >> + > > I'm strongly against adding any new dependencies on single user mode. > > A saner approach might be a superuser-only GUC that can be set as part of the > connection data (e.g. PGOPTIONS='-c ignore_login_event_trigger=true'). This, and similar approaches, has been discussed in this thread. I'm not a fan of holes punched with GUC's like this, but if you plan on removing single-user mode (as I recall seeing in an initdb thread) altogether then that kills the discussion. So. Since we already recommend single-user mode to handle broken event triggers, we should IMO do something to cover all of them rather than risk ending up with one disabling GUC per each event type. Right now we have this on the CREATE EVENT TRIGGER reference page: "Event triggers are disabled in single-user mode (see postgres). If an erroneous event trigger disables the database so much that you can't even drop the trigger, restart in single-user mode and you'll be able to do that." Something like a '-c ignore_event_trigger=' GUC perhaps? -- Daniel Gustafsson https://vmware.com/
Re: BufferAlloc: don't take two simultaneous locks
В Вс, 13/03/2022 в 07:05 -0700, Zhihong Yu пишет: > > Hi, > In the description: > > There is no need to hold both lock simultaneously. > > both lock -> both locks Thanks. > +* We also reset the usage_count since any recency of use of the old > > recency of use -> recent use Thanks. > +BufTableDelete(BufferTag *tagPtr, uint32 hashcode, bool reuse) > > Later on, there is code: > > + reuse ? HASH_REUSE : HASH_REMOVE, > > Can flag (such as HASH_REUSE) be passed to BufTableDelete() instead of bool ? > That way, flag can be used directly in the above place. No. BufTable* functions are created to abstract Buffer Table from dynahash. Pass of HASH_REUSE directly will break abstraction. > + longnalloced; /* number of entries initially allocated for > > nallocated isn't very long. I think it would be better to name the field > nallocated 'nallocated'. It is debatable. Why not num_allocated? allocated_count? number_of_allocations? Same points for nfree. `nalloced` is recognizable and unambiguous. And there are a lot of `*alloced` in the postgresql's source, so this one will not be unusual. I don't see the need to make it longer. But if someone supports your point, I will not mind to changing the name. > + sum += hashp->hctl->freeList[i].nalloced; > + sum -= hashp->hctl->freeList[i].nfree; > > I think it would be better to calculate the difference between nalloced and > nfree first, then add the result to sum (to avoid overflow). Doesn't really matter much, because calculation must be valid even if all nfree==0. I'd rather debate use of 'long' in dynahash at all: 'long' is 32bit on 64bit Windows. It is better to use 'Size' here. But 'nelements' were 'long', so I didn't change things. I think it is place for another patch. (On the other hand, dynahash with 2**31 elements is at least 512GB RAM... we doubtfully trigger problem before OOM killer came. Does Windows have an OOM killer?) > Subject: [PATCH 3/3] reduce memory allocation for non-partitioned dynahash > > memory allocation -> memory allocations For each dynahash instance single allocation were reduced. I think, 'memory allocation' is correct. Plural will be reduce memory allocations for non-partitioned dynahashes ie both 'allocations' and 'dynahashes'. Am I wrong? -- regards Yura Sokolov From 68800f6f02f062320e6d9fe42c986809a06a37cb Mon Sep 17 00:00:00 2001 From: Yura Sokolov Date: Mon, 21 Feb 2022 08:49:03 +0300 Subject: [PATCH 1/3] bufmgr: do not acquire two partition locks. Acquiring two partition locks leads to complex dependency chain that hurts at high concurrency level. There is no need to hold both locks simultaneously. Buffer is pinned so other processes could not select it for eviction. If tag is cleared and buffer removed from old partition other processes will not find it. Therefore it is safe to release old partition lock before acquiring new partition lock. --- src/backend/storage/buffer/bufmgr.c | 198 ++-- 1 file changed, 96 insertions(+), 102 deletions(-) diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index f5459c68f89..f7dbfc90aaa 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -1275,8 +1275,9 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum, } /* - * To change the association of a valid buffer, we'll need to have - * exclusive lock on both the old and new mapping partitions. + * To change the association of a valid buffer, we'll need to reset + * tag first, so we need to have exclusive lock on the old mapping + * partitions. */ if (oldFlags & BM_TAG_VALID) { @@ -1289,93 +1290,16 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum, oldHash = BufTableHashCode(); oldPartitionLock = BufMappingPartitionLock(oldHash); - /* - * Must lock the lower-numbered partition first to avoid - * deadlocks. - */ - if (oldPartitionLock < newPartitionLock) - { -LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE); -LWLockAcquire(newPartitionLock, LW_EXCLUSIVE); - } - else if (oldPartitionLock > newPartitionLock) - { -LWLockAcquire(newPartitionLock, LW_EXCLUSIVE); -LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE); - } - else - { -/* only one partition, only one lock */ -LWLockAcquire(newPartitionLock, LW_EXCLUSIVE); - } + LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE); } else { - /* if it wasn't valid, we need only the new partition */ - LWLockAcquire(newPartitionLock, LW_EXCLUSIVE); /* remember we have no old-partition lock or tag */ oldPartitionLock = NULL; /* keep the compiler quiet about uninitialized variables */ oldHash = 0; } - /* - * Try to make a hashtable entry for the buffer under its new tag. - * This could fail because while we were writing someone else - *
Re: add checkpoint stats of snapshot and mapping files of pg_logical dir
On Sun, Mar 13, 2022 at 01:54:10PM +0530, Bharath Rupireddy wrote: > Thanks for reviewing this. I agree with all of the above suggestions > and incorporated them in the v2 patch. Thanks for the new patch. > Another thing I added in v2 is to not emit snapshot and mapping files > stats in case of restartpoint as logical decoding isn't supported on > standbys, so it doesn't make sense to emit the stats there and cause > server log to grow unnecessarily. Having said that, I added a note > there to change it whenever logical decoding on standbys is supported. I think we actually do want to include this information for restartpoints since some files might be left over from the snapshot that was used to create the standby. Also, presumably these functions could do some work during recovery on a primary. Another problem I see is that this patch depends on the return value of the lstat() calls that we are trying to remove in 0001 from another thread [0]. I think the size of the removed/sync'd files is somewhat useful for understanding disk space usage, but I suspect the time spent performing these tasks is more closely related to the number of files. Do you think reporting the sizes is worth the extra system call? [0] https://postgr.es/m/20220215231123.GA2584239%40nathanxps13 -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: SQL/JSON: JSON_TABLE
On 2/9/22 08:22, Himanshu Upadhyaya wrote: > On Wed, Feb 2, 2022 at 12:44 AM Andrew Dunstan wrote: >> >> rebased with some review comments attended to. > I am in process of reviewing these patches, initially, have started > with 0002-JSON_TABLE-v55.patch. > Tested many different scenarios with various JSON messages and these > all are working as expected. Just one question on the below output. > > ‘postgres[1406146]=#’SELECT * FROM JSON_TABLE(jsonb '1', '$' COLUMNS > (a int PATH '$.a' ERROR ON EMPTY)) jt; > a > --- > > (1 row) > > ‘postgres[1406146]=#’SELECT * FROM JSON_TABLE(jsonb '1', '$' COLUMNS > (a int PATH '$.a' ERROR ON ERROR)) jt; > a > --- > > (1 row) > > is not "ERROR ON ERROR" is expected to give error? I think I understand what's going on here. In the first example 'ERROR ON EMPTY' causes an error condition, but as the default action for an error condition is to return null that's what happens. To get an error raised you would need to say 'ERROR ON EMPTY ERROR ON ERROR'. I don't know if that's according to spec. It seems kinda screwy, arguably a POLA violation, although that would hardly be a first for the SQL Standards body. But I'm speculating here, I'm not a standards lawyer. In the second case it looks like there isn't really an error. There would be if you used 'strict' in the path expression. This whole area needs more documentation. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: Issue with pg_stat_subscription_stats
On Sat, Mar 12, 2022 at 3:15 PM Andres Freund wrote: > > Hi, > > On 2022-03-12 08:28:35 +0530, Amit Kapila wrote: > > On Sat, Mar 12, 2022 at 2:14 AM Melanie Plageman > > wrote: > > > > > > So, I noticed that pg_stat_reset_subscription_stats() wasn't working > > > properly, and, upon further investigation, I'm not sure the view > > > pg_stat_subscription_stats is being properly populated. > > > > > > > I have tried the below scenario based on this: > > Step:1 Create some data that generates conflicts and lead to apply > > failures and then check in the view: > > I think the problem is present when there was *no* conflict > previously. Because nothing populates the stats entry without an error, the > reset doesn't have anything to set the stats_reset field in, which then means > that the stats_reset field is NULL even though stats have been reset. Yes, this is what I meant. stats_reset is not initialized and without any conflict happening to populate the stats, after resetting the stats, the field still does not get populated. I think this is a bit unexpected. psql (15devel) Type "help" for help. mplageman=# select * from pg_stat_subscription_stats ; subid | subname | apply_error_count | sync_error_count | stats_reset ---+-+---+--+- 16398 | mysub | 0 |0 | (1 row) mplageman=# select pg_stat_reset_subscription_stats(16398); pg_stat_reset_subscription_stats -- (1 row) mplageman=# select * from pg_stat_subscription_stats ; subid | subname | apply_error_count | sync_error_count | stats_reset ---+-+---+--+- 16398 | mysub | 0 |0 | (1 row) - Melanie
Tab completion not listing schema list for create/alter publication for all tables in schema
Hi, I noticed that the following commands "CREATE PUBLICATION pub1 FOR ALL TABLES IN SCHEMA" and "ALTER PUBLICATION pub1 ADD ALL TABLES IN SCHEMA" does not complete with the schema list. I feel this is because of the following code in tab-complete.c: . COMPLETE_WITH_QUERY_PLUS(Query_for_list_of_schemas " AND nspname NOT LIKE E'pg_%'", "CURRENT_SCHEMA"); . Here "pg_%" should be "pg_%%". Attached a patch to handle this. Thoughts? Regards, Vignesh From 4321bafb2b7594f6c1af5d02f64e934fdba9c2ef Mon Sep 17 00:00:00 2001 From: Vigneshwaran C Date: Sun, 13 Mar 2022 22:09:56 +0530 Subject: [PATCH] Tab completion not listing schema list for create/alter publication. Tab completion not listing schema list for create/alter publication. --- src/bin/psql/tab-complete.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c index 6957567264..6d5c928c10 100644 --- a/src/bin/psql/tab-complete.c +++ b/src/bin/psql/tab-complete.c @@ -1811,7 +1811,7 @@ psql_completion(const char *text, int start, int end) COMPLETE_WITH("(", "ALL TABLES IN SCHEMA", "TABLE"); else if (Matches("ALTER", "PUBLICATION", MatchAny, "ADD|DROP|SET", "ALL", "TABLES", "IN", "SCHEMA")) COMPLETE_WITH_QUERY_PLUS(Query_for_list_of_schemas - " AND nspname NOT LIKE E'pg_%'", + " AND nspname NOT LIKE E'pg_%%'", "CURRENT_SCHEMA"); /* ALTER PUBLICATION SET ( */ else if (HeadMatches("ALTER", "PUBLICATION", MatchAny) && TailMatches("SET", "(")) @@ -2956,7 +2956,7 @@ psql_completion(const char *text, int start, int end) */ else if (Matches("CREATE", "PUBLICATION", MatchAny, "FOR", "ALL", "TABLES", "IN", "SCHEMA")) COMPLETE_WITH_QUERY_PLUS(Query_for_list_of_schemas - " AND nspname NOT LIKE E'pg_%'", + " AND nspname NOT LIKE E'pg_%%'", "CURRENT_SCHEMA"); else if (Matches("CREATE", "PUBLICATION", MatchAny, "FOR", "ALL", "TABLES", "IN", "SCHEMA", MatchAny) && (!ends_with(prev_wd, ','))) COMPLETE_WITH("WITH ("); -- 2.32.0
Re: BufferAlloc: don't take two simultaneous locks
On Sun, Mar 13, 2022 at 3:25 AM Yura Sokolov wrote: > В Пт, 11/03/2022 в 17:21 +0900, Kyotaro Horiguchi пишет: > > At Fri, 11 Mar 2022 15:49:49 +0900 (JST), Kyotaro Horiguchi < > horikyota@gmail.com> wrote in > > > At Fri, 11 Mar 2022 15:30:30 +0900 (JST), Kyotaro Horiguchi < > horikyota@gmail.com> wrote in > > > > Thanks! I looked into dynahash part. > > > > > > > > struct HASHHDR > > > > { > > > > - /* > > > > -* The freelist can become a point of contention in > high-concurrency hash > > > > > > > > Why did you move around the freeList? > > This way it is possible to allocate just first partition, not all 32 > partitions. > > > > > Then I looked into bufmgr part. It looks fine to me but I have some > > comments on code comments. > > > > >* To change the association of a valid buffer, we'll > need to have > > >* exclusive lock on both the old and new mapping > partitions. > > > if (oldFlags & BM_TAG_VALID) > > > > We don't take lock on the new mapping partition here. > > Thx, fixed. > > > +* Clear out the buffer's tag and flags. We must do this to > ensure that > > +* linear scans of the buffer array don't think the buffer is > valid. We > > +* also reset the usage_count since any recency of use of the > old content > > +* is no longer relevant. > > +* > > +* We are single pinner, we hold buffer header lock and exclusive > > +* partition lock (if tag is valid). Given these statements it > is safe to > > +* clear tag since no other process can inspect it to the moment. > > > > This comment is a merger of the comments from InvalidateBuffer and > > BufferAlloc. But I think what we need to explain here is why we > > invalidate the buffer here despite of we are going to reuse it soon. > > And I think we need to state that the old buffer is now safe to use > > for the new tag here. I'm not sure the statement is really correct > > but clearing-out actually looks like safer. > > I've tried to reformulate the comment block. > > > > > > Now it is safe to use victim buffer for new tag. Invalidate the > > > buffer before releasing header lock to ensure that linear scans of > > > the buffer array don't think the buffer is valid. It is safe > > > because it is guaranteed that we're the single pinner of the buffer. > > > That pin also prevents the buffer from being stolen by others until > > > we reuse it or return it to freelist. > > > > So I want to revise the following comment. > > > > -* Now it is safe to use victim buffer for new tag. > > +* Now reuse victim buffer for new tag. > > >* Make sure BM_PERMANENT is set for buffers that must be > written at every > > >* checkpoint. Unlogged buffers only need to be written at > shutdown > > >* checkpoints, except for their "init" forks, which need to be > treated > > >* just like permanent relations. > > >* > > >* The usage_count starts out at 1 so that the buffer can > survive one > > >* clock-sweep pass. > > > > But if you think the current commet is fine, I don't insist on the > > comment chagnes. > > Used suggestion. > > Fr, 11/03/22 Yura Sokolov wrote: > > В Пт, 11/03/2022 в 15:49 +0900, Kyotaro Horiguchi пишет: > > > BufTableDelete considers both reuse and !reuse cases but > > > BufTableInsert doesn't and always does HASH_ASSIGN. That looks > > > odd. We should use HASH_ENTER here. Thus I think it is more > > > reasonable that HASH_ENTRY uses the stashed entry if exists and > > > needed, or returns it to freelist if exists but not needed. > > > > > > What do you think about this? > > > > Well... I don't like it but I don't mind either. > > > > Code in HASH_ENTER and HASH_ASSIGN cases differs much. > > On the other hand, probably it is possible to merge it carefuly. > > I'll try. > > I've merged HASH_ASSIGN into HASH_ENTER. > > As in previous letter, three commits are concatted to one file > and could be applied with `git am`. > > --- > > regards > > Yura Sokolov > Postgres Professional > y.soko...@postgrespro.ru > funny.fal...@gmail.com Hi, In the description: There is no need to hold both lock simultaneously. both lock -> both locks +* We also reset the usage_count since any recency of use of the old recency of use -> recent use +BufTableDelete(BufferTag *tagPtr, uint32 hashcode, bool reuse) Later on, there is code: + reuse ? HASH_REUSE : HASH_REMOVE, Can flag (such as HASH_REUSE) be passed to BufTableDelete() instead of bool ? That way, flag can be used directly in the above place. + longnalloced; /* number of entries initially allocated for nallocated isn't very long. I think it would be better to name the field nallocated* '*nallocated'. + sum += hashp->hctl->freeList[i].nalloced; + sum -= hashp->hctl->freeList[i].nfree; I think it would be better to calculate
Re: Support logical replication of DDLs
On Mon, Feb 21, 2022 at 9:43 PM Zheng Li wrote: > > Hello, > > One of the most frequently requested improvements from our customers > is to reduce downtime associated with software updates (both major and > minor versions). To do this, we have reviewed potential contributions to > improving logical replication. > > I’m working on a patch to support logical replication of data > definition language statements (DDLs). This is a useful feature when a > database in logical replication has lots of tables, functions and > other objects that change over time, such as in online cross major > version upgrade. +1 > I put together a prototype that replicates DDLs using the generic > messages for logical decoding. The idea is to log the candidate DDL > string in ProcessUtilitySlow() using LogLogicalMessge() with a new > flag in WAL record type xl_logical_message indicating it’s a DDL > message. The xl_logical_message record is decoded and sent to the > subscriber via pgoutput. The logical replication worker process is > dispatched for this new DDL message type and executes the command > accordingly. If you don't mind, would you like to share the POC or the branch for this work? > However, there are still many edge cases to sort out because not every > DDL statement can/should be replicated. Some of these include: > 3. CREATE TABLE AS and SELECT INTO, For example: > > CREATE TABLE foo AS > SELECT field_1, field_2 FROM bar; > > There are a few issues that can occur here. For one, it’s possible > that table bar doesn't exist on the subscriber. Even if “bar” does > exist, it may not be fully up-to-date with the publisher, which would > cause a data mismatch on “foo” between the publisher and subscriber. In such cases why don't we just log the table creation WAL for DDL instead of a complete statement which creates the table and inserts the tuple? Because we are already WAL logging individual inserts and once you make sure of replicating the table creation I think the exact data insertion on the subscriber side will be taken care of by the insert WALs no? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Add id's to various elements in protocol.sgml
On 09.03.2022 at 20:43, Brar Piening wrote: Attached is a pretty huge patch that adds ids to all sections and all the varlistentries where the containing variablelist already had at least one id (plus a few additional ones that I stumbled upon and deemed useful). It also adds html links next to the respective heading in the html documentation and emits a build message and a comment when a section or a relevant (see before) varlistentry doesn't have an id. I have uploaded a doc build with the patch applied to https://pgdocs.piening.info/ to make it easier for you all to review the results and see what is there and what isn't and how it feels UI-wise. You may want to look at https://pgdocs.piening.info/app-psql.html where the patch adds ids and links to all varlistentries but doesn't do so for the headings (because they are refsect1 headings not sect1 headings). https://pgdocs.piening.info/protocol-flow.html is pretty much the opposite. The patch adds ids and links to the headings (they are sect2 headings) but doesn't add them to the varlistentries (yet - because I mostly sticked to the algorithm suggested at https://www.postgresql.org/message-id/621FAF40.5070507%40anastigmatix.net to contain the workload).
Re: BufferAlloc: don't take two simultaneous locks
В Пт, 11/03/2022 в 17:21 +0900, Kyotaro Horiguchi пишет: > At Fri, 11 Mar 2022 15:49:49 +0900 (JST), Kyotaro Horiguchi > wrote in > > At Fri, 11 Mar 2022 15:30:30 +0900 (JST), Kyotaro Horiguchi > > wrote in > > > Thanks! I looked into dynahash part. > > > > > > struct HASHHDR > > > { > > > - /* > > > -* The freelist can become a point of contention in > > > high-concurrency hash > > > > > > Why did you move around the freeList? This way it is possible to allocate just first partition, not all 32 partitions. > > Then I looked into bufmgr part. It looks fine to me but I have some > comments on code comments. > > >* To change the association of a valid buffer, we'll need to > > have > >* exclusive lock on both the old and new mapping partitions. > > if (oldFlags & BM_TAG_VALID) > > We don't take lock on the new mapping partition here. Thx, fixed. > +* Clear out the buffer's tag and flags. We must do this to ensure > that > +* linear scans of the buffer array don't think the buffer is valid. > We > +* also reset the usage_count since any recency of use of the old > content > +* is no longer relevant. > +* > +* We are single pinner, we hold buffer header lock and exclusive > +* partition lock (if tag is valid). Given these statements it is > safe to > +* clear tag since no other process can inspect it to the moment. > > This comment is a merger of the comments from InvalidateBuffer and > BufferAlloc. But I think what we need to explain here is why we > invalidate the buffer here despite of we are going to reuse it soon. > And I think we need to state that the old buffer is now safe to use > for the new tag here. I'm not sure the statement is really correct > but clearing-out actually looks like safer. I've tried to reformulate the comment block. > > > Now it is safe to use victim buffer for new tag. Invalidate the > > buffer before releasing header lock to ensure that linear scans of > > the buffer array don't think the buffer is valid. It is safe > > because it is guaranteed that we're the single pinner of the buffer. > > That pin also prevents the buffer from being stolen by others until > > we reuse it or return it to freelist. > > So I want to revise the following comment. > > -* Now it is safe to use victim buffer for new tag. > +* Now reuse victim buffer for new tag. > >* Make sure BM_PERMANENT is set for buffers that must be written at > > every > >* checkpoint. Unlogged buffers only need to be written at shutdown > >* checkpoints, except for their "init" forks, which need to be > > treated > >* just like permanent relations. > >* > >* The usage_count starts out at 1 so that the buffer can survive one > >* clock-sweep pass. > > But if you think the current commet is fine, I don't insist on the > comment chagnes. Used suggestion. Fr, 11/03/22 Yura Sokolov wrote: > В Пт, 11/03/2022 в 15:49 +0900, Kyotaro Horiguchi пишет: > > BufTableDelete considers both reuse and !reuse cases but > > BufTableInsert doesn't and always does HASH_ASSIGN. That looks > > odd. We should use HASH_ENTER here. Thus I think it is more > > reasonable that HASH_ENTRY uses the stashed entry if exists and > > needed, or returns it to freelist if exists but not needed. > > > > What do you think about this? > > Well... I don't like it but I don't mind either. > > Code in HASH_ENTER and HASH_ASSIGN cases differs much. > On the other hand, probably it is possible to merge it carefuly. > I'll try. I've merged HASH_ASSIGN into HASH_ENTER. As in previous letter, three commits are concatted to one file and could be applied with `git am`. --- regards Yura Sokolov Postgres Professional y.soko...@postgrespro.ru funny.fal...@gmail.com From fbec0dd7d9f11aeaeb8f141ad3dedab7178aeb2e Mon Sep 17 00:00:00 2001 From: Yura Sokolov Date: Mon, 21 Feb 2022 08:49:03 +0300 Subject: [PATCH 1/3] bufmgr: do not acquire two partition locks. Acquiring two partition locks leads to complex dependency chain that hurts at high concurrency level. There is no need to hold both lock simultaneously. Buffer is pinned so other processes could not select it for eviction. If tag is cleared and buffer removed from old partition other processes will not find it. Therefore it is safe to release old partition lock before acquiring new partition lock. --- src/backend/storage/buffer/bufmgr.c | 198 ++-- 1 file changed, 96 insertions(+), 102 deletions(-) diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index f5459c68f89..63824b15686 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -1275,8 +1275,9 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum, } /* - * To change the association of a valid
Re: Defer selection of asynchronous subplans until the executor initialization stage
Hi Alexander, On Wed, Sep 15, 2021 at 3:40 PM Alexander Pyhalov wrote: > Etsuro Fujita писал 2021-08-30 12:52: > > To allow async execution in a bit more cases, I modified the patch a > > bit further: a ProjectionPath put directly above an async-capable > > ForeignPath would also be considered async-capable as ForeignScan can > > project and no separate Result is needed in that case, so I modified > > mark_async_capable_plan() as such, and added test cases to the > > postgres_fdw regression test. Attached is an updated version of the > > patch. > The patch looks good to me and seems to work as expected. Thanks for reviewing! I’m planning to commit the patch. Sorry for the long delay. Best regards, Etsuro Fujita
Re: postgres_fdw: commit remote (sub)transactions in parallel during pre-commit
On Sat, Mar 12, 2022 at 10:02 AM David Zhang wrote: > Applied patches v6-0002 and v6-0003 to master branch, and the `make check` > test is ok. > > Here is my test result in 10 times average on 3 virtual machines: > before the patches: > > abort.1 = 2.5473 ms > > abort.2 = 4.1572 ms > > after the patches with OPTIONS (ADD parallel_abort 'true'): > > abort.1 = 1.7136 ms > > abort.2 = 2.5833 ms > > Overall, it has about 32 ~ 37 % improvement in my testing environment. I think that is a great improvement. Thanks for testing! Best regards, Etsuro Fujita
Re: add checkpoint stats of snapshot and mapping files of pg_logical dir
On Sat, Mar 12, 2022 at 1:35 AM Nathan Bossart wrote: > > +CheckpointStats.repl_map_cutoff_lsn = cutoff; > > Could we set repl_map_cutoff_lsn closer to where it is calculated? Right > now, it's set at the bottom of the function just before the directory is > freed. Is there a strong reason to do it there? > > +"logical rewrite mapping file(s) removed/synced=" > UINT64_FORMAT ", size=%zu bytes, time=%ld.%03d s, cutoff LSN=%X/%X", > > Can we split out the "removed" versus "synced" files? Otherwise, I think > this is basically just telling us how many files are in the mappings > directory. > > +CheckpointStats.repl_snap_cutoff_lsn = cutoff; > > I have the same note for this one as I have for repl_map_cutoff_lsn. Thanks for reviewing this. I agree with all of the above suggestions and incorporated them in the v2 patch. Another thing I added in v2 is to not emit snapshot and mapping files stats in case of restartpoint as logical decoding isn't supported on standbys, so it doesn't make sense to emit the stats there and cause server log to grow unnecessarily. Having said that, I added a note there to change it whenever logical decoding on standbys is supported. Please review v2. Regards, Bharath Rupireddy. v2-0001-add-checkpoint-stats-of-snapshot-and-mapping-file.patch Description: Binary data