Re: Protocol problem with GSSAPI encryption?
On Fri, Feb 21, 2020 at 12:35:03AM +, Andrew Gierth wrote: > > "Stephen" == Stephen Frost writes: > > >> I figure something along these lines for the fix. Anyone in a > >> position to test this? > > Stephen> At least at first blush, I tend to agree with your analysis > Stephen> and patch. > > Stephen> I'll see about getting this actually set up and tested in the > Stephen> next week or so (and maybe there's some way to also manage to > Stephen> have a regression test for it..). > > *poke* Second *poke*. -- Michael signature.asc Description: PGP signature
001_rep_changes.pl stalls
Executive summary: the "MyWalSnd->write < sentPtr" in WalSndWaitForWal() is important for promptly updating pg_stat_replication. When caught up, we should impose that logic before every sleep. The one-line fix is to sleep in WalSndLoop() only when pq_is_send_pending(), not when caught up. On my regular development machine, src/test/subscription/t/001_rep_changes.pl stalls for ~10s at this wait_for_catchup: $node_publisher->safe_psql('postgres', "DELETE FROM tab_rep"); # Restart the publisher and check the state of the subscriber which # should be in a streaming state after catching up. $node_publisher->stop('fast'); $node_publisher->start; $node_publisher->wait_for_catchup('tap_sub'); That snippet emits three notable physical WAL records. There's a Transaction/COMMIT at the end of the DELETE, an XLOG/CHECKPOINT_SHUTDOWN, and an XLOG/FPI_FOR_HINT. The buildfarm has stalled there, but it happens probably less than half the time. Examples[1] showing the stall: https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=mandrill&dt=2020-03-20%2017%3A09%3A53&stg=subscription-check https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=thorntail&dt=2020-03-22%2019%3A51%3A38&stg=subscription-check https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=hoverfly&dt=2020-03-19%2003%3A35%3A01&stg=subscription-check https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=hoverfly&dt=2020-03-20%2015%3A15%3A01&stg=subscription-check Here's the most-relevant walsender call tree: WalSndLoop XLogSendLogical (caller invokes once per loop iteration, via send_data callback) XLogReadRecord (caller invokes once) ReadPageInternal (caller invokes twice in this test; more calls are possible) logical_read_xlog_page (caller skips when page is same as last call, else invokes 1-2 times via state->read_page() callback, registered in StartLogicalReplication) WalSndWaitForWal (caller invokes once; has fast path) The cause is a race involving the flow of reply messages (send_feedback() messages) from logical apply worker to walsender. Here are two sequencing patterns; the more-indented parts are what differ. Stalling pattern: sender reads Transaction/COMMIT and sends the changes receiver applies the changes receiver send_feedback() reports progress up to Transaction/COMMIT sender accepts the report sender reads XLOG/CHECKPOINT_SHUTDOWN and/or XLOG/FPI_FOR_HINT, which are no-ops for logical rep sender WalSndCaughtUp becomes true; sender sleeps in WalSndLoop() receiver wal_receiver_status_interval elapses; receiver reports progress up to Transaction/COMMIT sender wakes up, accepts the report sender calls WalSndWaitForWal(), which sends a keepalive due to "MyWalSnd->write < sentPtr" receiver gets keepalive, send_feedback() reports progress up to XLOG/FPI_FOR_HINT Non-stalling pattern (more prevalent with lower machine performance): sender reads Transaction/COMMIT and sends the changes sender reads XLOG/CHECKPOINT_SHUTDOWN and/or XLOG/FPI_FOR_HINT, which are no-ops for logical rep sender WalSndCaughtUp becomes true; sender sleeps in WalSndLoop() receiver applies the changes receiver send_feedback() reports progress up to Transaction/COMMIT sender wakes up, accepts the report sender calls WalSndWaitForWal(), which sends a keepalive due to "MyWalSnd->write < sentPtr" receiver gets keepalive, send_feedback() reports progress up to XLOG/FPI_FOR_HINT The fix is to test "MyWalSnd->write < sentPtr" before more sleeps. The test is unnecessary when sleeping due to pq_is_send_pending(); in that case, the receiver is not idle and will reply before idling. I changed WalSndLoop() to sleep only for pq_is_send_pending(). For all other sleep reasons, the sleep will happen in WalSndWaitForWal(). Attached. I don't know whether this is important outside of testing scenarios. I lean against back-patching, but I will back-patch if someone thinks this qualifies as a performance bug. Thanks, nm [1] I spot-checked only my animals, since I wanted to experiment on an affected animal. Author: Noah Misch Commit: Noah Misch When WalSndCaughtUp, sleep only in WalSndWaitForWal(). Before sleeping, WalSndWaitForWal() sends a keepalive if MyWalSnd->write < sentPtr. That is important in logical replication. When the latest physical LSN yields no logical replication messages (a common case), that keepalive elicits a reply, and processing the reply updates pg_stat_replication.replay_lsn. WalSndLoop() lacks that; when WalSndLoop() slept, replay_lsn advancement could stall until wal_receiver_status_interval elapsed. This sometimes stalled src/test/subscription/t/001_rep_changes.pl for up to 10s. Reviewed by FIXME. Discussion: https://postgr.es/m/FIXME diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c index 9e56115..d9c
Re: pg_stat_statements issue with parallel maintenance (Was Re: WAL usage calculation patch)
On Tue, 31 Mar 2020 at 14:13, Masahiko Sawada wrote: > > On Tue, 31 Mar 2020 at 12:58, Amit Kapila wrote: > > > > On Mon, Mar 30, 2020 at 12:31 PM Masahiko Sawada > > wrote: > > > > > > The patch for vacuum conflicts with recent changes in vacuum. So I've > > > attached rebased one. > > > > > > > + /* > > + * Next, accumulate buffer usage. (This must wait for the workers to > > + * finish, or we might get incomplete data.) > > + */ > > + for (i = 0; i < nworkers; i++) > > + InstrAccumParallelQuery(&lps->buffer_usage[i]); > > + > > > > This should be done for launched workers aka > > lps->pcxt->nworkers_launched. I think a similar problem exists in > > create index related patch. > > You're right. Fixed in the new patches. > > On Mon, 30 Mar 2020 at 17:00, Julien Rouhaud wrote: > > > > Just minor nitpicking: > > > > + int i; > > > > Assert(!IsParallelWorker()); > > Assert(ParallelVacuumIsActive(lps)); > > @@ -2166,6 +2172,13 @@ lazy_parallel_vacuum_indexes(Relation *Irel, > > IndexBulkDeleteResult **stats, > > /* Wait for all vacuum workers to finish */ > > WaitForParallelWorkersToFinish(lps->pcxt); > > > > + /* > > +* Next, accumulate buffer usage. (This must wait for the workers to > > +* finish, or we might get incomplete data.) > > +*/ > > + for (i = 0; i < nworkers; i++) > > + InstrAccumParallelQuery(&lps->buffer_usage[i]); > > > > We now allow declaring a variable in those loops, so it may be better to > > avoid > > declaring i outside the for scope? > > We can do that but I was not sure if it's good since other codes > around there don't use that. So I'd like to leave it for committers. > It's a trivial change. > I've updated the buffer usage patch for parallel index creation as the previous patch conflicts with commit df3b181499b40. This comment in commit df3b181499b40 seems the comment which had been replaced by Amit with a better sentence when introducing buffer usage to parallel vacuum. + /* +* Estimate space for WalUsage -- PARALLEL_KEY_WAL_USAGE +* +* WalUsage during execution of maintenance command can be used by an +* extension that reports the WAL usage, such as pg_stat_statements. We +* have no way of knowing whether anyone's looking at pgWalUsage, so do it +* unconditionally. +*/ Would the following sentence in lazyvacuum.c be also better for parallel create index? * If there are no extensions loaded that care, we could skip this. We * have no way of knowing whether anyone's looking at pgBufferUsage or * pgWalUsage, so do it unconditionally. The attached patch changes to the above comment and removed the code that is used to un-support only buffer usage accumulation. Regards, -- Masahiko Sawadahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services bufferusage_create_index_v3.patch Description: Binary data
Re: 2pc leaks fds
Hi, On 2020-04-06 14:26:48 +0900, Michael Paquier wrote: > 2PC shines with the code of xlogreader.c in this case because it keeps > opening and closing XLogReaderState for a short amount of time. So it > is not surprising to me to see this error only months after the fact > because recovery or pg_waldump just use one XLogReaderState. Well, it doesn't exactly signal that people (including me, up to just now) are testing their changes all that carefully... > From what I can see, the error is that the code only bothers closing > WALOpenSegment->seg when switching to a new segment, but we need also > to close it when finishing the business in XLogReaderFree(). Yea, I came to the same conclusion and locally fixed it the same way (except having the close a bit earlier in XLogReaderFree()). > diff --git a/src/backend/access/transam/xlogreader.c > b/src/backend/access/transam/xlogreader.c > index f3fea5132f..7e25e2050a 100644 > --- a/src/backend/access/transam/xlogreader.c > +++ b/src/backend/access/transam/xlogreader.c > @@ -144,6 +144,9 @@ XLogReaderFree(XLogReaderState *state) > if (state->main_data) > pfree(state->main_data); > > + if (state->seg.ws_file >= 0) > + close(state->seg.ws_file); > + > pfree(state->errormsg_buf); > if (state->readRecordBuf) > pfree(state->readRecordBuf); But I'm not sure it's quite the right idea. I'm not sure I fully understand the design of 0dc8ead46, but it looks to me like it's intended to allow users of the interface to have different ways of opening files. If we just close() the fd that'd be a bit more limited. OTOH, I think all but one (XLogPageRead()) of the current users of XLogReader use WALRead(), which also close()s the fd (before calling the WALSegmentOpen callback). The XLogReader code flow has gotten quite complicated :(. XLogReaderReadRecord()-> state->read_page() -> logical_read_xlog_page etc -> WALRead() -> wal_segment_open callback etc. There's been a fair bit of change, making the interface more generic / powerful / reducing duplication, but not a lot of added / adapted comments in the header... Greetings, Andres Freund
Re: 2pc leaks fds
On Sun, Apr 05, 2020 at 07:56:51PM -0700, Andres Freund wrote: > I found this while trying to benchmark the effect of my snapshot changes > on 2pc. I just used the attached pgbench file. > > I've not yet reviewed the change sufficiently to pinpoint the issue. Indeed. It takes seconds to show up. > It's a bit sad that nobody has hit this in the last few months :(. 2PC shines with the code of xlogreader.c in this case because it keeps opening and closing XLogReaderState for a short amount of time. So it is not surprising to me to see this error only months after the fact because recovery or pg_waldump just use one XLogReaderState. From what I can see, the error is that the code only bothers closing WALOpenSegment->seg when switching to a new segment, but we need also to close it when finishing the business in XLogReaderFree(). I am adding an open item, and attached is a patch to take care of the problem. Thoughts? -- Michael diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c index f3fea5132f..7e25e2050a 100644 --- a/src/backend/access/transam/xlogreader.c +++ b/src/backend/access/transam/xlogreader.c @@ -144,6 +144,9 @@ XLogReaderFree(XLogReaderState *state) if (state->main_data) pfree(state->main_data); + if (state->seg.ws_file >= 0) + close(state->seg.ws_file); + pfree(state->errormsg_buf); if (state->readRecordBuf) pfree(state->readRecordBuf); signature.asc Description: PGP signature
Re: backup manifests and contemporaneous buildfarm failures
Hello, Do I need to precede those with some recursive chmod commands? Perhaps the client should refuse to run if there is still something left after these. I think the latter would be a very good idea, just so that this sort of failure is less obscure. Not sure about whether a recursive chmod is really going to be worth the cycles. (On the other hand, the normal case should be that there's nothing there anyway, so maybe it's not going to be costly.) Could it be a two-stage process to minimize cost but still be resilient? rmtree if (-d $DIR) { emit warning chmodtree rmtree again if (-d $DIR) emit error } -- Fabien.
SyncRepLock acquired exclusively in default configuration
Hi, Due to the change below, when using the default postgres configuration of ynchronous_commit = on, max_wal_senders = 10, will now acquire a new exclusive lwlock after writing a commit record. commit 48c9f4926562278a2fd2b85e7486c6d11705f177 Author: Simon Riggs Date: 2017-12-29 14:30:33 + Fix race condition when changing synchronous_standby_names A momentary window exists when synchronous_standby_names changes that allows commands issued after the change to continue to act as async until the change becomes visible. Remove the race by using more appropriate test in syncrep.c Author: Asim Rama Praveen and Ashwin Agrawal Reported-by: Xin Zhang, Ashwin Agrawal, and Asim Rama Praveen Reviewed-by: Michael Paquier, Masahiko Sawada As far as I can tell there was no discussion about the added contention due this change in the relevant thread [1]. The default configuration has an empty synchronous_standby_names. Before this change we'd fall out of SyncRepWaitForLSN() before acquiring SyncRepLock in exlusive mode. Now we don't anymore. I'm really not ok with unneccessarily adding an exclusive lock acquisition to such a crucial path. Greetings, Andres Freund [1] https://postgr.es/m/CABrsG8j3kPD%2Bkbbsx_isEpFvAgaOBNGyGpsqSjQ6L8vwVUaZAQ%40mail.gmail.com
Re: Reinitialize stack base after fork (for the benefit of rr)?
On Sun, Apr 5, 2020 at 8:56 PM Andres Freund wrote: > Perhaps put it on a wiki page? I added a new major section to the "getting a stack trace" wiki page: https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD#Recording_Postgres_using_rr_Record_and_Replay_Framework Feel free to add to and edit this section yourself. > Were you doing this because of occasional failures in autovacuum > workers? If so, that shouldn't be necessary after the stack base change > (previously workers IIRC also could start with the wrong stack base - > but didn't end up checking stack depth except for expression indexes). No, just a personal preference for things like this. -- Peter Geoghegan
Re: Index Skip Scan
On Sun, Apr 5, 2020 at 9:39 PM Dmitry Dolgov <9erthali...@gmail.com> wrote: > > > On Sun, Apr 05, 2020 at 04:30:51PM +0530, Dilip Kumar wrote: > > > > I was just wondering how the distinct will work with the "skip scan" > > if we have some filter? I mean every time we select the unique row > > based on the prefix key and that might get rejected by an external > > filter right? > > Not exactly. In the case of index-only scan, we skipping to the first > unique position, and then use already existing functionality > (_bt_readpage with stepping to the next pages) to filter out those > records that do not pass the condition. I agree but that will work if we have a valid index clause, but "b%100=0" condition will not create an index clause, right? However, if we change the query to select distinct(a) from t where b=100 then it works fine because this condition will create an index clause. There are even couple of tests > in the patch for this. In case of index scan, when there are some > conditions, current implementation do not consider skipping. > > > So I tried an example to check this. > > Can you tell on which version of the patch you were testing? I have tested on v33. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Reinitialize stack base after fork (for the benefit of rr)?
Hi, On 2020-04-05 20:35:50 -0700, Peter Geoghegan wrote: > I have found that it's useful to use rr to debug Postgres by following > certain recipes. I'll share some of the details now, in case anybody > else wants to start using rr and isn't sure where to start. Perhaps put it on a wiki page? > I have a script that records a postgres session using rr with these options: > > rr record -M /code/postgresql/$BRANCH/install/bin/postgres \ > -D /code/postgresql/$BRANCH/data \ > --log_line_prefix="%m %p " \ > --autovacuum=off \ Were you doing this because of occasional failures in autovacuum workers? If so, that shouldn't be necessary after the stack base change (previously workers IIRC also could start with the wrong stack base - but didn't end up checking stack depth except for expression indexes). Greetings, Andres Freund
Re: Reinitialize stack base after fork (for the benefit of rr)?
On Sun, Apr 5, 2020 at 6:54 PM Andres Freund wrote: > I just pushed that. Great! I have found that it's useful to use rr to debug Postgres by following certain recipes. I'll share some of the details now, in case anybody else wants to start using rr and isn't sure where to start. I have a script that records a postgres session using rr with these options: rr record -M /code/postgresql/$BRANCH/install/bin/postgres \ -D /code/postgresql/$BRANCH/data \ --log_line_prefix="%m %p " \ --autovacuum=off \ --effective_cache_size=1GB \ --random_page_cost=4.0 \ --work_mem=4MB \ --maintenance_work_mem=64MB \ --fsync=off \ --log_statement=all \ --log_min_messages=DEBUG5 \ --max_connections=50 \ --shared_buffers=32MB Most of these settings were taken from a similar script that I use to run Valgrind, so the particulars may not matter much -- though it's useful to make the server logs as verbose as possible (you'll see why in a minute). I find it quite practical to run "make installcheck" against the server, recording the entire execution. I find that it's not that much slower than just running the tests against a regular debug build of Postgres. It's still much faster than Valgrind, for example. (Replaying the recording seems to be where having a high end machine helps a lot.) Once the tests are done, I stop Postgres in the usual way (Ctrl + C). The recording is saved to the $HOME/.local/share/rr/ directory on my Linux distro -- rr creates a directory for each distinct recording in this parent directory. rr also maintains a symlink (latest-trace) that points to the latest recording directory, which I rely on most of the time when replaying a recording (it's the default). I am careful to not leave too many recordings around, since they're large enough that that could become a concern. The record/Postgres terminal has output that looks like this: [rr 1786705 1241867]2020-04-04 21:55:05.018 PDT 1786705 DEBUG: CommitTransaction(1) name: unnamed; blockState: STARTED; state: INPROGRESS, xid/subid/cid: 63992/1/2 [rr 1786705 1241898]2020-04-04 21:55:05.019 PDT 1786705 DEBUG: StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGRESS, xid/subid/cid: 0/1/0 [rr 1786705 1241902]2020-04-04 21:55:05.019 PDT 1786705 LOG: statement: CREATE TYPE test_type_empty AS (); [rr 1786705 1241906]2020-04-04 21:55:05.020 PDT 1786705 DEBUG: CommitTransaction(1) name: unnamed; blockState: STARTED; state: INPROGRESS, xid/subid/cid: 63993/1/1 [rr 1786705 1241936]2020-04-04 21:55:05.020 PDT 1786705 DEBUG: StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGRESS, xid/subid/cid: 0/1/0 [rr 1786705 1241940]2020-04-04 21:55:05.020 PDT 1786705 LOG: statement: DROP TYPE test_type_empty; [rr 1786705 1241944]2020-04-04 21:55:05.021 PDT 1786705 DEBUG: drop auto-cascades to composite type test_type_empty [rr 1786705 1241948]2020-04-04 21:55:05.021 PDT 1786705 DEBUG: drop auto-cascades to type test_type_empty[] [rr 1786705 1241952]2020-04-04 21:55:05.021 PDT 1786705 DEBUG: MultiXact: setting OldestMember[2] = 9 [rr 1786705 1241956]2020-04-04 21:55:05.021 PDT 1786705 DEBUG: CommitTransaction(1) name: unnamed; blockState: STARTED; state: INPROGRESS, xid/subid/cid: 63994/1/3 The part of each log line in square brackets comes from rr (since we used -M when recording) -- the first number is a PID, the second an event number. I usually don't care about the PIDs, though, since the event number alone unambiguously identifies a particular "event" in a particular backend (rr recordings are single threaded, even when there are multiple threads or processes). Suppose I want to get to the "CREATE TYPE test_type_empty AS ();" query -- I can get to the end of the query by replaying the recording with this option: $ rr replay -M -g 1241902 Replaying the recording like this takes me to the point where the Postgres backend prints the log message at the end of executing the query I mentioned -- I get a familiar gdb debug server (rr implements a gdb backend). This isn't precisely the point of execution that interests me, but it's close enough. I can easily set a breakpoint to the precise function I'm interested in, and then "reverse-continue" to get there by going backwards. I can also find the point where a particular backend starts by using the fork option instead. So for the PID 1786705, that would look like: $ rr replay -M -f 1786705 (Don't try to use the similar -p option, since that starts a debug server when the pid has been exec'd.) rr really shines when debugging things like tap tests, where there is complex scaffolding that may run multiple Postgres servers. You can run an entire "rr record make check", without having to worry about how that scaffolding works. Once you have useful event numbers to work off of, it doesn't take too long to get an interactive debugging session in the backend of interest by applying the same techniques. Note that saving the output of a recording using standard tools like
Re: WAL usage calculation patch
On Sat, Apr 4, 2020 at 2:50 PM Julien Rouhaud wrote: > > On Sat, Apr 04, 2020 at 02:39:32PM +0530, Amit Kapila wrote: > > On Sat, Apr 4, 2020 at 2:24 PM Julien Rouhaud wrote: > > > > > > > > We can add if we want but I am not able to convince myself for that. > > > > Do you have any use case in mind? I think in most of the cases > > > > (except for hint-bit WAL) it will be zero. If we are not sure of this > > > > we can also discuss it separately in a new thread once this > > > > patch-series is committed and see if anybody else sees the value of it > > > > and if so adding the code should be easy. > > > > > > > > > I'm mostly thinking of people trying to investigate possible slowdowns on > > > a > > > hot-standby replica with a primary without wal_log_hints. If they > > > explicitly > > > ask for WAL information, we should provide them, even if it's quite > > > unlikely to > > > happen. > > > > > > > Yeah, possible but I am not completely sure. I would like to hear the > > opinion of others if any before adding code for this. How about if we > > first commit pg_stat_statements and wait for this till Monday and if > > nobody responds we can commit the current patch but would start a new > > thread and try to get the opinion of others? > > > I'm fine with it. > I have pushed pg_stat_statements and Explain related patches. I am now looking into (auto)vacuum patch and have few comments. @@ -614,6 +616,9 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params, TimestampDifference(starttime, endtime, &secs, &usecs); + memset(&walusage, 0, sizeof(WalUsage)); + WalUsageAccumDiff(&walusage, &pgWalUsage, &walusage_start); + read_rate = 0; write_rate = 0; if ((secs > 0) || (usecs > 0)) @@ -666,7 +671,13 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params, (long long) VacuumPageDirty); appendStringInfo(&buf, _("avg read rate: %.3f MB/s, avg write rate: %.3f MB/s\n"), read_rate, write_rate); - appendStringInfo(&buf, _("system usage: %s"), pg_rusage_show(&ru0)); + appendStringInfo(&buf, _("system usage: %s\n"), pg_rusage_show(&ru0)); + appendStringInfo(&buf, + _("WAL usage: %ld records, %ld full page writes, " +UINT64_FORMAT " bytes"), + walusage.wal_records, + walusage.wal_num_fpw, + walusage.wal_bytes); Here, we are not displaying Buffers related data, so why do we think it is important to display WAL data? I see some point in displaying Buffers and WAL data in a vacuum (verbose), but I feel it is better to make a case for both the statistics together rather than just displaying one and leaving other. I think the other change related to autovacuum stats seems okay to me. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
2pc leaks fds
Hi, Using 2PC with master very quickly leads to: 2020-04-05 19:42:18.368 PDT [2298126][5/2009:0] LOG: out of file descriptors: Too many open files; release and retry 2020-04-05 19:42:18.368 PDT [2298126][5/2009:0] STATEMENT: COMMIT PREPARED 'ptx_2'; This started with: commit 0dc8ead46363fec6f621a12c7e1f889ba73b55a9 (HEAD -> master) Author: Alvaro Herrera Date: 2019-11-25 15:04:54 -0300 Refactor WAL file-reading code into WALRead() I found this while trying to benchmark the effect of my snapshot changes on 2pc. I just used the attached pgbench file. andres@awork3:~/build/postgres/dev-assert/vpath$ pgbench -n -s 500 -c 4 -j 4 -T 10 -P1 -f ~/tmp/pgbench-write-2pc.sql progress: 1.0 s, 3723.8 tps, lat 1.068 ms stddev 0.305 client 2 script 0 aborted in command 8 query 0: ERROR: could not seek to end of file "base/14036/16396": Too many open files client 1 script 0 aborted in command 8 query 0: ERROR: could not seek to end of file "base/14036/16396": Too many open files client 3 script 0 aborted in command 8 query 0: ERROR: could not seek to end of file "base/14036/16396": Too many open files client 0 script 0 aborted in command 8 query 0: ERROR: could not seek to end of file "base/14036/16396": Too many open files transaction type: /home/andres/tmp/pgbench-write-2pc.sql I've not yet reviewed the change sufficiently to pinpoint the issue. It's a bit sad that nobody has hit this in the last few months :(. Greetings, Andres Freund pgbench-write-2pc.sql Description: application/sql
Re: ALTER tbl rewrite loses CLUSTER ON index
On Fri, Apr 03, 2020 at 03:54:38PM +0900, Michael Paquier wrote: > Now, would it be better to apply the refactoring patch for HEAD before > feature freeze, or are people fine if this is committed a bit after? > Patch 0002 is neither a new feature, nor an actual bug, and just some > code cleanup, but I am a bit worried about applying that cleanup on > HEAD just after the freeze. I have worked more on this one this morning and just applied the bug fix down to 9.5, and the refactoring on HEAD. Thanks! -- Michael signature.asc Description: PGP signature
Re: Reinitialize stack base after fork (for the benefit of rr)?
On 2020-04-04 21:02:56 -0700, Peter Geoghegan wrote: > On Fri, Mar 27, 2020 at 11:22 AM Andres Freund wrote: > > I've found rr [1] very useful to debug issues in postgres. The ability > > to hit a bug, and then e.g. identify a pointer with problematic > > contents, set a watchpoint on its contents, and reverse-continue is > > extremely powerful. > > I agree that rr is very useful. It would be great if we had a totally > smooth workflow for debugging using rr. I just pushed that.
Re: [HACKERS] WAL logging problem in 9.4.3?
At Sat, 4 Apr 2020 15:32:12 -0700, Noah Misch wrote in > On Sat, Apr 04, 2020 at 06:24:34PM -0400, Tom Lane wrote: > > Shouldn't the CF entry get closed? > > Once the buildfarm is clean for a day, sure. The buildfarm has already > revealed a missing perl2host call. Thank you for (re-) committing this and the following fix. I hope this doesn't bring in another failure. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: nbtree: assertion failure in _bt_killitems() for posting tuple
On Tue, Mar 24, 2020 at 1:00 AM Anastasia Lubennikova wrote: > > I think you're right. However, it still seems like we should check > > that "kitem->indexOffset" is consistent among all of the BTScanPosItem > > entries that we have for each TID that we believe to be from the same > > posting list tuple. The assertion failure happens in the logical replication worker because it uses a dirty snapshot, which cannot release the pin per commit 2ed5b87f. This means that the leaf page can change between the time that we observe an item is dead, and the time we reach _bt_killitems(), even though _bt_killitems() does get to kill items. I am thinking about pushing a fix along the lines of the attached patch. This preserves the assertion, while avoiding the check in cases where it doesn't apply, such as when a dirty snapshot is in use. -- Peter Geoghegan v1-0001-Fix-assertion-failure-in-_bt_killitems.patch Description: Binary data
Re: CLOBBER_CACHE_ALWAYS regression instability
Hi, On 2020-04-05 19:54:19 -0400, Alvaro Herrera wrote: > Isn't it the case that you can create an inner block with a constant > whose size is determined by a containing block's variable? I mean as in > the attached, which refuses to compile because of our -Werror=vla -- but > if I remove it, it compiles fine and works in my system. IIRC msvc doesn't support VLAs. And there's generally a slow push towards deprecating them (they've e.g. been moved to optional in C11). They don't tend to make a lot of sense for sizes that aren't tightly bound. In contrast to palloc etc, there's no good way to catch allocation errors. Most of the time you'll just get a SIGBUS or such, but sometimes you'll just end up overwriting data (if the allocation is large enough to not touch the guard pages). Both alloca/vlas also add some per-call overhead. Allocating the common size on-stack, and the uncommon ones on heap should be cheaper, and handles the cases of large allocations much better. Greetings, Andres Freund
archive recovery fetching wrong segments
Hello, hackers! I`m investigating a complains from our clients about archive recovery speed been very slow, and I`ve noticed a really strange and, I think, a very dangerous recovery behavior. When running multi-timeline archive recovery, for every requested segno startup process iterates through every timeline in restore target timeline history, starting from highest timeline and ending in current, and tries to fetch the segno in question from this timeline. Consider the following example. Timelines: ARCHIVE INSTANCE 'node' TLI Parent TLI Switchpoint Min Segno Max Segno N segments Size Zratio N backups Status 3 2 0/AEFFEDE0 000300AE 000300D5 40 41MB 15.47 0 OK 2 1 0/A08768D0 000200A0 000200AE 15 14MB 17.24 0 OK 1 0 0/0 00010001 000100BB 187 159MB 18.77 1 OK Backup: Instance Version ID Recovery Time Mode WAL Mode TLI Time Data WAL Zratio Start LSN Stop LSN Status node 11 Q8C8IH 2020-04-06 02:13:31+03 FULL ARCHIVE 1/0 3s 23MB 16MB 1.00 0/228 0/3B8 OK So when we are trying to restore this backup, located on Timeline 1, to the restore target on Timeline 3, we are getting this in the PostgreSQL log: 2020-04-05 23:24:36 GMT [28508]: [5-1] LOG: restored log file "0003.history" from archive INFO: PID [28511]: pg_probackup archive-get WAL file: 00030002, remote: none, threads: 1/1, batch: 20 ERROR: PID [28511]: pg_probackup archive-get failed to deliver WAL file 00030002, prefetched: 0/20, time elapsed: 0ms INFO: PID [28512]: pg_probackup archive-get WAL file: 00020002, remote: none, threads: 1/1, batch: 20 ERROR: PID [28512]: pg_probackup archive-get failed to deliver WAL file 00020002, prefetched: 0/20, time elapsed: 0ms INFO: PID [28513]: pg_probackup archive-get WAL file: 00010002, remote: none, threads: 1/1, batch: 20 INFO: PID [28513]: pg_probackup archive-get copied WAL file 00010002 2020-04-05 23:24:36 GMT [28508]: [6-1] LOG: restored log file "00010002" from archive ... Before requesting 00010002 recovery tries to fetch 00030002 and 00020002 and that goes for every segment, restored from the archive. This tremendously slows down recovery speed, especially if archive is located on remote machine with high latency network. And it also may lead to feeding recovery with wrong WAL segment, located on the next timeline. Is there a reason behind this behavior? Also I`ve attached a patch, which fixed this issue for me, but I`m not sure, that chosen approach is sound and didn`t break something. -- Grigory Smolkin Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index f6cd4fde2b..b4ddb246c2 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -3714,12 +3714,30 @@ XLogFileReadAnyTLI(XLogSegNo segno, int emode, int source) foreach(cell, tles) { TimeLineID tli = ((TimeLineHistoryEntry *) lfirst(cell))->tli; + XLogRecPtr switchpoint = ((TimeLineHistoryEntry *) lfirst(cell))->begin; if (tli < curFileTLI) break;/* don't bother looking at too-old TLIs */ if (source == XLOG_FROM_ANY || source == XLOG_FROM_ARCHIVE) { + /* If current timeline is from the future ... */ + if (tli > curFileTLI && switchpoint != InvalidXLogRecPtr) + { +XLogSegNo switchSegno; + +XLByteToSeg(switchpoint, switchSegno, wal_segment_size); + +/* ... and requested segno is not the segment with switchpoint, then + * skip current timeline */ +if (segno < switchSegno) +{ + elog(WARNING, "Skipping timeline: %u, switch segno: %u, current segno: %u", tli, switchSegno, segno); + continue; +} + } + + fd = XLogFileRead(segno, emode, tli, XLOG_FROM_ARCHIVE, true); if (fd != -1)
Re: CLOBBER_CACHE_ALWAYS regression instability
On 2020-Apr-05, Tom Lane wrote: > What I wish we had was alloca(), so you don't need a FUNC_MAX_ARGS-sized > array to parse a two-argument function call. Too bad C99 didn't add > that. (But some sniffing around suggests that an awful lot of systems > have it anyway ... even MSVC. Hmmm.) Isn't it the case that you can create an inner block with a constant whose size is determined by a containing block's variable? I mean as in the attached, which refuses to compile because of our -Werror=vla -- but if I remove it, it compiles fine and works in my system. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services diff --git a/src/backend/parser/parse_func.c b/src/backend/parser/parse_func.c index 9c3b6ad916..415ea17c68 100644 --- a/src/backend/parser/parse_func.c +++ b/src/backend/parser/parse_func.c @@ -103,7 +103,6 @@ ParseFuncOrColumn(ParseState *pstate, List *funcname, List *fargs, Node *first_arg = NULL; int nargs; int nargsplusdefs; - Oid actual_arg_types[FUNC_MAX_ARGS]; Oid *declared_arg_types; List *argnames; List *argdefaults; @@ -115,6 +114,9 @@ ParseFuncOrColumn(ParseState *pstate, List *funcname, List *fargs, char aggkind = 0; ParseCallbackState pcbstate; + { + Oid actual_arg_types[list_length(fargs)]; + /* * If there's an aggregate filter, transform it using transformWhereClause */ @@ -888,6 +890,7 @@ ParseFuncOrColumn(ParseState *pstate, List *funcname, List *fargs, /* if it returns a set, remember it for error checks at higher levels */ if (retset) pstate->p_last_srf = retval; + } return retval; }
Re: Make MemoryContextMemAllocated() more precise
Hi, On 2020-03-27 17:21:10 -0700, Jeff Davis wrote: > Attached refactoring patch. There's enough in here that warrants > discussion that I don't think this makes sense for v13 and I'm adding > it to the July commitfest. IDK, adding a commit to v13 that we know we should do architecturally differently in v14, when the difference in complexity between the two patches isn't actually *that* big... I'd like to see others jump in here... > I still think we should do something for v13, such as the originally- > proposed patch[1]. It's not critical, but it simply reports a better > number for memory consumption. Currently, the memory usage appears to > jump, often right past work mem (by a reasonable but noticable amount), > which could be confusing. Is that really a significant issue for most work mem sizes? Shouldn't the way we increase sizes lead to the max difference between the measurements to be somewhat limited? > * there's a new MemoryContextCount() that simply calculates the > statistics without printing anything, and returns a struct > - it supports flags to indicate which stats should be > calculated, so that some callers can avoid walking through > blocks/freelists > * it adds a new statistic for "new space" (i.e. untouched) > * it eliminates specialization of the memory context printing > - the only specialization was for generation.c to output the > number of chunks, which can be done easily enough for the > other types, too That sounds like a good direction. > + if (flags & MCXT_STAT_NBLOCKS) > + counters.nblocks = nblocks; > + if (flags & MCXT_STAT_NCHUNKS) > + counters.nchunks = set->nChunks; > + if (flags & MCXT_STAT_FREECHUNKS) > + counters.freechunks = freechunks; > + if (flags & MCXT_STAT_TOTALSPACE) > + counters.totalspace = set->memAllocated; > + if (flags & MCXT_STAT_FREESPACE) > + counters.freespace = freespace; > + if (flags & MCXT_STAT_NEWSPACE) > + counters.newspace = set->blocks->endptr - set->blocks->freeptr; I'd spec it so that context implementations are allowed to unconditionally fill fields, even when the flag isn't specified. The branches quoted don't buy us anyting... > diff --git a/src/include/nodes/memnodes.h b/src/include/nodes/memnodes.h > index c9f2bbcb367..cc545852968 100644 > --- a/src/include/nodes/memnodes.h > +++ b/src/include/nodes/memnodes.h > @@ -29,11 +29,21 @@ > typedef struct MemoryContextCounters > { > Sizenblocks;/* Total number of malloc > blocks */ > + Sizenchunks;/* Total number of chunks > (used+free) */ > Sizefreechunks; /* Total number of free chunks > */ > Sizetotalspace; /* Total bytes requested from > malloc */ > Sizefreespace; /* The unused portion of > totalspace */ > + Sizenewspace; /* Allocated but never held any > chunks */ I'd add some reasoning as to why this is useful. > } MemoryContextCounters; > > +#define MCXT_STAT_NBLOCKS(1 << 0) > +#define MCXT_STAT_NCHUNKS(1 << 1) > +#define MCXT_STAT_FREECHUNKS (1 << 2) > +#define MCXT_STAT_TOTALSPACE (1 << 3) > +#define MCXT_STAT_FREESPACE (1 << 4) > +#define MCXT_STAT_NEWSPACE (1 << 5) s/MCXT_STAT/MCXT_STAT_NEED/? > +#define MCXT_STAT_ALL((1 << 6) - 1) Hm, why not go for ~0 or such? Greetings, Andres Freund
WAL page magic errors (and plenty others) got hard to debug.
Hi, When starting with on a data directory with an older WAL page magic we currently make that hard to debug. E.g.: 2020-04-05 15:31:04.314 PDT [1896669][:0] LOG: database system was shut down at 2020-04-05 15:24:56 PDT 2020-04-05 15:31:04.314 PDT [1896669][:0] LOG: invalid primary checkpoint record 2020-04-05 15:31:04.314 PDT [1896669][:0] PANIC: could not locate a valid checkpoint record 2020-04-05 15:31:04.315 PDT [1896668][:0] LOG: startup process (PID 1896669) was terminated by signal 6: Aborted 2020-04-05 15:31:04.315 PDT [1896668][:0] LOG: aborting startup due to startup process failure 2020-04-05 15:31:04.316 PDT [1896668][:0] LOG: database system is shut down As far as I can tell this is not just the case for a wrong page magic, but for all page level validation errors. I think this largely originates in: commit 0668719801838aa6a8bda330ff9b3d20097ea844 Author: Heikki Linnakangas Date: 2018-05-05 01:34:53 +0300 Fix scenario where streaming standby gets stuck at a continuation record. If a continuation record is split so that its first half has already been removed from the master, and is only present in pg_wal, and there is a recycled WAL segment in the standby server that looks like it would contain the second half, recovery would get stuck. The code in XLogPageRead() incorrectly started streaming at the beginning of the WAL record, even if we had already read the first page. Backpatch to 9.4. In principle, older versions have the same problem, but without replication slots, there was no straightforward mechanism to prevent the master from recycling old WAL that was still needed by standby. Without such a mechanism, I think it's reasonable to assume that there's enough slack in how many old segments are kept around to not run into this, or you have a WAL archive. Reported by Jonathon Nelson. Analysis and patch by Kyotaro HORIGUCHI, with some extra comments by me. Discussion: https://www.postgresql.org/message-id/CACJqAM3xVz0JY1XFDKPP%2BJoJAjoGx%3DGNuOAshEDWCext7BFvCQ%40mail.gmail.com which added the following to XLogPageRead(): +/* + * Check the page header immediately, so that we can retry immediately if + * it's not valid. This may seem unnecessary, because XLogReadRecord() + * validates the page header anyway, and would propagate the failure up to + * ReadRecord(), which would retry. However, there's a corner case with + * continuation records, if a record is split across two pages such that + * we would need to read the two pages from different sources. For + * example, imagine a scenario where a streaming replica is started up, + * and replay reaches a record that's split across two WAL segments. The + * first page is only available locally, in pg_wal, because it's already + * been recycled in the master. The second page, however, is not present + * in pg_wal, and we should stream it from the master. There is a recycled + * WAL segment present in pg_wal, with garbage contents, however. We would + * read the first page from the local WAL segment, but when reading the + * second page, we would read the bogus, recycled, WAL segment. If we + * didn't catch that case here, we would never recover, because + * ReadRecord() would retry reading the whole record from the beginning. + * + * Of course, this only catches errors in the page header, which is what + * happens in the case of a recycled WAL segment. Other kinds of errors or + * corruption still has the same problem. But this at least fixes the + * common case, which can happen as part of normal operation. + * + * Validating the page header is cheap enough that doing it twice + * shouldn't be a big deal from a performance point of view. + */ +if (!XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf)) +{ +/* reset any error XLogReaderValidatePageHeader() might have set */ +xlogreader->errormsg_buf[0] = '\0'; +goto next_record_is_invalid; +} + I really can't follow the logic of just intentionally and silently throwing the error message away here. Isn't this basically hiding *all* page level error messages? And even in the scenarios where this were the right thing, I feel like not even outputting a debugging message makes debugging situations in which this is encountered unnecessarily hard. Greetings, Andres Freund
Re: d25ea01275 and partitionwise join
Amit Langote writes: > Okay, I tried that in the updated patch. I didn't try to come up with > examples that might break it though. I looked through this. * Wasn't excited about inventing makeCoalesceExpr(); the fact that it only had two potential call sites seemed to make it not worth the trouble. Plus, as defined, it could not handle the general case of COALESCE, which can have N arguments; so that seemed like a recipe for confusion. * I think your logic for building the coalesce combinations was just wrong. We need combinations of left-hand inputs with right-hand inputs, not left-hand with left-hand or right-hand with right-hand. Also the nesting should already have happened in the inputs, we don't need to try to generate it here. The looping code was pretty messy as well. * I don't think using partopcintype is necessarily right; that could be a polymorphic type, for instance. Safer to copy the type of the input expressions. Likely we could have got away with using partcollation, but for consistency I copied that as well. * You really need to update the data structure definitional comments when you make a change like this. * I did not like putting a test case that requires enable_partitionwise_aggregate in the partition_join test; that seems misplaced. But it's not necessary to the test, is it? * I did not follow the point of your second test case. The WHERE constraint on p1.a allows the planner to strength-reduce the joins, which is why there's no full join in that explain result, but then we aren't going to get to this code at all. I repaired the above in the attached. I'm actually sort of pleasantly surprised that this worked; I was not sure that building COALESCEs like this would provide the result we needed. But it seems okay -- it does fix the behavior in this 3-way test case, as well as the 4-way join you showed at the top of the thread. It's fairly dependent on the fact that the planner won't rearrange full joins; otherwise, the COALESCE structures we build here might not match those made at parse time. But that's not likely to change anytime soon; and this is hardly the only place that would break, so I'm not sweating about it. (I have some vague ideas about getting rid of the COALESCEs as part of the Var redefinition I've been muttering about, anyway; there might be a cleaner fix for this if that happens.) Anyway, I think this is probably OK for now. Given that the nullable_partexprs lists are only used in one place, it's pretty hard to see how it would break anything. regards, tom lane diff --git a/src/backend/optimizer/util/relnode.c b/src/backend/optimizer/util/relnode.c index af1fb48..e1cc11c 100644 --- a/src/backend/optimizer/util/relnode.c +++ b/src/backend/optimizer/util/relnode.c @@ -17,6 +17,7 @@ #include #include "miscadmin.h" +#include "nodes/nodeFuncs.h" #include "optimizer/appendinfo.h" #include "optimizer/clauses.h" #include "optimizer/cost.h" @@ -1890,7 +1891,8 @@ set_joinrel_partition_key_exprs(RelOptInfo *joinrel, RelOptInfo *outer_rel, RelOptInfo *inner_rel, JoinType jointype) { - int partnatts = joinrel->part_scheme->partnatts; + PartitionScheme part_scheme = joinrel->part_scheme; + int partnatts = part_scheme->partnatts; joinrel->partexprs = (List **) palloc0(sizeof(List *) * partnatts); joinrel->nullable_partexprs = @@ -1899,7 +1901,8 @@ set_joinrel_partition_key_exprs(RelOptInfo *joinrel, /* * The joinrel's partition expressions are the same as those of the input * rels, but we must properly classify them as nullable or not in the - * joinrel's output. + * joinrel's output. (Also, we add some more partition expressions if + * it's a FULL JOIN.) */ for (int cnt = 0; cnt < partnatts; cnt++) { @@ -1910,6 +1913,7 @@ set_joinrel_partition_key_exprs(RelOptInfo *joinrel, const List *inner_null_expr = inner_rel->nullable_partexprs[cnt]; List *partexpr = NIL; List *nullable_partexpr = NIL; + ListCell *lc; switch (jointype) { @@ -1969,6 +1973,31 @@ set_joinrel_partition_key_exprs(RelOptInfo *joinrel, outer_null_expr); nullable_partexpr = list_concat(nullable_partexpr, inner_null_expr); + +/* + * Also add CoalesceExprs corresponding to each possible + * full-join output variable (that is, left side coalesced to + * right side), so that we can match equijoin expressions + * using those variables. We assume no type coercions are + * needed to make the join outputs. + */ +foreach(lc, list_concat_copy(outer_expr, outer_null_expr)) +{ + Node *larg = (Node *) lfirst(lc); + ListCell *lc2; + + foreach(lc2, list_concat_copy(inner_expr, inner_null_expr)) + { + Node *rarg = (Node *) lfirst(lc2); + CoalesceExpr *c = makeNode(CoalesceExpr); + + c->coalescetype = exprType(larg); + c->coalescecollid = exprCollation(larg); + c->args = list_make2(larg, rar
Re: range_agg
v17 is a rebase fixing a minor parse_coerce.c edit; v16 lasted little :-( I chose to change the wording of the conflicting comment in enforce_generic_type_consistency(): * 3) Similarly, if return type is ANYRANGE or ANYMULTIRANGE, and any *argument is ANYRANGE or ANYMULTIRANGE, use that argument's *actual type, range type or multirange type as the function's return *type. This wording is less precise, in that it doesn't say exactly which of the three types is the actual result for each of the possible four cases (r->r, r->m, m->m, m->r) but I think it should be straightforward. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Add A Glossary
On 2020-Apr-05, Fabien COELHO wrote: > > > As the definitions are short and to the point, maybe the HTML display > > > could (also) "hover" the definitions when the mouse passes over the word, > > > using the "title" attribute? > > > > I like that idea, if it doesn't conflict with accessibility standards > > (maybe that's just titles on images, not sure). > > The following worked fine: > > Title Tag Test > The ACID > property is great. > I don't see myself patching the stylesheet as would be needed to do this. > > I suggest we pursue this idea in another thread, as we'd probably want to > > do it for acronyms as well. > > Or not. I'd test committer temperature before investing time because it > would mean that backpatching the doc would be a little harder. TBH I can't get very excited about this idea. Maybe other documentation champions would be happier about doing that. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: CLOBBER_CACHE_ALWAYS regression instability
On 2020-04-05 15:38:29 -0400, Tom Lane wrote: > Andres Freund writes: > > We could print the error using :LAST_ERROR_MESSAGE after removing a > > potential trailing "at character ..." if we're worried about the loss of > > specificity. > > Oh, actually it seems that :LAST_ERROR_MESSAGE is already just the > primary message, without any "at character N" addon, so this would be > a very easy way to ameliorate that complaint. ("at character N" is > added by libpq's pqBuildErrorMessage3 in TERSE mode, but psql does > not use that when filling LAST_ERROR_MESSAGE.) Heh. I though it worked differently because I just had typed some gibberish and got an error in :LAST_ERROR_MESSAGE that ended in "at or near ...". But that's scan.l adding that explicitly...
Re: VACUUM memory management
On Wed, Dec 11, 2019 at 09:29:17PM +0500, Ibrar Ahmed wrote: > > Did you modify Claudio's patch or write a totally new one? > > I wrote completely new patch. I tried multiple techniques like using a list > instead of fixed size array which I thought was most suitable here, but > leave that because of conflict with Parallel Vacuum. Using a list will hardly work, or certainly not well, since it needs to be searched by the ambulkdelete callback. > >> If you wrote a totally new one, have you compared your work with > >> Claudio's, to see if he covered > >> anything you might need to cover? > > > > I checked the patch, and it does not do anything special which my patch is > not doing except one thing. The patch is claiming to increase the limit of > 1GB along with that, but I have not touched that. In my case, we are still > under the limit of maintaines_work_mem but allocate memory in chunks. In > that case, you have the leverage to set a big value of maintaness_work_mem > (even if you don't need that) because it will not allocate all the memory > at the start. After spending a bunch of time comparing them, I disagree. Claudio's patch does these: - avoid using multiple chunks if there's no indexes, therefore no need to avoid the high cost of index scans to avoid; - rather than doing an index scan for each chunk (bad), the callback function lazy_tid_reaped() does a custom binary search *over* chunks of different sizes and then *within* each chunk. That's maybe slighly over-engineered, I'm not convinced that's needed (but I thought it was pretty clever), but someone thought that was important. - properly keep track of *total* number of dead tuples, eg for progress reporting, and for prev_dead_count for pages with no dead tuples; - lazy_record_dead_tuple() doubles allocation when running out of space for dead tuples; some people disagree with that (myself included) but I'm including it here since that's what it does. This still seems nontrivial (to me) to adapt to work with parallel query. -- Justin
Re: backup manifests and contemporaneous buildfarm failures
Andrew Dunstan writes: > Hmm, the buildfarm client does this at the beginning of each run to > remove anything that might be left over from a previous run: > rmtree("inst"); > rmtree("$pgsql") unless ($from_source && !$use_vpath); Right, the point is precisely that some versions of rmtree() fail to remove a mode-0 subdirectory. > Do I need to precede those with some recursive chmod commands? Perhaps > the client should refuse to run if there is still something left after > these. I think the latter would be a very good idea, just so that this sort of failure is less obscure. Not sure about whether a recursive chmod is really going to be worth the cycles. (On the other hand, the normal case should be that there's nothing there anyway, so maybe it's not going to be costly.) regards, tom lane
Re: backup manifests and contemporaneous buildfarm failures
On 4/5/20 9:10 AM, Mikael Kjellström wrote: > On 2020-04-04 04:43, Robert Haas wrote: > >> I think I've done about as much as I can do for tonight, though. Most >> things are green now, and the ones that aren't are failing because of >> stuff that is at least plausibly fixed. By morning it should be >> clearer how much broken stuff is left, although that will be somewhat >> complicated by at least sidewinder and seawasp needing manual >> intervention to get back on track. > > I fixed sidewinder I think. Should clear up the next time it runs. > > It was the mode on the directory it couldn't handle- A regular rm -rf > didn't work I had to do a chmod -R 700 on all directories to be able > to manually remove it. > > Hmm, the buildfarm client does this at the beginning of each run to remove anything that might be left over from a previous run: rmtree("inst"); rmtree("$pgsql") unless ($from_source && !$use_vpath); Do I need to precede those with some recursive chmod commands? Perhaps the client should refuse to run if there is still something left after these. cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Thoughts on "killed tuples" index hint bits support on standby
Hi Michail, On Fri, Jan 24, 2020 at 6:16 AM Michail Nikolaev wrote: > Some of issues your mentioned (reporting feedback to the another > cascade standby, processing queries after restart and newer xid > already reported) could be fixed in provided design, but your > intention to have "independent correctness backstop" is a right thing > to do. Attached is a very rough POC patch of my own, which makes item deletion occur "non-opportunistically" in unique indexes. The idea is that we exploit the uniqueness property of unique indexes to identify "version churn" from non-HOT updates. If any single value on a leaf page has several duplicates, then there is a good chance that we can safely delete some of them. It's worth going to the heap to check whether that's safe selectively, at the point where we'd usually have to split the page. We only have to free one or two items to avoid splitting the page. If we can avoid splitting the page immediately, we may well avoid splitting it indefinitely, or forever. This approach seems to be super effective. It can leave the PK on pgbench_accounts in pristine condition (no page splits) after many hours with a pgbench-like workload that makes all updates non-HOT updates. Even with many clients, and a skewed distribution. Provided the index isn't tiny to begin with, we can always keep up with controlling index bloat -- once the client backends themselves begin to take active responsibility for garbage collection, rather than just treating it as a nice-to-have. I'm pretty sure that I'm going to be spending a lot of time developing this approach, because it really works. This seems fairly relevant to what you're doing. It makes almost all index cleanup occur using the new delete infrastructure for some of the most interesting workloads where deletion takes place in client backends. In practice, a standby will almost be in the same position as the primary in a workload that this approach really helps with, since setting the LP_DEAD bit itself doesn't really need to happen (we can go straight to deleting the items in the new deletion path). To address the questions you've asked: I don't really like the idea of introducing new rules around tuple visibility and WAL logging to set more LP_DEAD bits like this at all. It seems very complicated. I suspect that we'd be better off introducing ways of making the actual deletes occur sooner on the primary, possibly much sooner, avoiding any need for special infrastructure on the standby. This is probably not limited to the special unique index case that my patch focuses on -- we can probably push this general approach forward in a number of different ways. I just started with unique indexes because that seemed most promising. I have only worked on the project for a few days. I don't really know how it will evolve. -- Peter Geoghegan 0001-Non-opportunistically-delete-B-Tree-items.patch Description: Binary data
Re: CLOBBER_CACHE_ALWAYS regression instability
Andres Freund writes: > On 2020-04-05 15:04:30 -0400, Tom Lane wrote: >> That would only reduce the chance of getting a stack overflow there, >> and not by that much, especially not for a CLOBBER_CACHE_ALWAYS animal >> which is going to be doing catalog accesses inside there too. > It'd certainly not be bullet proof. But I don't think we ever were? If I > understood you correctly we were just not noticing the stack overflow > danger before? We did catalog accesses from within there before too, > that's not changed by the addition of equal(), no? Ah, you're right, the CCA aspect is not such a problem as long as there are not check_stack_depth() calls inside the code that's run to load a syscache or relcache entry. Which there probably aren't, at least not for system catalogs. The reason we're seeing this on a CCA animal is simply that a cache flush has occurred to force recomputeNamespacePath to do some work. (In theory it could happen on a non-CCA animal, given unlucky timing of an sinval overrun.) My point here though is that it's basically been blind luck that we've not seen this before. There's certainly no good reason to assume that a check_stack_depth() call shouldn't happen while parsing a function call, or within some other chunk of the parser that happens to set up a transient error-position callback. And it's only going to get more likely in future, seeing for example my ambitions to extend the executor so that run-time expression failures can also report error cursors. So I think that we should be looking for a permanent fix, not a reduce-the-odds band-aid. >> Seems like that adds a lot of potential for memory leakage? > Depends on the case, I'd say. Certainly might be useful to add a helper > for a corresponding conditional free. > For parsing cases like this it could be better to bulk free at the > end. Compared to the memory needed for all the transformed arguments etc > it'd probably not matter in the short term (especially if only done for > 4+ args). What I wish we had was alloca(), so you don't need a FUNC_MAX_ARGS-sized array to parse a two-argument function call. Too bad C99 didn't add that. (But some sniffing around suggests that an awful lot of systems have it anyway ... even MSVC. Hmmm.) regards, tom lane
Re: CLOBBER_CACHE_ALWAYS regression instability
Andres Freund writes: > On 2020-04-05 14:33:19 -0400, Tom Lane wrote: >> 1. Change the test to do "\set VERBOSITY sqlstate" so that all that >> is printed is >> ERROR: 54001 >> ERRCODE_STATEMENT_TOO_COMPLEX is used in few enough places that >> this wouldn't be too much of a loss of specificity. (Or we could >> give stack overflow its very own ERRCODE.) > We could print the error using :LAST_ERROR_MESSAGE after removing a > potential trailing "at character ..." if we're worried about the loss of > specificity. Oh, actually it seems that :LAST_ERROR_MESSAGE is already just the primary message, without any "at character N" addon, so this would be a very easy way to ameliorate that complaint. ("at character N" is added by libpq's pqBuildErrorMessage3 in TERSE mode, but psql does not use that when filling LAST_ERROR_MESSAGE.) regards, tom lane
Re: backup manifests
Hi, On 2020-04-03 15:22:23 -0400, Robert Haas wrote: > I've pushed all the patches. Seeing new warnings in an optimized build /home/andres/src/postgresql-master/src/bin/pg_validatebackup/parse_manifest.c: In function 'json_manifest_object_end': /home/andres/src/postgresql-master/src/bin/pg_validatebackup/parse_manifest.c:591:2: warning: 'end_lsn' may be used uninitialized in this function [-Wmaybe-uninitialized] 591 | context->perwalrange_cb(context, tli, start_lsn, end_lsn); | ^ /home/andres/src/postgresql-master/src/bin/pg_validatebackup/parse_manifest.c:567:5: note: 'end_lsn' was declared here 567 | end_lsn; | ^~~ /home/andres/src/postgresql-master/src/bin/pg_validatebackup/parse_manifest.c:591:2: warning: 'start_lsn' may be used uninitialized in this function [-Wmaybe-uninitialized] 591 | context->perwalrange_cb(context, tli, start_lsn, end_lsn); | ^ /home/andres/src/postgresql-master/src/bin/pg_validatebackup/parse_manifest.c:566:13: note: 'start_lsn' was declared here 566 | XLogRecPtr start_lsn, | ^ The warnings don't seem too unreasonable. The compiler can't see that the error_cb inside json_manifest_parse_failure() is not expected to return. Probably worth adding a wrapper around the calls to context->error_cb and mark that as noreturn. - Andres
Re: CLOBBER_CACHE_ALWAYS regression instability
Hi, On 2020-04-05 15:04:30 -0400, Tom Lane wrote: > Andres Freund writes: > > Another avenue could be to make ParseFuncOrColumn et al use less stack, > > and hope that it avoids the problem. It's a bit insane that we use this > > much. > > That would only reduce the chance of getting a stack overflow there, > and not by that much, especially not for a CLOBBER_CACHE_ALWAYS animal > which is going to be doing catalog accesses inside there too. It'd certainly not be bullet proof. But I don't think we ever were? If I understood you correctly we were just not noticing the stack overflow danger before? We did catalog accesses from within there before too, that's not changed by the addition of equal(), no? Reminds me: I'll try to dust up my patch to make cache invalidation processing non-recursive for 14 (I wrote an initial version as part of a bugfix that we ended up fixing differently). Besides making CLOBBER_CACHE_ALWAYS vastly less expensive, it also reduces the cost of logical decoding substantially. > > We don't have to go there in this case, but I've before wondered about > > adding helpers that use an on-stack var for small allocations, and falls > > back to palloc otherwise. Something boiling down to: > > Seems like that adds a lot of potential for memory leakage? Depends on the case, I'd say. Certainly might be useful to add a helper for a corresponding conditional free. For parsing cases like this it could be better to bulk free at the end. Compared to the memory needed for all the transformed arguments etc it'd probably not matter in the short term (especially if only done for 4+ args). Greetings, Andres Freund
Re: CLOBBER_CACHE_ALWAYS regression instability
Andres Freund writes: > Another avenue could be to make ParseFuncOrColumn et al use less stack, > and hope that it avoids the problem. It's a bit insane that we use this > much. That would only reduce the chance of getting a stack overflow there, and not by that much, especially not for a CLOBBER_CACHE_ALWAYS animal which is going to be doing catalog accesses inside there too. > We don't have to go there in this case, but I've before wondered about > adding helpers that use an on-stack var for small allocations, and falls > back to palloc otherwise. Something boiling down to: Seems like that adds a lot of potential for memory leakage? regards, tom lane
Re: Add A Glossary
On 2020-Apr-05, Jürgen Purtz wrote: > a) Some rearrangements of the sequence of terms to meet alphabetical order. Thanks, will get this pushed. > b) --> in > two cases. Or should it be a ? Ah, yeah, those should be linkend. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: CLOBBER_CACHE_ALWAYS regression instability
Hi, On 2020-04-05 14:33:19 -0400, Tom Lane wrote: > It's a bit surprising perhaps that we run out of stack here and not > somewhere else; but ParseFuncOrColumn and its subroutines consume > quite a lot of stack, because of FUNC_MAX_ARGS-sized local arrays, > so it's not *that* surprising. > > So, what to do to re-stabilize the regression tests? Even if > we wanted to revert 8f59f6b9c0, that'd be kind of a band-aid fix, > because there are lots of other possible ways that a parser error > callback could be active at the point of the error. A few other > possibilities are: > > 1. Change the test to do "\set VERBOSITY sqlstate" so that all that > is printed is > ERROR: 54001 > ERRCODE_STATEMENT_TOO_COMPLEX is used in few enough places that > this wouldn't be too much of a loss of specificity. (Or we could > give stack overflow its very own ERRCODE.) We could print the error using :LAST_ERROR_MESSAGE after removing a potential trailing "at character ..." if we're worried about the loss of specificity. > On the whole I find #1 the least unpleasant, as well as the most > likely to forestall future variants of this issue. It won't dodge > the PPC64 problem, but I'm willing to keep living with that one > for now. Another avenue could be to make ParseFuncOrColumn et al use less stack, and hope that it avoids the problem. It's a bit insane that we use this much. We don't have to go there in this case, but I've before wondered about adding helpers that use an on-stack var for small allocations, and falls back to palloc otherwise. Something boiling down to: Oid actual_arg_types_stack[3]; Oid *actual_arg_types; if (list_length(fargs) <= lengthof(actual_arg_types_stack)) actual_arg_types = actual_arg_types_stack; else actual_arg_types = palloc(sizeof(*actual_arg_types) * list_length(fargs)) Greetings, Andres Freund
CLOBBER_CACHE_ALWAYS regression instability
Over in the thread at [1] we were wondering how come buildfarm member hyrax has suddenly started to fail like this: diff -U3 /home/buildfarm/buildroot/HEAD/pgsql.build/src/test/regress/expected/errors.out /home/buildfarm/buildroot/HEAD/pgsql.build/src/test/regress/results/errors.out --- /home/buildfarm/buildroot/HEAD/pgsql.build/src/test/regress/expected/errors.out 2018-11-21 13:48:48.34000 -0500 +++ /home/buildfarm/buildroot/HEAD/pgsql.build/src/test/regress/results/errors.out 2020-04-04 04:48:16.704699045 -0400 @@ -446,4 +446,4 @@ 'select infinite_recurse()' language sql; \set VERBOSITY terse select infinite_recurse(); -ERROR: stack depth limit exceeded +ERROR: stack depth limit exceeded at character 8 I've now looked into this, and found that it's not at all hard to duplicate; compile HEAD with -DCLOBBER_CACHE_ALWAYS, and run "select infinite_recurse()", and you'll likely get the changed error message. (The lack of other buildfarm failures is probably just because we have so few animals doing CLOBBER_CACHE_ALWAYS builds frequently.) The issue seems indeed to have been triggered by 8f59f6b9c0, because that inserted an equal() call into recomputeNamespacePath(). equal() includes a check_stack_depth() call. We get the observed message if this call is the one where the stack limit is hit, and it is invoked inside ParseFuncOrColumn(), which has set up a parser error callback to point at the infinite_recurse() call that it's trying to resolve. That callback's been there a long time of course, so we may conclude that no other code path reached from ParseFuncOrColumn contains a stack depth check, or we'd likely have seen this before. It's a bit surprising perhaps that we run out of stack here and not somewhere else; but ParseFuncOrColumn and its subroutines consume quite a lot of stack, because of FUNC_MAX_ARGS-sized local arrays, so it's not *that* surprising. So, what to do to re-stabilize the regression tests? Even if we wanted to revert 8f59f6b9c0, that'd be kind of a band-aid fix, because there are lots of other possible ways that a parser error callback could be active at the point of the error. A few other possibilities are: 1. Change the test to do "\set VERBOSITY sqlstate" so that all that is printed is ERROR: 54001 ERRCODE_STATEMENT_TOO_COMPLEX is used in few enough places that this wouldn't be too much of a loss of specificity. (Or we could give stack overflow its very own ERRCODE.) 2. Hack pcb_error_callback so that it suppresses the error position report for ERRCODE_STATEMENT_TOO_COMPLEX, as it already does for ERRCODE_QUERY_CANCELED. That seems pretty unpleasant though. 3. Create a separate expected-file to match the variant output. This would be a maintenance problem, but we could ameliorate that by moving the test to its own regression script, which was something that'd already been proposed to get around the PPC64 Linux kernel signal-handling bug that's been causing intermittent failures on most of the PPC64 buildfarm animals [2]. On the whole I find #1 the least unpleasant, as well as the most likely to forestall future variants of this issue. It won't dodge the PPC64 problem, but I'm willing to keep living with that one for now. Thoughts? regards, tom lane [1] https://www.postgresql.org/message-id/CA%2BTgmoaUOS5X64nKgFxNV7JHN4sRkNAJYW2gHz-LMb0Ej4xHig%40mail.gmail.com [2] https://www.postgresql.org/message-id/27924.1571068231%40sss.pgh.pa.us
Re: [PATCH] Btree BackwardScan race condition on Standby during VACUUM
Hello, Peter. > I added > something about this to the nbtree README in commit 9f83468b353. I have added some updates to your notes in the updated patch version. I also was trying to keep the original wrapping of the paragraph, so the patch looks too wordy. Thanks, Michail. btree-race-and-docs.patch Description: Binary data
Re: WIP: BRIN multi-range indexes
On Sun, Apr 5, 2020 at 8:00 PM Tomas Vondra wrote: > On Sun, Apr 05, 2020 at 07:33:40PM +0300, Alexander Korotkov wrote: > >On Sun, Apr 5, 2020 at 6:53 PM Tomas Vondra > > wrote: > >> On Sun, Apr 05, 2020 at 06:29:15PM +0300, Alexander Korotkov wrote: > >> >On Thu, Apr 2, 2020 at 5:29 AM Tomas Vondra > >> > wrote: > >> >> On Sun, Dec 01, 2019 at 10:55:02AM +0900, Michael Paquier wrote: > >> >> >On Thu, Sep 26, 2019 at 09:01:48PM +0200, Tomas Vondra wrote: > >> >> >> Yeah, the opclass params patches got broken by 773df883e adding enum > >> >> >> reloptions. The breakage is somewhat extensive so I'll leave it up to > >> >> >> Nikita to fix it in [1]. Until that happens, apply the patches on > >> >> >> top of caba97a9d9 for review. > >> >> > > >> >> >This has been close to two months now, so I have the patch as RwF. > >> >> >Feel free to update if you think that's incorrect. > >> >> > >> >> I see the opclass parameters patch got committed a couple days ago, so > >> >> I've rebased the patch series on top of it. The pach was marked RwF > >> >> since 2019-11, so I'll add it to the next CF. > >> > > >> >I think this patchset was marked RwF mainly because slow progress on > >> >opclass parameters. Now we got opclass parameters committed, and I > >> >think this patchset is in a pretty good shape. Moreover, opclass > >> >parameters patch comes with very small examples. This patchset would > >> >be great showcase for opclass parameters. > >> > > >> >I'd like to give this patchset a chance for v13. I'm going to make > >> >another pass trough this patchset. If I wouldn't find serious issues, > >> >I'm going to commit it. Any objections? > >> > > >> > >> I'm an author of the patchset and I'd love to see it committed, but I > >> think that might be a bit too rushed and unfair (considering it was not > >> included in the current CF at all). > >> > >> I think the code is correct and I'm not aware of any bugs, but I'm not > >> sure there was sufficient discussion about things like costing, choosing > >> parameter values (e.g. number of values in the multi-minmax or bloom > >> filter parameters). > > > >Ok! > > > >> That being said, I think the first couple of patches (that modify how > >> BRIN deals with multi-key scans and IS NULL clauses) are simple enough > >> and non-controversial, so maybe we could get 0001-0003 committed, and > >> leave the bloom/multi-minmax opclasses for v14. > > > >Regarding 0001-0003 I've couple of notes: > >1) They should revise BRIN extensibility documentation section. > >2) I think 0002 and 0003 should be merged. NULL ScanKeys should be > >still passed to consistent function when oi_regular_nulls == false. > > > >Assuming we're not going to get 0001-0003 into v13, I'm not so > >inclined to rush on these three as well. But you're willing to commit > >them, you can count round of review on me. > > > > I have no intention to get 0001-0003 committed. I think those changes > are beneficial on their own, but the primary reason was to support the > new opclasses (which require those changes). And those parts are not > going to make it into v13 ... OK, no problem. Let's do this for v14. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: WIP: BRIN multi-range indexes
On Sun, Apr 05, 2020 at 07:33:40PM +0300, Alexander Korotkov wrote: On Sun, Apr 5, 2020 at 6:53 PM Tomas Vondra wrote: On Sun, Apr 05, 2020 at 06:29:15PM +0300, Alexander Korotkov wrote: >On Thu, Apr 2, 2020 at 5:29 AM Tomas Vondra > wrote: >> On Sun, Dec 01, 2019 at 10:55:02AM +0900, Michael Paquier wrote: >> >On Thu, Sep 26, 2019 at 09:01:48PM +0200, Tomas Vondra wrote: >> >> Yeah, the opclass params patches got broken by 773df883e adding enum >> >> reloptions. The breakage is somewhat extensive so I'll leave it up to >> >> Nikita to fix it in [1]. Until that happens, apply the patches on >> >> top of caba97a9d9 for review. >> > >> >This has been close to two months now, so I have the patch as RwF. >> >Feel free to update if you think that's incorrect. >> >> I see the opclass parameters patch got committed a couple days ago, so >> I've rebased the patch series on top of it. The pach was marked RwF >> since 2019-11, so I'll add it to the next CF. > >I think this patchset was marked RwF mainly because slow progress on >opclass parameters. Now we got opclass parameters committed, and I >think this patchset is in a pretty good shape. Moreover, opclass >parameters patch comes with very small examples. This patchset would >be great showcase for opclass parameters. > >I'd like to give this patchset a chance for v13. I'm going to make >another pass trough this patchset. If I wouldn't find serious issues, >I'm going to commit it. Any objections? > I'm an author of the patchset and I'd love to see it committed, but I think that might be a bit too rushed and unfair (considering it was not included in the current CF at all). I think the code is correct and I'm not aware of any bugs, but I'm not sure there was sufficient discussion about things like costing, choosing parameter values (e.g. number of values in the multi-minmax or bloom filter parameters). Ok! That being said, I think the first couple of patches (that modify how BRIN deals with multi-key scans and IS NULL clauses) are simple enough and non-controversial, so maybe we could get 0001-0003 committed, and leave the bloom/multi-minmax opclasses for v14. Regarding 0001-0003 I've couple of notes: 1) They should revise BRIN extensibility documentation section. 2) I think 0002 and 0003 should be merged. NULL ScanKeys should be still passed to consistent function when oi_regular_nulls == false. Assuming we're not going to get 0001-0003 into v13, I'm not so inclined to rush on these three as well. But you're willing to commit them, you can count round of review on me. I have no intention to get 0001-0003 committed. I think those changes are beneficial on their own, but the primary reason was to support the new opclasses (which require those changes). And those parts are not going to make it into v13 ... regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: segmentation fault using currtid and partitioned tables
Jaime Casanova writes: > Another one caught by sqlsmith, on the regression database run this > query (using any non-partitioned table works fine): > select currtid('pagg_tab'::regclass::oid, '(0,156)'::tid) >= '(1,158)'::tid; Hm, so (1) currtid_byreloid and currtid_byrelname lack any check to see if they're dealing with a relkind that lacks storage. (2) The proximate cause of the crash is that rd_tableam is zero, so that the interface functions in tableam.h just crash hard. This seems like a pretty bad thing; does anyone want to promise that there are no other oversights of the same ilk elsewhere, and never will be? I think it might be a good idea to make relations-without-storage set up rd_tableam as a vector of dummy functions that will throw some suitable complaint about "relation lacks storage". NULL is a horrible default for this. regards, tom lane
Re: SQL/JSON: functions
On Mon, Mar 23, 2020 at 8:28 PM Nikita Glukhov wrote: > Attached 47th version of the patches. > > On 21.03.2020 22:38, Pavel Stehule wrote: > > > On 21. 3. 2020 v 11:07 Nikita Glukhov wrote: >> >> Attached 46th version of the patches. >> >> On 20.03.2020 22:34, Pavel Stehule wrote: >> >> >> On 19.03.2020 23:57 Nikita Glukhov wrote: >>> >>> Attached 45th version of the patches. >>> >>> Nodes JsonFormat, JsonReturning, JsonPassing, JsonBehavior were fixed. >>> >>> On 17.03.2020 21:35, Pavel Stehule wrote: User functions json[b]_build_object_ext() and json[b]_build_array_ext() also can be easily removed. But it seems harder to remove new aggregate functions json[b]_objectagg() and json[b]_agg_strict(), because they can't be called directly from JsonCtorExpr node. >>> >>> >>> I don't see reasons for another reduction now. Can be great if you can >>> finalize work what you plan for pg13. >>> >> I have removed json[b]_build_object_ext() and json[b]_build_array_ext(). >> >> But json[b]_objectagg() and json[b]_agg_strict() are still present. >> It seems that removing them requires majors refactoring of the execution >> of Aggref and WindowFunc nodes. > > I have replaced aggregate function > > json[b]_objectagg(key any, val any, absent_on_null boolean, unique_keys > boolean) > > with three separate functions: > > json[b]_object_agg_strict(any, any) > json[b]_object_agg_unique(any, any) > json[b]_object_agg_unique_strict(any, any) > > > This should be more correct than single aggregate with additional parameters. I've following notes about this patchset. 1) Uniqueness checks using JsonbUniqueCheckContext and JsonUniqueCheckContext have quadratic complexity over number of keys. That doesn't look good especially for jsonb, which anyway sorts object keys before object serialization. 2) We have two uniqueness checks for json type, which use JsonbUniqueCheckContext and JsonUniqueState. JsonUniqueState uses stack of hashes, while JsonbUniqueCheckContext have just plain array of keys. I think we can make JsonUniqueState use single hash, where object identifies would be part of hash key. And we should replace JsonbUniqueCheckContext with JsonUniqueState. That would eliminate extra entities and provide reasonable complexity for cases, which now use JsonbUniqueCheckContext. 3) New SQL/JSON clauses don't use timezone and considered as immutable assuming all the children are immutable. Immutability is good, but ignoring timezone in all the cases is plain wrong. The first thing we can do is to use timezone and make SQL/JSON clauses stable. But that limits their usage in functional and partial indexes. I see couple of things we can do next (one of them or probably both). 3.1) Provide user a way so specify that we should ignore timezone in particular case (IGNORE TIMEZONE clause or something like that). Then SQL/JSON clause will be considered as immutable. 3.2) Automatically detect whether jsonpath might use timezone. If jsonpath doesn't use .datetime() method, it doesn't need timezone for sure. Also, from the datetime format specifiers we can get that we don't compare non-timezoned values to timezoned values. So, if we detect this jsonpath never uses timezone, we can consider SQL/JSON clause as immutable. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: WIP: BRIN multi-range indexes
On Sun, Apr 5, 2020 at 6:53 PM Tomas Vondra wrote: > On Sun, Apr 05, 2020 at 06:29:15PM +0300, Alexander Korotkov wrote: > >On Thu, Apr 2, 2020 at 5:29 AM Tomas Vondra > > wrote: > >> On Sun, Dec 01, 2019 at 10:55:02AM +0900, Michael Paquier wrote: > >> >On Thu, Sep 26, 2019 at 09:01:48PM +0200, Tomas Vondra wrote: > >> >> Yeah, the opclass params patches got broken by 773df883e adding enum > >> >> reloptions. The breakage is somewhat extensive so I'll leave it up to > >> >> Nikita to fix it in [1]. Until that happens, apply the patches on > >> >> top of caba97a9d9 for review. > >> > > >> >This has been close to two months now, so I have the patch as RwF. > >> >Feel free to update if you think that's incorrect. > >> > >> I see the opclass parameters patch got committed a couple days ago, so > >> I've rebased the patch series on top of it. The pach was marked RwF > >> since 2019-11, so I'll add it to the next CF. > > > >I think this patchset was marked RwF mainly because slow progress on > >opclass parameters. Now we got opclass parameters committed, and I > >think this patchset is in a pretty good shape. Moreover, opclass > >parameters patch comes with very small examples. This patchset would > >be great showcase for opclass parameters. > > > >I'd like to give this patchset a chance for v13. I'm going to make > >another pass trough this patchset. If I wouldn't find serious issues, > >I'm going to commit it. Any objections? > > > > I'm an author of the patchset and I'd love to see it committed, but I > think that might be a bit too rushed and unfair (considering it was not > included in the current CF at all). > > I think the code is correct and I'm not aware of any bugs, but I'm not > sure there was sufficient discussion about things like costing, choosing > parameter values (e.g. number of values in the multi-minmax or bloom > filter parameters). Ok! > That being said, I think the first couple of patches (that modify how > BRIN deals with multi-key scans and IS NULL clauses) are simple enough > and non-controversial, so maybe we could get 0001-0003 committed, and > leave the bloom/multi-minmax opclasses for v14. Regarding 0001-0003 I've couple of notes: 1) They should revise BRIN extensibility documentation section. 2) I think 0002 and 0003 should be merged. NULL ScanKeys should be still passed to consistent function when oi_regular_nulls == false. Assuming we're not going to get 0001-0003 into v13, I'm not so inclined to rush on these three as well. But you're willing to commit them, you can count round of review on me. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: WIP: BRIN multi-range indexes
On Sun, Apr 5, 2020 at 6:51 PM Tom Lane wrote: > Alexander Korotkov writes: > > I'd like to give this patchset a chance for v13. I'm going to make > > another pass trough this patchset. If I wouldn't find serious issues, > > I'm going to commit it. Any objections? > > I think it is way too late to be reviving major features that nobody > has been looking at for months, that indeed were never even listed > in the final CF. At this point in the cycle I think we should just be > trying to get small stuff over the line, not shove in major patches > and figure they can be stabilized later. > > In this particular case, the last serious work on the patchset seems > to have been Tomas' revision of 2019-09-14, and he specifically stated > then that the APIs still needed work. That doesn't sound like > "it's about ready to commit" to me. OK, got it. Thank you for the feedback. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: Index Skip Scan
> On Sun, Apr 05, 2020 at 04:30:51PM +0530, Dilip Kumar wrote: > > I was just wondering how the distinct will work with the "skip scan" > if we have some filter? I mean every time we select the unique row > based on the prefix key and that might get rejected by an external > filter right? Not exactly. In the case of index-only scan, we skipping to the first unique position, and then use already existing functionality (_bt_readpage with stepping to the next pages) to filter out those records that do not pass the condition. There are even couple of tests in the patch for this. In case of index scan, when there are some conditions, current implementation do not consider skipping. > So I tried an example to check this. Can you tell on which version of the patch you were testing?
Re: WIP: BRIN multi-range indexes
On Sun, Apr 05, 2020 at 06:29:15PM +0300, Alexander Korotkov wrote: Hi! On Thu, Apr 2, 2020 at 5:29 AM Tomas Vondra wrote: On Sun, Dec 01, 2019 at 10:55:02AM +0900, Michael Paquier wrote: >On Thu, Sep 26, 2019 at 09:01:48PM +0200, Tomas Vondra wrote: >> Yeah, the opclass params patches got broken by 773df883e adding enum >> reloptions. The breakage is somewhat extensive so I'll leave it up to >> Nikita to fix it in [1]. Until that happens, apply the patches on >> top of caba97a9d9 for review. > >This has been close to two months now, so I have the patch as RwF. >Feel free to update if you think that's incorrect. I see the opclass parameters patch got committed a couple days ago, so I've rebased the patch series on top of it. The pach was marked RwF since 2019-11, so I'll add it to the next CF. I think this patchset was marked RwF mainly because slow progress on opclass parameters. Now we got opclass parameters committed, and I think this patchset is in a pretty good shape. Moreover, opclass parameters patch comes with very small examples. This patchset would be great showcase for opclass parameters. I'd like to give this patchset a chance for v13. I'm going to make another pass trough this patchset. If I wouldn't find serious issues, I'm going to commit it. Any objections? I'm an author of the patchset and I'd love to see it committed, but I think that might be a bit too rushed and unfair (considering it was not included in the current CF at all). I think the code is correct and I'm not aware of any bugs, but I'm not sure there was sufficient discussion about things like costing, choosing parameter values (e.g. number of values in the multi-minmax or bloom filter parameters). That being said, I think the first couple of patches (that modify how BRIN deals with multi-key scans and IS NULL clauses) are simple enough and non-controversial, so maybe we could get 0001-0003 committed, and leave the bloom/multi-minmax opclasses for v14. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: WIP: BRIN multi-range indexes
Alexander Korotkov writes: > I'd like to give this patchset a chance for v13. I'm going to make > another pass trough this patchset. If I wouldn't find serious issues, > I'm going to commit it. Any objections? I think it is way too late to be reviving major features that nobody has been looking at for months, that indeed were never even listed in the final CF. At this point in the cycle I think we should just be trying to get small stuff over the line, not shove in major patches and figure they can be stabilized later. In this particular case, the last serious work on the patchset seems to have been Tomas' revision of 2019-09-14, and he specifically stated then that the APIs still needed work. That doesn't sound like "it's about ready to commit" to me. regards, tom lane
Re: WIP: BRIN multi-range indexes
Hi! On Thu, Apr 2, 2020 at 5:29 AM Tomas Vondra wrote: > On Sun, Dec 01, 2019 at 10:55:02AM +0900, Michael Paquier wrote: > >On Thu, Sep 26, 2019 at 09:01:48PM +0200, Tomas Vondra wrote: > >> Yeah, the opclass params patches got broken by 773df883e adding enum > >> reloptions. The breakage is somewhat extensive so I'll leave it up to > >> Nikita to fix it in [1]. Until that happens, apply the patches on > >> top of caba97a9d9 for review. > > > >This has been close to two months now, so I have the patch as RwF. > >Feel free to update if you think that's incorrect. > > I see the opclass parameters patch got committed a couple days ago, so > I've rebased the patch series on top of it. The pach was marked RwF > since 2019-11, so I'll add it to the next CF. I think this patchset was marked RwF mainly because slow progress on opclass parameters. Now we got opclass parameters committed, and I think this patchset is in a pretty good shape. Moreover, opclass parameters patch comes with very small examples. This patchset would be great showcase for opclass parameters. I'd like to give this patchset a chance for v13. I'm going to make another pass trough this patchset. If I wouldn't find serious issues, I'm going to commit it. Any objections? -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: Consider low startup cost in add_partial_path
Hi, For the record, here is the relevant part of the Incremental Sort patch series, updating add_partial_path and add_partial_path_precheck to also consider startup cost. The changes in the first two patches are pretty straight-forward, plus there's a proposed optimization in the precheck function to only run compare_pathkeys if entirely necessary. I'm currently evaluating those changes and I'll post the results to the incremental sort thread. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services >From 761b935584229243ecc6fd47d83e86d6b1b382c7 Mon Sep 17 00:00:00 2001 From: Tomas Vondra Date: Sun, 28 Jul 2019 15:55:54 +0200 Subject: [PATCH 1/5] Consider low startup cost when adding partial path 45be99f8cd5d606086e0a458c9c72910ba8a613d added `add_partial_path` with the comment: > Neither do we need to consider startup costs: > parallelism is only used for plans that will be run to completion. > Therefore, this routine is much simpler than add_path: it needs to > consider only pathkeys and total cost. I'm not entirely sure if that is still true or not--I can't easily come up with a scenario in which it's not, but I also can't come up with an inherent reason why such a scenario cannot exist. Regardless, the in-progress incremental sort patch uncovered a new case where it definitely no longer holds, and, as a result a higher cost plan ends up being chosen because a low startup cost partial path is ignored in favor of a lower total cost partial path and a limit is a applied on top of that which would normal favor the lower startup cost plan. --- src/backend/optimizer/util/pathnode.c | 65 +-- 1 file changed, 31 insertions(+), 34 deletions(-) diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c index 8ba8122ee2..b570bfd3be 100644 --- a/src/backend/optimizer/util/pathnode.c +++ b/src/backend/optimizer/util/pathnode.c @@ -733,10 +733,11 @@ add_path_precheck(RelOptInfo *parent_rel, * * Because we don't consider parameterized paths here, we also don't * need to consider the row counts as a measure of quality: every path will - * produce the same number of rows. Neither do we need to consider startup - * costs: parallelism is only used for plans that will be run to completion. - * Therefore, this routine is much simpler than add_path: it needs to - * consider only pathkeys and total cost. + * produce the same number of rows. It may however matter how much the + * path ordering matches the final ordering, needed by upper parts of the + * plan. Because that will affect how expensive the incremental sort is, + * we need to consider both the total and startup path, in addition to + * pathkeys. * * As with add_path, we pfree paths that are found to be dominated by * another partial path; this requires that there be no other references to @@ -774,44 +775,40 @@ add_partial_path(RelOptInfo *parent_rel, Path *new_path) /* Compare pathkeys. */ keyscmp = compare_pathkeys(new_path->pathkeys, old_path->pathkeys); - /* Unless pathkeys are incompatible, keep just one of the two paths. */ + /* +* Unless pathkeys are incompatible, see if one of the paths dominates +* the other (both in startup and total cost). It may happen that one +* path has lower startup cost, the other has lower total cost. +* +* XXX Perhaps we could do this only when incremental sort is enabled, +* and use the simpler version (comparing just total cost) otherwise? +*/ if (keyscmp != PATHKEYS_DIFFERENT) { - if (new_path->total_cost > old_path->total_cost * STD_FUZZ_FACTOR) - { - /* New path costs more; keep it only if pathkeys are better. */ - if (keyscmp != PATHKEYS_BETTER1) - accept_new = false; - } - else if (old_path->total_cost > new_path->total_cost -* STD_FUZZ_FACTOR) + PathCostComparison costcmp; + + /* +* Do a fuzzy cost comparison with standard fuzziness limit. +*/ + costcmp = compare_path_costs_fuzzily(new_path, old_path, + STD_FUZZ_FACTOR); + + if (costcmp == COSTS_BETTER1) { - /* Old path costs more; keep it only if pathkeys are better. */ - if (keyscmp !=
Re: [PATCH] Incremental sort (was: PoC: Partial sort)
On Sun, Apr 05, 2020 at 03:01:10PM +0200, Tomas Vondra wrote: On Thu, Apr 02, 2020 at 09:40:45PM -0400, James Coleman wrote: On Thu, Apr 2, 2020 at 8:46 PM James Coleman wrote: On Thu, Apr 2, 2020 at 8:20 PM Tomas Vondra wrote: ... 5) Overall, I think the costing is OK. I'm sure we'll find cases that will need improvements, but that's fine. However, we now have - cost_tuplesort (used to be cost_sort) - cost_full_sort - cost_incremental_sort - cost_sort I find it a bit confusing that we have cost_sort and cost_full_sort. Why don't we just keep using the dummy path in label_sort_with_costsize? That seems to be the only external caller outside costsize.c. Then we could either make cost_full_sort static or get rid of it entirely. This another area of the patch I haven't really modified. See attached for a cleanup of this; it removed cost_fullsort so label_sort_with_costsize is back to how it was. I've directly merged this into the patch series; if you'd like to see the diff I can send that along. Thanks. Attached is v54 of the patch, with some minor changes. The main two changes are in add_partial_path_precheck(), firstly to also consider startup_cost, as discussed before. The second change (in 0003) is a bit of an experiment to make add_partial_precheck() cheaper by only calling compare_pathkeys after checking the costs first (which should be cheaper than the function call). add_path_precheck already does it in that order anyway. Oh, I forgot to mention a change in add_partial_path - I've removed the reference/dependency on enable_incrementalsort. It seemed rather ugly, and the results without it seem fine (I'm benchmarking only the case with incremental sort enabled anyway). I also plan to look at the other optimization we bolted on last week, i.e. checking length of pathkeys. I'll see if that actually makes measurable difference. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: backup manifests and contemporaneous buildfarm failures
On 2020-04-04 04:43, Robert Haas wrote: I think I've done about as much as I can do for tonight, though. Most things are green now, and the ones that aren't are failing because of stuff that is at least plausibly fixed. By morning it should be clearer how much broken stuff is left, although that will be somewhat complicated by at least sidewinder and seawasp needing manual intervention to get back on track. I fixed sidewinder I think. Should clear up the next time it runs. It was the mode on the directory it couldn't handle- A regular rm -rf didn't work I had to do a chmod -R 700 on all directories to be able to manually remove it. /Mikael
Re: pgbench - add pseudo-random permutation function
Attached is an attempt at improving things. I have added a explicit note and hijacked an existing example to better illustrate the purpose of the function. A significant part of the complexity of the patch is the overflow-handling implementation of (a * b % c) for 64 bits integers. However this implementation is probably never used because int128 bits are available and the one line implementation takes precedence, or the size is small enough (less than 48 bits) so that there are no overflows with the primes involved, thus the operation is done directly on 64 bits integers. I could remove the implementation and replace it with a "not available on this architecture" message, which would seldom be triggered: you would have to use a 32 bits arch (?) and test against a table with about 262 Trows, which I guess does not exists anywhere. This approach would remove about 40% of the code & comments. Thougths? -- Fabien.
Re: Improving connection scalability: GetSnapshotData()
On Sun, 1 Mar 2020 at 21:47, Andres Freund wrote: > On 2020-03-01 00:36:01 -0800, Andres Freund wrote: > > conns tps mastertps pgxact-split > > > > 1 26842.49284526524.194821 > > 10 246923.158682 249224.782661 > > 50 695956.539704 709833.746374 > > 100 1054727.043139 1903616.306028 > > 200 964795.282957 1949200.338012 > > 300 906029.377539 1927881.231478 > > 400 845696.690912 1911065.369776 > > 500 812295.222497 1926237.255856 > > 600 888030.104213 1903047.236273 > > 700 866896.532490 1886537.202142 > > 800 863407.341506 1883768.592610 > > 900 871386.608563 1874638.012128 > > 1000887668.277133 1876402.391502 > > 1500860051.361395 1815103.564241 > > 2000890900.098657 1775435.271018 > > 3000874184.980039 1653953.817997 > > 4000845023.080703 1582582.316043 > > 5000817100.195728 1512260.802371 > > > > I think these are pretty nice results. FWIW, I took this for a spin on an AMD 3990x: # setup pgbench -i postgres #benchmark #!/bin/bash for i in 1 10 50 100 200 300 400 500 600 700 800 900 1000 1500 2000 3000 4000 5000; do echo Testing with $i connections >> bench.log pgbench2 -M prepared -c $i -j $i -S -n -T 60 postgres >> bench.log done pgbench2 is your patched version pgbench. I got some pretty strange results with the unpatched version. Up to about 50 million tps for excluding connection establishing, which seems pretty farfetched connectionsUnpatchedPatched 149062.2441349834.64983 10428673.1027453290.5985 501552413.0841849233.821 1002039675.0272261437.1 2003139648.8453632008.991 3003091248.3163597748.942 4003056453.53567888.293 5003019571.473574009.053 6002991052.3933537518.903 7002952484.7633553252.603 8002910976.8753539404.865 9002873929.9893514353.776 10002846859.4993490006.026 15002540003.0383370093.934 20002361799.1073197556.738 30002056973.7782949740.692 40001751418.1172627174.81 50001464786.4612334586.042 > Attached as a graph as well. Likewise. David
Re: Online checksums verification in the backend
On Sun, Apr 05, 2020 at 08:01:36PM +0900, Masahiko Sawada wrote: > On Sun, 5 Apr 2020 at 18:45, Julien Rouhaud wrote: > > > > On Sun, Apr 05, 2020 at 06:08:06PM +0900, Masahiko Sawada wrote: > > > > > > Why do we need two rows in the doc? For instance, replication slot > > > functions have some optional arguments but there is only one row in > > > the doc. So I think we don't need to change the doc from the previous > > > version patch. > > > > > > > I thought that if we document the function as pg_check_relation(regclass [, > > fork]) users could think that the 2nd argument is optional, so that > > pg_check_relation('something', NULL) could be a valid alias for the > > 1-argument > > form, which it isn't. After checking, I see that e.g. current_setting has > > the > > same semantics and is documented the way you suggest, so fixed back to > > previous > > version. > > > > > And I think these are not necessary as we already defined in > > > include/catalog/pg_proc.dat: > > > > > > +CREATE OR REPLACE FUNCTION pg_check_relation( > > > + IN relation regclass, > > > + OUT relid oid, OUT forknum integer, OUT failed_blocknum bigint, > > > + OUT expected_checksum integer, OUT found_checksum integer) > > > + RETURNS SETOF record STRICT VOLATILE LANGUAGE internal AS > > > 'pg_check_relation' > > > + PARALLEL RESTRICTED; > > > + > > > +CREATE OR REPLACE FUNCTION pg_check_relation( > > > + IN relation regclass, IN fork text, > > > + OUT relid oid, OUT forknum integer, OUT failed_blocknum bigint, > > > + OUT expected_checksum integer, OUT found_checksum integer) > > > + RETURNS SETOF record STRICT VOLATILE LANGUAGE internal > > > + AS 'pg_check_relation_fork' > > > + PARALLEL RESTRICTED; > > > > > > > Oh right this isn't required since there's no default value anymore, fixed. > > > > v9 attached. > > Thank you for updating the patch! The patch looks good to me. > > I've marked this patch as Ready for Committer. I hope this patch will > get committed to PG13. > Thanks a lot!
Re: Online checksums verification in the backend
On Sun, 5 Apr 2020 at 18:45, Julien Rouhaud wrote: > > On Sun, Apr 05, 2020 at 06:08:06PM +0900, Masahiko Sawada wrote: > > > > Why do we need two rows in the doc? For instance, replication slot > > functions have some optional arguments but there is only one row in > > the doc. So I think we don't need to change the doc from the previous > > version patch. > > > > I thought that if we document the function as pg_check_relation(regclass [, > fork]) users could think that the 2nd argument is optional, so that > pg_check_relation('something', NULL) could be a valid alias for the 1-argument > form, which it isn't. After checking, I see that e.g. current_setting has the > same semantics and is documented the way you suggest, so fixed back to > previous > version. > > > And I think these are not necessary as we already defined in > > include/catalog/pg_proc.dat: > > > > +CREATE OR REPLACE FUNCTION pg_check_relation( > > + IN relation regclass, > > + OUT relid oid, OUT forknum integer, OUT failed_blocknum bigint, > > + OUT expected_checksum integer, OUT found_checksum integer) > > + RETURNS SETOF record STRICT VOLATILE LANGUAGE internal AS > > 'pg_check_relation' > > + PARALLEL RESTRICTED; > > + > > +CREATE OR REPLACE FUNCTION pg_check_relation( > > + IN relation regclass, IN fork text, > > + OUT relid oid, OUT forknum integer, OUT failed_blocknum bigint, > > + OUT expected_checksum integer, OUT found_checksum integer) > > + RETURNS SETOF record STRICT VOLATILE LANGUAGE internal > > + AS 'pg_check_relation_fork' > > + PARALLEL RESTRICTED; > > > > Oh right this isn't required since there's no default value anymore, fixed. > > v9 attached. Thank you for updating the patch! The patch looks good to me. I've marked this patch as Ready for Committer. I hope this patch will get committed to PG13. Regards, -- Masahiko Sawadahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Index Skip Scan
On Wed, Mar 25, 2020 at 2:19 PM Dmitry Dolgov <9erthali...@gmail.com> wrote: > > > On Wed, Mar 25, 2020 at 11:31:56AM +0530, Dilip Kumar wrote: > > > > Seems like you forgot to add the uniquekey.c file in the > > v33-0001-Unique-key.patch. > > Oh, you're right, thanks. Here is the corrected patch. I was just wondering how the distinct will work with the "skip scan" if we have some filter? I mean every time we select the unique row based on the prefix key and that might get rejected by an external filter right? So I tried an example to check this. postgres[50006]=# \d t Table "public.t" Column | Type | Collation | Nullable | Default +-+---+--+- a | integer | | | b | integer | | | Indexes: "idx" btree (a, b) postgres[50006]=# insert into t select 2, i from generate_series(1, 200)i; INSERT 0 200 postgres[50006]=# insert into t select 1, i from generate_series(1, 200)i; INSERT 0 200 postgres[50006]=# set enable_indexskipscan =off; SET postgres[50006]=# select distinct(a) from t where b%100=0; a --- 1 2 (2 rows) postgres[50006]=# set enable_indexskipscan =on; SET postgres[50006]=# select distinct(a) from t where b%100=0; a --- (0 rows) postgres[50006]=# explain select distinct(a) from t where b%100=0; QUERY PLAN --- Index Only Scan using idx on t (cost=0.15..1.55 rows=10 width=4) Skip scan: true Filter: ((b % 100) = 0) (3 rows) I think in such cases we should not select the skip scan. This should behave like we have a filter on the non-index field. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Allow cluster owner to bypass authentication
On 2020-03-27 15:58, David Steele wrote: Hi Peter, On 12/27/19 3:22 PM, Stephen Frost wrote: * Tom Lane (t...@sss.pgh.pa.us) wrote: I think it'd be great if this behavior could be implemented within the notation, because we could then just set up a non-empty default pg_ident.conf with useful behavioral examples in the form of prefab maps. In particular, we should think about how hard it is to do "I want the default behavior plus allow joe to connect as charlie". If the default is a one-liner that you can copy and add to, that's a lot better than if you have to reverse-engineer what to write. This direction certainly sounds more appealing to me. Any thoughts on the discussion between Stephen and Tom? It appears that the whole discussion of what a new default security configuration could or should be hasn't really moved to a new consensus, so given the time, I think it's best that we leave things as they are and continue the exploration at some future time. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Yet another fast GiST build
Hello Yuri, PDEP is indeed first thing that comes up when you start googling z-curve and bit interleaving :) We had the code with z-curve generating PDEP instruction in PostGIS, and dropped it since. In sorting, we now utilize sort support / prefix search, and key generated as Hilbert curve, with fine tuning it for different projections' geometric properties. >From this patch the most valuable thing for us is the sorting build infrastructure itself. Maybe to get it a bit more understandable for people not deep in geometry it makes sense to first expose the btree_gist datatypes to this thing? So that btree_gist index on integer will be built exactly the same way the btree index on integer is built. This will also get everyone a reference point on the bottlenecks and optimality of patch. On Fri, Apr 3, 2020 at 10:56 AM Yuri Astrakhan wrote: > > Awesome addition! Would it make sense to use x86's BMI2's PDEP instruction, > or is the interleave computation too small of a percentage to introduce > not-so-easy-to-port code? Also, I think it needs a bit more documentation to > explain the logic, i.e. a link to > https://stackoverflow.com/questions/39490345/interleave-bits-efficiently ? > Thx for making it faster :) -- Darafei Praliaskouski Support me: http://patreon.com/komzpa
Re: Online checksums verification in the backend
On Sun, Apr 05, 2020 at 06:08:06PM +0900, Masahiko Sawada wrote: > > Why do we need two rows in the doc? For instance, replication slot > functions have some optional arguments but there is only one row in > the doc. So I think we don't need to change the doc from the previous > version patch. > I thought that if we document the function as pg_check_relation(regclass [, fork]) users could think that the 2nd argument is optional, so that pg_check_relation('something', NULL) could be a valid alias for the 1-argument form, which it isn't. After checking, I see that e.g. current_setting has the same semantics and is documented the way you suggest, so fixed back to previous version. > And I think these are not necessary as we already defined in > include/catalog/pg_proc.dat: > > +CREATE OR REPLACE FUNCTION pg_check_relation( > + IN relation regclass, > + OUT relid oid, OUT forknum integer, OUT failed_blocknum bigint, > + OUT expected_checksum integer, OUT found_checksum integer) > + RETURNS SETOF record STRICT VOLATILE LANGUAGE internal AS > 'pg_check_relation' > + PARALLEL RESTRICTED; > + > +CREATE OR REPLACE FUNCTION pg_check_relation( > + IN relation regclass, IN fork text, > + OUT relid oid, OUT forknum integer, OUT failed_blocknum bigint, > + OUT expected_checksum integer, OUT found_checksum integer) > + RETURNS SETOF record STRICT VOLATILE LANGUAGE internal > + AS 'pg_check_relation_fork' > + PARALLEL RESTRICTED; > Oh right this isn't required since there's no default value anymore, fixed. v9 attached. >From 66e0ed1c3b12c42c4d52b2b27e79e16e82730b4b Mon Sep 17 00:00:00 2001 From: Julien Rouhaud Date: Mon, 4 Nov 2019 08:40:23 +0100 Subject: [PATCH v9] Add a pg_check_relation() function. This functions checks the validity of the checksums for all non-dirty blocks of a given relation, and optionally a given fork, and returns the list of all blocks that don't match, along with the expected and found checksums. Author: Julien Rouhaud Reviewed-by: Michael Paquier, Masahiko Sawada Discussion: https://postgr.es/m/CAOBaU_aVvMjQn%3Dge5qPiJOPMmOj5%3Dii3st5Q0Y%2BWuLML5sR17w%40mail.gmail.com --- doc/src/sgml/config.sgml | 85 + doc/src/sgml/func.sgml| 51 +++ src/backend/storage/page/checksum.c | 318 +- src/backend/utils/adt/Makefile| 1 + src/backend/utils/adt/checksumfuncs.c | 217 src/backend/utils/init/globals.c | 7 + src/backend/utils/misc/guc.c | 33 ++ src/backend/utils/misc/postgresql.conf.sample | 6 + src/include/catalog/pg_proc.dat | 16 + src/include/miscadmin.h | 7 + src/include/storage/checksum.h| 13 + src/include/utils/guc_tables.h| 1 + src/test/Makefile | 3 +- src/test/check_relation/.gitignore| 2 + src/test/check_relation/Makefile | 23 ++ src/test/check_relation/README| 23 ++ .../check_relation/t/01_checksums_check.pl| 276 +++ 17 files changed, 1078 insertions(+), 4 deletions(-) create mode 100644 src/backend/utils/adt/checksumfuncs.c create mode 100644 src/test/check_relation/.gitignore create mode 100644 src/test/check_relation/Makefile create mode 100644 src/test/check_relation/README create mode 100644 src/test/check_relation/t/01_checksums_check.pl diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 114db38116..a20501bb85 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -2038,6 +2038,91 @@ include_dir 'conf.d' + + Cost-based Checksum Verification Delay + + + During the execution of + function, the system maintains an internal counter that keeps track of + the estimated cost of the various I/O operations that are performed. + When the accumulated cost reaches a limit (specified by + checksum_cost_limit), the process performing the + operation will sleep for a short period of time, as specified by + checksum_cost_delay. Then it will reset the counter + and continue execution. + + + + This feature is disabled by default. To enable it, set the + checksum_cost_delay variable to a nonzero + value. + + + + + checksum_cost_delay (floating point) + +checksum_cost_delay configuration parameter + + + + + The amount of time that the process will sleep + when the cost limit has been exceeded. + If this value is specified without units, it is taken as milliseconds. + The default value is zero, which disables the cost-based checksum + verification delay feature. Positive values enable cost-based + checksum verification. + + + + + + checksum_cost_page (integer) + +ch
Re: Online checksums verification in the backend
On Sun, 5 Apr 2020 at 17:44, Julien Rouhaud wrote: > > On Sun, Apr 05, 2020 at 01:13:30PM +0900, Masahiko Sawada wrote: > > > > Thank you for updating the patch! The patch looks good to me. Here are > > some random comments mostly about cosmetic changes. > > > > Thanks a lot for the review! Thank you for updating the patch. > > > > > 1. > > I think we can have two separate SQL functions: > > pg_check_relation(regclass, text) and pg_check_relation(regclass), > > instead of setting NULL by default to the second argument. > > > > I'm fine with it, so implemented this way with the required documentation > changes. Why do we need two rows in the doc? For instance, replication slot functions have some optional arguments but there is only one row in the doc. So I think we don't need to change the doc from the previous version patch. And I think these are not necessary as we already defined in include/catalog/pg_proc.dat: +CREATE OR REPLACE FUNCTION pg_check_relation( + IN relation regclass, + OUT relid oid, OUT forknum integer, OUT failed_blocknum bigint, + OUT expected_checksum integer, OUT found_checksum integer) + RETURNS SETOF record STRICT VOLATILE LANGUAGE internal AS 'pg_check_relation' + PARALLEL RESTRICTED; + +CREATE OR REPLACE FUNCTION pg_check_relation( + IN relation regclass, IN fork text, + OUT relid oid, OUT forknum integer, OUT failed_blocknum bigint, + OUT expected_checksum integer, OUT found_checksum integer) + RETURNS SETOF record STRICT VOLATILE LANGUAGE internal + AS 'pg_check_relation_fork' + PARALLEL RESTRICTED; > > > > > 2. > > + * Check data sanity for a specific block in the given fork of the given > > + * relation, always retrieved locally with smgrred even if a version > > exists in > > > > s/smgrred/smgrread/ > > Fixed. > > > > > 3. > > + /* The buffer will have to check checked. */ > > + Assert(checkit); > > > > Should it be "The buffer will have to be checked"? > > > > Oops indeed, fixed. > > > > > 4. > > + if (!is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_SERVER_FILES)) > > + ereport(ERROR, > > + (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), > > +errmsg("only superuser or a member of the > > pg_read_server_files role may use this function"))); > > > > Looking at the definition of pg_stat_read_server_files role, this role > > seems to be for operations that could read non-database files such as > > csv files. Therefore, currently this role is used by file_fdw and COPY > > command. I personally think pg_stat_scan_tables would be more > > appropriate for this function but I'm not sure. > > > > That's a very good point, especially since the documentation of this default > role is quite relevant for those functions: > > "Execute monitoring functions that may take ACCESS SHARE locks on tables, > potentially for a long time." > > So changed! > > > > > 5. > > + /* Set cost-based vacuum delay */ > > + ChecksumCostActive = (ChecksumCostDelay > 0); > > + ChecksumCostBalance = 0; > > > > s/vacuum/checksum verification/ > > > > Fixed. > > > > > 6. > > + ereport(WARNING, > > + (errcode(ERRCODE_DATA_CORRUPTED), > > +errmsg("invalid page in block %u of relation %s", > > + blkno, > > + relpath(relation->rd_smgr->smgr_rnode, forknum; > > > > I think it's better to show the relation name instead of the relation path > > here. > > > > I'm here using the same pattern as what ReadBuffer_common() would display if a > corrupted block is read. I think it's better to keep the format for both, so > any existing log analyzer will keep working with those new functions. Ok, I agree with you. Regards, -- Masahiko Sawadahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Online checksums verification in the backend
On Sun, Apr 05, 2020 at 01:13:30PM +0900, Masahiko Sawada wrote: > > Thank you for updating the patch! The patch looks good to me. Here are > some random comments mostly about cosmetic changes. > Thanks a lot for the review! > > 1. > I think we can have two separate SQL functions: > pg_check_relation(regclass, text) and pg_check_relation(regclass), > instead of setting NULL by default to the second argument. > I'm fine with it, so implemented this way with the required documentation changes. > > 2. > + * Check data sanity for a specific block in the given fork of the given > + * relation, always retrieved locally with smgrred even if a version exists > in > > s/smgrred/smgrread/ Fixed. > > 3. > + /* The buffer will have to check checked. */ > + Assert(checkit); > > Should it be "The buffer will have to be checked"? > Oops indeed, fixed. > > 4. > + if (!is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_SERVER_FILES)) > + ereport(ERROR, > + (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), > +errmsg("only superuser or a member of the > pg_read_server_files role may use this function"))); > > Looking at the definition of pg_stat_read_server_files role, this role > seems to be for operations that could read non-database files such as > csv files. Therefore, currently this role is used by file_fdw and COPY > command. I personally think pg_stat_scan_tables would be more > appropriate for this function but I'm not sure. > That's a very good point, especially since the documentation of this default role is quite relevant for those functions: "Execute monitoring functions that may take ACCESS SHARE locks on tables, potentially for a long time." So changed! > > 5. > + /* Set cost-based vacuum delay */ > + ChecksumCostActive = (ChecksumCostDelay > 0); > + ChecksumCostBalance = 0; > > s/vacuum/checksum verification/ > Fixed. > > 6. > + ereport(WARNING, > + (errcode(ERRCODE_DATA_CORRUPTED), > +errmsg("invalid page in block %u of relation %s", > + blkno, > + relpath(relation->rd_smgr->smgr_rnode, forknum; > > I think it's better to show the relation name instead of the relation path > here. > I'm here using the same pattern as what ReadBuffer_common() would display if a corrupted block is read. I think it's better to keep the format for both, so any existing log analyzer will keep working with those new functions. > > 7. > + ereport(ERROR, > + (errcode(ERRCODE_WRONG_OBJECT_TYPE), > +errmsg("relation \"%s\" does not have storage to be checked", > +quote_qualified_identifier( > +get_namespace_name(get_rel_namespace(relid)), > +get_rel_name(relid); > > Looking at other similar error messages we don't show qualified > relation name but the relation name gotten by > RelationGetRelationName(relation). Can we do that here as well for > consistency? > Indeed, fixed. > > 8. > + if (!(rsinfo->allowedModes & SFRM_Materialize)) > + ereport(ERROR, > + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > +errmsg("materialize mode required, but it is not " \ > + "allowed in this context"))); > > I think it's better to have this error message in one line for easy grepping. Fixed. I also fixed missing leading tab in the perl TAP tests >From a9634bf02c0e7e1f9b0e8cf4e1ad79a72ea80ec8 Mon Sep 17 00:00:00 2001 From: Julien Rouhaud Date: Mon, 4 Nov 2019 08:40:23 +0100 Subject: [PATCH v8] Add a pg_check_relation() function. This functions checks the validity of the checksums for all non-dirty blocks of a given relation, and optionally a given fork, and returns the list of all blocks that don't match, along with the expected and found checksums. Author: Julien Rouhaud Reviewed-by: Michael Paquier, Masahiko Sawada Discussion: https://postgr.es/m/CAOBaU_aVvMjQn%3Dge5qPiJOPMmOj5%3Dii3st5Q0Y%2BWuLML5sR17w%40mail.gmail.com --- doc/src/sgml/config.sgml | 85 + doc/src/sgml/func.sgml| 60 src/backend/catalog/system_views.sql | 15 + src/backend/storage/page/checksum.c | 318 +- src/backend/utils/adt/Makefile| 1 + src/backend/utils/adt/checksumfuncs.c | 217 src/backend/utils/init/globals.c | 7 + src/backend/utils/misc/guc.c | 33 ++ src/backend/utils/misc/postgresql.conf.sample | 6 + src/include/catalog/pg_proc.dat | 16 + src/include/miscadmin.h | 7 + src/include/storage/checksum.h| 13 + src/include/utils/guc_tables.h| 1 + src/test/Makefile | 3 +- src/test/check_relation/.gitignore| 2 + src/test/check_relation/Makefile | 23 ++
Re: Add A Glossary
a) Some rearrangements of the sequence of terms to meet alphabetical order. b) --> in two cases. Or should it be a ? Kind regards, Jürgen diff --git a/doc/src/sgml/glossary.sgml b/doc/src/sgml/glossary.sgml index 8c6cb6e942..25762b7c3a 100644 --- a/doc/src/sgml/glossary.sgml +++ b/doc/src/sgml/glossary.sgml @@ -208,15 +208,6 @@ - - Checkpointer (process) - - - A specialized process responsible for executing checkpoints. - - - - Checkpoint @@ -244,6 +235,15 @@ + + Checkpointer (process) + + + A specialized process responsible for executing checkpoints. + + + + Class (archaic) @@ -761,25 +761,6 @@ - - Logger (process) - - - If activated, the - Logger process - writes information about database events into the current - log file. - When reaching certain time- or - volume-dependent criteria, a new log file is created. - Also called syslogger. - - - For more information, see - . - - - - Log Record @@ -803,6 +784,25 @@ + + Logger (process) + + + If activated, the + Logger process + writes information about database events into the current + log file. + When reaching certain time- or + volume-dependent criteria, a new log file is created. + Also called syslogger. + + + For more information, see + . + + + + Master (server) @@ -1651,6 +1651,11 @@ + + WAL + + + WAL Archiver (process) @@ -1696,11 +1701,6 @@ - - WAL - - - WAL Record @@ -1728,8 +1728,8 @@ A process that writes WAL records -from shared memory to -WAL files. +from shared memory to +WAL files. For more information, see
segmentation fault using currtid and partitioned tables
Hi, Another one caught by sqlsmith, on the regression database run this query (using any non-partitioned table works fine): """ select currtid('pagg_tab'::regclass::oid, '(0,156)'::tid) >= '(1,158)'::tid; """ This works on 11 (well it gives an error because the file doesn't exists) but crash the server on 12+ attached the stack trace from master -- Jaime Casanova www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services #0 0x00a3a3f1 in table_beginscan_tid (rel=0x7ff8ad4d01b8, snapshot=0x1336430) at ../../../../src/include/access/tableam.h:842 flags = 8 #1 0x00a3b1e3 in currtid_byreloid (fcinfo=0x130c298) at tid.c:384 reloid = 37534 tid = 0x128f498 result = 0x135b960 rel = 0x7ff8ad4d01b8 aclresult = ACLCHECK_OK snapshot = 0x1336430 scan = 0x10 #2 0x006fc889 in ExecInterpExpr (state=0x130c100, econtext=0x130be00, isnull=0x7fff1f9659e7) at execExprInterp.c:698 fcinfo = 0x130c298 args = 0x130c2b8 nargs = 2 d = 140733723337632 op = 0x130c2f0 resultslot = 0x130c008 innerslot = 0x0 outerslot = 0x0 scanslot = 0x0 dispatch_table = {0x6fc140 , 0x6fc165 , 0x6fc19b , 0x6fc1d4 , 0x6fc20d , 0x6fc296 , 0x6fc31f , 0x6fc3a8 , 0x6fc3d7 , 0x6fc406 , 0x6fc435 , 0x6fc463 , 0x6fc50a , 0x6fc5b1 , 0x6fc658 , 0x6fc6b3 , 0x6fc756 , 0x6fc78c , 0x6fc7f9 , 0x6fc8c8 , 0x6fc8f6 , 0x6fc924 , 0x6fc92f , 0x6fc997 , 0x6fc9f0 , 0x6fc9fb , 0x6fca63 , 0x6fcabc , 0x6fcaf4 , 0x6fcb69 , 0x6fcb94 , 0x6fcbdf , 0x6fcc2d , 0x6fcc88 , 0x6fccca , 0x6fcd0f , 0x6fcd3d , 0x6fcd6b , 0x6fcda5 , 0x6fce08 , 0x6fce6b , 0x6fcea5 , 0x6fced3 , 0x6fcf01 , 0x6fcf31 , 0x6fd01d , 0x6fd073 , 0x6fd25b , 0x6fd34e , 0x6fd432 , 0x6fd50e , 0x6fd535 , 0x6fd55c , 0x6fd583 , 0x6fd5aa , 0x6fd5d8 , 0x6fd5ff , 0x6fd745 , 0x6fd849 , 0x6fd870 , 0x6fd89e , 0x6fd8cc , 0x6fd8fa , 0x6fd950 , 0x6fd977 , 0x6fd99e , 0x6fcfa7 , 0x6fda1a , 0x6fda41 , 0x6fd9c5 , 0x6fd9f3 , 0x6fda68 , 0x6fda8f , 0x6fdb2d , 0x6fdb54 , 0x6fdbf2 , 0x6fdc20 , 0x6fdc4e , 0x6fdc89 , 0x6fdd3c , 0x6fddca , 0x6fde51 , 0x6fdede , 0x6fdffe , 0x6fe0e8 , 0x6fe1bc , 0x6fe2d9 , 0x6fe3c0 , 0x6fe491 , 0x6fe4bf , 0x6fe4ed } #3 0x006fe563 in ExecInterpExprStillValid (state=0x130c100, econtext=0x130be00, isNull=0x7fff1f9659e7) at execExprInterp.c:1807 No locals. #4 0x0074a22c in ExecEvalExprSwitchContext (state=0x130c100, econtext=0x130be00, isNull=0x7fff1f9659e7) at ../../../src/include/executor/executor.h:313 retDatum = 18446462598752812812 oldContext = 0x130b990 #5 0x0074a295 in ExecProject (projInfo=0x130c0f8) at ../../../src/include/executor/executor.h:347 econtext = 0x130be00 state = 0x130c100 slot = 0x130c008 isnull = false #6 0x0074a47b in ExecResult (pstate=0x130bce8) at nodeResult.c:136 node = 0x130bce8 outerTupleSlot = 0x779642 outerPlan = 0x0 econtext = 0x130be00 #7 0x00712656 in ExecProcNodeFirst (node=0x130bce8) at execProcnode.c:444 No locals. #8 0x00707085 in ExecProcNode (node=0x130bce8) at ../../../src/include/executor/executor.h:245 No locals. #9 0x00709b39 in ExecutePlan (estate=0x130bab0, planstate=0x130bce8, use_parallel_mode=false, operation=CMD_SELECT, sendTuples=true, numberTuples=0, direction=ForwardScanDirection, dest=0x12b30c0, execute_once=true) at execMain.c:1646 slot = 0x0 current_tuple_count = 0 #10 0x007076a9 in standard_ExecutorRun (queryDesc=0x12ae280, direction=ForwardScanDirection, count=0, execute_once=true) at execMain.c:364 estate = 0x130bab0 operation = CMD_SELECT dest = 0x12b30c0 sendTuples = true oldcontext = 0x12ae160 __func__ = "standard_ExecutorRun" #11 0x007074d1 in ExecutorRun (queryDesc=0x12ae280, direction=ForwardScanDirection, count=0, execute_once=true) at execMain.c:308 No locals. #12 0x00921d77 in PortalRunSelect (portal=0x12f3e70, forward=true, count=0, dest=0x12b30c0) at pquery.c:912 queryDesc = 0x12ae280 direction = ForwardScanDirection nprocessed = 0 __func__ = "PortalRunSelect" #13 0x00921a30 in PortalRun (portal=0x12f3e70, count=9223372036854775807, isTopLevel=true, run_once=true, dest=0x12b30c0, altdest=0x12b30c0, qc=0x7fff1f965d60) at pquery.c:756 _save_exception_stack = 0x7fff1f965e70 _save_context_stack = 0x0 _local_sigjmp_buf = {{__jmpbuf = {0, 60194379429867500, 4706656, 140733723337632, 0, 0, 60194379352272876, -59702997874741268}, __mask_was_saved = 0, __saved_mask =
Re: potential stuck lock in SaveSlotToPath()
On 2020-04-02 08:21, Michael Paquier wrote: On Wed, Apr 01, 2020 at 04:26:25PM +0200, Peter Eisentraut wrote: Good catch. How about the attached patch? WFM. Another trick would be to call LWLockRelease() after generating the log, but I find your patch more consistent with the surroundings. done -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Add A Glossary
Hi Corey, ISTM that occurrences of these words elsewhere in the documentation should link to the glossary definitions? Yes, that's a big project. I was considering writing a script to compile all the terms as search terms, paired with their glossary ids, and then invoke git grep to identify all pages that have term FOO but don't have glossary-foo. We would then go about gloss-linking those pages as appropriate, but only a few pages at a time to keep scope sane. Id go for scripting the thing. Should the glossary be backpatched, to possibly ease doc patchpatches? Also, I'm unclear about the circumstances under which we should _not_ tag a term. At least when then are explained locally. I remember hearing that we should only tag it on the first usage, but is that per section or per page? Page? As the definitions are short and to the point, maybe the HTML display could (also) "hover" the definitions when the mouse passes over the word, using the "title" attribute? I like that idea, if it doesn't conflict with accessibility standards (maybe that's just titles on images, not sure). The following worked fine: Title Tag Test The ACID property is great. So basically the def can be put on the glossary link, however retrieving the definition should be automatic. I suspect we would want to just carry over the first sentence or so with a ... to avoid cluttering the screen with my overblown definition of a sequence. Dunno. The definitions are quite short, maybe the can fit whole. I suggest we pursue this idea in another thread, as we'd probably want to do it for acronyms as well. Or not. I'd test committer temperature before investing time because it would mean that backpatching the doc would be a little harder. Entries could link to relevant wikipedia pages, like the acronyms section does? They could. I opted not to do that because each external link invites debate about how authoritative that link is, which is easier to do with acronyms. Now that the glossary is a reality, it's easier to have those discussions. Ok. -- Fabien.
Comment explaining why vacuum needs to push snapshot seems insufficient.
Hi, vacuum_rel() has the following comment: /* * Functions in indexes may want a snapshot set. Also, setting a snapshot * ensures that RecentGlobalXmin is kept truly recent. */ PushActiveSnapshot(GetTransactionSnapshot()); which was added quite a while ago in commit d53a56687f3d4772d17ffa0013a33231b7163731 Author: Alvaro Herrera Date: 2008-09-11 14:01:10 + But to me that's understating the issue. Don't we e.g. need a snapshot to ensure that pg_subtrans won't be truncated away? I thought xlog.c doesn't pass PROCARRAY_FLAGS_VACUUM to GetOldestXmin when truncating? TruncateSUBTRANS(GetOldestXmin(NULL, PROCARRAY_FLAGS_DEFAULT)); It's fine for rows that vacuum could see according to its xmin to be pruned away, since that won't happen while it has a page locked. But we can't just have a pg_subtrans access error out, and there's no page level interlock against that. Also, without an xmin set, it seems possible that vacuum could see some of the transaction ids it uses (in local memory) wrap around. While not likely, it doesn't seem that unlikely either, since autovacuum will be running full throttle if there's a 2 billion xid old transaction hanging around. And if that super old transaction finishes, e.g. vacuum's OldestXmin value could end up being in the future in the middle of vacuuming the table (if that table has a new relfrozenxid). How about replacing it with something like /* * Need to acquire a snapshot to prevent pg_subtrans from being truncated, * cutoff xids in local memory wrapping around, and to have updated xmin * horizons. */ Greetings, Andres Freund