Re: Assertion failure with replication origins and PREPARE TRANSACTIOn
On Mon, Dec 13, 2021 at 12:44 PM Michael Paquier wrote: > > Hi all, > (CCing some folks who worked on this area lately) > > The following sequence of commands generates an assertion failure, as > of $subject: > select pg_replication_origin_create('popo'); > select pg_replication_origin_session_setup('popo'); > begin; > select txid_current(); > prepare transaction 'popo'; -- assertion fails > > The problem originates from 1eb6d65, down to 11, where we finish by > triggering this assertion before replorigin_session_origin_lsn is not > valid: > + if (replorigin) > + { > + Assert(replorigin_session_origin_lsn != InvalidXLogRecPtr); > + hdr->origin_lsn = replorigin_session_origin_lsn; > + hdr->origin_timestamp = replorigin_session_origin_timestamp; > + } > > As far as I understand this code and based on the docs, > pg_replication_origin_xact_setup(), that would set of the session > origin LSN and timestamp, is an optional choise. > replorigin_session_advance() would be a no-op for remote_lsn, and > local_lsn requires an update. Now please note that I am really fluent > with this stuff, so feel free to correct me. The intention of the > code also seems that XACT_XINFO_HAS_ORIGIN should still be set, but > with no data. > > At the end, it seems to me that the assertion could just be dropped as > per the attached, as we'd finish by setting up origin_lsn and > origin_timestamp in the 2PC file header with some invalid data. Why do we check if replorigin_session_origin_lsn is not invalid data only when PREPARE TRANSACTION? Looking at commit and rollback code, we don't have assertions or checks that check replorigin_session_origin_lsn/timestamp is valid data. So it looks like we accept also invalid data in those cases since replorigin_advance doesn’t move LSN backward while applying a commit or rollback record even if LSN is invalid. The same is true for PREPARE records, i.g., even if replication origin LSN is invalid, it doesn't go backward. If replication origin LSN and timestamp in commit record and rollback record must be valid data too, I think we should similar checks for commit and rollback code and I think the assertions will fail in the case I reported before[1]. Regards, [1] https://www.postgresql.org/message-id/CAD21AoAMuTrXezGqaV1Q5H-Hf%2BzKqGbU8XuCZk9iQMYueF3%2B8w%40mail.gmail.com -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Re: parallel vacuum comments
On Mon, Dec 13, 2021 at 2:09 PM Amit Kapila wrote: > > On Mon, Dec 13, 2021 at 10:33 AM Masahiko Sawada > wrote: > > > > On Fri, Dec 10, 2021 at 9:08 PM Amit Kapila wrote: > > > > > > On Thu, Dec 9, 2021 at 6:05 PM Masahiko Sawada > > > wrote: > > > > > > > > On Thu, Dec 9, 2021 at 7:44 PM Amit Kapila > > > > wrote: > > > > > > > > Agreed with the above two points. > > > > > > > > I've attached updated patches that incorporated the above comments > > > > too. Please review them. > > > > > > > > > > I have made the following minor changes to the 0001 patch: (a) An > > > assert was removed from dead_items_max_items() which I added back. (b) > > > Removed an unnecessary semicolon from one of the statements in > > > compute_parallel_vacuum_workers(). (c) Changed comments at a few > > > places. (d) moved all parallel_vacuum_* related functions together. > > > (e) ran pgindent and slightly modify the commit message. > > > > > > Let me know what you think of the attached? > > > > Thank you for updating the patch! > > > > The patch also moves some functions, e.g., update_index_statistics() > > is moved without code changes. I agree to move functions for > > consistency but that makes the review hard and the patch complicated. > > I think it's better to do improving the parallel vacuum code and > > moving functions in separate patches. > > > > Okay, I thought it might be better to keep all parallel_vacuum_* > related functions together but we can keep that in a separate patch > Feel free to submit without those changes. I've attached the patch. I've just moved some functions back but not done other changes. > In fact, if we go for your > current 0002 then that might not be even required as we move all those > functions to a new file. Right. So it seems not necessary. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/ v9-0001-Improve-parallel-vacuum-implementation.patch Description: Binary data
Re: Add client connection check during the execution of the query
On Mon, Dec 13, 2021 at 5:51 PM Thomas Munro wrote: > [...] Everywhere else calls > with nevents == 1, so that's hypothetical. Erm, I forgot about ExecAppendAsyncEventWait(), so I'd have to update the commit message on that point, but it's hard to worry too much about that case -- it's creating and destroying a WES every time. I'd rather worry about fixing that problem.
Re: parallel vacuum comments
On Mon, Dec 13, 2021 at 10:33 AM Masahiko Sawada wrote: > > On Fri, Dec 10, 2021 at 9:08 PM Amit Kapila wrote: > > > > On Thu, Dec 9, 2021 at 6:05 PM Masahiko Sawada > > wrote: > > > > > > On Thu, Dec 9, 2021 at 7:44 PM Amit Kapila > > > wrote: > > > > > > Agreed with the above two points. > > > > > > I've attached updated patches that incorporated the above comments > > > too. Please review them. > > > > > > > I have made the following minor changes to the 0001 patch: (a) An > > assert was removed from dead_items_max_items() which I added back. (b) > > Removed an unnecessary semicolon from one of the statements in > > compute_parallel_vacuum_workers(). (c) Changed comments at a few > > places. (d) moved all parallel_vacuum_* related functions together. > > (e) ran pgindent and slightly modify the commit message. > > > > Let me know what you think of the attached? > > Thank you for updating the patch! > > The patch also moves some functions, e.g., update_index_statistics() > is moved without code changes. I agree to move functions for > consistency but that makes the review hard and the patch complicated. > I think it's better to do improving the parallel vacuum code and > moving functions in separate patches. > Okay, I thought it might be better to keep all parallel_vacuum_* related functions together but we can keep that in a separate patch Feel free to submit without those changes. In fact, if we go for your current 0002 then that might not be even required as we move all those functions to a new file. -- With Regards, Amit Kapila.
Re: parallel vacuum comments
On Fri, Dec 10, 2021 at 9:08 PM Amit Kapila wrote: > > On Thu, Dec 9, 2021 at 6:05 PM Masahiko Sawada wrote: > > > > On Thu, Dec 9, 2021 at 7:44 PM Amit Kapila wrote: > > > > Agreed with the above two points. > > > > I've attached updated patches that incorporated the above comments > > too. Please review them. > > > > I have made the following minor changes to the 0001 patch: (a) An > assert was removed from dead_items_max_items() which I added back. (b) > Removed an unnecessary semicolon from one of the statements in > compute_parallel_vacuum_workers(). (c) Changed comments at a few > places. (d) moved all parallel_vacuum_* related functions together. > (e) ran pgindent and slightly modify the commit message. > > Let me know what you think of the attached? Thank you for updating the patch! The patch also moves some functions, e.g., update_index_statistics() is moved without code changes. I agree to move functions for consistency but that makes the review hard and the patch complicated. I think it's better to do improving the parallel vacuum code and moving functions in separate patches. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Re: parallel vacuum comments
On Mon, Dec 13, 2021 at 12:03 PM Amit Kapila wrote: > > On Sat, Dec 11, 2021 at 8:30 PM Masahiko Sawada wrote: > > > > On Sat, Dec 11, 2021 at 2:32 PM Andres Freund wrote: > > > > > > Hi, > > > > > > On 2021-10-30 14:21:01 -0700, Andres Freund wrote: > > > > Due to bug #17245: [1] I spent a considerably amount of time looking at > > > > vacuum > > > > related code. And I found a few things that I think could stand > > > > improvement: > > > > Thank you for the comments. > > > > > > > > While working on the fix for #17255 (more specifically some cleanup that > > > Peter > > > suggested in the context), I noticed another thing: Initializing > > > parallelism > > > as part of dead_items_alloc() is a bad idea. Even if there are comments > > > noting > > > that oddity. > > > > > > I don't really see why we should do it this way? There's no > > > "no-parallelism" > > > path in begin_parallel_vacuum() besides > > > compute_parallel_vacuum_workers(). So > > > it's not like we might just discover the inability to do parallelism > > > during > > > parallel initialization? > > > > Right. Also, in parallel vacuum case, it allocates the space not only > > for dead items but also other data required to do parallelism like > > shared bulkdeletion results etc. Originally, in PG13, > > begin_parallel_vacuum() was called by lazy_scan_heap() but in PG14 it > > became part of dead_items_alloc() (see b4af70cb2). I agree to change > > this part so that lazy_scan_heap() calls begin_parallel_vacuum() > > (whatever we rename it). I'll incorporate this change in the > > refactoring patch barring any objections. > > > > > > > > It's also not particularly helpful to have a begin_parallel_vacuum() that > > > might not actually begin a parallel vacuum... > > > > During the development, I found that we have some begin_* functions > > that don't start the actual parallel job but prepare state data for > > starting parallel job and referred to _bt_begin_parallel() so I named > > begin_parallel_vacuum(). But I admit that considering what the > > function actually does, something like > > create_parallel_vacuum_context() would be clearer. > > > > How about if we name it as parallel_vacuum_init() which will be > similar InitializeParallelDSM, ExecInitParallelPlan(). parallel_vacuum_init() sounds better. > Now, I see > there is some reasoning to keep it in dead_items_alloc as both > primarily allocate memory for vacuum but maybe we should name the > function vacuum_space_alloc instead of dead_items_alloc and similarly > rename dead_items_cleanup to vacuum_space_free. The other idea could > be to bring begin_parallel_vacuum() back in lazy_scan_heap() but I > personally prefer the idea to keep it where it is but improve function > names. Will it be better to do this as a separate patch as 0002 > because this might require some change in the vacuum code path? Yeah, if we do just renaming functions, I think we can do that in 0001 patch. On the other hand, if we need to change the logic, it's better to do that in a separate patch. > > > > > > > Minor nit: > > > > > > begin_parallel_vacuum()'s comment says: > > > * On success (when we can launch one or more workers), will set > > > dead_items and > > > * lps in vacrel for caller. > > > > > > But it actually doesn't know whether we can start workers. It just checks > > > max_parallel_maintenance_workers, no? > > > > Yes, we cannot know whether we can actually start workers when > > starting parallel index vacuuming. It returns non-NULL if we request > > one or more workers. > > > > So can we adjust the comments? I think the part of the sentence "when > we can launch one or more workers" seems to be the cause of confusion, > can we remove it? Yes, we can remove it. Or replace "can launch" with "request". Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Re: Add client connection check during the execution of the query
On Sat, Dec 11, 2021 at 7:09 PM Thomas Munro wrote: > On Sat, Dec 11, 2021 at 6:11 PM Andres Freund wrote: > > Yuck. Is there really no better way to deal with this? What kind of errors > > is > > this trying to handle transparently? Afaics this still changes when we'd > > e.g. detect postmaster death. > > The problem is that WaitEventSetWait() only reports the latch, if it's > set, so I removed it from the set (setting it to NULL), and then undo > that afterwards. Perhaps we could fix that root problem instead. > That is, we could make it so that latches aren't higher priority in > that way, ie don't hide other events[1]. Then I wouldn't have to > modify the WES here, I could just ignore it in the output event list > (and make sure that's big enough for all possible events, as I had it > in the last version). I'll think about that. I tried that. It seems OK, and gets rid of the PG_FINALLY(), which is nice. Latches still have higher priority, and still have the fast return if already set and you asked for only one event, but now if you ask for nevents > 1 we'll make the syscall too so we'll see the WL_SOCKET_CLOSED. It's possible that someone might want the old behaviour one day (fast return even for nevents > 1, hiding other events), and then we'd have to come up with some way to request that, which is the type of tricky policy question that had put me off "fixing" this before. But I guess we could cross that bridge if we come to it. Everywhere else calls with nevents == 1, so that's hypothetical. Better? From 6b994807c29157ac7053ec347a51268d60f2b3b4 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Fri, 30 Apr 2021 10:38:40 +1200 Subject: [PATCH v4 1/3] Add WL_SOCKET_CLOSED for socket shutdown events. Provide a way for WaitEventSet to report that the remote peer has shut down its socket, independently of whether there is any buffered data remaining to be read. This works only on systems where the kernel exposes that information, namely: * WAIT_USE_POLL builds using POLLRDHUP, if available * WAIT_USE_EPOLL builds using EPOLLRDHUP * WAIT_USE_KQUEUE builds using EV_EOF Reviewed-by: Zhihong Yu Reviewed-by: Maksim Milyutin Discussion: https://postgr.es/m/77def86b27e41f0efcba411460e929ae%40postgrespro.ru --- src/backend/storage/ipc/latch.c | 79 + src/include/storage/latch.h | 6 ++- 2 files changed, 74 insertions(+), 11 deletions(-) diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c index 1d893cf863..54e928c564 100644 --- a/src/backend/storage/ipc/latch.c +++ b/src/backend/storage/ipc/latch.c @@ -841,6 +841,7 @@ FreeWaitEventSet(WaitEventSet *set) * - WL_SOCKET_CONNECTED: Wait for socket connection to be established, * can be combined with other WL_SOCKET_* events (on non-Windows * platforms, this is the same as WL_SOCKET_WRITEABLE) + * - WL_SOCKET_CLOSED: Wait for socket to be closed by remote peer. * - WL_EXIT_ON_PM_DEATH: Exit immediately if the postmaster dies * * Returns the offset in WaitEventSet->events (starting from 0), which can be @@ -1043,12 +1044,16 @@ WaitEventAdjustEpoll(WaitEventSet *set, WaitEvent *event, int action) else { Assert(event->fd != PGINVALID_SOCKET); - Assert(event->events & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE)); + Assert(event->events & (WL_SOCKET_READABLE | +WL_SOCKET_WRITEABLE | +WL_SOCKET_CLOSED)); if (event->events & WL_SOCKET_READABLE) epoll_ev.events |= EPOLLIN; if (event->events & WL_SOCKET_WRITEABLE) epoll_ev.events |= EPOLLOUT; + if (event->events & WL_SOCKET_CLOSED) + epoll_ev.events |= EPOLLRDHUP; } /* @@ -1087,12 +1092,18 @@ WaitEventAdjustPoll(WaitEventSet *set, WaitEvent *event) } else { - Assert(event->events & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE)); + Assert(event->events & (WL_SOCKET_READABLE | +WL_SOCKET_WRITEABLE | +WL_SOCKET_CLOSED)); pollfd->events = 0; if (event->events & WL_SOCKET_READABLE) pollfd->events |= POLLIN; if (event->events & WL_SOCKET_WRITEABLE) pollfd->events |= POLLOUT; +#ifdef POLLRDHUP + if (event->events & WL_SOCKET_CLOSED) + pollfd->events |= POLLRDHUP; +#endif } Assert(event->fd != PGINVALID_SOCKET); @@ -1165,7 +1176,9 @@ WaitEventAdjustKqueue(WaitEventSet *set, WaitEvent *event, int old_events) Assert(event->events != WL_LATCH_SET || set->latch != NULL); Assert(event->events == WL_LATCH_SET || event->events == WL_POSTMASTER_DEATH || - (event->events & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE))); + (event->events & (WL_SOCKET_READABLE | + WL_SOCKET_WRITEABLE | + WL_SOCKET_CLOSED))); if (event->events == WL_POSTMASTER_DEATH) { @@ -1188,9 +1201,9 @@ WaitEventAdjustKqueue(WaitEventSet *set, WaitEvent *event, int old_events) * old event mask to the new event mask, since kevent treats readable * and writable as separate events. */ - if (old_events & WL_SOCKET_READABLE) + if (old_event
Re: Skipping logical replication transactions on subscriber side
On Mon, Dec 13, 2021 at 8:28 AM Masahiko Sawada wrote: > > On Sat, Dec 11, 2021 at 3:29 PM Amit Kapila wrote: > > > > 3. > > + * Also, we don't skip receiving the changes in streaming cases, > > since we decide > > + * whether or not to skip applying the changes when starting to apply > > changes. > > > > But why so? Can't we even skip streaming (and writing to file all such > > messages)? If we can do this then we can avoid even collecting all > > messages in a file. > > IIUC in streaming cases, a transaction can be sent to the subscriber > while splitting into multiple chunks of changes. In the meanwhile, > skip_xid can be changed. If the user changed or cleared skip_xid after > the subscriber skips some streamed changes, the subscriber won't able > to have complete changes of the transaction. > Yeah, I think if we want we can handle this by writing into the stream xid file whether the changes need to be skipped and then the consecutive streams can check that in the file or may be in some way don't allow skip_xid to be changed in worker if it is already skipping some xact. If we don't want to do anything for this then it is better to at least reflect this reasoning in the comments. > > > > 4. > > + * Also, one might think that we can skip preparing the skipped > > transaction. > > + * But if we do that, PREPARE WAL record won’t be sent to its physical > > + * standbys, resulting in that users won’t be able to find the prepared > > + * transaction entry after a fail-over. > > + * > > .. > > + */ > > + if (skipping_changes) > > + stop_skipping_changes(false); > > > > Why do we need such a Prepare's entry either at current subscriber or > > on its physical standby? I think it is to allow Commit-prepared. If > > so, how about if we skip even commit prepared as well? Even on > > physical standby, we would be having the value of skip_xid which can > > help us to skip there as well after failover. > > It's true that skip_xid would be set also on physical standby. When it > comes to preparing the skipped transaction on the current subscriber, > if we want to skip commit-prepared I think we need protocol changes in > order for subscribers to know prepare_lsn and preppare_timestampso > that it can lookup the prepared transaction when doing > commit-prepared. I proposed this idea before. This change would be > benefical as of now since the publisher sends even empty transactions. > But considering the proposed patch[1] that makes the puslisher not > send empty transaction, this protocol change would be an optimization > only for this feature. > I was thinking to compare the xid received as part of the commit_prepared message with the value of skip_xid to skip the commit_prepared but I guess the user would change it between prepare and commit prepare and then we won't be able to detect it, right? I think we can handle this and the streaming case if we disallow users to change the value of skip_xid when we are already skipping changes or don't let the new skip_xid to reflect in the apply worker if we are already skipping some other transaction. What do you think? -- With Regards, Amit Kapila.
Assertion failure with replication origins and PREPARE TRANSACTIOn
Hi all, (CCing some folks who worked on this area lately) The following sequence of commands generates an assertion failure, as of $subject: select pg_replication_origin_create('popo'); select pg_replication_origin_session_setup('popo'); begin; select txid_current(); prepare transaction 'popo'; -- assertion fails The problem originates from 1eb6d65, down to 11, where we finish by triggering this assertion before replorigin_session_origin_lsn is not valid: + if (replorigin) + { + Assert(replorigin_session_origin_lsn != InvalidXLogRecPtr); + hdr->origin_lsn = replorigin_session_origin_lsn; + hdr->origin_timestamp = replorigin_session_origin_timestamp; + } As far as I understand this code and based on the docs, pg_replication_origin_xact_setup(), that would set of the session origin LSN and timestamp, is an optional choise. replorigin_session_advance() would be a no-op for remote_lsn, and local_lsn requires an update. Now please note that I am really fluent with this stuff, so feel free to correct me. The intention of the code also seems that XACT_XINFO_HAS_ORIGIN should still be set, but with no data. At the end, it seems to me that the assertion could just be dropped as per the attached, as we'd finish by setting up origin_lsn and origin_timestamp in the 2PC file header with some invalid data. The else block could just be removed, but then one would need to guess from origin.c that both replorigin_session_* may not be set. I am not completely sure that this is the right move, though, as pg_replication_origin_status has a *very* limited amount of tests. Adding a test to replorigin.sql with 2PC transactions able to trigger the problem is easy once we rely on a different slot to not influence the existing tests, as the expected contents of pg_replication_origin_status could be messed up in the follow-up tests. Thoughts? -- Michael diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c index 28b153abc3..7518f83925 100644 --- a/src/backend/access/transam/twophase.c +++ b/src/backend/access/transam/twophase.c @@ -1127,9 +1127,9 @@ EndPrepare(GlobalTransaction gxact) replorigin = (replorigin_session_origin != InvalidRepOriginId && replorigin_session_origin != DoNotReplicateId); - if (replorigin) + if (replorigin && + replorigin_session_origin_lsn != InvalidXLogRecPtr) { - Assert(replorigin_session_origin_lsn != InvalidXLogRecPtr); hdr->origin_lsn = replorigin_session_origin_lsn; hdr->origin_timestamp = replorigin_session_origin_timestamp; } diff --git a/contrib/test_decoding/expected/replorigin.out b/contrib/test_decoding/expected/replorigin.out index 8077318755..895783d917 100644 --- a/contrib/test_decoding/expected/replorigin.out +++ b/contrib/test_decoding/expected/replorigin.out @@ -181,3 +181,65 @@ SELECT pg_replication_origin_drop('regress_test_decoding: regression_slot'); (1 row) +-- Transactions with no origin LSN and commit timestamps set for this +-- session yet. +SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot_no_lsn', 'test_decoding'); + ?column? +-- + init +(1 row) + +SELECT pg_replication_origin_create('regress_test_decoding: regression_slot_no_lsn'); + pg_replication_origin_create +-- +1 +(1 row) + +-- mark session as replaying +SELECT pg_replication_origin_session_setup('regress_test_decoding: regression_slot_no_lsn'); + pg_replication_origin_session_setup +- + +(1 row) + +-- Normal transaction +BEGIN; +INSERT INTO target_tbl(data) + SELECT data FROM +pg_logical_slot_get_changes('regression_slot_no_lsn', NULL, NULL, 'include-xids', '0'); +COMMIT; +-- 2PC transaction +BEGIN; +INSERT INTO target_tbl(data) + SELECT data FROM +pg_logical_slot_get_changes('regression_slot_no_lsn', NULL, NULL, 'include-xids', '0'); +PREPARE TRANSACTION 'replorigin_prepared'; +COMMIT PREPARED 'replorigin_prepared'; +SELECT local_id, external_id, + remote_lsn <> '0/0' AS valid_remote_lsn, + local_lsn <> '0/0' AS valid_local_lsn + FROM pg_replication_origin_status; + local_id | external_id | valid_remote_lsn | valid_local_lsn +--+---+--+- +1 | regress_test_decoding: regression_slot_no_lsn | f| t +(1 row) + +-- This ensures that everything can be dropped. +SELECT pg_replication_origin_session_reset(); + pg_replication_origin_session_reset +- + +(1 row) + +SELECT pg_drop_replication_slot('regression_slot_no_lsn'); + pg_drop_replication_slot +-- + +(1 row) + +SELECT pg_replication_origin_drop('regress_test_decoding: regression_slot_no_lsn'); + pg_replication_origin_drop + + +(1 row) + diff --git a/contrib/test_decoding/sql/replorigin.sql b/contri
Re: Skipping logical replication transactions on subscriber side
On Fri, Dec 10, 2021 at 4:44 PM Masahiko Sawada wrote: > > I've attached an updated patch. The new syntax is like "ALTER > SUBSCRIPTION testsub SKIP (xid = '123')". > I have some review comments: (1) Patch comment - some suggested wording improvements BEFORE: If incoming change violates any constraint, logical replication stops AFTER: If an incoming change violates any constraint, logical replication stops BEFORE: The user can specify XID by ALTER SUBSCRIPTION ... SKIP (xid = XXX), updating pg_subscription.subskipxid field, telling the apply worker to skip the transaction. AFTER: The user can specify the XID of the transaction to skip using ALTER SUBSCRIPTION ... SKIP (xid = XXX), updating the pg_subscription.subskipxid field, telling the apply worker to skip the transaction. src/sgml/logical-replication.sgml (2) Some suggested wording improvements (i) Missing "the" BEFORE: + the existing data. When a conflict produce an error, it is shown in AFTER: + the existing data. When a conflict produce an error, it is shown in the (ii) Suggest starting a new sentence BEFORE: + and it is also shown in subscriber's server log as follows: AFTER: + The error is also shown in the subscriber's server log as follows: (iii) Context message should say "at ..." instead of "with commit timestamp ...", to match the actual output from the current code BEFORE: +CONTEXT: processing remote data during "INSERT" for replication target relation "public.test" in transaction 716 with commit timestamp 2021-09-29 15:52:45.165754+00 AFTER: +CONTEXT: processing remote data during "INSERT" for replication target relation "public.test" in transaction 716 at 2021-09-29 15:52:45.165754+00 (iv) The following paragraph seems out of place, with the information presented in the wrong order: + + In this case, you need to consider changing the data on the subscriber so that it + doesn't conflict with incoming changes, or dropping the conflicting constraint or + unique index, or writing a trigger on the subscriber to suppress or redirect + conflicting incoming changes, or as a last resort, by skipping the whole transaction. + They skip the whole transaction, including changes that may not violate any + constraint. They may easily make the subscriber inconsistent, especially if + a user specifies the wrong transaction ID or the position of origin. + How about rearranging it as follows: + + These methods skip the whole transaction, including changes that may not violate + any constraint. They may easily make the subscriber inconsistent, especially if + a user specifies the wrong transaction ID or the position of origin, and should + be used as a last resort. + Alternatively, you might consider changing the data on the subscriber so that it + doesn't conflict with incoming changes, or dropping the conflicting constraint or + unique index, or writing a trigger on the subscriber to suppress or redirect + conflicting incoming changes. + doc/src/sgml/ref/alter_subscription.sgml (3) (i) Doc needs clarification BEFORE: + the whole transaction. The logical replication worker skips all data AFTER: + the whole transaction. For the latter case, the logical replication worker skips all data (ii) "Setting -1 means to reset the transaction ID" Shouldn't it be explained what resetting actually does and when it can be, or is needed to be, done? Isn't it automatically reset? I notice that negative values (other than -1) seem to be regarded as valid - is that right? Also, what happens if this option is set multiple times? Does it just override and use the latest setting? (other option handling errors out with errorConflictingDefElem()). e.g. alter subscription sub skip (xid = 721, xid = 722); src/backend/replication/logical/worker.c (4) Shouldn't the "done skipping logical replication transaction" message also include the skipped XID value at the end? src/test/subscription/t/027_skip_xact.pl (5) Some suggested wording improvements (i) BEFORE: +# Test skipping the transaction. This function must be called after the caller +# inserting data that conflict with the subscriber. After waiting for the +# subscription worker stats are updated, we skip the transaction in question +# by ALTER SUBSCRIPTION ... SKIP. Then, check if logical replication can continue +# working by inserting $nonconflict_data on the publisher. AFTER: +# Test skipping the transaction. This function must be called after the caller +# inserts data that conflicts with the subscriber. After waiting for the +# subscription worker stats to be updated, we skip the transaction in question +# by ALTER SUBSCRIPTION ... SKIP. Then, check if logical replication can continue +# working by inserting $nonconflict_data on the publisher. (ii) BEFORE: +# will conflict with the data replicated from publisher later. AFTER: +# will conflict with the data replicated later from the publisher. Regards, Greg Nancarrow Fujitsu Australi
Re: Replication slot drop message is sent after pgstats shutdown.
At Fri, 10 Dec 2021 18:13:31 +0900, Masahiko Sawada wrote in > I agreed with Andres and Horiguchi-san and attached an updated patch. Thanks for the new version. It seems fine, but I have some comments from a cosmetic viewpoint. + /* +* Register callback to make sure cleanup and releasing the replication +* slot on exit. +*/ + ReplicationSlotInitialize(); Actually all the function does is that but it looks slightly inconsistent with the function name. I think we can call it just "initialization" here. I think we don't need to change the function comment the same way but either will do for me. +ReplicationSlotBeforeShmemExit(int code, Datum arg) The name looks a bit too verbose. Doesn't just "ReplicationSlotShmemExit" work? -* so releasing here is fine. There's another cleanup in ProcKill() -* ensuring we'll correctly cleanup on FATAL errors as well. +* so releasing here is fine. There's another cleanup in +* ReplicationSlotBeforeShmemExit() callback ensuring we'll correctly +* cleanup on FATAL errors as well. I'd prefer "during before_shmem_exit()" than "in ReplicationSlotBeforeShmemExit() callback" here. (But the current wording is also fine by me.) The attached detects that bug, but I'm not sure it's worth expending test time, or this might be in the server test suit. regards. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/bin/pg_basebackup/t/030_pg_recvlogical.pl b/src/bin/pg_basebackup/t/030_pg_recvlogical.pl index 90da1662e3..0fb4e67697 100644 --- a/src/bin/pg_basebackup/t/030_pg_recvlogical.pl +++ b/src/bin/pg_basebackup/t/030_pg_recvlogical.pl @@ -5,7 +5,8 @@ use strict; use warnings; use PostgreSQL::Test::Utils; use PostgreSQL::Test::Cluster; -use Test::More tests => 20; +use Test::More tests => 25; +use IPC::Run qw(pump finish timer); program_help_ok('pg_recvlogical'); program_version_ok('pg_recvlogical'); @@ -106,3 +107,44 @@ $node->command_ok( '--start', '--endpos', "$nextlsn", '--no-loop', '-f', '-' ], 'replayed a two-phase transaction'); + +## Check for a crash bug caused by replication-slot cleanup after +## pgstat shutdown. +#fire up an interactive psql session +my $in = ''; +my $out = ''; +my $timer = timer(5); +my $h = $node->interactive_psql('postgres', \$in, \$out, $timer); +like($out, qr/psql/, "print startup banner"); + +# open a transaction +$out = ""; +$in .= "BEGIN;\nCREATE TABLE a (a int);\n"; +pump $h until ($out =~ /CREATE TABLE/ || timer->is_expired); +ok(!timer->is_expired, 'background CREATE TABLE passed'); + +# this recvlogical waits for the transaction ends +ok(open(my $recvlogical, '-|', + 'pg_recvlogical', '--create-slot', '-S', 'test2', + '-d', $node->connstr('postgres')), + 'launch background pg_recvlogical'); + +$node->poll_query_until('postgres', + qq{SELECT count(*) > 0 FROM pg_stat_activity + WHERE backend_type='walsender' + AND query like 'CREATE_REPLICATION_SLOT %';}); +# stop server while it hangs. This shouldn't crash server. +$node->stop; +ok(open(my $cont, '-|', 'pg_controldata', $node->data_dir), + 'run pg_controldata'); +my $stop_result = ''; +while (<$cont>) +{ + if (/^Database cluster state: *([^ ].*)$/) + { + $stop_result = $1; + last; + } +} + +is($stop_result, 'shut down', 'server is properly shut down');
Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
+ /* +* If the relation is from the default tablespace then we need to +* create it in the destinations db's default tablespace. Otherwise, +* we need to create in the same tablespace as it is in the source +* database. +*/ This comment looks a bit confusing to me especially because when we say destination db's default tablespace people may think of pg_default tablespace (at least I think so). Basically what you are trying to say here - "If the relation exists in the same tablespace as the src database, then in the destination db also it should be the same or something like that.. " So, why not put it that way instead of referring to it as the default tablespace. It's just my view. If you disagree you can ignore it. -- + else if (src_dboid == dst_dboid) + continue; + else + dstrnode.spcNode = srcrnode.spcNode;; There is an extra semicolon here. -- With Regards, Ashutosh Sharma. On Sun, Dec 12, 2021 at 1:39 PM Dilip Kumar wrote: > On Fri, Dec 10, 2021 at 7:39 AM Ashutosh Sharma > wrote: > >> > >> Logfile Snippet: > >> 2021-12-09 17:49:18.110 +04 [18151] PANIC: could not fsync file > "base/116398/116400": No such file or directory > >> 2021-12-09 17:49:19.105 +04 [18150] LOG: checkpointer process (PID > 18151) was terminated by signal 6: Aborted > >> 2021-12-09 17:49:19.105 +04 [18150] LOG: terminating any other active > server processes > > > > > > This is different from the issue you raised earlier. As Dilip said, we > need to unregister sync requests for files that got successfully copied to > the target database, but the overall alter database statement failed. We > are doing this when the database is created successfully, but not when it > fails. > > Probably doing the same inside the cleanup function > movedb_failure_callback() should fix the problem. > > Correct, I have done this cleanup, apart from this we have dropped the > fsyc request in create database failure case as well and also need to > drop buffer in error case of creatdb as well as movedb. I have also > fixed the other issue for which you gave the patch (a bit differently) > basically, in case of movedb the source and destination dboid are same > so we don't need an additional parameter and also readjusted the > conditions to avoid nested if. > > -- > Regards, > Dilip Kumar > EnterpriseDB: http://www.enterprisedb.com >
Re: parallel vacuum comments
On Sat, Dec 11, 2021 at 8:30 PM Masahiko Sawada wrote: > > On Sat, Dec 11, 2021 at 2:32 PM Andres Freund wrote: > > > > Hi, > > > > On 2021-10-30 14:21:01 -0700, Andres Freund wrote: > > > Due to bug #17245: [1] I spent a considerably amount of time looking at > > > vacuum > > > related code. And I found a few things that I think could stand > > > improvement: > > Thank you for the comments. > > > > > While working on the fix for #17255 (more specifically some cleanup that > > Peter > > suggested in the context), I noticed another thing: Initializing parallelism > > as part of dead_items_alloc() is a bad idea. Even if there are comments > > noting > > that oddity. > > > > I don't really see why we should do it this way? There's no "no-parallelism" > > path in begin_parallel_vacuum() besides compute_parallel_vacuum_workers(). > > So > > it's not like we might just discover the inability to do parallelism during > > parallel initialization? > > Right. Also, in parallel vacuum case, it allocates the space not only > for dead items but also other data required to do parallelism like > shared bulkdeletion results etc. Originally, in PG13, > begin_parallel_vacuum() was called by lazy_scan_heap() but in PG14 it > became part of dead_items_alloc() (see b4af70cb2). I agree to change > this part so that lazy_scan_heap() calls begin_parallel_vacuum() > (whatever we rename it). I'll incorporate this change in the > refactoring patch barring any objections. > > > > > It's also not particularly helpful to have a begin_parallel_vacuum() that > > might not actually begin a parallel vacuum... > > During the development, I found that we have some begin_* functions > that don't start the actual parallel job but prepare state data for > starting parallel job and referred to _bt_begin_parallel() so I named > begin_parallel_vacuum(). But I admit that considering what the > function actually does, something like > create_parallel_vacuum_context() would be clearer. > How about if we name it as parallel_vacuum_init() which will be similar InitializeParallelDSM, ExecInitParallelPlan(). Now, I see there is some reasoning to keep it in dead_items_alloc as both primarily allocate memory for vacuum but maybe we should name the function vacuum_space_alloc instead of dead_items_alloc and similarly rename dead_items_cleanup to vacuum_space_free. The other idea could be to bring begin_parallel_vacuum() back in lazy_scan_heap() but I personally prefer the idea to keep it where it is but improve function names. Will it be better to do this as a separate patch as 0002 because this might require some change in the vacuum code path? > > > > Minor nit: > > > > begin_parallel_vacuum()'s comment says: > > * On success (when we can launch one or more workers), will set dead_items > > and > > * lps in vacrel for caller. > > > > But it actually doesn't know whether we can start workers. It just checks > > max_parallel_maintenance_workers, no? > > Yes, we cannot know whether we can actually start workers when > starting parallel index vacuuming. It returns non-NULL if we request > one or more workers. > So can we adjust the comments? I think the part of the sentence "when we can launch one or more workers" seems to be the cause of confusion, can we remove it? -- With Regards, Amit Kapila.
Re: Skipping logical replication transactions on subscriber side
On Sat, Dec 11, 2021 at 3:29 PM Amit Kapila wrote: > > On Fri, Dec 10, 2021 at 11:14 AM Masahiko Sawada > wrote: > > > > On Thu, Dec 9, 2021 at 6:16 PM Amit Kapila wrote: > > > > > > On Thu, Dec 9, 2021 at 2:24 PM Masahiko Sawada > > > wrote: > > > > > > > > On Thu, Dec 9, 2021 at 11:47 AM Amit Kapila > > > > wrote: > > > > > > > > > > I am thinking that we can start a transaction, update the catalog, > > > > > commit that transaction. Then start a new one to update > > > > > origin_lsn/timestamp, finishprepared, and commit it. Now, if it > > > > > crashes after the first transaction, only commit prepared will be > > > > > resent again and this time we don't need to update the catalog as that > > > > > entry would be already cleared. > > > > > > > > Sounds good. In the crash case, it should be fine since we will just > > > > commit an empty transaction. The same is true for the case where > > > > skip_xid has been changed after skipping and preparing the transaction > > > > and before handling commit_prepared. > > > > > > > > Regarding the case where the user specifies XID of the transaction > > > > after it is prepared on the subscriber (i.g., the transaction is not > > > > empty), we won’t skip committing the prepared transaction. But I think > > > > that we don't need to support skipping already-prepared transaction > > > > since such transaction doesn't conflict with anything regardless of > > > > having changed or not. > > > > > > > > > > Yeah, this makes sense to me. > > > > > > > I've attached an updated patch. The new syntax is like "ALTER > > SUBSCRIPTION testsub SKIP (xid = '123')". > > > > I’ve been thinking we can do something safeguard for the case where > > the user specified the wrong xid. For example, can we somewhat use the > > stats in pg_stat_subscription_workers? An idea is that logical > > replication worker fetches the xid from the stats when reading the > > subscription and skips the transaction if the xid matches to > > subskipxid. That is, the worker checks the error reported by the > > worker previously working on the same subscription. The error could > > not be a conflict error (e.g., connection error etc.) or might have > > been cleared by the reset function, But given the worker is in an > > error loop, the worker can eventually get xid in question. We can > > prevent an unrelated transaction from being skipped unexpectedly. It > > seems not a stable solution though. Or it might be enough to warn > > users when they specified an XID that doesn’t match to last_error_xid. > > > > I think the idea is good but because it is not predictable as pointed > by you so we might want to just issue a LOG/WARNING. If not already > mentioned, then please do mention in docs the possibility of skipping > non-errored transactions. > > Few comments/questions: > = > 1. > + Specifies the ID of the transaction whose application is to > be skipped > + by the logical replication worker. Setting -1 means to reset the > + transaction ID. > > Can we change it to something like: "Specifies the ID of the > transaction whose changes are to be skipped by the logical replication > worker. " > Agreed. > 2. > @@ -104,6 +104,16 @@ GetSubscription(Oid subid, bool missing_ok) > Assert(!isnull); > sub->publications = textarray_to_stringlist(DatumGetArrayTypeP(datum)); > > + /* Get skip XID */ > + datum = SysCacheGetAttr(SUBSCRIPTIONOID, > + tup, > + Anum_pg_subscription_subskipxid, > + &isnull); > + if (!isnull) > + sub->skipxid = DatumGetTransactionId(datum); > + else > + sub->skipxid = InvalidTransactionId; > > Can't we assign it as we do for other fixed columns like subdbid, > subowner, etc.? > Yeah, I think we can use InvalidTransactionId as the initial value instead of setting NULL. Then, we can change this code. > 3. > + * Also, we don't skip receiving the changes in streaming cases, > since we decide > + * whether or not to skip applying the changes when starting to apply > changes. > > But why so? Can't we even skip streaming (and writing to file all such > messages)? If we can do this then we can avoid even collecting all > messages in a file. IIUC in streaming cases, a transaction can be sent to the subscriber while splitting into multiple chunks of changes. In the meanwhile, skip_xid can be changed. If the user changed or cleared skip_xid after the subscriber skips some streamed changes, the subscriber won't able to have complete changes of the transaction. > > 4. > + * Also, one might think that we can skip preparing the skipped transaction. > + * But if we do that, PREPARE WAL record won’t be sent to its physical > + * standbys, resulting in that users won’t be able to find the prepared > + * transaction entry after a fail-over. > + * > .. > + */ > + if (skipping_changes) > + stop_skipping_changes(false); > > Why do we need such a Prepare's entry either at current subscriber or > on its physical standby? I think it is to allow Commi
Re: Use extended statistics to estimate (Var op Var) clauses
On Sun, Dec 12, 2021 at 6:04 PM Tomas Vondra wrote: > On 8/31/21 00:14, Zhihong Yu wrote: > > Hi, > > For patch 0002, > > > > + s1 = statext_clauselist_selectivity(root, clauses, > > varRelid, > > + jointype, > > sjinfo, rel, > > + > > &estimatedclauses, false); > > + > > + estimated = (bms_num_members(estimatedclauses) == 1); > > > > I took a look at clauselist_apply_dependencies() (called by > > statext_clauselist_selectivity) where estimatedclauses is modified. > > Since the caller would not use the returned Selectivity if number of > > elements in estimatedclauses is greater than 1, I wonder > > if a parameter can be added to clauselist_apply_dependencies() which > > indicates early return if the second element is added > to estimatedclauses. > > > > Hmmm, I'm not sure I understand your point. Are you suggesting there's a > bug in not updating the bitmap, or would this be an optimization? Can > you give an example of a query affected by this? > > Hi, My previous comment was from 3 months ago - let me see if I can come up with an example. Cheers > > regards > > -- > Tomas Vondra > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company >
Re: Use extended statistics to estimate (Var op Var) clauses
Hi, I finally got around to this patch again, focusing mostly on the first part that simply returns either 1.0 or 0.0 for Var op Var conditions (i.e. the part not really using extended statistics). I have been unhappy about using examine_variable, which does various expensive things like searching for statistics (which only got worse because now we're also looking for expression stats). But we don't really need the stats - we just need to check the Vars match (same relation, same attribute). So 0002 fixes this. Which got me thinking that maybe we don't need to restrict this to Var nodes only. We can just as easily compare arbitrary expressions, provided it's for the same relation and there are no volatile functions. So 0003 does this. Conditions with the same complex expression on each side of an operator are probably fairly rare, but it's cheap so why not. 0004 and 0005 parts are unchanged. The next steps is adding some tests to the first parts, and extending the tests in the main patch (to also use more complex expressions, if 0003 gets included). regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL CompanyFrom 91371c3e4252580a30530d5b625f72d5525c5c73 Mon Sep 17 00:00:00 2001 From: Tomas Vondra Date: Thu, 26 Aug 2021 23:01:04 +0200 Subject: [PATCH 1/5] Improve estimates for Var op Var with the same Var When estimating (Var op Var) conditions, we can treat the case with the same Var on both sides as a special case, and we can provide better selectivity estimate than for the generic case. For example for (a = a) we know it's 1.0, because all rows are expected to match. Similarly for (a != a) , wich has selectivity 0.0. And the same logic can be applied to inequality comparisons, like (a < a) etc. In principle, those clauses are a bit strange and queries are unlikely to use them. But query generators sometimes do silly things, and these checks are quite cheap so it's likely a win. --- src/backend/utils/adt/selfuncs.c | 77 +++- 1 file changed, 76 insertions(+), 1 deletion(-) diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c index 10895fb287..59038895ec 100644 --- a/src/backend/utils/adt/selfuncs.c +++ b/src/backend/utils/adt/selfuncs.c @@ -210,7 +210,8 @@ static bool get_actual_variable_endpoint(Relation heapRel, MemoryContext outercontext, Datum *endpointDatum); static RelOptInfo *find_join_input_rel(PlannerInfo *root, Relids relids); - +static bool matching_restriction_variables(PlannerInfo *root, List *args, + int varRelid); /* * eqsel - Selectivity of "=" for any data types. @@ -256,6 +257,14 @@ eqsel_internal(PG_FUNCTION_ARGS, bool negate) } } + /* + * It it's (variable = variable) with the same variable on both sides, it's + * a special case and we know it's not expected to filter anything, so we + * estimate the selectivity as 1.0 (or 0.0 if it's negated). + */ + if (matching_restriction_variables(root, args, varRelid)) + return (negate) ? 0.0 : 1.0; + /* * If expression is not variable = something or something = variable, then * punt and return a default estimate. @@ -1408,6 +1417,22 @@ scalarineqsel_wrapper(PG_FUNCTION_ARGS, bool isgt, bool iseq) Oid consttype; double selec; + /* + * Handle (variable < variable) and (variable <= variable) with the same + * variable on both sides as a special case. The strict inequality should + * not match any rows, hence selectivity is 0.0. The other case is about + * the same as equality, so selectivity is 1.0. + */ + if (matching_restriction_variables(root, args, varRelid)) + { + /* The case with equality matches all rows, so estimate it as 1.0. */ + if (iseq) + PG_RETURN_FLOAT8(1.0); + + /* Strict inequality matches nothing, so selectivity is 0.0. */ + PG_RETURN_FLOAT8(0.0); + } + /* * If expression is not variable op something or something op variable, * then punt and return a default estimate. @@ -4871,6 +4896,56 @@ get_restriction_variable(PlannerInfo *root, List *args, int varRelid, return false; } + +/* + * matching_restriction_variable + * Examine the args of a restriction clause to see if it's of the + * form (variable op variable) with the same variable on both sides. + * + * Inputs: + * root: the planner info + * args: clause argument list + * varRelid: see specs for restriction selectivity functions + * + * Returns true if the same variable is on both sides, otherwise false. + */ +static bool +matching_restriction_variables(PlannerInfo *root, List *args, int varRelid) +{ + Node *left, + *right; + VariableStatData ldata; + VariableStatData rdata; + bool res = false; + + /* Fail if not a binary opclause (probably shouldn't happen) */ + if (list_length(args) != 2) + return false; + + left = (Node *) linitial(args); + right = (Node *) lsecond(args); + + /* + * Examine both sides. Note that when varRelid is nonzero, Va
Re: Make pg_waldump report replication origin ID, LSN, and timestamp.
On Thu, Dec 09, 2021 at 05:42:56PM +0900, Masahiko Sawada wrote: > Thank you for updating the patch. The patch looks good to me. Done this way. Please note that while testing this patch, I have found a completely different issue. I'll spawn a thread about that in a minute.. -- Michael signature.asc Description: PGP signature
Re: Use extended statistics to estimate (Var op Var) clauses
On 8/31/21 00:14, Zhihong Yu wrote: Hi, For patch 0002, + s1 = statext_clauselist_selectivity(root, clauses, varRelid, + jointype, sjinfo, rel, + &estimatedclauses, false); + + estimated = (bms_num_members(estimatedclauses) == 1); I took a look at clauselist_apply_dependencies() (called by statext_clauselist_selectivity) where estimatedclauses is modified. Since the caller would not use the returned Selectivity if number of elements in estimatedclauses is greater than 1, I wonder if a parameter can be added to clauselist_apply_dependencies() which indicates early return if the second element is added to estimatedclauses. Hmmm, I'm not sure I understand your point. Are you suggesting there's a bug in not updating the bitmap, or would this be an optimization? Can you give an example of a query affected by this? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Defining (and possibly skipping) useless VACUUM operations
Robert Haas has written on the subject of useless vacuuming, here: http://rhaas.blogspot.com/2020/02/useless-vacuuming.html I'm sure at least a few of us have thought about the problem at some point. I would like to discuss how we can actually avoid useless vacuuming, and what our goals should be. I am currently working on decoupling advancing relfrozenxid from tuple freezing [1]. That is, I'm teaching VACUUM to keep track of information that it uses to generate an "optimal value" for the table's final relfrozenxid: the most recent XID value that might still be in the table. This patch is based on the observation that we don't actually have to use the FreezeLimit cutoff for our new pg_class.relfrozenxid. We need only obey the basic relfrozenxid invariant, which is that the final value must be <= any extant XID in the table. Using FreezeLimit is needlessly conservative. My draft patch to implement the optimization (which builds on the patches already posted to [1]) will reliably set pg_class.relfrozenxid to the same VACUUM's precise original OldestXmin once certain conditions are met -- reasonably common conditions. For example, the same precise OldestXmin XID is used for relfrozenxid in the event of a manual VACUUM (without FREEZE) on a table that was just bulk-loaded, assuming the system is otherwise idle. Setting relfrozenxid to the precise lowest safe value happens on a best-effort basis, without needlessly tying that to things like when or how we freeze tuples. It now occurs to me to push this patch in another direction, on top of all that: the OldestXmin behavior hints at a precise, robust way of defining "useless vacuuming". We can condition skipping a VACUUM (i.e. whether a VACUUM is considered "definitely won't be useful if allowed to execute") on whether or not our preexisting pg_class.relfrozenxid precisely equals our newly-acquired OldestXmin for an about-to-begin VACUUM operation. (We'd also want to add an "unchangeable pg_class.relminmxid" test, I think.) This definition does seem to be close to ideal: We're virtually assured that there will be no more useful work for us, in a way that is grounded in theory but still quite practical. But it's not a slam dunk. A person could still argue that we shouldn't cancel the VACUUM before it has begun, even when all these conditions have been met. This would not be a particularly strong argument, mind you, but it's still worth taking seriously. We need an exact problem statement that justifies whatever definition of "useless VACUUM" we settle on. Here are arguments *against* the skipping behavior I sketched out: * An aborted transaction might need to be cleaned up, which should be able to go ahead despite the unchanged OldestXmin. (I think that this is the argument with the most merit, by quite a bit.) * In general index AMs may want to do deferred cleanup, say to place previously deleted pages in the FSM. Although in practice the criteria for recycling safety used by nbtree and GiST will make that impossible, there is no fundamental reason why they need to work that way (XIDs are used, but only because they provide a conveniently available notion of "logical time" that is sufficient to implement what Lanin & Shasha call "the drain technique"). Plus GIN really could do real work in amvacuumcleanup, for the pending list. There are bound to be a handful of marginal things like this. * Who are we to intervene like this, anyway? (Makes much more sense if we don't limit ourselves to autovacuum worker operations.) Offhand, I suspect that we should only consider skipping "useless" anti-wraparound autovacuums (not other kinds of autovacuums, not manual VACUUMs). The arguments against skipping are weakest for the anti-wraparound case. And the arguments in favor are particularly strong: we should specifically avoid starting a useless (and possibly time-consuming) anti-wraparound autovacuum, because that could easily block an actually-useful autovacuum launched some time later. We should aim to be in a position to launch an anti-wraparound autovacuum that can actually advance relfrozenxid as soon as that becomes possible (e.g. when the DBA drops an old replication slot that was holding back each VACUUM's OldestXmin). And so "skipping" makes us much more responsive, which seems like it might matter a lot in practice. It minimizes the risk of wraparound failure. There is also a strong argument for logging our failure to clean up anything in any autovacuum -- we don't do nearly enough alerting when stuff like this happens (possibly because "useless" is such a squishy concept right now?). Just logging something still requires defining "useless VACUUM operation" in a way that is both reliable and proportionate. So just logging something necessitates solving that hard problem. [1] https://commitfest.postgresql.org/36/3433/ -- Peter Geoghegan
Re: Atomic rename feature for Windows.
On Thu, Dec 09, 2021 at 11:33:17PM -0500, Tom Lane wrote: > My general approach to platform compatibility is that when we > break compatibility with old versions of something, we should do so > because it will bring concrete benefits. If we can plausibly > drop support for Windows versions that don't have POSIX rename > semantics, I'm 100% for that. I'm not for dropping support for > some platform just because it's old. I'd agree with that. Now, I would also say if we need something that depends on a newer version of _WIN32_WINNT that proves to be trickier or even not possible for older versions, there could be an argument for dropping older versions, even in the back-branches, if the problem to-be-fixed is bad enough. In short history, we've never had to go down to that AFAIK, though. -- Michael signature.asc Description: PGP signature
Re: Logical replication error "no record found" /* shouldn't happen */
On 2021-Jul-23, Andrey Borodin wrote: > Hi! > > From time to time I observe $subj on clusters using logical replication. > I most of cases there are a lot of other errors. Probably $subj condition > should be kind of impossible without other problems. > I propose to enhance error logging of XLogReadRecord() in ReadPageInternal(). Hmm. A small problem in this patch is that XLogReaderValidatePageHeader already sets errormsg_buf; you're overwriting that. I suggest to leave that untouched. There are other two cases where the problem occurs in page_read() callback; ReadPageInternal explicitly documents that it doesn't set the error in that case. We have two options to deal with that: 1. change all existing callbacks to set the errormsg_buf depending on what actually fails, and then if they return failure without an error message, add something like your proposed message. 2. throw error directly in the callback rather than returning. I don't think this strategy actually works I attach a cut-down patch that doesn't deal with the page_read callbacks issue, just added stub comments in xlog.c where something should be done. -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/ "I am amazed at [the pgsql-sql] mailing list for the wonderful support, and lack of hesitasion in answering a lost soul's question, I just wished the rest of the mailing list could be like this." (Fotis) (http://archives.postgresql.org/pgsql-sql/2006-06/msg00265.php) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index d894af310a..83976cb014 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -12467,6 +12467,7 @@ retry: private->replayTLI, xlogreader->EndRecPtr)) { + /* XXX should this path set errormsg_buf? */ if (readFile >= 0) close(readFile); readFile = -1; @@ -12598,7 +12599,10 @@ next_record_is_invalid: if (StandbyMode) goto retry; else + { + /* XXX should set errormsg_buf here */ return -1; + } } /* diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c index 3a7de02565..5b61593820 100644 --- a/src/backend/access/transam/xlogreader.c +++ b/src/backend/access/transam/xlogreader.c @@ -650,14 +650,22 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen) state->currRecPtr, state->readBuf); if (readLen < 0) + { + report_invalid_record(state, + "attempt to read page of next segment failed at %X/%X", + LSN_FORMAT_ARGS(targetSegmentPtr)); goto err; + } /* we can be sure to have enough WAL available, we scrolled back */ Assert(readLen == XLOG_BLCKSZ); if (!XLogReaderValidatePageHeader(state, targetSegmentPtr, state->readBuf)) + { + /* XLogReaderValidatePageHeader sets errormsg_buf */ goto err; + } } /* @@ -668,13 +676,18 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen) state->currRecPtr, state->readBuf); if (readLen < 0) - goto err; + goto err; /* XXX errmsg? */ Assert(readLen <= XLOG_BLCKSZ); /* Do we have enough data to check the header length? */ if (readLen <= SizeOfXLogShortPHD) + { + report_invalid_record(state, + "unable to read short header of %d bytes at %X/%X", + readLen, LSN_FORMAT_ARGS(pageptr)); goto err; + } Assert(readLen >= reqLen); @@ -687,14 +700,17 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen) state->currRecPtr, state->readBuf); if (readLen < 0) - goto err; + goto err; /* XXX errmsg */ } /* * Now that we know we have the full header, validate it. */ if (!XLogReaderValidatePageHeader(state, pageptr, (char *) hdr)) + { + /* XLogReaderValidatePageHeader sets errormsg_buf */ goto err; + } /* update read state information */ state->seg.ws_segno = targetSegNo;
Re: [PATCH] pg_stat_toast
Hi, On 2021-12-13 00:00:23 +0100, Gunnar "Nick" Bluth wrote: > Regarding stats size; it adds one PgStat_BackendToastEntry > (PgStat_BackendAttrIdentifier + PgStat_ToastCounts, should be 56-64 bytes or > something in that ballpark) per TOASTable attribute, I can't see that make > any system break sweat ;-) That's actually a lot. The problem is that all the stats data for a database is loaded into private memory for each connection to that database, and that the stats collector regularly writes out all the stats data for a database. > A quick run comparing 1.000.000 INSERTs (2 TOASTable columns each) with and > without "pgstat_track_toast" resulted in 12792.882 ms vs. 12810.557 ms. So > at least the call overhead seems to be neglectible. Yea, you'd probably need a few more tables and a few more connections for it to have a chance of mattering meaningfully. Greetings, Andres Freund
Re: can we add subscription TAP test option "vcregress subscriptioncheck" for MSVC builds?
On Thu, Oct 21, 2021 at 7:21 AM Bharath Rupireddy < bharath.rupireddyforpostg...@gmail.com> wrote: > > Here's the documentation v1 patch that I've come up with. Please review it. > > There's a typo: + To run an arbitrary TAP test set, run vcregress taptest + comamnd. For example, use the following command for running subcription TAP + tests: s/comamnd/command/ But also the wording, I like better what vcregress prints as help, so something like: + You can use vcregress taptest TEST_DIR to run an + arbitrary TAP test set, where TEST_DIR is a required argument pointing to + the directory where the tests reside. For example, use the following + command for running the subcription TAP tests: Regards, Juan José Santamaría Flecha
Re: [PATCH] pg_stat_toast
Am 12.12.21 um 22:52 schrieb Andres Freund: Hi, On 2021-12-12 17:20:58 +0100, Gunnar "Nick" Bluth wrote: Please have a look at the attached patch, which implements some statistics for TOAST. The idea (and patch) have been lurking here for quite a while now, so I decided to dust it off, rebase it to HEAD and send it out for review today. A big shoutout to Georgios Kokolatos, who gave me a crash course in PG hacking, some very useful hints and valueable feedback early this year. I'd like to get some feedback about the general idea, approach, naming etc. before refining this further. I'm worried about the additional overhead this might impose. For some workload it'll substantially increase the amount of stats traffic. Have you tried to qualify the overheads? Both in stats size and in stats management overhead? I'd lie if I claimed so... Regarding stats size; it adds one PgStat_BackendToastEntry (PgStat_BackendAttrIdentifier + PgStat_ToastCounts, should be 56-64 bytes or something in that ballpark) per TOASTable attribute, I can't see that make any system break sweat ;-) A quick run comparing 1.000.000 INSERTs (2 TOASTable columns each) with and without "pgstat_track_toast" resulted in 12792.882 ms vs. 12810.557 ms. So at least the call overhead seems to be neglectible. Obviously, this was really a quick run and doesn't reflect real life. I'll have the machine run some reasonable tests asap, also looking at stat size, of course! Best regards, -- Gunnar "Nick" Bluth Eimermacherweg 106 D-48159 Münster Mobil +49 172 8853339 Email: gunnar.bl...@pro-open.de __ "Ceterum censeo SystemD esse delendam" - Cato
Re: extended stats on partitioned tables
On 12/12/21 22:32, Justin Pryzby wrote: On Sun, Dec 12, 2021 at 05:17:10AM +0100, Tomas Vondra wrote: The one thing bugging me a bit is that the regression test checks only a GROUP BY query. It'd be nice to add queries testing MCV/dependencies too, but that seems tricky because most queries will use per-partitions stats. You mean because the quals are pushed down to the scan node. Does that indicate a deficiency ? If extended stats are collected for a parent table, selectivity estimates based from the parent would be better; but instead we use uncorrected column estimates from the child tables. From what I see, we could come up with a way to avoid the pushdown, involving volatile functions/foreign tables/RLS/window functions/SRF/wholerow vars/etc. > But would it be better if extended stats objects on partitioned tables were to collect stats for both parent AND CHILD ? I'm not sure. Maybe that's the wrong solution, but maybe we should still document that extended stats on (empty) parent tables are often themselves not used/useful for selectivity estimates, and the user should instead (or in addition) create stats on child tables. Or, maybe if there's no extended stats on the child tables, stats on the parent table should be consulted ? Maybe, but that seems like a mostly separate improvement. At this point I'm interested only in testing the behavior implemented in the current patches. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [PATCH] pg_stat_toast
Hi, On 2021-12-12 17:20:58 +0100, Gunnar "Nick" Bluth wrote: > Please have a look at the attached patch, which implements some statistics > for TOAST. > > The idea (and patch) have been lurking here for quite a while now, so I > decided to dust it off, rebase it to HEAD and send it out for review today. > > A big shoutout to Georgios Kokolatos, who gave me a crash course in PG > hacking, some very useful hints and valueable feedback early this year. > > I'd like to get some feedback about the general idea, approach, naming etc. > before refining this further. I'm worried about the additional overhead this might impose. For some workload it'll substantially increase the amount of stats traffic. Have you tried to qualify the overheads? Both in stats size and in stats management overhead? Greetings, Andres Freund
Re: extended stats on partitioned tables
On Sun, Dec 12, 2021 at 10:29:39PM +0100, Tomas Vondra wrote: > On 12/12/21 18:52, Justin Pryzby wrote: > That's mostly a conscious choice, so that I don't have to include > parsetree.h. But maybe that'd be better ... > > > The regression tests changed as a result of not populating stx_data; I think > > it's may be better to update like this: > > > > SELECT stxname, stxdndistinct, stxddependencies, stxdmcv, stxoid IS NULL > > FROM pg_statistic_ext s LEFT JOIN pg_statistic_ext_data d > > ON d.stxoid = s.oid > > WHERE s.stxname = 'ab1_a_b_stats'; > > > > Not sure I understand. Why would this be better than inner join? It shows that there's an entry in pg_statistic_ext and not one in ext_data, rather than that it's not in at least one of the catalogs. Which is nice to show since as you say it's no longer 1:1. > Thanks, fixed. Can you read through the commit messages and check the > attribution is correct for all the patches? Seems fine. -- Justin
Re: extended stats on partitioned tables
On Sun, Dec 12, 2021 at 05:17:10AM +0100, Tomas Vondra wrote: > The one thing bugging me a bit is that the regression test checks only a > GROUP BY query. It'd be nice to add queries testing MCV/dependencies > too, but that seems tricky because most queries will use per-partitions > stats. You mean because the quals are pushed down to the scan node. Does that indicate a deficiency ? If extended stats are collected for a parent table, selectivity estimates based from the parent would be better; but instead we use uncorrected column estimates from the child tables. >From what I see, we could come up with a way to avoid the pushdown, involving volatile functions/foreign tables/RLS/window functions/SRF/wholerow vars/etc. But would it be better if extended stats objects on partitioned tables were to collect stats for both parent AND CHILD ? I'm not sure. Maybe that's the wrong solution, but maybe we should still document that extended stats on (empty) parent tables are often themselves not used/useful for selectivity estimates, and the user should instead (or in addition) create stats on child tables. Or, maybe if there's no extended stats on the child tables, stats on the parent table should be consulted ? -- Justin
Re: Windows now has fdatasync()
On Sun, Dec 12, 2021 at 3:48 PM Thomas Munro wrote: > [...] I > tried out a quick POC patch and it runs a bit faster than fsync(), as > expected. I'm not sure if it's worth bothering with or not given the > other options, but figured it was worth sharing. One reason to consider developing this further is the problem discussed in the aptly named thread "Loaded footgun open_datasync on Windows"[1] (not the problem that was fixed in pg_test_fsync, but the problem with cache control, or lack thereof). I saw 10x more TPS with open_datasync than with this experimental fdatasync on my little test VM, which was more than a little fishy, so I turned off the device write caching in "Device Manager" and got about the same number from open_datasync and fdatasync. Clearly you can lose committed transactions on power loss[2] with the default OS settings and default PostgreSQL settings, though perhaps only if you're on SATA storage due to lack of FUA pass-through[3] (?). I didn't try an NVMe stack. That suggests that fdatasync would actually be a better default ... except for the problems already mentioned with versions and not working on non-NTFS (not sure how it fails on non-NTFS, though, maybe it does a full flush, [4] doesn't say). (The patch is a little rough; I couldn't figure out the headers to get that macro. Insert my usual disclaimer that I'm not a Windows guy, this is stuff I'm just figuring out, all clues welcome...) [1] https://www.postgresql.org/message-id/flat/1527846213.2475.31.camel%40cybertec.at [2] https://github.com/MicrosoftDocs/feedback/issues/2747 [3] https://devblogs.microsoft.com/oldnewthing/20170510-00/?p=95505 [4] https://docs.microsoft.com/en-us/windows-hardware/drivers/ddi/ntifs/nf-ntifs-ntflushbuffersfileex
Re: extended stats on partitioned tables
On 12/12/21 18:52, Justin Pryzby wrote: > + * XXX Can't we simply look at rte->inh? > + */ > + inh = root->append_rel_array == NULL ? false : > + root->append_rel_array[onerel->relid]->parent_relid != 0; > > I think so. That's what I came up with while trying to figured this out, and > it's no great surprise that it needed to be cleaned up - thanks. > OK, fixed. > In your 0003 patch, the "if inh: break" isn't removed from examine_variable(), > but the corresponding thing is removed everywhere else. > Ah, you're right. And it wasn't updated in the 0002 patch either - it should do the relkind check too, to allow partitioned tables. Fixed. > In 0003, mcv_clauselist_selectivity still uses simple_rte_array rather than > rt_fetch. > That's mostly a conscious choice, so that I don't have to include parsetree.h. But maybe that'd be better ... > The regression tests changed as a result of not populating stx_data; I think > it's may be better to update like this: > > SELECT stxname, stxdndistinct, stxddependencies, stxdmcv, stxoid IS NULL > FROM pg_statistic_ext s LEFT JOIN pg_statistic_ext_data d > ON d.stxoid = s.oid > WHERE s.stxname = 'ab1_a_b_stats'; > Not sure I understand. Why would this be better than inner join? > There's this part about documentation for the changes in backbranches: > > On Sun, Sep 26, 2021 at 03:25:50PM -0500, Justin Pryzby wrote: >> Also, I think in backbranches we should document what's being stored in >> pg_statistic_ext, since it's pretty unintuitive: >> - noninherted stats (FROM ONLY) for inheritence parents; >> - inherted stats (FROM *) for partitioned tables; > > spellcheck: inheritence should be inheritance. > Thanks, fixed. Can you read through the commit messages and check the attribution is correct for all the patches? > All for now. I'm going to update the regression tests for dependencies and > the > other code paths. > Thanks! -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL CompanyFrom 77bd34660d15d6ba03538c6fe45a196a162c196c Mon Sep 17 00:00:00 2001 From: Tomas Vondra Date: Sun, 12 Dec 2021 02:28:41 +0100 Subject: [PATCH 4/4] Refactor parent ACL check selfuncs.c is 8k lines long, and this makes it 30 LOC shorter. --- src/backend/utils/adt/selfuncs.c | 140 --- 1 file changed, 52 insertions(+), 88 deletions(-) diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c index bf4525392d..e4be9fc95d 100644 --- a/src/backend/utils/adt/selfuncs.c +++ b/src/backend/utils/adt/selfuncs.c @@ -187,6 +187,8 @@ static char *convert_string_datum(Datum value, Oid typid, Oid collid, bool *failure); static double convert_timevalue_to_scalar(Datum value, Oid typid, bool *failure); +static void recheck_parent_acl(PlannerInfo *root, VariableStatData *vardata, +Oid relid); static void examine_simple_variable(PlannerInfo *root, Var *var, VariableStatData *vardata); static bool get_variable_range(PlannerInfo *root, VariableStatData *vardata, @@ -5153,51 +5155,7 @@ examine_variable(PlannerInfo *root, Node *node, int varRelid, (pg_class_aclcheck(rte->relid, userid, ACL_SELECT) == ACLCHECK_OK); -/* - * If the user doesn't have permissions to - * access an inheritance child relation, check - * the permissions of the table actually - * mentioned in the query, since most likely - * the user does have that permission. Note - * that whole-table select privilege on the - * parent doesn't quite guarantee that the - * user could read all columns of the child. - * But in practice it's unlikely that any - * interesting security violation could result - * from allowing access to the expression - * index's stats, so we allow it anyway. See - * similar code in examine_simple_variable() - * for additional comments. - */ -if (!vardata->acl_ok && - root->append_rel_array != NULL) -{ - AppendRelInfo *appinfo; - Index varno = index->rel->relid; - - appinfo = root->append_rel_array[varno]; - while (appinfo && - planner_rt_fetch(appinfo->parent_relid, - root)->rtekind == RTE_RELATION) - { - varno = appinfo->parent_relid; - appinfo = root->append_rel_array[varno]; - } - if (varno != index->rel->relid) - { - /* Repeat access check on this rel */ - rte = planner_rt_fetch(varno, root); - Assert(rte->rtekind == RTE_RELATION); - - userid = rte->checkAsUser ? rte->checkAsUser : GetUserId(); - - vardata->acl_ok = - rte->securityQuals == NIL && - (pg_class_aclcheck(rte->relid, - userid, - ACL_SELECT) == ACLCHECK_OK); - } -
Re: extended stats on partitioned tables
Tomas Vondra writes: > On 12/12/21 16:37, Zhihong Yu wrote: >> Since the rte (RangeTblEntry*) doesn't seem to be used beyond checking >> inh, I think it would be better if the above style of checking is used >> throughout the patch (without introducing rte variable). > It's mostly a matter of personal taste, but I always found this style of > condition (i.e. dereferencing a pointer returned by a function) much > less readable. It's hard to parse what exactly is happening, what struct > type are we dealing with, etc. YMMV but the separate variable makes it > much clearer for me. And I'd expect the compilers to produce pretty much > the same code too for those cases. FWIW, I agree. Also, it's possible that future patches would create a need to touch the RTE again nearby, in which case having the variable makes it easier to write non-crummy code for that. regards, tom lane
Re: extended stats on partitioned tables
+ * XXX Can't we simply look at rte->inh? + */ + inh = root->append_rel_array == NULL ? false : + root->append_rel_array[onerel->relid]->parent_relid != 0; I think so. That's what I came up with while trying to figured this out, and it's no great surprise that it needed to be cleaned up - thanks. In your 0003 patch, the "if inh: break" isn't removed from examine_variable(), but the corresponding thing is removed everywhere else. In 0003, mcv_clauselist_selectivity still uses simple_rte_array rather than rt_fetch. The regression tests changed as a result of not populating stx_data; I think it's may be better to update like this: SELECT stxname, stxdndistinct, stxddependencies, stxdmcv, stxoid IS NULL FROM pg_statistic_ext s LEFT JOIN pg_statistic_ext_data d ON d.stxoid = s.oid WHERE s.stxname = 'ab1_a_b_stats'; There's this part about documentation for the changes in backbranches: On Sun, Sep 26, 2021 at 03:25:50PM -0500, Justin Pryzby wrote: > Also, I think in backbranches we should document what's being stored in > pg_statistic_ext, since it's pretty unintuitive: > - noninherted stats (FROM ONLY) for inheritence parents; > - inherted stats (FROM *) for partitioned tables; spellcheck: inheritence should be inheritance. All for now. I'm going to update the regression tests for dependencies and the other code paths. -- Justin
Re: extended stats on partitioned tables
On 12/12/21 16:37, Zhihong Yu wrote: Hi, For patch 1, minor comment: + if (planner_rt_fetch(onerel->relid, root)->inh) Since the rte (RangeTblEntry*) doesn't seem to be used beyond checking inh, I think it would be better if the above style of checking is used throughout the patch (without introducing rte variable). It's mostly a matter of personal taste, but I always found this style of condition (i.e. dereferencing a pointer returned by a function) much less readable. It's hard to parse what exactly is happening, what struct type are we dealing with, etc. YMMV but the separate variable makes it much clearer for me. And I'd expect the compilers to produce pretty much the same code too for those cases. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: extended stats on partitioned tables
On 12/12/21 14:47, Zhihong Yu wrote: On Sat, Dec 11, 2021 at 9:14 PM Tomas Vondra mailto:tomas.von...@enterprisedb.com>> wrote: > ... > + /* skip statistics with mismatching stxdinherit value */ > + if (stat->inherit != rte->inh) > > Should a log be added for the above case ? > Why should we log this? It's an entirely expected case - there's a mismatch between inheritance for the relation and statistics, simply skipping it is the right thing to do. Hi, I agree that skipping should be fine (to avoid too much logging). I'm not sure it's related to the amount of logging, really. It'd be just noise without any practical use, even for debugging purposes. If you have an inheritance tree, it'll automatically have one set of statistics for inh=true and one for inh=false. And this condition will always skip one of those, depending on what query is being estimated. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
[PATCH] pg_stat_toast
Hello -hackers! Please have a look at the attached patch, which implements some statistics for TOAST. The idea (and patch) have been lurking here for quite a while now, so I decided to dust it off, rebase it to HEAD and send it out for review today. A big shoutout to Georgios Kokolatos, who gave me a crash course in PG hacking, some very useful hints and valueable feedback early this year. I'd like to get some feedback about the general idea, approach, naming etc. before refining this further. I'm not a C person and I s**k at git, so please be kind with me! ;-) Also, I'm not subscribed here, so a CC would be much appreciated! Why gather TOAST statistics? TOAST is transparent and opaque at the same time. Whilst we know that it's there and we know _that_ it works, we cannot generally tell _how well_ it works. What we can't answer (easily) are questions like e.g. - how many datums have been externalized? - how many datums have been compressed? - how often has a compression failed (resulted in no space saving)? - how effective is the compression algorithm used on a column? - how much time did the DB spend compressing/decompressing TOAST values? The patch adds some functionality that will eventually be able to answer these (and probably more) questions. Currently, #1 - #4 can be answered based on the view contained in "pg_stats_toast.sql": postgres=# CREATE TABLE test (i int, lz4 text COMPRESSION lz4, std text); postgres=# INSERT INTO test SELECT i,repeat(md5(i::text),100),repeat(md5(i::text),100) FROM generate_series(0,10) x(i); postgres=# SELECT * FROM pg_stat_toast WHERE schemaname = 'public'; -[ RECORD 1 ]+-- schemaname | public reloid | 16829 attnum | 2 relname | test attname | lz4 externalizations | 0 compressions | 11 compressionsuccesses | 11 compressionsizesum | 6299710 originalsizesum | 320403204 -[ RECORD 2 ]+-- schemaname | public reloid | 16829 attnum | 3 relname | test attname | std externalizations | 0 compressions | 11 compressionsuccesses | 11 compressionsizesum | 8198819 originalsizesum | 320403204 Implementation == I added some callbacks in backend/access/table/toast_helper.c to "pgstat_report_toast_activity" in backend/postmaster/pgstat.c. The latter (and the other additions there) are essentially 1:1 copies of the function statistics. Those were the perfect template, as IMHO the TOAST activities (well, what we're interested in at least) are very much comparable to function calls: a) It doesn't really matter if the TOASTed data was committed, as "the damage is done" (i.e. CPU cycles were used) anyway b) The information can (thus/best) be stored on DB level, no need to touch the relation or attribute statistics I didn't find anything that could have been used as a hash key, so the PgStat_StatToastEntry uses the shiny new PgStat_BackendAttrIdentifier (containing relid Oid, attr int). For persisting in the statsfile, I chose the identifier 'O' (as 'T' was taken). What's working? === - Gathering of TOAST externalization and compression events - collecting the sizes before and after compression - persisting in statsfile - not breaking "make check" - not crashing anything (afaict) What's missing (yet)? === - proper definition of the "pgstat_track_toast" GUC - Gathering of times (for compression [and decompression?]) - improve "pg_stat_toast" view and include it in the catalog - documentation (obviously) - proper naming (of e.g. the hash key type, functions, view columns etc.) - would it be necessary to implement overflow protection for the size & time sums? Thanks in advance & best regards, -- Gunnar "Nick" Bluth Eimermacherweg 106 D-48159 Münster Mobil +49 172 8853339 Email: gunnar.bl...@pro-open.de __ "Ceterum censeo SystemD esse delendam" - CatoFrom 05f81229fd2e81b8674649c8fd1e3857e2406fcd Mon Sep 17 00:00:00 2001 From: "Gunnar \"Nick\" Bluth" Date: Sun, 12 Dec 2021 15:35:35 +0100 Subject: [PATCH] initial patch of pg_stat_toast for -hackers --- pg_stat_toast.sql | 19 ++ src/backend/access/table/toast_helper.c | 19 ++ src/backend/postmaster/pgstat.c | 311 +++- src/backend/utils/adt/pgstatfuncs.c | 60 + src/include/catalog/pg_proc.dat | 21 ++ src/include/pgstat.h| 109 + 6 files changed, 533 insertions(+), 6 deletions(-) create mode 100644 pg_stat_toast.sql diff --git a/pg_stat_toast.sql b/pg_stat_toast.sql new file mode 100644 index 00..1c653254ab --- /dev/null +++ b/pg_stat_toast.sql @@ -0,0 +1,19 @@ +-- This creates a useable view, but the offset of 1 is annoying. +-- T
Re: extended stats on partitioned tables
On Sat, Dec 11, 2021 at 8:17 PM Tomas Vondra wrote: > Hi, > > Attached is a rebased and cleaned-up version of these patches, with more > comments, refactorings etc. Justin and Zhihong, can you take a look? > > > 0001 - Ignore extended statistics for inheritance trees > > 0002 - Build inherited extended stats on partitioned tables > > Those are mostly just Justin's patches, with more detailed comments and > updated commit message. I've considered moving the rel->inh check to > statext_clauselist_selectivity, and then removing the check from > dependencies and MCV. But I decided no to do that, because someone might > be calling those functions directly (even if that's very unlikely). > > The one thing bugging me a bit is that the regression test checks only a > GROUP BY query. It'd be nice to add queries testing MCV/dependencies > too, but that seems tricky because most queries will use per-partitions > stats. > > > 0003 - Add stxdinherit flag to pg_statistic_ext_data > > This is the patch for master, allowing to build stats for both inherits > flag values. It adds the flag to pg_stats_ext_exprs view to, reworked > how we deal with iterating both flags etc. I've adopted most of the > Justin's fixup patches, except that in plancat.c I've refactored how we > load the stats to process keys/expressions just once. > > It has the same issue with regression test using just a GROUP BY query, > but if we add a test to 0001/0002, that'll fix this too. > > > 0004 - Refactor parent ACL check > > Not sure about this - I doubt saving 30 rows in an 8kB file is really > worth it. Maybe it is, but then maybe we should try cleaning up the > other ACL checks in this file too? Seems mostly orthogonal to this > thread, though. > > Hi, For patch 1, minor comment: + if (planner_rt_fetch(onerel->relid, root)->inh) Since the rte (RangeTblEntry*) doesn't seem to be used beyond checking inh, I think it would be better if the above style of checking is used throughout the patch (without introducing rte variable). Cheers > > regards > > -- > Tomas Vondra > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company
Re: extended stats on partitioned tables
On Sat, Dec 11, 2021 at 9:14 PM Tomas Vondra wrote: > > > On 12/12/21 05:38, Zhihong Yu wrote: > > > > > > On Sat, Dec 11, 2021 at 8:17 PM Tomas Vondra > > mailto:tomas.von...@enterprisedb.com>> > > wrote: > > > > Hi, > > > > Attached is a rebased and cleaned-up version of these patches, with > more > > comments, refactorings etc. Justin and Zhihong, can you take a look? > > > > > > 0001 - Ignore extended statistics for inheritance trees > > > > 0002 - Build inherited extended stats on partitioned tables > > > > Those are mostly just Justin's patches, with more detailed comments > and > > updated commit message. I've considered moving the rel->inh check to > > statext_clauselist_selectivity, and then removing the check from > > dependencies and MCV. But I decided no to do that, because someone > might > > be calling those functions directly (even if that's very unlikely). > > > > The one thing bugging me a bit is that the regression test checks > only a > > GROUP BY query. It'd be nice to add queries testing MCV/dependencies > > too, but that seems tricky because most queries will use > per-partitions > > stats. > > > > > > 0003 - Add stxdinherit flag to pg_statistic_ext_data > > > > This is the patch for master, allowing to build stats for both > inherits > > flag values. It adds the flag to pg_stats_ext_exprs view to, reworked > > how we deal with iterating both flags etc. I've adopted most of the > > Justin's fixup patches, except that in plancat.c I've refactored how > we > > load the stats to process keys/expressions just once. > > > > It has the same issue with regression test using just a GROUP BY > query, > > but if we add a test to 0001/0002, that'll fix this too. > > > > > > 0004 - Refactor parent ACL check > > > > Not sure about this - I doubt saving 30 rows in an 8kB file is really > > worth it. Maybe it is, but then maybe we should try cleaning up the > > other ACL checks in this file too? Seems mostly orthogonal to this > > thread, though. > > > > > > Hi, > > For patch 3, in commit message: > > > > and there no clear winner. -> and there is no clear winner. > > > > and it seem wasteful -> and it seems wasteful > > > > The there may be -> There may be > > > > Thanks, will fix. > > > + /* skip statistics with mismatching stxdinherit value */ > > + if (stat->inherit != rte->inh) > > > > Should a log be added for the above case ? > > > > Why should we log this? It's an entirely expected case - there's a > mismatch between inheritance for the relation and statistics, simply > skipping it is the right thing to do. > Hi, I agree that skipping should be fine (to avoid too much logging). Thanks > > +* Determine if we'redealing with inheritance tree. > > > > There should be a space between re and dealing. > > > > Thanks, will fix. > > > regards > > -- > Tomas Vondra > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company >
回复:Re: Re: 回复:Re: Is it worth pushing conditions to sublink/subplan?
--原始邮件 -- 发件人:Zhihong Yu 发送时间:Sun Dec 12 01:13:11 2021 收件人:曾文旌(义从) 抄送:Tomas Vondra , wjzeng , PostgreSQL Hackers , shawn wang , ggys...@gmail.com 主题:Re: Re: 回复:Re: Is it worth pushing conditions to sublink/subplan? On Sat, Dec 11, 2021 at 7:31 AM 曾文旌(义从) wrote: --原始邮件 -- 发件人:Tomas Vondra 发送时间:Wed Dec 8 11:26:35 2021 收件人:曾文旌(义从) , shawn wang , ggys...@gmail.com , PostgreSQL Hackers 抄送:wjzeng 主题:Re: 回复:Re: Is it worth pushing conditions to sublink/subplan? Hi, On 12/7/21 10:44, 曾文旌(义从) wrote: > Hi Hackers > > For my previous proposal, I developed a prototype and passed > regression testing. It works similarly to subquery's qual pushdown. > We know that sublink expands at the beginning of each level of > query. At this stage, The query's conditions and equivalence classes > are not processed. But after generate_base_implied_equalities the > conditions are processed, which is why qual can push down to > subquery but sublink not. > > My POC implementation chose to delay the sublink expansion in the > SELECT clause (targetList) and where clause. Specifically, it is > delayed after generate_base_implied_equalities. Thus, the equivalent > conditions already established in the Up level query can be easily > obtained in the sublink expansion process (make_subplan). > > For example, if the up level query has a.id = 10 and the sublink > query has a.id = b.id, then we get b.id = 10 and push it down to the > sublink quey. If b is a partitioned table and is partitioned by id, > then a large number of unrelated subpartitions are pruned out, This > optimizes a significant amount of Planner and SQL execution time, > especially if the partitioned table has a large number of > subpartitions and is what I want. > > Currently, There were two SQL failures in the regression test, > because the expansion order of sublink was changed, which did not > affect the execution result of SQL. > > Look forward to your suggestions on this proposal. > I took a quick look, and while I don't see / can't think of any problems with delaying it until after generating implied equalities, there seems to be a number of gaps. Thank you for your attention. 1) Are there any regression tests exercising this modified behavior? Maybe there are, but if the only changes are due to change in order of targetlist entries, that doesn't seem like a clear proof. It'd be good to add a couple tests exercising both the positive and negative case (i.e. when we can and can't pushdown a qual). I added several samples to the regress(qual_pushdown_to_sublink.sql). and I used the partition table to show the plan status of qual being pushed down into sublink. Hopefully this will help you understand the details of this patch. Later, I will add more cases. 2) apparently, contrib/postgres_fdw does crash like this: #3 0x0077b412 in adjust_appendrel_attrs_mutator (node=0x13f7ea0, context=0x7fffc3351b30) at appendinfo.c:470 470 Assert(!IsA(node, SubLink)); (gdb) p node $1 = (Node *) 0x13f7ea0 (gdb) p *node $2 = {type = T_SubLink} Backtrace attached. For the patch attached in the last email, I passed all the tests under src/test/regress. As you pointed out, there was a problem with regression under contrib(in contrib/postgres_fdw). This time I fixed it and the current patch (V2) can pass the check-world. 3) various parts of the patch really need at least some comments, like: - try_push_outer_qual_to_sublink_query really needs some docs - new stuff at the end of initsplan.c Ok, I added some comments and will add more. If you have questions about any details, please point them out directly. 4) generate_base_implied_equalities shouldn't this if (ec->ec_processed) ; really be? if (ec->ec_processed) continue; You are right. I fixed it. 5) I'm not sure why we need the new ec_processed flag. I did this to eliminate duplicate equalities from the two generate_base_implied_equalities calls 1) I need the base equivalent expression generated after generate_base_implied_equalities, which is used to pushdown qual to sublink(lazy_process_sublinks) 2) The expansion of sublink may result in an equivalent expression with parameters, such as a = $1, which needs to deal with the equivalence classes again. 3) So, I added ec_processed and asked to process it again (generate_base_implied_equalities) after the equivalence class changed (add_eq_member/process_equivalence). Maybe you have a better suggestion, please let me know. 6) So we now have lazy_process_sublink callback? Does that mean we expand sublinks in two places - sometimes lazily, sometimes not? Yes, not all sublink is delayed. Let me explain this: 1) I added a GUC switch enable_lazy_process_sublink. If it is turned off, all lazy process sublink will not happen, qual pushdown to sublink depend on lazy procee sublink, which m
Re: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display
On Thu, Dec 9, 2021 at 9:28 PM Alvaro Herrera wrote: > > On 2021-Dec-09, Bharath Rupireddy wrote: > > > On Thu, Dec 9, 2021 at 6:00 PM Alvaro Herrera > > wrote: > > > > > > On 2021-Dec-09, Bharath Rupireddy wrote: > > > > > > > I just want to call this out: an insertion of 10 million rows in the > > > > primary generates a bunch of messages in the standby [1] within 40 sec > > > > (size of the standby server log grows by 5K). > > > > > > Hmm, that does sound excessive to me in terms of log bloat increase. > > > Remember the discussion about turning log_checkpoints on by default? > > > > The amount of LOG messages generated when the log_checkpoints GUC is > > set to on, are quite less, hardly 4 messages per-checkpoint (5min). I > > think enabling log_checkpoints is still acceptable as most of the > > hackers agreed in another thread [1] that the advantages with it are > > more and it doesn't seem to be bloating the server logs (in a day at > > max 1152 messages). > > My point is that it was argued, in that thread, that setting > log_checkpoints to on by default would cause too much additional log > volume. That argument was shot down, but in this thread we're arguing > that we want five kilobytes of messages in forty seconds. That sounds > much less acceptable to me, so making it default behavior or > unconditional behavior is not going to fly very high. > > > I'm not sure if it is worth having a GUC log_recovery if enabled the > > recovery messages can be emitted at LOG level otherwise DEBUG1 level. > > Maybe some tunable like > log_wal_traffic = { none, medium, high } > where "none" is current behavior of no noise, "medium" gets (say) once > every 256 segments (e.g., when going from XFF to (X+1)00), "high" gets > you one message per segment. Yeah, that can be an option. Another idea could be to use the infrastructure laid out by the commit 9ce346e [1]. With ereport_startup_progress, we can emit the LOGs(of current recovering WAL file) for every log_startup_progress_interval seconds/milliseconds. One problem is that ereport_startup_progress doesn't work on StandbyMode, maybe we can remove this restriction unless we have a major reason for not allowing it on the standby. /* Prepare to report progress of the redo phase. */ if (!StandbyMode) begin_startup_progress_phase(); Thoughts? [1] commit 9ce346eabf350a130bba46be3f8c50ba28506969 Author: Robert Haas Date: Mon Oct 25 11:51:57 2021 -0400 Report progress of startup operations that take a long time. Discussion: https://postgr.es/m/ca+tgmoahqrgdfobwgy16xcomtxxsrvgfb2jncvb7-ubuee1...@mail.gmail.com Discussion: https://postgr.es/m/camm1awahf7ve69572_olq+mgpt5ruiudgf1x5rrtkjbldpr...@mail.gmail.com Regards, Bharath Rupireddy.
Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
On Fri, Dec 10, 2021 at 7:39 AM Ashutosh Sharma wrote: >> >> Logfile Snippet: >> 2021-12-09 17:49:18.110 +04 [18151] PANIC: could not fsync file >> "base/116398/116400": No such file or directory >> 2021-12-09 17:49:19.105 +04 [18150] LOG: checkpointer process (PID 18151) >> was terminated by signal 6: Aborted >> 2021-12-09 17:49:19.105 +04 [18150] LOG: terminating any other active >> server processes > > > This is different from the issue you raised earlier. As Dilip said, we need > to unregister sync requests for files that got successfully copied to the > target database, but the overall alter database statement failed. We are > doing this when the database is created successfully, but not when it fails. > Probably doing the same inside the cleanup function movedb_failure_callback() > should fix the problem. Correct, I have done this cleanup, apart from this we have dropped the fsyc request in create database failure case as well and also need to drop buffer in error case of creatdb as well as movedb. I have also fixed the other issue for which you gave the patch (a bit differently) basically, in case of movedb the source and destination dboid are same so we don't need an additional parameter and also readjusted the conditions to avoid nested if. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com From eb27d159bc2aa41011b6352a514c7981a19a64dc Mon Sep 17 00:00:00 2001 From: Dilip Kumar Date: Wed, 1 Sep 2021 14:06:29 +0530 Subject: [PATCH v8 1/7] Refactor relmap load and relmap write functions Currently, write_relmap_file and load_relmap_file are tightly coupled with shared_map and local_map. As part of the higher level patch set we need remap read/write interfaces that are not dependent upon shared_map and local_map, and we should be able to pass map memory as an external parameter instead. --- src/backend/utils/cache/relmapper.c | 163 ++-- 1 file changed, 99 insertions(+), 64 deletions(-) diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c index a6e38ad..bb39632 100644 --- a/src/backend/utils/cache/relmapper.c +++ b/src/backend/utils/cache/relmapper.c @@ -136,6 +136,12 @@ static void apply_map_update(RelMapFile *map, Oid relationId, Oid fileNode, bool add_okay); static void merge_map_updates(RelMapFile *map, const RelMapFile *updates, bool add_okay); +static void read_relmap_file(char *mapfilename, RelMapFile *map, + bool lock_held); +static void write_relmap_file_internal(char *mapfilename, RelMapFile *newmap, + bool write_wal, bool send_sinval, + bool preserve_files, Oid dbid, Oid tsid, + const char *dbpath); static void load_relmap_file(bool shared, bool lock_held); static void write_relmap_file(bool shared, RelMapFile *newmap, bool write_wal, bool send_sinval, bool preserve_files, @@ -687,36 +693,19 @@ RestoreRelationMap(char *startAddress) } /* - * load_relmap_file -- load data from the shared or local map file + * read_relmap_file -- read data from given mapfilename file. * * Because the map file is essential for access to core system catalogs, * failure to read it is a fatal error. - * - * Note that the local case requires DatabasePath to be set up. */ static void -load_relmap_file(bool shared, bool lock_held) +read_relmap_file(char *mapfilename, RelMapFile *map, bool lock_held) { - RelMapFile *map; - char mapfilename[MAXPGPATH]; pg_crc32c crc; int fd; int r; - if (shared) - { - snprintf(mapfilename, sizeof(mapfilename), "global/%s", - RELMAPPER_FILENAME); - map = &shared_map; - } - else - { - snprintf(mapfilename, sizeof(mapfilename), "%s/%s", - DatabasePath, RELMAPPER_FILENAME); - map = &local_map; - } - - /* Read data ... */ + /* Open the relmap file for reading. */ fd = OpenTransientFile(mapfilename, O_RDONLY | PG_BINARY); if (fd < 0) ereport(FATAL, @@ -779,62 +768,50 @@ load_relmap_file(bool shared, bool lock_held) } /* - * Write out a new shared or local map file with the given contents. - * - * The magic number and CRC are automatically updated in *newmap. On - * success, we copy the data to the appropriate permanent static variable. - * - * If write_wal is true then an appropriate WAL message is emitted. - * (It will be false for bootstrap and WAL replay cases.) - * - * If send_sinval is true then a SI invalidation message is sent. - * (This should be true except in bootstrap case.) - * - * If preserve_files is true then the storage manager is warned not to - * delete the files listed in the map. + * load_relmap_file -- load data from the shared or local map file * - * Because this may be called during WAL replay when MyDatabaseId, - * DatabasePath, etc aren't valid, we require the caller to pass in suitable - * values. The caller is also responsible for being sure no concurrent - * map update could be happening. + * Note that the local case requires DatabasePath t