Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)
On Thu, Jan 18, 2024 at 4:16 AM torikoshia wrote: > On 2024-01-18 10:10, jian he wrote: > > On Thu, Jan 18, 2024 at 8:57 AM Masahiko Sawada > > wrote: > >> On Thu, Jan 18, 2024 at 6:38 AM Tom Lane wrote: > >> > Kyotaro-san's suggestion isn't bad, though I might shorten it to > >> > error_action {error|ignore|log} (or perhaps "stop" instead of "error")? > >> > You will need a separate parameter anyway to specify the destination > >> > of "log", unless "none" became an illegal table name when I wasn't > >> > looking. I don't buy that one parameter that has some special values > >> > while other values could be names will be a good design. Moreover, > >> > what if we want to support (say) log-to-file along with log-to-table? > >> > Trying to distinguish a file name from a table name without any other > >> > context seems impossible. > >> > >> I've been thinking we can add more values to this option to log errors > >> not only to the server logs but also to the error table (not sure > >> details but I imagined an error table is created for each table on > >> error), without an additional option for the destination name. The > >> values would be like error_action {error|ignore|save-logs|save-table}. > >> > > > > another idea: > > on_error {error|ignore|other_future_option} > > if not specified then by default ERROR. > > You can also specify ERROR or IGNORE for now. > > > > I agree, the parameter "error_action" is better than "location". > > I'm not sure whether error_action or on_error is better, but either way > "error_action error" and "on_error error" seems a bit odd to me. > I feel "stop" is better for both cases as Tom suggested. OK. What about this? on_error {stop|ignore|other_future_option} where other_future_option might be compound like "file 'copy.log'" or "table 'copy_log'". -- Regards, Alexander Korotkov
Re: Make all Perl warnings fatal
On 16.01.24 12:08, Bharath Rupireddy wrote: On Fri, Jan 12, 2024 at 9:21 PM Bharath Rupireddy wrote: On Fri, Jan 12, 2024 at 9:03 PM Peter Eisentraut wrote: I would put this code my $core = $ret & 128 ? " (core dumped)" : ""; die "psql exited with signal " . ($ret & 127) . "$core: '$$stderr' while running '@psql_params'" if $ret & 127; $ret = $ret >> 8; inside a if (defined $ret) block. Then the behavior would be that the whole function returns undef on timeout, which is usefully different from returning 0 (and matches previous behavior). WFM. I've attached a patch for the above change. Committed, thanks.
Re: [17] CREATE SUBSCRIPTION ... SERVER
On Tue, 2024-01-16 at 09:23 +0530, Bharath Rupireddy wrote: > 1. > May be a more descriptive note is > worth here instead of just saying "Load the library providing us > libpq calls."? OK, will be in the next patch set. > 2. Why not typedef keyword before the ConnectionOption structure? Agreed. An earlier unpublished iteration had the struct more localized, but here it makes more sense to be typedef'd. > 3. > +static const struct ConnectionOption * > +libpqrcv_conninfo_options(void) > > Why is libpqrcv_conninfo_options returning the const > ConnectionOption? I did that so I could save the result, and each subsequent call would be free (just returning the same pointer). That also means that the caller doesn't need to free the result, which would require another entry point in the API. > Is it that we don't expect callers to modify the result? I think it's > not needed given the fact that PQconndefaults doesn't constify the > return value. The result of PQconndefaults() can change from call to call when the defaults change. libpqrcv_conninfo_options() only depends on the available option names (and dispchar), which should be a static list. > 4. > + /* skip options that must be overridden */ > + if (strcmp(option, "client_encoding") == 0) > + return false; > + > > Options that must be overriden or disallow specifiing > "client_encoding" in the SERVER/USER MAPPING definition (just like > the > dblink)? I'm not quite sure of your question, but I'll try to improve the comment. > 5. > "By using the correct libpq options, it no longer needs to be > deprecated, and can be used by the upcoming pg_connection_fdw." > > Use of postgresql_fdw_validator for pg_connection_fdw seems a bit odd > to me. I don't mind pg_connection_fdw having its own validator > pg_connection_fdw_validator even if it duplicates the code. To avoid > code duplication we can move the guts to an internal function in > foreign.c so that both postgresql_fdw_validator and > pg_connection_fdw_validator can use it. This way the code is cleaner > and we can just leave postgresql_fdw_validator as deprecated. Will do so in the next patch set. Thank you for taking a look. Regards, Jeff Davis
Re: Catalog domain not-null constraints
On 17.01.24 13:15, vignesh C wrote: One of the test has failed in CFBot at [1] with: diff -U3 /tmp/cirrus-ci-build/src/test/regress/expected/domain.out /tmp/cirrus-ci-build/src/test/regress/results/domain.out --- /tmp/cirrus-ci-build/src/test/regress/expected/domain.out 2024-01-14 15:40:01.793434601 + +++ /tmp/cirrus-ci-build/src/test/regress/results/domain.out 2024-01-14 15:42:23.013332625 + @@ -1271,11 +1271,4 @@ FROM information_schema.domain_constraints WHERE domain_name IN ('con', 'dom', 'pos_int', 'things')) ORDER BY constraint_name; - constraint_catalog | constraint_schema | constraint_name | check_clause -+---+--+--- - regression | public| con_check| (VALUE > 0) - regression | public| meow | (VALUE < 11) - regression | public| pos_int_check| (VALUE > 0) - regression | public| pos_int_not_null | VALUE IS NOT NULL -(4 rows) - +ERROR: could not open relation with OID 36379 [1] - https://cirrus-ci.com/task/4536440638406656 [2] - https://api.cirrus-ci.com/v1/artifact/task/4536440638406656/log/src/test/regress/regression.diffs Interesting. I couldn't reproduce this locally, even across different operating systems. The cfbot failures appear to be sporadic, but also happening across multiple systems, so it's clearly not just a local environment failure. Can anyone else perhaps reproduce this locally?
Re: Reduce useless changes before reassembly during logical replication
On Wed, Jan 17, 2024 at 11:45 AM li jie wrote: > > Hi hackers, > > During logical replication, if there is a large write transaction, some > spill files will be written to disk, depending on the setting of > logical_decoding_work_mem. > > This behavior can effectively avoid OOM, but if the transaction > generates a lot of change before commit, a large number of files may > fill the disk. For example, you can update a TB-level table. > > However, I found an inelegant phenomenon. If the modified large table is not > published, its changes will also be written with a large number of spill > files. > Look at an example below: Thanks. I agree that decoding and queuing the changes of unpublished tables' data into reorder buffer is an unnecessary task for walsender. It takes processing efforts (CPU overhead), consumes disk space and uses memory configured via logical_decoding_work_mem for a replication connection inefficiently. > Later you will see a large number of spill files in the > > We can see that table tbl_t1 is not published in mypub. It also won't be sent > downstream because it's not subscribed. > After the transaction is reorganized, the pgoutput decoding plugin filters out > changes to these unpublished relationships when sending logical changes. > See function pgoutput_change. Right. Here's my testing [1]. > Most importantly, if we filter out unpublished relationship-related > changes after constructing the changes but before queuing the changes > into a transaction, will it reduce the workload of logical decoding > and avoid disk > or memory growth as much as possible? Right. It can. > The patch in the attachment is a prototype, which can effectively reduce the > memory and disk space usage during logical replication. > > Design: > 1. Added a callback LogicalDecodeFilterByRelCB for the output plugin. > > 2. Added this callback function pgoutput_table_filter for the pgoutput plugin. > Its main implementation is based on the table filter in the > pgoutput_change function. > Its main function is to determine whether the change needs to be published > based > on the parameters of the publication, and if not, filter it. > > 3. After constructing a change and before Queue a change into a transaction, > use RelidByRelfilenumber to obtain the relation associated with the change, > just like obtaining the relation in the ReorderBufferProcessTXN function. > > 4. Relation may be a toast, and there is no good way to get its real > table relation based on toast relation. Here, I get the real table oid > through toast relname, and then get the real table relation. > > 5. This filtering takes into account INSERT/UPDATE/INSERT. Other > changes have not been considered yet and can be expanded in the future. Design of this patch is based on the principle of logical decoding filtering things out early on and looks very similar to filter_prepare_cb_wrapper/pg_decode_filter_prepare and filter_by_origin_cb/pgoutput_origin_filter. Per my understanding this design looks okay unless I'm missing anything. > Test: > 1. Added a test case 034_table_filter.pl > 2. Like the case above, create two tables, the published table tbl_pub and > the non-published table tbl_t1 > 3. Insert 10,000 rows of toast data into tbl_t1 on the publisher, and use > pg_ls_replslotdir to record the total size of the slot directory > every second. > 4. Compare the size of the slot directory at the beginning of the > transaction(size1), > the size at the end of the transaction (size2), and the average > size of the entire process(size3). > 5. Assert(size1==size2==size3) I bet that the above test with 10K rows is going to take a noticeable time on some buildfarm members (it took 6 seconds on my dev system which is an AWS EC2 instance). And, the above test can get flaky. Therefore, IMO, the concrete way of testing this feature is by looking at the server logs for the following message using PostgreSQL::Test::Cluster log_contains(). +filter_done: + +if (result && RelationIsValid(relation)) +elog(DEBUG1, "logical filter change by table %s", RelationGetRelationName(relation)); + Here are some comments on the v1 patch: 1. @@ -1415,9 +1419,6 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn, TupleTableSlot *old_slot = NULL; TupleTableSlot *new_slot = NULL; -if (!is_publishable_relation(relation)) -return; - Instead of removing is_publishable_relation from pgoutput_change, I think it can just be turned into an assertion Assert(is_publishable_relation(relation));, no? 2. +switch (change->action) +{ +/* intentionally fall through */ Perhaps, it must use /* FALLTHROUGH */ just like elsewhere in the code, otherwise a warning is thrown. 3. From commit message: Most of the code in the FilterByTable function is transplanted from the ReorderBufferProcessTXN function, which can be called before the ReorderBufferQueueChange function.It is I think the above note can just be
Re: A performance issue with Memoize
Hi Richard, I can tell this a real world problem. I have seen this multiple times in production. The fix seems surprisingly simple. I hope my questions here aren't completely off. I still struggle to think about the implications. I wonder, if there is any stuff we are breaking by bluntly forgetting about the subplan params. Maybe some table valued function scan within a subquery scan? Or something about casts on a join condition, that could be performed differently? I wasn't able to construct a problem case. I might be just missing context here. But I am not yet fully convinced whether this is safe to do in all cases. Regards Arne
Re: heavily contended lwlocks with long wait queues scale badly
On Tue, Jan 16, 2024 at 03:11:48PM +0900, Michael Paquier wrote: > I'd like to apply that, just let me know if you have any comments > and/or objections. And done on 12~15. While on it, I have also looked at source code references on github and debian that involve lwWaiting, and all of them rely on lwWaiting when not waiting, making LW_WS_NOT_WAITING an equivalent. -- Michael signature.asc Description: PGP signature
Re: subscription disable_on_error not working after ALTER SUBSCRIPTION set bad conninfo
On Thu, Jan 18, 2024 at 12:55 PM Masahiko Sawada wrote: > ... > > Although we can improve it to handle this case too, I'm not sure it's > a bug. The doc says[1]: > > Specifies whether the subscription should be automatically disabled if > any errors are detected by subscription workers during data > replication from the publisher. > > When an apply worker is trying to establish a connection, it's not > replicating data from the publisher. > > Regards, > > [1] > https://www.postgresql.org/docs/devel/sql-createsubscription.html#SQL-CREATESUBSCRIPTION-PARAMS-WITH-DISABLE-ON-ERROR > > -- > Masahiko Sawada > Amazon Web Services: https://aws.amazon.com Yeah, I had also seen that wording of those docs. And I agree it leaves open some room for doubts because strictly from that wording it can be interpreted that establishing the connection is not actually "data replication from the publisher" in which case maybe there is no bug. OTOH, it was not clear to me if that precise wording was the intention or not. It could have been written as "Specifies whether the subscription should be automatically disabled if any errors are detected by subscription workers." which would mean the same thing 99% of the time except that would mean that the current behaviour is a bug. I tried looking at the original thread where this feature was born [1] but it is still unclear to me if 'disable_on_error' was meant for every kind of error or only data replication errors. Indeed. even the commit message [2] seems to have an each-way bet: * It talks about errors applying changes --- "Logical replication apply workers for a subscription can easily get stuck in an infinite loop of attempting to apply a change..." * But, it also says it covers any errors --- "When true, both the tablesync worker and apply worker catch any errors thrown..." ~ Maybe we should be asking ourselves how a user intuitively expects this option to behave. IMO the answer is right there in the option name - the subscription says 'disable_on_error' and I got an error, so I expected the subscription to be disabled. To wriggle out of it by saying "Ah, but we did not mean _those_ kinds of errors" doesn't quite seem genuine to me. == [1] https://www.postgresql.org/message-id/flat/CAA4eK1KsaVgkO%3DRbjj0bcXZTpeV1QVm0TGkdxZiH73MHfxf6oQ%40mail.gmail.com#d4a0db154fbeca356a494c50ac877ff1 [2] https://github.com/postgres/postgres/commit/705e20f8550c0e8e47c0b6b20b5f5ffd6ffd9e33 Kind Regards, Peter Smith. Fujitsu Australia
Re: Strange Bitmapset manipulation in DiscreteKnapsack()
Hi, David Rowley writes: > On Thu, 18 Jan 2024 at 14:58, Andy Fan wrote: >> I want to know if "user just want to zero out the flags in bitmapset >> but keeping the memory allocation" is a valid requirement. Commit >> 00b41463c makes it is hard IIUC. > > Looking at your patch, I see: > > +/* > + * does this break commit 00b41463c21615f9bf3927f207e37f9e215d32e6? > + * but I just found alloc memory and free the memory is too bad > + * for this current feature. So let see ...; > + */ > +void > +bms_zero(Bitmapset *a) > > I understand the requirement here, but, to answer the question in the > comment -- Yes, that does violate the requirements for how an empty > set is represented and as of c7e5e994b and possibly earlier, any > subsequent Bitmapset operations will cause an Assert failure due to > the trailing zero word(s) left by bms_zero(). Thanks for the understanding and confirmation. >> Looks like this is another user case of "user just wants to zero out the >> flags in bitmapset but keeping the memory allocation". > > I don't really see a way to have it work the way you want without > violating the new representation of an empty Bitmapset. Per [1], > there's quite a performance advantage to 00b41463c plus the additional > don't allow trailing empty words rule done in a8c09daa8. So I don't > wish the rules were more lax. I agree with this. > > Maybe Bitmapsets aren't the best fit for your need. Maybe it would be > better to have a more simple fixed size bitset that you could allocate > in the same allocation as the TupleTableSlot's tts_null and tts_values > arrays. The slot's TupleDesc should know how many bits are needed. I think this is the direction we have to go. There is no bitset struct yet, so I prefer to design it as below, The APIs are pretty similar with Bitmapset. typdef char Bitset; Bitset *bitset_init(int size); void bitset_add_member(Bitset *bitset, int x); void bitset_del_member(Bitset *bitset, int x); Bitset *bitset_add_members(Bitset *bitset1, Bitset *bitset2); bool bitset_is_member(int x); int bitset_next_member(Bitset *bitset, int i); Bitset *bitset_clear(); After this, we may use it for DiscreteKnapsack as well since the num_items is fixed as well and this user case wants the memory allocation is done only once. -- Best Regards Andy Fan
Re: Adding facility for injection points (or probe points?) for more advanced tests
Hi Michael, There is some overlap between Dtrace functionality and this functionality. But I see differences too. E.g. injection points offer deeper integration whereas dtrace provides more information to the probe like callstack and argument values etc. We need to assess whether these functionality can co-exist and whether we need both of them. If the answer to both of these questions is yes, it will be good to add documentation explaining the differences and similarities and also some guidance on when to use what. On Fri, Jan 12, 2024 at 9:56 AM Michael Paquier wrote: > > On Fri, Jan 12, 2024 at 08:35:42AM +0900, Michael Paquier wrote: > > It also seems like you've missed this message, where this has been > > mentioned (spoiler: first version of the patch used an alternate > > output): > > https://www.postgresql.org/message-id/zunuzpimkzcov...@paquier.xyz > > The refactoring of 0001 has now been applied as of e72a37528dda, and > the buildfarm looks stable (at least for now). > > Here is a rebased patch set of the rest. + +#ifdef USE_INJECTION_POINTS +static bool +file_exists(const char *name) +{ + struct stat st; + + Assert(name != NULL); + if (stat(name, ) == 0) + return !S_ISDIR(st.st_mode); + else if (!(errno == ENOENT || errno == ENOTDIR)) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not access file \"%s\": %m", name))); + return false; +} Shouldn't this be removed now? The code should use one from fd.c Other code changes look good. I think the documentation and comments need some changes esp. considering the users point of view. Have attached two patches (0003, and 0004) with those changes to be applied on top of 0001 and 0002 respectively. Please review them. Might need some wordsmithy and language correction. Attaching the whole patch set to keep cibot happy. This is review of 0001 and 0002 only. Once we take care of these comments I think those patches will be ready for commit except one point of contention mentioned in [1]. We haven't heard any third opinion yet. [1] https://www.postgresql.org/message-id/CAExHW5sc_ar7=w9xccc9twyxzf71ghc6poq_+u4hxtxmnb7...@mail.gmail.com -- Best Wishes, Ashutosh Bapat From 7a26f1c345e8a8dcf895dcd5d1d45012f46dd826 Mon Sep 17 00:00:00 2001 From: Ashutosh Bapat Date: Wed, 17 Jan 2024 16:59:03 +0530 Subject: [PATCH 2/6] Comment and documentation suggestions. ... reflecting a user's point of view. Ashutosh Bapat --- doc/src/sgml/installation.sgml | 24 doc/src/sgml/xfunc.sgml | 24 +++- src/backend/utils/misc/injection_point.c | 11 +-- 3 files changed, 32 insertions(+), 27 deletions(-) diff --git a/doc/src/sgml/installation.sgml b/doc/src/sgml/installation.sgml index 51830fc078..f913b6b78f 100644 --- a/doc/src/sgml/installation.sgml +++ b/doc/src/sgml/installation.sgml @@ -1661,12 +1661,12 @@ build-postgresql: Compiles PostgreSQL with support for -injection points in the server. This is valuable to inject -user-defined code to force specific conditions to happen on the -server in pre-defined code paths. This option is disabled by default. -See for more details. -This option is only for developers to test specific concurrency -scenarios. +injection points in the server. Injection points allow to run +user-defined code from within the server in pre-defined code paths. +This helps in testing and investigation of concurrency scenarios in a +controlled fashion. This option is disabled by default. See for more details. This option +is inteded to be used only by developers for testing. @@ -3180,12 +3180,12 @@ ninja install Compiles PostgreSQL with support for -injection points in the server. This is valuable to inject -user-defined code to force specific conditions to happen on the -server in pre-defined code paths. This option is disabled by default. -See for more details. -This option is only for developers to test specific concurrency -scenarios. +injection points in the server. Injection points allow to run +user-defined code from within the server in pre-defined code paths. +This helps in testing and investigation of concurrency scenarios in a +controlled fashion. This option is disabled by default. See for more details. This option +is inteded to be used only by developers for testing. diff --git a/doc/src/sgml/xfunc.sgml b/doc/src/sgml/xfunc.sgml index 9dd9eafd22..59dbe73c67 100644 --- a/doc/src/sgml/xfunc.sgml +++ b/doc/src/sgml/xfunc.sgml @@ -3514,27 +3514,24 @@ uint32 WaitEventExtensionNew(const char *wait_event_name) Injection Points - Add-ins can define injection points, that would run user-defined - code when going through a specific code
Re: Fix search_path for all maintenance commands
On Thu, 2024-01-18 at 09:24 +0530, Shubham Khanna wrote: > I tried to apply the patch but it is failing at the Head. It is > giving > the following error: I am withdrawing this patch from the CF until it's more clear that this is what we really want to do. Thank you for looking into it. Regards, Jeff Davis
Re: Synchronizing slots from primary to standby
I have one question about the new code in v63-0002. == src/backend/replication/logical/slotsync.c 1. ReplSlotSyncWorkerMain + Assert(SlotSyncWorker->pid == InvalidPid); + + /* + * Startup process signaled the slot sync worker to stop, so if meanwhile + * postmaster ended up starting the worker again, exit. + */ + if (SlotSyncWorker->stopSignaled) + { + SpinLockRelease(>mutex); + proc_exit(0); + } Can we be sure a worker crash can't occur (in ShutDownSlotSync?) in such a way that SlotSyncWorker->stopSignaled was already assigned true, but SlotSyncWorker->pid was not yet reset to InvalidPid; e.g. Is the Assert above still OK? == Kind Regards, Peter Smith. Fujitsu Australia
Re: initdb's -c option behaves wrong way?
Thank you for upicking this up. At Wed, 17 Jan 2024 23:47:41 +0100, Daniel Gustafsson wrote in > > On 17 Jan 2024, at 21:33, Tom Lane wrote: > > > > I wrote: > >> However ... I don't like the patch much. It seems to have left > >> the code in a rather random state. Why, for example, didn't you > >> keep all the code that constructs the "newline" value together? > > > > After thinking about it a bit more, I don't see why you didn't just > > s/strncmp/strncasecmp/ and call it good. The messiness seems to be > > a result of your choice to replace the GUC's case as shown in the > > file with the case used on the command line, which is not better > > IMO. We don't change our mind about the canonical spelling of a > > GUC because somebody varied the case in a SET command. > > The original patch was preserving the case of the file throwing away the case > from the commandline. The attached is a slimmed down version which only moves > the assembly of newline to the different cases (replace vs. new) keeping the > rest of the code intact. Keeping it in one place gets sort of messy too since > it needs to use different values for a replacement and a new variable. Not > sure if there is a cleaner approach? Just to clarify, the current code breaks out after processing the first matching line. I haven't changed that behavior. The reason I moved the rewrite processing code out of the loop was I wanted to avoid adding more lines that are executed only once into the loop. However, it is in exchange of additional complexity to remember the original spelling of the parameter name. Personally, I believe separating the search and rewrite processing is better, but I'm not particularly attached to the approach, so either way is fine with me. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Jan 18, 2024 at 8:31 AM Masahiko Sawada wrote: > It seems we cannot make this work nicely. IIUC VacDeadItems is > allocated in DSM and TidStore is embedded there. However, > dead_items->items.area is a local pointer to dsa_area. So we cannot > include dsa_area in neither TidStore nor RT_RADIX_TREE. Instead we > would need to pass dsa_area to each interface by callers. Thanks again for exploring this line of thinking! Okay, it seems even if there's a way to make this work, it would be too invasive to justify when compared with the advantage I was hoping for. > > If we can't make this work nicely, I'd be okay with keeping the tid > > store control object. My biggest concern is unnecessary > > double-locking. > > If we don't do any locking stuff in radix tree APIs and it's the > user's responsibility at all, probably we don't need a lock for > tidstore? That is, we expose lock functions as you mentioned and the > user (like tidstore) acquires/releases the lock before/after accessing > the radix tree and num_items. I'm not quite sure what the point of "num_items" is anymore, because it was really tied to the array in VacDeadItems. dead_items->num_items is essential to reading/writing the array correctly. If this number is wrong, the array is corrupt. There is no such requirement for the radix tree. We don't need to know the number of tids to add to it or do a lookup, or anything. There are a number of places where we assert "the running count of the dead items" is the same as "the length of the dead items array", like here: @@ -2214,7 +2205,7 @@ lazy_vacuum(LVRelState *vacrel) BlockNumber threshold; Assert(vacrel->num_index_scans == 0); - Assert(vacrel->lpdead_items == vacrel->dead_items->num_items); + Assert(vacrel->lpdead_items == TidStoreNumTids(vacrel->dead_items)); As such, in HEAD I'm guessing it's arbitrary which one is used for control flow. Correct me if I'm mistaken. If I am wrong for some part of the code, it'd be good to understand when that invariant can't be maintained. @@ -1258,7 +1265,7 @@ lazy_scan_heap(LVRelState *vacrel) * Do index vacuuming (call each index's ambulkdelete routine), then do * related heap vacuuming */ - if (dead_items->num_items > 0) + if (TidStoreNumTids(dead_items) > 0) lazy_vacuum(vacrel); Like here. In HEAD, could this have used vacrel->dead_items? @@ -2479,14 +2473,14 @@ lazy_vacuum_heap_rel(LVRelState *vacrel) * We set all LP_DEAD items from the first heap pass to LP_UNUSED during * the second heap pass. No more, no less. */ - Assert(index > 0); Assert(vacrel->num_index_scans > 1 || -(index == vacrel->lpdead_items && +(TidStoreNumTids(vacrel->dead_items) == vacrel->lpdead_items && vacuumed_pages == vacrel->lpdead_item_pages)); ereport(DEBUG2, - (errmsg("table \"%s\": removed %lld dead item identifiers in %u pages", - vacrel->relname, (long long) index, vacuumed_pages))); + (errmsg("table \"%s\": removed " INT64_FORMAT "dead item identifiers in %u pages", + vacrel->relname, TidStoreNumTids(vacrel->dead_items), + vacuumed_pages))); We assert that vacrel->lpdead_items has the expected value, and then the ereport repeats the function call (with a lock) to read the value we just consulted to pass the assert. If we *really* want to compare counts, maybe we could invent a debugging-only function that iterates over the tree and popcounts the bitmaps. That seems too expensive for regular assert builds, though. On the subject of debugging builds, I think it no longer makes sense to have the array for debug checking in tid store, even during development. A few months ago, we had an encoding scheme that looked simple on paper, but its code was fiendishly difficult to follow (at least for me). That's gone. In addition to the debugging count above, we could also put a copy of the key in the BlockTableEntry's header, in debug builds. We don't yet need to care about the key size, since we don't (yet) have runtime-embeddable values. > Currently (as of v52 patch) RT_FIND is > doing so, [meaning, there is no internal "automatic" locking here since after we switched to variable-length types, an outstanding TODO] Maybe it's okay to expose global locking for v17. I have one possible alternative: This week I tried an idea to use a callback there so that after internal unlocking, the caller received the value (or whatever else needs to happen, such as lookup an offset in the tid bitmap). I've attached a draft for that that passes radix tree tests. It's a bit awkward, but I'm guessing this would more closely match future internal atomic locking. Let me know what you think of the concept, and then do whichever way you think is best. (using v53 as the basis) I believe this is the only open question remaining. The rest is just polish and testing. > During trying this idea, I realized that there is a visibility problem > in the radix tree template If it's broken even without the embedding I'll look into this (I don't know if this
Re: Strange Bitmapset manipulation in DiscreteKnapsack()
On Thu, 18 Jan 2024 at 14:58, Andy Fan wrote: > I want to know if "user just want to zero out the flags in bitmapset > but keeping the memory allocation" is a valid requirement. Commit > 00b41463c makes it is hard IIUC. Looking at your patch, I see: +/* + * does this break commit 00b41463c21615f9bf3927f207e37f9e215d32e6? + * but I just found alloc memory and free the memory is too bad + * for this current feature. So let see ...; + */ +void +bms_zero(Bitmapset *a) I understand the requirement here, but, to answer the question in the comment -- Yes, that does violate the requirements for how an empty set is represented and as of c7e5e994b and possibly earlier, any subsequent Bitmapset operations will cause an Assert failure due to the trailing zero word(s) left by bms_zero(). > Looks like this is another user case of "user just wants to zero out the > flags in bitmapset but keeping the memory allocation". I don't really see a way to have it work the way you want without violating the new representation of an empty Bitmapset. Per [1], there's quite a performance advantage to 00b41463c plus the additional don't allow trailing empty words rule done in a8c09daa8. So I don't wish the rules were more lax. Maybe Bitmapsets aren't the best fit for your need. Maybe it would be better to have a more simple fixed size bitset that you could allocate in the same allocation as the TupleTableSlot's tts_null and tts_values arrays. The slot's TupleDesc should know how many bits are needed. David [1] https://postgr.es/m/CAJ2pMkYcKHFBD_OMUSVyhYSQU0-j9T6NZ0pL6pwbZsUCohWc7Q%40mail.gmail.com
Re: Fix search_path for all maintenance commands
On Thu, Jan 18, 2024 at 9:19 AM Jeff Davis wrote: > > On Mon, 2023-07-17 at 12:16 -0700, Jeff Davis wrote: > > Based on feedback, I plan to commit soon. > > Attached is a new version. > > Changes: > > * Also switch the search_path during CREATE MATERIALIZED VIEW, so that > it's consistent with REFRESH. As a part of this change, I slightly > reordered things in ExecCreateTableAs() so that the skipData path > returns early without entering the SECURITY_RESTRICTED_OPERATION. I > don't think that's a problem because (a) that is one place where > SECURITY_RESTRICTED_OPERATION is not used for security, but rather for > consistency; and (b) that path doesn't go through rewriter, planner, or > executor anyway so I don't see why it would matter. > > * Use GUC_ACTION_SAVE rather than GUC_ACTION_SET. That was a problem > with the previous patch for index functions executed in parallel > workers, which can happen calling SQL functions from pg_amcheck. > > * I used a wrapper function RestrictSearchPath() rather than calling > set_config_option() directly. That provides a nice place in case we > need to add a compatibility GUC to disable it. > > Question: > > Why do we switch to the table owner and use > SECURITY_RESTRICTED_OPERATION in DefineIndex(), when we will switch in > index_build (etc.) anyway? Similarly, why do we switch in vacuum_rel(), > when it doesn't matter for lazy vacuum and we will switch in > cluster_rel() and do_analyze_rel() anyway? I tried to apply the patch but it is failing at the Head. It is giving the following error: Hunk #7 succeeded at 3772 (offset -12 lines). patching file src/backend/commands/matview.c patching file src/backend/commands/vacuum.c Hunk #2 succeeded at 2169 (offset -19 lines). patching file src/backend/utils/init/usercontext.c patching file src/bin/scripts/t/100_vacuumdb.pl Hunk #1 FAILED at 109. 1 out of 1 hunk FAILED -- saving rejects to file src/bin/scripts/t/100_vacuumdb.pl.rej patching file src/include/utils/usercontext.h patching file src/test/modules/test_oat_hooks/expected/test_oat_hooks.out patching file src/test/regress/expected/matview.out patching file src/test/regress/expected/privileges.out patching file src/test/regress/expected/vacuum.out patching file src/test/regress/sql/matview.sql patching file src/test/regress/sql/privileges.sql patching file src/test/regress/sql/vacuum.sql Please send the Re-base version of the patch. Thanks and Regards, Shubham Khanna.
Re: Strange Bitmapset manipulation in DiscreteKnapsack()
Thanks for having a look at this again. On Thu, 18 Jan 2024 at 15:22, Richard Guo wrote: > Do you think we can use 'memcpy(a, b, BITMAPSET_SIZE(b->nwords))' > directly in the new bms_replace_members() instead of copying the > bitmapwords one by one, like: > > - i = 0; > - do > - { > - a->words[i] = b->words[i]; > - } while (++i < b->nwords); > - > - a->nwords = b->nwords; > + memcpy(a, b, BITMAPSET_SIZE(b->nwords)); > > But I'm not sure if this is an improvement or not. I considered this earlier but felt it was going against the method used in other places in the file. However, on relooking I do see bms_copy() does a memcpy(). I'm still in favour of keeping it the way the v2 patch does it for 2 reasons: 1) Ignoring bms_copy(), we use do/while in all other functions where we operate on all words in the set. 2) memcpy isn't that fast for small numbers of bytes when that number of bytes isn't known at compile-time. The do/while method can take advantage of knowing that the Bitmapset will have at least 1 word allowing a single loop check when the set only has a single word, which I expect most Bitmapsets do. Of course, memcpy() is fewer lines of code and this likely isn't that performance critical, so there's certainly arguments for memcpy(). However, it isn't quite as few lines as the patch you pasted. We still need to overwrite a->nwords to ensure we grow the set or shrink it to trim off any trailing zero words (which I didn't feel any need to actually set to 0). David
Re: Strange Bitmapset manipulation in DiscreteKnapsack()
On Thu, Jan 18, 2024 at 8:35 AM David Rowley wrote: > The functions's header comment mentions "The bitmapsets are all > pre-initialized with an unused high bit so that memory allocation is > done only once.". Ah, I neglected to notice this when reviewing the v1 patch. I guess it's implemented this way due to performance considerations, right? > I've attached a patch which adds bms_replace_members(). It's basically > like bms_copy() but attempts to reuse the member from another set. I > considered if the new function should be called bms_copy_inplace(), > but left it as bms_replace_members() for now. Do you think we can use 'memcpy(a, b, BITMAPSET_SIZE(b->nwords))' directly in the new bms_replace_members() instead of copying the bitmapwords one by one, like: - i = 0; - do - { - a->words[i] = b->words[i]; - } while (++i < b->nwords); - - a->nwords = b->nwords; + memcpy(a, b, BITMAPSET_SIZE(b->nwords)); But I'm not sure if this is an improvement or not. Thanks Richard
Re: Add system identifier to backup manifest
On Wed, Jan 17, 2024 at 08:46:09AM -0500, Robert Haas wrote: > On Wed, Jan 17, 2024 at 6:45 AM Alvaro Herrera > wrote: >> Hmm, okay, but what if I take a full backup from a primary server and >> later I want an incremental from a standby, or the other way around? >> Will this prevent me from using such a combination? > > The system identifier had BETTER match in such cases. If it doesn't, > somebody's run pg_resetwal on your standby since it was created... and > in that case, no incremental backup for you! There is an even stronger check than that at replay as we also store the system identifier in XLogLongPageHeaderData and cross-check it with the contents of the control file. Having a field in the backup manifest makes for a much faster detection, even if that's not the same as replaying things, it can avoid a lot of problems when combining backup pieces. I'm +1 for Amul's patch concept. -- Michael signature.asc Description: PGP signature
Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)
On 2024-01-18 10:10, jian he wrote: On Thu, Jan 18, 2024 at 8:57 AM Masahiko Sawada wrote: On Thu, Jan 18, 2024 at 6:38 AM Tom Lane wrote: > > Alexander Korotkov writes: > > On Wed, Jan 17, 2024 at 9:49 AM Kyotaro Horiguchi > > wrote: > >> On the other hand, SAVE_ERROR_TO takes 'error' or 'none', which > >> indicate "immediately error out" and 'just ignore the failure' > >> respectively, but these options hardly seem to denote a 'location', > >> and appear more like an 'action'. I somewhat suspect that this > >> parameter name intially conceived with the assupmtion that it would > >> take file names or similar parameters. I'm not sure if others will > >> agree, but I think the parameter name might not be the best > >> choice. For instance, considering the addition of the third value > >> 'log', something like on_error_action (error, ignore, log) would be > >> more intuitively understandable. What do you think? > > > Probably, but I'm not sure about that. The name SAVE_ERROR_TO assumes > > the next word will be location, not action. With some stretch we can > > assume 'error' to be location. I think it would be even more stretchy > > to think that SAVE_ERROR_TO is followed by action. > > The other problem with this terminology is that with 'none', what it > is doing is the exact opposite of "saving" the errors. I agree we > need a better name. Agreed. > > Kyotaro-san's suggestion isn't bad, though I might shorten it to > error_action {error|ignore|log} (or perhaps "stop" instead of "error")? > You will need a separate parameter anyway to specify the destination > of "log", unless "none" became an illegal table name when I wasn't > looking. I don't buy that one parameter that has some special values > while other values could be names will be a good design. Moreover, > what if we want to support (say) log-to-file along with log-to-table? > Trying to distinguish a file name from a table name without any other > context seems impossible. I've been thinking we can add more values to this option to log errors not only to the server logs but also to the error table (not sure details but I imagined an error table is created for each table on error), without an additional option for the destination name. The values would be like error_action {error|ignore|save-logs|save-table}. another idea: on_error {error|ignore|other_future_option} if not specified then by default ERROR. You can also specify ERROR or IGNORE for now. I agree, the parameter "error_action" is better than "location". I'm not sure whether error_action or on_error is better, but either way "error_action error" and "on_error error" seems a bit odd to me. I feel "stop" is better for both cases as Tom suggested. -- Regards, -- Atsushi Torikoshi NTT DATA Group Corporation
Re: Add last_commit_lsn to pg_stat_database
On Sun, Jan 14, 2024 at 6:01 AM vignesh C wrote: > > On Sat, 10 Jun 2023 at 07:57, James Coleman wrote: > > > > I've previously noted in "Add last commit LSN to > > pg_last_committed_xact()" [1] that it's not possible to monitor how > > many bytes of WAL behind a logical replication slot is (computing such > > is obviously trivial for physical slots) because the slot doesn't need > > to replicate beyond the last commit. In some cases it's possible for > > the current WAL location to be far beyond the last commit. A few > > examples: > > > > - An idle server with checkout_timeout at a lower value (so lots of > > empty WAL advance). > > - Very large transactions: particularly insidious because committing a > > 1 GB transaction after a small transaction may show almost zero time > > lag even though quite a bit of data needs to be processed and sent > > over the wire (so time to replay is significantly different from > > current lag). > > - A cluster with multiple databases complicates matters further, > > because while physical replication is cluster-wide, the LSNs that > > matter for logical replication are database specific. > > > > Since we don't expose the most recent commit's LSN there's no way to > > say "the WAL is currently 1250, the last commit happened at 1000, the > > slot has flushed up to 800, therefore there are at most 200 bytes > > replication needs to read through to catch up. > > > > In the aforementioned thread [1] I'd proposed a patch that added a SQL > > function pg_last_commit_lsn() to expose the most recent commit's LSN. > > Robert Haas didn't think the initial version's modifications to > > commit_ts made sense, and a subsequent approach adding the value to > > PGPROC didn't have strong objections, from what I can see, but it also > > didn't generate any enthusiasm. > > > > As I was thinking about how to improve things, I realized that this > > information (since it's for monitoring anyway) fits more naturally > > into the stats system. I'd originally thought of exposing it in > > pg_stat_wal, but that's per-cluster rather than per-database (indeed, > > this is a flaw I hadn't considered in the original patch), so I think > > pg_stat_database is the correct location. > > > > I've attached a patch to track the latest commit's LSN in pg_stat_database. > > I have changed the status of commitfest entry to "Returned with > Feedback" as Aleksander's comments have not yet been resolved. Please > feel free to post an updated version of the patch and update the > commitfest entry accordingly. Thanks for reminding me; I'd lost track of this patch. Regards, James Coleman
Re: Add last_commit_lsn to pg_stat_database
Hello, Thanks for reviewing! On Tue, Sep 19, 2023 at 8:16 AM Aleksander Alekseev wrote: > > Hi, > > > [...] > > As I was thinking about how to improve things, I realized that this > > information (since it's for monitoring anyway) fits more naturally > > into the stats system. I'd originally thought of exposing it in > > pg_stat_wal, but that's per-cluster rather than per-database (indeed, > > this is a flaw I hadn't considered in the original patch), so I think > > pg_stat_database is the correct location. > > > > I've attached a patch to track the latest commit's LSN in pg_stat_database. > > Thanks for the patch. It was marked as "Needs Review" so I decided to > take a brief look. > > All in all the code seems to be fine but I have a couple of nitpicks: > > - If you are modifying pg_stat_database the corresponding changes to > the documentation are needed. Updated. > - pg_stat_get_db_last_commit_lsn() has the same description as > pg_stat_get_db_xact_commit() which is confusing. I've fixed this. > - pg_stat_get_db_last_commit_lsn() is marked as STABLE which is > probably correct. But I would appreciate a second opinion on this. Sounds good. > - Wouldn't it be appropriate to add a test or two? Added. > - `if (!XLogRecPtrIsInvalid(commit_lsn))` - I suggest adding > XLogRecPtrIsValid() macro for better readability We have 36 usages of !XLogRecPtrIsInvalid(...) already, so I think we should avoid making this change in this patch. v2 is attached. Regards, James Coleman v2-0001-Add-last-commit-s-LSN-to-pg_stat_database.patch Description: Binary data
Re: Strange Bitmapset manipulation in DiscreteKnapsack()
Hi, David Rowley writes: > On Tue, 16 Jan 2024 at 16:32, David Rowley wrote: >> >> >> Now that 00b41463c changed Bitmapset to have NULL be the only valid >> representation of an empty set, this code no longer makes sense. We >> may as well just bms_free() the original set and bms_copy() in the new >> set as the bms_del_members() call will always pfree the set anyway. I want to know if "user just want to zero out the flags in bitmapset but keeping the memory allocation" is a valid requirement. Commit 00b41463c makes it is hard IIUC. The user case I have is I want to keep the detoast datum in slot->tts_values[1] so that any further access doesn't need to detoast it again, I used a 'Bitmapset' in TupleTableSlot which shows which attributes is detoast. all of the detoast values should be pfree-d in ExecClearTuple. However if a bms_free the bitmapset everytime in ExecClearTuple and allocate the memory again later makes some noticable performance regression (5% difference in my workload). That is still a open items for that patch. > ... > The functions's header comment mentions "The bitmapsets are all > pre-initialized with an unused high bit so that memory allocation is > done only once.". > NOTICE: DiscreteKnapsack: frees = 110, max_weight = 60, extra = 183.33% > NOTICE: DiscreteKnapsack: frees = 110, max_weight = 60, extra = 183.33% > > and by the looks of the code, the worst case is much worse. > Looks like this is another user case of "user just wants to zero out the flags in bitmapset but keeping the memory allocation". [1] https://www.postgresql.org/message-id/flat/87il4jrk1l@163.com -- Best Regards Andy Fan
Re: subscription disable_on_error not working after ALTER SUBSCRIPTION set bad conninfo
On Thu, Jan 18, 2024 at 8:16 AM Peter Smith wrote: > > Hi, > > I had reported a possible subscription 'disable_on_error' bug found > while reviewing another patch. > > I am including my initial report and Nisha's analysis again here so > that this topic has its own thread. > > == > INITIAL REPORT [1] > == > > ... > I see now that any ALTER of the subscription's connection, even to > some value that fails, will restart a new worker (like ALTER of any > other subscription parameters). For a bad connection, it will continue > to relaunch-worker/ERROR over and over. e.g. > > -- > test_sub=# \r2024-01-17 09:34:28.665 AEDT [11274] LOG: logical > replication apply worker for subscription "sub4" has started > 2024-01-17 09:34:28.666 AEDT [11274] ERROR: could not connect to the > publisher: invalid port number: "-1" > 2024-01-17 09:34:28.667 AEDT [928] LOG: background worker "logical > replication apply worker" (PID 11274) exited with exit code 1 > 2024-01-17 09:34:33.669 AEDT [11391] LOG: logical replication apply > worker for subscription "sub4" has started > 2024-01-17 09:34:33.669 AEDT [11391] ERROR: could not connect to the > publisher: invalid port number: "-1" > 2024-01-17 09:34:33.670 AEDT [928] LOG: background worker "logical > replication apply worker" (PID 11391) exited with exit code 1 > etc... > -- > > While experimenting with the bad connection ALTER I also tried setting > 'disable_on_error' like below: > > ALTER SUBSCRIPTION sub4 SET (disable_on_error); > ALTER SUBSCRIPTION sub4 CONNECTION 'port = -1'; > > ...but here the subscription did not become DISABLED as I expected it > would do on the next connection error iteration. It remains enabled > and just continues to loop relaunch/ERROR indefinitely same as before. > > That looks like it may be a bug. Thoughts? Although we can improve it to handle this case too, I'm not sure it's a bug. The doc says[1]: Specifies whether the subscription should be automatically disabled if any errors are detected by subscription workers during data replication from the publisher. When an apply worker is trying to establish a connection, it's not replicating data from the publisher. Regards, [1] https://www.postgresql.org/docs/devel/sql-createsubscription.html#SQL-CREATESUBSCRIPTION-PARAMS-WITH-DISABLE-ON-ERROR -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
linux cachestat in file Readv and Prefetch
Hi, I was testing the index prefetch and streamIO patches and I added cachestat() syscall to get a better view of the prefetching. It's a new linux syscall, it requires 6.5, it provides numerous interesting information from the VM for the range of pages examined. It's way way faster than the old mincore() and provides much more valuable information: uint64 nr_cache; /* Number of cached pages */ uint64 nr_dirty; /* Number of dirty pages */ uint64 nr_writeback; /* Number of pages marked for writeback. */ uint64 nr_evicted; /* Number of pages evicted from the cache. */ /* * Number of recently evicted pages. A page is recently evicted if its * last eviction was recent enough that its reentry to the cache would * indicate that it is actively being used by the system, and that there * is memory pressure on the system. */ uint64 nr_recently_evicted; While here I also added some quick tweaks to suspend prefetching on memory pressure. It's working but I have absolutely not checked the performance impact of my additions. Sharing here for others to tests and adding in CF in case there is interest to go further in this direction. --- Cédric Villemain +33 (0)6 20 30 22 52 https://Data-Bene.io PostgreSQL Expertise, Support, Training, R From 4cdc5e952e0cd729bde472bdc3d7ff7bbc3009ea Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?C=C3=A9dric=20Villemain?= Date: Thu, 18 Jan 2024 01:00:37 +0100 Subject: [PATCH] Add cachestat, requires linux 6.5 there is USE_CACHESTAT defined and ifdef as the feature is only for testing and discussion at the moment. cachestat provides a struct with those following fields set for the range of pages examinated. The syscall is supposed to be fast enougth but maybe not for each PageRead/Prefetch. The very basic logic to suspend prefetching is effective enough for some scenarios (ANALYZE being a good candidate). --- src/backend/storage/file/fd.c | 85 +++ 1 file changed, 85 insertions(+) diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c index 43d2b2a156e..ffdbd542373 100644 --- a/src/backend/storage/file/fd.c +++ b/src/backend/storage/file/fd.c @@ -181,6 +181,57 @@ int io_direct_flags; ((void) 0) #endif +/* cachestat - use structs as defined in linux */ +typedef struct +{ + uint64 nr_cache; /* Number of cached pages */ + uint64 nr_dirty; /* Number of dirty pages */ + uint64 nr_writeback; /* Number of pages marked for writeback. */ + uint64 nr_evicted; /* Number of pages evicted from the cache. */ + /* + * Number of recently evicted pages. A page is recently evicted if its + * last eviction was recent enough that its reentry to the cache would + * indicate that it is actively being used by the system, and that there + * is memory pressure on the system. + */ + uint64 nr_recently_evicted; +} cachestat; + +typedef struct +{ + uint64 off; + uint64 len; +} cachestat_range; + +/* + * simple wrapper of linux cachestat syscall + * *cstat point to a member in Vfd. + * XXX define USE_CACHESTAT ? + */ +static inline void vm_cachestat(int fd, cachestat_range csrange, +cachestat *cstat) +{ + +#define USE_CACHESTAT 1 + +#if defined(USE_CACHESTAT) +#define __NR_cachestat 451 + + long rc; + rc = syscall(__NR_cachestat, fd, , cstat, 0); + if (rc == -1) + { + if (errno == ENOSYS) + ereport(DEBUG1, (errcode(ERRCODE_INTERNAL_ERROR), + errmsg("sys_cachestat is not available: %m"), + errhint("linux 6.5 minimum is required!"))); + + ereport(ERROR, (errcode(ERRCODE_INTERNAL_ERROR), + errmsg("sys_cachestat returns: %m"))); + } +#endif +} + #define VFD_CLOSED (-1) #define FileIsValid(file) \ @@ -206,6 +257,9 @@ typedef struct vfd /* NB: fileName is malloc'd, and must be free'd when closing the VFD */ int fileFlags; /* open(2) flags for (re)opening the file */ mode_t fileMode; /* mode to pass to open(2) */ +#if defined(USE_CACHESTAT) + cachestat cstat; /* linux cachestat() */ +#endif } Vfd; /* @@ -2090,6 +2144,33 @@ FilePrefetch(File file, off_t offset, off_t amount, uint32 wait_event_info) if (returnCode < 0) return returnCode; +#if defined(USE_CACHESTAT) + /* + * last time we visit this file (somewhere), nr_recently_evicted pages of + * the range were just removed from vm cache, it's a sign a memory pressure. + * so do not prefetch further. + * it is hard to guess if it is always the right choice in absence of more + * information like: + * - prefetching distance expected overall + * - access pattern/backend maybe + */ + if (VfdCache[file].cstat.nr_recently_evicted > 0) + { + DO_DB(elog(LOG, "FilePrefetch: nr_recently_evicted="INT64_FORMAT"\t\t\t\tnoway", + VfdCache[file].cstat.nr_recently_evicted)); + return 0; + } + vm_cachestat(VfdCache[file].fd, (cachestat_range) {offset, amount}, + [file].cstat); + if (VfdCache[file].cstat.nr_cache > 0) + { + DO_DB(elog(LOG, "FilePrefetch:
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Wed, Jan 17, 2024 at 11:37 AM John Naylor wrote: > > On Wed, Jan 17, 2024 at 8:39 AM Masahiko Sawada wrote: > > > > On Wed, Jan 17, 2024 at 9:20 AM John Naylor wrote: > > > > > > On Tue, Jan 16, 2024 at 1:18 PM Masahiko Sawada > > > wrote: > > > > Just changing "items" to be the local tidstore struct could make the > > > > code tricky a bit, since max_bytes and num_items are on the shared > > > > memory while "items" is a local pointer to the shared tidstore. > > > > > > Thanks for trying it this way! I like the overall simplification but > > > this aspect is not great. > > > Hmm, I wonder if that's a side-effect of the "create" functions doing > > > their own allocations and returning a pointer. Would it be less tricky > > > if the structs were declared where we need them and passed to "init" > > > functions? > > > > Seems worth trying. The current RT_CREATE() API is also convenient as > > other data structure such as simplehash.h and dshash.c supports a > > similar > > I don't happen to know if these paths had to solve similar trickiness > with some values being local, and some shared. > > > > That may be a good idea for other reasons. It's awkward that the > > > create function is declared like this: > > > > > > #ifdef RT_SHMEM > > > RT_SCOPE RT_RADIX_TREE *RT_CREATE(MemoryContext ctx, Size max_bytes, > > > dsa_area *dsa, > > > int tranche_id); > > > #else > > > RT_SCOPE RT_RADIX_TREE *RT_CREATE(MemoryContext ctx, Size max_bytes); > > > #endif > > > > > > An init function wouldn't need these parameters: it could look at the > > > passed struct to know what to do. > > > > But the init function would initialize leaf_ctx etc,no? Initializing > > leaf_ctx needs max_bytes that is not stored in RT_RADIX_TREE. > > I was more referring to the parameters that were different above > depending on shared memory. My first thought was that the tricky part > is because of the allocation in local memory, but it's certainly > possible I've misunderstood the problem. > > > The same > > is true for dsa. I imagined that an init function would allocate a DSA > > memory for the control object. > > Yes: > > ... > // embedded in VacDeadItems > TidStore items; > }; > > // NULL DSA in local case, etc > dead_items->items.area = dead_items_dsa; > dead_items->items.tranche_id = FOO_ID; > > TidStoreInit(_items->items, vac_work_mem); > > That's how I imagined it would work (leaving out some details). I > haven't tried it, so not sure how much it helps. Maybe it has other > problems, but I'm hoping it's just a matter of programming. It seems we cannot make this work nicely. IIUC VacDeadItems is allocated in DSM and TidStore is embedded there. However, dead_items->items.area is a local pointer to dsa_area. So we cannot include dsa_area in neither TidStore nor RT_RADIX_TREE. Instead we would need to pass dsa_area to each interface by callers. > > If we can't make this work nicely, I'd be okay with keeping the tid > store control object. My biggest concern is unnecessary > double-locking. If we don't do any locking stuff in radix tree APIs and it's the user's responsibility at all, probably we don't need a lock for tidstore? That is, we expose lock functions as you mentioned and the user (like tidstore) acquires/releases the lock before/after accessing the radix tree and num_items. Currently (as of v52 patch) RT_FIND is doing so, but we would need to change RT_SET() and iteration functions as well. During trying this idea, I realized that there is a visibility problem in the radix tree template especially if we want to embed the radix tree in a struct. Considering a use case where we want to use a radix tree in an exposed struct, we would declare only interfaces in a .h file and define actual implementation in a .c file (FYI TupleHashTableData does a similar thing with simplehash.h). The .c file and .h file would be like: in .h file: #define RT_PREFIX local_rt #define RT_SCOPE extern #define RT_DECLARE #define RT_VALUE_TYPE BlocktableEntry #define RT_VARLEN_VALUE #include "lib/radixtree.h" typedef struct TidStore { : local_rt_radix_tree tree; /* embedded */ : } TidStore; in .c file: #define RT_PREFIX local_rt #define RT_SCOPE extern #define RT_DEFINE #define RT_VALUE_TYPE BlocktableEntry #define RT_VARLEN_VALUE #include "lib/radixtree.h" But it doesn't work as the compiler doesn't know the actual definition of local_rt_radix_tree. If the 'tree' is *local_rt_radix_tree, it works. The reason is that with RT_DECLARE but without RT_DEFINE, the radix tree template generates only forward declarations: #ifdef RT_DECLARE typedef struct RT_RADIX_TREE RT_RADIX_TREE; typedef struct RT_ITER RT_ITER; In order to make it work, we need to move the definitions required to expose RT_RADIX_TREE struct to RT_DECLARE part, which actually requires to move RT_NODE, RT_HANDLE, RT_NODE_PTR, RT_SIZE_CLASS_COUNT, and RT_RADIX_TREE_CONTROL etc. However RT_SIZE_CLASS_COUNT, used in RT_RADIX_TREE, could be bothersome. Since it refers to
Re: [PATCH] Add additional extended protocol commands to psql: \parse and \bindx
On Wed, Jan 17, 2024 at 10:05:33AM +0100, Anthonin Bonnefoy wrote: > > I do realize the same is true for plain \bind, but it seems > > like a bug there too. > > The unscanned bind's parameters are discarded later in the > HandleSlashCmds functions. So adding the ignore_slash_options() for > inactive branches scans and discards them earlier. I will add it to > match what's done in the other commands but I don't think it's > testable as the behaviour is the same unless I miss something. Hmm. So it does not lead to any user-visible changes, right? I can get your argument about being consistent in the code across the board for all the backslash commands, though. > I did add the \bind, \bindx, \close and \parse to the inactive branch > tests to complete the list. Could you split the bits for \bind into a separate patch, please? This requires a separate evaluation, especially if this had better be backpatched. -- Michael signature.asc Description: PGP signature
Re: Show WAL write and fsync stats in pg_stat_io
On Wed, Jan 17, 2024 at 03:20:39PM +0300, Nazir Bilal Yavuz wrote: > I agree with your points. While the other I/O related work is > happening we can discuss what we should do in the variable op_byte > cases. Also, this is happening only for WAL right now but if we try to > extend pg_stat_io in the future, that problem possibly will rise > again. So, it could be good to come up with a general solution, not > only for WAL. Okay, I've marked the patch as RwF for this CF. -- Michael signature.asc Description: PGP signature
Re: CI and test improvements
On Wed, Jan 17, 2024 at 05:34:00PM +0530, vignesh C wrote: > Are we planning to do anything more in this thread, the thread has > been idle for more than 7 months. If nothing more is planned for this, > I'm planning to close this commitfest entry in this commitfest. Oops, this went through the cracks. security was still missing in seg's meson.build, so I've just applied a patch to take care of it. I am not spotting any other holes.. -- Michael signature.asc Description: PGP signature
Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)
On Thu, Jan 18, 2024 at 8:57 AM Masahiko Sawada wrote: > > On Thu, Jan 18, 2024 at 6:38 AM Tom Lane wrote: > > > > Alexander Korotkov writes: > > > On Wed, Jan 17, 2024 at 9:49 AM Kyotaro Horiguchi > > > wrote: > > >> On the other hand, SAVE_ERROR_TO takes 'error' or 'none', which > > >> indicate "immediately error out" and 'just ignore the failure' > > >> respectively, but these options hardly seem to denote a 'location', > > >> and appear more like an 'action'. I somewhat suspect that this > > >> parameter name intially conceived with the assupmtion that it would > > >> take file names or similar parameters. I'm not sure if others will > > >> agree, but I think the parameter name might not be the best > > >> choice. For instance, considering the addition of the third value > > >> 'log', something like on_error_action (error, ignore, log) would be > > >> more intuitively understandable. What do you think? > > > > > Probably, but I'm not sure about that. The name SAVE_ERROR_TO assumes > > > the next word will be location, not action. With some stretch we can > > > assume 'error' to be location. I think it would be even more stretchy > > > to think that SAVE_ERROR_TO is followed by action. > > > > The other problem with this terminology is that with 'none', what it > > is doing is the exact opposite of "saving" the errors. I agree we > > need a better name. > > Agreed. > > > > > Kyotaro-san's suggestion isn't bad, though I might shorten it to > > error_action {error|ignore|log} (or perhaps "stop" instead of "error")? > > You will need a separate parameter anyway to specify the destination > > of "log", unless "none" became an illegal table name when I wasn't > > looking. I don't buy that one parameter that has some special values > > while other values could be names will be a good design. Moreover, > > what if we want to support (say) log-to-file along with log-to-table? > > Trying to distinguish a file name from a table name without any other > > context seems impossible. > > I've been thinking we can add more values to this option to log errors > not only to the server logs but also to the error table (not sure > details but I imagined an error table is created for each table on > error), without an additional option for the destination name. The > values would be like error_action {error|ignore|save-logs|save-table}. > another idea: on_error {error|ignore|other_future_option} if not specified then by default ERROR. You can also specify ERROR or IGNORE for now. I agree, the parameter "error_action" is better than "location".
Re: Where can I find the doxyfile?
On Tue, 7 Nov 2023 at 12:23, John Morris wrote: > > Another update, this time with an abbreviated Doxyfile. Rather than including > the full Doxyfile, this updated version only includes modified settings. It > is more compact and more likely to survive across future doxygen versions. Meson build is failing in CFbot at [1] and [2] with: Message: Install doxygen if you wish to create Doxygen documentation Program dot found: NO Configuring Doxyfile using configuration doc/doxygen/meson.build:52:0: ERROR: Tried to use not-found external program in "command" [1] - https://cirrus-ci.com/task/5823573617541120 [2] - https://api.cirrus-ci.com/v1/artifact/task/5823573617541120/meson_log/build/meson-logs/meson-log.txt Regards, Vignesh
Re: Add system identifier to backup manifest
I have also done a review of the patch and some testing. The patch looks good, and I agree with Robert's comments. On Wed, Jan 17, 2024 at 8:40 PM Robert Haas wrote: > > On Wed, Jan 17, 2024 at 6:31 AM Amul Sul wrote: > > With the attached patch, the backup manifest will have a new key item as > > "System-Identifier" 64-bit integer whose value is derived from pg_control while > > generating it, and the manifest version bumps to 2. > > > > This helps to identify the correct database server and/or backup for the > > subsequent backup operations. pg_verifybackup validates the manifest system > > identifier against the backup control file and fails if they don’t match. > > Similarly, pg_basebackup increment backup will fail if the manifest system > > identifier does not match with the server system identifier. The > > pg_combinebackup is already a bit smarter -- checks the system identifier from > > the pg_control of all the backups, with this patch the manifest system > > identifier also validated. > > Thanks for working on this. Without this, I think what happens is that > you can potentially take an incremental backup from the "wrong" > server, if the states of the systems are such that all of the other > sanity checks pass. When you run pg_combinebackup, it'll discover the > problem and tell you, but you ideally want to discover such errors at > backup time rather than at restore time. This addresses that. And, > overall, I think it's a pretty good patch. But I nonetheless have a > bunch of comments. > > - The associated value is always the integer 1. > + The associated value is the integer, either 1 or 2. > > is an integer. Beginning in PostgreSQL 17, > it is 2; in older versions, it is 1. > > + context.identity_cb = manifest_process_identity; > > I'm not really on board with calling the system identifier "identity" > throughout the patch. I think it should just say system_identifier. If > we were going to abbreviate, I'd prefer something like "sysident" that > looks like it's derived from "system identifier" rather than > "identity" which is a different word altogether. But I don't think we > should abbreviate unless doing so creates *ridiculously* long > identifier names. > > +static void > +manifest_process_identity(JsonManifestParseContext *context, > + int manifest_version, > + uint64 manifest_system_identifier) > +{ > + uint64 system_identifier; > + > + /* Manifest system identifier available in version 2 or later */ > + if (manifest_version == 1) > + return; > > I think you've got the wrong idea here. I think this function would > only get called if System-Identifier is present in the manifest, so if > it's a v1 manifest, this would never get called, so this if-statement > would not ever do anything useful. I think what you should do is (1) > if the client supplies a v1 manifest, reject it, because surely that's > from an older server version that doesn't support incremental backup; > but do that when the version is parsed rather than here; and (2) also > detect and reject the case when it's supposedly a v2 manifest but this > is absent. > > (1) should really be done when the version number is parsed, so I > suspect you may need to add manifest_version_cb. > > +static void > +combinebackup_identity_cb(JsonManifestParseContext *context, > + int manifest_version, > + uint64 manifest_system_identifier) > +{ > + parser_context *private_context = context->private_data; > + uint64 system_identifier = private_context->system_identifier; > + > + /* Manifest system identifier available in version 2 or later */ > + if (manifest_version == 1) > + return; > > Very similar to the above case. Just reject a version 1 manifest as > soon as we see the version number. In this function, set a flag > indicating we saw the system identifier; if at the end of parsing that > flag is not set, kaboom. > > - parse_manifest_file(manifest_path, , _wal_range); > + parse_manifest_file(manifest_path, , _wal_range, > + context.backup_directory); > > Don't do this! parse_manifest_file() should just record everything > found in the manifest in the context object. Syntax validation should > happen while parsing the manifest (e.g. "CAT/DOG" is not a valid LSN > and we should reject that at this stage) but semantic validation > should happen later (e.g. "0/0" can't be a the correct backup end LSN > but we don't figure that out while parsing, but rather later). I think > you should actually move validation of the system identifier to the > point where the directory walk encounters the control file (and update > the docs and tests to match that decision). Imagine if you wanted to > validate a tar-format backup; then you wouldn't have random access to > the directory. You'd see the manifest file first, and then all the > files in a random order, with one chance to look at each one. > > (This is, in fact, a feature I think we should implement.) > > - if (strcmp(token, "1") != 0) > + parse->manifest_version =
Re: pgsql: Clean up role created in new subscription test.
On Fri, 1 Dec 2023 at 18:23, Daniel Gustafsson wrote: > > > On 1 Dec 2023, at 13:19, Alvaro Herrera wrote: > > > > Isn't it simpler to use DROP OWNED BY in 0001? > > I suppose it is, I kind of like the explicit drops but we do use OWNED BY > quite > a lot in the regression tests so changed to that in the attached v5. There are a lot of failures in CFBot at [1] with: # test failed --- stderr --- # left over global object detected: regress_test_bypassrls # 1 of 2 tests failed. More details of the same are available at [2]. Do we need to clean up the objects leftover for the reported issues in the test? [1] - https://cirrus-ci.com/task/6222185975513088 [2] - https://api.cirrus-ci.com/v1/artifact/task/6222185975513088/meson_log/build/meson-logs/testlog-running.txt Regards, Vignesh
Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)
On Thu, Jan 18, 2024 at 6:38 AM Tom Lane wrote: > > Alexander Korotkov writes: > > On Wed, Jan 17, 2024 at 9:49 AM Kyotaro Horiguchi > > wrote: > >> On the other hand, SAVE_ERROR_TO takes 'error' or 'none', which > >> indicate "immediately error out" and 'just ignore the failure' > >> respectively, but these options hardly seem to denote a 'location', > >> and appear more like an 'action'. I somewhat suspect that this > >> parameter name intially conceived with the assupmtion that it would > >> take file names or similar parameters. I'm not sure if others will > >> agree, but I think the parameter name might not be the best > >> choice. For instance, considering the addition of the third value > >> 'log', something like on_error_action (error, ignore, log) would be > >> more intuitively understandable. What do you think? > > > Probably, but I'm not sure about that. The name SAVE_ERROR_TO assumes > > the next word will be location, not action. With some stretch we can > > assume 'error' to be location. I think it would be even more stretchy > > to think that SAVE_ERROR_TO is followed by action. > > The other problem with this terminology is that with 'none', what it > is doing is the exact opposite of "saving" the errors. I agree we > need a better name. Agreed. > > Kyotaro-san's suggestion isn't bad, though I might shorten it to > error_action {error|ignore|log} (or perhaps "stop" instead of "error")? > You will need a separate parameter anyway to specify the destination > of "log", unless "none" became an illegal table name when I wasn't > looking. I don't buy that one parameter that has some special values > while other values could be names will be a good design. Moreover, > what if we want to support (say) log-to-file along with log-to-table? > Trying to distinguish a file name from a table name without any other > context seems impossible. I've been thinking we can add more values to this option to log errors not only to the server logs but also to the error table (not sure details but I imagined an error table is created for each table on error), without an additional option for the destination name. The values would be like error_action {error|ignore|save-logs|save-table}. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Teach predtest about IS [NOT] proofs
On Fri, Dec 22, 2023 at 2:48 PM Tom Lane wrote: > > James Coleman writes: > > I've not yet applied all of your feedback, but I wanted to get an > > initial read on your thoughts on how using switch statements ends up > > looking. Attached is a single (pure refactor) patch that converts the > > various if/else levels that check things like node tag and > > boolean/null test type into switch statements. I've retained 'default' > > keyword usages for now for simplicity (my intuition is that we > > generally prefer to list out all options for compiler safety benefits, > > though I'm not 100% sure that's useful in the outer node tag check > > since it's unlikely that someone adding a new node would modify > > this...). > > > My big question is: are you comfortable with the indentation explosion > > this creates? IMO it's a lot wordier, but it is also more obvious what > > the structural goal is. I'm not sure how we want to make the right > > trade-off though. > > Yeah, I see what you mean. Also, I'd wanted to shove most of > the text in the function header in-line and get rid of the short > restatements of those paras. I carried that through just for > predicate_implied_by_simple_clause, as attached. The structure is > definitely clearer, but we end up with an awful lot of indentation, > which makes the comments less readable than I'd like. (I did some > minor rewording to make them flow better.) > > On balance I think this is probably better than what we have, but > maybe we'd be best off to avoid doubly nested switches? I think > there's a good argument for the outer switch on nodeTag, but > maybe we're getting diminishing returns from an inner switch. > > regards, tom lane > Apologies for the long delay. Attached is a new patch series. 0001 does the initial pure refactor. 0003 makes a lot of modifications to what we can prove about implication and refutation. Finally, 0003 isn't intended to be committed, but attempts to validate more holistically that none of the changes creates any invalid proofs beyond the mostly happy-path tests added in 0004. I ended up not tackling changing how test_predtest tests run for now. That's plausibly still useful, and I'd be happy to add that if you generally agree with the direction of the patch and with that abstraction being useful. I added some additional verifications to the test_predtest module to prevent additional obvious flaws. Regards, James Coleman v4-0003-Add-temporary-all-permutations-test.patch Description: Binary data v4-0001-Use-switch-statements-in-predicate_-implied-refut.patch Description: Binary data v4-0002-Teach-predtest.c-about-BooleanTest.patch Description: Binary data
Re: Strange Bitmapset manipulation in DiscreteKnapsack()
On Tue, 16 Jan 2024 at 16:32, David Rowley wrote: > > While working on [1], I noticed some strange code in > DiscreteKnapsack() which seems to be aiming to copy the Bitmapset. > > It's not that obvious at a casual glance, but: > > sets[j] = bms_del_members(sets[j], sets[j]); > > this is aiming to zero all the words in the set by passing the same > set in both parameters. > > Now that 00b41463c changed Bitmapset to have NULL be the only valid > representation of an empty set, this code no longer makes sense. We > may as well just bms_free() the original set and bms_copy() in the new > set as the bms_del_members() call will always pfree the set anyway. > > I've done that in the attached. > > I did consider if we might want bms_merge_members() or > bms_exchange_members() or some other function suitably named function > to perform a del/add operation, but given the lack of complaints about > any performance regressions here, I think it's not worthwhile. After looking at this again and reading more code, I see that DiscreteKnapsack() goes to some efforts to minimise memory allocations. The functions's header comment mentions "The bitmapsets are all pre-initialized with an unused high bit so that memory allocation is done only once.". I tried adding some debugging output to track how many additional allocations we're now causing as a result of 00b41463c. Previously there'd have just been max_weight allocations, but now there's those plus the number that's mentioned for "frees" below. NOTICE: DiscreteKnapsack: frees = 373, max_weight = 140, extra = 266.43% NOTICE: DiscreteKnapsack: frees = 373, max_weight = 140, extra = 266.43% NOTICE: DiscreteKnapsack: frees = 267, max_weight = 100, extra = 267.00% NOTICE: DiscreteKnapsack: frees = 267, max_weight = 100, extra = 267.00% NOTICE: DiscreteKnapsack: frees = 200, max_weight = 140, extra = 142.86% NOTICE: DiscreteKnapsack: frees = 200, max_weight = 140, extra = 142.86% NOTICE: DiscreteKnapsack: frees = 30, max_weight = 40, extra = 75.00% NOTICE: DiscreteKnapsack: frees = 110, max_weight = 60, extra = 183.33% NOTICE: DiscreteKnapsack: frees = 110, max_weight = 60, extra = 183.33% NOTICE: DiscreteKnapsack: frees = 110, max_weight = 60, extra = 183.33% NOTICE: DiscreteKnapsack: frees = 110, max_weight = 60, extra = 183.33% and by the looks of the code, the worst case is much worse. Given that the code original code was written in a very deliberate way to avoid reallocations, I now think that we should maintain that. I've attached a patch which adds bms_replace_members(). It's basically like bms_copy() but attempts to reuse the member from another set. I considered if the new function should be called bms_copy_inplace(), but left it as bms_replace_members() for now. Now I wonder if this should be backpatched to PG16. David diff --git a/src/backend/lib/knapsack.c b/src/backend/lib/knapsack.c index 13d800718f..439da1ad70 100644 --- a/src/backend/lib/knapsack.c +++ b/src/backend/lib/knapsack.c @@ -89,10 +89,7 @@ DiscreteKnapsack(int max_weight, int num_items, { /* copy sets[ow] to sets[j] without realloc */ if (j != ow) - { - sets[j] = bms_del_members(sets[j], sets[j]); - sets[j] = bms_add_members(sets[j], sets[ow]); - } + sets[j] = bms_replace_members(sets[j], sets[ow]); sets[j] = bms_add_member(sets[j], i); diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c index b0f908f978..8be61cd081 100644 --- a/src/backend/nodes/bitmapset.c +++ b/src/backend/nodes/bitmapset.c @@ -976,6 +976,50 @@ bms_add_members(Bitmapset *a, const Bitmapset *b) return result; } +/* + * bms_replace_members + * Remove all existing members from 'a' and repopulate the set with members + * from 'b', recycling 'a', when possible. + */ +Bitmapset * +bms_replace_members(Bitmapset *a, const Bitmapset *b) +{ + int i; + + Assert(bms_is_valid_set(a)); + Assert(bms_is_valid_set(b)); + + if (a == NULL) + return bms_copy(b); + if (b == NULL) + { + pfree(a); + return NULL; + } + + if (a->nwords < b->nwords) + a = (Bitmapset *) repalloc(a, BITMAPSET_SIZE(b->nwords)); + + i = 0; + do + { + a->words[i] = b->words[i]; + } while (++i < b->nwords); + + a->nwords = b->nwords; + +#ifdef REALLOCATE_BITMAPSETS + + /* +* There's no guarantee that the repalloc returned a new pointer, so copy +* and free unconditionally here. +*/ + a = bms_copy_and_free(a); +#endif + + return a; +} + /* * bms_add_range * Add members in the range of 'lower' to
linux cachestat in file Readv and Prefetch
Hi, I was testing the index prefetch and streamIO patches and I added cachestat() syscall to get a better view of the prefetching. It's a new linux syscall, it requires 6.5, it provides numerous interesting information from the VM for the range of pages examined. It's way way faster than the old mincore() and provides much more valuable information: uint64 nr_cache; /* Number of cached pages */ uint64 nr_dirty; /* Number of dirty pages */ uint64 nr_writeback; /* Number of pages marked for writeback. */ uint64 nr_evicted; /* Number of pages evicted from the cache. */ /* * Number of recently evicted pages. A page is recently evicted if its * last eviction was recent enough that its reentry to the cache would * indicate that it is actively being used by the system, and that there * is memory pressure on the system. */ uint64 nr_recently_evicted; While here I also added some quick tweaks to suspend prefetching on memory pressure. It's working but I have absolutely not checked the performance impact of my additions. Sharing here for others to tests and adding in CF in case there is interest to go further in this direction. --- Cédric Villemain +33 (0)6 20 30 22 52 https://Data-Bene.io PostgreSQL Expertise, Support, Training, R -- --- Cédric Villemain +33 (0)6 20 30 22 52 https://Data-Bene.io PostgreSQL Expertise, Support, Training, R From 4cdc5e952e0cd729bde472bdc3d7ff7bbc3009ea Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?C=C3=A9dric=20Villemain?= Date: Thu, 18 Jan 2024 01:00:37 +0100 Subject: [PATCH] Add cachestat, requires linux 6.5 there is USE_CACHESTAT defined and ifdef as the feature is only for testing and discussion at the moment. cachestat provides a struct with those following fields set for the range of pages examinated. The syscall is supposed to be fast enougth but maybe not for each PageRead/Prefetch. The very basic logic to suspend prefetching is effective enough for some scenarios (ANALYZE being a good candidate). --- src/backend/storage/file/fd.c | 85 +++ 1 file changed, 85 insertions(+) diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c index 43d2b2a156e..ffdbd542373 100644 --- a/src/backend/storage/file/fd.c +++ b/src/backend/storage/file/fd.c @@ -181,6 +181,57 @@ int io_direct_flags; ((void) 0) #endif +/* cachestat - use structs as defined in linux */ +typedef struct +{ + uint64 nr_cache; /* Number of cached pages */ + uint64 nr_dirty; /* Number of dirty pages */ + uint64 nr_writeback; /* Number of pages marked for writeback. */ + uint64 nr_evicted; /* Number of pages evicted from the cache. */ + /* + * Number of recently evicted pages. A page is recently evicted if its + * last eviction was recent enough that its reentry to the cache would + * indicate that it is actively being used by the system, and that there + * is memory pressure on the system. + */ + uint64 nr_recently_evicted; +} cachestat; + +typedef struct +{ + uint64 off; + uint64 len; +} cachestat_range; + +/* + * simple wrapper of linux cachestat syscall + * *cstat point to a member in Vfd. + * XXX define USE_CACHESTAT ? + */ +static inline void vm_cachestat(int fd, cachestat_range csrange, +cachestat *cstat) +{ + +#define USE_CACHESTAT 1 + +#if defined(USE_CACHESTAT) +#define __NR_cachestat 451 + + long rc; + rc = syscall(__NR_cachestat, fd, , cstat, 0); + if (rc == -1) + { + if (errno == ENOSYS) + ereport(DEBUG1, (errcode(ERRCODE_INTERNAL_ERROR), + errmsg("sys_cachestat is not available: %m"), + errhint("linux 6.5 minimum is required!"))); + + ereport(ERROR, (errcode(ERRCODE_INTERNAL_ERROR), + errmsg("sys_cachestat returns: %m"))); + } +#endif +} + #define VFD_CLOSED (-1) #define FileIsValid(file) \ @@ -206,6 +257,9 @@ typedef struct vfd /* NB: fileName is malloc'd, and must be free'd when closing the VFD */ int fileFlags; /* open(2) flags for (re)opening the file */ mode_t fileMode; /* mode to pass to open(2) */ +#if defined(USE_CACHESTAT) + cachestat cstat; /* linux cachestat() */ +#endif } Vfd; /* @@ -2090,6 +2144,33 @@ FilePrefetch(File file, off_t offset, off_t amount, uint32 wait_event_info) if (returnCode < 0) return returnCode; +#if defined(USE_CACHESTAT) + /* + * last time we visit this file (somewhere), nr_recently_evicted pages of + * the range were just removed from vm cache, it's a sign a memory pressure. + * so do not prefetch further. + * it is hard to guess if it is always the right choice in absence of more + * information like: + * - prefetching distance expected overall + * - access pattern/backend maybe + */ + if (VfdCache[file].cstat.nr_recently_evicted > 0) + { + DO_DB(elog(LOG, "FilePrefetch: nr_recently_evicted="INT64_FORMAT"\t\t\t\tnoway", + VfdCache[file].cstat.nr_recently_evicted)); + return 0; + } + vm_cachestat(VfdCache[file].fd, (cachestat_range) {offset, amount}, +
Re: Improve the connection failure error messages
On Wed, Jan 17, 2024 at 7:15 PM Nisha Moond wrote: > > > > > ~~ > > > > BTW, while experimenting with the bad connection ALTER I also tried > > setting 'disable_on_error' like below: > > > > ALTER SUBSCRIPTION sub4 SET (disable_on_error); > > ALTER SUBSCRIPTION sub4 CONNECTION 'port = -1'; > > > > ...but here the subscription did not become DISABLED as I expected it > > would do on the next connection error iteration. It remains enabled > > and just continues to loop relaunch/ERROR indefinitely same as before. > > > > That looks like it may be a bug. Thoughts? > > > Ideally, if the already running apply worker in > "LogicalRepApplyLoop()" has any exception/error it will be handled and > the subscription will be disabled if 'disable_on_error' is set - > > start_apply(XLogRecPtr origin_startpos) > { > PG_TRY(); > { > LogicalRepApplyLoop(origin_startpos); > } > PG_CATCH(); > { > if (MySubscription->disableonerr) > DisableSubscriptionAndExit(); > ... > > What is happening in this case is that the control reaches the function - > run_apply_worker() -> start_apply() -> LogicalRepApplyLoop -> > maybe_reread_subscription() > ... > /* > * Exit if any parameter that affects the remote connection was changed. > * The launcher will start a new worker but note that the parallel apply > * worker won't restart if the streaming option's value is changed from > * 'parallel' to any other value or the server decides not to stream the > * in-progress transaction. > */ > if (strcmp(newsub->conninfo, MySubscription->conninfo) != 0 || > ... > > and it sees a change in the parameter and calls apply_worker_exit(). > This will exit the current process, without throwing an exception to > the caller and the postmaster will try to restart the apply worker. > The new apply worker, before reaching the start_apply() [where we > handle exception], will hit the code to establish the connection to > the publisher - > > ApplyWorkerMain() -> run_apply_worker() - > ... > LogRepWorkerWalRcvConn = walrcv_connect(MySubscription->conninfo, > true /* replication */ , > true, > must_use_password, > MySubscription->name, ); > > if (LogRepWorkerWalRcvConn == NULL) > ereport(ERROR, > (errcode(ERRCODE_CONNECTION_FAILURE), >errmsg("could not connect to the publisher: %s", err))); > ... > and due to the bad connection string in the subscription, it will error out. > [28680] ERROR: could not connect to the publisher: invalid port number: "-1" > [3196] LOG: background worker "logical replication apply worker" (PID > 28680) exited with exit code 1 > > Now, the postmaster keeps trying to restart the apply worker and it > will keep failing until the connection string is corrected or the > subscription is disabled manually. > > I think this is a bug that needs to be handled in run_apply_worker() > when disable_on_error is set. > IMO, this bug-fix discussion deserves a separate thread. Thoughts? Hi Nisha, Thanks for your analysis -- it is the same as my understanding. As suggested, I have created a new thread for any further discussion related to this 'disable_on_error' topic [1]. == [1] https://www.postgresql.org/message-id/flat/CAHut%2BPuEsekA3e7ThwzWr%2BUs4x%3DLzkF7DSrED1UsZTUqNrhCUQ%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia
subscription disable_on_error not working after ALTER SUBSCRIPTION set bad conninfo
Hi, I had reported a possible subscription 'disable_on_error' bug found while reviewing another patch. I am including my initial report and Nisha's analysis again here so that this topic has its own thread. == INITIAL REPORT [1] == ... I see now that any ALTER of the subscription's connection, even to some value that fails, will restart a new worker (like ALTER of any other subscription parameters). For a bad connection, it will continue to relaunch-worker/ERROR over and over. e.g. -- test_sub=# \r2024-01-17 09:34:28.665 AEDT [11274] LOG: logical replication apply worker for subscription "sub4" has started 2024-01-17 09:34:28.666 AEDT [11274] ERROR: could not connect to the publisher: invalid port number: "-1" 2024-01-17 09:34:28.667 AEDT [928] LOG: background worker "logical replication apply worker" (PID 11274) exited with exit code 1 2024-01-17 09:34:33.669 AEDT [11391] LOG: logical replication apply worker for subscription "sub4" has started 2024-01-17 09:34:33.669 AEDT [11391] ERROR: could not connect to the publisher: invalid port number: "-1" 2024-01-17 09:34:33.670 AEDT [928] LOG: background worker "logical replication apply worker" (PID 11391) exited with exit code 1 etc... -- While experimenting with the bad connection ALTER I also tried setting 'disable_on_error' like below: ALTER SUBSCRIPTION sub4 SET (disable_on_error); ALTER SUBSCRIPTION sub4 CONNECTION 'port = -1'; ...but here the subscription did not become DISABLED as I expected it would do on the next connection error iteration. It remains enabled and just continues to loop relaunch/ERROR indefinitely same as before. That looks like it may be a bug. Thoughts? = ANALYSIS BY NISHA [2] = Ideally, if the already running apply worker in "LogicalRepApplyLoop()" has any exception/error it will be handled and the subscription will be disabled if 'disable_on_error' is set - start_apply(XLogRecPtr origin_startpos) { PG_TRY(); { LogicalRepApplyLoop(origin_startpos); } PG_CATCH(); { if (MySubscription->disableonerr) DisableSubscriptionAndExit(); ... What is happening in this case is that the control reaches the function - run_apply_worker() -> start_apply() -> LogicalRepApplyLoop -> maybe_reread_subscription() ... /* * Exit if any parameter that affects the remote connection was changed. * The launcher will start a new worker but note that the parallel apply * worker won't restart if the streaming option's value is changed from * 'parallel' to any other value or the server decides not to stream the * in-progress transaction. */ if (strcmp(newsub->conninfo, MySubscription->conninfo) != 0 || ... and it sees a change in the parameter and calls apply_worker_exit(). This will exit the current process, without throwing an exception to the caller and the postmaster will try to restart the apply worker. The new apply worker, before reaching the start_apply() [where we handle exception], will hit the code to establish the connection to the publisher - ApplyWorkerMain() -> run_apply_worker() - ... LogRepWorkerWalRcvConn = walrcv_connect(MySubscription->conninfo, true /* replication */ , true, must_use_password, MySubscription->name, ); if (LogRepWorkerWalRcvConn == NULL) ereport(ERROR, (errcode(ERRCODE_CONNECTION_FAILURE), errmsg("could not connect to the publisher: %s", err))); ... and due to the bad connection string in the subscription, it will error out. [28680] ERROR: could not connect to the publisher: invalid port number: "-1" [3196] LOG: background worker "logical replication apply worker" (PID 28680) exited with exit code 1 Now, the postmaster keeps trying to restart the apply worker and it will keep failing until the connection string is corrected or the subscription is disabled manually. I think this is a bug that needs to be handled in run_apply_worker() when disable_on_error is set. == [1] https://www.postgresql.org/message-id/CAHut%2BPvcf7P2dHbCeWPM4jQ%3DyHqf4WFS_C6eVb8V%3DbcZPMMp7A%40mail.gmail.com [2] https://www.postgresql.org/message-id/CABdArM6ORu%2BKpS_kXd-jwwPBqYPo1YqZjwwGnqmVanWgdHCggA%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia
Re: Emit fewer vacuum records by reaping removable tuples during pruning
On Wed, Jan 17, 2024 at 5:47 PM Melanie Plageman wrote: > > On Wed, Jan 17, 2024 at 4:31 PM Peter Geoghegan wrote: > > > > On Wed, Jan 17, 2024 at 4:25 PM Peter Geoghegan wrote: > > > I tend to suspect that VACUUM_FSM_EVERY_PAGES is fundamentally the > > > wrong idea. If it's such a good idea then why not apply it all the > > > time? That is, why not apply it independently of whether nindexes==0 > > > in the current VACUUM operation? (You know, just like with > > > FAILSAFE_EVERY_PAGES.) > > > > Actually, I suppose that we couldn't apply it independently of > > nindexes==0. Then we'd call FreeSpaceMapVacuumRange() before our > > second pass over the heap takes place for those LP_DEAD-containing > > heap pages scanned since the last round of index/heap vacuuming took > > place (or since VACUUM began). We need to make sure that the FSM has > > the most recent possible information known to VACUUM, which would > > break if we applied VACUUM_FSM_EVERY_PAGES rules when nindexes > 0. > > > > Even still, the design of VACUUM_FSM_EVERY_PAGES seems questionable to me. > > I now see I misunderstood and my earlier email was wrong. I didn't > notice that we only use VACUUM_FSM_EVERY_PAGES if nindexes ==0. > So, in master, we call FreeSpaceMapVacuumRange() always after a round > of index vacuuming and periodically if there are no indexes. The "nindexes == 0" if() that comes just after our call to lazy_scan_prune() is "the one-pass equivalent of a call to lazy_vacuum()". Though this includes the call to FreeSpaceMapVacuumRange() that immediately follows the two-pass case calling lazy_vacuum(), too. > It seems like you are asking whether not we should vacuum the FSM at a > different cadence for the no indexes case (and potentially count > blocks actually vacuumed instead of blocks considered). > > And it seems like Robert is asking whether or not we should > FreeSpaceMapVacuumRange() more frequently than after index vacuuming > in the nindexes > 0 case. There is no particular reason for the nindexes==0 case to care about how often we'd call FreeSpaceMapVacuumRange() in the counterfactual world where the same VACUUM ran on the same table, except that it was nindexes>1 instead. At least I don't see any. > Other than the overhead of the actual vacuuming of the FSM, what are > the potential downsides of knowing about freespace sooner? It could > change what pages are inserted to. What are the possible undesirable > side effects? The whole VACUUM_FSM_EVERY_PAGES thing comes from commit 851a26e266. The commit message of that work seems to suppose that calling FreeSpaceMapVacuumRange() more frequently is pretty much strictly better than calling it less frequently, at least up to the point where certain more-or-less fixed costs paid once per FreeSpaceMapVacuumRange() start to become a problem. I think that that's probably about right. The commit message also says that we "arbitrarily update upper FSM pages after each 8GB of heap" (in the nindexes==0 case). So VACUUM_FSM_EVERY_PAGES is only very approximately analogous to what we do in the nindexes>1 case. That seems reasonable because these two cases really aren't so comparable in terms of the FSM vacuuming requirements -- the nindexes==0 case legitimately doesn't have the same dependency on heap vacuuming (and index vacuuming) that we have to consider when nindexes>1. -- Peter Geoghegan
Re: Emit fewer vacuum records by reaping removable tuples during pruning
On Wed, Jan 17, 2024 at 4:31 PM Peter Geoghegan wrote: > > On Wed, Jan 17, 2024 at 4:25 PM Peter Geoghegan wrote: > > I tend to suspect that VACUUM_FSM_EVERY_PAGES is fundamentally the > > wrong idea. If it's such a good idea then why not apply it all the > > time? That is, why not apply it independently of whether nindexes==0 > > in the current VACUUM operation? (You know, just like with > > FAILSAFE_EVERY_PAGES.) > > Actually, I suppose that we couldn't apply it independently of > nindexes==0. Then we'd call FreeSpaceMapVacuumRange() before our > second pass over the heap takes place for those LP_DEAD-containing > heap pages scanned since the last round of index/heap vacuuming took > place (or since VACUUM began). We need to make sure that the FSM has > the most recent possible information known to VACUUM, which would > break if we applied VACUUM_FSM_EVERY_PAGES rules when nindexes > 0. > > Even still, the design of VACUUM_FSM_EVERY_PAGES seems questionable to me. I now see I misunderstood and my earlier email was wrong. I didn't notice that we only use VACUUM_FSM_EVERY_PAGES if nindexes ==0. So, in master, we call FreeSpaceMapVacuumRange() always after a round of index vacuuming and periodically if there are no indexes. It seems like you are asking whether not we should vacuum the FSM at a different cadence for the no indexes case (and potentially count blocks actually vacuumed instead of blocks considered). And it seems like Robert is asking whether or not we should FreeSpaceMapVacuumRange() more frequently than after index vacuuming in the nindexes > 0 case. Other than the overhead of the actual vacuuming of the FSM, what are the potential downsides of knowing about freespace sooner? It could change what pages are inserted to. What are the possible undesirable side effects? - Melanie
Re: initdb's -c option behaves wrong way?
> On 17 Jan 2024, at 21:33, Tom Lane wrote: > > I wrote: >> However ... I don't like the patch much. It seems to have left >> the code in a rather random state. Why, for example, didn't you >> keep all the code that constructs the "newline" value together? > > After thinking about it a bit more, I don't see why you didn't just > s/strncmp/strncasecmp/ and call it good. The messiness seems to be > a result of your choice to replace the GUC's case as shown in the > file with the case used on the command line, which is not better > IMO. We don't change our mind about the canonical spelling of a > GUC because somebody varied the case in a SET command. The original patch was preserving the case of the file throwing away the case from the commandline. The attached is a slimmed down version which only moves the assembly of newline to the different cases (replace vs. new) keeping the rest of the code intact. Keeping it in one place gets sort of messy too since it needs to use different values for a replacement and a new variable. Not sure if there is a cleaner approach? -- Daniel Gustafsson v3-0001-Make-initdb-c-option-case-insensitive.patch Description: Binary data
Re: Emit fewer vacuum records by reaping removable tuples during pruning
On Wed, Jan 17, 2024 at 4:25 PM Peter Geoghegan wrote: > > On Wed, Jan 17, 2024 at 3:58 PM Robert Haas wrote: > > > Ah, I realize I was not clear. I am now talking about inconsistencies > > > in vacuuming the FSM itself. FreeSpaceMapVacuumRange(). Not updating > > > the freespace map during the course of vacuuming the heap relation. > > > > Fair enough, but I'm still not quite sure exactly what the question > > is. It looks to me like the current code, when there are indexes, > > vacuums the FSM after each round of index vacuuming. When there are no > > indexes, doing it after each round of index vacuuming would mean never > > doing it, so instead we vacuum the FSM every ~8GB. I assume what > > happened here is that somebody decided doing it after each round of > > index vacuuming was the "right thing," and then realized that was not > > going to work if no index vacuuming was happening, and so inserted the > > 8GB threshold to cover that case. > > Note that VACUUM_FSM_EVERY_PAGES is applied against the number of > rel_pages "processed" so far -- *including* any pages that were > skipped using the visibility map. It would make a bit more sense if it > was applied against scanned_pages instead (just like > FAILSAFE_EVERY_PAGES has been since commit 07eef53955). In other > words, VACUUM_FSM_EVERY_PAGES is applied against a thing that has only > a very loose relationship with physical work performed/time elapsed. This is a good point. Seems like a very reasonable change to make, as I would think that was the original intent. - Melanie
Re: Emit fewer vacuum records by reaping removable tuples during pruning
On Wed, Jan 17, 2024 at 3:58 PM Robert Haas wrote: > > On Wed, Jan 17, 2024 at 3:12 PM Melanie Plageman > wrote: > > > Yes, I also spent some time thinking about this. In master, we do > > always call lazy_scan_new_or_empty() before calling > > lazy_scan_noprune(). The code is aiming to ensure we call > > lazy_scan_new_or_empty() once before calling either of > > lazy_scan_noprune() or lazy_scan_prune(). I think it didn't call > > lazy_scan_new_or_empty() unconditionally first because of the > > different lock types expected. But, your structure has solved that. > > I've used a version of your example code above in attached v9. It is > > much nicer. > > Oh, OK, I see it now. I missed that lazy_scan_new_or_empty() was > called either way. Glad that my proposed restructuring managed to be > helpful despite that confusion, though. :-) > > At a quick glance, I also like the way this looks. I'll review it more > thoroughly later. Does this patch require 0002 and 0003 or could it > equally well go first? I confess that I don't entirely understand why > we want 0002 and 0003. Well, 0002 and 0003 move the updates to the visibility map into lazy_scan_prune(). We only want to update the VM if we called lazy_scan_prune() (i.e. not if lazy_scan_noprune() returned true). We also need the lock on the heap page when updating the visibility map but we want to have released the lock before updating the FSM, so we need to first update the VM then the FSM. The VM update code, in my opinion, belongs in lazy_scan_prune() -- since it is only done when lazy_scan_prune() is called. To keep the VM update code in lazy_scan_heap() and still consolidate the FSM update code, we would have to surround all of the VM update code in a test (if got_cleanup_lock, I suppose). I don't see any advantage in doing that. > > Ah, I realize I was not clear. I am now talking about inconsistencies > > in vacuuming the FSM itself. FreeSpaceMapVacuumRange(). Not updating > > the freespace map during the course of vacuuming the heap relation. > > Fair enough, but I'm still not quite sure exactly what the question > is. It looks to me like the current code, when there are indexes, > vacuums the FSM after each round of index vacuuming. When there are no > indexes, doing it after each round of index vacuuming would mean never > doing it, so instead we vacuum the FSM every ~8GB. I assume what > happened here is that somebody decided doing it after each round of > index vacuuming was the "right thing," and then realized that was not > going to work if no index vacuuming was happening, and so inserted the > 8GB threshold to cover that case. Ah, I see. I understood that we want to update the FSM every 8GB, but I didn't understand that we wanted to check if we were at that 8GB only after a round of index vacuuming. That would explain why we also had to do it in the no indexes case -- because, as you say, there wouldn't be a round of index vacuuming. This does mean that something is not quite right with 0001 as well as 0004. We'd end up checking if we are at 8GB much more often. I should probably find a way to replicate the cadence on master. > I don't really know what to make of > all of this. On a happy PostgreSQL system, doing anything after each > round of index vacuuming means doing it once, because multiple rounds > of indexing vacuum are extremely expensive and we hope that it won't > ever occur. From that point of view, the 8GB threshold is better, > because it means that when we vacuum a large relation, space should > become visible to the rest of the system incrementally without needing > to wait for the entire vacuum to finish. On the other hand, we also > have this idea that we want to record free space in the FSM once, > after the last time we touch the page. Given that behavior, vacuuming > the FSM every 8GB when we haven't yet done index vacuuming wouldn't > accomplish much of anything, because we haven't updated it for the > pages we just touched. On the third hand, the current behavior seems > slightly ridiculous, because pruning the page is where we're mostly > going to free up space, so we might be better off just updating the > FSM then instead of waiting. That free space could be mighty useful > during the long wait between pruning and marking line pointers unused. > On the fourth hand, that seems like a significant behavior change that > we might not want to undertake without a bunch of research that we > might not want to do right now -- and if we did do it, should we then > update the FSM a second time after marking line pointers unused? I suspect we'd need to do some testing of various scenarios to justify such a change. - Melanie
modify first-word capitalisation of some messages
Hi. PSA a small patch to adjust the first-word capitalisation of some errmsg/ errdetail/ errhint so they comply with the guidelines. == Kind Regards, Peter Smith. Fujitsu Australia v1-0001-Error-message-capitalisation.patch Description: Binary data
Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)
Alexander Korotkov writes: > On Wed, Jan 17, 2024 at 9:49 AM Kyotaro Horiguchi > wrote: >> On the other hand, SAVE_ERROR_TO takes 'error' or 'none', which >> indicate "immediately error out" and 'just ignore the failure' >> respectively, but these options hardly seem to denote a 'location', >> and appear more like an 'action'. I somewhat suspect that this >> parameter name intially conceived with the assupmtion that it would >> take file names or similar parameters. I'm not sure if others will >> agree, but I think the parameter name might not be the best >> choice. For instance, considering the addition of the third value >> 'log', something like on_error_action (error, ignore, log) would be >> more intuitively understandable. What do you think? > Probably, but I'm not sure about that. The name SAVE_ERROR_TO assumes > the next word will be location, not action. With some stretch we can > assume 'error' to be location. I think it would be even more stretchy > to think that SAVE_ERROR_TO is followed by action. The other problem with this terminology is that with 'none', what it is doing is the exact opposite of "saving" the errors. I agree we need a better name. Kyotaro-san's suggestion isn't bad, though I might shorten it to error_action {error|ignore|log} (or perhaps "stop" instead of "error")? You will need a separate parameter anyway to specify the destination of "log", unless "none" became an illegal table name when I wasn't looking. I don't buy that one parameter that has some special values while other values could be names will be a good design. Moreover, what if we want to support (say) log-to-file along with log-to-table? Trying to distinguish a file name from a table name without any other context seems impossible. regards, tom lane
Re: Emit fewer vacuum records by reaping removable tuples during pruning
On Wed, Jan 17, 2024 at 4:25 PM Peter Geoghegan wrote: > I tend to suspect that VACUUM_FSM_EVERY_PAGES is fundamentally the > wrong idea. If it's such a good idea then why not apply it all the > time? That is, why not apply it independently of whether nindexes==0 > in the current VACUUM operation? (You know, just like with > FAILSAFE_EVERY_PAGES.) Actually, I suppose that we couldn't apply it independently of nindexes==0. Then we'd call FreeSpaceMapVacuumRange() before our second pass over the heap takes place for those LP_DEAD-containing heap pages scanned since the last round of index/heap vacuuming took place (or since VACUUM began). We need to make sure that the FSM has the most recent possible information known to VACUUM, which would break if we applied VACUUM_FSM_EVERY_PAGES rules when nindexes > 0. Even still, the design of VACUUM_FSM_EVERY_PAGES seems questionable to me. -- Peter Geoghegan
Re: Emit fewer vacuum records by reaping removable tuples during pruning
On Wed, Jan 17, 2024 at 3:58 PM Robert Haas wrote: > > Ah, I realize I was not clear. I am now talking about inconsistencies > > in vacuuming the FSM itself. FreeSpaceMapVacuumRange(). Not updating > > the freespace map during the course of vacuuming the heap relation. > > Fair enough, but I'm still not quite sure exactly what the question > is. It looks to me like the current code, when there are indexes, > vacuums the FSM after each round of index vacuuming. When there are no > indexes, doing it after each round of index vacuuming would mean never > doing it, so instead we vacuum the FSM every ~8GB. I assume what > happened here is that somebody decided doing it after each round of > index vacuuming was the "right thing," and then realized that was not > going to work if no index vacuuming was happening, and so inserted the > 8GB threshold to cover that case. Note that VACUUM_FSM_EVERY_PAGES is applied against the number of rel_pages "processed" so far -- *including* any pages that were skipped using the visibility map. It would make a bit more sense if it was applied against scanned_pages instead (just like FAILSAFE_EVERY_PAGES has been since commit 07eef53955). In other words, VACUUM_FSM_EVERY_PAGES is applied against a thing that has only a very loose relationship with physical work performed/time elapsed. I tend to suspect that VACUUM_FSM_EVERY_PAGES is fundamentally the wrong idea. If it's such a good idea then why not apply it all the time? That is, why not apply it independently of whether nindexes==0 in the current VACUUM operation? (You know, just like with FAILSAFE_EVERY_PAGES.) -- Peter Geoghegan
Re: On login trigger: take three
On Mon, Jan 15, 2024 at 11:29 AM Daniel Gustafsson wrote: > > On 13 Jan 2024, at 17:00, Alexander Lakhin wrote: > > > I suspected that this failure was caused by autovacuum, and I've managed to > > reproduce it without Valgrind or slowing down a machine. > > This might be due to the fact that the cleanup codepath to remove the flag > when > there is no login event trigger doesn't block on locking pg_database to avoid > stalling connections. There are no guarantees when the flag is cleared, so a > test like this will always have potential to be flaky it seems. +1 Thank you, Daniel. I think you described everything absolutely correctly. As I wrote upthread, it doesn't seem much could be done with this, at least within a regression test. So, I just removed this query from the test. -- Regards, Alexander Korotkov
Re: Assertion failure with epoch when replaying standby records for 2PC
Hi, Michael! On Wed, Jan 17, 2024 at 7:47 AM Michael Paquier wrote: > rorqual has failed today with a very interesting failure: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rorqual=2024-01-17%2005%3A06%3A31 > > This has caused an assertion failure for a 2PC transaction when > replaying one of the tests from the main regression suite: > 2024-01-17 05:08:23.143 UTC [3242608] DETAIL: Last completed transaction was > at log time 2024-01-17 05:08:22.920244+00. > TRAP: failed Assert("epoch > 0"), File: > "../pgsql/src/backend/access/transam/twophase.c", Line: 969, PID: 3242610 > postgres: standby_1: startup recovering > 0001000C(ExceptionalCondition+0x83)[0x55746c7838c1] > postgres: standby_1: startup recovering > 0001000C(+0x194f0e)[0x55746c371f0e] > postgres: standby_1: startup recovering > 0001000C(StandbyTransactionIdIsPrepared+0x29)[0x55746c373120] > postgres: standby_1: startup recovering > 0001000C(StandbyReleaseOldLocks+0x3f)[0x55746c621357] > postgres: standby_1: startup recovering > 0001000C(ProcArrayApplyRecoveryInfo+0x50)[0x55746c61bbb5] > postgres: standby_1: startup recovering > 0001000C(standby_redo+0xe1)[0x55746c621490] > postgres: standby_1: startup recovering > 0001000C(PerformWalRecovery+0xa5e)[0x55746c392404] > postgres: standby_1: startup recovering > 0001000C(StartupXLOG+0x3ac)[0x55746c3862b8] > postgres: standby_1: startup recovering > 0001000C(StartupProcessMain+0xd9)[0x55746c5a60f6] > postgres: standby_1: startup recovering > 0001000C(AuxiliaryProcessMain+0x172)[0x55746c59bbdd] > postgres: standby_1: startup recovering > 0001000C(+0x3c4235)[0x55746c5a1235] > postgres: standby_1: startup recovering > 0001000C(PostmasterMain+0x1401)[0x55746c5a5a10] > postgres: standby_1: startup recovering > 0001000C(main+0x835)[0x55746c4e90ce] > /lib/x86_64-linux-gnu/libc.so.6(+0x276ca)[0x7f67bbb846ca] > /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7f67bbb84785] > postgres: standby_1: startup recovering > 0001000C(_start+0x21)[0x55746c2b61d1] > > This refers to the following in twophase.c with > AdjustToFullTransactionId(): > nextXid = XidFromFullTransactionId(nextFullXid); > epoch = EpochFromFullTransactionId(nextFullXid); > > if (unlikely(xid > nextXid)) > { > /* Wraparound occurred, must be from a prev epoch. */ > Assert(epoch > 0); > epoch--; > } > > This would mean that we've found a way to get a negative epoch, which > should not be possible. > > Alexander, you have added this code in 5a1dfde8334b when switching the > 2PC file names to use FullTransactionIds. Could you check please? Thank you for reporting! I'm going to look at this in the next couple of days. -- Regards, Alexander Korotkov
Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)
On Wed, Jan 17, 2024 at 9:49 AM Kyotaro Horiguchi wrote: > At Wed, 17 Jan 2024 14:38:54 +0900, torikoshia > wrote in > > Hi, > > > > Thanks for applying! > > > > > + errmsg_plural("%zd row were skipped due to data type > > > incompatibility", > > > > Sorry, I just noticed it, but 'were' should be 'was' here? > > > > >> BTW I'm thinking we should add a column to pg_stat_progress_copy that > > >> counts soft errors. I'll suggest this in another thread. > > > Please do! > > > > I've started it here: > > > > https://www.postgresql.org/message-id/d12fd8c99adcae2744212cb23feff...@oss.nttdata.com > > Switching topics, this commit (9e2d870119) adds the following help message: > > > > "COPY { %s [ ( %s [, ...] ) ] | ( %s ) }\n" > > "TO { '%s' | PROGRAM '%s' | STDOUT }\n" > > ... > > "SAVE_ERROR_TO '%s'\n" > > ... > > _("location"), > > On the other hand, SAVE_ERROR_TO takes 'error' or 'none', which > indicate "immediately error out" and 'just ignore the failure' > respectively, but these options hardly seem to denote a 'location', > and appear more like an 'action'. I somewhat suspect that this > parameter name intially conceived with the assupmtion that it would > take file names or similar parameters. I'm not sure if others will > agree, but I think the parameter name might not be the best > choice. For instance, considering the addition of the third value > 'log', something like on_error_action (error, ignore, log) would be > more intuitively understandable. What do you think? Probably, but I'm not sure about that. The name SAVE_ERROR_TO assumes the next word will be location, not action. With some stretch we can assume 'error' to be location. I think it would be even more stretchy to think that SAVE_ERROR_TO is followed by action. Probably, we can replace SAVE_ERROR_TO with another name which could be naturally followed by action, but I don't have something appropriate in mind. However, I'm not native english speaker and certainly could miss something. -- Regards, Alexander Korotkov
Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)
On Wed, Jan 17, 2024 at 7:38 AM torikoshia wrote: > > Hi, > > Thanks for applying! > > > + errmsg_plural("%zd row were skipped due > > to data type incompatibility", > > Sorry, I just noticed it, but 'were' should be 'was' here? Sure, the fix is pushed. -- Regards, Alexander Korotkov
Re: Emit fewer vacuum records by reaping removable tuples during pruning
On Wed, Jan 17, 2024 at 3:12 PM Melanie Plageman wrote: > > Reviewing 0001, consider the case where a table has no indexes. > > Pre-patch, PageTruncateLinePointerArray() will happen when > > lazy_vacuum_heap_page() is called; post-patch, it will not occur. > > That's a loss. Should we care? On the plus side, visibility map > > repair, if required, can now take place. That's a gain. > > I thought that this wasn't an issue because heap_page_prune_execute() > calls PageRepairFragmentation() which similarly modifies pd_lower and > sets the hint bit about free line pointers. Ah, OK, I didn't understand that PageRepairFragmentation() does what is also done by PageTruncateLinePointerArray(). > Yes, I also spent some time thinking about this. In master, we do > always call lazy_scan_new_or_empty() before calling > lazy_scan_noprune(). The code is aiming to ensure we call > lazy_scan_new_or_empty() once before calling either of > lazy_scan_noprune() or lazy_scan_prune(). I think it didn't call > lazy_scan_new_or_empty() unconditionally first because of the > different lock types expected. But, your structure has solved that. > I've used a version of your example code above in attached v9. It is > much nicer. Oh, OK, I see it now. I missed that lazy_scan_new_or_empty() was called either way. Glad that my proposed restructuring managed to be helpful despite that confusion, though. :-) At a quick glance, I also like the way this looks. I'll review it more thoroughly later. Does this patch require 0002 and 0003 or could it equally well go first? I confess that I don't entirely understand why we want 0002 and 0003. > Ah, I realize I was not clear. I am now talking about inconsistencies > in vacuuming the FSM itself. FreeSpaceMapVacuumRange(). Not updating > the freespace map during the course of vacuuming the heap relation. Fair enough, but I'm still not quite sure exactly what the question is. It looks to me like the current code, when there are indexes, vacuums the FSM after each round of index vacuuming. When there are no indexes, doing it after each round of index vacuuming would mean never doing it, so instead we vacuum the FSM every ~8GB. I assume what happened here is that somebody decided doing it after each round of index vacuuming was the "right thing," and then realized that was not going to work if no index vacuuming was happening, and so inserted the 8GB threshold to cover that case. I don't really know what to make of all of this. On a happy PostgreSQL system, doing anything after each round of index vacuuming means doing it once, because multiple rounds of indexing vacuum are extremely expensive and we hope that it won't ever occur. From that point of view, the 8GB threshold is better, because it means that when we vacuum a large relation, space should become visible to the rest of the system incrementally without needing to wait for the entire vacuum to finish. On the other hand, we also have this idea that we want to record free space in the FSM once, after the last time we touch the page. Given that behavior, vacuuming the FSM every 8GB when we haven't yet done index vacuuming wouldn't accomplish much of anything, because we haven't updated it for the pages we just touched. On the third hand, the current behavior seems slightly ridiculous, because pruning the page is where we're mostly going to free up space, so we might be better off just updating the FSM then instead of waiting. That free space could be mighty useful during the long wait between pruning and marking line pointers unused. On the fourth hand, that seems like a significant behavior change that we might not want to undertake without a bunch of research that we might not want to do right now -- and if we did do it, should we then update the FSM a second time after marking line pointers unused? I'm not sure if any of this is answering your actual question, though. -- Robert Haas EDB: http://www.enterprisedb.com
Re: initdb's -c option behaves wrong way?
I wrote: > However ... I don't like the patch much. It seems to have left > the code in a rather random state. Why, for example, didn't you > keep all the code that constructs the "newline" value together? After thinking about it a bit more, I don't see why you didn't just s/strncmp/strncasecmp/ and call it good. The messiness seems to be a result of your choice to replace the GUC's case as shown in the file with the case used on the command line, which is not better IMO. We don't change our mind about the canonical spelling of a GUC because somebody varied the case in a SET command. regards, tom lane
Re: initdb's -c option behaves wrong way?
Robert Haas writes: > On Wed, Jan 17, 2024 at 2:31 PM Daniel Gustafsson wrote: >> Agreed, I think the patch as it stands now where it replaces case >> insensitive, >> while keeping the original casing, is the best path forward. The issue exist >> in 16 as well so I propose a backpatch to there. > I like that approach, too. I could go either way on back-patching. It > doesn't seem important, but it probably won't break anything, either. We just added this switch in 16, so I think backpatching to keep all the branches consistent is a good idea. However ... I don't like the patch much. It seems to have left the code in a rather random state. Why, for example, didn't you keep all the code that constructs the "newline" value together? I'm also unconvinced by the removal of the "assume there's only one match" break --- if we need to support multiple matches I doubt that's sufficient. regards, tom lane
Re: Emit fewer vacuum records by reaping removable tuples during pruning
On Wed, Jan 17, 2024 at 12:17 PM Robert Haas wrote: > > On Tue, Jan 16, 2024 at 6:07 PM Melanie Plageman > wrote: > > Attached v8 patch set is rebased over this. > > Reviewing 0001, consider the case where a table has no indexes. > Pre-patch, PageTruncateLinePointerArray() will happen when > lazy_vacuum_heap_page() is called; post-patch, it will not occur. > That's a loss. Should we care? On the plus side, visibility map > repair, if required, can now take place. That's a gain. I thought that this wasn't an issue because heap_page_prune_execute() calls PageRepairFragmentation() which similarly modifies pd_lower and sets the hint bit about free line pointers. > I'm otherwise satisfied with this patch now, except for some extremely > minor nitpicking: > > + * For now, pass mark_unused_now == false regardless of whether > or > > Personally, i would write "pass mark_unused_now as false" here, > because we're not testing equality. Or else "pass mark_unused_now = > false". This is not an equality test. > > + * During pruning, the caller may have passed mark_unused_now == > true, > > Again here, but also, this is referring to the name of a parameter to > a function whose name is not given. I think this this should either > talk fully in terms of code ("When heap_page_prune was called, > mark_unused_now may have been passed as true, which allows would-be > LP_DEAD items to be made LP_USED instead.") or fully in English > ("During pruning, we may have decided to mark would-be dead items as > unused."). Fixed both of the above issues as suggested in attached v9 > > In 0004, I've taken the approach you seem to favor and combined the FSM > > updates from the prune and no prune cases in lazy_scan_heap() instead > > of pushing the FSM updates into lazy_scan_prune() and > > lazy_scan_noprune(). > > I do like that approach. > > I think do_prune can be declared one scope inward, in the per-block > for loop. I would probably initialize it to true so I could drop the > stubby else block: > > + /* If we do get a cleanup lock, we will definitely prune */ > + else > + do_prune = true; > > And then I'd probably write the call as if (!lazy_scan_noprune()) > doprune = true. > > If I wanted to stick with not initializing do_prune, I'd put the > comment inside as /* We got the cleanup lock, so we will definitely > prune */ and add braces since that makes it a two-line block. If we don't unconditionally set do_prune using the result of lazy_scan_noprune(), then we cannot leave do_prune uninitialized. I preferred having it uninitialized, as it didn't imply that doing pruning was the default. Also, it made it simpler to have that comment about always pruning when we get the cleanup lock. However, with the changes you mentioned below (got_cleanup_lock), this discussion is moot. > > I did not guard against calling lazy_scan_new_or_empty() a second time > > in the case that lazy_scan_noprune() was called. I can do this. I > > mentioned upthread I found it confusing for lazy_scan_new_or_empty() > > to be guarded by if (do_prune). The overhead of calling it wouldn't be > > terribly high. I can change that based on your opinion of what is > > better. > > To me, the relevant question here isn't reader confusion, because that > can be cleared up with a comment explaining why we do or do not test > do_prune. Indeed, I'd say it's a textbook example of when you should > comment a test: when it might look wrong to the reader but you have a > good reason for doing it. > > But that brings me to what I think the real question is here: do we, > uh, have a good reason for doing it? At first blush the structure > looks a bit odd here. lazy_scan_new_or_empty() is intended to handle > PageIsNew() and PageIsEmpty() cases, lazy_scan_noprune() the cases > where we process the page without a cleanup lock, and > lazy_scan_prune() the regular case. So you might think that > lazy_scan_new_or_empty() would always be applied *first*, that we > would then conditionally apply lazy_scan_noprune(), and finally > conditionally apply lazy_scan_prune(). Like this: > > bool got_cleanup_lock = ConditionalLockBufferForCleanup(buf); > if (lazy_scan_new_or_empty()) > continue; > if (!got_cleanup_lock && !lazy_scan_noprune()) > { > LockBuffer(buf, BUFFER_LOCK_UNLOCK); > LockBufferForCleanup(buf); > got_cleanup_lock = true; > } > if (got_cleanup_lock) > lazy_scan_prune(); > > The current organization of the code seems to imply that we don't need > to worry about the PageIsNew() and PageIsEmpty() cases before calling > lazy_scan_noprune(), and at the moment I'm not understanding why that > should be the case. I wonder if this is a bug, or if I'm just > confused. Yes, I also spent some time thinking about this. In master, we do always call lazy_scan_new_or_empty() before calling lazy_scan_noprune(). The code is aiming to ensure we call lazy_scan_new_or_empty() once before
Re: cleanup patches for incremental backup
On Wed, Jan 17, 2024 at 1:42 PM Matthias van de Meent wrote: > Sure, added in attached. I think this mostly looks good now but I just realized that I think this needs rephrasing: + To restore incremental backups the tool + is used, which combines incremental backups with a base backup and + WAL to restore a + database cluster to + a consistent state. The way this is worded, at least to me, it makes it sound like pg_combinebackup is going to do the WAL recovery for you, which it isn't. Maybe: To restore incremental backups the tool is used, which combines incremental backups with a base backup. Afterwards, recovery can use WAL to bring the database cluster to a consistent state. -- Robert Haas EDB: http://www.enterprisedb.com
Re: initdb's -c option behaves wrong way?
On Wed, Jan 17, 2024 at 2:31 PM Daniel Gustafsson wrote: > Agreed, I think the patch as it stands now where it replaces case insensitive, > while keeping the original casing, is the best path forward. The issue exist > in 16 as well so I propose a backpatch to there. I like that approach, too. I could go either way on back-patching. It doesn't seem important, but it probably won't break anything, either. -- Robert Haas EDB: http://www.enterprisedb.com
Re: psql JSON output format
On Wed, Jan 17, 2024 at 4:30 AM Laurenz Albe wrote: > As mentioned in my other mail, I was talking about the psql output > format "csv" rather than about COPY. Oh. Well, I think it's sad that the psql format csv has that property. Why doesn't it adopt COPY's handling? > I agree that it is desirable to lose as little information as possible. > But if we want to format query output as JSON, we have a couple of > requirements that cannot all be satisfied: > > 1. lose no information ("round-trip safe") > > 2. don't double quote numbers, booleans and other JSON values > > 3. don't skip any table column in the output > > Christoph's original patch didn't satisfy #2, and his current version > doesn't satisfy #1. Do you think that skipping NULL columns would be > the best solution? We don't do that in the to_json() function, which > also renders SQL NULL as JSON null. Let me start by clarifying that I'm OK with sacrificing round-trippability here as long as we do it thoughtfully. "Round-trippability is important but X is more important and we cannot have both for Y reasons" seems like a potentially fine argument to me; I'm only objecting to an argument of the form "round-trippability doesn't even matter." My previous comment was a bit of a drive-by remark on that specifically rather than a strong opinion about what exactly we ought to do here. I guess the specifically issue here is around a json(b) column that is null at the SQL level vs one that contains a JSON null. How do we distinguish those cases? I think entirely omitting null columns could be a way forward, but I don't know if that would cause other problems for users. I'm not quite sure that addresses all the issues, though. For instance, consider that 1.00::numeric and 1.0::numeric are equal but distinguishable. If those get rendered into the JSON unquoted as 1.00 and 1.0, respectively, is that going to round-trip properly? What about float8 values where extra_float_digits=3 is needed to properly round trip? If we take PostgreSQL's array data types and turn them into JSON arrays, what happens with non-default bounds? I know how we're going to turn '{1,2}'::int[] into a JSON array, or at least I assume I do, but what in the world are we going to do about '[-3:-2]={1,2}'? As much as I think round-trippability is good, getting it to 100% here is probably a good bit of work. And maybe that work isn't worth doing or involves too much collateral damage. But I do think it has positive value. If we produce output that could be ingested back into PG later with the right tool, that leaves the door open for someone to build the tool later even if we don't have it today. If we produce output that loses information, no tool built later can make up for the loss. -- Robert Haas EDB: http://www.enterprisedb.com
Re: More new SQL/JSON item methods
On 17.01.24 10:03, Jeevan Chalke wrote: I added unary '+' and '-' support as well and thus thought of having separate rules altogether rather than folding those in. Per SQL standard, the precision and scale arguments are unsigned integers, so unary plus and minus signs are not supported. So my patch removes that support, but I didn't adjust the regression tests for that. However, PostgreSQL numeric casting does support a negative scale. Here is an example: # select '12345'::numeric(4,-2); numeric - 12300 (1 row) And thus thought of supporting those. Do we want this JSON item method to behave differently here? Ok, it would make sense to support this in SQL/JSON as well. I will merge them all into one and will try to keep them in the order specified in sql_features.txt. However, for documentation, it makes more sense to keep them in logical order than the alphabetical one. What are your views on this? The documentation can be in a different order.
Re: initdb's -c option behaves wrong way?
> On 17 Jan 2024, at 18:05, Tom Lane wrote: > > Alvaro Herrera writes: >> Hmm, how about raising an error if multiple options are given targetting >> the same GUC? > > I don't see any reason to do that. The underlying configuration > files don't complain about duplicate entries, they just take the > last setting. Agreed, I think the patch as it stands now where it replaces case insensitive, while keeping the original casing, is the best path forward. The issue exist in 16 as well so I propose a backpatch to there. -- Daniel Gustafsson
Re: cleanup patches for incremental backup
On Tue, 16 Jan 2024 at 21:49, Robert Haas wrote: > > On Tue, Jan 16, 2024 at 3:22 PM Matthias van de Meent > wrote: > + A special base > backup > + that for some WAL-logged relations only contains the pages that were > + modified since a previous backup, as opposed to the full relation data > of > + normal base backups. Like base backups, it is generated by the tool > + . > > Could we say "that for some files may contain only those pages that > were modified since a previous backup, as opposed to the full contents > of every file"? Sure, added in attached. > + To restore incremental backups the tool > + is used, which combines the incremental backups with a base backup and > + [...] > I wondered if this needed to be clearer that the chain of backups > could have length > 2. But on further reflection, I think it's fine, > unless you feel otherwise. I removed "the" from the phrasing "the incremental backups", which makes it a bit less restricted. > The rest LGTM. In the latest patch I also fixed the casing of "Incremental Backup" to "... backup", to be in line with most other multi-word items. Thanks! Kind regards, Matthias van de Meent Neon (https://neon.tech) v3-0001-incremental-backups-Add-new-items-to-glossary-mon.patch Description: Binary data
Re: New Table Access Methods for Multi and Single Inserts
On Tue, Aug 1, 2023 at 10:32 PM Jacob Champion wrote: > > On Tue, Aug 1, 2023 at 9:31 AM Bharath Rupireddy > wrote: > > Thanks. Finally, I started to spend time on this. Just curious - may > > I know the discussion in/for which this patch is referenced? What was > > the motive? Is it captured somewhere? > > It may not have been the only place, but we at least touched on it > during the unconference: > > > https://wiki.postgresql.org/wiki/PgCon_2023_Developer_Unconference#Table_AMs > > We discussed two related-but-separate ideas: > 1) bulk/batch operations and > 2) maintenance of TAM state across multiple related operations. Thank you. I'm attaching v8 patch-set here which includes use of new insert TAMs for COPY FROM. With this, postgres not only will have the new TAM for inserts, but they also can make the following commands faster - CREATE TABLE AS, SELECT INTO, CREATE MATERIALIZED VIEW, REFRESH MATERIALIZED VIEW and COPY FROM. I'll perform some testing in the coming days and post the results here, until then I appreciate any feedback on the patches. I've also added this proposal to CF - https://commitfest.postgresql.org/47/4777/. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com From cbdf2935be360017c0d62479e879630d4fec8766 Mon Sep 17 00:00:00 2001 From: Bharath Rupireddy Date: Wed, 17 Jan 2024 16:44:19 + Subject: [PATCH v8] New TAMs for inserts --- src/backend/access/heap/heapam.c | 224 +++ src/backend/access/heap/heapam_handler.c | 9 + src/include/access/heapam.h | 49 + src/include/access/tableam.h | 143 +++ 4 files changed, 425 insertions(+) diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 707460a536..7df305380e 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -68,6 +68,7 @@ #include "utils/datum.h" #include "utils/inval.h" #include "utils/lsyscache.h" +#include "utils/memutils.h" #include "utils/relcache.h" #include "utils/snapmgr.h" #include "utils/spccache.h" @@ -2446,6 +2447,229 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples, pgstat_count_heap_insert(relation, ntuples); } +/* + * Initialize state required for an insert a single tuple or multiple tuples + * into a heap. + */ +TableInsertState * +heap_insert_begin(Relation rel, CommandId cid, int am_flags, int insert_flags) +{ + TableInsertState *tistate; + + tistate = palloc0(sizeof(TableInsertState)); + tistate->rel = rel; + tistate->cid = cid; + tistate->am_flags = am_flags; + tistate->insert_flags = insert_flags; + + if ((am_flags & TABLEAM_MULTI_INSERTS) != 0 || + (am_flags & TABLEAM_BULKWRITE_BUFFER_ACCESS_STRATEGY)) + tistate->am_data = palloc0(sizeof(HeapInsertState)); + + if ((am_flags & TABLEAM_MULTI_INSERTS) != 0) + { + HeapMultiInsertState *mistate; + + mistate = palloc0(sizeof(HeapMultiInsertState)); + mistate->slots = palloc0(sizeof(TupleTableSlot *) * HEAP_MAX_BUFFERED_SLOTS); + + mistate->context = AllocSetContextCreate(CurrentMemoryContext, + "heap_multi_insert_v2 memory context", + ALLOCSET_DEFAULT_SIZES); + + ((HeapInsertState *) tistate->am_data)->mistate = mistate; + } + + if ((am_flags & TABLEAM_BULKWRITE_BUFFER_ACCESS_STRATEGY) != 0) + ((HeapInsertState *) tistate->am_data)->bistate = GetBulkInsertState(); + + return tistate; +} + +/* + * Insert a single tuple into a heap. + */ +void +heap_insert_v2(TableInsertState * state, TupleTableSlot *slot) +{ + bool shouldFree = true; + HeapTuple tuple = ExecFetchSlotHeapTuple(slot, true, ); + BulkInsertState bistate = NULL; + + Assert(state->am_data != NULL && + ((HeapInsertState *) state->am_data)->mistate == NULL); + + /* Update tuple with table oid */ + slot->tts_tableOid = RelationGetRelid(state->rel); + tuple->t_tableOid = slot->tts_tableOid; + + if (state->am_data != NULL && + ((HeapInsertState *) state->am_data)->bistate != NULL) + bistate = ((HeapInsertState *) state->am_data)->bistate; + + /* Perform insertion, and copy the resulting ItemPointer */ + heap_insert(state->rel, tuple, state->cid, state->insert_flags, +bistate); + ItemPointerCopy(>t_self, >tts_tid); + + if (shouldFree) + pfree(tuple); +} + +/* + * Create/return next free slot from multi-insert buffered slots array. + */ +TupleTableSlot * +heap_multi_insert_next_free_slot(TableInsertState * state) +{ + TupleTableSlot *slot; + HeapMultiInsertState *mistate; + + Assert(state->am_data != NULL && + ((HeapInsertState *) state->am_data)->mistate != NULL); + + mistate = ((HeapInsertState *) state->am_data)->mistate; + slot = mistate->slots[mistate->cur_slots]; + + if (slot == NULL) + { + slot = table_slot_create(state->rel, NULL); + mistate->slots[mistate->cur_slots] = slot; + } + else + ExecClearTuple(slot); + + return slot; +} + +/* + * Store passed-in tuple into in-memory buffered slots.
Re: Emit fewer vacuum records by reaping removable tuples during pruning
On Tue, Jan 16, 2024 at 6:07 PM Melanie Plageman wrote: > Attached v8 patch set is rebased over this. Reviewing 0001, consider the case where a table has no indexes. Pre-patch, PageTruncateLinePointerArray() will happen when lazy_vacuum_heap_page() is called; post-patch, it will not occur. That's a loss. Should we care? On the plus side, visibility map repair, if required, can now take place. That's a gain. I'm otherwise satisfied with this patch now, except for some extremely minor nitpicking: + * For now, pass mark_unused_now == false regardless of whether or Personally, i would write "pass mark_unused_now as false" here, because we're not testing equality. Or else "pass mark_unused_now = false". This is not an equality test. + * During pruning, the caller may have passed mark_unused_now == true, Again here, but also, this is referring to the name of a parameter to a function whose name is not given. I think this this should either talk fully in terms of code ("When heap_page_prune was called, mark_unused_now may have been passed as true, which allows would-be LP_DEAD items to be made LP_USED instead.") or fully in English ("During pruning, we may have decided to mark would-be dead items as unused."). > In 0004, I've taken the approach you seem to favor and combined the FSM > updates from the prune and no prune cases in lazy_scan_heap() instead > of pushing the FSM updates into lazy_scan_prune() and > lazy_scan_noprune(). I do like that approach. I think do_prune can be declared one scope inward, in the per-block for loop. I would probably initialize it to true so I could drop the stubby else block: + /* If we do get a cleanup lock, we will definitely prune */ + else + do_prune = true; And then I'd probably write the call as if (!lazy_scan_noprune()) doprune = true. If I wanted to stick with not initializing do_prune, I'd put the comment inside as /* We got the cleanup lock, so we will definitely prune */ and add braces since that makes it a two-line block. > I did not guard against calling lazy_scan_new_or_empty() a second time > in the case that lazy_scan_noprune() was called. I can do this. I > mentioned upthread I found it confusing for lazy_scan_new_or_empty() > to be guarded by if (do_prune). The overhead of calling it wouldn't be > terribly high. I can change that based on your opinion of what is > better. To me, the relevant question here isn't reader confusion, because that can be cleared up with a comment explaining why we do or do not test do_prune. Indeed, I'd say it's a textbook example of when you should comment a test: when it might look wrong to the reader but you have a good reason for doing it. But that brings me to what I think the real question is here: do we, uh, have a good reason for doing it? At first blush the structure looks a bit odd here. lazy_scan_new_or_empty() is intended to handle PageIsNew() and PageIsEmpty() cases, lazy_scan_noprune() the cases where we process the page without a cleanup lock, and lazy_scan_prune() the regular case. So you might think that lazy_scan_new_or_empty() would always be applied *first*, that we would then conditionally apply lazy_scan_noprune(), and finally conditionally apply lazy_scan_prune(). Like this: bool got_cleanup_lock = ConditionalLockBufferForCleanup(buf); if (lazy_scan_new_or_empty()) continue; if (!got_cleanup_lock && !lazy_scan_noprune()) { LockBuffer(buf, BUFFER_LOCK_UNLOCK); LockBufferForCleanup(buf); got_cleanup_lock = true; } if (got_cleanup_lock) lazy_scan_prune(); The current organization of the code seems to imply that we don't need to worry about the PageIsNew() and PageIsEmpty() cases before calling lazy_scan_noprune(), and at the moment I'm not understanding why that should be the case. I wonder if this is a bug, or if I'm just confused. > The big open question/functional change is when we consider vacuuming > the FSM. Previously, we only considered vacuuming the FSM in the no > indexes, has dead items case. After combining the FSM updates from > lazy_scan_prune()'s no indexes/has lpdead items case, > lazy_scan_prune()'s regular case, and lazy_scan_noprune(), all of them > consider vacuuming the FSM. I could guard against this, but I wasn't > sure why we would want to only vacuum the FSM in the no indexes/has > dead items case. I don't get it. Conceptually, I agree that we don't want to be inconsistent here without some good reason. One of the big advantages of unifying different code paths is that you avoid being accidentally inconsistent. If different things are different it shows up as a test in the code instead of just having different code paths in different places that may or may not match. But I thought the whole point of 45d395cd75ffc5b4c824467140127a5d11696d4c was to iron out the existing inconsistencies so that we could unify this code without having to change any more behavior. In particular, I
Re: initdb's -c option behaves wrong way?
Alvaro Herrera writes: > Hmm, how about raising an error if multiple options are given targetting > the same GUC? I don't see any reason to do that. The underlying configuration files don't complain about duplicate entries, they just take the last setting. regards, tom lane
Re: More new SQL/JSON item methods
On 2024-01-17 We 04:03, Jeevan Chalke wrote: On Mon, Jan 15, 2024 at 7:41 PM Peter Eisentraut wrote: Overall, I think it would be better if you combined all three of these patches into one. Right now, you have arranged these as incremental features, and as a result of that, the additions to the JsonPathItemType enum and the grammar keywords etc. are ordered in the way you worked on these features, I guess. It would be good to maintain a bit of sanity to put all of this together and order all the enums and everything else for example in the order they are in the sql_features.txt file (which is alphabetical, I suppose). At this point I suspect we'll end up committing this whole feature set together anyway, so we might as well organize it that way. OK. I will merge them all into one and will try to keep them in the order specified in sql_features.txt. However, for documentation, it makes more sense to keep them in logical order than the alphabetical one. What are your views on this? I agree that we should order the documentation logically. Users don't care how we organize the code etc, but they do care about docs have sensible structure. cheers andrew -- Andrew Dunstan EDB:https://www.enterprisedb.com
Re: Build the docs if there are changes in docs and don't run other tasks if the changes are only in docs
On 2023-10-06 Fr 10:07, Tom Lane wrote: Peter Eisentraut writes: I don't have a good sense of what you are trying to optimize for. If it's the mainline build-on-every-commit type, then I wonder how many commits would really be affected by this. Like, how many commits touch only a README file. If it's for things like the cfbot, then I think the time-triggered builds would be more frequent than new patch versions, so I don't know if these kinds of optimizations would affect anything. As a quick cross-check, I searched our commit log to see how many README-only commits there were so far this year. I found 11 since January. (Several were triggered by the latest round of pgindent code and process changes, so maybe this is more than typical.) Not sure what that tells us about the value of changing the CI logic, but it does seem like it could be worth the one-liner change needed to teach buildfarm animals to ignore READMEs. - trigger_exclude => qr[^doc/|\.po$], + trigger_exclude => qr[^doc/|/README$|\.po$], I've put that in the sample config file for the next release. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: psql JSON output format
On 2024-01-17 We 03:52, Laurenz Albe wrote: On Tue, 2024-01-16 at 11:49 -0500, Andrew Dunstan wrote: On 2024-01-16 Tu 11:07, Laurenz Albe wrote: On Tue, 2024-01-09 at 16:51 +, Dean Rasheed wrote: On Tue, 9 Jan 2024 at 14:35, Christoph Berg wrote: Getting it print numeric/boolean without quotes was actually easy, as well as json(b). Implemented as the attached v2 patch. But: not quoting json means that NULL and 'null'::json will both be rendered as 'null'. That strikes me as a pretty undesirable conflict. Does the COPY patch also do that? Yes. Perhaps what needs to happen is for a NULL column to be omitted entirely from the output. I think the COPY TO json patch would have to do that if COPY FROM json were to be added later, to make it round-trip safe. I think the behavior is fine as it is. I'd expect both NULL and JSON "null" to be rendered as "null". I think the main use case for a feature like this is people who need the result in JSON for further processing somewhere else. "Round-trip safety" is not so important. If you want to move data from PostgreSQL to PostgreSQL, you use the plain or the binary format. The CSV format by default renders NULL and empty strings identical, and I don't think anybody objects to that. This is absolutely not true. CSV format with default settings is and has been from the beginning designed to be round trippable. Sorry for being unclear. I wasn't talking about COPY, but about the psql output format: CREATE TABLE xy (a integer, b text); INSERT INTO xy VALUES (1, 'one'), (2, NULL), (3, ''); \pset format csv Output format is csv. TABLE xy; a,b 1,one 2, 3, I think the reason nobody's complained about it is quite possibly that very few people have used it. That's certainly the case with me - if I'd noticed it I would have complained. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
[pg16]Question about building snapshot in logical decoding
If txn1 begin after SNAPBUILD_BUILDING_SNAPSHOT and commit before SNAPBUILD_FULL_SNAPSHOT(so txn1 will not be processed by DecodeCommit), and txn2 begin after SNAPBUILD_FULL_SNAPSHOT and commit after SNAPBUILD_CONSISTENT(so txn2 will be replayed), how to ensure that txn2 could see the changes made by txn1? Thanks
Re: initdb's -c option behaves wrong way?
On 2024-Jan-16, Daniel Gustafsson wrote: > > On 28 Sep 2023, at 09:49, Kyotaro Horiguchi wrote: > > > I noticed that -c option of initdb behaves in an unexpected > > manner. Identical variable names with variations in letter casing are > > treated as distinct variables. > > > > $ initdb -cwork_mem=100 -cWORK_MEM=1000 -cWork_mem=2000 > > > The original intention was apparently to overwrite the existing > > line. Furthermore, I surmise that preserving the original letter > > casing is preferable. > > Circling back to an old thread, I agree that this seems odd and the original > thread [0] makes no mention of it being intentional. Hmm, how about raising an error if multiple options are given targetting the same GUC? -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/
Re: Add pgindent test to check if codebase is correctly formatted
On Wed Jan 17, 2024 at 3:50 AM CST, Alvaro Herrera wrote: Hmm, should this also install typedefs.list and pgindent.man? What about the tooling to reformat Perl code? Good point about pgindent.man. It would definitely be good to install alongside pgindent and pg_bsd_indent. I don't know if we need to install the typedefs.list file. I think it would just be good enough to also install the find_typedefs script. But it needs some fixing up first[0]. Extension authors can then just generate their own typedefs.list that will include the typedefs of the extension and the typedefs of the postgres types they use. At least, that is what we have found works at Neon. I cannot vouch for extension authors writing Perl but I think it could make sense to install the src/test/perl tree, so extension authors could more easily write tests for their extensions in Perl. But we could install the perltidy file and whatever else too. I keep my Perl writing to a minimum, so I am not the best person to vouch for these usecases. [0]: https://www.postgresql.org/message-id/aaa59ef5-dce8-7369-5cae-487727664127%40dunslane.net -- Tristan Partin Neon (https://neon.tech)
Re: Fix a possible socket leak at Windows (src/backend/port/win32/socket.c)
Em qua., 17 de jan. de 2024 09:54, Daniel Gustafsson escreveu: > > On 17 Jan 2024, at 07:26, Michael Paquier wrote: > > On Tue, Jan 16, 2024 at 05:25:39PM -0300, Ranier Vilela wrote: > > >> Do you have plans or should I register for a commitfest? > > > > Daniel has stated that he would take care of it, so why not letting > > him a few days? I don't think that a CF entry is necessary. > > It isn't, I've now committed it backpatched down to 12. > Thanks for the commit, Daniel. Best regards, Ranier Vilela
Re: Add new protocol message to change GUCs for usage with future protocol-only GUCs
On Wed, 17 Jan 2024 at 14:39, Robert Haas wrote: > I think it's hard to say for sure what API is going to work well here, > because we just don't have much experience with this. Agreed, but I strongly believe PQunsupportedProtocolExtensions() is useful regardless of the API choice. > I also think that the reason why > the API you're proposing here looks good in this case is because libpq > itself doesn't really need to do anything differently for these > parameters. It doesn't actually really change anything about the > protocol; it only nails down the server behavior in a way that can't > be changed. Another current request is to have a way to have certain > data types always be sent in binary format, specified by OID. Do we > want that to be written as PQsetParameter("always_binary_format", > "123,456,789,101112") or do we want it to maybe look more like > PQalwaysBinaryFormat(int count, Oid *stuff)? Or, another example, say > we want to set client_to_server_compression_method=lz4. I think from libpq's perspective there are two categories of protocol extension parameters: 1. parameters that change behaviour in a way that does not matter to libpq 2. parameters that change in such a way that libpq needs to change its behaviour too (by parsing or sending messages differently than it normally would). _pq_.protocol_roles, _pq_.report_parameters, and (afaict) even _pq_.always_binary_format would fall into category 1. But _pq_.client_to_server_compression_method would fall into category 2, because libpq should start to compress the messages that it is sending. I think if you look at it like that, then using PQsetParameter for parameters in category 1 makes sense. And indeed you'd likely want a dedicated API for each parameter in category 2, and probably have PQsetParameter error for these parameters. In any case it seems like something that can be decided on a case by case basis. However, to make this future proof, I think it might be best if PQsetParameter would error for protocol extension parameters that it does not know about. > Also, I never intended for _pq_ to become a user-visible namespace > that people would have to care about I agree that forcing Postgres users to learn about this prefix is probably unwanted. But requiring client authors to learn about the prefix seems acceptable to me.
Re: Add system identifier to backup manifest
On Wed, Jan 17, 2024 at 6:31 AM Amul Sul wrote: > With the attached patch, the backup manifest will have a new key item as > "System-Identifier" 64-bit integer whose value is derived from pg_control > while > generating it, and the manifest version bumps to 2. > > This helps to identify the correct database server and/or backup for the > subsequent backup operations. pg_verifybackup validates the manifest system > identifier against the backup control file and fails if they don’t match. > Similarly, pg_basebackup increment backup will fail if the manifest system > identifier does not match with the server system identifier. The > pg_combinebackup is already a bit smarter -- checks the system identifier from > the pg_control of all the backups, with this patch the manifest system > identifier also validated. Thanks for working on this. Without this, I think what happens is that you can potentially take an incremental backup from the "wrong" server, if the states of the systems are such that all of the other sanity checks pass. When you run pg_combinebackup, it'll discover the problem and tell you, but you ideally want to discover such errors at backup time rather than at restore time. This addresses that. And, overall, I think it's a pretty good patch. But I nonetheless have a bunch of comments. - The associated value is always the integer 1. + The associated value is the integer, either 1 or 2. is an integer. Beginning in PostgreSQL 17, it is 2; in older versions, it is 1. + context.identity_cb = manifest_process_identity; I'm not really on board with calling the system identifier "identity" throughout the patch. I think it should just say system_identifier. If we were going to abbreviate, I'd prefer something like "sysident" that looks like it's derived from "system identifier" rather than "identity" which is a different word altogether. But I don't think we should abbreviate unless doing so creates *ridiculously* long identifier names. +static void +manifest_process_identity(JsonManifestParseContext *context, + int manifest_version, + uint64 manifest_system_identifier) +{ + uint64 system_identifier; + + /* Manifest system identifier available in version 2 or later */ + if (manifest_version == 1) + return; I think you've got the wrong idea here. I think this function would only get called if System-Identifier is present in the manifest, so if it's a v1 manifest, this would never get called, so this if-statement would not ever do anything useful. I think what you should do is (1) if the client supplies a v1 manifest, reject it, because surely that's from an older server version that doesn't support incremental backup; but do that when the version is parsed rather than here; and (2) also detect and reject the case when it's supposedly a v2 manifest but this is absent. (1) should really be done when the version number is parsed, so I suspect you may need to add manifest_version_cb. +static void +combinebackup_identity_cb(JsonManifestParseContext *context, + int manifest_version, + uint64 manifest_system_identifier) +{ + parser_context *private_context = context->private_data; + uint64 system_identifier = private_context->system_identifier; + + /* Manifest system identifier available in version 2 or later */ + if (manifest_version == 1) + return; Very similar to the above case. Just reject a version 1 manifest as soon as we see the version number. In this function, set a flag indicating we saw the system identifier; if at the end of parsing that flag is not set, kaboom. - parse_manifest_file(manifest_path, , _wal_range); + parse_manifest_file(manifest_path, , _wal_range, + context.backup_directory); Don't do this! parse_manifest_file() should just record everything found in the manifest in the context object. Syntax validation should happen while parsing the manifest (e.g. "CAT/DOG" is not a valid LSN and we should reject that at this stage) but semantic validation should happen later (e.g. "0/0" can't be a the correct backup end LSN but we don't figure that out while parsing, but rather later). I think you should actually move validation of the system identifier to the point where the directory walk encounters the control file (and update the docs and tests to match that decision). Imagine if you wanted to validate a tar-format backup; then you wouldn't have random access to the directory. You'd see the manifest file first, and then all the files in a random order, with one chance to look at each one. (This is, in fact, a feature I think we should implement.) - if (strcmp(token, "1") != 0) + parse->manifest_version = atoi(token); + if (parse->manifest_version != 1 && parse->manifest_version != 2) json_manifest_parse_failure(parse->context, "unexpected manifest version"); Please either (a) don't do a string-to-integer conversion and just strcmp() twice or (b) use strtol so that you can check that it succeeded. I don't want to accept manifest version 1a as 1. +/* + *
Re: Change GUC hashtable to use simplehash?
On 17/01/2024 09:15, John Naylor wrote: /* * hashfn_unstable.h * * Building blocks for creating fast inlineable hash functions. The * unstable designation is in contrast to hashfn.h, which cannot break * compatibility because hashes can be written to disk and so must produce * the same hashes between versions. * * The functions in this file are not guaranteed to be stable between * versions, and may differ by hardware platform. These paragraphs sound a bit awkward. It kind of buries the lede, the "these functions are not guaranteed to be stable" part, to the bottom. Maybe something like: " Building blocks for creating fast inlineable hash functions. The functions in this file are not guaranteed to be stable between versions, and may differ by hardware platform. Hence they must not be used in indexes or other on-disk structures. See hashfn.h if you need stability. " typo: licencse Other than that, LGTM. -- Heikki Linnakangas Neon (https://neon.tech)
Re: Build the docs if there are changes in docs and don't run other tasks if the changes are only in docs
> On 14 Dec 2023, at 14:40, Nazir Bilal Yavuz wrote: > On Fri, 6 Oct 2023 at 17:07, Tom Lane wrote: >> Not sure what that tells us about the value of changing the CI >> logic, but it does seem like it could be worth the one-liner >> change needed to teach buildfarm animals to ignore READMEs. > > I agree that it could be worth implementing this logic on buildfarm animals. > > In case we want to implement the same logic on the CI, I added a new > version of the patch; it skips CI completely if the changes are only > in the README files. I think it makes sense to avoid wasting CI cycles on commits only changing README files since we know it will be a no-op. A README documentation patch going through N revisions will incur at least N full CI runs which are resources we can spend on other things. So +1 for doing this both in CI and in the buildfarm. -- Daniel Gustafsson
Re: [PATCH] Compression dictionaries for JSONB
Hi, Aleksander, there was a quite straightforward answer regarding Pluggable TOAST in other thread - the Pluggable TOAST feature is not desired by the community, and advanced TOAST mechanics would be accepted as parts of problematic datatypes extended functionality, on a par with in and out functions, so what I am actually doing now - re-writing JSONb TOAST improvements to be called as separate functions without Pluggable TOAST API. This takes some time because there is a large and complex code base left by Nikita Glukhov who has lost interest in this work due to some reasons. I, personally, think that these two features could benefit from each other, but they could be adapted to each other after I would introduce JSONb Toaster in v17 master. If you don't mind please check the thread on extending the TOAST pointer - it is important for improving TOAST mechanics. On Wed, Jan 17, 2024 at 5:27 PM Aleksander Alekseev < aleksan...@timescale.com> wrote: > Hi again, > > > Yes it does for a while now. Until we reach any agreement regarding > > questions (1) and (2) personally I don't see the point in submitting > > rebased patches. We can continue the discussion but mark the CF entry > > as RwF for now it will be helpful. > > Sorry, what I actually meant were the following questions: > > """ > Several things occured to me: > > - Does anyone believe that va_tag should be part of the utf8-like > bitmask in order to save a byte or two? > > - The described approach means that compression dictionaries are not > going to be used when data is compressed in-place (i.e. within a > tuple), since no TOAST pointer is involved in this case. Also we will > be unable to add additional compression algorithms here. Does anyone > have problems with this? Should we use the reserved compression > algorithm id instead as a marker of an extended TOAST? > > - It would be nice to decompose the feature in several independent > patches, e.g. modify TOAST first, then add compression dictionaries > without automatic update of the dictionaries, then add the automatic > update. I find it difficult to imagine however how to modify TOAST > pointers and test the code properly without a dependency on a larger > feature. Could anyone think of a trivial test case for extendable > TOAST? Maybe something we could add to src/test/modules similarly to > how we test SLRU, background workers, etc. > """ > > Since there was not much activity since then (for 3 months) I don't > really see how to process further. > > -- > Best regards, > Aleksander Alekseev > -- Regards, Nikita Malakhov Postgres Professional The Russian Postgres Company https://postgrespro.ru/
Re: MERGE ... RETURNING
On Sat, Nov 18, 2023 at 8:55 PM Dean Rasheed wrote: > > The v13 patch still applies on top of this, so I won't re-post it. > Hi. minor issues based on v13. + +MERGING ( property ) + + The following are valid property values specifying what to return: + + + + ACTION + + + The merge action command executed for the current row + ('INSERT', 'UPDATE', or + 'DELETE'). + + + do we change to property? Maybe the main para should be two sentences like: The merge action command executed for the current row. Possible values are: 'INSERT', 'UPDATE', 'DELETE'. static Node * +transformMergingFunc(ParseState *pstate, MergingFunc *f) +{ + /* + * Check that we're in the RETURNING list of a MERGE command. + */ + if (pstate->p_expr_kind != EXPR_KIND_MERGE_RETURNING) + { + ParseState *parent_pstate = pstate->parentParseState; + + while (parent_pstate && + parent_pstate->p_expr_kind != EXPR_KIND_MERGE_RETURNING) + parent_pstate = parent_pstate->parentParseState; + + if (!parent_pstate || + parent_pstate->p_expr_kind != EXPR_KIND_MERGE_RETURNING) + ereport(ERROR, + errcode(ERRCODE_WRONG_OBJECT_TYPE), + errmsg("MERGING() can only be used in the RETURNING list of a MERGE command"), + parser_errposition(pstate, f->location)); + } + + return (Node *) f; +} + the object is correct, but not in the right place. maybe we should change errcode(ERRCODE_WRONG_OBJECT_TYPE) to errcode(ERRCODE_INVALID_OBJECT_DEFINITION) also do we need to add some comments explain that why we return it as is when it's EXPR_KIND_MERGE_RETURNING. (my understanding is that, if key words not match, then it will fail at gram.y, like syntax error, else MERGING will based on keywords make a MergingFunc node and assign mfop, mftype, location to it) in src/backend/executor/functions.c /* * Break from loop if we didn't shut down (implying we got a * lazily-evaluated row). Otherwise we'll press on till the whole * function is done, relying on the tuplestore to keep hold of the * data to eventually be returned. This is necessary since an * INSERT/UPDATE/DELETE RETURNING that sets the result might be * followed by additional rule-inserted commands, and we want to * finish doing all those commands before we return anything. */ Does the above comments need to change to INSERT/UPDATE/DELETE/MERGE? in src/backend/nodes/nodeFuncs.c case T_UpdateStmt: { UpdateStmt *stmt = (UpdateStmt *) node; if (WALK(stmt->relation)) return true; if (WALK(stmt->targetList)) return true; if (WALK(stmt->whereClause)) return true; if (WALK(stmt->fromClause)) return true; if (WALK(stmt->returningList)) return true; if (WALK(stmt->withClause)) return true; } break; case T_MergeStmt: { MergeStmt *stmt = (MergeStmt *) node; if (WALK(stmt->relation)) return true; if (WALK(stmt->sourceRelation)) return true; if (WALK(stmt->joinCondition)) return true; if (WALK(stmt->mergeWhenClauses)) return true; if (WALK(stmt->withClause)) return true; } break; you add "returningList" to MergeStmt. do you need to do the following similar to UpdateStmt, even though it's so abstract, i have no idea what's going on. ` if (WALK(stmt->returningList)) return true; `
Re: [PATCH] Compression dictionaries for JSONB
Hi again, > Yes it does for a while now. Until we reach any agreement regarding > questions (1) and (2) personally I don't see the point in submitting > rebased patches. We can continue the discussion but mark the CF entry > as RwF for now it will be helpful. Sorry, what I actually meant were the following questions: """ Several things occured to me: - Does anyone believe that va_tag should be part of the utf8-like bitmask in order to save a byte or two? - The described approach means that compression dictionaries are not going to be used when data is compressed in-place (i.e. within a tuple), since no TOAST pointer is involved in this case. Also we will be unable to add additional compression algorithms here. Does anyone have problems with this? Should we use the reserved compression algorithm id instead as a marker of an extended TOAST? - It would be nice to decompose the feature in several independent patches, e.g. modify TOAST first, then add compression dictionaries without automatic update of the dictionaries, then add the automatic update. I find it difficult to imagine however how to modify TOAST pointers and test the code properly without a dependency on a larger feature. Could anyone think of a trivial test case for extendable TOAST? Maybe something we could add to src/test/modules similarly to how we test SLRU, background workers, etc. """ Since there was not much activity since then (for 3 months) I don't really see how to process further. -- Best regards, Aleksander Alekseev
Re: [PATCH] Compression dictionaries for JSONB
Hi Shubham, > > > 8272749e added a few more arguments to CastCreate(). Here is the rebased > > > patch. > > > > After merging afbfc029 [1] the patch needed a rebase. PFA v10. > > > > The patch is still in a PoC state and this is exactly why comments and > > suggestions from the community are most welcome! Particularly I would > > like to know: > > > > 1. Would you call it a wanted feature considering the existence of > > Pluggable TOASTer patchset which (besides other things) tries to > > introduce type-aware TOASTers for EXTERNAL attributes? I know what > > Simon's [2] and Nikita's latest answers were, and I know my personal > > opinion on this [3][4], but I would like to hear from the rest of the > > community. > > > > 2. How should we make sure a dictionary will not consume all the > > available memory? Limiting the amount of dictionary entries to pow(2, > > 16) and having dictionary versions seems to work OK for ZSON. However > > it was pointed out that this may be an unwanted limitation for the > > in-core implementation. > > > > [1]: > > https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=c727f511;hp=afbfc02983f86c4d71825efa6befd547fe81a926 > > [2]: > > https://www.postgresql.org/message-id/CANbhV-HpCF852WcZuU0wyh1jMU4p6XLbV6rCRkZpnpeKQ9OenQ%40mail.gmail.com > > [3]: > > https://www.postgresql.org/message-id/CAJ7c6TN-N3%3DPSykmOjmW1EAf9YyyHFDHEznX-5VORsWUvVN-5w%40mail.gmail.com > > [4]: > > https://www.postgresql.org/message-id/CAJ7c6TO2XTTk3cu5w6ePHfhYQkoNpw7u1jeqHf%3DGwn%2BoWci8eA%40mail.gmail.com > > I tried to apply the patch but it is failing at the Head. It is giving > the following error: Yes it does for a while now. Until we reach any agreement regarding questions (1) and (2) personally I don't see the point in submitting rebased patches. We can continue the discussion but mark the CF entry as RwF for now it will be helpful. -- Best regards, Aleksander Alekseev
Re: Add system identifier to backup manifest
On Wed, Jan 17, 2024 at 6:45 AM Alvaro Herrera wrote: > Hmm, okay, but what if I take a full backup from a primary server and > later I want an incremental from a standby, or the other way around? > Will this prevent me from using such a combination? The system identifier had BETTER match in such cases. If it doesn't, somebody's run pg_resetwal on your standby since it was created... and in that case, no incremental backup for you! -- Robert Haas EDB: http://www.enterprisedb.com
Re: Add new protocol message to change GUCs for usage with future protocol-only GUCs
On Tue, Jan 16, 2024 at 9:21 PM Jelte Fennema-Nio wrote: > It's understandable you're worried about breaking clients, but afaict > **you are actually misunderstanding the situation**. I totally agree > that we cannot bump the protocol version without also merging 0002 > (that's why the version bump is in patch 0005 not patch 0001). But > 0002 does **not** need to be backported to all supported branches. The > only libpq versions that can ever receive a NegotiateVersionProtocol > are ones that request 3.1, and since none of the pre-17 libpq versions > ever request 3.1 there's no need for them to be able to handle > NegotiateVersionProtocol. OK, yeah, fuzzy thinking on my part. > I think we have a very different idea of what is a desirable API for > client authors that use libpq to build their clients. libpq its API is > pretty low level, so I think it makes total sense for client authors > to know what protocol extension parameters exist. It seems totally > acceptable for me to have them call PQsetParameter directly: > > PQsetParameter("_pq_.protocol_roles", "true") > PQsetParameter("_pq_.report_parameters", "role,search_path") > > Otherwise we need to introduce **two** new functions for every > protocol extension that is going to be introduced, a blocking and a > non-blocking one (e.g. PQsetWireProtocolRole() and > PQsendSetWireProtocolRole()). And that seems like unnecessary API > bloat to me. I think it's hard to say for sure what API is going to work well here, because we just don't have much experience with this. I do agree that we want to avoid API bloat. However, I also think that the reason why the API you're proposing here looks good in this case is because libpq itself doesn't really need to do anything differently for these parameters. It doesn't actually really change anything about the protocol; it only nails down the server behavior in a way that can't be changed. Another current request is to have a way to have certain data types always be sent in binary format, specified by OID. Do we want that to be written as PQsetParameter("always_binary_format", "123,456,789,101112") or do we want it to maybe look more like PQalwaysBinaryFormat(int count, Oid *stuff)? Or, another example, say we want to set client_to_server_compression_method=lz4. It's very difficult for me to believe that libpq should be doing strcmp() against the proposed protocol parameter settings and thus realizing that it needs to start compressing ... especially since there might be compression parameters (like level or degree of parallelism) that the client needs and the server doesn't. Also, I never intended for _pq_ to become a user-visible namespace that people would have to care about; that was just a convention that I adopted to differentiate setting a protocol parameter from setting a GUC. I think it's a mistake to make that string something users have to worry about directly. It wouldn't begin and end with an underscore if it were intended to be user-visible. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Fix a possible socket leak at Windows (src/backend/port/win32/socket.c)
> On 17 Jan 2024, at 07:26, Michael Paquier wrote: > On Tue, Jan 16, 2024 at 05:25:39PM -0300, Ranier Vilela wrote: >> Do you have plans or should I register for a commitfest? > > Daniel has stated that he would take care of it, so why not letting > him a few days? I don't think that a CF entry is necessary. It isn't, I've now committed it backpatched down to 12. -- Daniel Gustafsson
Re: generate syscache info automatically
On 10.01.24 09:00, John Naylor wrote: I suppose a sensible alternative could be to leave the DECLARE_..._INDEX... alone and make a separate statement, like MAKE_SYSCACHE(pg_type_oid_index, TYPEOID, 64); That's at least visually easier, because some of those DECLARE_... lines are getting pretty long. Probably a good idea, and below I mention a third possible macro. I updated the patch to use this style (but I swapped the first two arguments from my example, so that the thing being created is named first). I also changed the names of the output files a bit to make them less confusing. (I initially had some files named .c.h, which was weird, but apparently necessary to avoid confusing the build system. But it's all clearer now.) Other than bugs and perhaps style opinions, I think the first patch is pretty good now. I haven't changed much in the second patch, other than to update it for the code changes made in the first patch. It's still very much WIP/preview at the moment. From f77c381b9df30824e469f72f4e7754d03cb97d78 Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Wed, 17 Jan 2024 12:33:42 +0100 Subject: [PATCH v5 1/2] Generate syscache info from catalog files Add a new genbki macros MAKE_SYSCACHE that specifies the syscache ID macro, the underlying index, and the number of buckets. From that, we can generate the existing tables in syscache.h and syscache.c via genbki.pl. Discussion: https://www.postgresql.org/message-id/flat/75ae5875-3abc-dafc-8aec-73247ed41...@eisentraut.org --- src/backend/catalog/.gitignore| 2 + src/backend/catalog/Catalog.pm| 10 + src/backend/catalog/Makefile | 2 +- src/backend/catalog/genbki.pl | 72 ++ src/backend/utils/cache/syscache.c| 632 +- src/include/catalog/.gitignore| 2 + src/include/catalog/genbki.h | 10 +- src/include/catalog/meson.build | 18 +- src/include/catalog/pg_aggregate.h| 2 + src/include/catalog/pg_am.h | 3 + src/include/catalog/pg_amop.h | 3 + src/include/catalog/pg_amproc.h | 2 + src/include/catalog/pg_attribute.h| 3 + src/include/catalog/pg_auth_members.h | 3 + src/include/catalog/pg_authid.h | 3 + src/include/catalog/pg_cast.h | 2 + src/include/catalog/pg_class.h| 3 + src/include/catalog/pg_collation.h| 3 + src/include/catalog/pg_constraint.h | 2 + src/include/catalog/pg_conversion.h | 4 + src/include/catalog/pg_database.h | 2 + src/include/catalog/pg_default_acl.h | 2 + src/include/catalog/pg_enum.h | 3 + src/include/catalog/pg_event_trigger.h| 3 + src/include/catalog/pg_foreign_data_wrapper.h | 3 + src/include/catalog/pg_foreign_server.h | 3 + src/include/catalog/pg_foreign_table.h| 2 + src/include/catalog/pg_index.h| 2 + src/include/catalog/pg_language.h | 3 + src/include/catalog/pg_namespace.h| 3 + src/include/catalog/pg_opclass.h | 3 + src/include/catalog/pg_operator.h | 2 + src/include/catalog/pg_opfamily.h | 3 + src/include/catalog/pg_parameter_acl.h| 2 + src/include/catalog/pg_partitioned_table.h| 2 + src/include/catalog/pg_proc.h | 3 + src/include/catalog/pg_publication.h | 3 + .../catalog/pg_publication_namespace.h| 3 + src/include/catalog/pg_publication_rel.h | 3 + src/include/catalog/pg_range.h| 3 + src/include/catalog/pg_replication_origin.h | 3 + src/include/catalog/pg_rewrite.h | 2 + src/include/catalog/pg_sequence.h | 2 + src/include/catalog/pg_statistic.h| 2 + src/include/catalog/pg_statistic_ext.h| 3 + src/include/catalog/pg_statistic_ext_data.h | 1 + src/include/catalog/pg_subscription.h | 3 + src/include/catalog/pg_subscription_rel.h | 2 + src/include/catalog/pg_tablespace.h | 2 + src/include/catalog/pg_transform.h| 3 + src/include/catalog/pg_ts_config.h| 3 + src/include/catalog/pg_ts_config_map.h| 2 + src/include/catalog/pg_ts_dict.h | 3 + src/include/catalog/pg_ts_parser.h| 3 + src/include/catalog/pg_ts_template.h | 3 + src/include/catalog/pg_type.h | 3 + src/include/catalog/pg_user_mapping.h | 3 + src/include/utils/syscache.h | 98 +-- src/tools/pginclude/cpluspluscheck| 5 + src/tools/pginclude/headerscheck | 5 + 60 files changed, 266 insertions(+), 719 deletions(-) diff --git a/src/backend/catalog/.gitignore b/src/backend/catalog/.gitignore
Re: Show WAL write and fsync stats in pg_stat_io
Hi, On Mon, 15 Jan 2024 at 09:27, Michael Paquier wrote: > > On Fri, Jan 12, 2024 at 04:23:26PM +0300, Nazir Bilal Yavuz wrote: > > On Thu, 11 Jan 2024 at 17:28, Melanie Plageman > > wrote: > >> Even if we made a separate view for WAL I/O stats, we would still have > >> this issue of variable sized I/O vs block sized I/O and would probably > >> end up solving it with separate columns for the number of bytes and > >> number of operations. > > > > Yes, I think it is more about flexibility and not changing the already > > published pg_stat_io view. > > I don't know. Adding more columns or changing op_bytes with an extra > mode that reflects on what the other columns mean is kind of the same > thing to me: we want pg_stat_io to report more modes so as all I/O can > be evaluated from a single view, but the complication is now that > everything is tied to BLCKSZ. > > IMHO, perhaps we'd better put this patch aside until we are absolutely > *sure* of what we want to achieve when it comes to WAL, and I am > afraid that this cannot happen until we're happy with the way we > handle WAL reads *and* writes, including WAL receiver or anything that > has the idea of pulling its own page callback with > XLogReaderAllocate() in the backend. Well, writes should be > relatively "easy" as things happen with XLOG_BLCKSZ, mainly, but > reads are the unknown part. > > That also seems furiously related to the work happening with async I/O > or the fact that we may want to have in the view a separate meaning > for cached pages or pages read directly from disk. The worst thing > that we would do is rush something into the tree and then have to deal > with the aftermath of what we'd need to deal with in terms of > compatibility depending on the state of the other I/O related work > when the new view is released. That would not be fun for the users > and any hackers who would have to deal with that (aka mainly me if I > were to commit something), because pg_stat_io could mean something in > version N, still mean something entirely different in version N+1. I agree with your points. While the other I/O related work is happening we can discuss what we should do in the variable op_byte cases. Also, this is happening only for WAL right now but if we try to extend pg_stat_io in the future, that problem possibly will rise again. So, it could be good to come up with a general solution, not only for WAL. -- Regards, Nazir Bilal Yavuz Microsoft
Re: Catalog domain not-null constraints
On Wed, 29 Nov 2023 at 01:14, Peter Eisentraut wrote: > > On 23.11.23 14:13, Aleksander Alekseev wrote: > > =# create domain connotnull1 integer; > > =# create domain connotnull2 integer; > > =# alter domain connotnull1 add not null value; > > =# alter domain connotnull2 set not null; > > =# \dD > > ERROR: unexpected null value in cached tuple for catalog > > pg_constraint column conkey > > Yeah, for domain not-null constraints pg_constraint.conkey is indeed > null. Should we put something in there? > > Attached is an updated patch that avoids the error by taking a separate > code path for domain constraints in ruleutils.c. But maybe there is > another way to arrange this. One of the test has failed in CFBot at [1] with: diff -U3 /tmp/cirrus-ci-build/src/test/regress/expected/domain.out /tmp/cirrus-ci-build/src/test/regress/results/domain.out --- /tmp/cirrus-ci-build/src/test/regress/expected/domain.out 2024-01-14 15:40:01.793434601 + +++ /tmp/cirrus-ci-build/src/test/regress/results/domain.out 2024-01-14 15:42:23.013332625 + @@ -1271,11 +1271,4 @@ FROM information_schema.domain_constraints WHERE domain_name IN ('con', 'dom', 'pos_int', 'things')) ORDER BY constraint_name; - constraint_catalog | constraint_schema | constraint_name | check_clause -+---+--+--- - regression | public| con_check| (VALUE > 0) - regression | public| meow | (VALUE < 11) - regression | public| pos_int_check| (VALUE > 0) - regression | public| pos_int_not_null | VALUE IS NOT NULL -(4 rows) - +ERROR: could not open relation with OID 36379 [1] - https://cirrus-ci.com/task/4536440638406656 [2] - https://api.cirrus-ci.com/v1/artifact/task/4536440638406656/log/src/test/regress/regression.diffs Regards, Vignesh
Re: Add system identifier to backup manifest
On Wed, Jan 17, 2024 at 5:15 PM Alvaro Herrera wrote: > On 2024-Jan-17, Amul Sul wrote: > > > This helps to identify the correct database server and/or backup for the > > subsequent backup operations. pg_verifybackup validates the manifest > system > > identifier against the backup control file and fails if they don’t match. > > Similarly, pg_basebackup increment backup will fail if the manifest > system > > identifier does not match with the server system identifier. The > > pg_combinebackup is already a bit smarter -- checks the system identifier > > from > > the pg_control of all the backups, with this patch the manifest system > > identifier also validated. > > Hmm, okay, but what if I take a full backup from a primary server and > later I want an incremental from a standby, or the other way around? > Will this prevent me from using such a combination? > Yes, that worked for me where the system identifier was the same on master as well standby. Regards, Amul
Re: CI and test improvements
On Wed, 12 Jul 2023 at 11:38, Michael Paquier wrote: > > On Wed, Jul 12, 2023 at 12:56:17AM -0500, Justin Pryzby wrote: > > And, since 681d9e462: > > > > security is missing from contrib/seg/meson.build > > Ugh. Good catch! Are we planning to do anything more in this thread, the thread has been idle for more than 7 months. If nothing more is planned for this, I'm planning to close this commitfest entry in this commitfest. Regards, Vignesh
Re: Add system identifier to backup manifest
On 2024-Jan-17, Amul Sul wrote: > This helps to identify the correct database server and/or backup for the > subsequent backup operations. pg_verifybackup validates the manifest system > identifier against the backup control file and fails if they don’t match. > Similarly, pg_basebackup increment backup will fail if the manifest system > identifier does not match with the server system identifier. The > pg_combinebackup is already a bit smarter -- checks the system identifier > from > the pg_control of all the backups, with this patch the manifest system > identifier also validated. Hmm, okay, but what if I take a full backup from a primary server and later I want an incremental from a standby, or the other way around? Will this prevent me from using such a combination? -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "You're _really_ hosed if the person doing the hiring doesn't understand relational systems: you end up with a whole raft of programmers, none of whom has had a Date with the clue stick." (Andrew Sullivan) https://postgr.es/m/20050809113420.gd2...@phlogiston.dyndns.org
Add system identifier to backup manifest
Hi All, With the attached patch, the backup manifest will have a new key item as "System-Identifier" 64-bit integer whose value is derived from pg_control while generating it, and the manifest version bumps to 2. This helps to identify the correct database server and/or backup for the subsequent backup operations. pg_verifybackup validates the manifest system identifier against the backup control file and fails if they don’t match. Similarly, pg_basebackup increment backup will fail if the manifest system identifier does not match with the server system identifier. The pg_combinebackup is already a bit smarter -- checks the system identifier from the pg_control of all the backups, with this patch the manifest system identifier also validated. For backward compatibility, the manifest system identifier validation will be skipped for version 1. -- Regards, Amul Sul EDB: http://www.enterprisedb.com From 03d21efd7b03d17a55fc1dd1159e0838777f548a Mon Sep 17 00:00:00 2001 From: Amul Sul Date: Wed, 17 Jan 2024 16:12:19 +0530 Subject: [PATCH v1] Add system identifier to backup manifest The backup manifest will be having a new item as "System-Identifier" and its value is from "ControlFileData.system_identifier" when manifest generated. This help identify the correct database server and/or backup while taking subsequent backup. --- doc/src/sgml/backup-manifest.sgml | 13 +++- doc/src/sgml/ref/pg_verifybackup.sgml | 3 +- src/backend/backup/backup_manifest.c | 9 ++- src/backend/backup/basebackup.c | 3 +- src/backend/backup/basebackup_incremental.c | 29 src/bin/pg_basebackup/t/010_pg_basebackup.pl | 20 + src/bin/pg_combinebackup/load_manifest.c | 73 --- src/bin/pg_combinebackup/load_manifest.h | 6 +- src/bin/pg_combinebackup/pg_combinebackup.c | 16 ++-- src/bin/pg_combinebackup/t/005_integrity.pl | 14 src/bin/pg_combinebackup/write_manifest.c | 8 +- src/bin/pg_combinebackup/write_manifest.h | 3 +- src/bin/pg_verifybackup/pg_verifybackup.c | 59 ++- src/bin/pg_verifybackup/t/003_corruption.pl | 16 +++- src/bin/pg_verifybackup/t/004_options.pl | 2 +- src/bin/pg_verifybackup/t/005_bad_manifest.pl | 7 +- src/common/parse_manifest.c | 36 - src/include/backup/backup_manifest.h | 3 +- src/include/common/parse_manifest.h | 4 + 19 files changed, 286 insertions(+), 38 deletions(-) diff --git a/doc/src/sgml/backup-manifest.sgml b/doc/src/sgml/backup-manifest.sgml index 771be1310a..d0fc20ed0f 100644 --- a/doc/src/sgml/backup-manifest.sgml +++ b/doc/src/sgml/backup-manifest.sgml @@ -37,7 +37,18 @@ PostgreSQL-Backup-Manifest-Version - The associated value is always the integer 1. + The associated value is the integer, either 1 or 2. + + + + + +System-Identifier + + + The associated integer value is an unique system identifier to ensure + file match up with the installation that produced them. Available only + when PostgreSQL-Backup-Manifest-Version is 2 and later. diff --git a/doc/src/sgml/ref/pg_verifybackup.sgml b/doc/src/sgml/ref/pg_verifybackup.sgml index 36335e0a18..3051517d92 100644 --- a/doc/src/sgml/ref/pg_verifybackup.sgml +++ b/doc/src/sgml/ref/pg_verifybackup.sgml @@ -53,7 +53,8 @@ PostgreSQL documentation Backup verification proceeds in four stages. First, pg_verifybackup reads the backup_manifest file. If that file - does not exist, cannot be read, is malformed, or fails verification + does not exist, cannot be read, is malformed, fails to match the system + identifier with pg_control of the backup directory or fails verification against its own internal checksum, pg_verifybackup will terminate with a fatal error. diff --git a/src/backend/backup/backup_manifest.c b/src/backend/backup/backup_manifest.c index 2c34e59752..612ff3add2 100644 --- a/src/backend/backup/backup_manifest.c +++ b/src/backend/backup/backup_manifest.c @@ -56,7 +56,8 @@ IsManifestEnabled(backup_manifest_info *manifest) void InitializeBackupManifest(backup_manifest_info *manifest, backup_manifest_option want_manifest, - pg_checksum_type manifest_checksum_type) + pg_checksum_type manifest_checksum_type, + uint64 system_identifier) { memset(manifest, 0, sizeof(backup_manifest_info)); manifest->checksum_type = manifest_checksum_type; @@ -79,8 +80,10 @@ InitializeBackupManifest(backup_manifest_info *manifest, if (want_manifest != MANIFEST_OPTION_NO) AppendToManifest(manifest, - "{ \"PostgreSQL-Backup-Manifest-Version\": 1,\n" - "\"Files\": ["); + "{ \"PostgreSQL-Backup-Manifest-Version\": 2,\n" + "\"System-Identifier\": " UINT64_FORMAT ",\n" + "\"Files\": [", + system_identifier); } /* diff --git a/src/backend/backup/basebackup.c
Re: [PATCH] Compression dictionaries for JSONB
On Wed, Jan 17, 2024 at 4:16 PM Aleksander Alekseev wrote: > > Hi hackers, > > > 8272749e added a few more arguments to CastCreate(). Here is the rebased > > patch. > > After merging afbfc029 [1] the patch needed a rebase. PFA v10. > > The patch is still in a PoC state and this is exactly why comments and > suggestions from the community are most welcome! Particularly I would > like to know: > > 1. Would you call it a wanted feature considering the existence of > Pluggable TOASTer patchset which (besides other things) tries to > introduce type-aware TOASTers for EXTERNAL attributes? I know what > Simon's [2] and Nikita's latest answers were, and I know my personal > opinion on this [3][4], but I would like to hear from the rest of the > community. > > 2. How should we make sure a dictionary will not consume all the > available memory? Limiting the amount of dictionary entries to pow(2, > 16) and having dictionary versions seems to work OK for ZSON. However > it was pointed out that this may be an unwanted limitation for the > in-core implementation. > > [1]: > https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=c727f511;hp=afbfc02983f86c4d71825efa6befd547fe81a926 > [2]: > https://www.postgresql.org/message-id/CANbhV-HpCF852WcZuU0wyh1jMU4p6XLbV6rCRkZpnpeKQ9OenQ%40mail.gmail.com > [3]: > https://www.postgresql.org/message-id/CAJ7c6TN-N3%3DPSykmOjmW1EAf9YyyHFDHEznX-5VORsWUvVN-5w%40mail.gmail.com > [4]: > https://www.postgresql.org/message-id/CAJ7c6TO2XTTk3cu5w6ePHfhYQkoNpw7u1jeqHf%3DGwn%2BoWci8eA%40mail.gmail.com I tried to apply the patch but it is failing at the Head. It is giving the following error: patching file src/test/regress/expected/dict.out patching file src/test/regress/expected/oidjoins.out patching file src/test/regress/expected/opr_sanity.out patching file src/test/regress/parallel_schedule Hunk #1 FAILED at 111. 1 out of 1 hunk FAILED -- saving rejects to file src/test/regress/parallel_schedule.rej patching file src/test/regress/sql/dict.sql Please send the Re-base version of the patch. Thanks and Regards, Shubham Khanna.
Re: Implement missing join selectivity estimation for range types
On Fri, 5 Jan 2024 at 23:09, Schoemans Maxime wrote: > > On 05/01/2024 11:37, vignesh C wrote: > > One of the tests was aborted at [1], kindly post an updated patch for > the same: > > Thank you for notifying us. > I believe I fixed the issue but it is hard to be certain as the issue > did not arise when running the regression tests locally. I'm noticing this issue is not yet resolved, the CFBot is still failing at [1] with: #7 0x55cddc25cd93 in range_cmp_bound_values (typcache=typcache@entry=0x62930b60, b1=b1@entry=0x61c16f08, b2=b2@entry=0x61c180b8) at rangetypes.c:2090 [19:55:02.591] No locals. [19:55:02.591] #8 0x55cddc2685c1 in calc_hist_join_selectivity (typcache=typcache@entry=0x62930b60, hist1=hist1@entry=0x61c180b8, nhist1=nhist1@entry=101, hist2=hist2@entry=0x61c168b8, nhist2=nhist2@entry=101) at rangetypes_selfuncs.c:1295 [19:55:02.591] i = 0 [19:55:02.591] j = 101 [19:55:02.591] selectivity = 0 [19:55:02.591] prev_sel1 = -1 [19:55:02.591] prev_sel2 = 0 [19:55:02.591] #9 0x55cddc269aaa in rangejoinsel (fcinfo=) at rangetypes_selfuncs.c:1479 [19:55:02.591] root = [19:55:02.591] operator = [19:55:02.591] args = [19:55:02.591] sjinfo = [19:55:02.591] vardata1 = {var = , rel = , statsTuple = , freefunc = , vartype = , atttype = , atttypmod = , isunique = , acl_ok = } [19:55:02.591] vardata2 = {var = , rel = , statsTuple = , freefunc = , vartype = , atttype = , atttypmod = , isunique = , acl_ok = } [19:55:02.591] hist1 = {staop = , stacoll = , valuetype = , values = , nvalues = , numbers = , nnumbers = , values_arr = , numbers_arr = } [19:55:02.591] hist2 = {staop = , stacoll = , valuetype = , values = , nvalues = , numbers = , nnumbers = , values_arr = , numbers_arr = } [19:55:02.591] sslot = {staop = , stacoll = , valuetype = , values = , nvalues = , numbers = , nnumbers = , values_arr = , numbers_arr = } [19:55:02.591] reversed = [19:55:02.591] selec = 0.00170937500013 [19:55:02.591] typcache = 0x62930b60 [19:55:02.591] stats1 = [19:55:02.591] stats2 = [19:55:02.591] empty_frac1 = 0 [19:55:02.591] empty_frac2 = 0 [19:55:02.591] null_frac1 = 0 [19:55:02.591] null_frac2 = 0 [19:55:02.591] nhist1 = 101 [19:55:02.591] nhist2 = 101 [19:55:02.591] hist1_lower = 0x61c168b8 [19:55:02.591] hist1_upper = 0x61c170b8 [19:55:02.591] hist2_lower = 0x61c178b8 [19:55:02.591] hist2_upper = 0x61c180b8 [19:55:02.591] empty = [19:55:02.591] i = [19:55:02.591] __func__ = "rangejoinsel" [19:55:02.591] #10 0x55cddc3b761f in FunctionCall5Coll (flinfo=flinfo@entry=0x7ffc1628d710, collation=collation@entry=0, arg1=arg1@entry=107545982648856, arg2=arg2@entry=3888, arg3=arg3@entry=107820862916056, arg4=arg4@entry=0, arg5=) at fmgr.c:1242 [19:55:02.591] fcinfodata = {fcinfo = {flinfo = , context = , resultinfo = , fncollation = , isnull = , nargs = , args = 0x0}, fcinfo_data = { }} [19:55:02.591] fcinfo = 0x7ffc1628d5e0 [19:55:02.591] result = [19:55:02.591] __func__ = "FunctionCall5Coll" [19:55:02.591] #11 0x55cddc3b92ee in OidFunctionCall5Coll (functionId=8355, collation=collation@entry=0, arg1=arg1@entry=107545982648856, arg2=arg2@entry=3888, arg3=arg3@entry=107820862916056, arg4=arg4@entry=0, arg5=) at fmgr.c:1460 [19:55:02.591] flinfo = {fn_addr = , fn_oid = , fn_nargs = , fn_strict = , fn_retset = , fn_stats = , fn_extra = , fn_mcxt = , fn_expr = } [19:55:02.591] #12 0x55cddbe834ae in join_selectivity (root=root@entry=0x61d00017c218, operatorid=operatorid@entry=3888, args=0x6210003bc5d8, inputcollid=0, jointype=jointype@entry=JOIN_INNER, sjinfo=sjinfo@entry=0x7ffc1628db30) at ../../../../src/include/postgres.h:324 [19:55:02.591] oprjoin = [19:55:02.591] result = [19:55:02.591] __func__ = "join_selectivity" [19:55:02.591] #13 0x55cddbd8c18c in clause_selectivity_ext (root=root@entry=0x61d00017c218, clause=0x6210003bc678, varRelid=varRelid@entry=0, jointype=jointype@entry=JOIN_INNER, sjinfo=sjinfo@entry=0x7ffc1628db30, use_extended_stats=use_extended_stats@entry=true) at clausesel.c:841 I have changed the status to "Waiting on Author", feel free to post an updated version, check CFBot and update the Commitfest entry accordingly. [1] - https://cirrus-ci.com/task/5698117824151552 Regards, Vignesh
Re: Synchronizing slots from primary to standby
On Wed, Jan 17, 2024 at 3:08 PM Bertrand Drouvot wrote: > > Hi, > > On Tue, Jan 16, 2024 at 05:27:05PM +0530, shveta malik wrote: > > PFA v62. Details: > > Thanks! > > > v62-003: > > It is a new patch which attempts to implement slot-sync worker as a > > special process which is neither a bgworker nor an Auxiliary process. > > Here we get the benefit of converting enable_syncslot to a PGC_SIGHUP > > Guc rather than PGC_POSTMASTER. We launch the slot-sync worker only if > > it is hot-standby and 'enable_syncslot' is ON. > > The implementation looks reasonable to me (from what I can see some parts is > copy/paste from an already existing "special" process and some parts are > "sync slot" specific) which makes fully sense. Thanks for the feedback. I have addressed the comments in v63 except 5th one. > A few remarks: > > 1 === > +* Was it the slot sycn worker? > > Typo: sycn > > 2 === > +* ones), and no walwriter, autovac launcher or bgwriter or > slot sync > > Instead? "* ones), and no walwriter, autovac launcher, bgwriter or slot sync" > > 3 === > + * restarting slot slyc worker. If stopSignaled is set, the worker will > > Typo: slyc > > 4 === > +/* Flag to tell if we are in an slot sync worker process */ > > s/an/a/ ? > > 5 === (coming from v62-0002) > + Assert(tuplestore_tuple_count(res->tuplestore) == 1); > > Is it even possible for the related query to not return only one row? (I > think the > "count" ensures it). I think you are right. This assertion was added sometime back on the basis of feedback on hackers. Let me review that again. I can consider this comment in the next version. > 6 === > if (conninfo_changed || > primary_slotname_changed || > + old_enable_syncslot != enable_syncslot || > (old_hot_standby_feedback != hot_standby_feedback)) > { > ereport(LOG, > errmsg("slot sync worker will restart because > of" >" a parameter change")); > > I don't think "slot sync worker will restart" is true if one change > enable_syncslot > from on to off. Yes, right. I have changed the log-msg in this specific case. > > IMHO, v62-003 is in good shape and could be merged in v62-002 (that would ease > the review). But let's wait to see if others think differently. > > Regards, > > -- > Bertrand Drouvot > PostgreSQL Contributors Team > RDS Open Source Databases > Amazon Web Services: https://aws.amazon.com
Re: Add pgindent test to check if codebase is correctly formatted
Hmm, should this also install typedefs.list and pgindent.man? What about the tooling to reformat Perl code? -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "Linux transformó mi computadora, de una `máquina para hacer cosas', en un aparato realmente entretenido, sobre el cual cada día aprendo algo nuevo" (Jaime Salinas)
Re: Oversight in reparameterize_path_by_child leading to executor crash
On Wed, Jan 17, 2024 at 5:01 PM Richard Guo wrote: > On Tue, Jan 16, 2024 at 2:30 AM Robert Haas wrote: > >> On Mon, Jan 8, 2024 at 3:32 AM Richard Guo >> wrote: > > > Fair point. I think we can back-patch a more minimal fix, as Tom >> > proposed in [1], which disallows the reparameterization if the path >> > contains sampling info that references the outer rel. But I think we >> > need also to disallow the reparameterization of a path if it contains >> > restriction clauses that reference the outer rel, as such paths have >> > been found to cause incorrect results, or planning errors as in [2]. >> >> Do you want to produce a patch for that, to go along with this patch for >> master? > > > Sure, here it is: > v10-0001-Avoid-reparameterizing-Paths-when-it-s-not-suitable.patch > I forgot to mention that this patch applies on v16 not master. Thanks Richard
Re: Synchronizing slots from primary to standby
Hi, On Tue, Jan 16, 2024 at 05:27:05PM +0530, shveta malik wrote: > PFA v62. Details: Thanks! > v62-003: > It is a new patch which attempts to implement slot-sync worker as a > special process which is neither a bgworker nor an Auxiliary process. > Here we get the benefit of converting enable_syncslot to a PGC_SIGHUP > Guc rather than PGC_POSTMASTER. We launch the slot-sync worker only if > it is hot-standby and 'enable_syncslot' is ON. The implementation looks reasonable to me (from what I can see some parts is copy/paste from an already existing "special" process and some parts are "sync slot" specific) which makes fully sense. A few remarks: 1 === +* Was it the slot sycn worker? Typo: sycn 2 === +* ones), and no walwriter, autovac launcher or bgwriter or slot sync Instead? "* ones), and no walwriter, autovac launcher, bgwriter or slot sync" 3 === + * restarting slot slyc worker. If stopSignaled is set, the worker will Typo: slyc 4 === +/* Flag to tell if we are in an slot sync worker process */ s/an/a/ ? 5 === (coming from v62-0002) + Assert(tuplestore_tuple_count(res->tuplestore) == 1); Is it even possible for the related query to not return only one row? (I think the "count" ensures it). 6 === if (conninfo_changed || primary_slotname_changed || + old_enable_syncslot != enable_syncslot || (old_hot_standby_feedback != hot_standby_feedback)) { ereport(LOG, errmsg("slot sync worker will restart because of" " a parameter change")); I don't think "slot sync worker will restart" is true if one change enable_syncslot from on to off. IMHO, v62-003 is in good shape and could be merged in v62-002 (that would ease the review). But let's wait to see if others think differently. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: make pg_ctl more friendly
Hi Alvaro, On Wed, Jan 17, 2024 at 4:54 PM Alvaro Herrera wrote: > > I think this needs more comments. First, in the WaitPMResult enum, we > currently have three values -- READY, STILL_STARTING, FAILED. These are > all pretty self-explanatory. But this patch adds SHUTDOWN_IN_RECOVERY, > and that's not at all self-explanatory. Did postmaster start or not? > The enum value's name doesn't make that clear. So I'd do something like > > typedef enum > { > POSTMASTER_READY, > POSTMASTER_STILL_STARTING, > /* > * postmaster no longer running, because it stopped after recovery > * completed. > */ > POSTMASTER_SHUTDOWN_IN_RECOVERY, > POSTMASTER_FAILED, > } WaitPMResult; > > Maybe put the comments in wait_for_postmaster_start instead. I put the comments in WaitPMResult since we need to add two of those if in wait_for_postmaster_start. > > Secondly, the patch proposes to add new text to be returned by > do_start() when this happens, which would look like this: > > waiting for server to start... shut down while in recovery > update recovery target settings for startup again if needed > > Is this crystal clear? I'm not sure. How about something like this? > > waiting for server to start... done, but not running > server shut down because of recovery target settings Agreed. > > variations on the first phrase: > > "done, no longer running" > "done, but no longer running" > "done, automatically shut down" > "done, automatically shut down after recovery" I chose the last one because it gives information to users. See V5, thanks for the wording ;) > > -- > Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ > "Now I have my system running, not a byte was off the shelf; > It rarely breaks and when it does I fix the code myself. > It's stable, clean and elegant, and lightning fast as well, > And it doesn't cost a nickel, so Bill Gates can go to hell." -- Regards Junwang Zhao v5-0001-PITR-shutdown-should-not-report-error-by-pg_ctl.patch Description: Binary data
Re: psql JSON output format
On Tue, 2024-01-16 at 14:12 -0500, Robert Haas wrote: > On Tue, Jan 16, 2024 at 11:07 AM Laurenz Albe > wrote: > > "Round-trip safety" is not so important. If you want to move data from > > PostgreSQL to PostgreSQL, you use the plain or the binary format. > > The CSV format by default renders NULL and empty strings identical, and > > I don't think anybody objects to that. > > As Andrew says, the part about the CSV format is not correct, but I > also don't think I agree with the larger point, either. I believe that > round-trip safety is a really desirable property. Is it absolutely > necessary in every case? Maybe not. But, it shouldn't be lacking > without a good reason, either, at least IMHO. If you postulate that > people are moving data from A to B, it is reasonable to think that > eventually someone is going to want to move some data from B back to > A. If that turns out to be hard, they'll be sad. We shouldn't make > people sad without a good reason. As mentioned in my other mail, I was talking about the psql output format "csv" rather than about COPY. I agree that it is desirable to lose as little information as possible. But if we want to format query output as JSON, we have a couple of requirements that cannot all be satisfied: 1. lose no information ("round-trip safe") 2. don't double quote numbers, booleans and other JSON values 3. don't skip any table column in the output Christoph's original patch didn't satisfy #2, and his current version doesn't satisfy #1. Do you think that skipping NULL columns would be the best solution? We don't do that in the to_json() function, which also renders SQL NULL as JSON null. I think the argument for round-trip safety of psql output is tenuous. There is no way for psql to ingest JSON as input format, and the patch to add JSON as COPY format only supports COPY TO. And unless you can name the exact way that the data written by psql will be loaded into PostgreSQL again, all that remains is an (understandable) unease about losing the distiction between SQL NULL and JSON null. We have jsonb_populate_record() to convert JSON back to a table row, but that function will convert both missing columns and a JSON null to SQL NULL: CREATE TABLE xy (id integer, j jsonb); \pset null '∅' SELECT * FROM jsonb_populate_record(NULL::xy, '{"id":1,"j":null}'); id │ j ╪═══ 1 │ ∅ (1 row) SELECT * FROM jsonb_populate_record(NULL::xy, '{"id":1}'); id │ j ╪═══ 1 │ ∅ (1 row) Indeed, there doesn't seem to be a way to generate JSON null with that function. So I wouldn't worry about round-trip safety too much, and my preference is how the current patch does it. I am not dead set against a solution that omits NULL columns in the output, though. Yours, Laurenz Albe
Re: [PATCH] Add additional extended protocol commands to psql: \parse and \bindx
> I do realize the same is true for plain \bind, but it seems > like a bug there too. The unscanned bind's parameters are discarded later in the HandleSlashCmds functions. So adding the ignore_slash_options() for inactive branches scans and discards them earlier. I will add it to match what's done in the other commands but I don't think it's testable as the behaviour is the same unless I miss something. I did add the \bind, \bindx, \close and \parse to the inactive branch tests to complete the list. > One more usability thing. I think \parse and \close should not require > a \g to send the message. You can do that by returning PSQL_CMD_SEND > instead of PSQL_CMD_SKIP_LIN Changed. > I think the examples for \bindx and \close > should use \parse instead of PREPARE Done. I had to rely on manual PREPARE for my first tests and it leaked in the docs. v3-0001-psql-Add-support-for-prepared-stmt-with-extended-.patch Description: Binary data
Re: More new SQL/JSON item methods
On Mon, Jan 15, 2024 at 7:41 PM Peter Eisentraut wrote: > Attached are two small fixup patches for your patch set. > Thanks, Peter. > > In the first one, I simplified the grammar for the .decimal() method. > It seemed a bit overkill to build a whole list structure when all we > need are 0, 1, or 2 arguments. > Agree. I added unary '+' and '-' support as well and thus thought of having separate rules altogether rather than folding those in. > Per SQL standard, the precision and scale arguments are unsigned > integers, so unary plus and minus signs are not supported. So my patch > removes that support, but I didn't adjust the regression tests for that. > However, PostgreSQL numeric casting does support a negative scale. Here is an example: # select '12345'::numeric(4,-2); numeric - 12300 (1 row) And thus thought of supporting those. Do we want this JSON item method to behave differently here? > > Also note that in your 0002 patch, the datetime precision is similarly > unsigned, so that's consistent. > > By the way, in your 0002 patch, don't see the need for the separate > datetime_method grammar rule. You can fold that into accessor_op. > Sure. > > Overall, I think it would be better if you combined all three of these > patches into one. Right now, you have arranged these as incremental > features, and as a result of that, the additions to the JsonPathItemType > enum and the grammar keywords etc. are ordered in the way you worked on > these features, I guess. It would be good to maintain a bit of sanity > to put all of this together and order all the enums and everything else > for example in the order they are in the sql_features.txt file (which is > alphabetical, I suppose). At this point I suspect we'll end up > committing this whole feature set together anyway, so we might as well > organize it that way. > OK. I will merge them all into one and will try to keep them in the order specified in sql_features.txt. However, for documentation, it makes more sense to keep them in logical order than the alphabetical one. What are your views on this? Thanks -- Jeevan Chalke *Principal, ManagerProduct Development* edbpostgres.com