Re: Reconcile stats in find_tabstat_entry() and get rid of PgStat_BackendFunctionEntry
Hi, On 3/29/23 2:09 AM, Michael Paquier wrote: On Tue, Mar 28, 2023 at 05:43:26PM +0900, Kyotaro Horiguchi wrote: No. Fine by me, except that "block read requests" seems to suggest kernel read() calls, maybe because it's not clear whether "block" refers to our buffer blocks or file blocks to me.. If it is generally clear, I'm fine with the proposal. Okay. Would somebody like to draft a patch? Please find a draft attached. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.comdiff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml index c809ff1ba4..ae5d9a0226 100644 --- a/doc/src/sgml/monitoring.sgml +++ b/doc/src/sgml/monitoring.sgml @@ -5724,8 +5724,10 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i bigint -Returns the number of buffers fetched for table or index, in the current -transaction. +Returns the number of block read requests for table or index, in the +current transaction. This number minus +pg_stat_get_xact_blocks_hit gives the number of kernel +read() calls. @@ -5738,8 +5740,9 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i bigint -Returns the number of buffer hits for table or index, in the current -transaction. +Returns the number of block read requests for table or index, in the +current transaction, found in cache (not triggering kernel +read() calls).
Re: logical decoding and replication of sequences, take 2
On Wed, Mar 29, 2023 at 3:34 AM Tomas Vondra wrote: > > On 3/28/23 18:34, Masahiko Sawada wrote: > > On Mon, Mar 27, 2023 at 11:46 PM Tomas Vondra > > wrote: > >> > >> > >> > >> On 3/27/23 03:32, Masahiko Sawada wrote: > >>> Hi, > >>> > >>> On Fri, Mar 24, 2023 at 7:26 AM Tomas Vondra > >>> wrote: > > I merged the earlier "fixup" patches into the relevant parts, and left > two patches with new tweaks (deducing the corrent "WAL" state from the > current state read by copy_sequence), and the interlock discussed here. > > >>> > >>> Apart from that, how does the publication having sequences work with > >>> subscribers who are not able to handle sequence changes, e.g. in a > >>> case where PostgreSQL version of publication is newer than the > >>> subscriber? As far as I tested the latest patches, the subscriber > >>> (v15) errors out with the error 'invalid logical replication message > >>> type "Q"' when receiving a sequence change. I'm not sure it's sensible > >>> behavior. I think we should instead either (1) deny starting the > >>> replication if the subscriber isn't able to handle sequence changes > >>> and the publication includes that, or (2) not send sequence changes to > >>> such subscribers. > >>> > >> > >> I agree the "invalid message" error is not great, but it's not clear to > >> me how to do either (1). The trouble is we don't really know if the > >> publication contains (or will contain) sequences. I mean, what would > >> happen if the replication starts and then someone adds a sequence? > >> > >> For (2), I think that's not something we should do - silently discarding > >> some messages seems error-prone. If the publication includes sequences, > >> presumably the user wanted to replicate those. If they want to replicate > >> to an older subscriber, create a publication without sequences. > >> > >> Perhaps the right solution would be to check if the subscriber supports > >> replication of sequences in the output plugin, while attempting to write > >> the "Q" message. And error-out if the subscriber does not support it. > > > > It might be related to this topic; do we need to bump the protocol > > version? The commit 64824323e57d introduced new streaming callbacks > > and bumped the protocol version. I think the same seems to be true for > > this change as it adds sequence_cb callback. > > > > It's not clear to me what should be the exact behavior? > > I mean, imagine we're opening a connection for logical replication, and > the subscriber does not handle sequences. What should the publisher do? > > (Note: The correct commit hash is 464824323e57d.) Thanks. > > I don't think the streaming is a good match for sequences, because of a > couple important differences ... > > Firstly, streaming determines *how* the changes are replicated, not what > gets replicated. It doesn't (silently) filter out "bad" events that the > subscriber doesn't know how to apply. If the subscriber does not know > how to deal with streamed xacts, it'll still get the same changes > exactly per the publication definition. > > Secondly, the default value is "streming=off", i.e. the subscriber has > to explicitly request streaming when opening the connection. And we > simply check it against the negotiated protocol version, i.e. the check > in pgoutput_startup() protects against subscriber requesting a protocol > v1 but also streaming=on. > > I don't think we can/should do more check at this point - we don't know > what's included in the requested publications at that point, and I doubt > it's worth adding because we certainly can't predict if the publication > will be altered to include/decode sequences in the future. True. That's a valid argument. > > Speaking of precedents, TRUNCATE is probably a better one, because it's > a new action and it determines *what* the subscriber can handle. But > that does exactly the thing we do for sequences - if you open a > connection from PG10 subscriber (truncate was added in PG11), and the > publisher decodes a truncate, subscriber will do: > > 2023-03-28 20:29:46.921 CEST [2357609] ERROR: invalid logical >replication message type "T" > 2023-03-28 20:29:46.922 CEST [2356534] LOG: worker process: logical >replication worker for subscription 16390 (PID 2357609) exited with >exit code 1 > > I don't see why sequences should do anything else. If you need to > replicate to such subscriber, create a publication that does not have > 'sequence' in the publish option ... > I didn't check TRUNCATE cases, yes, sequence replication is a good match for them. So it seems we don't need to do anything. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Using Ephemeral Named Relation like a temporary table
On Wed, Mar 29, 2023 at 12:54 AM Yugo NAGATA wrote: > Hello, > > > Temporary tables are often used to store transient data in > batch processing and the contents can be accessed multiple > times. However, frequent use of temporary tables has a problem > that the system catalog tends to bloat. I know there has been > several proposals to attack this problem, but I would like to > propose a new one. > > The idea is to use Ephemeral Named Relation (ENR) like a > temporary table. ENR information is not stored into the system > catalog, but in QueryEnvironment, so it never bloat the system > catalog. > > Although we cannot perform insert, update or delete on ENR, > I wonder it could be beneficial if we need to reference to a > result of a query multiple times in a batch processing. > > The attached is a concept patch. This adds a new syntax > "OPEN cursor INTO TABLE tablename" to pl/pgSQL, that stores > a result of the cursor query into a ENR with specified name. > However, this is a tentative interface to demonstrate the > concept of feature. > > Here is an example; > > postgres=# \sf fnc > CREATE OR REPLACE FUNCTION public.fnc() > RETURNS TABLE(sum1 integer, avg1 integer, sum2 integer, avg2 integer) > LANGUAGE plpgsql > AS $function$ > DECLARE > sum1 integer; > sum2 integer; > avg1 integer; > avg2 integer; > curs CURSOR FOR SELECT aid, bid, abalance FROM pgbench_accounts >WHERE abalance BETWEEN 100 AND 200; > BEGIN > OPEN curs INTO TABLE tmp_accounts; > SELECT count(abalance) , avg(abalance) INTO sum1, avg1 >FROM tmp_accounts; > SELECT count(bbalance), avg(bbalance) INTO sum2, avg2 >FROM tmp_accounts a, pgbench_branches b WHERE a.bid = b.bid; > RETURN QUERY SELECT sum1,avg1,sum2,avg2; > END; > $function$ > > postgres=# select fnc(); > fnc > > (541,151,541,3937) > (1 row) > > As above, we can use the same query result for multiple > aggregations, and also join it with other tables. > > What do you think of using ENR for this way? > > Regards, > Yugo Nagata > > -- > Yugo NAGATA > This looks like a slightly more flexible version of the Oracle pl/sql table type. For those not familiar, PL/SQL can have record types, and in-memory collections of records types, and you can either build up multiple records in a collection manually, or you can bulk-collect them from a query. Then, you can later reference that collection in a regular SQL query with FROM TABLE(collection_name). It's a neat system for certain types of workloads. example link, I'm sure there's better out there: https://oracle-base.com/articles/12c/using-the-table-operator-with-locally-defined-types-in-plsql-12cr1 My first take is there are likely customers out there that will want this. However, those customers will want to manually add/delete rows from the ENR, so we'll want a way to do that. I haven't looked at ENRs in a while, when would the memory from that ENR get freed?
Re: Using Ephemeral Named Relation like a temporary table
Hi st 29. 3. 2023 v 6:54 odesílatel Yugo NAGATA napsal: > Hello, > > > Temporary tables are often used to store transient data in > batch processing and the contents can be accessed multiple > times. However, frequent use of temporary tables has a problem > that the system catalog tends to bloat. I know there has been > several proposals to attack this problem, but I would like to > propose a new one. > > The idea is to use Ephemeral Named Relation (ENR) like a > temporary table. ENR information is not stored into the system > catalog, but in QueryEnvironment, so it never bloat the system > catalog. > > Although we cannot perform insert, update or delete on ENR, > I wonder it could be beneficial if we need to reference to a > result of a query multiple times in a batch processing. > > The attached is a concept patch. This adds a new syntax > "OPEN cursor INTO TABLE tablename" to pl/pgSQL, that stores > a result of the cursor query into a ENR with specified name. > However, this is a tentative interface to demonstrate the > concept of feature. > > Here is an example; > > postgres=# \sf fnc > CREATE OR REPLACE FUNCTION public.fnc() > RETURNS TABLE(sum1 integer, avg1 integer, sum2 integer, avg2 integer) > LANGUAGE plpgsql > AS $function$ > DECLARE > sum1 integer; > sum2 integer; > avg1 integer; > avg2 integer; > curs CURSOR FOR SELECT aid, bid, abalance FROM pgbench_accounts >WHERE abalance BETWEEN 100 AND 200; > BEGIN > OPEN curs INTO TABLE tmp_accounts; > SELECT count(abalance) , avg(abalance) INTO sum1, avg1 >FROM tmp_accounts; > SELECT count(bbalance), avg(bbalance) INTO sum2, avg2 >FROM tmp_accounts a, pgbench_branches b WHERE a.bid = b.bid; > RETURN QUERY SELECT sum1,avg1,sum2,avg2; > END; > $function$ > > postgres=# select fnc(); > fnc > > (541,151,541,3937) > (1 row) > > As above, we can use the same query result for multiple > aggregations, and also join it with other tables. > > What do you think of using ENR for this way? > The idea looks pretty good. I think it can be very useful. I am not sure if this design is intuitive. If I remember well, the Oracle's has similar features, and can be nice if we use the same or more similar syntax (although I am not sure how it can be implementable)? I think so PL/SQL design has an advantage, because you don't need to solve the scope of the cursor's assigned table. OPEN curs INTO TABLE tmp_accounts; -- it looks little bit strange. I miss info, so tmp_accounts is not normal table what about OPEN curs INTO CURSOR TABLE xxx; or OPEN curs FOR CURSOR TABLE xxx Regards Pavel > > Regards, > Yugo Nagata > > -- > Yugo NAGATA >
[PATCH] Allow Postgres to pick an unused port to listen
Hi, I would like to suggest a patch against master (although it may be worth backporting it) that makes it possible to listen on any unused port. The main motivation is running colocated instances of Postgres (such as test benches) without having to coordinate port allocation in an unnecessarily complicated way. Instead, with this patch, one can specify `port` as `0` (the "wildcard" port) and retrieve the assigned port from postmaster.pid I believe there is no significant performance or another impact as it is a tiny bit of conditional functionality executed during startup. The patch builds and `make check` succeeds. The patch does not add a test; however, I am trying to figure out if this behaviour can be tested automatically. -- http://omnigres.org Y. V1-0001-Allow-listening-port-to-be-0.patch Description: Binary data
Using Ephemeral Named Relation like a temporary table
Hello, Temporary tables are often used to store transient data in batch processing and the contents can be accessed multiple times. However, frequent use of temporary tables has a problem that the system catalog tends to bloat. I know there has been several proposals to attack this problem, but I would like to propose a new one. The idea is to use Ephemeral Named Relation (ENR) like a temporary table. ENR information is not stored into the system catalog, but in QueryEnvironment, so it never bloat the system catalog. Although we cannot perform insert, update or delete on ENR, I wonder it could be beneficial if we need to reference to a result of a query multiple times in a batch processing. The attached is a concept patch. This adds a new syntax "OPEN cursor INTO TABLE tablename" to pl/pgSQL, that stores a result of the cursor query into a ENR with specified name. However, this is a tentative interface to demonstrate the concept of feature. Here is an example; postgres=# \sf fnc CREATE OR REPLACE FUNCTION public.fnc() RETURNS TABLE(sum1 integer, avg1 integer, sum2 integer, avg2 integer) LANGUAGE plpgsql AS $function$ DECLARE sum1 integer; sum2 integer; avg1 integer; avg2 integer; curs CURSOR FOR SELECT aid, bid, abalance FROM pgbench_accounts WHERE abalance BETWEEN 100 AND 200; BEGIN OPEN curs INTO TABLE tmp_accounts; SELECT count(abalance) , avg(abalance) INTO sum1, avg1 FROM tmp_accounts; SELECT count(bbalance), avg(bbalance) INTO sum2, avg2 FROM tmp_accounts a, pgbench_branches b WHERE a.bid = b.bid; RETURN QUERY SELECT sum1,avg1,sum2,avg2; END; $function$ postgres=# select fnc(); fnc (541,151,541,3937) (1 row) As above, we can use the same query result for multiple aggregations, and also join it with other tables. What do you think of using ENR for this way? Regards, Yugo Nagata -- Yugo NAGATA diff --git a/src/backend/executor/spi.c b/src/backend/executor/spi.c index e3a170c38b..27e94ecc87 100644 --- a/src/backend/executor/spi.c +++ b/src/backend/executor/spi.c @@ -20,6 +20,7 @@ #include "access/xact.h" #include "catalog/heap.h" #include "catalog/pg_type.h" +#include "commands/portalcmds.h" #include "commands/trigger.h" #include "executor/executor.h" #include "executor/spi_priv.h" @@ -3378,3 +3379,30 @@ SPI_register_trigger_data(TriggerData *tdata) return SPI_OK_TD_REGISTER; } + +int +SPI_register_portal(Portal portal, char *name) +{ + int rc; + EphemeralNamedRelation enr; + + if (portal == NULL || name == NULL) + return SPI_ERROR_ARGUMENT; + + PortalCreateHoldStore(portal); + PersistHoldablePortal(portal); + + enr = palloc(sizeof(EphemeralNamedRelationData)); + + enr->md.name = name; + enr->md.reliddesc = InvalidOid; + enr->md.tupdesc = portal->tupDesc; + enr->md.enrtype = ENR_NAMED_TUPLESTORE; + enr->md.enrtuples = tuplestore_tuple_count(portal->holdStore); + enr->reldata = portal->holdStore; + rc = SPI_register_relation(enr); + if (rc != SPI_OK_REL_REGISTER) + return rc; + + return SPI_OK_TD_REGISTER; +} diff --git a/src/include/executor/spi.h b/src/include/executor/spi.h index d1de139a3b..40753dc78a 100644 --- a/src/include/executor/spi.h +++ b/src/include/executor/spi.h @@ -199,6 +199,7 @@ extern void SPI_cursor_close(Portal portal); extern int SPI_register_relation(EphemeralNamedRelation enr); extern int SPI_unregister_relation(const char *name); extern int SPI_register_trigger_data(TriggerData *tdata); +extern int SPI_register_portal(Portal portal, char *name); extern void SPI_start_transaction(void); extern void SPI_commit(void); diff --git a/src/pl/plpgsql/src/pl_exec.c b/src/pl/plpgsql/src/pl_exec.c index b0a2cac227..5d561b39bd 100644 --- a/src/pl/plpgsql/src/pl_exec.c +++ b/src/pl/plpgsql/src/pl_exec.c @@ -4776,6 +4776,9 @@ exec_stmt_open(PLpgSQL_execstate *estate, PLpgSQL_stmt_open *stmt) elog(ERROR, "could not open cursor: %s", SPI_result_code_string(SPI_result)); + if (stmt->tablename) + SPI_register_portal(portal, stmt->tablename); + /* * If cursor variable was NULL, store the generated portal name in it, * after verifying it's okay to assign to. diff --git a/src/pl/plpgsql/src/pl_gram.y b/src/pl/plpgsql/src/pl_gram.y index edeb72c380..576ea86943 100644 --- a/src/pl/plpgsql/src/pl_gram.y +++ b/src/pl/plpgsql/src/pl_gram.y @@ -185,7 +185,7 @@ static void check_raise_parameters(PLpgSQL_stmt_raise *stmt); %type foreach_slice %type for_control -%type any_identifier opt_block_label opt_loop_label opt_label +%type any_identifier opt_block_label opt_into_table opt_loop_label opt_label %type option_value %type proc_sect stmt_elsifs stmt_else @@ -2064,7 +2064,17 @@ stmt_dynexecute : K_EXECUTE ; -stmt_open : K_OPEN cursor_variable +opt_into_table : + { + $$ = NULL; + } +| K_INTO K_TABLE any_identifier + { + $$ = $3; + } +; + +stmt_open : K_OPEN cursor_variable opt_into_table {
RE: doc: add missing "id" attributes to extension packaging page
Dear my comrade Brar, > Thanks, I actually was not aware of this. > > Also, kudos for capturing the missing id and advocating for consistency > regarding ids even before this is actively enforced. This nurtures my > optimism that consistency might actually be achieveable without > everybody getting angry at me because my patch will enforce it. > > Regarding the overlap, I currently try to make it as easy as possible > for a potential committer and I'm happy to rebase my patch upon request > or if Kuroda-san's patch gets applied first. FYI - my patch is pushed [1]. Could you please rebase your patch? I think it's ok to just remove changes from logical-replication.sgml, ref/alter_subscription.sgml, and ref/create_subscription.sgml. [1]: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=de5a47af2d8003dee123815bb7e58913be9a03f3 Best Regards, Hayato Kuroda FUJITSU LIMITED
Re: Should vacuum process config file reload more often
At Wed, 29 Mar 2023 12:09:08 +0900 (JST), Kyotaro Horiguchi wrote in > timeing perfectly. I think we might accidentally add a reload > timing. In that case, the assumption could break. In most cases, I > think we use snapshotting in various ways to avoid unintended variable > changes. (And I beilieve the analyze code also does that.) Okay, I was missing the following code. autovacuum.c:2893 /* * If any of the cost delay parameters has been set individually for * this table, disable the balancing algorithm. */ tab->at_dobalance = !(avopts && (avopts->vacuum_cost_limit > 0 || avopts->vacuum_cost_delay > 0)); So, sorry for the noise. I'll review it while this into cnosideration. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Data is copied twice when specifying both child and parent table in publication
On Wed, Mar 29, 2023 at 7:44 AM Peter Smith wrote: > > A minor review comment for v25-0001. > > == > src/backend/commands/subscriptioncmds.c > > 1. > @@ -1936,21 +1936,56 @@ fetch_table_list(WalReceiverConn *wrconn, List > *publications) > WalRcvExecResult *res; > StringInfoData cmd; > TupleTableSlot *slot; > - Oid tableRow[3] = {TEXTOID, TEXTOID, NAMEARRAYOID}; > + Oid tableRow[3] = {TEXTOID, TEXTOID, InvalidOid}; > > The patch could be slightly less invasive if you did not make this > change, but instead, only overwrite tableRow[2] for the >= PG16 case. > > Or vice versa, if you prefer. > The current coding pattern looks neat to me. -- With Regards, Amit Kapila.
Re: refactoring relation extension and BufferAlloc(), faster COPY
Hi, On 2023-03-27 15:32:47 +0900, Kyotaro Horiguchi wrote: > At Sun, 26 Mar 2023 12:26:59 -0700, Andres Freund wrote > in > > Hi, > > > > Attached is v5. Lots of comment polishing, a bit of renaming. I extracted > > the > > relation extension related code in hio.c back into its own function. > > > > While reviewing the hio.c code, I did realize that too much stuff is done > > while holding the buffer lock. See also the pre-existing issue > > https://postgr.es/m/20230325025740.wzvchp2kromw4zqz%40awork3.anarazel.de > > 0001, 0002 looks fine to me. > > 0003 adds the new function FileFallocte, but we already have > AllocateFile. Although fd.c contains functions with varying word > orders, it could be confusing that closely named functions have > different naming conventions. The syscall is named fallocate, I don't think we'd gain anything by inventing a different name for it? Given that there's a number of File$syscall operations, I think it's clear enough that it just fits into that. Unless you have a better proposal? > + /* > + * Return in cases of a "real" failure, if fallocate is not supported, > + * fall through to the FileZero() backed implementation. > + */ > + if (returnCode != EINVAL && returnCode != EOPNOTSUPP) > + return returnCode; > > I'm not entirely sure, but man 2 fallocate tells that ENOSYS also can > be returned. Some googling indicate that ENOSYS might need the same > amendment to EOPNOTSUPP. However, I'm not clear on why man > posix_fallocate donsn't mention the former. posix_fallocate() and its errors are specified by posix, I guess. I think glibc etc will map ENOSYS to EOPNOTSUPP. I really dislike this bit from the posix_fallocate manpage: EINVAL offset was less than 0, or len was less than or equal to 0, or the underlying filesystem does not support the operation. Why oh why would you add the "or .." portion into EINVAL, when there's also EOPNOTSUPP? > > + (returnCode != EINVAL && returnCode != EINVAL)) > :) > > > FileGetRawDesc(File file) > { > Assert(FileIsValid(file)); > + > + if (FileAccess(file) < 0) > + return -1; > + > > The function's comment is provided below. > > > * The returned file descriptor will be valid until the file is closed, but > > * there are a lot of things that can make that happen. So the caller should > > * be careful not to do much of anything else before it finishes using the > > * returned file descriptor. > > So, the responsibility to make sure the file is valid seems to lie > with the callers, although I'm not sure since there aren't any > function users in the tree. Except, as I think you realized as well, external callers *can't* call FileAccess(), it's static. > I'm unclear as to why FileSize omits the case lruLessRecently != file. Not quite following - why would FileSize() deal with lruLessRecently itself? Or do you mean why FileSize() uses FileIsNotOpen() itself, rather than relying on FileAccess() doing that internally? > When examining similar functions, such as FileGetRawFlags and > FileGetRawMode, I'm puzzled to find that FileAccess() nor > BasicOpenFilePermthe don't set the struct members referred to by the > functions. Those aren't involved with LRU mechanism, IIRC. Note that BasicOpenFilePerm() returns an actual fd, not a File. So you can't call FileGetRawMode() on it. As BasicOpenFilePerm() says: * This is exported for use by places that really want a plain kernel FD, * but need to be proof against running out of FDs. ... I don't think FileAccess() needs to set those struct members, that's already been done in PathNameOpenFilePerm(). > This makes my question the usefulness of these functions including > FileGetRawDesc(). It's quite weird that we have FileGetRawDesc(), but don't allow to use it in a safe way... > Regardless, since the > patchset doesn't use FileGetRawDesc(), I don't believe the fix is > necessary in this patch set. Yea. It was used in an earlier version, but not anymore. > > + if ((uint64) blocknum + nblocks >= (uint64) InvalidBlockNumber) > > I'm not sure it is appropriate to assume InvalidBlockNumber equals > MaxBlockNumber + 1 in this context. Hm. This is just the multi-block equivalent of what mdextend() already does. It's not pretty, indeed. I'm not sure there's really a better thing to do for mdzeroextend(), given the mdextend() precedent? mdzeroextend() (just as mdextend()) will be called with blockNum == InvalidBlockNumber, if you try to extend past the size limit. > > + * However, we don't use FileAllocate() for small extensions, > as it > + * defeats delayed allocation on some filesystems. Not clear > where > + * that decision should be made though? For now just use a > cutoff of > + * 8, anything between 4 and 8 worked OK in some local testing. > > The chose is quite similar to what FileFallocate() makes. However, I'm > not sure FileFallocate() itself
Re: refactoring relation extension and BufferAlloc(), faster COPY
Hi, On 2023-03-26 17:42:45 -0400, Melanie Plageman wrote: > Below is my review of a slightly older version than you just posted -- > much of it you may have already addressed. Far from all is already - thanks for the review! > From 3a6c3f41000e057bae12ab4431e6bb1c5f3ec4b0 Mon Sep 17 00:00:00 2001 > From: Andres Freund > Date: Mon, 20 Mar 2023 21:57:40 -0700 > Subject: [PATCH v5 01/15] createdb-using-wal-fixes > > This could use a more detailed commit message -- I don't really get what > it is doing It's a fix for a bug that I encountered while hacking on this, it since has been committed. commit 5df319f3d55d09fadb4f7e4b58c5b476a3aeceb4 Author: Andres Freund Date: 2023-03-20 21:57:40 -0700 Fix memory leak and inefficiency in CREATE DATABASE ... STRATEGY WAL_LOG RelationCopyStorageUsingBuffer() did not free the strategies used to access the source / target relation. They memory was released at the end of the transaction, but when using a template database with a lot of relations, the temporary leak can become big prohibitively big. RelationCopyStorageUsingBuffer() acquired the buffer for the target relation with RBM_NORMAL, therefore requiring a read of a block guaranteed to be zero. Use RBM_ZERO_AND_LOCK instead. Reviewed-by: Robert Haas Discussion: https://postgr.es/m/20230321070113.o2vqqxogjykwg...@awork3.anarazel.de Backpatch: 15-, where STRATEGY WAL_LOG was introduced > From 6faba69c241fd5513022bb042c33af09d91e84a6 Mon Sep 17 00:00:00 2001 > From: Andres Freund > Date: Wed, 1 Jul 2020 19:06:45 -0700 > Subject: [PATCH v5 02/15] Add some error checking around pinning > > --- > src/backend/storage/buffer/bufmgr.c | 40 - > src/include/storage/bufmgr.h| 1 + > 2 files changed, 29 insertions(+), 12 deletions(-) > > diff --git a/src/backend/storage/buffer/bufmgr.c > b/src/backend/storage/buffer/bufmgr.c > index 95212a3941..fa20fab5a2 100644 > --- a/src/backend/storage/buffer/bufmgr.c > +++ b/src/backend/storage/buffer/bufmgr.c > @@ -4283,6 +4287,25 @@ ConditionalLockBuffer(Buffer buffer) > LW_EXCLUSIVE); > } > > +void > +BufferCheckOneLocalPin(Buffer buffer) > +{ > +if (BufferIsLocal(buffer)) > +{ > +/* There should be exactly one pin */ > +if (LocalRefCount[-buffer - 1] != 1) > +elog(ERROR, "incorrect local pin count: %d", > +LocalRefCount[-buffer - 1]); > +} > +else > +{ > +/* There should be exactly one local pin */ > +if (GetPrivateRefCount(buffer) != 1) > > I'd rather this be an else if (was already like this, but, no reason not > to change it now) I don't like that much - it'd break the symmetry between local / non-local. > +/* > + * Zero a region of the file. > + * > + * Returns 0 on success, -1 otherwise. In the latter case errno is set to the > + * appropriate error. > + */ > +int > +FileZero(File file, off_t offset, off_t amount, uint32 wait_event_info) > +{ > +intreturnCode; > +ssize_twritten; > + > +Assert(FileIsValid(file)); > +returnCode = FileAccess(file); > +if (returnCode < 0) > +return returnCode; > + > +pgstat_report_wait_start(wait_event_info); > +written = pg_pwrite_zeros(VfdCache[file].fd, amount, offset); > +pgstat_report_wait_end(); > + > +if (written < 0) > +return -1; > +else if (written != amount) > > this doesn't need to be an else if You mean it could be a "bare" if instead? I don't really think that's clearer. > +{ > +/* if errno is unset, assume problem is no disk space */ > +if (errno == 0) > +errno = ENOSPC; > +return -1; > +} > > +int > +FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info) > +{ > +#ifdef HAVE_POSIX_FALLOCATE > +intreturnCode; > + > +Assert(FileIsValid(file)); > +returnCode = FileAccess(file); > +if (returnCode < 0) > +return returnCode; > + > +pgstat_report_wait_start(wait_event_info); > +returnCode = posix_fallocate(VfdCache[file].fd, offset, amount); > +pgstat_report_wait_end(); > + > +if (returnCode == 0) > +return 0; > + > +/* for compatibility with %m printing etc */ > +errno = returnCode; > + > +/* > +* Return in cases of a "real" failure, if fallocate is not supported, > +* fall through to the FileZero() backed implementation. > +*/ > +if (returnCode != EINVAL && returnCode != EOPNOTSUPP) > +return returnCode; > > I'm pretty sure you can just delete the below if statement > > +if (returnCode == 0 || > +(returnCode != EINVAL && returnCode != EINVAL)) > +return returnCode; Hm. I don't see how - wouldn't that lead us to call FileZero(), even if FileFallocate() succeeded or failed (rather than not being supported)? > > +/* > + *mdzeroextend() -- Add ew zeroed out blocks to
Re: Should vacuum process config file reload more often
At Tue, 28 Mar 2023 20:35:28 -0400, Melanie Plageman wrote in > On Tue, Mar 28, 2023 at 4:21 AM Kyotaro Horiguchi > wrote: > > > > At Mon, 27 Mar 2023 14:12:03 -0400, Melanie Plageman > > wrote in > > > So, I've attached an alternate version of the patchset which takes the > > > approach of having one commit which only enables cost-based delay GUC > > > refresh for VACUUM and another commit which enables it for autovacuum > > > and makes the changes to balancing variables. > > > > > > I still think the commit which has workers updating their own > > > wi_cost_delay in vacuum_delay_point() is a bit weird. It relies on no one > > > else emulating our bad behavior and reading from wi_cost_delay without a > > > lock and on no one else deciding to ever write to wi_cost_delay (even > > > though it is in shared memory [this is the same as master]). It is only > > > safe because our process is the only one (right now) writing to > > > wi_cost_delay, so when we read from it without a lock, we know it isn't > > > being written to. And everyone else takes a lock when reading from > > > wi_cost_delay right now. So, it seems...not great. > > > > > > This approach also introduces a function that is only around for > > > one commit until the next commit obsoletes it, which seems a bit silly. > > > > (I'm not sure what this refers to, though..) I don't think it's silly > > if a later patch removes something that the preceding patches > > introdcued, as long as that contributes to readability. Untimately, > > they will be merged together on committing. > > I was under the impression that reviewers thought config reload and > worker balance changes should be committed in separate commits. > > Either way, the ephemeral function is not my primary concern. I felt > more uncomfortable with increasing how often we update a double in > shared memory which is read without acquiring a lock. > > > > Basically, I think it is probably better to just have one commit > > > enabling guc refresh for VACUUM and then another which correctly > > > implements what is needed for autovacuum to do the same. > > > Attached v9 does this. > > > > > > I've provided both complete versions of both approaches (v9 and v8). > > > > I took a look at v9 and have a few comments. > > > > 0001: > > > > I don't believe it is necessary, as mentioned in the commit > > message. It apperas that we are resetting it at the appropriate times. > > VacuumCostBalance must be zeroed out when we disable vacuum cost. > Previously, we did not reenable VacuumCostActive once it was disabled, > but now that we do, I think it is good practice to always zero out > VacuumCostBalance when we disable vacuum cost. I will squash this commit > into the one introducing VacuumCostInactive, though. > > > > 0002: > > > > I felt a bit uneasy on this. It seems somewhat complex (and makes the > > succeeding patches complex), > > Even if we introduced a second global variable to indicate that failsafe > mode has been engaged, we would still require the additional checks > of VacuumCostInactive. > > > has confusing names, > > I would be happy to rename the values of the enum to make them less > confusing. Are you thinking "force" instead of "locked"? > maybe: > VACUUM_COST_FORCE_INACTIVE and > VACUUM_COST_INACTIVE > ? > > > and doesn't seem like self-contained. > > By changing the variable from VacuumCostActive to VacuumCostInactive, I > have kept all non-vacuum code from having to distinguish between it > being inactive due to failsafe mode or due to user settings. My concern is that VacuumCostActive is logic-inverted and turned into a ternary variable in a subtle way. The expression "!VacuumCostInactive" is quite confusing. (I sometimes feel the same way about "!XLogRecPtrIsInvalid(lsn)", and I believe most people write it with another macro like "lsn != InvalidXLogrecPtr"). Additionally, the constraint in this patch will be implemented as open code. So I wanted to suggest something like the attached. The main idea is to use a wrapper function to enforce the restriction, and by doing so, we eliminated the need to make the variable into a ternary without a good reason. > > I think it'd be simpler to add a global boolean (maybe > > VacuumCostActiveForceDisable or such) that forces VacuumCostActive to > > be false and set VacuumCostActive using a setter function that follows > > the boolean. > > I think maintaining an additional global variable is more brittle than > including the information in a single variable. > > > 0003: > > > > +* Reload the configuration file if requested. This allows changes > > to > > +* vacuum_cost_limit and vacuum_cost_delay to take effect while a > > table is > > +* being vacuumed or analyzed. Analyze should not reload > > configuration > > +* file if it is in an outer transaction, as GUC values shouldn't be > > +* allowed to refer to some uncommitted state (e.g. database objects > > +* created
Re: hio.c does visibilitymap_pin()/IO while holding buffer lock
Hi, On 2023-03-28 18:21:02 -0700, Andres Freund wrote: > Here's a draft patch. Attached is v2, with a stupid bug fixed and a bit of comment / pgindent polish. Greetings, Andres Freund >From 6f27018765f167a8d2a52bc7c9068cb09ca8d82c Mon Sep 17 00:00:00 2001 From: Andres Freund Date: Mon, 24 Oct 2022 12:28:06 -0700 Subject: [PATCH v2 1/2] hio: Release extension lock before initializing page / pinning VM PageInit() while holding the extension lock is unnecessary after 0d1fe9f74e3 started to use RBM_ZERO_AND_LOCK - nobody can look at the new page before we release the page lock. PageInit() zeroes the page, which isn't that cheap, so deferring it until after the extension lock is released seems like a good idea. Doing visibilitymap_pin() while holding the extension lock, introduced in 7db0cd2145f2, looks like an accident. Due to the restrictions on HEAP_INSERT_FROZEN it's unlikely to be a performance issue, but it still seems better to move it out. --- src/backend/access/heap/hio.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c index e152807d2dc..7479212d4e0 100644 --- a/src/backend/access/heap/hio.c +++ b/src/backend/access/heap/hio.c @@ -623,6 +623,13 @@ loop: */ buffer = ReadBufferBI(relation, P_NEW, RBM_ZERO_AND_LOCK, bistate); + /* + * Release the file-extension lock; it's now OK for someone else to extend + * the relation some more. + */ + if (needLock) + UnlockRelationForExtension(relation, ExclusiveLock); + /* * We need to initialize the empty new page. Double-check that it really * is empty (this should never happen, but if it does we don't want to @@ -647,13 +654,6 @@ loop: visibilitymap_pin(relation, BufferGetBlockNumber(buffer), vmbuffer); } - /* - * Release the file-extension lock; it's now OK for someone else to extend - * the relation some more. - */ - if (needLock) - UnlockRelationForExtension(relation, ExclusiveLock); - /* * Lock the other buffer. It's guaranteed to be of a lower page number * than the new page. To conform with the deadlock prevent rules, we ought -- 2.38.0 >From 5a33df025ce466343d88bbc57ac1bc80d883a230 Mon Sep 17 00:00:00 2001 From: Andres Freund Date: Tue, 28 Mar 2023 18:17:11 -0700 Subject: [PATCH v2 2/2] WIP: relation extension: Don't pin the VM while holding buffer lock --- src/backend/access/heap/hio.c | 120 +++--- 1 file changed, 81 insertions(+), 39 deletions(-) diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c index 7479212d4e0..8b3dfa0ccae 100644 --- a/src/backend/access/heap/hio.c +++ b/src/backend/access/heap/hio.c @@ -134,14 +134,17 @@ ReadBufferBI(Relation relation, BlockNumber targetBlock, * buffer2 may be InvalidBuffer, if only one buffer is involved. buffer1 * must not be InvalidBuffer. If both buffers are specified, block1 must * be less than block2. + * + * Returns whether buffer locks were temporarily released. */ -static void +static bool GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2, BlockNumber block1, BlockNumber block2, Buffer *vmbuffer1, Buffer *vmbuffer2) { bool need_to_pin_buffer1; bool need_to_pin_buffer2; + bool released_locks = false; Assert(BufferIsValid(buffer1)); Assert(buffer2 == InvalidBuffer || block1 <= block2); @@ -155,9 +158,10 @@ GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2, && PageIsAllVisible(BufferGetPage(buffer2)) && !visibilitymap_pin_ok(block2, *vmbuffer2); if (!need_to_pin_buffer1 && !need_to_pin_buffer2) - return; + break; /* We must unlock both buffers before doing any I/O. */ + released_locks = true; LockBuffer(buffer1, BUFFER_LOCK_UNLOCK); if (buffer2 != InvalidBuffer && buffer2 != buffer1) LockBuffer(buffer2, BUFFER_LOCK_UNLOCK); @@ -183,6 +187,8 @@ GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2, || (need_to_pin_buffer1 && need_to_pin_buffer2)) break; } + + return released_locks; } /* @@ -345,6 +351,7 @@ RelationGetBufferForTuple(Relation relation, Size len, BlockNumber targetBlock, otherBlock; bool needLock; + bool unlockedTargetBuffer; len = MAXALIGN(len); /* be conservative */ @@ -630,6 +637,9 @@ loop: if (needLock) UnlockRelationForExtension(relation, ExclusiveLock); + unlockedTargetBuffer = false; + targetBlock = BufferGetBlockNumber(buffer); + /* * We need to initialize the empty new page. Double-check that it really * is empty (this should never happen, but if it does we don't want to @@ -639,75 +649,107 @@ loop: if (!PageIsNew(page)) elog(ERROR, "page %u of relation \"%s\" should be empty but is not", - BufferGetBlockNumber(buffer), + targetBlock, RelationGetRelationName(relation)); PageInit(page, BufferGetPageSize(buffer), 0); MarkBufferDirty(buffer); /* - * The page is empty, pin vmbuffer to set
Re: Data is copied twice when specifying both child and parent table in publication
A minor review comment for v25-0001. == src/backend/commands/subscriptioncmds.c 1. @@ -1936,21 +1936,56 @@ fetch_table_list(WalReceiverConn *wrconn, List *publications) WalRcvExecResult *res; StringInfoData cmd; TupleTableSlot *slot; - Oid tableRow[3] = {TEXTOID, TEXTOID, NAMEARRAYOID}; + Oid tableRow[3] = {TEXTOID, TEXTOID, InvalidOid}; The patch could be slightly less invasive if you did not make this change, but instead, only overwrite tableRow[2] for the >= PG16 case. Or vice versa, if you prefer. The point is, there are only 2 cases, so you might as well initialize a default tableRow[2] that is valid for one case and overwrite it only for the other case, instead of overwriting it in 2 places. YMMV. -- Kind Regards, Peter Smith. Fujitsu Australia
pg_column_toast_chunk_id: a function to get a chunk ID of a TOASTed value
Hello, Attached patch introduces a function pg_column_toast_chunk_id that returns a chunk ID of a TOASTed value. Recently, one of our clients needed a way to show which columns are actually TOASTed because they would like to know how much updates on the original table affects to its toast table specifically with regard to auto VACUUM. We could not find a function for this purpose in the current PostgreSQL, so I would like propose pg_column_toast_chunk_id. This function returns a chunk ID of a TOASTed value, or NULL if the value is not TOASTed. Here is an example; postgres=# \d val Table "public.val" Column | Type | Collation | Nullable | Default +--+---+--+- t | text | | | postgres=# select length(t), pg_column_size(t), pg_column_compression(t), pg_column_toast_chunk_id(t), tableoid from val; length | pg_column_size | pg_column_compression | pg_column_toast_chunk_id | tableoid ++---+--+-- 3 | 4 | | | 16388 3000 | 46 | pglz | | 16388 32000 |413 | pglz | | 16388 305 |309 | | | 16388 64000 | 64000 | | 16393 | 16388 (5 rows) postgres=# select chunk_id, chunk_seq from pg_toast.pg_toast_16388; chunk_id | chunk_seq --+--- 16393 | 0 16393 | 1 16393 | 2 (snip) 16393 |30 16393 |31 16393 |32 (33 rows) This function is also useful to identify a problematic row when an error like "ERROR: unexpected chunk number ... (expected ...) for toast value" occurs. The patch is a just a concept patch and not including documentation and tests. What do you think about this feature? Regards, Yugo Nagata -- Yugo NAGATA diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c index 5778e3f0ef..4ddcf885cd 100644 --- a/src/backend/utils/adt/varlena.c +++ b/src/backend/utils/adt/varlena.c @@ -5070,6 +5070,47 @@ pg_column_compression(PG_FUNCTION_ARGS) PG_RETURN_TEXT_P(cstring_to_text(result)); } +/* + * Return the chunk id of the TOASTed value. + * Return NULL for unTOASTed value. + */ +Datum +pg_column_toast_chunk_id(PG_FUNCTION_ARGS) +{ + int typlen; + struct varlena *attr; + struct varatt_external toast_pointer; + + /* On first call, get the input type's typlen, and save at *fn_extra */ + if (fcinfo->flinfo->fn_extra == NULL) + { + /* Lookup the datatype of the supplied argument */ + Oid argtypeid = get_fn_expr_argtype(fcinfo->flinfo, 0); + + typlen = get_typlen(argtypeid); + if (typlen == 0) /* should not happen */ + elog(ERROR, "cache lookup failed for type %u", argtypeid); + + fcinfo->flinfo->fn_extra = MemoryContextAlloc(fcinfo->flinfo->fn_mcxt, + sizeof(int)); + *((int *) fcinfo->flinfo->fn_extra) = typlen; + } + else + typlen = *((int *) fcinfo->flinfo->fn_extra); + + if (typlen != -1) + PG_RETURN_NULL(); + + attr = (struct varlena *) DatumGetPointer(PG_GETARG_DATUM(0)); + + if (!VARATT_IS_EXTERNAL_ONDISK(attr)) + PG_RETURN_NULL(); + + VARATT_EXTERNAL_GET_POINTER(toast_pointer, attr); + + PG_RETURN_OID(toast_pointer.va_valueid); +} + /* * string_agg - Concatenates values and returns string. * diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat index 7c358cff16..4c214f4c63 100644 --- a/src/include/catalog/pg_proc.dat +++ b/src/include/catalog/pg_proc.dat @@ -7414,6 +7414,9 @@ { oid => '2121', descr => 'compression method for the compressed datum', proname => 'pg_column_compression', provolatile => 's', prorettype => 'text', proargtypes => 'any', prosrc => 'pg_column_compression' }, +{ oid => '8393', descr => 'chunk ID of TOASTed value', + proname => 'pg_column_toast_chunk_id', provolatile => 's', prorettype => 'oid', + proargtypes => 'any', prosrc => 'pg_column_toast_chunk_id' }, { oid => '2322', descr => 'total disk space usage for the specified tablespace', proname => 'pg_tablespace_size', provolatile => 'v', prorettype => 'int8',
Re: allow_in_place_tablespaces vs. pg_basebackup
On Tue, Mar 28, 2023 at 05:15:25PM +1300, Thomas Munro wrote: > The commit message explains pretty well, and it seems to work in > simple testing, and yeah commit c6f2f016 was not a work of art. > pg_basebackeup --format=plain "worked", but your way is better. I > guess we should add a test of -Fp too, to keep it working? Here's one > of those. The patch is larger than it is complicated. Particularly, in xlog.c, this is just adding one block for the PGFILETYPE_DIR case to store a relative path. > I know it's not your patch's fault, but I wonder if this might break > something. We have this strange beast ti->rpath (commit b168c5ef in > 2014): > > +/* > + * Relpath holds the relative path of the tablespace directory > + * when it's located within PGDATA, or NULL if it's located > + * elsewhere. > + */ > > That's pretty confusing, because relative paths have been banned since > the birth of tablespaces (commit 2467394ee15, 2004): I think that this comes down to people manipulating pg_tblspc/ by themselves post-creation with CREATE, because we don't track the location anywhere post-creation and rely on the structure of pg_tblspc/ for any future decision taken by the backend? I recall this specific case, where users were creating tablespaces in PGDATA itself to make "cleaner" for them the structure in the data folder, even if it makes no sense as the mount point is the same.. 33cb8ff added a warning about that in tablespace.c, but we could be more aggressive and outright drop this case entirely when in-place tablespaces are not involved. Tablespace maps defined in pg_basebackup -T require both the old and new locations to be absolute, so it seems shaky to me to assume that this should always be fine in the backend.. > I guess what I'm wondering here is if there is a hazard where we > confuse these outlawed tablespaces that supposedly roam the plains > somewhere in your code that is assuming that "relative implies > in-place". Or if not now, maybe in future changes. I wonder if these > "semi-supported-but-don't-tell-anyone" relative symlinks are worthy of > a defensive test (or is it in there somewhere already?). Yeah, it makes for surprising and sometimes undocumented behaviors, which is confusing for users and developers at the end. I'd like to think that we'd live happier long-term if the borders are clearer, aka switch to more aggressive ERRORs instead of WARNINGs in some places where the boundaries are blurry. A good thing about in-place tablespaces taken in isolation is that the borders of what you can do are well-defined, which comes down to the absolute vs relative path handling. Looking at the patch, nothing really stands out.. -- Michael signature.asc Description: PGP signature
Re: hio.c does visibilitymap_pin()/IO while holding buffer lock
Hi, On 2023-03-25 11:17:07 -0700, Andres Freund wrote: > I don't see how that's easily possible with the current lock ordering > rules. At least without giving up using RBM_ZERO_AND_LOCK for extending or > stuffing even more things to happen with the the extension lock held, which I > don't think we want to. I don't think INSERT_FROZEN is worth that price. I think I might have been thinking of this too narrowly. It's extremely unlikely that another backend would discover the page. And we can use visibilitymap_pin_ok() to amortize the cost to almost nothing - there's a lot of bits in an 8k block... Here's a draft patch. The bulk relation patch I am polishing has a similar issue, except that there the problem is inserting into the FSM, instead of pinning a VM pageabout the FSM. Hence the patch above makes the infrastructure a bit more general than required for the HEAP_INSERT_FROZEN case alone (where we currently shouldn't ever have a valid otherBuffer). The way the parameter ordering for GetVisibilityMapPins() works make it somewhat unwieldy - see e.g the existing if (otherBuffer == InvalidBuffer || targetBlock <= otherBlock) GetVisibilityMapPins(relation, buffer, otherBuffer, targetBlock, otherBlock, vmbuffer, vmbuffer_other); else GetVisibilityMapPins(relation, otherBuffer, buffer, otherBlock, targetBlock, vmbuffer_other, vmbuffer); Which I now duplicated in yet another place. Perhaps we just ought to switch buffer1/block1 with buffer2/block2 inside GetVisibilityMapPins(), to avoid duplicating that code elsewhere? Because we now track whether the *targetBuffer* was ever unlocked, we can be a bit more narrow about the possibility of there not being sufficient space. The patch could be narrowed for backpatching. But as there's likely no practical problem at this point, I wouldn't want to backpatch anyway? Greetings, Andres Freund >From 6f27018765f167a8d2a52bc7c9068cb09ca8d82c Mon Sep 17 00:00:00 2001 From: Andres Freund Date: Mon, 24 Oct 2022 12:28:06 -0700 Subject: [PATCH v1 1/2] hio: Release extension lock before initializing page / pinning VM PageInit() while holding the extension lock is unnecessary after 0d1fe9f74e3 started to use RBM_ZERO_AND_LOCK - nobody can look at the new page before we release the page lock. PageInit() zeroes the page, which isn't that cheap, so deferring it until after the extension lock is released seems like a good idea. Doing visibilitymap_pin() while holding the extension lock, introduced in 7db0cd2145f2, looks like an accident. Due to the restrictions on HEAP_INSERT_FROZEN it's unlikely to be a performance issue, but it still seems better to move it out. --- src/backend/access/heap/hio.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c index e152807d2dc..7479212d4e0 100644 --- a/src/backend/access/heap/hio.c +++ b/src/backend/access/heap/hio.c @@ -623,6 +623,13 @@ loop: */ buffer = ReadBufferBI(relation, P_NEW, RBM_ZERO_AND_LOCK, bistate); + /* + * Release the file-extension lock; it's now OK for someone else to extend + * the relation some more. + */ + if (needLock) + UnlockRelationForExtension(relation, ExclusiveLock); + /* * We need to initialize the empty new page. Double-check that it really * is empty (this should never happen, but if it does we don't want to @@ -647,13 +654,6 @@ loop: visibilitymap_pin(relation, BufferGetBlockNumber(buffer), vmbuffer); } - /* - * Release the file-extension lock; it's now OK for someone else to extend - * the relation some more. - */ - if (needLock) - UnlockRelationForExtension(relation, ExclusiveLock); - /* * Lock the other buffer. It's guaranteed to be of a lower page number * than the new page. To conform with the deadlock prevent rules, we ought -- 2.38.0 >From 5aeb7f5d042eda6ba4c70d8f475400c51607fac5 Mon Sep 17 00:00:00 2001 From: Andres Freund Date: Tue, 28 Mar 2023 18:17:11 -0700 Subject: [PATCH v1 2/2] WIP: relation extension: Don't pin the VM while holding buffer lock --- src/backend/access/heap/hio.c | 115 +++--- 1 file changed, 78 insertions(+), 37 deletions(-) diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c index 7479212d4e0..11ef053705e 100644 --- a/src/backend/access/heap/hio.c +++ b/src/backend/access/heap/hio.c @@ -134,14 +134,17 @@ ReadBufferBI(Relation relation, BlockNumber targetBlock, * buffer2 may be InvalidBuffer, if only one buffer is involved. buffer1 * must not be InvalidBuffer. If both buffers are specified, block1 must * be less than block2. + * +
RE: PGdoc: add missing ID attribute to create_subscription.sgml
Dear Amit-san, > There is no need to break the link line. See attached. I understood your saying. I think it's better. Best Regards, Hayato Kuroda FUJITSU LIMITED
Re: [RFC] Add jit deform_counter
On Sun, 12 Jun 2022 at 21:14, Dmitry Dolgov <9erthali...@gmail.com> wrote: > I've noticed that JIT performance counter generation_counter seems to include > actions, relevant for both jit_expressions and jit_tuple_deforming options. It > means one can't directly see what is the influence of jit_tuple_deforming > alone, which would be helpful when adjusting JIT options. To make it better a > new counter can be introduced, does it make sense? I'm not so sure about this idea. As of now, if I look at EXPLAIN ANALYZE's JIT summary, the individual times add up to the total time. If we add this deform time, then that's no longer going to be true as the "Generation" time includes the newly added deform time. master: JIT: Functions: 600 Options: Inlining false, Optimization false, Expressions true, Deforming true Timing: Generation 37.758 ms, Inlining 0.000 ms, Optimization 6.736 ms, Emission 172.244 ms, Total 216.738 ms 37.758 + 6.736 + 172.244 = 216.738 I think if I was a DBA wondering why JIT was taking so long, I'd probably either be very astonished or I'd report a bug if I noticed that all the individual component JIT times didn't add up to the total time. I don't think the solution is to subtract the deform time from the generation time either. Can't users just get this by looking at EXPLAIN ANALYZE with and without jit_tuple_deforming? David
Re: Add LZ4 compression in pg_dump
On 3/28/23 00:34, gkokola...@pm.me wrote: > > ... > >> Got it. In that case I agree it's fine to do that in a single commit. > > For what is worth, I think that this patch should get a +1 and get in. It > solves the empty writes problem and includes a test to a previous untested > case. > Pushed, after updating / rewording the commit message a little bit. Thanks! -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Should vacuum process config file reload more often
On Tue, Mar 28, 2023 at 4:21 AM Kyotaro Horiguchi wrote: > > At Mon, 27 Mar 2023 14:12:03 -0400, Melanie Plageman > wrote in > > So, I've attached an alternate version of the patchset which takes the > > approach of having one commit which only enables cost-based delay GUC > > refresh for VACUUM and another commit which enables it for autovacuum > > and makes the changes to balancing variables. > > > > I still think the commit which has workers updating their own > > wi_cost_delay in vacuum_delay_point() is a bit weird. It relies on no one > > else emulating our bad behavior and reading from wi_cost_delay without a > > lock and on no one else deciding to ever write to wi_cost_delay (even > > though it is in shared memory [this is the same as master]). It is only > > safe because our process is the only one (right now) writing to > > wi_cost_delay, so when we read from it without a lock, we know it isn't > > being written to. And everyone else takes a lock when reading from > > wi_cost_delay right now. So, it seems...not great. > > > > This approach also introduces a function that is only around for > > one commit until the next commit obsoletes it, which seems a bit silly. > > (I'm not sure what this refers to, though..) I don't think it's silly > if a later patch removes something that the preceding patches > introdcued, as long as that contributes to readability. Untimately, > they will be merged together on committing. I was under the impression that reviewers thought config reload and worker balance changes should be committed in separate commits. Either way, the ephemeral function is not my primary concern. I felt more uncomfortable with increasing how often we update a double in shared memory which is read without acquiring a lock. > > Basically, I think it is probably better to just have one commit > > enabling guc refresh for VACUUM and then another which correctly > > implements what is needed for autovacuum to do the same. > > Attached v9 does this. > > > > I've provided both complete versions of both approaches (v9 and v8). > > I took a look at v9 and have a few comments. > > 0001: > > I don't believe it is necessary, as mentioned in the commit > message. It apperas that we are resetting it at the appropriate times. VacuumCostBalance must be zeroed out when we disable vacuum cost. Previously, we did not reenable VacuumCostActive once it was disabled, but now that we do, I think it is good practice to always zero out VacuumCostBalance when we disable vacuum cost. I will squash this commit into the one introducing VacuumCostInactive, though. > > 0002: > > I felt a bit uneasy on this. It seems somewhat complex (and makes the > succeeding patches complex), Even if we introduced a second global variable to indicate that failsafe mode has been engaged, we would still require the additional checks of VacuumCostInactive. > has confusing names, I would be happy to rename the values of the enum to make them less confusing. Are you thinking "force" instead of "locked"? maybe: VACUUM_COST_FORCE_INACTIVE and VACUUM_COST_INACTIVE ? > and doesn't seem like self-contained. By changing the variable from VacuumCostActive to VacuumCostInactive, I have kept all non-vacuum code from having to distinguish between it being inactive due to failsafe mode or due to user settings. > I think it'd be simpler to add a global boolean (maybe > VacuumCostActiveForceDisable or such) that forces VacuumCostActive to > be false and set VacuumCostActive using a setter function that follows > the boolean. I think maintaining an additional global variable is more brittle than including the information in a single variable. > 0003: > > +* Reload the configuration file if requested. This allows changes to > +* vacuum_cost_limit and vacuum_cost_delay to take effect while a > table is > +* being vacuumed or analyzed. Analyze should not reload configuration > +* file if it is in an outer transaction, as GUC values shouldn't be > +* allowed to refer to some uncommitted state (e.g. database objects > +* created in this transaction). > > I'm not sure GUC reload is or should be related to transactions. For > instance, work_mem can be changed by a reload during a transaction > unless it has been set in the current transaction. I don't think we > need to deliberately suppress changes in variables caused by realods > during transactions only for analzye. If analyze doesn't like changes > to certain GUC variables, their values should be snapshotted before > starting the process. Currently, we only reload the config file in top-level statements. We don't reload the configuration file from within a nested transaction command. BEGIN starts a transaction but not a transaction command. So BEGIN; ANALYZE; probably wouldn't violate this rule. But it is simpler to just forbid reloading when it is not a top-level transaction command. I have updated the comment to reflect this. > 0004: > -
Re: ALTER TABLE SET ACCESS METHOD on partitioned tables
On Tue, Mar 28, 2023 at 02:56:28PM +0900, Michael Paquier wrote: > Hmm. This is a good point. It is true that the patch feels > incomplete on this side. I don't see why we could not be flexible, > and allow a value of 0 in a partitioned table's relam to mean that we > would pick up the database default in this case when a partition is > is created on it. This would have the advantage to be consistent with > older versions where we fallback on the default. We cannot be > completely consistent with the reltablespace of the leaf partitions > unfortunately, as relam should always be set if a relation has > storage. And allowing a value of 0 means that there are likely other > tricky cases with dumps? > > Another thing: would it make sense to allow an empty string in > default_table_access_method so as we'd always fallback to a database > default? FYI, I am not sure that I will be able to look more at this patch by the end of the commit fest, and there are quite more points to consider. Perhaps at this stage we'd better mark it as returned with feedback? I understand that I've arrived late at this party :/ -- Michael signature.asc Description: PGP signature
Re: Reconcile stats in find_tabstat_entry() and get rid of PgStat_BackendFunctionEntry
On Tue, Mar 28, 2023 at 05:43:26PM +0900, Kyotaro Horiguchi wrote: > No. Fine by me, except that "block read requests" seems to suggest > kernel read() calls, maybe because it's not clear whether "block" > refers to our buffer blocks or file blocks to me.. If it is generally > clear, I'm fine with the proposal. Okay. Would somebody like to draft a patch? -- Michael signature.asc Description: PGP signature
Re: Why mark empty pages all visible?
On Tue, Mar 28, 2023 at 3:29 PM Andres Freund wrote: > Why is that? It's as clear a signal of concurrent activity on the buffer > you're going to get. Not immediately getting a cleanup lock in VACUUM is informative to the extent that you only care about what is happening that very nanosecond. If you look at which pages it happens to in detail, what you seem to end up with is a whole bunch of noise, which (on its own) tells you exactly nothing about what VACUUM really ought to be doing with those pages. In almost all cases we could get a cleanup lock by waiting for one millisecond and retrying. I suspect that the cleanup lock thing might be a noisy, unreliable proxy for the condition that you actually care about, in the context of your work on relation extension. I bet there is a better signal to go on, if you look for one. > It's interesting to understand *why* we are doing what we are. I think it'd > make sense to propose changing how things work around this, but I just don't > feel like I have a good enough grasp for why we do some of the things we > do. And given there's not a lot of comments around it and some of the comments > that do exist are inconsistent with themselves, looking at the history seems > like the next best thing? I think that I know why Heikki had no comments about PageIsEmpty() pages when the VM first went in, back in 2009: because it was just so obvious that you'd treat them the same as any other initialized page, it didn't seem to warrant a comment at all. The difference between a page with 0 tuples and 1 tuple is the same difference between a page with 1 tuple and a page with 2 tuples. A tiny difference (one extra tuple), of no particular consequence. I think that you don't see it that way now because you're focussed on the hio.c view of things. That code looked very different back in 2009, and in any case is very far removed from vacuumlazy.c. I can tell you what I was thinking of with lazy_scan_new_or_empty: I hate special cases. I will admit to being a zealot about it. > > I gather that you *don't* want to do anything about the > > lazy_vacuum_heap_page behavior with setting empty pages all-visible (just > > the lazy_scan_new_or_empty behavior). > > Not in the short-medium term, at least. In the long term I do suspect we might > want to do something about it. We have a *crapton* of contention in the FSM > and caused by the FSM in bulk workloads. With my relation extension patch > disabling the FSM nearly doubles concurrent load speed. I've seen the same effect myself. There is no question that that's a big problem. I think that the problem is that the system doesn't have any firm understanding of pages as things that are owned by particular transactions and/or backends, at least to a limited, scoped extent. It's all really low level, when it actually needs to be high level and take lots of context that comes from the application into account. > At the same time, the fact that we might loose knowledge about all the > existing free space in case of promotion or crash and never rediscover that > space (because the pages are frozen), seems decidedly not great. Unquestionably. > I don't know what the path forward is, but it seems somewhat clear that we > ought to do something. I suspect having a not crash-safe FSM isn't really > acceptable anymore - it probably is fine to not persist a *reduction* in free > space, but we can't permanently loose track of free space, no matter how many > crashes. Strongly agreed. It's a terrible false economy. If we bit the bullet and made relation extension and the FSM crash safe, we'd gain so much more than we'd lose. > One thing I am fairly certain about is that using the FSM to tell other > backends about newly bulk extended pages is not a good solution, even though > we're stuck with it for the moment. Strongly agreed. > > I think that you must be arguing for making the early lifetime of a > > heap page special to VACUUM, since AFAICT you want to change VACUUM's > > behavior with strictly PageIsEmpty/PageIsNew pages only -- and *not* > > with pages that have one remaining LP_UNUSED item, but are otherwise > > empty (which could be set all-visible/all-frozen in either the first > > or second heap pass, even if we disabled the lazy_scan_new_or_empty() > > behavior you're complaining about). You seem to want to distinguish > > between very new pages (that also happen to be empty), and old pages > > that happen to be empty. Right? > > I think that might be worthwhile, yes. The retry code in > RelationGetBufferForTuple() is quite hairy and almost impossible to test. If > we can avoid the complexity, at a fairly bound cost (vacuum needing to > re-visit new/empty pages if they're currently pinned), it'd imo be more that > worth the price. Short term, you could explicitly say that PageIsEmpty() means that the page is qualitatively different to other empty pages that were left behind by VACUUM's second phase. > But perhaps the better
Re: Request for comment on setting binary format output per session
On Tue, 2023-03-28 at 10:22 -0400, Dave Cramer wrote: > OK, IIUC what you are proposing here is that there would be a > separate pool for > database, user, and OIDs. This doesn't seem too flexible. For > instance if I create a UDT and then want it to be returned > as binary then I have to reconfigure the pool to be able to accept a > new list of OID's. There are two ways that I could imagine the connection pool working: 1. Accept whatever clients connect, and pass along the binary_formats setting to the outbound (server) connection. The downside here is that if you have many different clients (or different versions) that have different binary_formats settings, then it creates too many pools and doesn't share well enough. 2. Some kind of configuration setting (or maybe it can be done automatically) that organizes based on a common subset of binary formats that many clients can understand. These can evolve once the protocol extension is in place. Regards, Jeff Davis
Re: running logical replication as the subscription owner
On Tue, 2023-03-28 at 13:55 -0400, Robert Haas wrote: > Oh, interesting. I hadn't realized we were doing that, and I do think > that narrows the vulnerability quite a bit. It's good to know I wasn't missing something obvious. > bsent logical replication, you don't really need to worry > about whether that function is written securely -- it will run with > privileges of the person performing the DML, and not impact your > account at all. That's not strictly true. See the example at the bottom of this email. The second trigger is SECURITY INVOKER, and it captures the leaked secret number that was stored in the tuple by an earlier SECURITY DEFINER trigger. Perhaps I'm being pedantic, but my point is that SECURITY INVOKER is not a magical shield that protects users from themselves without bound. It protects users against some kinds of attacks, sure, but there's a limit to what it has to offer. > They have reason to be scared of you, but not the > reverse. However, if the people doing DML on the table can arrange to > perform that DML as you, then any security vulnerabilities in your > function potentially allow them to compromise your account. OK, but I'd like to be clear that you've moved from your prior statement: "in theory they might be able to protect themselves, but in practice they are unlikely to be careful enough" To something very different, where it seems that we can't think of a concrete problem even for not-so-careful users. The dangerous cases seem to be something along the lines of a security- invoker trigger function that builds and executes arbirary SQL based on the row contents. And then the table owner would then still need to set ENABLE ALWAYS TRIGGER. Do we really want to take that case on as our security responsibility? > It would also affect someone who uses a default > expression or other means of associating executable code with the > table, and something like a default expression doesn't require any > explicit setting to make it apply in case of replication, so maybe > the > risk of someone messing up is a bit higher. Perhaps, but I still don't understand that case. Unless I'm missing something, the table owner would have to write a pretty weird default expression or check constraint or whatever to end up with something dangerous. > But this definitely makes it more of a judgment call than I thought > initially. And I'm fine if the judgement is that we just require SET ROLE to be conservative and make sure we don't over-promise in 16. That's a totally reasonable thing: it's easier to loosen restrictions later than to tighten them. Furthermore, just because you and I can't think of exploitable problems doesn't mean they don't exist. I have found this discussion enlightening, so thank you for going through these details with me. > I'm still inclined to leave the patch checking for SET ROLE +0.5. I want the patch to go in either way, and can carry on the discussion later and improve things more for 17. But I want to say generally that I believe we spend too much effort trying to protect unreasonable users from themselves, which interferes with our ability to protect reasonable users and solve real use cases. > However, there might be an argument > that we ought to do something else, like have a REPLICATE privilege Sounds cool. Not sure I have an opinion yet, but probably 17 material. > What I would like to change in the > patch in this release is to add a new subscription property along the > lines of what I proposed to Jelte in my earlier email: let's provide > an escape hatch that turns off the user-switching behavior. I believe switching to the table owner (as done in your patch) is the right best practice, and should be the default. I'm fine with an escape hatch here to ease migrations or whatever. But please do it in an unobtrusive way such that we (as hackers) can mostly forget that the non-switching behavior exists. At least for me, our system is already pretty complex without two kinds of subscription security models. Regards, Jeff Davis - example follows -- -- user foo CREATE TABLE secret(I INT); INSERT INTO secret VALUES(42); CREATE TABLE r(i INT, secret INT); CREATE OR REPLACE FUNCTION a_func() RETURNS TRIGGER SECURITY DEFINER LANGUAGE plpgsql AS $$ BEGIN SELECT i INTO NEW.secret FROM secret; RETURN NEW; END; $$; CREATE OR REPLACE FUNCTION z_func() RETURNS TRIGGER SECURITY INVOKER LANGUAGE plpgsql AS $$ BEGIN IF NEW.secret <> abs(NEW.secret) THEN RAISE EXCEPTION 'no negative secret numbers allowed'; END IF; RETURN NEW; END; $$; CREATE TRIGGER a_trigger BEFORE INSERT ON r FOR EACH ROW EXECUTE PROCEDURE a_func(); CREATE TRIGGER z_trigger BEFORE INSERT ON r FOR EACH ROW EXECUTE PROCEDURE z_func(); GRANT INSERT ON r TO bar; GRANT USAGE ON SCHEMA foo TO bar; -- user bar CREATE OR REPLACE FUNCTION bar.abs(i INT4) RETURNS INT4
Re: Add pg_walinspect function with block info columns
On Tue, Mar 28, 2023 at 11:15:17AM -0700, Peter Geoghegan wrote: > On Mon, Mar 27, 2023 at 7:47 PM Bharath Rupireddy > wrote: >> Hm, agreed. Changed in the attached v7-0002 patch. We can as well >> write a case statement in the create function SQL to output forkname >> instead forknumber, but I'd stop doing that to keep in sync with >> pg_buffercache. > > I just don't see much value in any textual representation of fork > name, however generated. In practice it's just not adding very much > useful information. It is mostly useful as a way of filtering block > references, which makes simple integers more natural. I disagree with this argument. Personally, I have a *much* better experience with textual representation because there is no need to cross-check the internals of the code in case you don't remember what a given number means in an enum or in a set of bits, especially if you're in a hurry of looking at a production or customer deployment. In short, it makes for less mistakes because you don't have to think about some extra mapping between some integers and what they actually mean through text. The clauses you'd apply for a group by on the forks, or for a filter with IN clauses don't change, they're just made easier to understand for the common user, and that includes experienced people. We'd better think about that like the text[] arrays we use for the flag values, like the FPI flags, or why we've introduced text[] for the HEAP_* flags in the heap functions of pageinspect. There's even more consistency with pageinspect in using a fork name, where we can pass down a fork name to get a raw page. -- Michael signature.asc Description: PGP signature
Re: zstd compression for pg_dump
On Wed, Mar 15, 2023 at 9:50 PM Justin Pryzby wrote: > On Fri, Mar 10, 2023 at 12:48:13PM -0800, Jacob Champion wrote: > > It looks like pg_dump's meson.build is missing dependencies on zstd > > (meson couldn't find the headers in the subproject without them). > > I saw that this was added for LZ4, but I hadn't added it for zstd since > I didn't run into an issue without it. Could you check that what I've > added works for your case ? I thought I replied to this, sorry -- your newest patchset builds correctly with subprojects, so the new dependency looks good to me. Thanks! > > Hm. Best I can tell, the CloneArchive() machinery is supposed to be > > handling safety for this, by isolating each thread's state. I don't feel > > comfortable pronouncing this new addition safe or not, because I'm not > > sure I understand what the comments in the format-specific _Clone() > > callbacks are saying yet. > > My line of reasoning for unix is that pg_dump forks before any calls to > zstd. Nothing zstd does ought to affect the pg_dump layer. But that > doesn't apply to pg_dump under windows. This is an opened question. If > there's no solid answer, I could disable/ignore the option (maybe only > under windows). To (maybe?) move this forward a bit, note that pg_backup_custom's _Clone() function makes sure that there is no active compressor state at the beginning of the new thread. pg_backup_directory's implementation has no such provision. And I don't think it can, because the parent thread might have concurrently set one up -- see the directory-specific implementation of _CloseArchive(). Perhaps we should just NULL out those fields after the copy, instead? To illustrate why I think this is tough to characterize: if I've read the code correctly, the _Clone() and CloneArchive() implementations are running concurrently with code that is actively modifying the ArchiveHandle and the lclContext. So safety is only ensured to the extent that we keep track of which fields threads are allowed to touch, and I don't have that mental model. --Jacob
Re: Why mark empty pages all visible?
Hi, On 2023-03-28 13:05:19 -0700, Peter Geoghegan wrote: > On Tue, Mar 28, 2023 at 12:01 PM Andres Freund wrote: > > I will probably make that argument - so far I was just trying to understand > > the intent of the current code. There aren't really comments explaining why > > we > > want to mark currently-pinned empty/new pages all-visible and enter them > > into > > the FSM. > > I don't think that not being able to immediately get a cleanup lock on > a page signifies much of anything. I never have. Why is that? It's as clear a signal of concurrent activity on the buffer you're going to get. > > Historically we did *not* enter currently pinned empty/new pages into the > > FSM > > / VM. Afaics that's new as of 44fa84881fff. > > Of course that's true, but I don't know why that history is > particularly important. It's interesting to understand *why* we are doing what we are. I think it'd make sense to propose changing how things work around this, but I just don't feel like I have a good enough grasp for why we do some of the things we do. And given there's not a lot of comments around it and some of the comments that do exist are inconsistent with themselves, looking at the history seems like the next best thing? > I actually think that I might agree with the substance of much of what > you're saying, but at the same time I don't think that you're defining > the problem in a way that's particularly helpful. Likely because the goals of the existing code aren't clear to me. So I don't feel like I have a firm grasp... > I gather that you *don't* want to do anything about the > lazy_vacuum_heap_page behavior with setting empty pages all-visible (just > the lazy_scan_new_or_empty behavior). Not in the short-medium term, at least. In the long term I do suspect we might want to do something about it. We have a *crapton* of contention in the FSM and caused by the FSM in bulk workloads. With my relation extension patch disabling the FSM nearly doubles concurrent load speed. At the same time, the fact that we might loose knowledge about all the existing free space in case of promotion or crash and never rediscover that space (because the pages are frozen), seems decidedly not great. I don't know what the path forward is, but it seems somewhat clear that we ought to do something. I suspect having a not crash-safe FSM isn't really acceptable anymore - it probably is fine to not persist a *reduction* in free space, but we can't permanently loose track of free space, no matter how many crashes. I know that you strongly dislike the way the FSM works, although I forgot some of the details. One thing I am fairly certain about is that using the FSM to tell other backends about newly bulk extended pages is not a good solution, even though we're stuck with it for the moment. > I actually agree that VACUUM is way too unconcerned about interfering > with concurrent activity in terms of how it manages free space in the > FSM. But this seems like just about the least important example of > that (outside the context of your RelationGetBufferForTuple work). The > really important case (that VACUUM gets wrong) all involve recently > dead tuples. But I don't think that you want to talk about that right > now. You want to talk about RelationGetBufferForTuple. That's indeed the background. By now I'd also like to add a few comments explaining why we do what we currently do, because I don't find all of it obvious. > > The reason I'm looking at this is that there's a lot of complexity at the > > bottom of RelationGetBufferForTuple(), related to needing to release the > > lock > > on the newly extended page and then needing to recheck whether there still > > is > > free space on the page. And that it's not complicated enough > > (c.f. INSERT_FROZEN doing visibilitymap_pin() with the page locked). > > > > As far as I can tell, if we went back to not entering new/empty pages into > > either VM or FSM, we could rely on the page not getting filled while just > > pinning, not locking it. > > What you're essentially arguing for is inventing a new rule that makes > the early lifetime of a page (what we currently call a PageIsEmpty() > page, and new pages) special, to avoid interference from VACUUM. I > have made similar arguments myself on quite a few occasions, so I'm > actually sympathetic. I just think that you should own it. And no, I'm > not just reflexively defending my work in 44fa84881fff; I actually > think that framing the problem as a case of restoring a previous > behavior is confusing and ahistorical. If there was a useful behavior > that was lost, then it was quite an accidental behavior all along. The > difference matters because now you have to reconcile what you're > saying with the lazy_vacuum_heap_page no-cleanup-lock behavior added > in 14. I really don't have a position to own yet, not on firm enough ground. > I think that you must be arguing for making the early lifetime of a > heap page special
Re: Add LZ4 compression in pg_dump
On 3/28/23 18:07, gkokola...@pm.me wrote: > > --- Original Message --- > On Friday, March 24th, 2023 at 10:30 AM, gkokola...@pm.me > wrote: > >> >> --- Original Message --- >> On Thursday, March 23rd, 2023 at 6:10 PM, Tomas Vondra >> tomas.von...@enterprisedb.com wrote: >> >>> This leaves the empty-data issue (which we have a fix for) and the >>> switch to LZ4F. And then the zstd part. >> >> Please expect promptly a patch for the switch to frames. > > Please find the expected patch attached. Note that the bulk of the > patch is code unification, variable renaming to something more > appropriate, and comment addition. These are changes that are not > strictly necessary to switch to LZ4F. I do believe that are > essential for code hygiene after the switch and they do belong > on the same commit. > I think the patch is fine, but I'm wondering if the renames shouldn't go a bit further. It removes references to LZ4File struct, but there's a bunch of functions with LZ4File_ prefix. Why not to simply use LZ4_ prefix? We don't have GzipFile either. Sure, it might be a bit confusing because lz4.h uses LZ4_ prefix, but then we probably should not define LZ4_compressor_init ... Also, maybe the comments shouldn't use "File API" when compress_io.c calls that "Compressed stream API". regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Support logical replication of DDLs
On Sun, Mar 26, 2023 at 5:22 PM Tom Lane wrote: > I spent some time looking through this thread to try to get a sense > of the state of things, and I came away quite depressed. The patchset > has ballooned to over 2MB, which is a couple orders of magnitude > larger than anyone could hope to meaningfully review from scratch. > Despite that, it seems that there are fundamental semantics issues > remaining, not to mention clear-and-present security dangers, not > to mention TODO comments all over the code. Thanks for looking into this! > I'm also less than sold on the technical details, specifically > the notion of "let's translate utility parse trees into JSON and > send that down the wire". You can probably make that work for now, > but I wonder if it will be any more robust against cross-version > changes than just shipping the outfuncs.c representation. (Perhaps > it can be made more robust than the raw parse trees, but I see no > evidence that anyone's thought much about how.) I explored the idea of using the outfuncs.c representation in [1] and found this existing format is not designed to be portable between different major versions. So it can't be directly used for replication without serious modification. I think the DDL deparser is a necessary tool if we want to be able to handle cross-version DDL syntax differences by providing the capability to machine-edit the JSON representation. > And TBH, I don't think that I quite believe the premise in the > first place. The whole point of using logical rather than physical > replication is that the subscriber installation(s) aren't exactly like > the publisher. Given that, how can we expect that automated DDL > replication is going to do the right thing often enough to be a useful > tool rather than a disastrous foot-gun? The more you expand the scope > of what gets replicated, the worse that problem becomes --- for > example, I don't buy for one second that "let's replicate roles" > is a credible solution for the problems that come from the roles > not being the same on publisher and subscriber. I agree that a full fledged DDL deparser and DDL replication is too big of a task for one patch. I think we may consider approaching this feature in the following ways: 1. Phased development and testing as discussed in other emails. Probably support table commands first (as they are the most common DDLs), then the other commands in multiple phases. 2. Provide a subscription option to receive the DDL change, raise a notice and to skip applying the change. The users can listen to the DDL notice and implement application logic to apply the change if needed. The idea is we can start gathering user feedback by providing a somewhat useful feature (compared to doing nothing about DDLs), but also avoid heading straight into the potential footgun situation caused by automatically applying any mal-formatted DDLs. 3. As cross-version DDL syntax differences are expected to be uncommon (in real workload), maybe we can think about other options to handle such edge cases instead of fully automating it? For example, what about letting the user specify how a DDL should be replicated on the subscriber by explicitly providing two versions of DDL commands in some way? > I'm not sure how we get from here to a committable and useful feature, > but I don't think we're close to that yet, and I'm not sure that minor > iterations on a 2MB patchset will accomplish much. About 1 MB of the patch are testing output files for the DDL deparser (postgres/src/test/modules/test_ddl_deparse_regress/expected/). Regards, Zane [1] https://www.postgresql.org/message-id/CAAD30U%2Boi6e6Vh_zAzhuXzkqUhagmLGD%2B_iyn2N9w_sNRKsoag%40mail.gmail.com
Re: pgsql: amcheck: Fix verify_heapam for tuples where xmin or xmax is 0.
On Mon, Mar 27, 2023 at 4:52 PM Peter Geoghegan wrote: > This is fine, as far as it goes. Obviously it fixes the immediate problem. OK, I've committed and back-patched this fix to v14, just like the erroneous commit that created the issue. -- Robert Haas EDB: http://www.enterprisedb.com
Re: CAST(... ON DEFAULT) - WIP build on top of Error-Safe User Functions
On Tue, Mar 28, 2023 at 3:25 PM Isaac Morland wrote: > On Mon, 19 Dec 2022 at 17:57, Corey Huinker > wrote: > >> >> Attached is my work in progress to implement the changes to the CAST() >> function as proposed by Vik Fearing. >> >> CAST(expr AS typename NULL ON ERROR) >> will use error-safe functions to do the cast of expr, and will return >> NULL if the cast fails. >> >> CAST(expr AS typename DEFAULT expr2 ON ERROR) >> will use error-safe functions to do the cast of expr, and will return >> expr2 if the cast fails. >> > > Is there any difference between NULL and DEFAULT NULL? > What I think you're asking is: is there a difference between these two statements: SELECT CAST(my_string AS integer NULL ON ERROR) FROM my_table; SELECT CAST(my_string AS integer DEFAULT NULL ON ERROR) FROM my_table; And as I understand it, the answer would be no, there is no practical difference. The first case is just a convenient shorthand, whereas the second case tees you up for a potentially complex expression. Before you ask, I believe the ON ERROR syntax could be made optional. As I implemented it, both cases create a default expression which then typecast to integer, and in both cases that expression would be a const-null, so the optimizer steps would very quickly collapse those steps into a plain old constant.
Re: CAST(... ON DEFAULT) - WIP build on top of Error-Safe User Functions
On Tue, Mar 28, 2023 at 2:53 PM Gregory Stark (as CFM) wrote: > On Tue, 3 Jan 2023 at 14:16, Tom Lane wrote: > > > > Vik Fearing writes: > > > > > I don't think we should add that syntax until I do get it through the > > > committee, just in case they change something. > > > > Agreed. So this is something we won't be able to put into v16; > > it'll have to wait till there's something solid from the committee. > > I guess I'll mark this Rejected in the CF then. Who knows when the SQL > committee will look at this... > > -- > Gregory Stark > As Commitfest Manager > Yes, for now. I'm in touch with the pg-people on the committee and will resume work when there's something to act upon.
Re: Why mark empty pages all visible?
On Tue, Mar 28, 2023 at 12:01 PM Andres Freund wrote: > I will probably make that argument - so far I was just trying to understand > the intent of the current code. There aren't really comments explaining why we > want to mark currently-pinned empty/new pages all-visible and enter them into > the FSM. I don't think that not being able to immediately get a cleanup lock on a page signifies much of anything. I never have. > Historically we did *not* enter currently pinned empty/new pages into the FSM > / VM. Afaics that's new as of 44fa84881fff. Of course that's true, but I don't know why that history is particularly important. Either way, holding a pin was never supposed to work as an interlock against a page being concurrently set all-visible, or having its space recorded in the FSM. It's true that my work on VACUUM has shaken out a couple of bugs where we accidentally relied on that being true. But those were all due to the change in lazy_vacuum_heap_page() made in Postgres 14 -- not the addition of lazy_scan_new_or_empty() in Postgres 15. I actually think that I might agree with the substance of much of what you're saying, but at the same time I don't think that you're defining the problem in a way that's particularly helpful. I gather that you *don't* want to do anything about the lazy_vacuum_heap_page behavior with setting empty pages all-visible (just the lazy_scan_new_or_empty behavior). So clearly this isn't really about marking empty pages all-visible, with or without a cleanup lock. It's actually about something rather more specific: the interactions with RelationGetBufferForTuple. I actually agree that VACUUM is way too unconcerned about interfering with concurrent activity in terms of how it manages free space in the FSM. But this seems like just about the least important example of that (outside the context of your RelationGetBufferForTuple work). The really important case (that VACUUM gets wrong) all involve recently dead tuples. But I don't think that you want to talk about that right now. You want to talk about RelationGetBufferForTuple. > The reason I'm looking at this is that there's a lot of complexity at the > bottom of RelationGetBufferForTuple(), related to needing to release the lock > on the newly extended page and then needing to recheck whether there still is > free space on the page. And that it's not complicated enough > (c.f. INSERT_FROZEN doing visibilitymap_pin() with the page locked). > > As far as I can tell, if we went back to not entering new/empty pages into > either VM or FSM, we could rely on the page not getting filled while just > pinning, not locking it. What you're essentially arguing for is inventing a new rule that makes the early lifetime of a page (what we currently call a PageIsEmpty() page, and new pages) special, to avoid interference from VACUUM. I have made similar arguments myself on quite a few occasions, so I'm actually sympathetic. I just think that you should own it. And no, I'm not just reflexively defending my work in 44fa84881fff; I actually think that framing the problem as a case of restoring a previous behavior is confusing and ahistorical. If there was a useful behavior that was lost, then it was quite an accidental behavior all along. The difference matters because now you have to reconcile what you're saying with the lazy_vacuum_heap_page no-cleanup-lock behavior added in 14. I think that you must be arguing for making the early lifetime of a heap page special to VACUUM, since AFAICT you want to change VACUUM's behavior with strictly PageIsEmpty/PageIsNew pages only -- and *not* with pages that have one remaining LP_UNUSED item, but are otherwise empty (which could be set all-visible/all-frozen in either the first or second heap pass, even if we disabled the lazy_scan_new_or_empty() behavior you're complaining about). You seem to want to distinguish between very new pages (that also happen to be empty), and old pages that happen to be empty. Right? > I also do wonder whether the different behaviour of PageIsEmpty() and > PageIsNew() pages makes sense. The justification for not marking PageIsNew() > pages as all-visible holds just as true for empty pages. There's just as much > free space there. What you say here makes a lot of sense to me. I'm just not sure what I'd prefer to do about it. -- Peter Geoghegan
Re: Some revises in adding sorting path
On Tue, 21 Feb 2023 at 22:02, Richard Guo wrote: > Looking at the codes now I have some concern that what we do in > create_ordered_paths for partial paths may have already been done in > generate_useful_gather_paths, especially when query_pathkeys is equal to > sort_pathkeys. Not sure if this is a problem. And the comment there > mentions generate_gather_paths(), but ISTM we should mention what > generate_useful_gather_paths has done. I think you need to write some tests for this. I've managed to come up with something to get that code to perform a Sort, but I've not managed to get it to perform an incremental sort. create or replace function parallel_safe_volatile(a int) returns int as $$ begin return a; end; $$ parallel safe volatile language plpgsql; create table abc(a int, b int, c int); insert into abc select x,y,z from generate_Series(1,100)x, generate_Series(1,100)y, generate_Series(1,100)z; set parallel_tuple_cost=0; Without making those parallel paths, we get: postgres=# explain select * from abc where a=1 order by a,b,parallel_safe_volatile(c); QUERY PLAN Sort (cost=13391.49..13417.49 rows=10400 width=16) Sort Key: b, (parallel_safe_volatile(c)) -> Gather (cost=1000.00..12697.58 rows=10400 width=16) Workers Planned: 2 -> Parallel Seq Scan on abc (cost=0.00..11697.58 rows=4333 width=16) Filter: (a = 1) (6 rows) but with, the plan is: postgres=# explain select * from abc where a=1 order by a,b,parallel_safe_volatile(c); QUERY PLAN Gather Merge (cost=12959.35..13060.52 rows=8666 width=16) Workers Planned: 2 -> Sort (cost=11959.32..11970.15 rows=4333 width=16) Sort Key: b, (parallel_safe_volatile(c)) -> Parallel Seq Scan on abc (cost=0.00..11697.58 rows=4333 width=16) Filter: (a = 1) (6 rows) I added the parallel safe and volatile function so that get_useful_pathkeys_for_relation() wouldn't include all of the query_pathkeys. If you write some tests for this code, it will be useful to prove that it actually does something, and also that it does not break again in the future. I don't really want to just blindly copy the pattern used in 3c6fc5820 for creating incremental sort paths if it's not useful here. It would be good to see tests that make an Incremental Sort path using the code you're changing. Same for the 0003 patch. I'll mark this as waiting on author while you work on that. David
Re: POC: Better infrastructure for automated testing of concurrency issues
On Tue, Mar 28, 2023 at 9:44 PM Gregory Stark (as CFM) wrote: > On Wed, 31 Aug 2022 at 20:52, Alexander Korotkov wrote: > > > > I'll continue work on this patch. The rebased patch is attached. It > > implements stop events as configure option (not runtime GUC option). > > It looks like this patch isn't going to be ready this commitfest. And > it hasn't received much discussion since August 2022. If I'm wrong say > something but otherwise I'll mark it Returned With Feedback. It can be > resurrected (and moved to the next commitfest) when you're free to > work on it again. I'm good with that. -- Regards, Alexander Korotkov
Re: Add SPLIT PARTITION/MERGE PARTITIONS commands
Hi, Patch v15-0001-ALTER-TABLE-MERGE-PARTITIONS-command.patch Apply nicely. One warning on meson compile (configure -Dssl=openssl -Dldap=enabled -Dauto_features=enabled -DPG_TEST_EXTRA='ssl,ldap,kerberos' -Dbsd_auth=disabled -Dbonjour=disabled -Dpam=disabled -Dpltcl=disabled -Dsystemd=disabled -Dzstd=disabled -Db_coverage=true) ../../src/pgmergesplit/src/test/modules/test_ddl_deparse/test_ddl_deparse.c: In function ‘get_altertable_subcmdinfo’: ../../src/pgmergesplit/src/test/modules/test_ddl_deparse/test_ddl_deparse.c:112:17: warning: enumeration value ‘AT_MergePartitions’ not handled in switch [-Wswitch] 112 | switch (subcmd->subtype) | ^~ Should be the same with 0002... meson test perfect, patch coverage is very good. Patch v15-0002-ALTER-TABLE-SPLIT-PARTITION-command.patch Doesn't apply on 326a33a289c7ba2dbf45f17e610b7be98dc11f67 Patch v15-0003-Documentation-for-ALTER-TABLE-SPLIT-PARTITION-ME.patch Apply with one warning 1 line add space error (translate from french "warning: 1 ligne a ajouté des erreurs d'espace"). v15-0003-Documentation-for-ALTER-TABLE-SPLIT-PARTITION-ME.patch:54: trailing whitespace. One of the new partitions partition_name1, Comment are ok for me. A non native english speaker. Perhaps you could add some remarks in ddl.html and alter-ddl.html Stéphane The new status of this patch is: Waiting on Author
Re: Schema variables - new implementation for Postgres 15
Hi ne 26. 3. 2023 v 19:53 odesílatel Dmitry Dolgov <9erthali...@gmail.com> napsal: > > On Sun, Mar 26, 2023 at 07:32:05PM +0800, Julien Rouhaud wrote: > > Hi, > > > > I just have a few minor wording improvements for the various comments / > > documentation you quoted. > > Talking about documentation I've noticed that the implementation > contains few limitations, that are not mentioned in the docs. Examples > are WITH queries: > > WITH x AS (LET public.svar = 100) SELECT * FROM x; > ERROR: LET not supported in WITH query > The LET statement doesn't support the RETURNING clause, so using inside CTE does not make any sense. Do you have some tips, where this behaviour should be mentioned? > and using with set-returning functions (haven't found any related tests). > There it is: +CREATE VARIABLE public.svar AS int; +-- should be ok +LET public.svar = generate_series(1, 1); +-- should fail +LET public.svar = generate_series(1, 2); +ERROR: expression returned more than one row +LET public.svar = generate_series(1, 0); +ERROR: expression returned no rows +DROP VARIABLE public.svar; > > Another small note is about this change in the rowsecurity: > > /* > -* For SELECT, UPDATE and DELETE, add security quals to enforce > the USING > -* policies. These security quals control access to existing > table rows. > -* Restrictive policies are combined together using AND, and > permissive > -* policies are combined together using OR. > +* For SELECT, LET, UPDATE and DELETE, add security quals to > enforce the > +* USING policies. These security quals control access to > existing table > +* rows. Restrictive policies are combined together using AND, and > +* permissive policies are combined together using OR. > */ > > From this commentary one may think that LET command supports row level > security, but I don't see it being implemented. A wrong commentary? > I don't think so. The row level security should be supported. I tested it on example from doc: CREATE TABLE public.accounts ( manager text, company text, contact_email text ); CREATE VARIABLE public.v AS text; COPY public.accounts (manager, company, contact_email) FROM stdin; t1role xxx t1r...@xxx.org t2role yyy t2r...@yyy.org \. CREATE POLICY account_managers ON public.accounts USING ((manager = CURRENT_USER)); ALTER TABLE public.accounts ENABLE ROW LEVEL SECURITY; GRANT SELECT,INSERT ON TABLE public.accounts TO t1role; GRANT SELECT,INSERT ON TABLE public.accounts TO t2role; GRANT ALL ON VARIABLE public.v TO t1role; GRANT ALL ON VARIABLE public.v TO t2role; [pavel@localhost postgresql.master]$ psql Assertions: on psql (16devel) Type "help" for help. (2023-03-28 21:32:33) postgres=# set role to t1role; SET (2023-03-28 21:32:40) postgres=# select * from accounts ; ┌─┬─┬┐ │ manager │ company │ contact_email │ ╞═╪═╪╡ │ t1role │ xxx │ t1r...@xxx.org │ └─┴─┴┘ (1 row) (2023-03-28 21:32:45) postgres=# let v = (select company from accounts); LET (2023-03-28 21:32:58) postgres=# select v; ┌─┐ │ v │ ╞═╡ │ xxx │ └─┘ (1 row) (2023-03-28 21:33:03) postgres=# set role to default; SET (2023-03-28 21:33:12) postgres=# set role to t2role; SET (2023-03-28 21:33:19) postgres=# select * from accounts ; ┌─┬─┬┐ │ manager │ company │ contact_email │ ╞═╪═╪╡ │ t2role │ yyy │ t2r...@yyy.org │ └─┴─┴┘ (1 row) (2023-03-28 21:33:22) postgres=# let v = (select company from accounts); LET (2023-03-28 21:33:26) postgres=# select v; ┌─┐ │ v │ ╞═╡ │ yyy │ └─┘ (1 row) Regards Pavel
Re: Documentation for building with meson
Hi, On Wed, Mar 15, 2023 at 4:28 AM Peter Eisentraut < peter.eisentr...@enterprisedb.com> wrote: > > [PATCH v8 1/5] Make minor additions and corrections to meson docs > > The last hunk revealed that there is some mixing up between meson setup > and meson configure. This goes a bit further. For example, earlier it > says that to get a list of meson setup options, call meson configure > --help and look at https://mesonbuild.com/Commands.html#configure, which > are both wrong. Also later throughout the text it uses one or the > other. I think this has the potential to be very confusing, and we > should clean this up carefully. > > The text about additional meson test options maybe should go into the > regress.sgml chapter? > I tried to make the meson setup and meson configure usage consistent. I've removed the text for the test options. > > > > [PATCH v8 2/5] Add data layout options sub-section in installation > docs > > This makes sense. Please double check your patch for correct title > casing, vertical spacing of XML, to keep everything looking consistent. > Thanks for noticing. Made it consistent on both sides. > > This text isn't yours, but since your patch emphasizes it, I wonder if > it could use some clarification: > > + These options affect how PostgreSQL lays out data on disk. > + Note that changing these breaks on-disk database compatibility, > + meaning you cannot use pg_upgrade to upgrade to > + a build with different values of these options. > > This isn't really correct. What breaking on-disk compatibility means is > that you can't use a server compiled one way with a data directory > initialized by binaries compiled another way. pg_upgrade may well have > the ability to upgrade between one or the other; that's up to pg_upgrade > to figure out but not an intrinsic property. (I wonder why pg_upgrade > cares about the WAL block size.) > Fixed. > > > > [PATCH v8 3/5] Remove Anti-Features section from Installation from > source docs > > Makes sense. But is "--disable-thread-safety" really a developer > feature? I think not. > > Moved to PostgreSQL features. Do you think there's a better place for it? > > > [PATCH v8 4/5] Re-organize Miscellaneous section > > This moves the Miscellaneous section after Developer Features. I think > Developer Features should be last. > > Maybe should remove this section and add the options to the regular > PostgreSQL Features section. > Yes, that makes sense. Made this change. > > Also consider the grouping in meson_options.txt, which is slightly > different yet. Removed Misc options section from meson_options.txt too. > > > > [PATCH v8 5/5] Change Short Version for meson installation guide > > +# create working directory > +mkdir postgres > +cd postgres > + > +# fetch source code > +git clone https://git.postgresql.org/git/postgresql.git src > > This comes after the "Getting the Source" section, so at this point they > already have the source and don't need to do "git clone" etc. again. > > +# setup and enter build directory (done only first time) > +## Unix based platforms > +meson setup build src --prefix=$PWD/install > + > +## Windows > +meson setup build src --prefix=%cd%/install > > Maybe some people work this way, but to me the directory structures you > create here are completely weird. > I'd like to discuss what you think is a good directory structure to work with. I've mentioned some of the drawbacks I see with the current structure for the short version. I know this structure can feel different but it feeling weird is not ideal. Do you have a directory structure in mind which is different but doesn't feel odd to you? > > +# Initialize a new database > +../install/bin/initdb -D ../data > + > +# Start database > +../install/bin/pg_ctl -D ../data/ -l logfile start > + > +# Connect to the database > +../install/bin/psql -d postgres > > The terminology here needs to be tightened up. You are using "database" > here to mean three different things. > I'll address this together once we are aligned on the overall directory structure etc. There are a few reasons why I had done this. Some reasons Andres has > described in his previous email and I'll add a few specific examples on why > having the same section for both might not be a good idea. > > * Having readline and zlib as required for building PostgreSQL is now > wrong because they are not required for meson builds. Also, the name of the > configs are different for make and meson and the current section only lists > the make ones. > * There are many references to configure in that section which don't > apply to meson. > * Last I checked Flex and Bison were always required to build via meson > but not for make and the current section doesn't explain those differences. > > I spent a good amount of time thinking if we could have a single section, > clarify these differences to make it correct and not confuse the users. I > couldn't find a way to do all
Re: CAST(... ON DEFAULT) - WIP build on top of Error-Safe User Functions
On Mon, 19 Dec 2022 at 17:57, Corey Huinker wrote: > > Attached is my work in progress to implement the changes to the CAST() > function as proposed by Vik Fearing. > > CAST(expr AS typename NULL ON ERROR) > will use error-safe functions to do the cast of expr, and will return > NULL if the cast fails. > > CAST(expr AS typename DEFAULT expr2 ON ERROR) > will use error-safe functions to do the cast of expr, and will return > expr2 if the cast fails. > Is there any difference between NULL and DEFAULT NULL?
Re: Schema variables - new implementation for Postgres 15
ne 26. 3. 2023 v 13:32 odesílatel Julien Rouhaud napsal: > Hi, > > I just have a few minor wording improvements for the various comments / > documentation you quoted. > > On Sun, Mar 26, 2023 at 08:53:49AM +0200, Pavel Stehule wrote: > > út 21. 3. 2023 v 17:18 odesílatel Peter Eisentraut < > > peter.eisentr...@enterprisedb.com> napsal: > > > > > - What is the purpose of struct Variable? It seems very similar to > > >FormData_pg_variable. At least a comment would be useful. > > > > > > > I wrote comment there: > > > > > > /* > > * The Variable struct is based on FormData_pg_variable struct. Against > > * FormData_pg_variable it can hold node of deserialized expression used > > * for calculation of default value. > > */ > > Did you mean "Unlike" rather than "Against"? > fixed > > > > 0002 > > > > > > expr_kind_allows_session_variables() should have some explanation > > > about criteria for determining which expression kinds should allow > > > variables. > > > > > > > I wrote comment there: > > > > /* > > * Returns true, when expression of kind allows using of > > * session variables. > > + * The session's variables can be used everywhere where > > + * can be used external parameters. Session variables > > + * are not allowed in DDL. Session's variables cannot be > > + * used in constraints. > > + * > > + * The identifier can be parsed as an session variable > > + * only in expression's kinds where session's variables > > + * are allowed. This is the primary usage of this function. > > + * > > + * Second usage of this function is for decision if > > + * an error message "column does not exist" or "column > > + * or variable does not exist" should be printed. When > > + * we are in expression, where session variables cannot > > + * be used, we raise the first form or error message. > > */ > > Maybe > > /* > * Returns true if the given expression kind is valid for session variables > * Session variables can be used everywhere where external parameters can > be > * used. Session variables are not allowed in DDL commands or in > constraints. > * > * An identifier can be parsed as a session variable only for expression > kinds > * where session variables are allowed. This is the primary usage of this > * function. > * > * Second usage of this function is to decide whether "column does not > exist" or > * "column or variable does not exist" error message should be printed. > * When we are in an expression where session variables cannot be used, we > raise > * the first form or error message. > */ > changed > > > > session_variables_ambiguity_warning: There needs to be more > > > information about this. The current explanation is basically just, > > > "warn if your query is confusing". Why do I want that? Why would I > > > not want that? What is the alternative? What are some examples? > > > Shouldn't there be a standard behavior without a need to configure > > > anything? > > > > > > > I enhanced this entry: > > > > + > > +The session variables can be shadowed by column references in a > > query. This > > +is an expected feature. The existing queries should not be > broken > > by creating > > +any session variable, because session variables are shadowed > > always if the > > +identifier is ambiguous. The variables should be named without > > possibility > > +to collision with identifiers of other database objects (column > > names or > > +record field names). The warnings enabled by setting > > session_variables_ambiguity_warning > > +should help with finding identifier's collisions. > > Maybe > > Session variables can be shadowed by column references in a query, this is > an > expected behavior. Previously working queries shouldn't error out by > creating > any session variable, so session variables are always shadowed if an > identifier > is ambiguous. Variables should be referenced using an unambiguous > identifier > without any possibility for a collision with identifier of other database > objects (column names or record fields names). The warning messages > emitted > when enabling session_variables_ambiguity_warning can > help > finding such identifier collision. > > > + > > + > > +This feature can significantly increase size of logs, and then > it > > is > > +disabled by default, but for testing or development > environments it > > +should be enabled. > > Maybe > > This feature can significantly increase log size, so it's disabled by > default. > For testing or development environments it's recommended to enable it if > you > use session variables. > replaced Thank you very much for these language correctures Regards Pavel p.s. I'll send updated patch after today or tomorrow - I have to fix broken dependency check after rebase
Re: Why mark empty pages all visible?
Hi, On 2023-03-27 21:51:09 -0700, Peter Geoghegan wrote: > On Mon, Mar 27, 2023 at 9:32 PM Andres Freund wrote: > > > You haven't said anything about the leading cause of marking empty > > > pages all-frozen, by far: lazy_vacuum_heap_page(). Should it also stop > > > marking empty pages all-frozen? > > > > It's not obvious that it should - but it's not as clear a case as the > > ConditionalLockBufferForCleanup() -> lazy_scan_new_or_empty() one. In the > > latter, we know > > a) that we don't have to do any work to be able to advance the horizon > > b) we know that somebody else has the page pinned > > > > What's the point in marking it all-visible at that point? In quite likely be > > from RelationGetBufferForTuple() having extended the relation and then > > briefly > > needed to release the lock (to acquire the lock on otherBuffer or in > > GetVisibilityMapPins()). > > I think that there is significant value in avoiding special cases, on > general principle. If we stopped doing this in > lazy_scan_new_or_empty() we'd be inventing a new thing that cleanup > locks are supposed to protect against. Maybe something like that would > make sense, but if so then make that argument, and make it explicitly > represented in the code. I will probably make that argument - so far I was just trying to understand the intent of the current code. There aren't really comments explaining why we want to mark currently-pinned empty/new pages all-visible and enter them into the FSM. Historically we did *not* enter currently pinned empty/new pages into the FSM / VM. Afaics that's new as of 44fa84881fff. The reason I'm looking at this is that there's a lot of complexity at the bottom of RelationGetBufferForTuple(), related to needing to release the lock on the newly extended page and then needing to recheck whether there still is free space on the page. And that it's not complicated enough (c.f. INSERT_FROZEN doing visibilitymap_pin() with the page locked). As far as I can tell, if we went back to not entering new/empty pages into either VM or FSM, we could rely on the page not getting filled while just pinning, not locking it. > > I don't follow. In the ConditionalLockBufferForCleanup() -> > > lazy_scan_new_or_empty() case we are dealing with an new or empty > > page. Whereas lazy_vacuum_heap_page() deals with a page that definitely > > has dead tuples on it. How are those two cases comparable? > > It doesn't have dead tuples anymore, though. > > ISTM that there is an issue here with the definition of an empty page. > You're concerned about PageIsEmpty() pages. And not just any PageIsEmpty() page, ones that are currently pinned. I also do wonder whether the different behaviour of PageIsEmpty() and PageIsNew() pages makes sense. The justification for not marking PageIsNew() pages as all-visible holds just as true for empty pages. There's just as much free space there. Greetings, Andres Freund
Re: [PATCH]Feature improvement for MERGE tab completion
Ah, another thread with a bouncing email address... Please respond to to thread from this point to avoid bounces. -- Gregory Stark As Commitfest Manager
Re: [PATCH]Feature improvement for MERGE tab completion
It looks like this remaining work isn't going to happen this CF and therefore this release. There hasn't been an update since January when Dean Rasheed posted a review. Unless there's any updates soon I'll move this on to the next commitfest or mark it returned with feedback. -- Gregory Stark As Commitfest Manager
Re: running logical replication as the subscription owner
On Tue, 28 Mar 2023 at 18:13, Robert Haas wrote: > > On Mon, Mar 27, 2023 at 6:08 PM Jelte Fennema wrote: > > For high privileged subscription owners, > > we downgrade to the permissions of the table owner, but for low > > privileged ones we care about permissions of the subscription owner > > itself. > > Hmm. This is an interesting idea. A variant on this theme could be: > what if we made this an explicit configuration option? Sounds perfect to me. Let's do that. As long as the old no-superuser tests continue to pass when disabling the new switch-to-table-owner behaviour, that sounds totally fine to me. I think it's probably easiest if you use the tests from my v2 patch when you add that option, since that was by far the thing I spent most time on to get right in the v2 patch. On Tue, 28 Mar 2023 at 18:13, Robert Haas wrote: > > On Mon, Mar 27, 2023 at 6:08 PM Jelte Fennema wrote: > > Attached is an updated version of your patch with what I had in mind > > (admittedly it needed one more line than "just" the return to make it > > work). But as you can see all previous tests for a lowly privileged > > subscription owner that **cannot** SET ROLE to the table owner > > continue to work as they did before. While still downgrading to the > > table owners role when the subscription owner **can** SET ROLE to the > > table owner. > > > > Obviously this needs some comments explaining what's going on and > > probably some code refactoring and/or variable renaming, but I hope > > it's clear what I meant now: For high privileged subscription owners, > > we downgrade to the permissions of the table owner, but for low > > privileged ones we care about permissions of the subscription owner > > itself. > > Hmm. This is an interesting idea. A variant on this theme could be: > what if we made this an explicit configuration option? > > I'm worried that if we just try to switch users and silently fall back > to not doing so when we don't have enough permissions, the resulting > behavior is going to be difficult to understand and troubleshoot. I'm > thinking that maybe if you let people pick the behavior they want that > becomes more comprehensible. It's also a nice insurance policy: say > for the sake of argument we make switch-to-table-owner the new > default. If that new behavior causes something to happen to somebody > that they don't like, they can always turn it off, even if they are a > highly privileged user who doesn't "need" to turn it off. > > -- > Robert Haas > EDB: http://www.enterprisedb.com
Re: CAST(... ON DEFAULT) - WIP build on top of Error-Safe User Functions
On Tue, 3 Jan 2023 at 14:16, Tom Lane wrote: > > Vik Fearing writes: > > > I don't think we should add that syntax until I do get it through the > > committee, just in case they change something. > > Agreed. So this is something we won't be able to put into v16; > it'll have to wait till there's something solid from the committee. I guess I'll mark this Rejected in the CF then. Who knows when the SQL committee will look at this... -- Gregory Stark As Commitfest Manager
Re: Request for comment on setting binary format output per session
FYI I attached this thread to https://commitfest.postgresql.org/42/3777 which I believe is the same issue. I mistakenly had this listed as a CF entry with no discussion for a long time due to that missing link. -- Gregory Stark As Commitfest Manager
Re: Add LZ4 compression in pg_dump
On Tue, Mar 28, 2023 at 06:40:03PM +0200, Tomas Vondra wrote: > On 3/28/23 18:07, gkokola...@pm.me wrote: > > --- Original Message --- > > On Friday, March 24th, 2023 at 10:30 AM, gkokola...@pm.me > > wrote: > > > >> --- Original Message --- > >> On Thursday, March 23rd, 2023 at 6:10 PM, Tomas Vondra > >> tomas.von...@enterprisedb.com wrote: > >> > >>> This leaves the empty-data issue (which we have a fix for) and the > >>> switch to LZ4F. And then the zstd part. > >> > >> Please expect promptly a patch for the switch to frames. > > > > Please find the expected patch attached. Note that the bulk of the > > patch is code unification, variable renaming to something more > > appropriate, and comment addition. These are changes that are not > > strictly necessary to switch to LZ4F. I do believe that are > > essential for code hygiene after the switch and they do belong > > on the same commit. > > Thanks! > > I agree the renames & cleanup are appropriate - it'd be silly to stick > to misleading naming etc. Would it make sense to split the patch into > two, to separate the renames and the switch to lz4f? > That'd make it the changes necessary for lz4f switch clearer. I don't think so. Did you mean separate commits only for review ? The patch is pretty readable - the File API has just some renames, and the compressor API is what's being replaced, which isn't going to be any more clear. @Georgeos: did you consider using a C union in LZ4State, to separate the parts used by the different APIs ? -- Justin
Re: zstd compression for pg_dump
On 3/28/23 20:03, Kirk Wolak wrote: > On Tue, Mar 28, 2023 at 12:23 PM Tomas Vondra > mailto:tomas.von...@enterprisedb.com>> > wrote: > > On 3/27/23 19:28, Justin Pryzby wrote: > > On Fri, Mar 17, 2023 at 03:43:31AM +0100, Tomas Vondra wrote: > >> On 3/16/23 05:50, Justin Pryzby wrote: > >>> On Fri, Mar 10, 2023 at 12:48:13PM -0800, Jacob Champion wrote: > On Wed, Mar 8, 2023 at 10:59 AM Jacob Champion > mailto:jchamp...@timescale.com>> wrote: > > I did some smoke testing against zstd's GitHub release on > Windows. To > ... > OK. I don't have access to a Windows machine so I can't test that. Is it > possible to disable the zstd threading, until we figure this out? > > Thomas since I appear to be one of the few windows users (I use both), > can I help? > I can test pg_dump... for you, easy to do. I do about 5-10 pg_dumps a > day on windows while developing. > Perhaps. But I'll leave the details up to Justin - it's his patch, and I'm not sure how to verify the threading is OK. I'd try applying this patch, build with --with-zstd and then run the pg_dump TAP tests, and perhaps do some manual tests. And perhaps do the same for --with-lz4 - there's a thread [1] suggesting we don't detect lz4 stuff on Windows, so the TAP tests do nothing. https://www.postgresql.org/message-id/zajl96n9zw84u...@msg.df7cb.de > Also, I have an AWS instance I created to build PG/Win with readline > back in November. > I could give you access to that... (you are not the only person who has > made this statement here). > I've made such instances available for other Open Source developers, to > support them. > > Obvi I would share connection credentials privately. > I'd rather leave the Windows stuff up to someone with more experience with that platform. I have plenty of other stuff on my plate atm. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: POC: Better infrastructure for automated testing of concurrency issues
On Wed, 31 Aug 2022 at 20:52, Alexander Korotkov wrote: > > I'll continue work on this patch. The rebased patch is attached. It > implements stop events as configure option (not runtime GUC option). It looks like this patch isn't going to be ready this commitfest. And it hasn't received much discussion since August 2022. If I'm wrong say something but otherwise I'll mark it Returned With Feedback. It can be resurrected (and moved to the next commitfest) when you're free to work on it again. -- Gregory Stark As Commitfest Manager
Re: "variable not found in subplan target list"
Alvaro Herrera writes: > So I'm back home and found a couple more weird errors in the log: > ERROR: mismatching PartitionPruneInfo found at part_prune_index 0 > DETALLE: plan node relids (b 1), pruneinfo relids (b 36) This one reproduces for me. > select > pg_catalog.pg_stat_get_buf_fsync_backend() as c9 > from > public.tenk2 as ref_0 > where (ref_0.stringu2 is NULL) > and (EXISTS ( > select 1 from fkpart5.fk1 as ref_1 > where pg_catalog.current_date() < (select pg_catalog.max(filler3) > from public.mcv_lists))) ; > ERROR: subplan "InitPlan 1 (returns $1)" was not initialized > CONTEXTO: parallel worker Hmph, I couldn't reproduce that, not even with other settings of debug_parallel_query. Are you running it with non-default planner parameters? > select 1 as c0 > ... > ERROR: could not find commutator for operator 53286 I got a slightly different error: ERROR: missing support function 1(195306,195306) in opfamily 1976 where regression=# select 195306::regtype; regtype int8alias1 (1 row) So that one is related to the intentionally-somewhat-broken int8 opclass configuration that equivclass.sql leaves behind. I've always had mixed emotions about whether leaving that set up that way was a good idea or not. In principle nothing really bad should happen, but it can lead to confusing errors like this one. Maybe it'd be better to roll that back? regards, tom lane
Re: logical decoding and replication of sequences, take 2
On 3/28/23 18:34, Masahiko Sawada wrote: > On Mon, Mar 27, 2023 at 11:46 PM Tomas Vondra > wrote: >> >> >> >> On 3/27/23 03:32, Masahiko Sawada wrote: >>> Hi, >>> >>> On Fri, Mar 24, 2023 at 7:26 AM Tomas Vondra >>> wrote: I merged the earlier "fixup" patches into the relevant parts, and left two patches with new tweaks (deducing the corrent "WAL" state from the current state read by copy_sequence), and the interlock discussed here. >>> >>> Apart from that, how does the publication having sequences work with >>> subscribers who are not able to handle sequence changes, e.g. in a >>> case where PostgreSQL version of publication is newer than the >>> subscriber? As far as I tested the latest patches, the subscriber >>> (v15) errors out with the error 'invalid logical replication message >>> type "Q"' when receiving a sequence change. I'm not sure it's sensible >>> behavior. I think we should instead either (1) deny starting the >>> replication if the subscriber isn't able to handle sequence changes >>> and the publication includes that, or (2) not send sequence changes to >>> such subscribers. >>> >> >> I agree the "invalid message" error is not great, but it's not clear to >> me how to do either (1). The trouble is we don't really know if the >> publication contains (or will contain) sequences. I mean, what would >> happen if the replication starts and then someone adds a sequence? >> >> For (2), I think that's not something we should do - silently discarding >> some messages seems error-prone. If the publication includes sequences, >> presumably the user wanted to replicate those. If they want to replicate >> to an older subscriber, create a publication without sequences. >> >> Perhaps the right solution would be to check if the subscriber supports >> replication of sequences in the output plugin, while attempting to write >> the "Q" message. And error-out if the subscriber does not support it. > > It might be related to this topic; do we need to bump the protocol > version? The commit 64824323e57d introduced new streaming callbacks > and bumped the protocol version. I think the same seems to be true for > this change as it adds sequence_cb callback. > It's not clear to me what should be the exact behavior? I mean, imagine we're opening a connection for logical replication, and the subscriber does not handle sequences. What should the publisher do? (Note: The correct commit hash is 464824323e57d.) I don't think the streaming is a good match for sequences, because of a couple important differences ... Firstly, streaming determines *how* the changes are replicated, not what gets replicated. It doesn't (silently) filter out "bad" events that the subscriber doesn't know how to apply. If the subscriber does not know how to deal with streamed xacts, it'll still get the same changes exactly per the publication definition. Secondly, the default value is "streming=off", i.e. the subscriber has to explicitly request streaming when opening the connection. And we simply check it against the negotiated protocol version, i.e. the check in pgoutput_startup() protects against subscriber requesting a protocol v1 but also streaming=on. I don't think we can/should do more check at this point - we don't know what's included in the requested publications at that point, and I doubt it's worth adding because we certainly can't predict if the publication will be altered to include/decode sequences in the future. Speaking of precedents, TRUNCATE is probably a better one, because it's a new action and it determines *what* the subscriber can handle. But that does exactly the thing we do for sequences - if you open a connection from PG10 subscriber (truncate was added in PG11), and the publisher decodes a truncate, subscriber will do: 2023-03-28 20:29:46.921 CEST [2357609] ERROR: invalid logical replication message type "T" 2023-03-28 20:29:46.922 CEST [2356534] LOG: worker process: logical replication worker for subscription 16390 (PID 2357609) exited with exit code 1 I don't see why sequences should do anything else. If you need to replicate to such subscriber, create a publication that does not have 'sequence' in the publish option ... regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: "variable not found in subplan target list"
So I'm back home and found a couple more weird errors in the log: MERGE INTO public.idxpart2 as target_0 USING (select 1 from public.xmltest2 as ref_0 inner join public.prt1_l_p1 as sample_0 inner join fkpart4.droppk as ref_1 on (sample_0.a = ref_1.a ) on (true) limit 50) as subq_0 left join information_schema.transforms as ref_2 left join public.transition_table_status as sample_1 on (ref_2.transform_type is not NULL) on (true) ON target_0.a = sample_1.level WHEN MATCHED THEN UPDATE set a = target_0.a; ERROR: mismatching PartitionPruneInfo found at part_prune_index 0 DETALLE: plan node relids (b 1), pruneinfo relids (b 36) This one is probably my fault, will look later. select pg_catalog.pg_stat_get_buf_fsync_backend() as c9 from public.tenk2 as ref_0 where (ref_0.stringu2 is NULL) and (EXISTS ( select 1 from fkpart5.fk1 as ref_1 where pg_catalog.current_date() < (select pg_catalog.max(filler3) from public.mcv_lists))) ; ERROR: subplan "InitPlan 1 (returns $1)" was not initialized CONTEXTO: parallel worker select 1 as c0 from (select subq_0.c9 as c5, subq_0.c8 as c9 from public.iso8859_5_inputs as ref_0, lateral (select ref_1.ident as c2, ref_0.description as c8, ref_1.used_bytes as c9 from pg_catalog.pg_backend_memory_contexts as ref_1 where true ) as subq_0 where subq_0.c2 is not NULL) as subq_1 inner join pg_catalog.pg_class as sample_0 on (subq_1.c5 = public.int8alias1in( cast(case when subq_1.c9 is not NULL then null end as cstring))) where true; ERROR: could not find commutator for operator 53286 There were quite a few of those "variable not found" ones, both mentioning singular "targetlist" and others that said "targetlists". I reran them with your patch and they no longer error out, so I guess it's all the same bug. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "I must say, I am absolutely impressed with what pgsql's implementation of VALUES allows me to do. It's kind of ridiculous how much "work" goes away in my code. Too bad I can't do this at work (Oracle 8/9)." (Tom Allison) http://archives.postgresql.org/pgsql-general/2007-06/msg00016.php
Re: allow_in_place_tablespaces vs. pg_basebackup
On Tue, Mar 28, 2023 at 5:15 PM Thomas Munro wrote: > I guess we should add a test of -Fp too, to keep it working? Oops, that was already tested in existing tests, so I take that back.
Re: Add pg_walinspect function with block info columns
On Mon, Mar 27, 2023 at 7:47 PM Bharath Rupireddy wrote: > Hm, agreed. Changed in the attached v7-0002 patch. We can as well > write a case statement in the create function SQL to output forkname > instead forknumber, but I'd stop doing that to keep in sync with > pg_buffercache. I just don't see much value in any textual representation of fork name, however generated. In practice it's just not adding very much useful information. It is mostly useful as a way of filtering block references, which makes simple integers more natural. > Oh, thanks for finding it out. Fixed in the attached v7-0001 patch. I > also removed the "invalid fork number" error as users can figure that > out if at all the fork number is wrong. Pushed just now. > On the ordering of the columns, I kept start_lsn, end_lsn and prev_lsn > first and then the rel** columns (this rel** columns order follows > pg_buffercache) and then block data related columns. Michael and > Kyotaro are of the opinion that it's better to keep LSNs first to be > consistent and also given that this function is WAL related, it makes > sense to have LSNs first. Right, but I didn't change that part in the revision of the patch I posted. Those columns still came first, and were totally consistent with the pg_get_wal_record_info function. I think that there was a "mid air collision" here, where we both posted patches that we each called v7 within minutes of each other. Just to be clear, I ended up with a column order as described here in my revision: https://postgr.es/m/CAH2-WzmzO-AU4QSbnzzANBkrpg=4cuod3scvtv+7x65e+qk...@mail.gmail.com It now occurs to me that "fpi_data" should perhaps be called "block_fpi_data". What do you think? -- Peter Geoghegan
Re: zstd compression for pg_dump
On Tue, Mar 28, 2023 at 12:23 PM Tomas Vondra wrote: > On 3/27/23 19:28, Justin Pryzby wrote: > > On Fri, Mar 17, 2023 at 03:43:31AM +0100, Tomas Vondra wrote: > >> On 3/16/23 05:50, Justin Pryzby wrote: > >>> On Fri, Mar 10, 2023 at 12:48:13PM -0800, Jacob Champion wrote: > On Wed, Mar 8, 2023 at 10:59 AM Jacob Champion < > jchamp...@timescale.com> wrote: > > I did some smoke testing against zstd's GitHub release on Windows. To > ... > OK. I don't have access to a Windows machine so I can't test that. Is it > possible to disable the zstd threading, until we figure this out? > > Thomas since I appear to be one of the few windows users (I use both), can I help? I can test pg_dump... for you, easy to do. I do about 5-10 pg_dumps a day on windows while developing. Also, I have an AWS instance I created to build PG/Win with readline back in November. I could give you access to that... (you are not the only person who has made this statement here). I've made such instances available for other Open Source developers, to support them. Obvi I would share connection credentials privately. Regards, Kirk
Re: running logical replication as the subscription owner
On Tue, Mar 28, 2023 at 1:38 AM Jeff Davis wrote: > > If they do have those things then in > > theory they might be able to protect themselves, but in practice they > > are unlikely to be careful enough. I judge that practically every > > installation where table owners use triggers would be easily > > compromised. Only the most security-conscious of table owners are > > going to get this right. > > This is the interesting part. > > Commit 11da97024ab set the subscription process's search path as empty. > It seems it was done to protect the subscription owner against users > writing into the public schema. But after your apply-as-table-owner > patch, it seems to also offer some protection for table owners against > a malicious subscription owner, too. > > But I must be missing something obvious here -- if the subscription > process has an empty search_path, what can an attacker do to trick it? Oh, interesting. I hadn't realized we were doing that, and I do think that narrows the vulnerability quite a bit. But I still feel pretty uncomfortable with the idea of requiring only INSERT/UPDATE/DELETE permissions on the target table. Let's suppose that you create a table and attach a trigger to it which is SECURITY INVOKER. Absent logical replication, you don't really need to worry about whether that function is written securely -- it will run with privileges of the person performing the DML, and not impact your account at all. They have reason to be scared of you, but not the reverse. However, if the people doing DML on the table can arrange to perform that DML as you, then any security vulnerabilities in your function potentially allow them to compromise your account. Now, there might not be any, but there also might be some, depending on exactly what your function does. And if logical replication into a table requires only I/U/D permission then it's basically a back-door way to run functions that someone could otherwise execute only as themselves as the table owner, and that's scary. Now, I don't know how much to worry about that. As you just pointed out to me in some out-of-band discussion, this is only going to affect a table owner who writes a trigger, makes it insecure in some way other than failure to sanitize the search_path, and sets it ENABLE ALWAYS TRIGGER or ENABLE REPLICA TRIGGER. And maybe we could say that if you do that last part, you kind of need to think about the consequences for logical replication. If so, we could document that problem away. It would also affect someone who uses a default expression or other means of associating executable code with the table, and something like a default expression doesn't require any explicit setting to make it apply in case of replication, so maybe the risk of someone messing up is a bit higher. But this definitely makes it more of a judgment call than I thought initially. Functions that are vulnerable to search_path exploits are a dime a dozen; other kinds of exploits are going to be less common, and more dependent on exactly what the function is doing. I'm still inclined to leave the patch checking for SET ROLE, because after all, we're thinking of switching user identities to the table owner, and checking whether we can SET ROLE is the most natural way of doing that, and definitely doesn't leave room to escalate to any privilege you don't already have. However, there might be an argument that we ought to do something else, like have a REPLICATE privilege on the table that the owner can grant to users that they trust to replicate into that table. Because time is short, I'd like to leave that idea for a future release. What I would like to change in the patch in this release is to add a new subscription property along the lines of what I proposed to Jelte in my earlier email: let's provide an escape hatch that turns off the user-switching behavior. If we do that, then anyone who feels themselves worse off after this patch can switch back to the old behavior. Most people will be better off, I believe, and the opportunity to make things still better in the future is not foreclosed. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Non-superuser subscription owners
On Fri, 2023-03-24 at 00:17 -0700, Jeff Davis wrote: > The other patch you posted seems like it makes a lot of progress in > that direction, and I think that should go in first. That was one of > the items I suggested previously[2], so thank you for working on > that. The above is not a hard objection. I still hold the opinion that the non-superuser subscriptions work is feels premature without the apply-as-table-owner work. It would be great if the other patch ends up ready quickly, which would moot the commit-ordering question. Regards, Jeff Davis
Re: Data is copied twice when specifying both child and parent table in publication
On Tue, Mar 28, 2023 at 2:59 AM wangw.f...@fujitsu.com wrote: > The scenario of this bug is to subscribe to two publications at the same time, > and these two publications publish parent table and child table respectively. > And option via_root is specified in both publications or only in the > publication > of the parent table. Ah, reading the initial mail again, that makes sense. I came to this thread with the alternative reproduction in mind (subscribing to one publication with viaroot=true, and another publication with viaroot=false) and misread the report accordingly... In the end, I'm comfortable with the current level of coverage. > > Would it > > be enough to just replace that whole thing with gpt.attrs? > > Make sense. > Changed as suggested. LGTM, by inspection. Thanks! --Jacob
Re: Add LZ4 compression in pg_dump
On 3/28/23 18:07, gkokola...@pm.me wrote: > > > > > > --- Original Message --- > On Friday, March 24th, 2023 at 10:30 AM, gkokola...@pm.me > wrote: > >> >> --- Original Message --- >> On Thursday, March 23rd, 2023 at 6:10 PM, Tomas Vondra >> tomas.von...@enterprisedb.com wrote: >> >>> This leaves the empty-data issue (which we have a fix for) and the >>> switch to LZ4F. And then the zstd part. >> >> Please expect promptly a patch for the switch to frames. > > Please find the expected patch attached. Note that the bulk of the > patch is code unification, variable renaming to something more > appropriate, and comment addition. These are changes that are not > strictly necessary to switch to LZ4F. I do believe that are > essential for code hygiene after the switch and they do belong > on the same commit. > Thanks! I agree the renames & cleanup are appropriate - it'd be silly to stick to misleading naming etc. Would it make sense to split the patch into two, to separate the renames and the switch to lz4f? That'd make it the changes necessary for lz4f switch clearer. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: logical decoding and replication of sequences, take 2
On Mon, Mar 27, 2023 at 11:46 PM Tomas Vondra wrote: > > > > On 3/27/23 03:32, Masahiko Sawada wrote: > > Hi, > > > > On Fri, Mar 24, 2023 at 7:26 AM Tomas Vondra > > wrote: > >> > >> I merged the earlier "fixup" patches into the relevant parts, and left > >> two patches with new tweaks (deducing the corrent "WAL" state from the > >> current state read by copy_sequence), and the interlock discussed here. > >> > > > > Apart from that, how does the publication having sequences work with > > subscribers who are not able to handle sequence changes, e.g. in a > > case where PostgreSQL version of publication is newer than the > > subscriber? As far as I tested the latest patches, the subscriber > > (v15) errors out with the error 'invalid logical replication message > > type "Q"' when receiving a sequence change. I'm not sure it's sensible > > behavior. I think we should instead either (1) deny starting the > > replication if the subscriber isn't able to handle sequence changes > > and the publication includes that, or (2) not send sequence changes to > > such subscribers. > > > > I agree the "invalid message" error is not great, but it's not clear to > me how to do either (1). The trouble is we don't really know if the > publication contains (or will contain) sequences. I mean, what would > happen if the replication starts and then someone adds a sequence? > > For (2), I think that's not something we should do - silently discarding > some messages seems error-prone. If the publication includes sequences, > presumably the user wanted to replicate those. If they want to replicate > to an older subscriber, create a publication without sequences. > > Perhaps the right solution would be to check if the subscriber supports > replication of sequences in the output plugin, while attempting to write > the "Q" message. And error-out if the subscriber does not support it. It might be related to this topic; do we need to bump the protocol version? The commit 64824323e57d introduced new streaming callbacks and bumped the protocol version. I think the same seems to be true for this change as it adds sequence_cb callback. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: zstd compression for pg_dump
On 3/27/23 19:28, Justin Pryzby wrote: > On Fri, Mar 17, 2023 at 03:43:31AM +0100, Tomas Vondra wrote: >> On 3/16/23 05:50, Justin Pryzby wrote: >>> On Fri, Mar 10, 2023 at 12:48:13PM -0800, Jacob Champion wrote: On Wed, Mar 8, 2023 at 10:59 AM Jacob Champion wrote: > I did some smoke testing against zstd's GitHub release on Windows. To > build against it, I had to construct an import library, and put that > and the DLL into the `lib` folder expected by the MSVC scripts... > which makes me wonder if I've chosen a harder way than necessary? It looks like pg_dump's meson.build is missing dependencies on zstd (meson couldn't find the headers in the subproject without them). >>> >>> I saw that this was added for LZ4, but I hadn't added it for zstd since >>> I didn't run into an issue without it. Could you check that what I've >>> added works for your case ? >>> > Parallel zstd dumps seem to work as expected, in that the resulting > pg_restore output is identical to uncompressed dumps and nothing > explodes. I haven't inspected the threading implementation for safety > yet, as you mentioned. Hm. Best I can tell, the CloneArchive() machinery is supposed to be handling safety for this, by isolating each thread's state. I don't feel comfortable pronouncing this new addition safe or not, because I'm not sure I understand what the comments in the format-specific _Clone() callbacks are saying yet. >>> >>> My line of reasoning for unix is that pg_dump forks before any calls to >>> zstd. Nothing zstd does ought to affect the pg_dump layer. But that >>> doesn't apply to pg_dump under windows. This is an opened question. If >>> there's no solid answer, I could disable/ignore the option (maybe only >>> under windows). >> >> I may be missing something, but why would the patch affect this? Why >> would it even affect safety of the parallel dump? And I don't see any >> changes to the clone stuff ... > > zstd supports using threads during compression, with -Z zstd:workers=N. > When unix forks, the child processes can't do anything to mess up the > state of the parent processes. > > But windows pg_dump uses threads instead of forking, so it seems > possible that the pg_dump -j threads that then spawn zstd threads could > "leak threads" and break the main thread. I suspect there's no issue, > but we still ought to verify that before declaring it safe. > OK. I don't have access to a Windows machine so I can't test that. Is it possible to disable the zstd threading, until we figure this out? >>> The function is first checking if it was passed a filename which already >>> has a suffix. And if not, it searches through a list of suffixes, >>> testing for an existing file with each suffix. The search with stat() >>> doesn't happen if it has a suffix. I'm having trouble seeing how the >>> hasSuffix() branch isn't dead code. Another opened question. >> >> AFAICS it's done this way because of this comment in pg_backup_directory >> >> * ... >> * ".gz" suffix is added to the filenames. The TOC files are never >> * compressed by pg_dump, however they are accepted with the .gz suffix >> * too, in case the user has manually compressed them with 'gzip'. >> >> I haven't tried, but I believe that if you manually compress the >> directory, it may hit this branch. > > That would make sense, but when I tried, it didn't work like that. > The filenames were all uncompressed names. Maybe it worked differently > in an older release. Or maybe it changed during development of the > parallel-directory-dump patch and it's actually dead code. > Interesting. Would be good to find out. I wonder if a little bit of git-log digging could tell us more. Anyway, until we confirm it's dead code, we should probably do what .gz does and have the same check for .lz4 and .zst files. > This is rebased over the updated compression API. > > It seems like I misunderstood something you said before, so now I put > back "supports_compression()". > Thanks! I need to do a bit more testing and review, but it seems pretty much RFC to me, assuming we can figure out what to do about threading. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: running logical replication as the subscription owner
On Mon, Mar 27, 2023 at 6:08 PM Jelte Fennema wrote: > Attached is an updated version of your patch with what I had in mind > (admittedly it needed one more line than "just" the return to make it > work). But as you can see all previous tests for a lowly privileged > subscription owner that **cannot** SET ROLE to the table owner > continue to work as they did before. While still downgrading to the > table owners role when the subscription owner **can** SET ROLE to the > table owner. > > Obviously this needs some comments explaining what's going on and > probably some code refactoring and/or variable renaming, but I hope > it's clear what I meant now: For high privileged subscription owners, > we downgrade to the permissions of the table owner, but for low > privileged ones we care about permissions of the subscription owner > itself. Hmm. This is an interesting idea. A variant on this theme could be: what if we made this an explicit configuration option? I'm worried that if we just try to switch users and silently fall back to not doing so when we don't have enough permissions, the resulting behavior is going to be difficult to understand and troubleshoot. I'm thinking that maybe if you let people pick the behavior they want that becomes more comprehensible. It's also a nice insurance policy: say for the sake of argument we make switch-to-table-owner the new default. If that new behavior causes something to happen to somebody that they don't like, they can always turn it off, even if they are a highly privileged user who doesn't "need" to turn it off. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Commitfest 2023-03 starting tomorrow!
Status summary: Needs review: 116. Waiting on Author: 30. Ready for Committer: 32. Committed: 94. Moved to next CF: 17. Returned with Feedback: 6. Rejected: 6. Withdrawn: 18. Total: 319. Ok, here are the patches that have been stuck in "Waiting on Author" for a while. I divided them into three groups. * The first group have been stuck for over a week and mostly look like they should be RwF. Some I guess just moved to next CF. But some of them I'm uncertain if I should leave or if they really should be RfC or NR. * The other two groups have had some updates in the last week (actually I used 10 days). But some of them still look like they're pretty much dead for this CF and should either be moved forward or Rwf or Rejected. So here's the triage list. I'm going to send emails and start clearing out the patches pretty much right away. Some of these are pretty clearcut. Nothing in over a week: -- * Better infrastructure for automated testing of concurrency issues - Consensus that this is desirable. But it's not clear what it's actually waiting on Author for. RwF? * Provide the facility to set binary format output for specific OID's per session - I think Dave was looking for feedback and got it from Tom and Peter. I don't actually see a specific patch here but there are two patches linked in the original message. There seems to be enough feedback to proceed but nobody's working on it. RwF? * pg_visibility's pg_check_visible() yields false positive when working in parallel with autovacuum - Bug, but tentatively a false positive... * CAST( ... ON DEFAULT) - it'll have to wait till there's something solid from the committee" -- Rejected? * Fix tab completion MERGE - Partly committed but v9-0002-psql-Add-PartialMatches-macro-for-better-tab-complet.patch remains. There was a review from Dean Rasheed. Move to next CF? * Fix recovery conflict SIGUSR1 handling - This looks like a suite of bug fixes and looks like it should be Needs Review or Ready for Commit * Prefetch the next tuple's memory during seqscans - David Rowley said it was dependeny on "heapgettup() refactoring" which has been refactored. So is it now Needs Review or Ready for Commit? Is it likely to happen this CF? * Pluggable toaster - This seems to have digressed from the original patch. There were patches early on and a lot of feedback. Is the result that the original patches are Rejected or are they still live? * psql - refactor echo code - "I think this patch requires an up-to-date summary and explanation" from Peter. But it seems like Tom was ok with it and just had some additional improvements he wanted that were added. It sounds like this might be "Ready for Commit" if someone familiar with the patch looked at it. * Push aggregation down to base relations and joins - Needs a significant rebase (since March 1). * Remove self join on a unique column - An offer of a Bounty! There was one failing test which was apparently fixed? But it looks like this should be in Needs Review or Ready for Commit. * Split index and table statistics into different types of stats - Was stuck on "Generate pg_stat_get_xact*() functions with Macros" which was committed. So "Ready for Commit" now? * Default to ICU during initdb - Partly committed, 0001 waiting until after CF * suppressing useless wakeups in logical/worker.c - Got feedback March 17. Doesn't look like it's going to be ready this CF. * explain analyze rows=%.0f - Patch updated January, but I think Tom still had some simple if tedious changes he asked for * Fix order of checking ICU options in initdb and create database - Feedback Last November but no further questions or progress * Introduce array_shuffle() and array_sample() functions - Feedback from Tom last September. No further questions or progress Status Updates in last week: * Some revises in adding sorting path - Got feedback Feb 21 and author responded but doesn't look like it's going to be ready this CF * Add TAP tests for psql \g piped into program - Peter Eisentraut asked for a one-line change, otherwise it looks like it's Ready for Commit? * Improvements to Meson docs - Some feedback March 15 but no response. I assume this is still in play Emails in last week: --- * RADIUS tests and improvements - last feedback March 20, last patch March 4. Should probably be moved to the next CF unless there's progress soon. * Direct SSL Connections - (This is mine) Code for SSL is pretty finished. The last patch for ALPN support needs a bit of polish. I'll be doing that momentarily. * Fix alter subscription concurrency errors - "patch as-submitted is pretty uninteresting" and "patch that I don't care much about" ... I guess this is Rejected or Withdrawn * Fix improper qual pushdown after applying outer join identity 3 - Tom Lane's patch. Active discussion as of
Re: Add LZ4 compression in pg_dump
--- Original Message --- On Friday, March 24th, 2023 at 10:30 AM, gkokola...@pm.me wrote: > > --- Original Message --- > On Thursday, March 23rd, 2023 at 6:10 PM, Tomas Vondra > tomas.von...@enterprisedb.com wrote: > > > This leaves the empty-data issue (which we have a fix for) and the > > switch to LZ4F. And then the zstd part. > > Please expect promptly a patch for the switch to frames. Please find the expected patch attached. Note that the bulk of the patch is code unification, variable renaming to something more appropriate, and comment addition. These are changes that are not strictly necessary to switch to LZ4F. I do believe that are essential for code hygiene after the switch and they do belong on the same commit. Cheers, //Georgios > > Cheers, > //GeorgiosFrom c289fb8d49b680ad180a44b20fff1dc9553b7494 Mon Sep 17 00:00:00 2001 From: Georgios Kokolatos Date: Tue, 28 Mar 2023 15:48:06 + Subject: [PATCH v1] Use LZ4 frames in pg_dump's compressor API. This change allows for greater compaction of data, especially so in very narrow relations, by avoiding at least a compaction header and footer per row. Since LZ4 frames are now used by both compression APIs, some code deduplication opportunities have become obvious and are also implemented. Reported by: Justin Pryzby --- src/bin/pg_dump/compress_lz4.c | 358 ++--- 1 file changed, 244 insertions(+), 114 deletions(-) diff --git a/src/bin/pg_dump/compress_lz4.c b/src/bin/pg_dump/compress_lz4.c index fc2f4e116d..078dc35cd6 100644 --- a/src/bin/pg_dump/compress_lz4.c +++ b/src/bin/pg_dump/compress_lz4.c @@ -17,7 +17,6 @@ #include "compress_lz4.h" #ifdef USE_LZ4 -#include #include /* @@ -29,102 +28,279 @@ #endif /*-- - * Compressor API - *-- + * Common to both APIs */ -typedef struct LZ4CompressorState +/* + * State used for LZ4 (de)compression by both APIs. + */ +typedef struct LZ4State { - char *outbuf; - size_t outsize; -} LZ4CompressorState; + /* + * Used by the File API to keep track of the file stream. + */ + FILE *fp; + + LZ4F_preferences_t prefs; + + LZ4F_compressionContext_t ctx; + LZ4F_decompressionContext_t dtx; + + /* + * Used by the File API's lazy initialization. + */ + bool inited; + + /* + * Used by the File API to distinguish between compression + * and decompression operations. + */ + bool compressing; + + /* + * Used by the Compressor API to mark if the compression + * headers have been written after initialization. + */ + bool needs_header_flush; + + size_t buflen; + char *buffer; + + /* + * Used by the File API to store already uncompressed + * data that the caller has not consumed. + */ + size_t overflowalloclen; + size_t overflowlen; + char *overflowbuf; + + /* + * Used by both APIs to keep track of the compressed + * data length stored in the buffer. + */ + size_t compressedlen; + + /* + * Used by both APIs to keep track of error codes. + */ + size_t errcode; +} LZ4State; + +/* + * Initialize the required LZ4State members for compression. Write the LZ4 frame + * header in a buffer keeping track of its length. Users of this function can + * choose when and how to write the header to a file stream. + * + * Returns true on success. In case of a failure returns false, and stores the + * error code in state->errcode. + */ +static bool +LZ4_compression_state_init(LZ4State *state) +{ + size_t status; + + state->buflen = LZ4F_compressBound(DEFAULT_IO_BUFFER_SIZE, >prefs); + + /* + * LZ4F_compressBegin requires a buffer that is greater or equal to + * LZ4F_HEADER_SIZE_MAX. Verify that the requirement is met. + */ + if (state->buflen < LZ4F_HEADER_SIZE_MAX) + state->buflen = LZ4F_HEADER_SIZE_MAX; + + status = LZ4F_createCompressionContext(>ctx, LZ4F_VERSION); + if (LZ4F_isError(status)) + { + state->errcode = status; + return false; + } + + state->buffer = pg_malloc(state->buflen); + status = LZ4F_compressBegin(state->ctx, +state->buffer, state->buflen, + >prefs); + if (LZ4F_isError(status)) + { + state->errcode = status; + return false; + } + + state->compressedlen = status; + + return true; +} + +/*-- + * Compressor API + *-- + */ /* Private routines that support LZ4 compressed data I/O */ -static void ReadDataFromArchiveLZ4(ArchiveHandle *AH, CompressorState *cs); -static void WriteDataToArchiveLZ4(ArchiveHandle *AH, CompressorState *cs, - const void *data, size_t dLen); -static void EndCompressorLZ4(ArchiveHandle *AH, CompressorState *cs); static void ReadDataFromArchiveLZ4(ArchiveHandle *AH, CompressorState *cs) { - LZ4_streamDecode_t lz4StreamDecode; - char *buf; - char *decbuf; - size_t buflen; - size_t cnt; - - buflen = DEFAULT_IO_BUFFER_SIZE; - buf = pg_malloc(buflen); - decbuf = pg_malloc(buflen); + size_t r; + size_t readbuflen; + char *outbuf; + char *readbuf; +
Re: [EXTERNAL] Re: Add non-blocking version of PQcancel
On Tue, 28 Mar 2023 at 16:54, Denis Laxalde wrote: > I had a look into your patchset (v16), did a quick review and played a > bit with the feature. > > Patch 2 is missing the documentation about PQcancelSocket() and contains > a few typos; please find attached a (fixup) patch to correct these. Thanks applied that patch and attached a new patchset > Namely, I wonder why it returns a PGcancelConn and what's the > point of requiring the user to call PQcancelStatus() to see if something > got wrong. Maybe it could be defined as: > > int PQcancelSend(PGcancelConn *cancelConn); > > where the return value would be status? And the user would only need to > call PQcancelErrorMessage() in case of error. This would leave only one > single way to create a PGcancelConn value (i.e. PQcancelConn()), which > seems less confusing to me. To clarify what you mean, the API would then be like this: PGcancelConn cancelConn = PQcancelConn(conn); if (PQcancelSend(cancelConn) == CONNECTION_BAD) { printf("ERROR %s\n", PQcancelErrorMessage(cancelConn)) exit(1) } Instead of: PGcancelConn cancelConn = PQcancelSend(conn); if (PQcancelStatus(cancelConn) == CONNECTION_BAD) { printf("ERROR %s\n", PQcancelErrorMessage(cancelConn)) exit(1) } Those are so similar, that I have no preference either way. If more people prefer one over the other I'm happy to change it, but for now I'll keep it as is. > As part of my testing, I've implemented non-blocking cancellation in > Psycopg, based on v16 on this patchset. Overall this worked fine and > seems useful; if you want to try it: > > https://github.com/dlax/psycopg3/tree/pg16/non-blocking-pqcancel That's great to hear! I'll try to take a closer look at that change tomorrow. > (The only thing I found slightly inconvenient is the need to convey the > connection encoding (from PGconn) when handling error message from the > PGcancelConn.) Could you expand a bit more on this? And if you have any idea on how to improve the API with regards to this? v17-0002-Copy-and-store-addrinfo-in-libpq-owned-private-m.patch Description: Binary data v17-0005-Start-using-new-libpq-cancel-APIs.patch Description: Binary data v17-0003-Return-2-from-pqReadData-on-EOF.patch Description: Binary data v17-0004-Add-non-blocking-version-of-PQcancel.patch Description: Binary data v17-0001-libpq-Run-pgindent-after-a9e9a9f32b3.patch Description: Binary data
Re: Memory leak from ExecutorState context?
On 3/28/23 15:17, Jehan-Guillaume de Rorthais wrote: > On Tue, 28 Mar 2023 00:43:34 +0200 > Tomas Vondra wrote: > >> On 3/27/23 23:13, Jehan-Guillaume de Rorthais wrote: >>> Please, find in attachment a patch to allocate bufFiles in a dedicated >>> context. I picked up your patch, backpatch'd it, went through it and did >>> some minor changes to it. I have some comment/questions thought. >>> >>> 1. I'm not sure why we must allocate the "HashBatchFiles" new context >>> under ExecutorState and not under hashtable->hashCxt? >>> >>> The only references I could find was in hashjoin.h:30: >>> >>>/* [...] >>> * [...] (Exception: data associated with the temp files lives in the >>> * per-query context too, since we always call buffile.c in that >>> context.) >>> >>> And in nodeHashjoin.c:1243:ExecHashJoinSaveTuple() (I reworded this >>> original comment in the patch): >>> >>>/* [...] >>> * Note: it is important always to call this in the regular executor >>> * context, not in a shorter-lived context; else the temp file buffers >>> * will get messed up. >>> >>> >>> But these are not explanation of why BufFile related allocations must be >>> under a per-query context. >>> >> >> Doesn't that simply describe the current (unpatched) behavior where >> BufFile is allocated in the per-query context? > > I wasn't sure. The first quote from hashjoin.h seems to describe a stronger > rule about «**always** call buffile.c in per-query context». But maybe it > ought > to be «always call buffile.c from one of the sub-query context»? I assume the > aim is to enforce the tmp files removal on query end/error? > I don't think we need this info for tempfile cleanup - CleanupTempFiles relies on the VfdCache, which does malloc/realloc (so it's not using memory contexts at all). I'm not very familiar with this part of the code, so I might be missing something. But you can try that - just stick an elog(ERROR) somewhere into the hashjoin code, and see if that breaks the cleanup. Not an explicit proof, but if there was some hard-wired requirement in which memory context to allocate BufFile stuff, I'd expect it to be documented in buffile.c. But that actually says this: * Note that BufFile structs are allocated with palloc(), and therefore * will go away automatically at query/transaction end. Since the underlying * virtual Files are made with OpenTemporaryFile, all resources for * the file are certain to be cleaned up even if processing is aborted * by ereport(ERROR). The data structures required are made in the * palloc context that was current when the BufFile was created, and * any external resources such as temp files are owned by the ResourceOwner * that was current at that time. which I take as confirmation that it's legal to allocate BufFile in any memory context, and that cleanup is handled by the cache in fd.c. >> I mean, the current code calls BufFileCreateTemp() without switching the >> context, so it's in the ExecutorState. But with the patch it very clearly is >> not. >> >> And I'm pretty sure the patch should do >> >> hashtable->fileCxt = AllocSetContextCreate(hashtable->hashCxt, >> "HashBatchFiles", >> ALLOCSET_DEFAULT_SIZES); >> >> and it'd still work. Or why do you think we *must* allocate it under >> ExecutorState? > > That was actually my very first patch and it indeed worked. But I was confused > about the previous quoted code comments. That's why I kept your original code > and decided to rise the discussion here. > IIRC I was just lazy when writing the experimental patch, there was not much thought about stuff like this. > Fixed in new patch in attachment. > >> FWIW The comment in hashjoin.h needs updating to reflect the change. > > Done in the last patch. Is my rewording accurate? > >>> 2. Wrapping each call of ExecHashJoinSaveTuple() with a memory context >>> switch seems fragile as it could be forgotten in futur code path/changes. >>> So I added an Assert() in the function to make sure the current memory >>> context is "HashBatchFiles" as expected. >>> Another way to tie this up might be to pass the memory context as >>> argument to the function. >>> ... Or maybe I'm over precautionary. >>> >> >> I'm not sure I'd call that fragile, we have plenty other code that >> expects the memory context to be set correctly. Not sure about the >> assert, but we don't have similar asserts anywhere else. > > I mostly sticked it there to stimulate the discussion around this as I needed > to scratch that itch. > >> But I think it's just ugly and overly verbose > > +1 > > Your patch was just a demo/debug patch by the time. It needed some cleanup now > :) > >> it'd be much nicer to e.g. pass the memory context as a parameter, and do >> the switch inside. > > That was a proposition in my previous mail, so I did it in the new patch. > Let's > see what other
Re: Initial Schema Sync for Logical Replication
On Tue, Mar 28, 2023 at 6:47 PM Amit Kapila wrote: > > On Mon, Mar 27, 2023 at 8:17 AM Masahiko Sawada wrote: > > > > On Fri, Mar 24, 2023 at 11:51 PM Kumar, Sachin wrote: > > > > > > > From: Amit Kapila > > > > > I think we won't be able to use same snapshot because the transaction > > > > > will > > > > > be committed. > > > > > In CreateSubscription() we can use the transaction snapshot from > > > > > walrcv_create_slot() till walrcv_disconnect() is called.(I am not sure > > > > > about this part maybe walrcv_disconnect() calls the commits > > > > > internally ?). > > > > > So somehow we need to keep this snapshot alive, even after transaction > > > > > is committed(or delay committing the transaction , but we can have > > > > > CREATE SUBSCRIPTION with ENABLED=FALSE, so we can have a restart > > > > > before tableSync is able to use the same snapshot.) > > > > > > > > > > > > > Can we think of getting the table data as well along with schema via > > > > pg_dump? Won't then both schema and initial data will correspond to the > > > > same snapshot? > > > > > > Right , that will work, Thanks! > > > > While it works, we cannot get the initial data in parallel, no? > > > > Another possibility is that we dump/restore the schema of each table > along with its data. One thing we can explore is whether the parallel > option of dump can be useful here. Do you have any other ideas? A downside of the idea of dumping both table schema and table data would be that we need to temporarily store data twice the size of the table (the dump file and the table itself) during the load. One might think that we can redirect the pg_dump output into the backend so that it can load it via SPI, but it doesn't work since "COPY tbl FROM stdin;" doesn't work via SPI. The --inserts option of pg_dump could help it out but it makes restoration very slow. > > One related idea is that currently, we fetch the table list > corresponding to publications in subscription and create the entries > for those in pg_subscription_rel during Create Subscription, can we > think of postponing that work till after the initial schema sync? We > seem to be already storing publications list in pg_subscription, so it > appears possible if we somehow remember the value of copy_data. If > this is feasible then I think that may give us the flexibility to > perform the initial sync at a later point by the background worker. It sounds possible. With this idea, we will be able to have the apply worker restore the table schemas (and create pg_subscription_rel entries) as the first thing. Another point we might need to consider is that the initial schema sync (i.e. creating tables) and creating pg_subscription_rel entries need to be done in the same transaction. Otherwise, we could end up committing either one change. I think it depends on how we restore the schema data. > > > > > > > > > I think we can have same issues as you mentioned New table t1 is added > > > > > to the publication , User does a refresh publication. > > > > > pg_dump / pg_restore restores the table definition. But before > > > > > tableSync can start, steps from 2 to 5 happen on the publisher. > > > > > > 1. Create Table t1(c1, c2); --LSN: 90 2. Insert t1 (1, 1); --LSN 100 > > > > > > 3. Insert t1 (2, 2); --LSN 110 4. Alter t1 Add Column c3; --LSN 120 > > > > > > 5. Insert t1 (3, 3, 3); --LSN 130 > > > > > And table sync errors out > > > > > There can be one more issue , since we took the pg_dump without > > > > snapshot (wrt to replication slot). > > > > > > > > > > > > > To avoid both the problems mentioned for Refresh Publication, we can do > > > > one of the following: (a) create a new slot along with a snapshot for > > > > this > > > > operation and drop it afterward; or (b) using the existing slot, > > > > establish a > > > > new snapshot using a technique proposed in email [1]. > > > > > > > > > > Thanks, I think option (b) will be perfect, since we don’t have to create > > > a new slot. > > > > Regarding (b), does it mean that apply worker stops streaming, > > requests to create a snapshot, and then resumes the streaming? > > > > Shouldn't this be done by the backend performing a REFRESH publication? Hmm, I might be missing something but the idea (b) uses the existing slot to establish a new snapshot, right? What existing replication slot do we use for that? I thought it was the one used by the apply worker. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Memory leak from ExecutorState context?
Hi, Sorry for the late answer, I was reviewing the first patch and it took me some time to study and dig around. On Thu, 23 Mar 2023 08:07:04 -0400 Melanie Plageman wrote: > On Fri, Mar 10, 2023 at 1:51 PM Jehan-Guillaume de Rorthais > wrote: > > > So I guess the best thing would be to go through these threads, see what > > > the status is, restart the discussion and propose what to do. If you do > > > that, I'm happy to rebase the patches, and maybe see if I could improve > > > them in some way. > > > > OK! It took me some time, but I did it. I'll try to sum up the situation as > > simply as possible. > > Wow, so many memories! > > I'm excited that someone looked at this old work (though it is sad that > a customer faced this issue). And, Jehan, I really appreciate your great > summarization of all these threads. This will be a useful reference. Thank you! > > 1. "move BufFile stuff into separate context" > > [...] > >I suppose this simple one has been forgotten in the fog of all other > >discussions. Also, this probably worth to be backpatched. > > I agree with Jehan-Guillaume and Tomas that this seems fine to commit > alone. This is a WIP. > > 2. "account for size of BatchFile structure in hashJoin" > > [...] > > I think I would have to see a modern version of a patch which does this > to assess if it makes sense. But, I probably still agree with 2019 > Melanie :) I volunteer to work on this after the memory context patch, unless someone grab it in the meantime. > [...] > On Mon, Mar 20, 2023 at 10:12 AM Jehan-Guillaume de Rorthais > wrote: > > BNJL and/or other considerations are for 17 or even after. In the meantime, > > Melanie, who authored BNLJ, +1 the balancing patch as it can coexists with > > other discussed solutions. No one down vote since then. Melanie, what is > > your opinion today on this patch? Did you change your mind as you worked > > for many months on BNLJ since then? > > So, in order to avoid deadlock, my design of adaptive hash join/block > nested loop hash join required a new parallelism concept not yet in > Postgres at the time -- the idea of a lone worker remaining around to do > work when others have left. > > See: BarrierArriveAndDetachExceptLast() > introduced in 7888b09994 > > Thomas Munro had suggested we needed to battle test this concept in a > more straightforward feature first, so I implemented parallel full outer > hash join and parallel right outer hash join with it. > > https://commitfest.postgresql.org/42/2903/ > > This has been stalled ready-for-committer for two years. It happened to > change timing such that it made an existing rarely hit parallel hash > join bug more likely to be hit. Thomas recently committed our fix for > this in 8d578b9b2e37a4d (last week). It is my great hope that parallel > full outer hash join goes in before the 16 feature freeze. This is really interesting to follow. I kinda feel/remember how this could be useful for your BNLJ patch. It's good to see things are moving, step by step. Thanks for the pointers. > If it does, I think it could make sense to try and find committable > smaller pieces of the adaptive hash join work. As it is today, parallel > hash join does not respect work_mem, and, in some sense, is a bit broken. > > I would be happy to work on this feature again, or, if you were > interested in picking it up, to provide review and any help I can if for > you to work on it. I don't think I would be able to pick up such a large and complex patch. But I'm interested to help, test and review, as far as I can! Regards,
Re: [EXTERNAL] Re: Add non-blocking version of PQcancel
Hi Jelte, I had a look into your patchset (v16), did a quick review and played a bit with the feature. Patch 2 is missing the documentation about PQcancelSocket() and contains a few typos; please find attached a (fixup) patch to correct these. --- a/src/interfaces/libpq/libpq-fe.h +++ b/src/interfaces/libpq/libpq-fe.h @@ -321,16 +328,28 @@ extern PostgresPollingStatusType PQresetPoll(PGconn *conn); /* Synchronous (blocking) */ extern void PQreset(PGconn *conn); +/* issue a cancel request */ +extern PGcancelConn * PQcancelSend(PGconn *conn); [...] Maybe I'm missing something, but this function above seems a bit strange. Namely, I wonder why it returns a PGcancelConn and what's the point of requiring the user to call PQcancelStatus() to see if something got wrong. Maybe it could be defined as: int PQcancelSend(PGcancelConn *cancelConn); where the return value would be status? And the user would only need to call PQcancelErrorMessage() in case of error. This would leave only one single way to create a PGcancelConn value (i.e. PQcancelConn()), which seems less confusing to me. Jelte Fennema wrote: > Especially since I ran into another use case that I would want to use > this patch for recently: Adding an async cancel function to Python > it's psycopg3 library. This library exposes both a Connection class > and an AsyncConnection class (using python its asyncio feature). But > one downside of the AsyncConnection type is that it doesn't have a > cancel method. As part of my testing, I've implemented non-blocking cancellation in Psycopg, based on v16 on this patchset. Overall this worked fine and seems useful; if you want to try it: https://github.com/dlax/psycopg3/tree/pg16/non-blocking-pqcancel (The only thing I found slightly inconvenient is the need to convey the connection encoding (from PGconn) when handling error message from the PGcancelConn.) Cheers, Denis >From a5f9cc680ffa520b05fe34b7cac5df2e60a6d4ad Mon Sep 17 00:00:00 2001 From: Denis Laxalde Date: Tue, 28 Mar 2023 16:06:42 +0200 Subject: [PATCH] fixup! Add non-blocking version of PQcancel --- doc/src/sgml/libpq.sgml | 16 +++- src/interfaces/libpq/fe-connect.c| 2 +- src/test/modules/libpq_pipeline/libpq_pipeline.c | 2 +- 3 files changed, 17 insertions(+), 3 deletions(-) diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml index 29f08a4317..aa404c4d15 100644 --- a/doc/src/sgml/libpq.sgml +++ b/doc/src/sgml/libpq.sgml @@ -5795,7 +5795,7 @@ PGcancelConn *PQcancelSend(PGconn *conn); options as the the original PGconn. So when the original connection is encrypted (using TLS or GSS), the connection for the cancel request is encrypted in the same way. Any connection options - that only are only used during authentication or after authentication of + that are only used during authentication or after authentication of the client are ignored though, because cancellation requests do not require authentication and the connection is closed right after the cancellation request is submitted. @@ -5912,6 +5912,20 @@ ConnStatusType PQcancelStatus(const PGcancelConn *conn); + + PQcancelSocketPQcancelSocket + + + + A version of that can be used for + cancellation connections. + +int PQcancelSocket(PGcancelConn *conn); + + + + + PQcancelPollPQcancelPoll diff --git a/src/interfaces/libpq/fe-connect.c b/src/interfaces/libpq/fe-connect.c index 74e337fddf..16af7303d4 100644 --- a/src/interfaces/libpq/fe-connect.c +++ b/src/interfaces/libpq/fe-connect.c @@ -808,7 +808,7 @@ PQcancelConn(PGconn *conn) /* * Cancel requests should not iterate over all possible hosts. The request * needs to be sent to the exact host and address that the original - * connection used. So we we manually create the host and address arrays + * connection used. So we manually create the host and address arrays * with a single element after freeing the host array that we generated * from the connection options. */ diff --git a/src/test/modules/libpq_pipeline/libpq_pipeline.c b/src/test/modules/libpq_pipeline/libpq_pipeline.c index e8e904892c..6764ab513b 100644 --- a/src/test/modules/libpq_pipeline/libpq_pipeline.c +++ b/src/test/modules/libpq_pipeline/libpq_pipeline.c @@ -89,7 +89,7 @@ pg_fatal_impl(int line, const char *fmt,...) /* * Check that the query on the given connection got cancelled. * - * This is a function wrapped in a macrco to make the reported line number + * This is a function wrapped in a macro to make the reported line number * in an error match the line number of the invocation. */ #define confirm_query_cancelled(conn) confirm_query_cancelled_impl(__LINE__, conn) -- 2.30.2
Re: Support logical replication of DDLs
On 3/27/23 2:37 AM, Amit Kapila wrote: On Mon, Mar 27, 2023 at 2:52 AM Tom Lane wrote: And TBH, I don't think that I quite believe the premise in the first place. The whole point of using logical rather than physical replication is that the subscriber installation(s) aren't exactly like the publisher. Given that, how can we expect that automated DDL replication is going to do the right thing often enough to be a useful tool rather than a disastrous foot-gun? One of the major use cases as mentioned in the initial email was for online version upgrades. And also, people would be happy to automatically sync the schema for cases where the logical replication is set up to get a subset of the data via features like row filters. Having said that, I agree with you that it is very important to define the scope of this feature if we want to see it becoming reality. To echo Amit, this is actually one area where PostgreSQL replication lags behind (no pun intended) other mature RDBMSes. As Amit says, the principal use case is around major version upgrades, but also migration between systems or moving data/schemas between systems that speak the PostgreSQL protocol. All of these are becoming more increasingly common as PostgreSQL is taking on more workloads that are sensitive to downtime or are distributed in nature. There are definitely footguns with logical replication of DDL -- I've seen this from reading other manuals that support this feature and in my own experiments. However, like many features, users have strategies thy use to avoid footgun scenarios. For example, in systems that use logical replication as part of their HA, users will either: * Not replicate DDL, but use some sort of rolling orchestration process to place it on each instance * Replicate DDL, but coordinate it with some kind of global lock * Replica only a subset of DDL, possibly with lock coordination I'll comment on the patch scope further downthread. I agree it's very big -- I had given some of that feedback privately a few month back -- and it could benefit from the "step back, holistic review." For example, I was surprised that a fairly common pattern[1] did not work due to changes we made when addressing a CVE (some follow up work was proposed but we haven't done it yet). I do agree this patch would benefit from stepping back, and I do think we can work many of the issues. From listening to users and prospective users, it's pretty clear we need to support DDL replication in some capacity. Thanks, Jonathan [1] https://www.postgresql.org/message-id/263bea1c-a897-417d-3765-ba6e1e24711e%40postgresql.org OpenPGP_signature Description: OpenPGP digital signature
Re: "variable not found in subplan target list"
I wrote: > The planner is reducing the scan of target_parted to > a dummy scan, as is reasonable, but it forgets to > provide ctid as an output from that scan; then the > parent join node is unhappy because it does have > a ctid output. So it looks like the problem is some > shortcut we take while creating the dummy scan. Oh, actually the problem is in distribute_row_identity_vars, which is supposed to handle this case, but it thinks it doesn't have to back-fill the rel's reltarget. Wrong. Now that I see the problem, I wonder if we can't reproduce a similar symptom without MERGE, which would mean that v14 has the issue too. The attached seems to fix it, but I'm going to look for a non-MERGE test case before pushing. regards, tom lane diff --git a/src/backend/optimizer/util/appendinfo.c b/src/backend/optimizer/util/appendinfo.c index 9d377385f1..c1b1557570 100644 --- a/src/backend/optimizer/util/appendinfo.c +++ b/src/backend/optimizer/util/appendinfo.c @@ -21,6 +21,7 @@ #include "nodes/nodeFuncs.h" #include "optimizer/appendinfo.h" #include "optimizer/pathnode.h" +#include "optimizer/planmain.h" #include "parser/parsetree.h" #include "utils/lsyscache.h" #include "utils/rel.h" @@ -994,9 +995,10 @@ distribute_row_identity_vars(PlannerInfo *root) * certainly process no rows. Handle this edge case by re-opening the top * result relation and adding the row identity columns it would have used, * as preprocess_targetlist() would have done if it weren't marked "inh". - * (This is a bit ugly, but it seems better to confine the ugliness and - * extra cycles to this unusual corner case.) We needn't worry about - * fixing the rel's reltarget, as that won't affect the finished plan. + * Then re-run build_base_rel_tlists() to ensure that the added columns + * get propagated to the relation's reltarget. (This is a bit ugly, but + * it seems better to confine the ugliness and extra cycles to this + * unusual corner case.) */ if (root->row_identity_vars == NIL) { @@ -1006,6 +1008,8 @@ distribute_row_identity_vars(PlannerInfo *root) add_row_identity_columns(root, result_relation, target_rte, target_relation); table_close(target_relation, NoLock); + build_base_rel_tlists(root, root->processed_tlist); + /* There are no ROWID_VAR Vars in this case, so we're done. */ return; }
Re: Missing update of all_hasnulls in BRIN opclasses
Hi, It took me a while but I finally got back to reworking this to use the bt_info bit, as proposed by Alvaro. And it turned out to work great, because (a) it's a tuple-level flag, i.e. the right place, and (b) it does not overload existing flags. This greatly simplified the code in add_values_to_range and (especially) union_tuples, making it much easier to understand, I think. One disadvantage is we are unable to see which ranges are empty in current pageinspect, but 0002 addresses that by adding "empty" column to the brin_page_items() output. That's a matter for master only, though. It's a trivial patch and it makes it easier/possible to test this, so we should consider to squeeze it into PG16. I did quite a bit of testing - the attached 0003 adds extra tests, but I don't propose to get this committed as is - it's rather overkill. Maybe some reduced version of it ... The hardest thing to test is the union_tuples() part, as it requires concurrent operations with "correct" timing. Easy to simulate by breakpoints in GDB, not so much in plain regression/TAP tests. There's also a stress tests, doing a lot of randomized summarizations, etc. Without the fix this failed in maybe 30% of runs, now I did ~100 runs without a single failure. I haven't done any backporting, but I think it should be simpler than with the earlier approach. I wonder if we need to care about starting to use the previously unused bit - I don't think so, in the worst case we'll just ignore it, but maybe I'm missing something (e.g. when using physical replication). regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL CompanyFrom 10efaf9964806e5a30818994e3cfda879bb90171 Mon Sep 17 00:00:00 2001 From: Tomas Vondra Date: Sun, 8 Jan 2023 16:43:06 +0100 Subject: [PATCH 1/3] Fix handling of NULLs in BRIN indexes BRIN indexes did not properly distinguish between summaries for empty (no rows) and all-NULL ranges. All summaries were initialized with allnulls=true, and the opclasses simply reset allnulls to false when processing the first non-NULL value. This however fails if the range starts with a NULL value (or a sequence of NULL values), in which case we forget the range contains NULL values. This happens because the allnulls flag is used for two separate purposes - to mark empty ranges (not representing any rows yet) and ranges containing only NULL values. Opclasses don't know which of these cases it is, and so don't know whether to set hasnulls=true. Setting hasnulls=true in both cases would make it correct, but it would also make BRIN indexes useless for queries with IS NULL clauses - all ranges start empty (and thus allnulls=true), so all ranges would end up with either allnulls=true or hasnulls=true. The severity of the issue is somewhat reduced by the fact that it only happens when adding values to an existing summary with allnulls=true, not when the summarization is processing values in bulk (e.g. during CREATE INDEX or automatic summarization). In this case the flags were updated in a slightly different way, not forgetting the NULL values. This introduces a new a new flag marking index tuples representing ranges with no rows. Luckily we have an unused tuple in the BRIN tuple header that we can use for this. We still store index tuples for empty ranges, because otherwise we'd not be able to say whether a range is empty or not yet summarized, and we'd have to process them for any query. Backpatch to 11. The issue exists since BRIN indexes were introduced in 9.5, but older releases are already EOL. Backpatch-through: 11 Reviewed-by: Alvaro Herrera, Justin Pryzby, Matthias van de Meent Discussion: https://postgr.es/m/402430e4-7d9d-6cf1-09ef-464d80aff...@enterprisedb.com --- src/backend/access/brin/brin.c| 115 +- src/backend/access/brin/brin_tuple.c | 15 ++- src/include/access/brin_tuple.h | 6 +- ...summarization-and-inprogress-insertion.out | 8 +- ...ummarization-and-inprogress-insertion.spec | 1 + 5 files changed, 137 insertions(+), 8 deletions(-) diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c index 53e4721a54e..162a0c052aa 100644 --- a/src/backend/access/brin/brin.c +++ b/src/backend/access/brin/brin.c @@ -592,6 +592,17 @@ bringetbitmap(IndexScanDesc scan, TIDBitmap *tbm) bval = >bt_columns[attno - 1]; + /* + * If the BRIN tuple indicates that this range is empty, + * we can skip it: there's nothing to match. We don't + * need to examine the next columns. + */ + if (dtup->bt_empty_range) + { + addrange = false; + break; + } + /* * First check if there are any IS [NOT] NULL scan keys, * and if we're violating them. In that case we can @@ -1590,6 +1601,8 @@ form_and_insert_tuple(BrinBuildState *state) /* * Given two deformed tuples, adjust the first one so that it's consistent * with the summary values in
Re: Kerberos delegation support in libpq and postgres_fdw
Greetings, * Stephen Frost (sfr...@snowman.net) wrote: > * Greg Stark (st...@mit.edu) wrote: > > The CFBot says there's a function be_gssapi_get_proxy() which is > > undefined. Presumably this is a missing #ifdef or a definition that > > should be outside an #ifdef. > > Yup, just a couple of missing #ifdef's. > > Updated and rebased patch attached. ... and a few more. Apparently hacking on a plane without enough sleep leads to changing ... and unchanging configure flags before testing. Thanks, Stephen From 9808fd4eb4920e40468709af7325a27b18066cec Mon Sep 17 00:00:00 2001 From: Stephen Frost Date: Mon, 28 Feb 2022 20:17:55 -0500 Subject: [PATCH] Add support for Kerberos credential delegation Support GSSAPI/Kerberos credentials being delegated to the server by a client. With this, a user authenticating to PostgreSQL using Kerberos (GSSAPI) credentials can choose to delegate their credentials to the PostgreSQL server (which can choose to accept them, or not), allowing the server to then use those delegated credentials to connect to another service, such as with postgres_fdw or dblink or theoretically any other service which is able to be authenticated using Kerberos. Both postgres_fdw and dblink are changed to allow non-superuser password-less connections but only when GSSAPI credentials have been proxied to the server by the client and GSSAPI is used to authenticate to the remote system. Authors: Stephen Frost, Peifeng Qiu Discussion: https://postgr.es/m/co1pr05mb8023cc2cb575e0faad7df4f8a8...@co1pr05mb8023.namprd05.prod.outlook.com --- contrib/dblink/dblink.c | 127 --- contrib/dblink/expected/dblink.out| 4 +- contrib/postgres_fdw/connection.c | 72 +++- .../postgres_fdw/expected/postgres_fdw.out| 19 +- contrib/postgres_fdw/option.c | 3 + contrib/postgres_fdw/sql/postgres_fdw.sql | 3 +- doc/src/sgml/config.sgml | 17 + doc/src/sgml/dblink.sgml | 5 +- doc/src/sgml/libpq.sgml | 41 +++ doc/src/sgml/monitoring.sgml | 9 + doc/src/sgml/postgres-fdw.sgml| 5 +- src/backend/catalog/system_views.sql | 3 +- src/backend/foreign/foreign.c | 1 + src/backend/libpq/auth.c | 13 +- src/backend/libpq/be-gssapi-common.c | 51 +++ src/backend/libpq/be-secure-gssapi.c | 26 +- src/backend/utils/activity/backend_status.c | 1 + src/backend/utils/adt/pgstatfuncs.c | 20 +- src/backend/utils/init/postinit.c | 8 +- src/backend/utils/misc/guc_tables.c | 10 + src/backend/utils/misc/postgresql.conf.sample | 1 + src/include/catalog/pg_proc.dat | 6 +- src/include/libpq/auth.h | 1 + src/include/libpq/be-gssapi-common.h | 3 + src/include/libpq/libpq-be.h | 2 + src/include/utils/backend_status.h| 1 + src/interfaces/libpq/exports.txt | 1 + src/interfaces/libpq/fe-auth.c| 15 +- src/interfaces/libpq/fe-connect.c | 17 + src/interfaces/libpq/fe-secure-gssapi.c | 23 +- src/interfaces/libpq/libpq-fe.h | 1 + src/interfaces/libpq/libpq-int.h | 2 + src/test/kerberos/Makefile| 3 + src/test/kerberos/t/001_auth.pl | 335 -- src/test/perl/PostgreSQL/Test/Utils.pm| 27 ++ src/test/regress/expected/rules.out | 11 +- 36 files changed, 752 insertions(+), 135 deletions(-) diff --git a/contrib/dblink/dblink.c b/contrib/dblink/dblink.c index 78a8bcee6e..d4dd338055 100644 --- a/contrib/dblink/dblink.c +++ b/contrib/dblink/dblink.c @@ -48,6 +48,7 @@ #include "funcapi.h" #include "lib/stringinfo.h" #include "libpq-fe.h" +#include "libpq/libpq-be.h" #include "libpq/libpq-be-fe-helpers.h" #include "mb/pg_wchar.h" #include "miscadmin.h" @@ -111,7 +112,8 @@ static HeapTuple get_tuple_of_interest(Relation rel, int *pkattnums, int pknumat static Relation get_rel_from_relname(text *relname_text, LOCKMODE lockmode, AclMode aclmode); static char *generate_relation_name(Relation rel); static void dblink_connstr_check(const char *connstr); -static void dblink_security_check(PGconn *conn, remoteConn *rconn); +static bool dblink_connstr_has_pw(const char *connstr); +static void dblink_security_check(PGconn *conn, remoteConn *rconn, const char *connstr); static void dblink_res_error(PGconn *conn, const char *conname, PGresult *res, bool fail, const char *fmt,...) pg_attribute_printf(5, 6); static char *get_connect_string(const char *servername); @@ -213,7 +215,7 @@ dblink_get_conn(char *conname_or_str, errmsg("could not establish connection"), errdetail_internal("%s", msg))); } - dblink_security_check(conn, rconn); + dblink_security_check(conn, rconn, connstr); if
Re: Support logical replication of global object commands
> I think that there are some (possibly) tricky challenges that haven't > been discussed yet to support replicating global objects. > > First, as for publications having global objects (roles, databases, > and tablespaces), but storing them in database specific tables like > pg_publication doesn't make sense, because it should be at some shared > place where all databases can have access to it. Maybe we need to have > a shared catalog like pg_shpublication or pg_publication_role to store > publications related to global objects or the relationship between > such publications and global objects. Second, we might need to change > the logical decoding infrastructure so that it's aware of shared > catalog changes. Thanks for the feedback. This is insightful. > Currently we need to scan only db-specific catalogs. > Finally, since we process CREATE DATABASE in a different way than > other DDLs (by cloning another database such as template1), simply > replicating the CREATE DATABASE statement would not produce the same > results as the publisher. Also, since event triggers are not fired on > DDLs for global objects, always WAL-logging such DDL statements like > the proposed patch does is not a good idea. > Given that there seems to be some tricky problems and there is a > discussion for cutting the scope to make the initial patch small[1], I > think it's better to do this work after the first version. Agreed. Regards, Zane
Re: Request for comment on setting binary format output per session
On Sun, 26 Mar 2023 at 21:30, Tom Lane wrote: > Dave Cramer writes: > > On Sun, 26 Mar 2023 at 18:12, Tom Lane wrote: > >> I would not expect DISCARD ALL to reset a session-level property. > > > Well if we can't reset it with DISCARD ALL how would that work with > > pgbouncer, or any pool for that matter since it doesn't know which client > > asked for which (if any) OID's to be binary. > > Well, it'd need to know that, just like it already needs to know > which clients asked for which database or which login role. > OK, IIUC what you are proposing here is that there would be a separate pool for database, user, and OIDs. This doesn't seem too flexible. For instance if I create a UDT and then want it to be returned as binary then I have to reconfigure the pool to be able to accept a new list of OID's. Am I mis-understanding how this would potentially work? Dave > >
Re: Move definition of standard collations from initdb to pg_collation.dat
On 28.03.23 13:33, Tom Lane wrote: While we're here, do we want to adopt some other spelling of "the root locale" than "und", in view of recent discoveries about the instability of that on old ICU versions? That issue was fixed by 3b50275b12, so we can keep using the "und" spelling.
Re: TAP output format in pg_regress
> On 28 Mar 2023, at 15:26, Daniel Gustafsson wrote: > I think the attached is a good candidate for going in, but I would like to > see it > for another spin in the CF bot first. Another candidate due to a thinko which raised a compiler warning. -- Daniel Gustafsson v19-0001-pg_regress-Emit-TAP-compliant-output.patch Description: Binary data
Re: Infinite Interval
On Mon, Mar 20, 2023 at 3:16 AM Joseph Koshakow wrote: > > > > On Sun, Mar 19, 2023 at 5:13 PM Tom Lane wrote: > > > >Did you actually write "if TIMESTAMP_IS_NOBEGIN(dt2)" and not > >"if (TIMESTAMP_IS_NOBEGIN(dt2))"? If the former, I'm not surprised > >that pgindent gets confused. The parentheses are required by the > >C standard. Your code might accidentally work because the macro > >has parentheses internally, but call sites have no business > >knowing that. For example, it would be completely legit to change > >TIMESTAMP_IS_NOBEGIN to be a plain function, and then this would be > >syntactically incorrect. > > Oh duh. I've been doing too much Rust development and did this without > thinking. I've attached a patch with a fix. > Thanks for fixing this. On this latest patch, I have one code comment @@ -3047,7 +3180,30 @@ timestamptz_pl_interval_internal(TimestampTz timestamp, TimestampTz result; int tz; - if (TIMESTAMP_NOT_FINITE(timestamp)) + /* +* Adding two infinites with the same sign results in an infinite +* timestamp with the same sign. Adding two infintes with different signs +* results in an error. +*/ + if (INTERVAL_IS_NOBEGIN(span)) + { + if TIMESTAMP_IS_NOEND(timestamp) + ereport(ERROR, + (errcode(ERRCODE_DATETIME_VALUE_OUT_OF_RANGE), +errmsg("interval out of range"))); + else + TIMESTAMP_NOBEGIN(result); + } + else if (INTERVAL_IS_NOEND(span)) + { + if TIMESTAMP_IS_NOBEGIN(timestamp) + ereport(ERROR, + (errcode(ERRCODE_DATETIME_VALUE_OUT_OF_RANGE), +errmsg("interval out of range"))); + else + TIMESTAMP_NOEND(result); + } + else if (TIMESTAMP_NOT_FINITE(timestamp)) This code is duplicated in timestamp_pl_interval(). We could create a function to encode the infinity handling rules and then call it in these two places. The argument types are different, Timestamp and TimestampTz viz. which map to in64, so shouldn't be a problem. But it will be slightly unreadable. Or use macros but then it will be difficult to debug. What do you think? Next I will review the test changes and also make sure that every operator that interval as one of its operands or the result has been covered in the code. This is the following list #select oprname, oprcode from pg_operator where oprleft = 'interval'::regtype or oprright = 'interval'::regtype or oprresult = 'interval'::regtype; oprname | oprcode -+- + | date_pl_interval - | date_mi_interval + | timestamptz_pl_interval - | timestamptz_mi - | timestamptz_mi_interval = | interval_eq <> | interval_ne < | interval_lt <= | interval_le > | interval_gt >= | interval_ge - | interval_um + | interval_pl - | interval_mi - | time_mi_time * | interval_mul * | mul_d_interval / | interval_div + | time_pl_interval - | time_mi_interval + | timetz_pl_interval - | timetz_mi_interval + | interval_pl_time + | timestamp_pl_interval - | timestamp_mi - | timestamp_mi_interval + | interval_pl_date + | interval_pl_timetz + | interval_pl_timestamp + | interval_pl_timestamptz (30 rows) -- Best Wishes, Ashutosh Bapat
Re: Infinite Interval
On Sun, Mar 26, 2023 at 1:28 AM Tom Lane wrote: > > I think you can take it as read that simple C test programs on modern > platforms will exhibit IEEE-compliant handling of float infinities. > For the record, I tried the attached. It gives a warning at compilation time. $gcc float_inf.c float_inf.c: In function ‘main’: float_inf.c:10:17: warning: division by zero [-Wdiv-by-zero] 10 | float inf = 1.0/0; | ^ float_inf.c:11:20: warning: division by zero [-Wdiv-by-zero] 11 | float n_inf = -1.0/0; |^ $ ./a.out inf = inf -inf = -inf inf + inf = inf inf + -inf = -nan -inf + inf = -nan -inf + -inf = -inf inf - inf = -nan inf - -inf = inf -inf - inf = -inf -inf - -inf = -nan float 0.0 equals 0.0 float 1.0 equals 1.0 5.0 * inf = inf 5.0 * - inf = -inf 5.0 / inf = 0.00 5.0 / - inf = -0.00 inf / 5.0 = inf - inf / 5.0 = -inf The changes in the patch are compliant with the observations above. -- Best Wishes, Ashutosh Bapat #include void test_flag(int flag, char *flag_name); int main(void) { float inf = 1.0/0; float n_inf = -1.0/0; printf("inf = %f\n", inf); printf("-inf = %f\n", n_inf); /* Additions */ printf("inf + inf = %f\n", inf + inf); printf("inf + -inf = %f\n", inf + n_inf); printf("-inf + inf = %f\n", n_inf + inf); printf("-inf + -inf = %f\n", n_inf + n_inf); /* Subtractions */ printf("inf - inf = %f\n", inf - inf); printf("inf - -inf = %f\n", inf - n_inf); printf("-inf - inf = %f\n", n_inf - inf); printf("-inf - -inf = %f\n", n_inf - n_inf); if (0.0 == 0.0) printf("float 0.0 equals 0.0\n"); if (0.5 + 0.5 == .64 + .36) printf("float 1.0 equals 1.0\n"); /* Multiplication */ printf(" 5.0 * inf = %f\n", 5.0 * inf); printf(" 5.0 * - inf = %f\n", 5.0 * n_inf); /* Division */ printf(" 5.0 / inf = %f\n", 5.0 / inf); printf(" 5.0 / - inf = %f\n", 5.0 / n_inf); printf(" inf / 5.0 = %f\n", inf / 5.0); printf(" - inf / 5.0 = %f\n", n_inf / 5.0); }
Re: SQL/JSON revisited
Op 3/27/23 om 20:54 schreef Alvaro Herrera: Docs amended as I threatened. Other than that, this has required more > [v12-0001-SQL-JSON-constructors.patch] > [v12-0001-delta-uniqueifyJsonbObject-bugfix.patch] In doc/src/sgml/func.sgml, some minor stuff: 'which specify the data type returned' should be 'which specifies the data type returned' In the json_arrayagg() description, it says: 'If ABSENT ON NULL is specified, any NULL values are omitted.' That's true, but as omitting NULL values is the default (i.e., also without that clause) maybe it's better to say: 'Any NULL values are omitted unless NULL ON NULL is specified' I've found no bugs in functionality. Thanks, Erik Rijkers
Re: Infinite Interval
On Sun, Mar 19, 2023 at 12:18 AM Joseph Koshakow wrote: > > The problem is that `secs = rint(secs)` rounds the seconds too early > and loses any fractional seconds. Do we have an overflow detecting > multiplication function for floats? We have float8_mul() which checks for overflow. typedef double float8; > > >+if (INTERVAL_NOT_FINITE(result)) > >+ereport(ERROR, > >+(errcode(ERRCODE_DATETIME_VALUE_OUT_OF_RANGE), > >+ errmsg("interval out of range"))); > > > >Probably, I added these kind of checks. But I don't remember if those are > >defensive checks or whether it's really possible that the arithmetic > > above > >these lines can yield an non-finite interval. > > These checks appear in `make_interval`, `justify_X`, > `interval_um_internal`, `interval_pl`, `interval_mi`, `interval_mul`, > `interval_div`. For all of these it's possible that the interval > overflows/underflows the non-finite ranges, but does not > overflow/underflow the data type. For example > `SELECT INTERVAL '2147483646 months' + INTERVAL '1 month'` would error > on this check. Without this patch postgres@64807=#SELECT INTERVAL '2147483646 months' + INTERVAL '1 month'; ?column? 178956970 years 7 mons (1 row) That result looks correct postgres@64807=#select 178956970 * 12 + 7; ?column? 2147483647 (1 row) So some backward compatibility break. I don't think we can avoid the backward compatibility break without expanding interval structure and thus causing on-disk breakage. But we can reduce the chances of breaking, if we change INTERVAL_NOT_FINITE to check all the three fields, instead of just month. > > > >+else > >+{ > >+result->time = -interval->time; > >+result->day = -interval->day; > >+result->month = -interval->month; > >+ > >+if (INTERVAL_NOT_FINITE(result)) > >+ereport(ERROR, > >+(errcode(ERRCODE_DATETIME_VALUE_OUT_OF_RANGE), > >+ errmsg("interval out of range"))); > > > >If this error ever gets to the user, it could be confusing. Can we > > elaborate by > >adding context e.g. errcontext("while negating an interval") or some > > such? > > Done. Thanks. Can we add relevant contexts at similar other places? Also if we use all the three fields, we will need to add such checks in interval_justify_hours() > > I replaced these checks with the following: > > + else if (interval->time == PG_INT64_MIN || interval->day == PG_INT32_MIN || > interval->month == PG_INT32_MIN) > + ereport(ERROR, > + (errcode(ERRCODE_DATETIME_VALUE_OUT_OF_RANGE), > + errmsg("interval out of range"))); > > I think this covers the same overflow check but is maybe a bit more > obvious. Unless, there's something I'm missing? Thanks. Your current version is closer to int4um(). Some more review comments in the following email. -- Best Wishes, Ashutosh Bapat
Re: Add standard collation UNICODE
On Tue, 2023-03-28 at 08:46 -0400, Joe Conway wrote: > > As long as we still have to initialize the libc locale fields to > > some > > language, I think it would be less confusing to keep the ICU locale > > on > > the same language. > > I definitely agree with that. Sounds good -- no changes then. Regards, Jeff Davis >
Re: TAP output format in pg_regress
> On 15 Mar 2023, at 11:36, Peter Eisentraut > wrote: > > On 24.02.23 10:49, Daniel Gustafsson wrote: >> Another rebase on top of 337903a16f. Unless there are conflicting reviews, I >> consider this patch to be done and ready for going in during the next CF. > > I think this is just about as good as it's going to get, so I think we can > consider committing this soon. > > A few comments along the way: > > 1) We can remove the gettext markers _() inside calls like note(), bail(), > etc. If we really wanted to do translation, we would do that inside those > functions (similar to errmsg() etc.). Fixed. The attached also removes all explicit \n from output and leaves the decision on when to add a linebreak to the TAP emitting function. I think this better match how we typically handle printing of output like this. It also ensures that all bail messages follow the same syntax. > 2) There are a few fprintf(stderr, ...) calls left. Should those be changed > to something TAP-enabled? Initially the patch kept errors happening before testing started a non-TAP output, there were leftovers which are now converted. > 3) Maybe these lines > > +++ isolation check in src/test/isolation +++ > > should be changed to TAP format? Arguably, if I run "make -s check", then > everything printed should be valid TAP, right? Fixed. > 4) From the introduction lines > > == creating temporary instance== > == initializing database system == > == starting postmaster== > running on port 61696 with PID 85346 > == creating database "regression" == > == running regression test queries== > > you have kept > > # running on port 61696 with PID 85346 > > which, well, is that the most interesting of those? > > The first three lines (creating, initializing, starting) take some noticeable > amount of time, so they could be kept as a progress indicator. Or just > delete all of them? I suppose some explicit indication about temp-instance > versus existing instance would be useful. This was discussed in 20220221164736.rq3ornzjdkmwk...@alap3.anarazel.de which concluded that this was about the only thing of interest, and even at that it was more of a maybe than a definite yes. As this patch stands, it prints the above for a temp install and the host/port for an existing install, but I don't have ny problems removing that as well. > 5) About the output format. Obviously, this will require some retraining of > the eye. But my first impression is that there is a lot of whitespace > without any guiding lines, so to speak, horizontally or vertically. It makes > me a bit dizzy. ;-) > > I think instead > > # parallel group (2 tests): event_trigger oidjoins > ok 210 event_trigger 131 ms > ok 211 oidjoins 190 ms > ok 212 fast_default 158 ms > ok 213 tablespace319 ms > > Something like this would be less dizzy-making: > > # parallel group (2 tests): event_trigger oidjoins > ok 210 - + event_trigger 131 ms > ok 211 - + oidjoins190 ms > ok 212 - fast_default 158 ms > ok 213 - tablespace319 ms The current format is chosen to be close to the old format, while also adding sufficient padding that it won't yield ragged columns. The wide padding is needed to cope with long names in the isolation and ecgp test suites and not so much regress suite. In the attached I've dialled back the padding a little bit to make it a bit more compact, but I doubt it's enough. > Just an idea, we don't have to get this exactly perfect right now, but it's > something to think about. I'm quite convinced that this will be revisited once this lands and is in front of developers. I think the attached is a good candidate for going in, but I would like to see it for another spin in the CF bot first. -- Daniel Gustafsson v18-0001-pg_regress-Emit-TAP-compliant-output.patch Description: Binary data
Re: Memory leak from ExecutorState context?
On Tue, 28 Mar 2023 00:43:34 +0200 Tomas Vondra wrote: > On 3/27/23 23:13, Jehan-Guillaume de Rorthais wrote: > > Please, find in attachment a patch to allocate bufFiles in a dedicated > > context. I picked up your patch, backpatch'd it, went through it and did > > some minor changes to it. I have some comment/questions thought. > > > > 1. I'm not sure why we must allocate the "HashBatchFiles" new context > > under ExecutorState and not under hashtable->hashCxt? > > > > The only references I could find was in hashjoin.h:30: > > > >/* [...] > > * [...] (Exception: data associated with the temp files lives in the > > * per-query context too, since we always call buffile.c in that > > context.) > > > > And in nodeHashjoin.c:1243:ExecHashJoinSaveTuple() (I reworded this > > original comment in the patch): > > > >/* [...] > > * Note: it is important always to call this in the regular executor > > * context, not in a shorter-lived context; else the temp file buffers > > * will get messed up. > > > > > > But these are not explanation of why BufFile related allocations must be > > under a per-query context. > > > > Doesn't that simply describe the current (unpatched) behavior where > BufFile is allocated in the per-query context? I wasn't sure. The first quote from hashjoin.h seems to describe a stronger rule about «**always** call buffile.c in per-query context». But maybe it ought to be «always call buffile.c from one of the sub-query context»? I assume the aim is to enforce the tmp files removal on query end/error? > I mean, the current code calls BufFileCreateTemp() without switching the > context, so it's in the ExecutorState. But with the patch it very clearly is > not. > > And I'm pretty sure the patch should do > > hashtable->fileCxt = AllocSetContextCreate(hashtable->hashCxt, > "HashBatchFiles", > ALLOCSET_DEFAULT_SIZES); > > and it'd still work. Or why do you think we *must* allocate it under > ExecutorState? That was actually my very first patch and it indeed worked. But I was confused about the previous quoted code comments. That's why I kept your original code and decided to rise the discussion here. Fixed in new patch in attachment. > FWIW The comment in hashjoin.h needs updating to reflect the change. Done in the last patch. Is my rewording accurate? > > 2. Wrapping each call of ExecHashJoinSaveTuple() with a memory context > > switch seems fragile as it could be forgotten in futur code path/changes. > > So I added an Assert() in the function to make sure the current memory > > context is "HashBatchFiles" as expected. > > Another way to tie this up might be to pass the memory context as > > argument to the function. > > ... Or maybe I'm over precautionary. > > > > I'm not sure I'd call that fragile, we have plenty other code that > expects the memory context to be set correctly. Not sure about the > assert, but we don't have similar asserts anywhere else. I mostly sticked it there to stimulate the discussion around this as I needed to scratch that itch. > But I think it's just ugly and overly verbose +1 Your patch was just a demo/debug patch by the time. It needed some cleanup now :) > it'd be much nicer to e.g. pass the memory context as a parameter, and do > the switch inside. That was a proposition in my previous mail, so I did it in the new patch. Let's see what other reviewers think. > > 3. You wrote: > > > >>> A separate BufFile memory context helps, although people won't see it > >>> unless they attach a debugger, I think. Better than nothing, but I was > >>> wondering if we could maybe print some warnings when the number of batch > >>> files gets too high ... > > > > So I added a WARNING when batches memory are exhausting the memory size > > allowed. > > > >+ if (hashtable->fileCxt->mem_allocated > hashtable->spaceAllowed) > >+ elog(WARNING, "Growing number of hash batch is exhausting > > memory"); > > > > This is repeated on each call of ExecHashIncreaseNumBatches when BufFile > > overflows the memory budget. I realize now I should probably add the > > memory limit, the number of current batch and their memory consumption. > > The message is probably too cryptic for a user. It could probably be > > reworded, but some doc or additionnal hint around this message might help. > > > > Hmmm, not sure is WARNING is a good approach, but I don't have a better > idea at the moment. I stepped it down to NOTICE and added some more infos. Here is the output of the last patch with a 1MB work_mem: =# explain analyze select * from small join large using (id); WARNING: increasing number of batches from 1 to 2 WARNING: increasing number of batches from 2 to 4 WARNING: increasing number of batches from 4 to 8 WARNING: increasing number of batches from 8 to 16 WARNING: increasing
Re: "variable not found in subplan target list"
I wrote: > I reduced this down to > MERGE INTO public.target_parted as target_0 > USING public.itest1 as ref_0 > ON target_0.b = ref_0.a > WHEN NOT MATCHED >THEN INSERT VALUES (42, 13); > The critical moving part seems to just be that the MERGE target > is a partitioned table ... but surely somebody tested that before? Oh, it's not just any partitioned table: regression=# \d+ target_parted Partitioned table "public.target_parted" Column | Type | Collation | Nullable | Default | Storage | Compression | Stats target | Description +-+---+--+-+-+-+--+- a | integer | | | | plain | | | b | integer | | | | plain | | | Partition key: LIST (a) Number of partitions: 0 The planner is reducing the scan of target_parted to a dummy scan, as is reasonable, but it forgets to provide ctid as an output from that scan; then the parent join node is unhappy because it does have a ctid output. So it looks like the problem is some shortcut we take while creating the dummy scan. I suppose that without the planner bug, this'd fail at runtime for lack of a partition to put (42,13) into. Because of that, the case isn't really interesting for production, which may explain the lack of reports. regards, tom lane
Re: Hash table scans outside transactions
Bumping it to attract some attention. On Tue, Mar 21, 2023 at 12:51 PM Ashutosh Bapat wrote: > > Hi, > Hash table scans (seq_scan_table/level) are cleaned up at the end of a > transaction in AtEOXact_HashTables(). If a hash seq scan continues > beyond transaction end it will meet "ERROR: no hash_seq_search scan > for hash table" in deregister_seq_scan(). That seems like a limiting > the hash table usage. > > Our use case is > 1. Add/update/remove entries in hash table > 2. Scan the existing entries and perform one transaction per entry > 3. Close scan > > repeat above steps in an infinite loop. Note that we do not > add/modify/delete entries in step 2. We can't use linked lists since > the entries need to be updated or deleted using hash keys. Because the > hash seq scan is cleaned up at the end of the transaction, we > encounter error in the 3rd step. I don't see that the actual hash > table scan depends upon the seq_scan_table/level[] which is cleaned up > at the end of the transaction. > > I have following questions > 1. Is there a way to avoid cleaning up seq_scan_table/level() when the > transaction ends? > 2. Is there a way that we can use hash table implementation in > PostgreSQL code for our purpose? > > > -- > Best Wishes, > Ashutosh Bapat -- Best Wishes, Ashutosh Bapat
Re: "variable not found in subplan target list"
Alvaro Herrera writes: > I have to run now so can't dissect it, but while running sqlsmith on the > SQL/JSON patch after Justin's report, I got $SUBJECT in this query: I reduced this down to MERGE INTO public.target_parted as target_0 USING public.itest1 as ref_0 ON target_0.b = ref_0.a WHEN NOT MATCHED THEN INSERT VALUES (42, 13); The critical moving part seems to just be that the MERGE target is a partitioned table ... but surely somebody tested that before? regards, tom lane
Re: Add standard collation UNICODE
On 3/28/23 06:07, Peter Eisentraut wrote: On 23.03.23 21:16, Jeff Davis wrote: Another thought: for ICU, do we want the default collation to be UNICODE (root collation)? What we have now gets the default from the environment, which is consistent with the libc provider. But now that we have the UNICODE collation, it makes me wonder if we should just default to that. The server's environment doesn't necessarily say much about the locale of the data stored in it or the locale of the applications accessing it. As long as we still have to initialize the libc locale fields to some language, I think it would be less confusing to keep the ICU locale on the same language. I definitely agree with that. If we ever manage to get rid of that, then I would also support making the ICU locale the root collation by default. +1 -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: PGdoc: add missing ID attribute to create_subscription.sgml
On Tue, Mar 28, 2023 at 1:00 PM Hayato Kuroda (Fujitsu) wrote: > > Dear Amit, > > Thank you for reviewing! PSA new version. > > > Isn't it better to move the link-related part to the next line > > wherever possible? Currently, it looks bit odd. > > Previously I preferred not to add a new line inside the tag, but it > caused > long-line. So I adjusted them not to be too short/long length. > There is no need to break the link line. See attached. -- With Regards, Amit Kapila. v7-0001-Add-XML-ID-attributes-to-create_subscription.sgml.patch Description: Binary data
Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)
> > I might misunderstand something, but I believe the v5 patch uses > copy_generic_opt_list and it does not add IGNORE_DATATYPE_ERRORS as a > keyword. > It modifies neither kwlist.h nor gram.y. > Sorry, didn't notice that. I think everything is alright now. Regards, Damir Belyalov Postgres Professional
Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)
On 2023-03-27 23:28, Damir Belyalov wrote: Hi! I made the specified changes and my patch turned out the same as yours. The performance measurements were the same too. Thanks for your review and measurements. The only thing left to do is how not to add IGNORE_DATATYPE_ERRORS as a keyword. See how this is done for parameters such as FORCE_NOT_NULL, FORCE_NULL, FORCE_QUOTE. They are not in kwlist.h and are not as keywords in gram.y. I might misunderstand something, but I believe the v5 patch uses copy_generic_opt_list and it does not add IGNORE_DATATYPE_ERRORS as a keyword. It modifies neither kwlist.h nor gram.y. -- Regards, -- Atsushi Torikoshi NTT DATA CORPORATION
Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths
On Mon, Dec 19, 2022 at 12:37:19PM +0100, Alvaro Herrera wrote: > I have created one in the January commitfest, > https://commitfest.postgresql.org/41/ > and rebased the patch on current master. (I have not reviewed this.) I have spent some time on that, and here are some comments with an updated version of the patch attached. The checks in XLogRegisterData() seemed overcomplicated to me. In this context, I think that we should just care about making sure that mainrdata_len does not overflow depending on the length given by the caller, which is where pg_add_u32_overflow() becomes handy. XLogRegisterBufData() added a check on UINT16_MAX in an assert, though we already check for overflow a couple of lines down. This is not necessary, it seems. @@ -535,6 +567,9 @@ XLogRecordAssemble(RmgrId rmid, uint8 info, XLogRecord *rechdr; char *scratch = hdr_scratch; + /* ensure that any assembled record can be decoded */ + Assert(AllocSizeIsValid(DecodeXLogRecordRequiredSpace(MaxXLogRecordSize))); A hardcoded check like that has no need to be in a code path triggered each time a WAL record is assembled. One place where this could be is InitXLogInsert(). It still means that it is called one time for each backend, but seeing it where the initialization of xloginsert.c feels natural, at least. A postmaster location would be enough, as well. XLogRecordMaxSize just needs to be checked once IMO, around the end of XLogRecordAssemble() once we know the total size of the record that will be fed to a XLogReader. One thing that we should be more careful of is to make sure that total_len does not overflow its uint32 value while assembling the record, as well. I have removed XLogErrorDataLimitExceeded(), replacing it with more context about the errors happening. Perhaps this has no need to be that much verbose, but it can be really useful for developers. Some comments had no need to be updated, and there were some typos. I am on board with the idea of a XLogRecordMaxSize that's bounded at 1020MB, leaving 4MB as room for the extra data needed by a XLogReader. At the end, I think that this is quite interesting long-term. For example, if we lift up XLogRecordMaxSize, we can evaluate the APIs adding buffer data or main data separately. Thoughts about this version? -- Michael From 16b40324a5bc5bbe6e06cbfb86251c907f400151 Mon Sep 17 00:00:00 2001 From: Michael Paquier Date: Tue, 28 Mar 2023 20:34:31 +0900 Subject: [PATCH v10] Add protections in xlog record APIs against overflows Before this, it was possible for an extension to create malicious WAL records that were too large to replay; or that would overflow the xl_tot_len field, causing potential corruption in WAL record IO ops. Emitting invalid records was also possible through pg_logical_emit_message(), which allowed you to emit arbitrary amounts of data up to 2GB, much higher than the replay limit of approximately 1GB. This patch adds a limit to the size of an XLogRecord (1020MiB), checks against overflows, and ensures that the reader infrastructure can read any max-sized XLogRecords, such that the wal-generating backend will fail when it attempts to insert unreadable records, instead of that insertion succeeding but breaking any replication streams. Author: Matthias van de Meent --- src/include/access/xlogrecord.h | 11 src/backend/access/transam/xloginsert.c | 73 + 2 files changed, 74 insertions(+), 10 deletions(-) diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h index 0d576f7883..0a7b0bb6fc 100644 --- a/src/include/access/xlogrecord.h +++ b/src/include/access/xlogrecord.h @@ -54,6 +54,17 @@ typedef struct XLogRecord #define SizeOfXLogRecord (offsetof(XLogRecord, xl_crc) + sizeof(pg_crc32c)) +/* + * XLogReader needs to allocate all the data of an xlog record in a single + * chunk. This means that a single XLogRecord cannot exceed MaxAllocSize + * in length if we ignore any allocation overhead of the XLogReader. + * + * To accommodate some overhead, this value allows for 4M of allocation + * overhead, that should be plenty enough for what + * DecodeXLogRecordRequiredSpace() expects as extra. + */ +#define XLogRecordMaxSize (1020 * 1024 * 1024) + /* * The high 4 bits in xl_info may be used freely by rmgr. The * XLR_SPECIAL_REL_UPDATE and XLR_CHECK_CONSISTENCY bits can be passed by diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c index 008612e032..86fa6a5276 100644 --- a/src/backend/access/transam/xloginsert.c +++ b/src/backend/access/transam/xloginsert.c @@ -33,6 +33,7 @@ #include "access/xloginsert.h" #include "catalog/pg_control.h" #include "common/pg_lzcompress.h" +#include "common/int.h" #include "executor/instrument.h" #include "miscadmin.h" #include "pg_trace.h" @@ -351,11 +352,13 @@ void XLogRegisterData(char *data, uint32 len) { XLogRecData *rdata; + uint32 result = 0;
Re: "variable not found in subplan target list"
Alvaro Herrera writes: > I have to run now so can't dissect it, but while running sqlsmith on the > SQL/JSON patch after Justin's report, I got $SUBJECT in this query: Reproduces in HEAD and v15 too (once you replace pg_catalog.system_user with some function that exists in v15). So it's not the fault of the JSON patch, nor of my outer-join hacking which had been my first thought. regards, tom lane
Re: Move definition of standard collations from initdb to pg_collation.dat
Peter Eisentraut writes: > While working on [0], I was wondering why the collations ucs_basic and > unicode are not in pg_collation.dat. I traced this back through > history, and I think this was just lost in a game of telephone. > The initial commit for pg_collation.h (414c5a2ea6) has only the default > collation in pg_collation.h (pre .dat), with initdb handling everything > else. Over time, additional collations "C" and "POSIX" were moved to > pg_collation.h, and other logic was moved from initdb to > pg_import_system_collations(). But ucs_basic was untouched. Commit > 0b13b2a771 rearranged the relative order of operations in initdb and > added the current comment "We don't want to pin these", but looking at > the email[1], I think this was more a guess about the previous intent. Yeah, I was just loath to change the previous behavior in that patch. I can't see any strong reason not to pin these entries. > I suggest we fix this now; see attached patch. While we're here, do we want to adopt some other spelling of "the root locale" than "und", in view of recent discoveries about the instability of that on old ICU versions? regards, tom lane
"variable not found in subplan target list"
I have to run now so can't dissect it, but while running sqlsmith on the SQL/JSON patch after Justin's report, I got $SUBJECT in this query: MERGE INTO public.target_parted as target_0 USING (select subq_0.c5 as c0, subq_0.c0 as c1, ref_0.a as c2, subq_0.c1 as c3, subq_0.c9 as c4, (select c from public.prt2_m_p3 limit 1 offset 1) as c5, subq_0.c8 as c6, ref_0.a as c7, subq_0.c7 as c8, subq_0.c1 as c9, pg_catalog.system_user() as c10 from public.itest1 as ref_0 left join (select ref_1.matches as c0, ref_1.typ as c1, ref_1.colname as c2, (select slotname from public.iface limit 1 offset 44) as c3, ref_1.matches as c4, ref_1.op as c5, ref_1.matches as c6, ref_1.value as c7, ref_1.op as c8, ref_1.op as c9, ref_1.typ as c10 from public.brinopers_multi as ref_1 where cast(null as polygon) <@ (select polygon from public.tab_core_types limit 1 offset 22) ) as subq_0 on (cast(null as macaddr8) >= cast(null as macaddr8)) where subq_0.c10 > subq_0.c2 limit 49) as subq_1 ON target_0.b = subq_1.c2 WHEN MATCHED AND (cast(null as box) |>> cast(null as box)) or (cast(null as lseg) ?-| (select s from public.lseg_tbl limit 1 offset 6) ) THEN DELETE WHEN NOT MATCHED AND (EXISTS ( select 21 as c0, subq_2.c0 as c1 from public.itest14 as sample_0 tablesample system (3.6) inner join public.num_exp_sqrt as sample_1 tablesample bernoulli (0.3) on (cast(null as "char") <= cast(null as "char")), lateral (select sample_1.id as c0 from public.a as ref_2 where (cast(null as lseg) <@ cast(null as line)) or ((select b3 from public.bit_defaults limit 1 offset 80) <> (select b3 from public.bit_defaults limit 1 offset 4) ) limit 158) as subq_2 where (cast(null as name) !~ (select t from public.test_tsvector limit 1 offset 5) ) and ((select bool from public.tab_core_types limit 1 offset 61) < (select pg_catalog.bool_or(v) from public.rtest_view1) ))) or (18 is NULL) THEN INSERT VALUES ( pg_catalog.int4um( cast(public.func_with_bad_set() as int4)), 13) WHEN MATCHED AND ((24 is not NULL) or (true)) or (cast(null as "timestamp") <= cast(null as timestamptz)) THEN UPDATE set b = target_0.b Ugh. I got no more SQL/JSON related crashes so far. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "No hay ausente sin culpa ni presente sin disculpa" (Prov. francés)
Move definition of standard collations from initdb to pg_collation.dat
While working on [0], I was wondering why the collations ucs_basic and unicode are not in pg_collation.dat. I traced this back through history, and I think this was just lost in a game of telephone. The initial commit for pg_collation.h (414c5a2ea6) has only the default collation in pg_collation.h (pre .dat), with initdb handling everything else. Over time, additional collations "C" and "POSIX" were moved to pg_collation.h, and other logic was moved from initdb to pg_import_system_collations(). But ucs_basic was untouched. Commit 0b13b2a771 rearranged the relative order of operations in initdb and added the current comment "We don't want to pin these", but looking at the email[1], I think this was more a guess about the previous intent. I suggest we fix this now; see attached patch. [0]: https://www.postgresql.org/message-id/flat/1293e382-2093-a2bf-a397-c04e8f83d3c2%40enterprisedb.com [1]: https://www.postgresql.org/message-id/28195.1498172402%40sss.pgh.pa.usFrom 0d2c6b92a3340833f13bab395e0556ce1f045226 Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Tue, 28 Mar 2023 12:04:34 +0200 Subject: [PATCH] Move definition of standard collations from initdb to pg_collation.dat --- src/bin/initdb/initdb.c | 15 +-- src/include/catalog/pg_collation.dat | 7 +++ 2 files changed, 8 insertions(+), 14 deletions(-) diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c index bae97539fc..9ccbf998ec 100644 --- a/src/bin/initdb/initdb.c +++ b/src/bin/initdb/initdb.c @@ -1695,20 +1695,7 @@ setup_description(FILE *cmdfd) static void setup_collation(FILE *cmdfd) { - /* -* Add SQL-standard names. We don't want to pin these, so they don't go -* in pg_collation.dat. But add them before reading system collations, so -* that they win if libc defines a locale with the same name. -*/ - PG_CMD_PRINTF("INSERT INTO pg_collation (oid, collname, collnamespace, collowner, collprovider, collisdeterministic, collencoding, colliculocale)" - "VALUES (pg_nextoid('pg_catalog.pg_collation', 'oid', 'pg_catalog.pg_collation_oid_index'), 'unicode', 'pg_catalog'::regnamespace, %u, '%c', true, -1, 'und');\n\n", - BOOTSTRAP_SUPERUSERID, COLLPROVIDER_ICU); - - PG_CMD_PRINTF("INSERT INTO pg_collation (oid, collname, collnamespace, collowner, collprovider, collisdeterministic, collencoding, collcollate, collctype)" - "VALUES (pg_nextoid('pg_catalog.pg_collation', 'oid', 'pg_catalog.pg_collation_oid_index'), 'ucs_basic', 'pg_catalog'::regnamespace, %u, '%c', true, %d, 'C', 'C');\n\n", - BOOTSTRAP_SUPERUSERID, COLLPROVIDER_LIBC, PG_UTF8); - - /* Now import all collations we can find in the operating system */ + /* Import all collations we can find in the operating system */ PG_CMD_PUTS("SELECT pg_import_system_collations('pg_catalog');\n\n"); } diff --git a/src/include/catalog/pg_collation.dat b/src/include/catalog/pg_collation.dat index f4bda1c769..14df398ad2 100644 --- a/src/include/catalog/pg_collation.dat +++ b/src/include/catalog/pg_collation.dat @@ -23,5 +23,12 @@ descr => 'standard POSIX collation', collname => 'POSIX', collprovider => 'c', collencoding => '-1', collcollate => 'POSIX', collctype => 'POSIX' }, +{ oid => '962', + descr => 'sorts using the Unicode Collation Algorithm with default settings', + collname => 'unicode', collprovider => 'i', collencoding => '-1', + colliculocale => 'und' }, +{ oid => '963', descr => 'sorts by Unicode code point', + collname => 'ucs_basic', collprovider => 'c', collencoding => '6', + collcollate => 'C', collctype => 'C' }, ] -- 2.40.0
Re: doc: add missing "id" attributes to extension packaging page
On 28.03.2023 at 00:11, Peter Smith wrote: FYI, there is a lot of overlap between this last attachment and the patches of Kuroda-san's current thread here [1] which is also adding ids to create_subscription.sgml. (Anyway, I guess you might already be aware of that other thread because your new ids look like they have the same names as those chosen by Kuroda-san) -- [1] https://www.postgresql.org/message-id/flat/CAHut%2BPvzo6%3DKKLqMR6-mAQdM%2Bj_dse0eUreGmrFouL7gbLbv2w%40mail.gmail.com#7da8d0e3b73096375847c16c856b4aed Thanks, I actually was not aware of this. Also, kudos for capturing the missing id and advocating for consistency regarding ids even before this is actively enforced. This nurtures my optimism that consistency might actually be achieveable without everybody getting angry at me because my patch will enforce it. Regarding the overlap, I currently try to make it as easy as possible for a potential committer and I'm happy to rebase my patch upon request or if Kuroda-san's patch gets applied first. Regards, Brar
RE: Data is copied twice when specifying both child and parent table in publication
On Tues, Mar 28, 2023 at 18:00 PM Wang, Wei/王 威 wrote: > Attach the new patch. Sorry, I attached the wrong patch. Here is the correct new version patch which addressed all comments so far. Regards, Wang Wei v25-0001-Avoid-syncing-data-twice-for-the-publish_via_par.patch Description: v25-0001-Avoid-syncing-data-twice-for-the-publish_via_par.patch