Re: Time delayed LR (WAS Re: logical replication restrictions)
On Tue, Feb 21, 2023 at 1:28 PM Hayato Kuroda (Fujitsu) wrote: > > > doc/src/sgml/catalogs.sgml > > > > 4. > > + > > + > > + subminsenddelay int4 > > + > > + > > + The minimum delay, in milliseconds, by the publisher to send changes > > + > > + > > > > "by the publisher to send changes" --> "by the publisher before sending > > changes" > > As Amit said[1], there is a possibility to delay after sending delay. So I > changed to > "before sending COMMIT record". How do you think? > I think it would be better to say: "The minimum delay, in milliseconds, by the publisher before sending all the changes". If you agree then similar change is required in below comment as well: + /* + * The minimum delay, in milliseconds, by the publisher before sending + * COMMIT/PREPARE record. + */ + int32 min_send_delay; + > > > src/backend/replication/pgoutput/pgoutput.c > > > > 11. > > + errno = 0; > > + parsed = strtoul(strVal(defel->arg), &endptr, 10); > > + if (errno != 0 || *endptr != '\0') > > + ereport(ERROR, > > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > > + errmsg("invalid min_send_delay"))); > > + > > + if (parsed > PG_INT32_MAX) > > + ereport(ERROR, > > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > > + errmsg("min_send_delay \"%s\" out of range", > > + strVal(defel->arg; > > > > Should the validation be also checking/asserting no negative numbers, > > or actually should the min_send_delay be defined as a uint32 in the > > first place? > > I think you are right because min_apply_delay does not have related code. > we must consider additional possibility that user may send START_REPLICATION > by hand and it has minus value. > Fixed. > Your reasoning for adding the additional check seems good to me but I don't see it in the patch. The check as I see is as below: + if (delay_val > PG_INT32_MAX) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("min_send_delay \"%s\" out of range", + strVal(defel->arg; Am, I missing something, and the new check is at some other place? + has been finished. However, there is a possibility that the table + status written in pg_subscription_rel + will be delayed in getting to "ready" state, and also two-phase + (if specified) will be delayed in getting to "enabled". + There appears to be a special value <0x00> after "ready". I think that is added by mistake or probably you have used some editor which has added this value. Can we slightly reword this to: "However, there is a possibility that the table status updated in pg_subscription_rel could be delayed in getting to the "ready" state, and also two-phase (if specified) could be delayed in getting to "enabled"."? -- With Regards, Amit Kapila.
unsafe_tests module
Given its nature and purpose as a module we don't want to run against an installed instance, shouldn't src/test/modules/unsafe_tests have NO_INSTALLCHECK=1 in its Makefile and runningcheck:false in its meson.build? cheers andrew -- Andrew Dunstan EDB:https://www.enterprisedb.com
Re: LWLock deadlock in brinRevmapDesummarizeRange
On 2023-Feb-22, Tomas Vondra wrote: > But instead of I almost immediately ran into a LWLock deadlock :-( Ouch. > I've managed to reproduce this on PG13+, but I believe it's there since > the brinRevmapDesummarizeRange was introduced in PG10. I just haven't > tried on pre-13 releases. Hmm, I think that might just be an "easy" way to hit it, but the problem is actually older than that, since AFAICS brin_doupdate is careless regarding locking order of revmap page vs. regular page. Sadly, the README doesn't cover locking considerations. I had that in a file called 'minmax-proposal' in version 16 of the patch here https://postgr.es/m/20140820225133.gb6...@eldon.alvh.no-ip.org but by version 18 (when 'minmax' became BRIN) I seem to have removed that file and replaced it with the README and apparently I didn't copy this material over. ... and in there, I wrote that we would first write the brin tuple in the regular page, unlock that, and then lock the revmap for the update, without holding lock on the data page. I don't remember why we do it differently now, but maybe the fix is just to release the regular page lock before locking the revmap page? One very important change is that in previous versions the revmap used a separate fork, and we had to introduce an "evacuation protocol" when we integrated the revmap into the main fork, which may have changed the locking considerations. Another point: to desummarize a range, just unlinking the entry from revmap should suffice, from the POV of other index scanners. Maybe we can simplify the whole procedure to: lock revmap, remove entry, remember page number, unlock revmap; lock regular page, delete entry, unlock. Then there are no two locks held at the same time during desummarize. This comes from v16: + Locking considerations + -- + + To read the TID during an index scan, we follow this protocol: + + * read revmap page + * obtain share lock on the revmap buffer + * read the TID + * obtain share lock on buffer of main fork + * LockTuple the TID (using the index as relation). A shared lock is + sufficient. We need the LockTuple to prevent VACUUM from recycling + the index tuple; see below. + * release revmap buffer lock + * read the index tuple + * release the tuple lock + * release main fork buffer lock + + + To update the summary tuple for a page range, we use this protocol: + + * insert a new index tuple somewhere in the main fork; note its TID + * read revmap page + * obtain exclusive lock on revmap buffer + * write the TID + * release lock + + This ensures no concurrent reader can obtain a partially-written TID. + Note we don't need a tuple lock here. Concurrent scans don't have to + worry about whether they got the old or new index tuple: if they get the + old one, the tighter values are okay from a correctness standpoint because + due to MVCC they can't possibly see the just-inserted heap tuples anyway. + + [vacuum stuff elided] -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "Escucha y olvidarás; ve y recordarás; haz y entenderás" (Confucio)
Re: Assert failure with MERGE into partitioned table with RLS
On Mon, 20 Feb 2023 at 16:18, Dean Rasheed wrote: > > As part of the MERGE RETURNING patch I noticed a suspicious Assert() > in ExecInitPartitionInfo() that looked like it needed updating for > MERGE. > > After more testing, I can confirm that this is indeed a pre-existing > bug, that can be triggered using MERGE into a partitioned table that > has RLS enabled (and hence non-empty withCheckOptionLists to > initialise). > > So I think we need something like the attached. > Pushed and back-patched. Regards, Dean
Re: Disable rdns for Kerberos tests
On 21/02/2023 01:35, Stephen Frost wrote: Greetings, The name canonicalization support for Kerberos is doing us more harm than good in the regression tests, so I propose we disable it. Patch attached. Thoughts? Makes sense. A brief comment in 001_auth.pl itself to mention why we disable rdns would be nice. - Heikki
LWLock deadlock in brinRevmapDesummarizeRange
Hi, While finalizing some fixes in BRIN, I decided to stress-test the relevant part of the code to check if I missed something. Imagine a simple script that builds BRIN indexes on random data, does random changes and cross-checks the results with/without the index. But instead of I almost immediately ran into a LWLock deadlock :-( I've managed to reproduce this on PG13+, but I believe it's there since the brinRevmapDesummarizeRange was introduced in PG10. I just haven't tried on pre-13 releases. The stress-test-2.sh script (attached .tgz) builds a table, fills it with random data and then runs a mix of updates and (de)summarization DDL of random fraction of the index. The lockup is usually triggered within a couple minutes, but might take longer (I guess it depends on parameters used to generate the random data, so it may take a couple runs to hit the right combination). The root cause is that brin_doupdate and brinRevmapDesummarizeRange end up locking buffers in different orders. Attached is also a .patch that adds a bunch of LOG messages for buffer locking in BRIN code (it's for PG13, but should work on newer releases too). Here's a fairly typical example of the interaction between brin_doupdate and brinRevmapDesummarizeRange: brin_doupdate (from UPDATE query): LOG: brin_doupdate: samepage 0 LOG: number of LWLocks held: 0 LOG: brin_getinsertbuffer: locking 898 lock 0x7f9a99a5af64 LOG: brin_getinsertbuffer: buffer locked LOG: brin_getinsertbuffer B: locking 899 lock 0x7f9a99a5afa4 LOG: brin_getinsertbuffer B: buffer locked LOG: number of LWLocks held: 2 LOG: lock 0 => 0x7f9a99a5af64 LOG: lock 1 => 0x7f9a99a5afa4 LOG: brin_doupdate: locking buffer for update LOG: brinLockRevmapPageForUpdate: locking 158 lock 0x7f9a99a4f664 brinRevmapDesummarizeRange (from brin_desummarize_range): LOG: starting brinRevmapDesummarizeRange LOG: number of LWLocks held: 0 LOG: brinLockRevmapPageForUpdate: locking 158 lock 0x7f9a99a4f664 LOG: brinLockRevmapPageForUpdate: buffer locked LOG: number of LWLocks held: 1 LOG: lock 0 => 0x7f9a99a4f664 LOG: brinRevmapDesummarizeRange: locking 898 lock 0x7f9a99a5af64 So, brin_doupdate starts with no LWLocks, and locks buffers 898, 899 (through getinsertbuffer). And then tries to lock 158. Meanwhile brinRevmapDesummarizeRange locks 158 first, and then tries to lock 898. So, a LWLock deadlock :-( I've now seen a bunch of these traces, with only minor differences. For example brinRevmapDesummarizeRange might gets stuck on the second buffer locked by getinsertbuffer (not the first one like here). I don't have a great idea how to fix this - I guess we need to ensure the buffers are locked in the same order, but that seems tricky. Obviously, people don't call brin_desummarize_range() very often, which likely explains the lack of reports. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Companydiff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c index 0becfde113..65e4ed76b4 100644 --- a/src/backend/access/brin/brin.c +++ b/src/backend/access/brin/brin.c @@ -689,7 +689,9 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo) meta = ReadBuffer(index, P_NEW); Assert(BufferGetBlockNumber(meta) == BRIN_METAPAGE_BLKNO); + elog(LOG, "brinbuild: locking buffer %d lock %p", meta, BufferDescriptorGetContentLock2(meta)); LockBuffer(meta, BUFFER_LOCK_EXCLUSIVE); + elog(LOG, "brinbuild: buffer locked"); brin_metapage_init(BufferGetPage(meta), BrinGetPagesPerRange(index), BRIN_CURRENT_VERSION); @@ -756,7 +758,9 @@ brinbuildempty(Relation index) /* An empty BRIN index has a metapage only. */ metabuf = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL); + elog(LOG, "brinbuildempty: locking buffer %d lock %p", metabuf, BufferDescriptorGetContentLock2(metabuf)); LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE); + elog(LOG, "brinbuildempty: buffer locked"); /* Initialize and xlog metabuffer. */ START_CRIT_SECTION(); diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c index d5b19293a0..d4a118eb54 100644 --- a/src/backend/access/brin/brin_pageops.c +++ b/src/backend/access/brin/brin_pageops.c @@ -78,6 +78,10 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange, return false; /* keep compiler quiet */ } + elog(LOG, "brin_doupdate: samepage %d", samepage); + + DumpHeldLWLocks(); + /* make sure the revmap is long enough to contain the entry we need */ brinRevmapExtend(revmap, heapBlk); @@ -106,13 +110,18 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange, } else { + elog(LOG, "brin_doupdate: locking buffer %d lock %p", oldbuf, BufferDescriptorGetContentLock2(oldbuf)); LockBuffer(oldbuf, BUFFER_LOCK_EXCLUSIVE); + elog(LOG, "brin_doupdate: buffer locked"); newbuf = InvalidBuffer; extended = false; } oldpage = BufferGetPage(oldbuf);
Re: Allow logical replication to copy tables in binary format
> If in future version the general data type is added, the copy command in > binary > format will not work well, right? It is because the inferior version does not > have > recv/send functions for added type. It means that there is a possibility that > replication between different versions may be failed if binary format is > specified. > Therefore, I think we should accept copy_format = binary only when the major > version of servers are the same. I don't think it's necessary to check versions. Yes, there are situations where binary will fail across major versions. But in many cases it does not. To me it seems the responsibility of the operator to evaluate this risk. And if the operator chooses wrong and uses binary copy across incompatible versions, then it will still fail hard in that case during the copy phase (so still a very early error). So I don't see a reason to check pre-emptively, afaict it will only disallow some valid usecases and introduce more code. Furthermore no major version check is done for "binary = true" either (afaik). The only additional failure scenario that copy_format=binary introduces is when one of the types does not implement a send function on the source. With binary=true, this would continue to work, but with copy_format=binary this stops working. All other failure scenarios that binary encoding of types introduces apply to both binary=true and copy_format=binary (the only difference being in which phase of the replication these failures happen, the apply or the copy phase). > I'm not sure the combination of "copy_format = binary" and "copy_data = false" > should be accepted or not. How do you think? It seems quite useless indeed to specify the format of a copy that won't happen.
Re: Transparent column encryption
On 12.02.23 15:48, Mark Dilger wrote: I should mention, src/sgml/html/libpq-exec.html needs clarification: paramFormats[]Specifies whether parameters are text (put a zero in the array entry for the corresponding parameter) or binary (put a one in the array entry for the corresponding parameter). If the array pointer is null then all parameters are presumed to be text strings. Perhaps you should edit this last sentence to say that all parameters are presumed to be test strings without forced encryption. This is actually already mentioned later in that section. When column encryption is enabled, the second-least-significant byte of this parameter specifies whether encryption should be forced for a parameter. The value 0x10 has a one as its second-least-significant *nibble*, but that's a really strange way to describe the high-order nibble, and perhaps not what you mean. Could you clarify? Set this byte to one to force encryption. I think setting the byte to one (0x01) means "binary unencrypted", not "force encryption". Don't you mean to set this byte to 16? For example, use the C code literal 0x10 to specify text format with forced encryption. If the array pointer is null then encryption is not forced for any parameter. If encryption is forced for a parameter but the parameter does not correspond to an encrypted column on the server, then the call will fail and the parameter will not be sent. This can be used for additional security against a compromised server. (The drawback is that application code then needs to be kept up to date with knowledge about which columns are encrypted rather than letting the server specify this.) This was me being confused. I adjusted the text to use "half-byte". I think you should say something about how specifying 0x11 will behave -- in other words, asking for encrypted binary data. I believe that this is will draw a "format must be text for encrypted parameter" error, and that the docs should clearly say so. done
Re: Transparent column encryption
On 11.02.23 22:54, Mark Dilger wrote: Thanks Peter. Here are some observations about the documentation in patch version 15. In acronyms.sgml, the CEK and CMK entries should link to documentation, perhaps linkend="glossary-column-encryption-key" and linkend="glossary-column-master-key". These glossary entries should in turn link to linkend="ddl-column-encryption". In ddl.sgml, the sentence "forcing encryption of certain parameters in the client library (see its documentation)" should link to linkend="libpq-connect-column-encryption". Did you intend the use of "transparent data encryption" (rather than "transparent column encryption") in datatype.sgml? If so, what's the difference? There are all addressed in the v16 patch I just posted. Is this feature intended to be available from ecpg? If so, can we maybe include an example in 36.3.4. Prepared Statements showing how to pass the encrypted values securely. If not, can we include verbiage about that limitation, so folks don't waste time trying to figure out how to do it? It should "just work". I will give this a try sometime, but I don't see why it wouldn't work. The documentation for pg_dump (and pg_dumpall) now includes a --decrypt-encrypted-columns option, which I suppose requires cmklookup to first be configured, and for PGCMKLOOKUP to be exported. There isn't anything in the pg_dump docs about this, though, so maybe a link to section 5.5.3 with a warning about not running pg_dump this way on the database server itself? Also addressed in v16. How does a psql user mark a parameter as having forced encryption? A libpq user can specify this in the paramFormats array, but I don't see any syntax for doing this from psql. None of the perl tap tests you've included appear to do this (except indirectly when calling test_client); grep'ing for the libpq error message "parameter with forced encryption is not to be encrypted" in the tests has no matches. Is it just not possible? I thought you'd mentioned some syntax for this when we spoke in person, but I don't see it now. This has been asked about before. We just need to come up with a syntax for it. The issue is contained inside psql.
RE: "out of relcache_callback_list slots" after multiple calls to pg_logical_slot_get_binary_changes
On Wed, Feb 22, 2023 2:20 PM Michael Paquier wrote: > > On Wed, Feb 22, 2023 at 12:07:06PM +0900, Kyotaro Horiguchi wrote: > > At Wed, 22 Feb 2023 12:29:59 +1100, Peter Smith > wrote in > >> If you are going to do that, then won't just copying the > >> CacheRegisterSyscacheCallback(PUBLICATIONOID... into function > >> init_rel_sync_cache() be effectively the same as doing that? > > > > I'm not sure if it has anything to do with the relation sync cache. > > On the other hand, moving all the content of init_rel_sync_cache() up > > to pgoutput_startup() doesn't seem like a good idea.. Another option, > > as you see, was to separate callback registration code. > > Both are kept separate in the code, so keeping this separation makes > sense to me. > > + /* Register callbacks if we didn't do that. */ > + if (!callback_registered) > + CacheRegisterSyscacheCallback(PUBLICATIONOID, > + publication_invalidation_cb, > + (Datum) 0); > > /* Initialize relation schema cache. */ > init_rel_sync_cache(CacheMemoryContext); > + callback_registered = true; > [...] > + /* Register callbacks if we didn't do that. */ > + if (!callback_registered) > > I am a bit confused by the use of one single flag called > callback_registered to track both the publication callback and the > relation callbacks. Wouldn't it be cleaner to use two flags? I don't > think that we'll have soon a second code path calling > init_rel_sync_cache(), but if we do then the callback load could again > be messed up. > Thanks for your reply. Using two flags makes sense to me. Attach the updated patch. Regards, Shi Yu v3-0001-Avoid-duplicate-registration-of-callbacks-in-pgou.patch Description: v3-0001-Avoid-duplicate-registration-of-callbacks-in-pgou.patch
Re: Refactor calculations to use instr_time
Hi, Thanks for the review. On Wed, 22 Feb 2023 at 05:50, Kyotaro Horiguchi wrote: > > PgStat_PendingStats should be included in typedefs.list. Done. > > + * Created for accumulating wal_write_time and wal_sync_time as a instr_time > + * but instr_time can't be used as a type where it ends up on-disk > + * because its units may change. PgStat_WalStats type is used for > + * in-memory/on-disk data. So, PgStat_PendingWalStats is created for > + * accumulating intervals as a instr_time. > + */ > +typedef struct PgStat_PendingWalStats > > IMHO, this comment looks somewhat off. Maybe we could try something > like the following instead? > > > This struct stores wal-related durations as instr_time, which makes it > > easier to accumulate them without requiring type conversions. Then, > > during stats flush, they will be moved into shared stats with type > > conversions. Done. And I think we should write why we didn't change PgStat_WalStats's variable types and instead created a new struct. Maybe we can explain it in the commit description? > > The aim of this patch is to keep using instr_time for accumulation. > So it seems like we could do the same refactoring for > pgStatBlockReadTime, pgStatBlockWriteTime, pgStatActiveTime and > pgStatTransactionIdleTime. What do you think - should we include this > additional refactoring in the same patch or make a separate one for > it? I tried a bit but it seems the required changes for additional refactoring aren't small. So, I think we can create a separate patch for these changes. Regards, Nazir Bilal Yavuz Microsoft v4-0001-Refactor-instr_time-calculations.patch Description: Binary data
Re: logical decoding and replication of sequences, take 2
On 2/22/23 03:28, Jonathan S. Katz wrote: > Hi, > > On 2/16/23 10:50 AM, Tomas Vondra wrote: >> Hi, >> >> Here's a rebased patch, without the last bit which is now unnecessary >> thanks to c981d9145dea. > > Thanks for continuing to work on this patch! I tested the latest version > and have some feedback/clarifications. > Thanks! > I did some testing using a demo-app-based-on-a-real-world app I had > conjured up[1]. This uses integer sequences as surrogate keys. > > In general things seemed to work, but I had a couple of > observations/questions. > > 1. Sequence IDs after a "failover". I believe this is a design decision, > but I noticed that after simulating a failover, the IDs were replicating > from a higher value, e.g. > > INSERT INTO room (name) VALUES ('room 1'); > INSERT INTO room (name) VALUES ('room 2'); > INSERT INTO room (name) VALUES ('room 3'); > INSERT INTO room (name) VALUES ('room 4'); > > The values of room_id_seq on each instance: > > instance 1: > > last_value | log_cnt | is_called > +-+--- > 4 | 29 | t > > instance 2: > > last_value | log_cnt | is_called > +-+--- > 33 | 0 | t > > After the switchover on instance 2: > > INSERT INTO room (name) VALUES ('room 5') RETURNING id; > > id > > 34 > > I don't see this as an issue for most applications, but we should at > least document the behavior somewhere. > Yes, this is due to how we WAL-log sequences. We don't log individual increments, but every 32nd increment and we log the "future" sequence state so that after a crash/recovery we don't generate duplicates. So you do nextval() and it returns 1. But into WAL we record 32. And there will be no WAL records until nextval reaches 32 and needs to generate another batch. And because logical replication relies on these WAL records, it inherits this batching behavior with a "jump" on recovery/failover. IMHO it's OK, it works for the "logical failover" use case and if you need gapless sequences then regular sequences are not an issue anyway. It's possible to reduce the jump a bit by reducing the batch size (from 32 to 0) so that every increment is logged. But it doesn't eliminate it because of rollbacks. > 2. Using with origin=none with nonconflicting sequences. > > I modified the example in [1] to set up two schemas with non-conflicting > sequences[2], e.g. on instance 1: > > CREATE TABLE public.room ( > id int GENERATED BY DEFAULT AS IDENTITY (INCREMENT 2 START WITH 1) > PRIMARY KEY, > name text NOT NULL > ); > > and instance 2: > > CREATE TABLE public.room ( > id int GENERATED BY DEFAULT AS IDENTITY (INCREMENT 2 START WITH 2) > PRIMARY KEY, > name text NOT NULL > ); > Well, yeah. We don't support active-active logical replication (at least not with the built-in). You can easily get into similar issues without sequences. Replicating a sequence overwrites the state of the sequence on the other side, which may result in it generating duplicate values with the other node, etc. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Wed, Feb 22, 2023 at 3:29 PM Masahiko Sawada wrote: > > On Wed, Feb 22, 2023 at 4:35 PM John Naylor > wrote: > > > > I don't think any vacuum calls in regression tests would stress any of this code very much, so it's not worth carrying the old way forward. I was thinking of only doing this as a short-time sanity check for testing a real-world workload. > > I guess that It would also be helpful at least until the GA release. > People will be able to test them easily on their workloads or their > custom test scenarios. That doesn't seem useful to me. If we've done enough testing to reassure us the new way always gives the same answer, the old way is not needed at commit time. If there is any doubt it will always give the same answer, then the whole patchset won't be committed. TPC-C was just an example. It should have testing comparing the old and new methods. If you have already done that to some degree, that might be enough. After performance tests, I'll also try some vacuums that use the comparison patch. -- John Naylor EDB: http://www.enterprisedb.com
Re: Incorrect command tag row count for MERGE with a cross-partition update
On Tue, 21 Feb 2023 at 09:34, Alvaro Herrera wrote: > > > I think the best fix is to have ExecMergeMatched() pass canSetTag = > > false to ExecUpdateAct(), so that ExecMergeMatched() takes > > responsibility for updating estate->es_processed in all cases. > > Sounds sensible. > I decided it was also probably worth having a regression test covering this, since it would be quite easy to break if the code is ever refactored. Pushed and back-patched. Regards, Dean
Re: Reducing connection overhead in pg_upgrade compat check phase
> On 18 Feb 2023, at 06:46, Nathan Bossart wrote: > > On Fri, Feb 17, 2023 at 10:44:49PM +0100, Daniel Gustafsson wrote: >> When adding a check to pg_upgrade a while back I noticed in a profile that >> the >> cluster compatibility check phase spend a lot of time in connectToServer. >> Some >> of this can be attributed to data type checks which each run serially in turn >> connecting to each database to run the check, and this seemed like a place >> where we can do better. >> >> The attached patch moves the checks from individual functions, which each >> loops >> over all databases, into a struct which is consumed by a single umbrella >> check >> where all data type queries are executed against a database using the same >> connection. This way we can amortize the connectToServer overhead across >> more >> accesses to the database. > > This change consolidates all the data type checks, so instead of 7 separate > loops through all the databases, there is just one. However, I wonder if > we are leaving too much on the table, as there are a number of other > functions that also loop over all the databases: > > * get_loadable_libraries > * get_db_and_rel_infos > * report_extension_updates > * old_9_6_invalidate_hash_indexes > * check_for_isn_and_int8_passing_mismatch > * check_for_user_defined_postfix_ops > * check_for_incompatible_polymorphics > * check_for_tables_with_oids > * check_for_user_defined_encoding_conversions > > I suspect consolidating get_loadable_libraries, get_db_and_rel_infos, and > report_extension_updates would be prohibitively complicated and not worth > the effort. Agreed, the added complexity of the code seems hard to justify unless there are actual reports of problems. I did experiment with reducing the allocations of namespaces and tablespaces with a hashtable, see the attached WIP diff. There is no measurable difference in speed, but a synthetic benchmark where allocations cannot be reused shows reduced memory pressure. This might help on very large schemas, but it's not worth pursuing IMO. > old_9_6_invalidate_hash_indexes is only needed for unsupported > versions, so that might not be worth consolidating. > check_for_isn_and_int8_passing_mismatch only loops through all databases > when float8_pass_by_value in the control data differs, so that might not be > worth it, either. Yeah, these two aren't all that interesting to spend cycles on IMO. > The last 4 are for supported versions and, from a very > quick glance, seem possible to consolidate. That would bring us to a total > of 11 separate loops that we could consolidate into one. However, the data > type checks seem to follow a nice pattern, so perhaps this is easier said > than done. There is that, refactoring the data type checks leads to removal of duplicated code and a slight performance improvement. Refactoring the other checks to reduce overhead would be an interesting thing to look at, but this point in the v16 cycle might not be ideal for that. > IIUC with the patch, pg_upgrade will immediately fail as soon as a single > check in a database fails. I believe this differs from the current > behavior where all matches for a given check in the cluster are logged > before failing. Yeah, that's wrong. Fixed. > I wonder if it'd be better to perform all of the data type > checks in all databases before failing so that all of the violations are > reported. Else, users would have to run pg_upgrade, fix a violation, run > pg_upgrade again, fix another one, etc. I think that's better, and have changed the patch to do it that way. One change this brings is that check.c contains version specific checks in the struct. Previously these were mostly contained in version.c (some, like the 9.4 jsonb check was in check.c) which maintained some level of separation. Splitting the array init is of course one option but it also seems a tad messy. Not sure what's best, but for now I've documented it in the array comment at least. This version also moves the main data types check to check.c, renames some members in the struct, moves to named initializers (as commented on by Justin downthread), and adds some more polish here and there. -- Daniel Gustafsson nsphash.diff Description: Binary data v2-0001-pg_upgrade-run-all-data-type-checks-per-connectio.patch Description: Binary data
Re: [PATCH] Add pretty-printed XML output option
On 22.02.23 08:20, Peter Smith wrote: Here are some review comments for patch v15-0001 FYI, the patch applies clean and tests OK for me. == doc/src/sgml/datatype.sgml 1. XMLSERIALIZE ( { DOCUMENT | CONTENT } value AS type [ { NO INDENT | INDENT } ] ) ~ Another/shorter way to write that syntax is like below. For me, it is easier to read. YMMV. XMLSERIALIZE ( { DOCUMENT | CONTENT } value AS type [ [NO] INDENT ] ) Indeed simpler to read. == src/backend/executor/execExprInterp.c 2. ExecEvalXmlExpr @@ -3829,7 +3829,8 @@ ExecEvalXmlExpr(ExprState *state, ExprEvalStep *op) { Datum*argvalue = op->d.xmlexpr.argvalue; bool*argnull = op->d.xmlexpr.argnull; - + boolindent = op->d.xmlexpr.xexpr->indent; + text*data; /* argument type is known to be xml */ Assert(list_length(xexpr->args) == 1); Missing whitespace after the variable declarations Whitespace added. ~~~ 3. + + data = xmltotext_with_xmloption(DatumGetXmlP(value), + xexpr->xmloption); + if(indent) + *op->resvalue = PointerGetDatum(xmlformat(data)); + else + *op->resvalue = PointerGetDatum(data); + } Unnecessary blank line at the end. blank line removed. == src/backend/utils/adt/xml. 4. xmlformat +#else + NO_XML_SUPPORT(); +return 0; +#endif Wrong indentation (return 0) in the indentation function? ;-) indentation corrected. v16 attached. Thanks a lot! Jim From a4fef3cdadd3d2f7abe530f5b07bf8c536689d83 Mon Sep 17 00:00:00 2001 From: Jim Jones Date: Mon, 20 Feb 2023 23:35:22 +0100 Subject: [PATCH v16] Add pretty-printed XML output option This patch implements the XML/SQL:2011 feature 'X069, XMLSERIALIZE: INDENT.' It adds the options INDENT and NO INDENT (default) to the existing xmlserialize function. It uses the indentation feature of xmlDocDumpFormatMemory from libxml2 to format XML strings. Although the INDENT feature is designed to work with xml strings of type DOCUMENT, this implementation also allows the usage of CONTENT type strings as long as it contains a well-formed xml - note the XMLOPTION_DOCUMENT in the xml_parse call. This patch also includes documentation, regression tests and their three possible output files xml.out, xml_1.out and xml_2.out. --- doc/src/sgml/datatype.sgml| 8 ++- src/backend/catalog/sql_features.txt | 2 +- src/backend/executor/execExprInterp.c | 12 +++- src/backend/parser/gram.y | 12 +++- src/backend/parser/parse_expr.c | 1 + src/backend/utils/adt/xml.c | 41 src/include/nodes/parsenodes.h| 1 + src/include/nodes/primnodes.h | 1 + src/include/parser/kwlist.h | 1 + src/include/utils/xml.h | 1 + src/test/regress/expected/xml.out | 93 +++ src/test/regress/expected/xml_1.out | 63 ++ src/test/regress/expected/xml_2.out | 93 +++ src/test/regress/sql/xml.sql | 23 +++ 14 files changed, 344 insertions(+), 8 deletions(-) diff --git a/doc/src/sgml/datatype.sgml b/doc/src/sgml/datatype.sgml index 467b49b199..53d59662b9 100644 --- a/doc/src/sgml/datatype.sgml +++ b/doc/src/sgml/datatype.sgml @@ -4460,14 +4460,18 @@ xml 'bar' xml, uses the function xmlserialize:xmlserialize -XMLSERIALIZE ( { DOCUMENT | CONTENT } value AS type ) +XMLSERIALIZE ( { DOCUMENT | CONTENT } value AS type [ [NO] INDENT ] ) type can be character, character varying, or text (or an alias for one of those). Again, according to the SQL standard, this is the only way to convert between type xml and character types, but PostgreSQL also allows -you to simply cast the value. +you to simply cast the value. The option INDENT allows to +indent the serialized xml output - the default is NO INDENT. +It is designed to indent XML strings of type DOCUMENT, but it can also + be used with CONTENT as long as value + contains a well-formed XML. diff --git a/src/backend/catalog/sql_features.txt b/src/backend/catalog/sql_features.txt index 3766762ae3..2e196faeeb 100644 --- a/src/backend/catalog/sql_features.txt +++ b/src/backend/catalog/sql_features.txt @@ -619,7 +619,7 @@ X061 XMLParse: character string input and DOCUMENT option YES X065 XMLParse: binary string input and CONTENT option NO X066 XMLParse: binary string input and DOCUMENT option NO X068 XMLSerialize: BOM NO -X069 XMLSerialize: INDENT NO +X069 XMLSerialize: INDENT YES X070 XMLSerialize: character string serialization and CONTENT option YES X071 XMLSerialize: character string serialization and DOCUMENT option YES X072 XMLSerialize: character string serialization YES diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c index 19351fe34b..15393f83c8 100644 --- a/src/backend/executor/execExprInterp.c +++ b/src/backend/executor/execExprInterp.c @@ -3829,7 +3829,8 @@ ExecEvalXmlExpr(ExprState *state, ExprEvalStep *op) {
Re: [PATCH] Support SK_SEARCHNULL / SK_SEARCHNOTNULL for heap-only scans
On 14/02/2023 11:10, Aleksander Alekseev wrote: Hi Andres, Shouldn't need to extract the column if we just want to know if it's NULL (see heap_attisnull()). Afaics the value isn't accessed after this. Many thanks. Fixed. I'm confused, what exactly is the benefit of this? What extension performs a direct table scan bypassing the executor, searching for NULLs or not-NULLs? If heapam can check for NULL/not-NULL more efficiently than the code that calls it, sure let's do this, and let's also see the performance test results to show the benefit. But then let's also modify the caller in nodeSeqScan.c to actually make use of it. For tableam extensions, which may or may not support checking for NULLs, we need to add an 'amsearchnulls' field to the table AM API. - Heikki
Re: refactoring relation extension and BufferAlloc(), faster COPY
On 21/02/2023 21:22, Andres Freund wrote: On 2023-02-21 18:18:02 +0200, Heikki Linnakangas wrote: Is it ever possible to call this without a relcache entry? WAL redo functions do that with ReadBuffer, but they only extend a relation implicitly, by replay a record for a particular block. I think we should use it for crash recovery as well, but the patch doesn't yet. We have some gnarly code there, see the loop using P_NEW in XLogReadBufferExtended(). Extending the file one-by-one is a lot more expensive than doing it in bulk. Hmm, XLogReadBufferExtended() could use smgrzeroextend() to fill the gap, and then call ExtendRelationBuffered for the target page. Or the new ExtendRelationBufferedTo() function that you mentioned. In the common case that you load a lot of data to a relation extending it, and then crash, the WAL replay would still extend the relation one page at a time, which is inefficient. Changing that would need bigger changes, to WAL-log the relation extension as a separate WAL record, for example. I don't think we need to solve that right now, it can be addressed separately later. - Heikki
Re: [PATCH] Add pretty-printed XML output option
On 22.02.23 08:05, Nikolay Samokhvalov wrote: But is this as expected? Shouldn't it be like this: text 13 ? Oracle and other parsers I know also do not work well with mixed contents.[1,2] I believe libxml2's parser does not know where to put the newline, as mixed values can contain more than one text node: text13 text2 text3 [3] And applying this logic the output could look like this .. text 13text2 text3 or even this text 13 text2 text3 .. which doesn't seem right either. Perhaps a note about mixed contents in the docs would make things clearer? Thanks for the review! Jim 1- https://xmlpretty.com/ 2- https://www.samltool.com/prettyprint.php 3- https://dbfiddle.uk/_CcC8h3I smime.p7s Description: S/MIME Cryptographic Signature
Re: meson: Non-feature feature options
On 21.02.23 17:32, Nazir Bilal Yavuz wrote: I like the second approach, with a 'uuid' feature option. As you wrote earlier, adding an 'auto' choice to a combo option doesn't work fully like a real feature option. But we can make it behave exactly like one, by checking the auto_features option. Yes, we can set it like `uuidopt = get_option('auto_features')`. However, if someone wants to set 'auto_features' to 'disabled' but 'uuid' to 'enabled'(to find at least one working uuid library); this won't be possible. We can add 'enabled', 'disabled and 'auto' choices to 'uuid' combo option to make all behaviours possible but adding 'uuid' feature option and 'uuid_library' combo option seems better to me. I think the uuid side of this is making this way too complicated. I'm content leaving this as a manual option for now. There is much more value in making the ssl option work automatically. So I would welcome a patch that just makes -Dssl=auto work smoothly, perhaps using the "trick" that Andres described.
Re: meson: Optionally disable installation of test modules
On 20.02.23 20:48, Andres Freund wrote: On 2023-02-20 19:43:56 +0100, Peter Eisentraut wrote: I don't think any callers try to copy a directory source, so the shutil.copytree() stuff isn't necessary. I'd like to use it for installing docs outside of the normal install target. Of course we could add the ability at a later point, but that seems a bit pointless back-forth to me. I figured it could be useful as a general installation tool, but the current script has specific command-line options for this specific purpose, so I don't think it would work for your purpose anyway. For the purpose here, we really just need something that does for src in sys.argv[1:-1]: shutil.copy2(src, sys.argv[-1]) But we need to call it twice for different sets of files and destinations, and since we can't have more than one command per test, we either need to write two "tests" or write a wrapper script like the one we have here. I don't know what the best way to slice this is, but it's not a lot of code that we couldn't move around again in the future.
Re: Fix GUC_NO_SHOW_ALL test scenario in 003_check_guc.pl
On 14/02/2023 03:42, Kyotaro Horiguchi wrote: At Mon, 13 Feb 2023 12:18:07 +0900, Michael Paquier wrote in On Mon, Feb 13, 2023 at 11:27:58AM +0900, Kyotaro Horiguchi wrote: I think currently the output by --describe-config can be used only for consulting while editing a (possiblly broken) config file. Thus I think it's no use showing GIC_DISALLOW_IN_FILE items there unless we use help_config() for an on-session use. On the other hand, don't we need to remove the condition GUC_NOT_IN_SAMPLE from displayStruct? I think that help_config() should show a value if it is marked as !GUC_DISALLOW_IN_FILE even if it is GUC_NOT_IN_SAMPLE. I'm not sure whether there's any variable that are marked that way, though. As in marked with GUC_NOT_IN_SAMPLE but not GUC_DISALLOW_IN_FILE? There are quite a lot, developer GUCs being one (say ignore_invalid_pages). We don't want to list them in the sample file so as common users don't play with them, still they make sense if listed in a file. Ah, right. I think I faintly had them in my mind. If you add a check meaning that GUC_DISALLOW_IN_FILE implies GUC_NOT_IN_SAMPLE, where one change would need to be applied to config_file as all the other GUC_DISALLOW_IN_FILE GUCs already do that, you could remove GUC_DISALLOW_IN_FILE. However, GUC_NOT_IN_SAMPLE should be around to not expose options, we don't want common users to know too much about. Okay, I thought that "postgres --help-config" was a sort of developer option, but your explanation above makes sense. The question about how much people rely on --describe-config these days is a good one, so perhaps there could be an argument in removing Yeah, that the reason for my thought it was a developer option... GUC_NOT_IN_SAMPLE from the set. TBH, I would be really surprised that anybody able to use a developer option writes an configuration file in an incorrect format and needs to use this option, though :) Hmm. I didn't directly link GUC_NOT_IN_SAMPLE to being a developer option. But on second thought, it seems that it is. So, the current code looks good for me now. Thanks for the explanation. The first patch was committed, and there's not much enthusiasm for disallowing (GUC_DISALLOW_IN_FILE && !GUC_NOT_IN_SAMPLE), so I am marking this as Committed in the commitfest app. Thanks! - Heikki
Re: ANY_VALUE aggregate
On 09.02.23 10:42, Vik Fearing wrote: v4 attached. I am putting this back to Needs Review in the commitfest app, but these changes were editorial so it is probably RfC like Peter had set it. I will let you be the judge of that. I have committed this. I made a few small last-minute tweaks: - Changed "non-deterministic" to "arbitrary", as suggested by Maciek Sakrejda nearby. This seemed like a handier and less jargony term. - Removed trailing whitespace in misc.c. - Changed the function description in pg_proc.dat. Apparently, we are using 'aggregate transition function' there for all aggregate functions (instead of 'any_value transition function' etc.). - Made the tests a bit more interested by feeding in more rows and a mix of null and nonnull values.
Re: Amcheck verification of GiST and GIN
Hi, On Thu, Feb 02, 2023 at 12:56:47PM -0800, Nikolay Samokhvalov wrote: > On Thu, Feb 2, 2023 at 12:43 PM Peter Geoghegan wrote: > > I think that that problem should be solved at a higher level, in the > > program that runs amcheck. Note that pg_amcheck will already do this > > for B-Tree indexes. > > That's a great tool, and it's great it supports parallelization, very useful > on large machines. Right, but unfortunately not an option on managed services. It's clear that this restriction should not be a general guideline for Postgres development, but it makes the amcheck extension (that is now shipped everywhere due to being in-code I believe) somewhat less useful for use-case of checking your whole database for corruption. Michael
Re: Weird failure with latches in curculio on v15
On Tue, Feb 21, 2023 at 5:50 PM Nathan Bossart wrote: > On Tue, Feb 21, 2023 at 09:03:27AM +0900, Michael Paquier wrote: > > Perhaps beginning a new thread with a patch and a summary would be > > better at this stage? Another thing I am wondering is if it could be > > possible to test that rather reliably. I have been playing with a few > > scenarios like holding the system() call for a bit with hardcoded > > sleep()s, without much success. I'll try harder on that part.. It's > > been mentioned as well that we could just move away from system() in > > the long-term. > > I'm happy to create a new thread if needed, but I can't tell if there is > any interest in this stopgap/back-branch fix. Perhaps we should just jump > straight to the long-term fix that Thomas is looking into. Unfortunately the latch-friendly subprocess module proposal I was talking about would be for 17. I may post a thread fairly soon with design ideas + list of problems and decision points as I see them, and hopefully some sketch code, but it won't be a proposal for [/me checks calendar] next week's commitfest and probably wouldn't be appropriate in a final commitfest anyway, and I also have some other existing stuff to clear first. So please do continue with the stopgap ideas. BTW Here's an idea (untested) about how to reproduce the problem. You could copy the source from a system() implementation, call it doomed_system(), and insert kill(-getppid(), SIGQUIT) in between sigprocmask(SIG_SETMASK, &omask, NULL) and exec*(). Parent and self will handle the signal and both reach the proc_exit(). The systems that failed are running code like this: https://github.com/openbsd/src/blob/master/lib/libc/stdlib/system.c https://github.com/DragonFlyBSD/DragonFlyBSD/blob/master/lib/libc/stdlib/system.c I'm pretty sure these other implementations could fail in just the same way (they restore the handler before unblocking, so can run it just before exec() replaces the image): https://github.com/freebsd/freebsd-src/blob/main/lib/libc/stdlib/system.c https://github.com/lattera/glibc/blob/master/sysdeps/posix/system.c The glibc one is a bit busier and, huh, has a lock (I guess maybe deadlockable if proc_exit() also calls system(), but hopefully it doesn't), and uses fork() instead of vfork() but I don't think that's a material difference here (with fork(), parent and child run concurrently, while with vfork() the parent is suspended until the child exists or execs, and then processes its pending signals).
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Wed, Feb 22, 2023 at 4:35 PM John Naylor wrote: > > > On Wed, Feb 22, 2023 at 1:16 PM Masahiko Sawada wrote: > > > > On Mon, Feb 20, 2023 at 2:56 PM Masahiko Sawada > > wrote: > > > > > > Yeah, I did a similar thing in an earlier version of tidstore patch. > > Okay, if you had checks against the old array lookup in development, that > gives us better confidence. > > > > Since we're trying to introduce two new components: radix tree and > > > tidstore, I sometimes find it hard to investigate failures happening > > > during lazy (parallel) vacuum due to a bug either in tidstore or radix > > > tree. If there is a bug in lazy vacuum, we cannot even do initdb. So > > > it might be a good idea to do such checks in USE_ASSERT_CHECKING (or > > > with another macro say DEBUG_TIDSTORE) builds. For example, TidStore > > > stores tids to both the radix tree and array, and checks if the > > > results match when lookup or iteration. It will use more memory but it > > > would not be a big problem in USE_ASSERT_CHECKING builds. It would > > > also be great if we can enable such checks on some bf animals. > > > > I've tried this idea. Enabling this check on all debug builds (i.e., > > with USE_ASSERT_CHECKING macro) seems not a good idea so I use a > > special macro for that, TIDSTORE_DEBUG. I think we can define this > > macro on some bf animals (or possibly a new bf animal). > > I don't think any vacuum calls in regression tests would stress any of this > code very much, so it's not worth carrying the old way forward. I was > thinking of only doing this as a short-time sanity check for testing a > real-world workload. I guess that It would also be helpful at least until the GA release. People will be able to test them easily on their workloads or their custom test scenarios. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
pgindent vs. git whitespace check
Commit e4602483e95 accidentally introduced a situation where pgindent disagrees with the git whitespace check. The code is conn = libpqsrv_connect_params(keywords, values, /* expand_dbname = */ false, PG_WAIT_EXTENSION); where the current source file has 4 spaces before the /*, and the whitespace check says that that should be a tab. I think it should actually be 3 spaces, so that the "/*..." lines up with the "keywords..." and "PG_WAIT..." above and below. I suppose this isn't going to be a quick fix in pgindent, but if someone is keeping track, maybe this could be added to the to-consider list. In the meantime, I suggest we work around this, perhaps by conn = libpqsrv_connect_params(keywords, values, /* expand_dbname = */ false, PG_WAIT_EXTENSION); which appears to be robust for both camps.