Re: Issues with ON CONFLICT UPDATE and REINDEX CONCURRENTLY
On Mon, Jun 17, 2024 at 07:00:51PM +0200, Michail Nikolaev wrote: > The same issue may happen in case of CREATE/DROP INDEX CONCURRENTLY as well. While looking at all that, I've been also curious about this specific point, and it is indeed possible to finish in a state where a duplicate key would be found in one of indexes selected by the executor during an INSERT ON CONFLICT while a concurrent set of CICs and DICs are run, so you don't really need a REINDEX. See for example the attached test. -- Michael diff --git a/src/test/modules/test_misc/t/009_cdic_concurrently_unique_fail.pl b/src/test/modules/test_misc/t/009_cdic_concurrently_unique_fail.pl new file mode 100644 index 00..bd37c797a3 --- /dev/null +++ b/src/test/modules/test_misc/t/009_cdic_concurrently_unique_fail.pl @@ -0,0 +1,46 @@ + +# Copyright (c) 2024-2024, PostgreSQL Global Development Group + +# Test CREATE/DROP INDEX CONCURRENTLY with concurrent INSERT +use strict; +use warnings FATAL => 'all'; + +use PostgreSQL::Test::Cluster; +use PostgreSQL::Test::Utils; + +use Test::More; + +my ($node, $result); + +# +# Test set-up +# +$node = PostgreSQL::Test::Cluster->new('CIC_test'); +$node->init; +$node->append_conf('postgresql.conf', + 'lock_timeout = ' . (1000 * $PostgreSQL::Test::Utils::timeout_default)); +$node->start; +$node->safe_psql('postgres', q(CREATE UNLOGGED TABLE tbl(i int primary key, updated_at timestamp))); +$node->safe_psql('postgres', q(CREATE UNIQUE INDEX tbl_pkey_2 ON tbl (i))); + +$node->pgbench( + '--no-vacuum --client=100 -j 2 --transactions=1000', + 0, + [qr{actually processed}], + [qr{^$}], + 'concurrent INSERTs, UPDATES and CIC/DIC', + { + '01_updates' => q( +INSERT INTO tbl VALUES(13,now()) ON CONFLICT(i) DO UPDATE SET updated_at = now(); + ), + '02_reindex' => q( +SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset +\if :gotlock + DROP INDEX CONCURRENTLY tbl_pkey_2; + CREATE UNIQUE INDEX CONCURRENTLY tbl_pkey_2 ON tbl (i); + SELECT pg_advisory_unlock(42); +\endif + ), + }); +$node->stop; +done_testing(); signature.asc Description: PGP signature
Re: Logical Replication of sequences
On Thu, 20 Jun 2024 at 18:45, Amit Kapila wrote: > > On Wed, Jun 19, 2024 at 8:33 PM vignesh C wrote: > > > > On Tue, 18 Jun 2024 at 16:10, Amit Kapila wrote: > > > > > > > > > Agreed and I am not sure which is better because there is a value in > > > keeping the state name the same for both sequences and tables. We > > > probably need more comments in code and doc updates to make the > > > behavior clear. We can start with the sequence state as 'init' for > > > 'needs-to-be-sycned' and 'ready' for 'synced' and can change if others > > > feel so during the review. > > > > Here is a patch which does the sequence synchronization in the > > following lines from the above discussion: > > > > Thanks for summarizing the points discussed. I would like to confirm > whether the patch replicates new sequences that are created > implicitly/explicitly for a publication defined as ALL SEQUENCES. Currently, FOR ALL SEQUENCES publication both explicitly created sequences and implicitly created sequences will be synchronized during the creation of subscriptions (using CREATE SUBSCRIPTION) and refreshing publication sequences(using ALTER SUBSCRIPTION ... REFRESH PUBLICATION SEQUENCES). Therefore, the explicitly created sequence seq1: CREATE SEQUENCE seq1; and the implicitly created sequence seq_test2_c2_seq for seq_test2 table: CREATE TABLE seq_test2 (c1 int, c2 SERIAL); will both be synchronized. Regards, Vignesh
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
Hi, Melanie! I'm glad to hear you that you have found a root case of the problem) Thank you for that! On 21.06.2024 02:42, Melanie Plageman wrote: Hi, If vacuum fails to remove a tuple with xmax older than VacuumCutoffs->OldestXmin and younger than GlobalVisState->maybe_needed, it will ERROR out when determining whether or not to freeze the tuple with "cannot freeze committed xmax". In back branches starting with 14, failing to remove tuples older than OldestXmin during pruning caused vacuum to infinitely loop in lazy_scan_prune(), as investigated on this [1] thread. On master, after 1ccc1e05ae removed the retry loop in lazy_scan_prune() and stopped comparing tuples to OldestXmin, the hang could no longer happen, but we can still attempt to freeze dead tuples visibly killed before OldestXmin -- resulting in an ERROR. Pruning may fail to remove dead tuples with xmax before OldestXmin if the tuple is not considered removable by GlobalVisState. For vacuum, the GlobalVisState is initially calculated at the beginning of vacuuming the relation -- at the same time and with the same value as VacuumCutoffs->OldestXmin. A backend's GlobalVisState may be updated again when it is accessed if a new snapshot is taken or if something caused ComputeXidHorizons() to be called. This can happen, for example, at the end of a round of index vacuuming when GetOldestNonRemovableTransactionId() is called. Normally this may result in GlobalVisState's horizon moving forward -- potentially allowing more dead tuples to be removed. However, if a disconnected standby with a running transaction older than VacuumCutoffs->OldestXmin reconnects to the primary after vacuum initially calculates GlobalVisState and OldestXmin but before GlobalVisState is updated, the value of GlobalVisState->maybe_needed could go backwards. If this happens in the middle of vacuum's first pruning and freezing pass, it is possible that pruning/freezing could then encounter a tuple whose xmax is younger than GlobalVisState->maybe_needed and older than VacuumCutoffs->OldestXmin. heap_prune_satisfies_vacuum() would deem the tuple HEAPTUPLE_RECENTLY_DEAD and would not remove it. But the heap_pre_freeze_checks() would ERROR out with "cannot freeze committed xmax". This check is to avoid freezing dead tuples. We can fix this by always removing tuples considered dead before VacuumCutoffs->OldestXmin. This is okay even if a reconnected standby has a transaction that sees that tuple as alive, because it will simply wait to replay the removal until it would be correct to do so or recovery conflict handling will cancel the transaction that sees the tuple as alive and allow replay to continue. Thisis an interestinganddifficultcase)Inoticedthatwheninitializingthe cluster,inmyopinion,we provideexcessivefreezing.Initializationtakesa longtime,whichcanlead,for example,tolongertestexecution.Igot ridofthisby addingthe OldestMxact checkboxisnotFirstMultiXactId,anditworksfine. if (prstate->cutoffs && TransactionIdIsValid(prstate->cutoffs->OldestXmin) && prstate->cutoffs->OldestMxact != FirstMultiXactId && NormalTransactionIdPrecedes(dead_after, prstate->cutoffs->OldestXmin)) return HEAPTUPLE_DEAD; CanI keepit? Attached is the suggested fix for master plus a repro. I wrote it as a recovery suite TAP test, but I am _not_ proposing we add it to the ongoing test suite. It is, amongst other things, definitely prone to flaking. I also had to use loads of data to force two index vacuuming passes now that we have TIDStore, so it is a slow test. If you want to run the repro with meson, you'll have to add 't/099_vacuum_hang.pl' to src/test/recovery/meson.build and then run it with: meson test postgresql:recovery / recovery/099_vacuum_hang If you use autotools, you can run it with: make check PROVE_TESTS="t/099_vacuum_hang.pl The repro forces a round of index vacuuming after the standby reconnects and before pruning a dead tuple whose xmax is older than OldestXmin. At the end of the round of index vacuuming, _bt_pendingfsm_finalize() calls GetOldestNonRemovableTransactionId(), thereby updating the backend's GlobalVisState and moving maybe_needed backwards. Then vacuum's first pass will continue with pruning and find our later inserted and updated tuple HEAPTUPLE_RECENTLY_DEAD when compared to maybe_needed but HEAPTUPLE_DEAD when compared to OldestXmin. I make sure that the standby reconnects between vacuum_get_cutoffs() and pruning because I have a cursor on the page keeping VACUUM FREEZE from getting a cleanup lock. See the repro for step-by-step explanations of how it works. I have a modified version of this that repros the infinite loop on 14-16 with substantially less data. See it here [2]. Also, the repro attached to this mail won't work on 14 and 15 because of changes to background_psql. [1]https://postgr.es/m/20240415173913.4zyyrwaftujxthf2%40awork3.anarazel.de#1b216b7768b5bd577a3d3d51bd5aadee [2]https://www.postgresql.org/message-id/CAAKRu_Y_NJzF4
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On 21/06/2024 03:02, Peter Geoghegan wrote: On Thu, Jun 20, 2024 at 7:42 PM Melanie Plageman wrote: If vacuum fails to remove a tuple with xmax older than VacuumCutoffs->OldestXmin and younger than GlobalVisState->maybe_needed, it will ERROR out when determining whether or not to freeze the tuple with "cannot freeze committed xmax". In back branches starting with 14, failing to remove tuples older than OldestXmin during pruning caused vacuum to infinitely loop in lazy_scan_prune(), as investigated on this [1] thread. This is a great summary. +1 We can fix this by always removing tuples considered dead before VacuumCutoffs->OldestXmin. This is okay even if a reconnected standby has a transaction that sees that tuple as alive, because it will simply wait to replay the removal until it would be correct to do so or recovery conflict handling will cancel the transaction that sees the tuple as alive and allow replay to continue. I think that this is the right general approach. +1 The repro forces a round of index vacuuming after the standby reconnects and before pruning a dead tuple whose xmax is older than OldestXmin. At the end of the round of index vacuuming, _bt_pendingfsm_finalize() calls GetOldestNonRemovableTransactionId(), thereby updating the backend's GlobalVisState and moving maybe_needed backwards. Right. I saw details exactly consistent with this when I used GDB against a production instance. I'm glad that you were able to come up with a repro that involves exactly the same basic elements, including index page deletion. Would it be possible to make it robust so that we could always run it with "make check"? This seems like an important corner case to regression test. -- Heikki Linnakangas Neon (https://neon.tech)
Re: New function normal_rand_array function to contrib/tablefunc.
It looks useful, for example, it can be used in sorting tests to make them more interesting. I just have one question. Why are you using SRF_IS_FIRST CALL and not _PG_init? Best regards, Stepan Neretin.
Re: Proposal: Division operator for (interval / interval => double precision)
On Sun, 2024-06-23 at 17:57 -0700, Gurjeet Singh wrote: > Is there a desire to have a division operator / that takes dividend > and divisor of types interval, and results in a quotient of type > double precision. > > This would be helpful in calculating how many times the divisor > interval can fit into the dividend interval. > > To complement this division operator, it would be desirable to also > have a remainder operator %. > > For example, > > ('365 days'::interval / '5 days'::interval) => 73 > ('365 days'::interval % '5 days'::interval) => 0 > > ('365 days'::interval / '3 days'::interval) => 121 > ('365 days'::interval % '3 days'::interval) => 2 I think that is a good idea in principle, but I have one complaint, and one thing should be discussed. The complaint is that the result should be double precision or numeric. I'd want the result of '1 minute' / '8 seconds' to be 7.5. That would match how the multiplication operator works. What should be settled is how to handle divisions that are not well defined. For example, what is '1 year' / '1 day'? - 365.24217, because that is the number of solar days in a solar year? - 365, because we don't consider leap years? - 360, because we use the usual conversion of 1 month -> 30 days? Yours, Laurenz Albe
Re: Meson far from ready on Windows
Hi Tristan, On Fri, 21 Jun 2024 at 18:16, Tristan Partin wrote: > > Hey Dave, > > I'm a maintainer for Meson, and am happy to help you in any way that > I reasonably can. > Thank you! > > Let's start with the state of Windows support in Meson. If I were to > rank Meson support for platforms, I would do something like this: > > - Linux > - BSDs > - Solaris/IllumOS > - ... > - Apple > - Windows > > As you can see Windows is the bottom of the totem pole. We don't have > Windows people coming along to contribute very often for whatever > reason. Thus admittedly, Windows support can be very lackluster at > times. > > Meson is not a project which sees a lot of funding. (Do any build > tools?) The projects that support Meson in any way are Mesa and > GStreamer, which don't have a lot of incentive to do anything with > Windows, generally. > > I'm not even sure any of the regular contributors to Meson have > Windows development machines. I surely don't have access to a Windows > machine. > To be very clear, my comments - in particular the subject line of this thread - are not referring to Meson itself, rather our use of it on Windows. I've been quite impressed with Meson in general, and am coming to like it a lot. > > All that being said, I would like to help you solve your Windows > dependencies issue, or at least mitigate them. I think it is probably > best to just look at one dependency at a time. Here is how lz4 is picked > up in the Postgres Meson build: > > > lz4opt = get_option('lz4') > > if not lz4opt.disabled() > > lz4 = dependency('liblz4', required: lz4opt) > > > > if lz4.found() > > cdata.set('USE_LZ4', 1) > > cdata.set('HAVE_LIBLZ4', 1) > > endif > > > > else > > lz4 = not_found_dep > > endif > > As you are well aware, dependency() looks largely at pkgconfig and cmake > to find the dependencies. In your case, that is obviously not working. > > I think there are two ways to solve your problem. A first solution would > look like this: > > > lz4opt = get_option('lz4') > > if not lz4opt.disabled() > > lz4 = dependency('liblz4', required: false) > > if not lz4.found() > > lz4 = cc.find_library('lz4', required: lz4opt, dirs: extra_lib_dirs) > > endif > > > > if lz4.found() > > cdata.set('USE_LZ4', 1) > > cdata.set('HAVE_LIBLZ4', 1) > > end > > else > > lz4 = not_found_dep > > endif > Yes, that's the pattern I think we should generally be using: - It supports the design goals, allowing for configurations we don't know about to be communicated through pkgconfig or cmake files. - It provides a fallback method to detect the dependencies as we do in the old MSVC build system, which should work with most dependencies built with their "standard" configuration on Windows. To address Andres' concerns around mis-detection of dependencies, or other oddities such as required compiler flags not being included, I would suggest that a) that's happened very rarely, if ever, in the past, and b) we can always spit out an obvious warning if we've not been able to use cmake or pkgconfig for any particular dependencies. > > Another solution that could work alongside the previous suggestion is to > use Meson subprojects[0] for managing Postgres dependencies. I don't > know if we would want this in the Postgres repo or a patch that > downstream packagers would need to apply, but essentially, add the wrap > file: > > > [wrap-file] > > directory = lz4-1.9.4 > > source_url = https://github.com/lz4/lz4/archive/v1.9.4.tar.gz > > source_filename = lz4-1.9.4.tgz > > source_hash = > 0b0e3aa07c8c063ddf40b082bdf7e37a1562bda40a0ff5272957f3e987e0e54b > > patch_filename = lz4_1.9.4-2_patch.zip > > patch_url = https://wrapdb.mesonbuild.com/v2/lz4_1.9.4-2/get_patch > > patch_hash = > 4f33456cce986167d23faf5d28a128e773746c10789950475d2155a7914630fb > > wrapdb_version = 1.9.4-2 > > > > [provide] > > liblz4 = liblz4_dep > > into subprojects/lz4.wrap, and Meson should be able to automagically > pick up the dependency. Do this for all the projects that Postgres > depends on, and you'll have an entire build managed by Meson. Note that > Meson subprojects don't have to use Meson themselves. They can also use > CMake[1] or Autotools[2], but your results may vary. > That's certainly interesting functionality. I'm not sure we'd want to try to use it here, simply because building all of PostgreSQL's dependencies on Windows requires multiple different toolchains and environments and is not trivial to setup. That's largely why I started working on https://github.com/dpage/winpgbuild. Thanks! > > Happy to hear your thoughts. I think if our goal is to enable more > people to work on Postgres, we should probably add subproject wraps to > the source tree, but we also need to be forgiving like in the Meson DSL > snippet above. > > Let me know your thoughts! > > [0]: https://mesonbuild.com/Wrap-dependency-system-manual.html > [1]: > https://github.com/hse-project/hse/blob/6d5207f88044a3bd9b3539260074395317e2
Re: Proposal: Division operator for (interval / interval => double precision)
On Mon, Jun 24, 2024 at 2:04 PM Laurenz Albe wrote: > On Sun, 2024-06-23 at 17:57 -0700, Gurjeet Singh wrote: > > Is there a desire to have a division operator / that takes dividend > > and divisor of types interval, and results in a quotient of type > > double precision. > > > > This would be helpful in calculating how many times the divisor > > interval can fit into the dividend interval. > > > > To complement this division operator, it would be desirable to also > > have a remainder operator %. > > > > For example, > > > > ('365 days'::interval / '5 days'::interval) => 73 > > ('365 days'::interval % '5 days'::interval) => 0 > > > > ('365 days'::interval / '3 days'::interval) => 121 > > ('365 days'::interval % '3 days'::interval) => 2 > > I think that is a good idea in principle, but I have one complaint, > and one thing should be discussed. > > The complaint is that the result should be double precision or numeric. > I'd want the result of '1 minute' / '8 seconds' to be 7.5. > That would match how the multiplication operator works. > > What should be settled is how to handle divisions that are not well > defined. > For example, what is '1 year' / '1 day'? > - 365.24217, because that is the number of solar days in a solar year? > - 365, because we don't consider leap years? > - 360, because we use the usual conversion of 1 month -> 30 days? > We will need to go back to first principles, I guess. Result of division is quotient, which is how many times a divisor can be subtracted from dividend, and remainder, which is the what remains after so many subtractions. Since day to hours and month to days conversions are not constants, interval/interval will result in an integer quotient and interval remainder. That looks painful. -- Best Wishes, Ashutosh Bapat
Re: Meson far from ready on Windows
Hi On Sat, 22 Jun 2024 at 17:32, Andres Freund wrote: > > I don't think it's unreasonable to not support static linking, but I take > > your point. > > Separately from this thread: ISTM that on windows it'd be quite beneficial > to > link a few things statically, given how annoying dealing with dlls can be? > There's also the perf issue addressed further down. > Dealing with DLLs largely just boils down to copying them into the right place when packaging. The perf issue is a much more compelling reason to look at static linking imho. > > > > My assumption all along was that Meson would replace autoconf etc. before > > anything happened with MSVC, precisely because that's the type of > > environment all the Postgres devs work in primarily. Instead we seem to > > have taken what I think is a flawed approach of entirely replacing the > > build system on the platform none of the devs use, whilst leaving the new > > system as an experimental option on the platforms it will have had orders > > of magnitude more testing. > > The old system was a major bottleneck. For one, there was no way to run all > tests. And even the tests that one could run, would run serially, leading > to > exceedingly long tests times. While that could partially be addressed by > having both buildsystems in parallel, the old one would frequently break > in a > way that one couldn't reproduce on other systems. And resource wise it > wasn't > feasible to test both old and new system for cfbot/CI. > Hmm, I've found that running the tests under Meson takes notably longer than the old system - maybe 5 - 10x longer ("meson test" vs. "vcregress check"). I haven't yet put any effort into figuring out a cause for that yet. > FWIW, dynamic linking has a noticeable overhead on other platforms too. A > non-dependencies-enabled postgres can do about 2x the > connections-per-second > than a fully kitted out postgres can (basically due to more memory mapping > metadata being copied). But on windows the overhead is larger because so > much > more happens for every new connections, including loading all dlls from > scratch. > > I suspect linking a few libraries statically would be quite worth it on > windows. On other platforms it'd be quite inadvisable to statically link > libraries, due to security updates, but for stuff like the EDB windows > installer dynamic linking doesn't really help with that afaict? > Correct - we're shipping the dependencies ourselves, so we have to rewrap/retest anyway. -- Dave Page pgAdmin: https://www.pgadmin.org PostgreSQL: https://www.postgresql.org EDB: https://www.enterprisedb.com
Re: Partial aggregates pushdown
On Sun, 23 Jun 2024 at 10:24, fujii.y...@df.mitsubishielectric.co.jp wrote: > I attached the POC patch(the second one) which supports avg(int8) whose > standard format is _numeric type. Okay, very interesting. So instead of defining the serialization/deserialization functions to text/binary, you serialize the internal type to an existing non-internal type, which then in turn gets serialized to text. In the specific case of avg(int8) this is done to an array of numeric (with length 2). > However, I do not agree that I modify the internal transtype to the native > data type. The reasons are the following three. > 1. Generality > I believe we should develop a method that can theoretically apply to any > aggregate function, even if we cannot implement it immediately. However, I > find it exceptionally challenging to demonstrate that any internal transtype > can be universally converted to a native data type for aggregate functions > that are parallel-safe under the current parallel query feature. > Specifically, proving this for user-defined aggregate functions presents a > significant difficulty in my view. > On the other hand, I think that the usage of export and import functions can > theoretically apply to any aggregate functions. The only thing required when doing CREATE TYPE is having an INPUT and OUTPUT function for the type, which (de)serialize the type to text format. As far as I can tell by definition that requirement should be fine for any aggregates that we can do partial aggregation pushdown for. To clarify I'm not suggesting we should change any of the internal representation of the type for the current internal aggregates. I'm suggesting we create new native types (i.e. do CREATE TYPE) for those internal representations and then use the name of that type instead of internal. > 2. Amount of codes. > It could need more codes. I think it would be about the same as your proposal. Instead of serializing to an intermediary existing type you serialize to string straight away. I think it would probably be slightly less code actually, since you don't have to add code to handle the new aggpartialexportfn and aggpartialimportfn columns. > 3. Concern about performance > I'm concerned that altering the current internal data types could impact > performance. As explained above in my proposal all the aggregation code would remain unchanged, only new native types will be added. Thus performance won't be affected, because all aggregation code will be the same. The only thing that's changed is that the internal type now has a name and an INPUT and OUTPUT function. > I know. In my proposal, the standard format is not seriarized data by > serialfn, instead, is text or other native data type. > Just to clarify, I'm writing this to avoid any potential misunderstanding. Ah alright, that definitely clarifies the proposal. I was looking at the latest patch file on the thread and that one was still using serialfn. Your new one indeed doesn't, so this is fine. > Basically responding to your advice, > for now, I prepare two POC patches. Great! I definitely think this makes the review/discussion easier. > The first supports case a, currently covering only avg(int4) and other > aggregate functions that do not require import or export functions, such as > min, max, and count. Not a full review but some initial notes: 1. Why does this patch introduce aggpartialpushdownsafe? I'd have expected that any type with a non-pseudo/internal type as aggtranstype would be safe to partially push down. 2. It seems the int4_avg_import function shouldn't be part of this patch (but maybe of a future one). 3. I think splitting this patch in two pieces would make it even easier to review: First adding support for the new PARTIAL_AGGREGATE keyword (adds the new feature). Second, using PARTIAL_AGGREGATE in postgres_fdw (starts using the new feature). Citus would only need the first patch not the second one, so I think the PARTIAL_AGGREGATE feature has merit to be added on its own, even without the postgres_fdw usage. 4. Related to 3, I think it would be good to have some tests of PARTIAL_AGGREGATE that don't involve postgres_fdw at all. I also spotted some comments too that mention FDW, even though they apply to the "pure" PARTIAL_AGGREGATE code. 5. This comment now seems incorrect: -* Apply the agg's finalfn if one is provided, else return transValue. +* If the agg's finalfn is provided and PARTIAL_AGGREGATE keyword is +* not specified, apply the agg's finalfn. +* If PARTIAL_AGGREGATE keyword is specified and the transValue type +* is internal, apply the agg's serialfn. In this case the agg's +* serialfn must not be invalid. Otherwise return transValue. 6. These errors are not on purpose afaict (if they are a comment in the test would be good to explain why) +SELECT b, avg(a), max(a), count(*) FROM pagg_tab GROUP BY b ORDER BY 1; +ERROR: could not connect to server "loopback" +DETAIL: inval
Re: Improve EXPLAIN output for multicolumn B-Tree Index
On Mon, 24 Jun 2024 at 04:38, wrote: > > In my local PoC patch, I have modified the output as follows, what do you > think? > > =# EXPLAIN (VERBOSE, ANALYZE) SELECT * FROM test WHERE id1 = 1 AND id2 = 101; >QUERY PLAN > - > Index Scan using test_idx on ikedamsh.test (cost=0.42..8.45 rows=1 > width=18) (actual time=0.082..0.086 rows=1 loops=1) >Output: id1, id2, id3, value >Index Cond: ((test.id1 = 1) AND (test.id2 = 101)) -- If it’s efficient, > the output won’t change. > Planning Time: 5.088 ms > Execution Time: 0.162 ms > (5 rows) > > =# EXPLAIN (VERBOSE, ANALYZE) SELECT * FROM test WHERE id1 = 1 AND id3 = 101; > QUERY PLAN > --- > Index Scan using test_idx on ikedamsh.test (cost=0.42..12630.10 rows=1 > width=18) (actual time=0.175..279.819 rows=1 loops=1) >Output: id1, id2, id3, value >Index Cond: (test.id1 = 1) -- Change the output. Show only > the bound quals. >Index Filter: (test.id3 = 101) -- New. Output quals which are > not used as the bound quals I think this is too easy to confuse with the pre-existing 'Filter' condition, which you'll find on indexes with INCLUDE-d columns or filters on non-index columns. Furthermore, I think this is probably not helpful (maybe even harmful) for index types like GIN and BRIN, where index searchkey order is mostly irrelevant to the index shape and performance. Finally, does this change the index AM API? Does this add another scankey argument to ->amrescan? >Rows Removed by Index Filter: 49-- New. Output when ANALYZE option > is specified Separate from the changes to Index Cond/Index Filter output changes I think this can be useful output, though I'd probably let the AM specify what kind of filter data to display. E.g. BRIN may well want to display how many ranges matched the predicate, vs how many ranges were unsummarized and thus returned; two conditions which aren't as easy to differentiate but can be important debugging query performance. > Planning Time: 0.354 ms > Execution Time: 279.908 ms > (7 rows) Was this a test against the same dataset as the one you'd posted your measurements of your first patchset with? The execution time seems to have slown down quite significantly, so if the testset is the same then this doesn't bode well for your patchset. Kind regards, Matthias van de Meent
Re: Meson far from ready on Windows
Hi, On 2024-06-24 09:54:51 +0100, Dave Page wrote: > > The old system was a major bottleneck. For one, there was no way to run all > > tests. And even the tests that one could run, would run serially, leading > > to > > exceedingly long tests times. While that could partially be addressed by > > having both buildsystems in parallel, the old one would frequently break > > in a > > way that one couldn't reproduce on other systems. And resource wise it > > wasn't > > feasible to test both old and new system for cfbot/CI. > > > > Hmm, I've found that running the tests under Meson takes notably longer > than the old system - maybe 5 - 10x longer ("meson test" vs. "vcregress > check"). I haven't yet put any effort into figuring out a cause for that > yet. That's because vcregress check only runs a small portion of the tests (just the main pg_regress checks, no tap tests, no extension). Which is pretty much my point. To run a decent, but still far from complete, portion of the tests you needed to do this: https://github.com/postgres/postgres/blob/REL_15_STABLE/.cirrus.tasks.yml#L402-L440 If you want to run just the main regression tests with meson, you can: meson test --suite setup --suite regress To see the list of all tests meson test --list Greetings, Andres Freund
Re: strange context message in spi.c?
Hi! Looks good to me! Best regards, Stepan Neretin.
Re: Improve EXPLAIN output for multicolumn B-Tree Index
+1 for the idea. On Mon, 24 Jun 2024 at 11:11, Matthias van de Meent wrote: > I think this is too easy to confuse with the pre-existing 'Filter' > condition, which you'll find on indexes with INCLUDE-d columns or > filters on non-index columns. Why not combine them? And both call them Filter? In a sense this filtering acts very similar to INCLUDE based filtering (for btrees at least). Although I might be wrong about that, because when I try to confirm the same perf using the following script I do get quite different timings (maybe you have an idea what's going on here). But even if it does mean something slightly different perf wise, I think using Filter for both is unlikely to confuse anyone. Since, while allowed, it seems extremely unlikely in practice that someone will use the same column as part of the indexed columns and as part of the INCLUDE-d columns (why would you store the same info twice). CREATE TABLE test (id1 int, id2 int, id3 int, value varchar(32)); INSERT INTO test (SELECT i % 10, i % 1000, i, 'hello' FROM generate_series(1,100) s(i)); vacuum freeze test; CREATE INDEX test_idx_include ON test(id1, id2) INCLUDE (id3); ANALYZE test; EXPLAIN (VERBOSE, ANALYZE, BUFFERS) SELECT id1, id3 FROM test WHERE id1 = 1 AND id3 = 101; CREATE INDEX test_idx ON test(id1, id2, id3); ANALYZE test; EXPLAIN (VERBOSE, ANALYZE, BUFFERS) SELECT id1, id3 FROM test WHERE id1 = 1 AND id3 = 101; QUERY PLAN ─── Index Only Scan using test_idx_include on public.test (cost=0.42..3557.09 rows=1 width=8) (actual time=0.708..6.639 rows=1 loops=1) Output: id1, id3 Index Cond: (test.id1 = 1) Filter: (test.id3 = 101) Rows Removed by Filter: 9 Heap Fetches: 0 Buffers: shared hit=1 read=386 Query Identifier: 471139784017641093 Planning: Buffers: shared hit=8 read=1 Planning Time: 0.091 ms Execution Time: 6.656 ms (12 rows) Time: 7.139 ms QUERY PLAN ─ Index Only Scan using test_idx on public.test (cost=0.42..2591.77 rows=1 width=8) (actual time=0.238..2.110 rows=1 loops=1) Output: id1, id3 Index Cond: ((test.id1 = 1) AND (test.id3 = 101)) Heap Fetches: 0 Buffers: shared hit=1 read=386 Query Identifier: 471139784017641093 Planning: Buffers: shared hit=10 read=1 Planning Time: 0.129 ms Execution Time: 2.128 ms (10 rows) Time: 2.645 ms
Re: Support "Right Semi Join" plan shapes
Thank you for reviewing. On Mon, Jun 24, 2024 at 1:27 PM Li Japin wrote: > + /* > +* For now we do not support RIGHT_SEMI join in mergejoin or nestloop > +* join. > +*/ > + if (jointype == JOIN_RIGHT_SEMI) > + return; > + > > How about adding some reasons here? I've included a brief explanation in select_mergejoin_clauses. > + * this is a right-semi join, or this is a right/right-anti/full join and > + * there are nonmergejoinable join clauses. The executor's mergejoin > > Maybe we can put the right-semi join together with the right/right-anti/full > join. Is there any other significance by putting it separately? I don't think so. The logic is different: for right-semi join we will always set *mergejoin_allowed to false, while for right/right-anti/full join it is set to false only if there are nonmergejoinable join clauses. > Maybe the following comments also should be updated. Right? Correct. And there are a few more places where we need to mention JOIN_RIGHT_SEMI, such as in reduce_outer_joins_pass2 and in the comment for SpecialJoinInfo. I noticed that this patch changes the plan of a query in join.sql from a semi join to right semi join, compromising the original purpose of this query, which was to test the fix for neqjoinsel's behavior for semijoins (see commit 7ca25b7d). -- -- semijoin selectivity for <> -- explain (costs off) select * from int4_tbl i4, tenk1 a where exists(select * from tenk1 b where a.twothousand = b.twothousand and a.fivethous <> b.fivethous) and i4.f1 = a.tenthous; So I've changed this test case a bit so that it is still testing what it is supposed to test. In passing, I've also updated the commit message to clarify that this patch does not address the support of "Right Semi Join" for merge joins. Thanks Richard v6-0001-Support-Right-Semi-Join-plan-shapes.patch Description: Binary data
sql/json miscellaneous issue
hi. the following two queries should return the same result? SELECT * FROM JSON_query (jsonb 'null', '$' returning jsonb); SELECT * FROM JSON_value (jsonb 'null', '$' returning jsonb); I've tried a patch to implement it. (i raised the issue at https://www.postgresql.org/message-id/CACJufxFWiCnG3Q7f0m_GdrytPbv29A5OWngCDwKVjcftwzHbTA%40mail.gmail.com i think a new thread would be more appropriate). current json_value doc: "Note that scalar strings returned by json_value always have their quotes removed, equivalent to specifying OMIT QUOTES in json_query." i think there are two exceptions: when the returning data types are jsonb or json.
Injection point locking
InjectionPointRun() acquires InjectionPointLock, looks up the hash entry, and releases the lock: LWLockAcquire(InjectionPointLock, LW_SHARED); entry_by_name = (InjectionPointEntry *) hash_search(InjectionPointHash, name, HASH_FIND, &found); LWLockRelease(InjectionPointLock); Later, it reads fields from the entry it looked up: /* not found in local cache, so load and register */ snprintf(path, MAXPGPATH, "%s/%s%s", pkglib_path, entry_by_name->library, DLSUFFIX); Isn't that a straightforward race condition, if the injection point is detached in between? Another thing: I wanted use injection points to inject an error early at backend startup, to write a test case for the scenarios that Jacob point out at https://www.postgresql.org/message-id/CAOYmi%2Bnwvu21mJ4DYKUa98HdfM_KZJi7B1MhyXtnsyOO-PB6Ww%40mail.gmail.com. But I can't do that, because InjectionPointRun() requires a PGPROC entry, because it uses an LWLock. That also makes it impossible to use injection points in the postmaster. Any chance we could allow injection points to be triggered without a PGPROC entry? Could we use a simple spinlock instead? With a fast path for the case that no injection points are attached or something? -- Heikki Linnakangas Neon (https://neon.tech)
Re: Buildfarm animal caiman showing a plperl test issue with newer Perl versions
On 2024-06-24 Mo 12:00 AM, Alexander Lakhin wrote: Hello hackers, As recent caiman failures ([1], [2], ...) show, plperl.sql is incompatible with Perl 5.40. (The last successful test runs took place when cayman had Perl 5.38.2 installed: [3].) FWIW, I've found an already-existing fix for the issue [4] and a note describing the change for Perl 5.39.10 [5]. [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=caiman&dt=2024-06-24%2001%3A34%3A23 [2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=caiman&dt=2024-06-24%2000%3A15%3A16 [3] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=caiman&dt=2024-05-02%2021%3A57%3A17 [4] https://git.alpinelinux.org/aports/tree/community/postgresql14/fix-test-plperl-5.8-pragma.patch?id=28aeb872811f59a7f646aa29ed7c9dc30e698e65 [5] https://metacpan.org/release/PEVANS/perl-5.39.10/changes#Selected-Bug-Fixes It's a very odd bug. I guess we should just backpatch the removal of that redundant version check in plc_perlboot.pl, probably all the way down to 9.2 since godwit builds and tests with plperl that far back, and some day in the not too distant future it will upgrade to perl 5.40. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: Track the amount of time waiting due to cost_delay
Hi, On Sat, Jun 22, 2024 at 12:48:33PM +, Bertrand Drouvot wrote: > 1. vacuuming indexes time has been longer on master because with v2, the > leader > has been interrupted 342605 times while waiting, then making v2 "faster". > > 2. the leader being interrupted while waiting is also already happening on > master > due to the pgstat_progress_parallel_incr_param() calls in > parallel_vacuum_process_one_index() (that have been added in > 46ebdfe164). It has been the case "only" 36 times during my test case. > > I think that 2. is less of a concern but I think that 1. is something that > needs > to be addressed because the leader process is not honouring its cost delay > wait > time in a noticeable way (at least during my test case). > > I did not think of a proposal yet, just sharing my investigation as to why > v2 has been faster than master during the vacuuming indexes phase. I think that a reasonable approach is to make the reporting from the parallel workers to the leader less aggressive (means occur less frequently). Please find attached v3, that: - ensures that there is at least 1 second between 2 reports, per parallel worker, to the leader. - ensures that the reported delayed time is still correct (keep track of the delayed time between 2 reports). - does not add any extra pg_clock_gettime_ns() calls (as compare to v2). Remarks: 1. Having a time based only approach to throttle the reporting of the parallel workers sounds reasonable. I don't think that the number of parallel workers has to come into play as: 1.1) the more parallel workers is used, the less the impact of the leader on the vacuum index phase duration/workload is (because the repartition is done on more processes). 1.2) the less parallel workers is, the less the leader will be interrupted ( less parallel workers would report their delayed time). 2. The throttling is not based on the cost limit as that value is distributed proportionally among the parallel workers (so we're back to the previous point). 3. The throttling is not based on the actual cost delay value because the leader could be interrupted at the beginning, the midle or whatever part of the wait and we are more interested about the frequency of the interrupts. 3. A 1 second reporting "throttling" looks a reasonable threshold as: 3.1 the idea is to have a significant impact when the leader could have been interrupted say hundred/thousand times per second. 3.2 it does not make that much sense for any tools to sample pg_stat_progress_vacuum multiple times per second (so a one second reporting granularity seems ok). With this approach in place, v3 attached applied, during my test case: - the leader has been interrupted about 2500 times (instead of about 345000 times with v2) - the vacuum index phase duration is very close to the master one (it has been 4 seconds faster (over a 8 minutes 40 seconds duration time), instead of 3 minutes faster with v2). Thoughts? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com >From 99f417c0bcd7c29e126fdccdd6214ea37db67379 Mon Sep 17 00:00:00 2001 From: Bertrand Drouvot Date: Mon, 24 Jun 2024 08:43:26 + Subject: [PATCH v3] Report the total amount of time that vacuum has been delayed due to cost delay This commit adds one column: time_delayed to the pg_stat_progress_vacuum system view to show the total amount of time in milliseconds that vacuum has been delayed. This uses the new parallel message type for progress reporting added by f1889729dd. In case of parallel worker, to avoid the leader to be interrupted too frequently (while it might be sleeping for cost delay), the report is done only if the last report has been done more than 1 second ago. Having a time based only approach to throttle the reporting of the parallel workers sounds reasonable. Indeed when deciding about the throttling: 1. The number of parallel workers should not come into play: 1.1) the more parallel workers is used, the less the impact of the leader on the vacuum index phase duration/workload is (because the repartition is done on more processes). 1.2) the less parallel workers is, the less the leader will be interrupted ( less parallel workers would report their delayed time). 2. The cost limit should not come into play as that value is distributed proportionally among the parallel workers (so we're back to the previous point). 3. The cost delay does not come into play as the leader could be interrupted at the beginning, the midle or whatever part of the wait and we are more interested about the frequency of the interrupts. 3. A 1 second reporting "throttling" looks a reasonable threshold as: 3.1 the idea is to have a significant impact when the leader could have been interrupted say hundred/thousand times per second. 3.2 it does not make that much sense for any tools to sample pg_stat_progress_vacuum multiple times per second (so
Re: long-standing data loss bug in initial sync of logical replication
On Sun, Nov 19, 2023 at 7:48 AM Andres Freund wrote: > > On 2023-11-19 02:15:33 +0100, Tomas Vondra wrote: > > > > If understand correctly, with the current code (which only gets > > ShareUpdateExclusiveLock), we may end up in a situation like this > > (sessions A and B): > > > > A: starts "ALTER PUBLICATION p ADD TABLE t" and gets the SUE lock > > A: writes the invalidation message(s) into WAL > > B: inserts into table "t" > > B: commit > > A: commit > > I don't think this the problematic sequence - at least it's not what I had > reproed in > https://postgr.es/m/20231118025445.crhaeeuvoe2g5dv6%40awork3.anarazel.de > > Adding line numbers: > > 1) S1: CREATE TABLE d(data text not null); > 2) S1: INSERT INTO d VALUES('d1'); > 3) S2: BEGIN; INSERT INTO d VALUES('d2'); > 4) S1: ALTER PUBLICATION pb ADD TABLE d; > 5) S2: COMMIT > 6) S2: INSERT INTO d VALUES('d3'); > 7) S1: INSERT INTO d VALUES('d4'); > 8) RL: > > The problem with the sequence is that the insert from 3) is decoded *after* 4) > and that to decode the insert (which happened before the ALTER) the catalog > snapshot and cache state is from *before* the ALTER TABLE. Because the > transaction started in 3) doesn't actually modify any catalogs, no > invalidations are executed after decoding it. The result is that the cache > looks like it did at 3), not like after 4). Undesirable timetravel... > > It's worth noting that here the cache state is briefly correct, after 4), it's > just that after 5) it stays the old state. > > If 4) instead uses a SRE lock, then S1 will be blocked until S2 commits, and > everything is fine. > I agree, your analysis looks right to me. > > > > > I'm not sure there are any cases where using SRE instead of AE would cause > > > problems for logical decoding, but it seems very hard to prove. I'd be > > > very > > > surprised if just using SRE would not lead to corrupted cache contents in > > > some > > > situations. The cases where a lower lock level is ok are ones where we > > > just > > > don't care that the cache is coherent in that moment. > > > Are you saying it might break cases that are not corrupted now? How > > could obtaining a stronger lock have such effect? > > No, I mean that I don't know if using SRE instead of AE would have negative > consequences for logical decoding. I.e. whether, from a logical decoding POV, > it'd suffice to increase the lock level to just SRE instead of AE. > > Since I don't see how it'd be correct otherwise, it's kind of a moot question. > We lost track of this thread and the bug is still open. IIUC, the conclusion is to use SRE in OpenTableList() to fix the reported issue. Andres, Tomas, please let me know if my understanding is wrong, otherwise, let's proceed and fix this issue. -- With Regards, Amit Kapila.
Re: Improve EXPLAIN output for multicolumn B-Tree Index
On Mon, 24 Jun 2024 at 11:58, Jelte Fennema-Nio wrote: > > +1 for the idea. > > On Mon, 24 Jun 2024 at 11:11, Matthias van de Meent > wrote: > > I think this is too easy to confuse with the pre-existing 'Filter' > > condition, which you'll find on indexes with INCLUDE-d columns or > > filters on non-index columns. > > Why not combine them? And both call them Filter? In a sense this > filtering acts very similar to INCLUDE based filtering (for btrees at > least). It does not really behave similar: index scan keys (such as the id3=101 scankey) don't require visibility checks in the btree code, while the Filter condition _does_ require a visibility check, and delegates the check to the table AM if the scan isn't Index-Only, or if the VM didn't show all-visible during the check. Furthermore, the index could use the scankey to improve the number of keys to scan using "skip scans"; by realising during a forward scan that if you've reached tuple (1, 2, 3) and looking for (1, _, 1) you can skip forward to (1, 3, _), rather than having to go through tuples (1, 2, 4), (1, 2, 5), ... (1, 2, n). This is not possible for INCLUDE-d columns, because their datatypes and structure are opaque to the index AM; the AM cannot assume anything about or do anything with those values. > Although I might be wrong about that, because when I try to > confirm the same perf using the following script I do get quite > different timings (maybe you have an idea what's going on here). But > even if it does mean something slightly different perf wise, I think > using Filter for both is unlikely to confuse anyone. I don't want A to to be the plan, while showing B' to the user, as the performance picture for the two may be completely different. And, as I mentioned upthread, the differences between AMs in the (lack of) meaning in index column order also makes it quite wrong to generally separate prefixes equalities from the rest of the keys. > Since, while > allowed, it seems extremely unlikely in practice that someone will use > the same column as part of the indexed columns and as part of the > INCLUDE-d columns (why would you store the same info twice). Yeah, people don't generally include the same index column more than once in the same index. > CREATE INDEX test_idx_include ON test(id1, id2) INCLUDE (id3); > CREATE INDEX test_idx ON test(id1, id2, id3); > > QUERY PLAN > ─── > Index Only Scan using test_idx_include on public.test [...] > Time: 7.139 ms > QUERY PLAN > ─ > Index Only Scan using test_idx on public.test (cost=0.42..2591.77 [...] > Time: 2.645 ms As you can see, there's a huge difference in performance. Putting both non-bound and "normal" filter clauses in the same Filter clause will make it more difficult to explain performance issues based on only the explain output. Kind regards, Matthias van de Meent Neon (https://neon.tech)
Re: sql/json miscellaneous issue
On Mon, Jun 24, 2024 at 5:05 PM jian he wrote: > hi. > the following two queries should return the same result? > > SELECT * FROM JSON_query (jsonb 'null', '$' returning jsonb); > SELECT * FROM JSON_value (jsonb 'null', '$' returning jsonb); > > I've tried a patch to implement it. > (i raised the issue at > > https://www.postgresql.org/message-id/CACJufxFWiCnG3Q7f0m_GdrytPbv29A5OWngCDwKVjcftwzHbTA%40mail.gmail.com > i think a new thread would be more appropriate). > > > > current json_value doc: > "Note that scalar strings returned by json_value always have their > quotes removed, equivalent to specifying OMIT QUOTES in json_query." > > i think there are two exceptions: when the returning data types are > jsonb or json. > > > Hi! I also noticed a very strange difference in behavior in these two queries, it seems to me that although it returns a string by default, for the boolean operator it is necessary to return true or false SELECT * FROM JSON_value (jsonb '1', '$ == "1"' returning jsonb); json_value (1 row) SELECT * FROM JSON_value (jsonb 'null', '$ == "1"' returning jsonb); json_value false (1 row) Best regards, Stepan Neretin.
Re: sql/json miscellaneous issue
Hi, On Mon, Jun 24, 2024 at 8:02 PM Stepan Neretin wrote: > Hi! > > I also noticed a very strange difference in behavior in these two queries, it > seems to me that although it returns a string by default, for the boolean > operator it is necessary to return true or false > SELECT * FROM JSON_value (jsonb '1', '$ == "1"' returning jsonb); > json_value > > > (1 row) > > SELECT * FROM JSON_value (jsonb 'null', '$ == "1"' returning jsonb); > json_value > > false > (1 row) Hmm, that looks sane to me when comparing the above two queries with their jsonb_path_query() equivalents: select jsonb_path_query(jsonb '1', '$ == "1"'); jsonb_path_query -- null (1 row) select jsonb_path_query(jsonb 'null', '$ == "1"'); jsonb_path_query -- false (1 row) -- Thanks, Amit Langote
Re: Avoid incomplete copy string (src/backend/access/transam/xlog.c)
Em dom., 23 de jun. de 2024 às 23:56, Richard Guo escreveu: > On Mon, Jun 24, 2024 at 7:51 AM Ranier Vilela wrote: > > In src/include/access/xlogbackup.h, the field *name* > > has one byte extra to store null-termination. > > > > But, in the function *do_pg_backup_start*, > > I think that is a mistake in the line (8736): > > > > memcpy(state->name, backupidstr, strlen(backupidstr)); > > > > memcpy with strlen does not copy the whole string. > > strlen returns the exact length of the string, without > > the null-termination. > > I noticed that the two callers of do_pg_backup_start both allocate > BackupState with palloc0. Can we rely on this to ensure that the > BackupState.name is initialized with null-termination? > I do not think so. It seems to me the best solution is to use Michael's suggestion, strlcpy + sizeof. Currently we have this: memcpy(state->name, "longlongpathexample1", strlen("longlongpathexample1")); printf("%s\n", state->name); longlongpathexample1 Next random call: memcpy(state->name, "longpathexample2", strlen("longpathexample2")); printf("%s\n", state->name); longpathexample2ple1 It's not a good idea to use memcpy with strlen. best regards, Ranier Vilela
Re: Avoid incomplete copy string (src/backend/access/transam/xlog.c)
Em seg., 24 de jun. de 2024 às 00:27, Yugo NAGATA escreveu: > On Sun, 23 Jun 2024 22:34:03 -0300 > Ranier Vilela wrote: > > > Em dom., 23 de jun. de 2024 às 22:14, Ranier Vilela > > > escreveu: > > > > > Em dom., 23 de jun. de 2024 às 22:05, Ranier Vilela < > ranier...@gmail.com> > > > escreveu: > > > > > >> Em dom., 23 de jun. de 2024 às 21:54, Michael Paquier < > > >> mich...@paquier.xyz> escreveu: > > >> > > >>> On Sun, Jun 23, 2024 at 09:34:45PM -0300, Ranier Vilela wrote: > > >>> > It's not critical code, so I think it's ok to use strlen, even > because > > >>> the > > >>> > result of strlen will already be available using modern compilers. > > >>> > > > >>> > So, I think it's ok to use memcpy with strlen + 1. > > >>> > > >>> It seems to me that there is a pretty good argument to just use > > >>> strlcpy() for the same reason as the one you cite: this is not a > > >>> performance-critical code, and that's just safer. > > >>> > > >> Yeah, I'm fine with strlcpy. I'm not against it. > > >> > > > Perhaps, like the v2? > > > > > > Either v1 or v2, to me, looks good. > > > > > Thinking about, does not make sense the field size MAXPGPATH + 1. > > all other similar fields are just MAXPGPATH. > > > > If we copy MAXPGPATH + 1, it will also be wrong. > > So it is necessary to adjust logbackup.h as well. > > I am not sure whether we need to change the size of the field, > but if change it, I wonder it is better to modify the following > message from MAXPGPATH to MAXPGPATH -1. > > errmsg("backup label too long (max %d > bytes)", > MAXPGPATH))); > Or perhaps, is it better to show the too long label? errmsg("backup label too long (%s)", backupidstr))); best regards, Ranier Vilela > > > > > So, I think that v3 is ok to fix. > > > > best regards, > > Ranier Vilela > > > > > > > > best regards, > > > Ranier Vilela > > > > > >> > > > -- > Yugo NAGATA >
Re: Conflict detection and logging in logical replication
On Mon, Jun 24, 2024 at 7:39 AM Zhijie Hou (Fujitsu) wrote: > > When testing the patch, I noticed a bug that when reporting the conflict > after calling ExecInsertIndexTuples(), we might find the tuple that we > just inserted and report it.(we should only report conflict if there are > other conflict tuples which are not inserted by us) Here is a new patch > which fixed this and fixed a compile warning reported by CFbot. > Thanks for the patch. Few comments: 1) Few typos: Commit msg of patch001: iolates--> violates execIndexing.c: ingored --> ignored 2) Commit msg of stats patch: "The commit adds columns in view pg_stat_subscription_workers to shows" --"pg_stat_subscription_workers" --> "pg_stat_subscription_stats" 3) I feel, chapter '31.5. Conflicts' in docs should also mention about detection or point to the page where it is already mentioned. thanks Shveta
Re: sql/json miscellaneous issue
Hi, On Mon, Jun 24, 2024 at 7:04 PM jian he wrote: > > hi. > the following two queries should return the same result? > > SELECT * FROM JSON_query (jsonb 'null', '$' returning jsonb); > SELECT * FROM JSON_value (jsonb 'null', '$' returning jsonb); I get this with HEAD: SELECT * FROM JSON_query (jsonb 'null', '$' returning jsonb); json_query null (1 row) Time: 734.587 ms SELECT * FROM JSON_value (jsonb 'null', '$' returning jsonb); json_value (1 row) Much like: SELECT JSON_QUERY('{"key": null}', '$.key'); json_query null (1 row) Time: 2.975 ms SELECT JSON_VALUE('{"key": null}', '$.key'); json_value (1 row) Which makes sense to me, because JSON_QUERY() is supposed to return a JSON null in both cases and JSON_VALUE() is supposed to return a SQL NULL for a JSON null. -- Thanks, Amit Langote
Re: speed up a logical replica setup
On Sun, Jun 23, 2024 at 11:52 AM Noah Misch wrote: > > > +static void > > +create_publication(PGconn *conn, struct LogicalRepInfo *dbinfo) > > +{ > > > + appendPQExpBuffer(str, "CREATE PUBLICATION %s FOR ALL TABLES", > > + ipubname_esc); > > This tool's documentation says it "guarantees that no transaction will be > lost." I tried to determine whether achieving that will require something > like the fix from > https://postgr.es/m/flat/de52b282-1166-1180-45a2-8d8917ca7...@enterprisedb.com. > (Not exactly the fix from that thread, since that thread has not discussed the > FOR ALL TABLES version of its race condition.) I don't know. On the one > hand, pg_createsubscriber benefits from creating a logical slot after creating > the publication. That snapbuild.c process will wait for running XIDs. On the > other hand, an INSERT/UPDATE/DELETE acquires its RowExclusiveLock and builds > its relcache entry before assigning an XID, so perhaps the snapbuild.c process > isn't enough to prevent that thread's race condition. What do you think? > I am not able to imagine how the race condition discussed in the thread you quoted can impact this patch. The problem discussed is mainly the interaction when we are processing the changes in logical decoding w.r.t concurrent DDL (Alter Publication ... Add Table). The problem happens because we use the old cache state. I am missing your point about the race condition mentioned in the thread you quoted with snapbuild.c. Can you please elaborate a bit more? -- With Regards, Amit Kapila.
Re: Inconsistent Parsing of Offsets with Seconds
On Jun 22, 2024, at 14:10, David E. Wheeler wrote: > I believe the former issue is caused by the latter: The jsonpath > implementation uses the formatting strings to parse the timestamps[1], and > since there is no formatting to support offsets with seconds, it doesn’t work > at all in JSON timestamp parsing. > > [1]: > https://github.com/postgres/postgres/blob/70a845c/src/backend/utils/adt/jsonpath_exec.c#L2420-L2442 A side-effect of this implementation of date/time parsing using the to_char templates is that only time zone offsets and abbreviations are supported. I find the behavior a little surprising TBH: david=# select to_timestamp('2024-06-03 12:35:00America/New_York', '-MM-DD HH24:MI:SSTZ'); ERROR: invalid value "America/New_York" for "TZ" DETAIL: Time zone abbreviation is not recognized. Unless the SQL standard only supports offsets and abbreviations, I wonder if we’d be better off updating the above parsing code to also try the various date/time input functions, as well as the custom formats that *are* defined by the standard. Best,
Re: Changing default -march landscape
Hi, sorry for the delayed reply, I suck at prioritizing things. Re: Thomas Munro > OK let me CC Christoph and ask the question this way: hypothetically, > if RHEL users' PostgreSQL packages became automatically faster than > Debian users' packages because of default -march choice in the > toolchain, what would the Debian project think about that, and what > should we do about it? The options discussed so far were: > > 1. Consider a community apt repo that is identical to the one except > targeting the higher microarch levels, that users can point a machine > to if they want to. There are sub-variations of that: Don't use -march in Debian for strict baseline compatiblity, but use -march=something in apt.postgresql.org; bump to -march=x86-64-v2 for the server build (but not libpq and psql) saying that PG servers need better hardware; ... I'd rather want to avoid adding yet another axis to the matrix we target on apt.postgresql.org, it's already complex enough. So ideally there should be only one server-build. Or if we decide it's worth to have several, extension builds should still be compatible with either. > 2. Various ideas for shipping multiple versions of the code at a > higher granularity than we're doing to day (a callback for a single > instruction! the opposite extreme being the whole executable + > libraries), probably using some of techniques mentioned at > https://wiki.debian.org/InstructionSelection. I don't have enough experience with that to say anything about the trade-offs, or if the online instruction selectors themselves add too much overhead. Not to mention that testing things against all instruction variants is probably next to impossible in practice. > I would guess that 1 is about 100x easier but I haven't looked into it. Are there any numbers about what kind of speedup we could expect? If the online selection isn't feasible/worthwhile, I'd be willing to bump the baseline for the server packages. There are already packages doing that, and there's even infrastructure in the "isa-support" package that lets packages declare a dependency on CPU features. Or Debian might just bump the baseline. PostgreSQL asking for it might just be the reason we wanted to hear to make it happen. Christoph
Re: Meson far from ready on Windows
On Sat, 22 Jun 2024 at 17:35, Andres Freund wrote: > Hi, > > On 2024-06-21 15:36:56 +0100, Dave Page wrote: > > For giggles, I took a crack at doing that, manually creating .pc files > for > > everything I've been working with so far. > > Cool! > > > > It seems to work as expected, except that unlike everything else libintl > is > > detected entirely based on whether the header and library can be found. I > > had to pass extra lib and include dirs: > > Yea, right now libintl isn't using dependency detection because I didn't > see > any platform where it's distributed with a .pc for or such. It'd be just a > line or two to make it use one... > I think it should, for consistency if nothing else - especially if we're adding our own pc/cmake files to prebuilt dependencies. -- Dave Page pgAdmin: https://www.pgadmin.org PostgreSQL: https://www.postgresql.org EDB: https://www.enterprisedb.com
Re: Changing default -march landscape
Re: To Thomas Munro > Or Debian might just bump the baseline. PostgreSQL asking for it might > just be the reason we wanted to hear to make it happen. Which level would PostgreSQL specifically want? x86-64-v2 or even x86-64-v3? Christoph
[PATCH] Add ACL (Access Control List) acronym
Hello hackers, This patch is based on a suggestion from a separate thread [1]: On Mon, Jun 24, 2024, at 01:46, Michael Paquier wrote: > Rather unrelated to this patch, still this patch makes the situation > more complicated in the docs, but wouldn't it be better to add ACL as > a term in acronyms.sql, and reuse it here? It would be a doc-only > patch that applies on top of the rest (could be on a new thread of its > own), with some markups added where needed. [1] https://postgr.es/m/Zniz1n7qa3_i4iac%40paquier.xyz /Joel 0001-Add-ACL-Access-Control-List-acronym.patch Description: Binary data
Re: Conflict detection and logging in logical replication
On Mon, Jun 24, 2024 at 7:39 AM Zhijie Hou (Fujitsu) wrote: > > When testing the patch, I noticed a bug that when reporting the conflict > after calling ExecInsertIndexTuples(), we might find the tuple that we > just inserted and report it.(we should only report conflict if there are > other conflict tuples which are not inserted by us) Here is a new patch > which fixed this and fixed a compile warning reported by CFbot. > Thank you for the patch! A review comment: The patch does not detect 'update_differ' conflicts when the Publisher has a non-partitioned table and the Subscriber has a partitioned version. Here’s a simple failing test case: Pub: create table tab (a int primary key, b int not null, c varchar(5)); Sub: create table tab (a int not null, b int not null, c varchar(5)) partition by range (b); alter table tab add constraint tab_pk primary key (a, b); create table tab_1 partition of tab for values from (minvalue) to (100); create table tab_2 partition of tab for values from (100) to (maxvalue); With the above setup, in case the Subscriber table has a tuple with its own origin, the incoming remote update from the Publisher fails to detect the 'update_differ' conflict. -- Thanks, Nisha
Re: Improve EXPLAIN output for multicolumn B-Tree Index
On Mon, 24 Jun 2024 at 13:02, Matthias van de Meent wrote: > It does not really behave similar: index scan keys (such as the > id3=101 scankey) don't require visibility checks in the btree code, > while the Filter condition _does_ require a visibility check, and > delegates the check to the table AM if the scan isn't Index-Only, or > if the VM didn't show all-visible during the check. Any chance you could point me in the right direction for the code/docs/comment about this? I'd like to learn a bit more about why that is the case, because I didn't realize visibility checks worked differently for index scan keys and Filter keys. > Furthermore, the index could use the scankey to improve the number of > keys to scan using "skip scans"; by realising during a forward scan > that if you've reached tuple (1, 2, 3) and looking for (1, _, 1) you > can skip forward to (1, 3, _), rather than having to go through tuples > (1, 2, 4), (1, 2, 5), ... (1, 2, n). This is not possible for > INCLUDE-d columns, because their datatypes and structure are opaque to > the index AM; the AM cannot assume anything about or do anything with > those values. Does Postgres actually support this currently? I thought skip scans were not available (yet). > I don't want A to to be the plan, while showing B' to the user, as the > performance picture for the two may be completely different. And, as I > mentioned upthread, the differences between AMs in the (lack of) > meaning in index column order also makes it quite wrong to generally > separate prefixes equalities from the rest of the keys. Yeah, that makes sense. These specific explain lines probably only/mostly make sense for btree. So yeah we'd want the index AM to be able to add some stuff to the explain plan. > As you can see, there's a huge difference in performance. Putting both > non-bound and "normal" filter clauses in the same Filter clause will > make it more difficult to explain performance issues based on only the > explain output. Fair enough, that's of course the main point of this patch in the first place: being able to better interpret the explain plan when you don't have access to the schema. Still I think Filter is the correct keyword for both, so how about we make it less confusing by making the current "Filter" more specific by calling it something like "Non-key Filter" or "INCLUDE Filter" and then call the other something like "Index Filter" or "Secondary Bound Filter".
Re: Improve EXPLAIN output for multicolumn B-Tree Index
On Mon, Jun 24, 2024 at 8:08 AM wrote: > > I am unable to decide whether reporting the bound quals is just enough > to decide the efficiency of index without knowing the difference in the > number of index tuples selectivity and heap tuple selectivity. The > difference seems to be a better indicator of index efficiency whereas the > bound quals will help debug the in-efficiency, if any. > > Also, do we want to report bound quals even if they are the same as > index conditions or just when they are different? > > Thank you for your comment. After receiving your comment, I thought it > would be better to also report information that would make the difference > in selectivity understandable. One idea I had is to output the number of > index tuples inefficiently extracted, like “Rows Removed by Filter”. Users > can check the selectivity and efficiency by looking at the number. > > Also, I thought it would be better to change the way bound quals are > reported to align with the "Filter". I think it would be better to modify > it so that it does not output when the bound quals are the same as the > index conditions. > > In my local PoC patch, I have modified the output as follows, what do you > think? > > =# EXPLAIN (VERBOSE, ANALYZE) SELECT * FROM test WHERE id1 = 1 AND id2 = > 101; >QUERY PLAN > > > - > Index Scan using test_idx on ikedamsh.test (cost=0.42..8.45 rows=1 > width=18) (actual time=0.082..0.086 rows=1 loops=1) >Output: id1, id2, id3, value >Index Cond: ((test.id1 = 1) AND (test.id2 = 101)) -- If it’s > efficient, the output won’t change. > Planning Time: 5.088 ms > Execution Time: 0.162 ms > (5 rows) > This looks fine. We may highlight in the documentation that lack of Index bound quals in EXPLAIN output indicate that they are same as Index Cond:. Other idea is to use Index Cond and bound quals as property name but that's too long. > > =# EXPLAIN (VERBOSE, ANALYZE) SELECT * FROM test WHERE id1 = 1 AND id3 = > 101; > QUERY PLAN > > > --- > Index Scan using test_idx on ikedamsh.test (cost=0.42..12630.10 rows=1 > width=18) (actual time=0.175..279.819 rows=1 loops=1) >Output: id1, id2, id3, value >Index Cond: (test.id1 = 1) -- Change the output. Show > only the bound quals. >Index Filter: (test.id3 = 101) -- New. Output quals which > are not used as the bound quals >Rows Removed by Index Filter: 49-- New. Output when ANALYZE > option is specified > Planning Time: 0.354 ms > Execution Time: 279.908 ms > (7 rows) > I don't think we want to split these clauses. Index Cond should indicate the conditions applied to the index scan. Bound quals should be listed separately even though they will have an intersection with Index Cond. I am not sure whether Index Filter is the right name, maybe Index Bound Cond: But I don't know this area enough to make a final call. About Rows Removed by Index Filter: it's good to provide a number when ANALYZE is specified, but it will be also better to specify what was estimated. We do that for (cost snd rows etc.) but doing that somewhere in the plan output may not have a precedent. I think we should try that and see what others think. -- Best Wishes, Ashutosh Bapat
Re: replace strtok()
On 24.06.24 02:34, Michael Paquier wrote: On Sat, Jun 22, 2024 at 11:48:21AM -0400, Tom Lane wrote: Peter Eisentraut writes: On 18.06.24 13:43, Ranier Vilela wrote: I found another implementation of strsep, it seems lighter to me. I will attach it for consideration, however, I have not done any testing. Yeah, surely there are many possible implementations. I'm thinking, since we already took other str*() functions from OpenBSD, it makes sense to do this here as well, so we have only one source to deal with. Why not use strpbrk? That's equally thread-safe, it's been there since C89, and it doesn't have the problem that you can't find out which of the delimiter characters was found. Yeah, strpbrk() has been used in the tree as far as 2003 without any port/ implementation. The existing uses of strpbrk() are really just checking whether some characters exist in a string, more like an enhanced strchr(). I don't see any uses for tokenizing a string like strtok() or strsep() would do. I think that would look quite cumbersome. So I think a simpler and more convenient abstraction like strsep() would still be worthwhile.
Re: Pgoutput not capturing the generated columns
On Fri, 21 Jun 2024 at 12:51, Peter Smith wrote: > > Hi, Here are some review comments for patch v9-0003 > > == > Commit Message > > /fix/fixes/ Fixed > == > 1. > General. Is tablesync enough? > > I don't understand why is the patch only concerned about tablesync? > Does it make sense to prevent VIRTUAL column replication during > tablesync, if you aren't also going to prevent VIRTUAL columns from > normal logical replication (e.g. when copy_data = false)? Or is this > already handled somewhere? I checked the behaviour during incremental changes. I saw during decoding 'null' values are present for Virtual Generated Columns. I made the relevant changes to not support replication of Virtual generated columns. > ~~~ > > 2. > General. Missing test. > > Add some test cases to verify behaviour is different for STORED versus > VIRTUAL generated columns I have not added the tests as it would give an error in cfbot. I have added a TODO note for the same. This can be done once the VIRTUAL generated columns patch is committted. > == > src/sgml/ref/create_subscription.sgml > > NITPICK - consider rearranging as shown in my nitpicks diff > NITPICK - use sgml markup for STORED Fixed > == > src/backend/replication/logical/tablesync.c > > 3. > - if ((walrcv_server_version(LogRepWorkerWalRcvConn) >= 12 && > - walrcv_server_version(LogRepWorkerWalRcvConn) <= 16) || > - !MySubscription->includegencols) > + if (walrcv_server_version(LogRepWorkerWalRcvConn) < 17) > + { > + if (walrcv_server_version(LogRepWorkerWalRcvConn) >= 12) > appendStringInfo(&cmd, " AND a.attgenerated = ''"); > + } > + else if (walrcv_server_version(LogRepWorkerWalRcvConn) >= 17) > + { > + if(MySubscription->includegencols) > + appendStringInfo(&cmd, " AND a.attgenerated != 'v'"); > + else > + appendStringInfo(&cmd, " AND a.attgenerated = ''"); > + } > > IMO this logic is too tricky to remain uncommented -- please add some > comments. > Also, it seems somewhat complex. I think you can achieve the same more simply: > > SUGGESTION > > if (walrcv_server_version(LogRepWorkerWalRcvConn) >= 12) > { > bool gencols_allowed = walrcv_server_version(LogRepWorkerWalRcvConn) >= > 17 > && MySubscription->includegencols; > if (gencols_allowed) > { > /* Replication of generated cols is supported, but not VIRTUAL cols. */ > appendStringInfo(&cmd, " AND a.attgenerated != 'v'"); > } > else > { > /* Replication of generated cols is not supported. */ > appendStringInfo(&cmd, " AND a.attgenerated = ''"); > } > } Fixed > == > > 99. > Please refer also to my attached nitpick diffs and apply those if you agree. Applied the changes. I have attached the updated patch v10 here in [1]. [1]: https://www.postgresql.org/message-id/CANhcyEUMCk6cCbw0vVZWo8FRd6ae9CmKG%3DgKP-9Q67jLn8HqtQ%40mail.gmail.com Thanks and Regards, Shlok Kyal
RE: Partial aggregates pushdown
Hi. Jelte, hackers. Thank you for your proposal and comments. > From: Jelte Fennema-Nio > Sent: Monday, June 24, 2024 6:09 PM > > 1. Generality > > I believe we should develop a method that can theoretically apply to any > aggregate function, even if we cannot implement it immediately. However, I > find > it exceptionally challenging to demonstrate that any internal transtype can be > universally converted to a native data type for aggregate functions that are > parallel-safe under the current parallel query feature. Specifically, proving > this > for user-defined aggregate functions presents a significant difficulty in my > view. > > On the other hand, I think that the usage of export and import functions can > theoretically apply to any aggregate functions. > > The only thing required when doing CREATE TYPE is having an INPUT and > OUTPUT function for the type, which (de)serialize the type to text format. As > far as I can tell by definition that requirement should be fine for any > aggregates > that we can do partial aggregation pushdown for. To clarify I'm not suggesting > we should change any of the internal representation of the type for the > current > internal aggregates. I'm suggesting we create new native types (i.e. do CREATE > TYPE) for those internal representations and then use the name of that type > instead of internal. I see. I maybe got your proposal. Refer to your proposal, for avg(int8), I create a new native type like state_int8_avg with the new typsend/typreceive functions and use them to transmit the state value, right? That might seem to be a more fundamental solution because I can avoid adding export/import functions of my proposal, which are the new components of aggregate function. I have never considered the solution. I appreciate your proposal. However, I still have the following two questions. 1. Not necessary components of new native types Refer to pg_type.dat, typinput and typoutput are required. I think that in your proposal they are not necessary, so waste. I think that it is not acceptable. How can I resolve the problem? 2. Many new native types I think that, basically, each aggregate function does need a new native type. For example, avg(int8), avg(numeric), and var_pop(int4) has the same transtype, PolyNumAggState. You said that it is enough to add only one native type like state_poly_num_agg for supporting them, right? But the combine functions of them might have quite different expectation on the data items of PolyNumAggState like the range of N(means count) and the true/false of calcSumX2 (the flag of calculating sum of squares). The final functions of them have the similar expectation. So, I think that, responded to your proposal, each of them need a native data type like state_int8_avg, state_numeric_avg, for safety. And, we do need a native type for an aggregate function whose transtype is not internal and not pseudo. For avg(int4), the current transtype is _int8. However, I do need a validation check on the number of the array And the positiveness of count(the first element of the array). Responded to your proposal, I do need a new native type like state_int4_avg. Consequently, I think that, responded to your proposal, finally each of aggregate functions need a new native type with typinput and typoutput. That seems need the same amount of codes and more catalog data, right? > > 2. Amount of codes. > > It could need more codes. > > I think it would be about the same as your proposal. Instead of serializing > to an > intermediary existing type you serialize to string straight away. I think it > would > probably be slightly less code actually, since you don't have to add code to > handle the new aggpartialexportfn and aggpartialimportfn columns. > > > 3. Concern about performance > > I'm concerned that altering the current internal data types could impact > performance. > > As explained above in my proposal all the aggregation code would remain > unchanged, only new native types will be added. Thus performance won't be > affected, because all aggregation code will be the same. The only thing that's > changed is that the internal type now has a name and an INPUT and OUTPUT > function. I got it. Thank you. > Not a full review but some initial notes: Thank you. I don't have time today, so I'll answer after tomorrow. Best regards, Yuki Fujii -- Yuki Fujii Information Technology R&D Center, Mitsubishi Electric Corporation
basebackups seem to have serious issues with FILE_COPY in CREATE DATABASE
Hi, While doing some additional testing of (incremental) backups, I ran into a couple regular failures. After pulling my hair for a couple days, I realized the issue seems to affect regular backups, and incremental backups (which I've been trying to test) are likely innocent. I'm using a simple (and admittedly not very pretty) bash scripts that takes and verified backups, concurrently with this workload: 1) initialize a cluster 2) initialize pgbench in database 'db' 3) run short pgbench on 'db' 4) maybe do vacuum [full] on 'db' 5) drop a database 'db_copy' if it exists 6) create a database 'db_copy' by copying 'db' using one of the available strategies (file_copy, wal_log) 7) run short pgbench on 'db_copy' 8) maybe do vacuum [full] on 'db_copy' And concurrently with this, it takes a basebackup, starts a cluster on it (on a different port, ofc), and does various checks on that: a) verify checksums using pg_checksums (cluster has them enabled) b) run amcheck on tables/indexes on both databases c) SQL check (we expect all tables to be 'consistent' as if we did a PITR - in particular sum(balance) is expected to be the same value on all pgbench tables) on both databases I believe those are reasonable expectations - that we get a database with valid checksums, with non-broken tables/indexes, and that the database looks as a snapshot taken at a single instant. Unfortunately it doesn't take long for the tests to start failing with various strange symptoms on the db_copy database (I'm yet to see an issue on the 'db' database): i) amcheck fails with 'heap tuple lacks matching index tuple' ERROR: heap tuple (116195,22) from table "pgbench_accounts" lacks matching index tuple within index "pgbench_accounts_pkey" HINT: Retrying verification using the function bt_index_parent_check() might provide a more specific error. I've seen this with other tables/indexes too, e.g. system catalogs pg_statitics or toast tables, but 'accounts' is most common. ii) amcheck fails with 'could not open file' ERROR: could not open file "base/18121/18137": No such file or directory LINE 9: lateral verify_heapam(relation => c.oid, on_error_stop => f... ^ ERROR: could not open file "base/18121/18137": No such file or directory iii) failures in the SQL check, with different tables have different balance sums SQL check fails (db_copy) (account 156142 branches 136132 tellers 136132 history -42826) Sometimes this is preceded by amcheck issue, but not always. I guess this is not the behavior we expect :-( I've reproduced all of this on PG16 - I haven't tried with older releases, but I have no reason to assume pre-16 releases are not affected. With incremental backups I've observed a couple more symptoms, but those are most likely just fallout of this - not realizing the initial state is a bit wrong, and making it worse by applying the increments. The important observation is that this only happens if a database is created while the backup is running, and that it only happens with the FILE_COPY strategy - I've never seen this with WAL_LOG (which is the default since PG15). I don't recall any reports of similar issues from pre-15 releases, where FILE_COPY was the only available option - I'm not sure why is that. Either it didn't have this issue back then, or maybe people happen to not create databases concurrently with a backup very often. It's a race condition / timing issue, essentially. I have no ambition to investigate this part of the code much deeper, or invent a fix myself, at least not in foreseeable future. But it seems like something we probably should fix - subtly broken backups are not a great thing. I see there have been a couple threads proposing various improvements to FILE_COPY, that might make it more efficient/faster, namely using the filesystem cloning [1] or switching pg_upgrade to use it [2]. But having something that's (maybe) faster but not quite correct does not seem like a winning strategy to me ... Alternatively, if we don't have clear desire to fix it, maybe the right solution would be get rid of it? regards [1] https://www.postgresql.org/message-id/ca+hukglm+t+swbu-chemuxjcogbxshlgzutv5zcwy4qrcce...@mail.gmail.com [2] https://www.postgresql.org/message-id/Zl9ta3FtgdjizkJ5%40nathan -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company basebackup-test-scripts.tgz Description: application/compressed-tar
Re: long-standing data loss bug in initial sync of logical replication
On 6/24/24 12:54, Amit Kapila wrote: > ... >> I'm not sure there are any cases where using SRE instead of AE would cause problems for logical decoding, but it seems very hard to prove. I'd be very surprised if just using SRE would not lead to corrupted cache contents in some situations. The cases where a lower lock level is ok are ones where we just don't care that the cache is coherent in that moment. >> >>> Are you saying it might break cases that are not corrupted now? How >>> could obtaining a stronger lock have such effect? >> >> No, I mean that I don't know if using SRE instead of AE would have negative >> consequences for logical decoding. I.e. whether, from a logical decoding POV, >> it'd suffice to increase the lock level to just SRE instead of AE. >> >> Since I don't see how it'd be correct otherwise, it's kind of a moot >> question. >> > > We lost track of this thread and the bug is still open. IIUC, the > conclusion is to use SRE in OpenTableList() to fix the reported issue. > Andres, Tomas, please let me know if my understanding is wrong, > otherwise, let's proceed and fix this issue. > It's in the commitfest [https://commitfest.postgresql.org/48/4766/] so I don't think we 'lost track' of it, but it's true we haven't done much progress recently. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On Mon, Jun 24, 2024 at 4:10 AM Alena Rybakina wrote: > > We can fix this by always removing tuples considered dead before > VacuumCutoffs->OldestXmin. This is okay even if a reconnected standby > has a transaction that sees that tuple as alive, because it will > simply wait to replay the removal until it would be correct to do so > or recovery conflict handling will cancel the transaction that sees > the tuple as alive and allow replay to continue. > > This is an interesting and difficult case) I noticed that when initializing > the cluster, in my opinion, we provide excessive freezing. Initialization > takes a long time, which can lead, for example, to longer test execution. I > got rid of this by adding the OldestMxact checkbox is not FirstMultiXactId, > and it works fine. > > if (prstate->cutoffs && > TransactionIdIsValid(prstate->cutoffs->OldestXmin) && > prstate->cutoffs->OldestMxact != FirstMultiXactId && > NormalTransactionIdPrecedes(dead_after, prstate->cutoffs->OldestXmin)) > return HEAPTUPLE_DEAD; > > Can I keep it? This looks like an addition to the new criteria I added to heap_prune_satisfies_vacuum(). Is that what you are suggesting? If so, it looks like it would only return HEAPTUPLE_DEAD (and thus only remove) a subset of the tuples my original criteria would remove. When vacuum calculates OldestMxact as FirstMultiXactId, it would not remove those tuples deleted before OldestXmin. It seems like OldestMxact will equal FirstMultiXactID sometimes right after initdb and after transaction ID wraparound. I'm not sure I totally understand the criteria. One thing I find confusing about this is that this would actually remove less tuples than with my criteria -- which could lead to more freezing. When vacuum calculates OldestMxact == FirstMultiXactID, we would not remove tuples deleted before OldestXmin and thus return HEAPTUPLE_RECENTLY_DEAD for those tuples. Then we would consider freezing them. So, it seems like we would do more freezing by adding this criteria. Could you explain more about how the criteria you are suggesting works? Are you saying it does less freezing than master or less freezing than with my patch? > Attached is the suggested fix for master plus a repro. I wrote it as a > recovery suite TAP test, but I am _not_ proposing we add it to the > ongoing test suite. It is, amongst other things, definitely prone to > flaking. I also had to use loads of data to force two index vacuuming > passes now that we have TIDStore, so it is a slow test. -- snip -- > I have a modified version of this that repros the infinite loop on > 14-16 with substantially less data. See it here [2]. Also, the repro > attached to this mail won't work on 14 and 15 because of changes to > background_psql. > > I couldn't understand why the replica is necessary here. Now I am digging why > I got the similar behavior without replica when I have only one instance. I'm > still checking this in my test, but I believe this patch fixes the original > problem because the symptoms were the same. Did you get similar behavior on master or on back branches? Was the behavior you observed the infinite loop or the error during heap_prepare_freeze_tuple()? In my examples, the replica is needed because something has to move the horizon on the primary backwards. When a standby reconnects with an older oldest running transaction ID than any of the running transactions on the primary and the vacuuming backend recomputes its RecentXmin, the horizon may move backwards when compared to the horizon calculated at the beginning of the vacuum. Vacuum does not recompute cutoffs->OldestXmin during vacuuming a relation but it may recompute the values in the GlobalVisState it uses for pruning. We knew of only one other way that the horizon could move backwards which Matthias describes here [1]. However, this is thought to be its own concurrency-related bug in the commit-abort path that should be fixed -- as opposed to the standby reconnecting with an older oldest running transaction ID which can be expected. Do you know if you were seeing the effects of the scenario Matthias describes? - Melanie [1] https://www.postgresql.org/message-id/CAEze2WjMTh4KS0%3DQEQB-Jq%2BtDLPR%2B0%2BzVBMfVwSPK5A%3DWZa95Q%40mail.gmail.com
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On Mon, Jun 24, 2024 at 4:27 AM Heikki Linnakangas wrote: > > On 21/06/2024 03:02, Peter Geoghegan wrote: > > On Thu, Jun 20, 2024 at 7:42 PM Melanie Plageman > > wrote: > > > >> The repro forces a round of index vacuuming after the standby > >> reconnects and before pruning a dead tuple whose xmax is older than > >> OldestXmin. > >> > >> At the end of the round of index vacuuming, _bt_pendingfsm_finalize() > >> calls GetOldestNonRemovableTransactionId(), thereby updating the > >> backend's GlobalVisState and moving maybe_needed backwards. > > > > Right. I saw details exactly consistent with this when I used GDB > > against a production instance. > > > > I'm glad that you were able to come up with a repro that involves > > exactly the same basic elements, including index page deletion. > > Would it be possible to make it robust so that we could always run it > with "make check"? This seems like an important corner case to > regression test. I'd have to look into how to ensure I can stabilize some of the parts that seem prone to flaking. I can probably stabilize the vacuum bit with a query of pg_stat_activity making sure it is waiting to acquire the cleanup lock. I don't, however, see a good way around the large amount of data required to trigger more than one round of index vacuuming. I could generate the data more efficiently than I am doing here (generate_series() in the from clause). Perhaps with a copy? I know it is too slow now to go in an ongoing test, but I don't have an intuition around how fast it would have to be to be acceptable. Is there a set of additional tests that are slower that we don't always run? I didn't follow how the wraparound test ended up, but that seems like one that would have been slow. - Melanie
Re: Injection point locking
Heikki Linnakangas writes: > ... I can't do that, because InjectionPointRun() requires a PGPROC > entry, because it uses an LWLock. That also makes it impossible to use > injection points in the postmaster. Any chance we could allow injection > points to be triggered without a PGPROC entry? Could we use a simple > spinlock instead? With a fast path for the case that no injection points > are attached or something? Even taking a spinlock in the postmaster is contrary to project policy. Maybe we could look the other way for debug-only code, but it seems like a pretty horrible precedent. Given your point that the existing locking is just a fig leaf anyway, maybe we could simply not have any? Then it's on the test author to be sure that the injection point won't be getting invoked when it's about to be removed. Or just rip out the removal capability, and say that once installed an injection point is there until postmaster shutdown (or till shared memory reinitialization, anyway). regards, tom lane
Re: scalability bottlenecks with (many) partitions (and more)
On Sun, Jan 28, 2024 at 4:57 PM Tomas Vondra wrote: > For NUM_LOCK_PARTITIONS this is pretty simple (see 0001 patch). The > LWLock table has 16 partitions by default - it's quite possible that on > machine with many cores and/or many partitions, we can easily hit this. > So I bumped this 4x to 64 partitions. I think this probably makes sense. I'm a little worried that we're just kicking the can down the road here where maybe we should be solving the problem in some more fundamental way, and I'm also a little worried that we might be reducing single-core performance. But it's probably fine. > What I ended up doing is having a hash table of 16-element arrays. There > are 64 "pieces", each essentially the (16 x OID + UINT64 bitmap) that we > have now. Each OID is mapped to exactly one of these parts as if in a > hash table, and in each of those 16-element parts we do exactly the same > thing we do now (linear search, removal, etc.). This works great, the > locality is great, etc. The one disadvantage is this makes PGPROC > larger, but I did a lot of benchmarks and I haven't seen any regression > that I could attribute to this. (More about this later.) I think this is a reasonable approach. Some comments: - FastPathLocalUseInitialized seems unnecessary to me; the contents of an uninitialized local variable are undefined, but an uninitialized global variable always starts out zeroed. - You need comments in various places, including here, where someone is certain to have questions about the algorithm and choice of constants: +#define FAST_PATH_LOCK_REL_GROUP(rel) (((uint64) (rel) * 7883 + 4481) % FP_LOCK_GROUPS_PER_BACKEND) When I originally coded up the fast-path locking stuff, I supposed that we couldn't make the number of slots too big because the algorithm requires a linear search of the whole array. But with this one trick (a partially-associative cache), there's no real reason that I can think of why you can't make the number of slots almost arbitrarily large. At some point you're going to use too much memory, and probably before that point you're going to make the cache big enough that it doesn't fit in the CPU cache of an individual core, at which point possibly it will stop working as well. But honestly ... I don't quite see why this approach couldn't be scaled quite far. Like, if we raised FP_LOCK_GROUPS_PER_BACKEND from your proposed value of 64 to say 65536, would that still perform well? I'm not saying we should do that, because that's probably a silly amount of memory to use for this, but the point is that when you start to have enough partitions that you run out of lock slots, performance is going to degrade, so you can imagine wanting to try to have enough lock groups to make that unlikely. Which leads me to wonder if there's any particular number of lock groups that is clearly "too many" or whether it's just about how much memory we want to use. -- Robert Haas EDB: http://www.enterprisedb.com
Re: basebackups seem to have serious issues with FILE_COPY in CREATE DATABASE
On Mon, Jun 24, 2024 at 04:12:38PM +0200, Tomas Vondra wrote: > The important observation is that this only happens if a database is > created while the backup is running, and that it only happens with the > FILE_COPY strategy - I've never seen this with WAL_LOG (which is the > default since PG15). My first thought is that this sounds related to the large comment in CreateDatabaseUsingFileCopy(): /* * We force a checkpoint before committing. This effectively means that * committed XLOG_DBASE_CREATE_FILE_COPY operations will never need to be * replayed (at least not in ordinary crash recovery; we still have to * make the XLOG entry for the benefit of PITR operations). This avoids * two nasty scenarios: * * #1: When PITR is off, we don't XLOG the contents of newly created * indexes; therefore the drop-and-recreate-whole-directory behavior of * DBASE_CREATE replay would lose such indexes. * * #2: Since we have to recopy the source database during DBASE_CREATE * replay, we run the risk of copying changes in it that were committed * after the original CREATE DATABASE command but before the system crash * that led to the replay. This is at least unexpected and at worst could * lead to inconsistencies, eg duplicate table names. * * (Both of these were real bugs in releases 8.0 through 8.0.3.) * * In PITR replay, the first of these isn't an issue, and the second is * only a risk if the CREATE DATABASE and subsequent template database * change both occur while a base backup is being taken. There doesn't * seem to be much we can do about that except document it as a * limitation. * * See CreateDatabaseUsingWalLog() for a less cheesy CREATE DATABASE * strategy that avoids these problems. */ > I don't recall any reports of similar issues from pre-15 releases, where > FILE_COPY was the only available option - I'm not sure why is that. > Either it didn't have this issue back then, or maybe people happen to > not create databases concurrently with a backup very often. It's a race > condition / timing issue, essentially. If it requires concurrent activity on the template database, I wouldn't be surprised at all that this is rare. > I see there have been a couple threads proposing various improvements to > FILE_COPY, that might make it more efficient/faster, namely using the > filesystem cloning [1] or switching pg_upgrade to use it [2]. But having > something that's (maybe) faster but not quite correct does not seem like > a winning strategy to me ... > > Alternatively, if we don't have clear desire to fix it, maybe the right > solution would be get rid of it? It would be unfortunate if we couldn't use this for pg_upgrade, especially if it is unaffected by these problems. -- nathan
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On 6/24/24 16:53, Melanie Plageman wrote: > On Mon, Jun 24, 2024 at 4:27 AM Heikki Linnakangas wrote: >> >> On 21/06/2024 03:02, Peter Geoghegan wrote: >>> On Thu, Jun 20, 2024 at 7:42 PM Melanie Plageman >>> wrote: >>> The repro forces a round of index vacuuming after the standby reconnects and before pruning a dead tuple whose xmax is older than OldestXmin. At the end of the round of index vacuuming, _bt_pendingfsm_finalize() calls GetOldestNonRemovableTransactionId(), thereby updating the backend's GlobalVisState and moving maybe_needed backwards. >>> >>> Right. I saw details exactly consistent with this when I used GDB >>> against a production instance. >>> >>> I'm glad that you were able to come up with a repro that involves >>> exactly the same basic elements, including index page deletion. >> >> Would it be possible to make it robust so that we could always run it >> with "make check"? This seems like an important corner case to >> regression test. > > I'd have to look into how to ensure I can stabilize some of the parts > that seem prone to flaking. I can probably stabilize the vacuum bit > with a query of pg_stat_activity making sure it is waiting to acquire > the cleanup lock. > > I don't, however, see a good way around the large amount of data > required to trigger more than one round of index vacuuming. I could > generate the data more efficiently than I am doing here > (generate_series() in the from clause). Perhaps with a copy? I know it > is too slow now to go in an ongoing test, but I don't have an > intuition around how fast it would have to be to be acceptable. Is > there a set of additional tests that are slower that we don't always > run? I didn't follow how the wraparound test ended up, but that seems > like one that would have been slow. > I think it depends on what is the impact on the 'make check' duration. If it could be added to one of the existing test groups, then it depends on how long the slowest test in that group is. If the new test needs to be in a separate group, it probably needs to be very fast. But I was wondering how much time are we talking about, so I tried creating a table, filling it with 300k rows => 250ms creating an index => 180ms delete 90% data => 200ms vacuum t => 130ms which with m_w_m=1MB does two rounds of index cleanup. That's ~760ms. I'm not sure how much more stuff does the test need to do, but this would be pretty reasonable, if we could add it to an existing group. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: POC, WIP: OR-clause support for indexes
On Fri, Jun 21, 2024 at 6:52 PM Alena Rybakina wrote: > It's hard to tell, but I think it might be one of the good places to apply > transformation. Let me describe a brief conclusion on the two approaches. This explanation is somewhat difficult for me to follow. For example: > In the first approach, we definitely did not process the extra "OR" > expressions in the first approach, since they were packaged as an Array. It > could lead to the fact that less planning time would be spent on the > optimizer. I don't know what the "first approach" refers to, or what processing the extra "OR" expressions means, or what it would mean to package OR expressions as an array. If you made them into an SAOP then you'd have an array *instead of* OR expressions, not OR expressions "packaged as an array" but even then, they'd still be processed somewhere, unless the patch was just wrong. I think you should try writing this summary again and see if you can make it a lot clearer and more precise. I'm suspicious based that we should actually be postponing the transformation even further. If, for example, the transformation is advantageous for index scans and disadvantageous for bitmap scans, or the other way around, then this approach can't help much: it either does the transformation and all scan types are affected, or it doesn't do it and no scan types are affected. But if you decided for each scan whether to transform the quals, then you could handle that. Against that, there might be increased planning cost. But, perhaps that could be avoided somehow. > What exactly is the strategy around OR-clauses with type differences? > If I'm reading the code correctly, the first loop requires an exact > opno match, which presumably implies that the constant-type elements > are of the same type. But then why does the second loop need to use > coerce_to_common_type? > > It needs to transform all similar constants to one type, because some > constants of "OR" expressions can belong others, like the numeric and int > types. Due to the fact that array structure demands that all types must be > belonged to one type, so for this reason we applied this procedure. The alternative that should be considered is not combining things if the types don't match. If we're going to combine such things, we need to be absolutely certain that type conversion cannot fail. > I do not think this is acceptable. We should find a way to get the > right operator into the ScalarArrayOpExpr without translating the OID > back into a name and then back into an OID. > > I don’t really understand the reason why it’s better not to do this. Can you > explain please? One reason is that it is extra work to convert things to a name and then back to an OID. It's got to be slower than using the OID you already have. The other reason is that it's error-prone. If somehow the second lookup doesn't produce the same OID as the first lookup, bad things will happen, possibly including security vulnerabilities. I see you've taken steps to avoid that, like nailing down the schema, and that's good, but it's not a good enough reason to do it like this. If we don't have a function that can construct the node we need with the OID rather than the name as an argument, we should invent one, not do this sort of thing. > Also, why is the array built with eval_const_expressions instead of > something like makeArrayResult? There should be no need for general > expression evaluation here if we are just dealing with constants. > > I'm not ready to answer this question right now, I need time to study the use > of the makeArrayResult function in the optimizer. OK. An important consideration here is that eval_const_expressions() is prone to *fail* because it can call user-defined functions. We really don't want this optimization to cause planner failure (or queries to error out at any other stage, either). We also don't want to end up with any security problems, which is another possible danger when we call a function that can execute arbitrary code. It's better to keep it simple and only do things that we know are simple and safe, like assembling a bunch of datums that we already have into an array. -- Robert Haas EDB: http://www.enterprisedb.com
Re: basebackups seem to have serious issues with FILE_COPY in CREATE DATABASE
On 6/24/24 17:14, Nathan Bossart wrote: > On Mon, Jun 24, 2024 at 04:12:38PM +0200, Tomas Vondra wrote: >> The important observation is that this only happens if a database is >> created while the backup is running, and that it only happens with the >> FILE_COPY strategy - I've never seen this with WAL_LOG (which is the >> default since PG15). > > My first thought is that this sounds related to the large comment in > CreateDatabaseUsingFileCopy(): > > /* >* We force a checkpoint before committing. This effectively means that >* committed XLOG_DBASE_CREATE_FILE_COPY operations will never need to > be >* replayed (at least not in ordinary crash recovery; we still have to >* make the XLOG entry for the benefit of PITR operations). This avoids >* two nasty scenarios: >* >* #1: When PITR is off, we don't XLOG the contents of newly created >* indexes; therefore the drop-and-recreate-whole-directory behavior of >* DBASE_CREATE replay would lose such indexes. >* >* #2: Since we have to recopy the source database during DBASE_CREATE >* replay, we run the risk of copying changes in it that were committed >* after the original CREATE DATABASE command but before the system > crash >* that led to the replay. This is at least unexpected and at worst > could >* lead to inconsistencies, eg duplicate table names. >* >* (Both of these were real bugs in releases 8.0 through 8.0.3.) >* >* In PITR replay, the first of these isn't an issue, and the second is >* only a risk if the CREATE DATABASE and subsequent template database >* change both occur while a base backup is being taken. There doesn't >* seem to be much we can do about that except document it as a >* limitation. >* >* See CreateDatabaseUsingWalLog() for a less cheesy CREATE DATABASE >* strategy that avoids these problems. >*/ > Perhaps, the mentioned risks certainly seem like it might be related to the issues I'm observing. >> I don't recall any reports of similar issues from pre-15 releases, where >> FILE_COPY was the only available option - I'm not sure why is that. >> Either it didn't have this issue back then, or maybe people happen to >> not create databases concurrently with a backup very often. It's a race >> condition / timing issue, essentially. > > If it requires concurrent activity on the template database, I wouldn't be > surprised at all that this is rare. > Right. Although, "concurrent" here means a somewhat different thing. AFAIK there can't be a any changes concurrent with the CREATE DATABASE directly, because we make sure there are no connections: createdb: error: database creation failed: ERROR: source database "test" is being accessed by other users DETAIL: There is 1 other session using the database. But per the comment, it'd be a problem if there is activity after the database gets copied, but before the backup completes (which is where the replay will happen). >> I see there have been a couple threads proposing various improvements to >> FILE_COPY, that might make it more efficient/faster, namely using the >> filesystem cloning [1] or switching pg_upgrade to use it [2]. But having >> something that's (maybe) faster but not quite correct does not seem like >> a winning strategy to me ... >> >> Alternatively, if we don't have clear desire to fix it, maybe the right >> solution would be get rid of it? > > It would be unfortunate if we couldn't use this for pg_upgrade, especially > if it is unaffected by these problems. > Yeah. I wouldn't mind using FILE_COPY in contexts where we know it's safe, like pg_upgrade. I just don't want to let users to unknowingly step on this. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On Thu, Jun 20, 2024 at 7:42 PM Melanie Plageman wrote: > We can fix this by always removing tuples considered dead before > VacuumCutoffs->OldestXmin. I don't have a great feeling about this fix. It's not that I think it's wrong. It's just that the underlying problem here is that we have heap_page_prune_and_freeze() getting both GlobalVisState *vistest and struct VacuumCutoffs *cutoffs, and the vistest wants to be in charge of deciding what gets pruned, but that doesn't actually work, because as I pointed out in http://postgr.es/m/ca+tgmob1btwcp6r5-tovhb5wqhasptsr2tjkcdcutmzauyb...@mail.gmail.com it's not properly synchronized with vacrel->cutoffs.OldestXmin. Your fix is to consider both variables, which again may be totally correct, but wouldn't it be a lot better if we didn't have two variables fighting for control of the same behavior? (I'm not trying to be a nuisance here -- I think it's great that you've done the work to pin this down and perhaps there is no better fix than what you've proposed.) -- Robert Haas EDB: http://www.enterprisedb.com
Re: [PATCH] Add ACL (Access Control List) acronym
On Mon, Jun 24, 2024 at 02:32:27PM +0200, Joel Jacobson wrote: > This patch is based on a suggestion from a separate thread [1]: > > On Mon, Jun 24, 2024, at 01:46, Michael Paquier wrote: >> Rather unrelated to this patch, still this patch makes the situation >> more complicated in the docs, but wouldn't it be better to add ACL as >> a term in acronyms.sql, and reuse it here? It would be a doc-only >> patch that applies on top of the rest (could be on a new thread of its >> own), with some markups added where needed. Sounds reasonable to me. + https://en.wikipedia.org/wiki/Access_Control_List";>Access Control List, i.e. privileges list I think we could omit "i.e. privileges list." -- nathan
Re: Improve EXPLAIN output for multicolumn B-Tree Index
On Mon, 24 Jun 2024 at 14:42, Jelte Fennema-Nio wrote: > > On Mon, 24 Jun 2024 at 13:02, Matthias van de Meent > wrote: > > It does not really behave similar: index scan keys (such as the > > id3=101 scankey) don't require visibility checks in the btree code, > > while the Filter condition _does_ require a visibility check, and > > delegates the check to the table AM if the scan isn't Index-Only, or > > if the VM didn't show all-visible during the check. > > Any chance you could point me in the right direction for the > code/docs/comment about this? I'd like to learn a bit more about why > that is the case, because I didn't realize visibility checks worked > differently for index scan keys and Filter keys. This can be derived by combining how Filter works (it only filters the returned live tuples) and how Index-Only scans work (return the index tuple, unless !ALL_VISIBLE, in which case the heap tuple is projected). There have been several threads more or less recently that also touch this topic and closely related topics, e.g. [0][1]. > > Furthermore, the index could use the scankey to improve the number of > > keys to scan using "skip scans"; by realising during a forward scan > > that if you've reached tuple (1, 2, 3) and looking for (1, _, 1) you > > can skip forward to (1, 3, _), rather than having to go through tuples > > (1, 2, 4), (1, 2, 5), ... (1, 2, n). This is not possible for > > INCLUDE-d columns, because their datatypes and structure are opaque to > > the index AM; the AM cannot assume anything about or do anything with > > those values. > > Does Postgres actually support this currently? I thought skip scans > were not available (yet). Peter Geoghegan has been working on it as project after PG17's IN()-list improvements were committed, and I hear he has the basics working but the further details need fleshing out. > > As you can see, there's a huge difference in performance. Putting both > > non-bound and "normal" filter clauses in the same Filter clause will > > make it more difficult to explain performance issues based on only the > > explain output. > > Fair enough, that's of course the main point of this patch in the > first place: being able to better interpret the explain plan when you > don't have access to the schema. Still I think Filter is the correct > keyword for both, so how about we make it less confusing by making the > current "Filter" more specific by calling it something like "Non-key > Filter" or "INCLUDE Filter" and then call the other something like > "Index Filter" or "Secondary Bound Filter". I'm not sure how debuggable explain plans are without access to the schema, especially when VERBOSE isn't configured, so I would be hesitant to accept that as an argument here. Kind regards, Matthias van de Meent Neon (https://neon.tech) [0] https://www.postgresql.org/message-id/flat/N1xaIrU29uk5YxLyW55MGk5fz9s6V2FNtj54JRaVlFbPixD5z8sJ07Ite5CvbWwik8ZvDG07oSTN-usENLVMq2UAcizVTEd5b-o16ZGDIIU%3D%40yamlcoder.me [1] https://www.postgresql.org/message-id/flat/cf85f46f-b02f-05b2-5248-5000b894ebab%40enterprisedb.com
Re: [PATCH] Add ACL (Access Control List) acronym
On Mon, Jun 24, 2024 at 8:44 AM Nathan Bossart wrote: > On Mon, Jun 24, 2024 at 02:32:27PM +0200, Joel Jacobson wrote: > > This patch is based on a suggestion from a separate thread [1]: > > > > On Mon, Jun 24, 2024, at 01:46, Michael Paquier wrote: > >> Rather unrelated to this patch, still this patch makes the situation > >> more complicated in the docs, but wouldn't it be better to add ACL as > >> a term in acronyms.sql, and reuse it here? It would be a doc-only > >> patch that applies on top of the rest (could be on a new thread of its > >> own), with some markups added where needed. > > Sounds reasonable to me. > +1 > + https://en.wikipedia.org/wiki/Access_Control_List";>Access > Control List, i.e. privileges list > > I think we could omit "i.e. privileges list." > > Agreed. Between the docs and code we say "privileges list" once and that refers to the dumputIls description of the arguments to grant. As the acronym page now defines the term using fundamentals, introducing another term not used elsewhere seems undesirable. Observations: We are referencing a disambiguation page. We never actually spell out ACL anywhere so we might as well just reference what Wikipedia believes is the expected spelling. The page we link to uses "permissions" while we consistently use "privileges" to describe the contents of the list. This seems like an obvious synonym, but as the point of these is to formally define things, pointing this equivalence is worth considering. David J.
Re: POC, WIP: OR-clause support for indexes
On Mon, Jun 24, 2024 at 11:28 AM Robert Haas wrote: > > It needs to transform all similar constants to one type, because some > > constants of "OR" expressions can belong others, like the numeric and int > > types. Due to the fact that array structure demands that all types must be > > belonged to one type, so for this reason we applied this procedure. > > The alternative that should be considered is not combining things if > the types don't match. If we're going to combine such things, we need > to be absolutely certain that type conversion cannot fail. But what about cases like this: SELECT * FROM mytable WHERE columna = 1_000_000_000 or columna = 5_000_000_000; -- columna is int4 This is using two types, of course. 1_000_000_000 is int4, while 5_000_000_000 is bigint. If the transformation suddenly failed to work when a constant above INT_MAX was used for the first time, then I'd say that that's pretty surprising. That's what happens currently if you write the same query as "WHERE columna = any('{1_000_000_000,5_000_000_000}')", due to the way the coercion works. That seems less surprising to me, because the user is required to construct their own array, and users expect arrays to always have one element type. It would probably be okay to make the optimization not combine things/not apply when the user gratuitously mixes different syntaxes. For example, if a numeric constant was used, rather than an integer constant. Maybe it would be practical to do something with the B-Tree operator class for each of the types involved in the optimization. You could probably find a way for a SAOP to work against a "heterogeneously-typed array" while still getting B-Tree index scans -- provided the types all came from the same operator family. I'm assuming that the index has an index column whose input opclass was a member of that same family. That would necessitate changing the general definition of SAOP, and adding new code to nbtree that worked with that. But that seems doable. I was already thinking about doing something like this, to support index scans for "IS NOT DISTINCT FROM", or on constructs like "columna = 5 OR columna IS NULL". That is more or less a SAOP with two values, except that one of the values in the value NULL. I've already implemented "nbtree SAOPs where one of the elements is a NULL" for skip scan, which could be generalized to support these other cases. Admittedly I'm glossing over a lot of important details here. Does it just work for the default opclass for the type, or can we expect it to work with a non-default opclass when that's the salient opclass (the one used by our index)? I don't know what you'd do about stuff like that. -- Peter Geoghegan
Re: RFC: Additional Directory for Extensions
On Thu, Apr 11, 2024 at 01:52:26PM -0400, David E. Wheeler wrote: > I realize this probably isn´t going to happen for 17, given the freeze, > but I would very much welcome feedback and pointers to address concerns > about providing a second directory for extensions and DSOs. Quite a few > people have talked about the need for this in the Extension Mini > Summits[1], so I´m sure I could get some collaborators to make > improvements or look at a different approach. At first glance, the general idea seems reasonable to me. I'm wondering whether there is a requirement for this directory to be prepended or if it could be appended to the end. That way, the existing ones would take priority, which might be desirable from a security standpoint. -- nathan
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On Mon, Jun 24, 2024 at 11:44 AM Robert Haas wrote: > I don't have a great feeling about this fix. It's not that I think > it's wrong. It's just that the underlying problem here is that we have > heap_page_prune_and_freeze() getting both GlobalVisState *vistest and > struct VacuumCutoffs *cutoffs, and the vistest wants to be in charge > of deciding what gets pruned, but that doesn't actually work, because > as I pointed out in > http://postgr.es/m/ca+tgmob1btwcp6r5-tovhb5wqhasptsr2tjkcdcutmzauyb...@mail.gmail.com > it's not properly synchronized with vacrel->cutoffs.OldestXmin. Your > fix is to consider both variables, which again may be totally correct, > but wouldn't it be a lot better if we didn't have two variables > fighting for control of the same behavior? Why would it be better? It's to our advantage to have vistest prune away extra tuples when possible. Andres placed a lot of emphasis on that during the snapshot scalability work -- vistest can be updated on the fly. The problem here is that OldestXmin is supposed to be more conservative than vistest, which it almost always is, except in this one edge case. I don't think that plugging that hole changes the basic fact that there is one source of truth about what *needs* to be pruned. There is such a source of truth: OldestXmin. -- Peter Geoghegan
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On Mon, Jun 24, 2024 at 12:43 PM Peter Geoghegan wrote: > The problem here is that OldestXmin is supposed to be more > conservative than vistest, which it almost always is, except in this > one edge case. I don't think that plugging that hole changes the basic > fact that there is one source of truth about what *needs* to be > pruned. There is such a source of truth: OldestXmin. Well, another approach could be to make it so that OldestXmin actually is always more conservative than vistest rather than almost always. I agree with you that letting the pruning horizon move forward during vacuum is desirable. I'm just wondering if having the vacuum code need to know a second horizon is really the best way to address that. -- Robert Haas EDB: http://www.enterprisedb.com
Re: POC, WIP: OR-clause support for indexes
On Mon, Jun 24, 2024 at 12:09 PM Peter Geoghegan wrote: > But what about cases like this: > > SELECT * FROM mytable WHERE columna = 1_000_000_000 or columna = > 5_000_000_000; -- columna is int4 > > This is using two types, of course. 1_000_000_000 is int4, while > 5_000_000_000 is bigint. If the transformation suddenly failed to work > when a constant above INT_MAX was used for the first time, then I'd > say that that's pretty surprising. That's what happens currently if > you write the same query as "WHERE columna = > any('{1_000_000_000,5_000_000_000}')", due to the way the coercion > works. That seems less surprising to me, because the user is required > to construct their own array, and users expect arrays to always have > one element type. I am not against handling this kind of case if we can do it, but it's more important that the patch doesn't cause gratuitous failures than that it handles more cases. > Maybe it would be practical to do something with the B-Tree operator > class for each of the types involved in the optimization. You could > probably find a way for a SAOP to work against a > "heterogeneously-typed array" while still getting B-Tree index scans > -- provided the types all came from the same operator family. I'm > assuming that the index has an index column whose input opclass was a > member of that same family. That would necessitate changing the > general definition of SAOP, and adding new code to nbtree that worked > with that. But that seems doable. I agree that something based on operator families might be viable. Why would that require changing the definition of an SAOP? > Admittedly I'm glossing over a lot of important details here. Does it > just work for the default opclass for the type, or can we expect it to > work with a non-default opclass when that's the salient opclass (the > one used by our index)? I don't know what you'd do about stuff like > that. It seems to me that it just depends on the opclasses in the query. If the user says WHERE column op1 const1 AND column op2 const2 ...then if op1 and op2 are in the same operator family and if we can convert one of const1 and const2 to the type of the other without risk of failure, then we can rewrite this as an SAOP with whichever of the two operators pertains to the target type, e.g. column1 op1 ANY[const1,converted_const2] I don't think the default opclass matters here, or the index properties either. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On Mon, Jun 24, 2024 at 1:05 PM Robert Haas wrote: > On Mon, Jun 24, 2024 at 12:43 PM Peter Geoghegan wrote: > > The problem here is that OldestXmin is supposed to be more > > conservative than vistest, which it almost always is, except in this > > one edge case. I don't think that plugging that hole changes the basic > > fact that there is one source of truth about what *needs* to be > > pruned. There is such a source of truth: OldestXmin. > > Well, another approach could be to make it so that OldestXmin actually > is always more conservative than vistest rather than almost always. If we did things like that then it would still be necessary to write a patch like the one Melanie came up with, on the grounds that we'd really need to be paranoid about having missed some subtlety. We might as well just rely on the mechanism directly. I just don't think that it makes much difference. -- Peter Geoghegan
Re: POC, WIP: OR-clause support for indexes
On Mon, Jun 24, 2024 at 1:29 PM Robert Haas wrote: > I am not against handling this kind of case if we can do it, but it's > more important that the patch doesn't cause gratuitous failures than > that it handles more cases. I agree, with the proviso that "avoid gratuitous failures" should include cases where a query that got the optimization suddenly fails to get the optimization, due only to some very innocuous looking change. Such as a change from using a constant 1_000_000_000 to a constant 5_000_000_000 in the query text. That is a POLA violation. > > Maybe it would be practical to do something with the B-Tree operator > > class for each of the types involved in the optimization. You could > > probably find a way for a SAOP to work against a > > "heterogeneously-typed array" while still getting B-Tree index scans > > -- provided the types all came from the same operator family. I'm > > assuming that the index has an index column whose input opclass was a > > member of that same family. That would necessitate changing the > > general definition of SAOP, and adding new code to nbtree that worked > > with that. But that seems doable. > > I agree that something based on operator families might be viable. Why > would that require changing the definition of an SAOP? Maybe it doesn't. My point was only that the B-Tree code doesn't necessarily need to use just one rhs type for the same column input opclass. The definition of SOAP works (or could work) in basically the same way, provided the "OR condition" were provably disjunct. We could for example mix different operators for the same nbtree scan key (with some work in nbtutils.c), just as we could support "where mycol =5 OR mycol IS NULL" with much effort. BTW, did you know MySQL has long supported the latter? It has a <=> operator, which is basically a non-standard spelling of IS NOT DISTINCT FROM. Importantly, it is indexable, whereas right now Postgres doesn't support indexing IS NOT DISTINCT FROM. If you're interested in working on this problem within the scope of this patch, or some follow-up patch, I can take care of the nbtree side of things. > > Admittedly I'm glossing over a lot of important details here. Does it > > just work for the default opclass for the type, or can we expect it to > > work with a non-default opclass when that's the salient opclass (the > > one used by our index)? I don't know what you'd do about stuff like > > that. > > It seems to me that it just depends on the opclasses in the query. If > the user says > > WHERE column op1 const1 AND column op2 const2 > > ...then if op1 and op2 are in the same operator family and if we can > convert one of const1 and const2 to the type of the other without risk > of failure, then we can rewrite this as an SAOP with whichever of the > two operators pertains to the target type, e.g. > > column1 op1 ANY[const1,converted_const2] > > I don't think the default opclass matters here, or the index properties > either. Okay, good. The docs do say "Another requirement for a multiple-data-type family is that any implicit or binary-coercion casts that are defined between data types included in the operator family must not change the associated sort ordering" [1]. There must be precedent for this sort of thing. Probably for merge joins. [1] https://www.postgresql.org/docs/devel/btree.html#BTREE-BEHAVIOR -- Peter Geoghegan
Re: RFC: Additional Directory for Extensions
On Wed, Apr 3, 2024 at 3:13 AM Alvaro Herrera wrote: > I support the idea of there being a second location from where to load > shared libraries ... but I don't like the idea of making it > runtime-configurable. If we want to continue to tighten up what > superuser can do, then one of the things that has to go away is the > ability to load shared libraries from arbitrary locations > (dynamic_library_path). I think we should instead look at making those > locations hardcoded at compile time. The packager can then decide where > those things go, and the superuser no longer has the ability to load > arbitrary code from arbitrary locations. Is "tighten up what the superuser can do" on our list of objectives? Personally, I think we should be focusing mostly, and maybe entirely, on letting non-superusers do more things, with appropriate security controls. The superuser can ultimately do anything, since they can cause shell commands to be run and load arbitrary code into the backend and write code in untrusted procedural languages and mutilate the system catalogs and lots of other terrible things. Now, I think there are environments where people have used things like containers to try to lock down the superuser, and while I'm not sure that can ever be particularly water-tight, if it were the case that this patch would make it a whole lot easier for a superuser to bypass the kinds of controls that people are imposing today, that might be an argument against this patch. But ... off-hand, I'm not seeing such an exposure. On the patch itself, I find the documentation for this to be fairly hard to understand. I think it could benefit from an example. I'm confused about whether this is intended to let me search for extensions in /my/temp/root/usr/lib/postgresql/... by setting extension_directory=/my/temp/dir, or whether it's intended me to search both /usr/lib/postgresql as I normally would and also /some/other/place. If the latter, I wonder why we don't handle shared libraries by setting dynamic_library_path and then just have an analogue of that for control files. -- Robert Haas EDB: http://www.enterprisedb.com
Re: POC, WIP: OR-clause support for indexes
On Mon, Jun 24, 2024 at 1:46 PM Peter Geoghegan wrote: > BTW, did you know MySQL has long supported the latter? It has a <=> > operator, which is basically a non-standard spelling of IS NOT > DISTINCT FROM. Importantly, it is indexable, whereas right now > Postgres doesn't support indexing IS NOT DISTINCT FROM. If you're > interested in working on this problem within the scope of this patch, > or some follow-up patch, I can take care of the nbtree side of things. To be clear, I meant that we could easily support "where mycol = 5 OR mycol IS NULL" and have nbtree handle that efficiently, by making it a SAOP internally. Separately, we could also make IS NOT DISTINCT FROM indexable, though that probably wouldn't need any work in nbtree. -- Peter Geoghegan
Apparent bug in WAL summarizer process (hung state)
Hello, Hope you are doing well. I've been playing a bit with the incremental backup feature which might come as part of the 17 release, and I think I hit a possible bug in the WAL summarizer process. The issue that I face refers to the summarizer process getting into a hung state. When the issue is triggered, it keeps in an infinite loop trying to process a WAL file that no longer exists. It apparently comes up only when I perform changes to `wal_summarize` GUC and reload Postgres, while there is some load in Postgres which makes it recycle WAL files. I'm running Postgres 17 in a Rockylinux 9 VM. In order to have less WAL files available in `pg_wal` and make it easier to reproduce the issue, I'm using a low value for `max_wal_size` ('100MB'). You can find below the steps that I took to reproduce this problem, assuming this small `max_wal_size`, and `summarize_wal` initially enabled: ```bash # Assume we initially have max_wal_size = '100MB' and summarize_wal = on # Create a table of ~ 100MB psql -c "CREATE TABLE test AS SELECT generate_series(1, 300)" # Take a full backup pg_basebackup -X none -c fast -P -D full_backup_1 # Recreate a table of ~ 100MB psql -c "DROP TABLE test" psql -c "CREATE TABLE test AS SELECT generate_series(1, 300)" # Take an incremental backup pg_basebackup -X none -c fast -P -D incremental_backup_1 --incremental full_backup_1/backup_manifest # Disable summarize_wal psql -c "ALTER SYSTEM SET summarize_wal TO off" psql -c "SELECT pg_reload_conf()" # Recreate a table of ~ 100MB psql -c "DROP TABLE test" psql -c "CREATE TABLE test AS SELECT generate_series(1, 300)" # Re-enable sumarize_wal psql -c "ALTER SYSTEM SET summarize_wal TO on" psql -c "SELECT pg_reload_conf()" # Take a full backup pg_basebackup -X none -c fast -P -D full_backup_2 # Take an incremental backup pg_basebackup -X none -c fast -P -D incremental_backup_2 --incremental full_backup_2/backup_manifest ``` I'm able to reproduce the issue most of the time when running these steps manually. It's harder to reproduce if I attempt to run those commands as a bash script. This is the sample output of a run of those commands: ```console (barman) [postgres@barmandevhost ~]$ psql -c "CREATE TABLE test AS SELECT generate_series(1, 300)"SELECT 300(barman) [postgres@barmandevhost ~]$ pg_basebackup -X none -c fast -P -D full_backup_1NOTICE: WAL archiving is not enabled; you must ensure that all required WAL segments are copied through other means to complete the backup331785/331785 kB (100%), 1/1 tablespace(barman) [postgres@barmandevhost ~]$ psql -c "DROP TABLE test"DROP TABLE(barman) [postgres@barmandevhost ~]$ psql -c "CREATE TABLE test AS SELECT generate_series(1, 300)"SELECT 300(barman) [postgres@barmandevhost ~]$ pg_basebackup -X none -c fast -P -D incremental_backup_1 --incremental full_backup_1/backup_manifestNOTICE: WAL archiving is not enabled; you must ensure that all required WAL segments are copied through other means to complete the backup111263/331720 kB (33%), 1/1 tablespace(barman) [postgres@barmandevhost ~]$ psql -c "ALTER SYSTEM SET summarize_wal TO off"ALTER SYSTEM(barman) [postgres@barmandevhost ~]$ psql -c "SELECT pg_reload_conf()" pg_reload_conf t(1 row) (barman) [postgres@barmandevhost ~]$ psql -c "DROP TABLE test"DROP TABLE(barman) [postgres@barmandevhost ~]$ psql -c "CREATE TABLE test AS SELECT generate_series(1, 300)"SELECT 300(barman) [postgres@barmandevhost ~]$ psql -c "ALTER SYSTEM SET summarize_wal TO on"ALTER SYSTEM(barman) [postgres@barmandevhost ~]$ psql -c "SELECT pg_reload_conf()" pg_reload_conf t(1 row) (barman) [postgres@barmandevhost ~]$ pg_basebackup -X none -c fast -P -D full_backup_2NOTICE: WAL archiving is not enabled; you must ensure that all required WAL segments are copied through other means to complete the backup331734/331734 kB (100%), 1/1 tablespace(barman) [postgres@barmandevhost ~]$ pg_basebackup -X none -c fast -P -D incremental_backup_2 --incremental full_backup_2/backup_manifestWARNING: still waiting for WAL summarization through 2/C128 after 10 secondsDETAIL: Summarization has reached 2/B3D8 on disk and 2/B3D8 in memory.WARNING: still waiting for WAL summarization through 2/C128 after 20 secondsDETAIL: Summarization has reached 2/B3D8 on disk and 2/B3D8 in memory.WARNING: still waiting for WAL summarization through 2/C128 after 30 secondsDETAIL: Summarization has reached 2/B3D8 on disk and 2/B3D8 in memory.WARNING: still waiting for WAL summarization through 2/C128 after 40 secondsDETAIL: Summarization has reached 2/B3D8 on disk and 2/B3D8 in memory.WARNING: still waiting for WAL summarization through 2/C128 after 50 secondsDETAIL: Summarization has reached 2/B3D8 on disk and 2/B3D8 in memory.WARNING: still waiting for WAL summarization through 2/C128 after 60 secondsDETAIL: Summarization has reached 2/B3D8 on disk and
Re: Apparent bug in WAL summarizer process (hung state)
I'm attaching the files which I missed in the original email. > 19:34:17.437626 epoll_wait(5, [], 1, 8161) = 0 <8.171542> 19:34:25.610176 rt_sigprocmask(SIG_SETMASK, [URG], NULL, 8) = 0 <0.000334> 19:34:25.611012 openat(AT_FDCWD, "pg_wal/0001000200B3", O_RDONLY) = -1 ENOENT (No such file or directory) <0.000323> 19:34:25.611657 rt_sigprocmask(SIG_SETMASK, ~[QUIT ILL TRAP ABRT BUS FPE KILL SEGV CONT STOP SYS RTMIN RT_1], NULL, 8) = 0 <0.000406> 19:34:25.612327 write(2, "\0\0{\0\357\23\1\0\0212024-06-24 17:34:25.612"..., 132) = 132 <0.000203> 19:34:25.612873 epoll_wait(5, [], 1, 1) = 0 <10.010950> 19:34:35.623942 rt_sigprocmask(SIG_SETMASK, [URG], NULL, 8) = 0 <0.12> 19:34:35.624120 openat(AT_FDCWD, "pg_wal/0001000200B3", O_RDONLY) = -1 ENOENT (No such file or directory) <0.18> 19:34:35.624191 rt_sigprocmask(SIG_SETMASK, ~[QUIT ILL TRAP ABRT BUS FPE KILL SEGV CONT STOP SYS RTMIN RT_1], NULL, 8) = 0 <0.07> 19:34:35.624237 write(2, "\0\0{\0\357\23\1\0\0212024-06-24 17:34:35.624"..., 132) = 132 <0.52> 19:34:35.624341 epoll_wait(5, [], 1, 1) = 0 <10.010941> 19:34:45.635408 rt_sigprocmask(SIG_SETMASK, [URG], NULL, 8) = 0 <0.16> 19:34:45.635602 openat(AT_FDCWD, "pg_wal/0001000200B3", O_RDONLY) = -1 ENOENT (No such file or directory) <0.62> 19:34:45.635765 rt_sigprocmask(SIG_SETMASK, ~[QUIT ILL TRAP ABRT BUS FPE KILL SEGV CONT STOP SYS RTMIN RT_1], NULL, 8) = 0 <0.09> 19:34:45.635830 write(2, "\0\0{\0\357\23\1\0\0212024-06-24 17:34:45.635"..., 132) = 132 <0.000495> 19:34:45.636390 epoll_wait(5, [], 1, 1) = 0 <10.008785> 19:34:55.645305 rt_sigprocmask(SIG_SETMASK, [URG], NULL, 8) = 0 <0.16> 19:34:55.645520 openat(AT_FDCWD, "pg_wal/0001000200B3", O_RDONLY) = -1 ENOENT (No such file or directory) <0.32> 19:34:55.645645 rt_sigprocmask(SIG_SETMASK, ~[QUIT ILL TRAP ABRT BUS FPE KILL SEGV CONT STOP SYS RTMIN RT_1], NULL, 8) = 0 <0.27> 19:34:55.646214 write(2, "\0\0{\0\357\23\1\0\0212024-06-24 17:34:55.645"..., 132) = 132 <0.000136> 19:34:55.646445 epoll_wait(5, [], 1, 1) = 0 <10.011731> 19:35:05.658366 rt_sigprocmask(SIG_SETMASK, [URG], NULL, 8) = 0 <0.18> 19:35:05.658591 openat(AT_FDCWD, "pg_wal/0001000200B3", O_RDONLY) = -1 ENOENT (No such file or directory) <0.28> 19:35:05.658689 rt_sigprocmask(SIG_SETMASK, ~[QUIT ILL TRAP ABRT BUS FPE KILL SEGV CONT STOP SYS RTMIN RT_1], NULL, 8) = 0 <0.10> 19:35:05.658750 write(2, "\0\0{\0\357\23\1\0\0212024-06-24 17:35:05.658"..., 132) = 132 <0.000521> 19:35:05.659335 epoll_wait(5, #0 0x7fb51c50e1da in epoll_wait (epfd=5, events=0x1409fd8, maxevents=1, timeout=timeout@entry=1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30 #1 0x00922bad in WaitEventSetWaitBlock (nevents=1, occurred_events=0x7ffe5492aba0, cur_timeout=, set=0x1409f70) at storage/ipc/latch.c:1570 #2 WaitEventSetWait (set=0x1409f70, timeout=1, occurred_events=0x7ffe5492aba0, nevents=1, wait_event_info=) at storage/ipc/latch.c:1516 #3 0x00920e65 in WaitLatch (latch=, wakeEvents=, timeout=, wait_event_info=150994953) at storage/ipc/latch.c:538 #4 0x008aa7e4 in WalSummarizerMain (startup_data=, startup_data_len=) at postmaster/walsummarizer.c:317 #5 0x008a46a0 in postmaster_child_launch (client_sock=0x0, startup_data_len=0, startup_data=0x0, child_type=B_WAL_SUMMARIZER) at postmaster/launch_backend.c:265 #6 postmaster_child_launch (client_sock=0x0, startup_data_len=0, startup_data=0x0, child_type=B_WAL_SUMMARIZER) at postmaster/launch_backend.c:226 #7 StartChildProcess (type=type@entry=B_WAL_SUMMARIZER) at postmaster/postmaster.c:3928 #8 0x008a4dcf in MaybeStartWalSummarizer () at postmaster/postmaster.c:4075 #9 MaybeStartWalSummarizer () at postmaster/postmaster.c:4070 #10 0x008a7f8f in ServerLoop () at postmaster/postmaster.c:1746 #11 0x0089f5e0 in PostmasterMain (argc=3, argv=0x14097b0) at postmaster/postmaster.c:1372 #12 0x005381aa in main (argc=3, argv=0x14097b0) at main/main.c:197 No symbol "full" in current context. #0 0x7fb51c50e1da in epoll_wait (epfd=5, events=0x1409fd8, maxevents=1, timeout=timeout@entry=1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30 sc_ret = -4 sc_ret = #1 0x00922bad in WaitEventSetWaitBlock (nevents=1, occurred_events=0x7ffe5492aba0, cur_timeout=, set=0x1409f70) at storage/ipc/latch.c:1570 returned_events = 0 rc = cur_event = cur_epoll_event = returned_events = rc = cur_event = cur_epoll_event = __func__ = { } __errno_location = #2 WaitEventSetWait (set=0x1409f70, timeout=1, occurred_events=0x7ffe5492aba0, nevents=1, wait_event_info=) at storage/ipc/latch.c:1516 rc = returned_events = 0 start_time = {ticks = 49310570173263} cur_time = {ticks = } cur_timeout = #3 0x00920
PostgreSQL does not compile on macOS SDK 15.0
Xcode 16 beta was released on 2024-06-10 and ships with macOS SDK 15.0 [1]. It appears PostgreSQL does not compile due to typedef redefiniton errors in the regex library: /tmp/Xcode-beta.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/clang -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Werror=unguarded-availability-new -Wendif-labels -Wmissing-format-attribute -Wcast-function-type -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-unused-command-line-argument -Wno-compound-token-split-by-macro -Wno-cast-function-type-strict -O2 -I../../../src/include -isysroot /tmp/Xcode-beta.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk -c -o regcomp.o regcomp.c In file included from regcomp.c:2647: In file included from ./regc_pg_locale.c:21: In file included from ../../../src/include/utils/pg_locale.h:16: In file included from /tmp/Xcode-beta.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/xlocale.h:45: In file included from /tmp/Xcode-beta.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/xlocale/_regex.h:27: /tmp/Xcode-beta.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/_regex.h:107:24: error: typedef redefinition with different types ('__darwin_off_t' (aka 'long long') vs 'long') 107 | typedef __darwin_off_t regoff_t; |^ ../../../src/include/regex/regex.h:48:14: note: previous definition is here 48 | typedef long regoff_t; | ^ In file included from regcomp.c:2647: In file included from ./regc_pg_locale.c:21: In file included from ../../../src/include/utils/pg_locale.h:16: In file included from /tmp/Xcode-beta.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/xlocale.h:45: In file included from /tmp/Xcode-beta.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/xlocale/_regex.h:27: /tmp/Xcode-beta.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/_regex.h:114:3: error: typedef redefinition with different types ('struct regex_t' vs 'struct regex_t') 114 | } regex_t; | ^ ../../../src/include/regex/regex.h:82:3: note: previous definition is here 82 | } regex_t; | ^ In file included from regcomp.c:2647: In file included from ./regc_pg_locale.c:21: In file included from ../../../src/include/utils/pg_locale.h:16: In file included from /tmp/Xcode-beta.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/xlocale.h:45: In file included from /tmp/Xcode-beta.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/xlocale/_regex.h:27: /tmp/Xcode-beta.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/_regex.h:119:3: error: typedef redefinition with different types ('struct regmatch_t' vs 'struct regmatch_t') 119 | } regmatch_t; | ^ ../../../src/include/regex/regex.h:89:3: note: previous definition is here 89 | } regmatch_t; | ^ 3 errors generated. make[3]: *** [regcomp.o] Error 1 make[2]: *** [regex-recursive] Error 2 make[1]: *** [all-backend-recurse] Error 2 make: *** [all-src-recurse] Error 2 I've reproduced this issue by: 1. Download the XCode 16 beta 2 ZIP file: https://developer.apple.com/services-account/download?path=/Developer_Tools/Xcode_16_beta/Xcode_16_beta.xip 2. Extract this to `/tmp`. 3. Then I ran: export PATH=/tmp/Xcode-beta.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin:$PATH export SDKROOT=/tmp/Xcode-beta.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk export XCODE_DIR=/tmp/Xcode-beta.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain export CC="$XCODE_DIR/usr/bin/clang" export CXX="$XCODE_DIR/usr/bin/clang++" ./configure CC="$CC" CXX="$CXX" make The compilation goes through if I comment out the "#include " from /tmp/Xcode-beta.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/xlocale.h. However, even on macOS SDK 14.5 I see that include statement. I'm still trying to figure out what changed here. [1] - https://developer.apple.com/macos/
Re: POC, WIP: OR-clause support for indexes
On Mon, Jun 24, 2024 at 1:47 PM Peter Geoghegan wrote: > I agree, with the proviso that "avoid gratuitous failures" should > include cases where a query that got the optimization suddenly fails > to get the optimization, due only to some very innocuous looking > change. Such as a change from using a constant 1_000_000_000 to a > constant 5_000_000_000 in the query text. That is a POLA violation. Nope, I don't agree with that at all. If you imagine that we can either have the optimization apply to one of those cases on the other, or on the other hand we can have some cases that outright fail, I think it's entirely clear that the former is better. > Maybe it doesn't. My point was only that the B-Tree code doesn't > necessarily need to use just one rhs type for the same column input > opclass. The definition of SOAP works (or could work) in basically the > same way, provided the "OR condition" were provably disjunct. We could > for example mix different operators for the same nbtree scan key (with > some work in nbtutils.c), just as we could support "where mycol =5 OR > mycol IS NULL" with much effort. > > BTW, did you know MySQL has long supported the latter? It has a <=> > operator, which is basically a non-standard spelling of IS NOT > DISTINCT FROM. Importantly, it is indexable, whereas right now > Postgres doesn't support indexing IS NOT DISTINCT FROM. If you're > interested in working on this problem within the scope of this patch, > or some follow-up patch, I can take care of the nbtree side of things. I was assuming this patch shouldn't be changing the way indexes work at all, just making use of the facilities that we have today. More could be done, but that might make it harder to get anything committed. Before we get too deep into arguing about hypotheticals, I don't think there's any problem here that we can't solve with the infrastructure we already have. For instance, consider this: robert.haas=# explain select * from foo where a in (1, 1000); QUERY PLAN --- Seq Scan on foo1 foo (cost=0.00..25.88 rows=13 width=36) Filter: (a = ANY ('{1,1000}'::bigint[])) (2 rows) I don't know exactly what's happening here, but it seems very similar to what we need to have happen for this patch to work. pg_typeof(1) is integer, and pg_typeof(1000) is bigint, and we're able to figure out that it's OK to put both of those in an array of a single type and without having any type conversion failures. If you replace 1000 with 2, then the array ends up being of type integer[] rather than type bigint[], so. clearly the system is able to reason its way through these kinds of scenarios already. It's even possible, in my mind at least, that the patch is already doing exactly the right things here. Even if it isn't, the problem doesn't seem to be fundamental, because if this example can work (and it does) then what the patch is trying to do should be workable, too. We just have to make sure we're plugging all the pieces properly together, and that we have comments adequately explain what is happening and test cases that verify it. My feeling is that the patch doesn't meet that standard today, but I think that just means it needs some more work. I'm not arguing we have to throw the whole thing out, or invent a lot of new infrastructure, or anything like that. -- Robert Haas EDB: http://www.enterprisedb.com
Re: improve predefined roles documentation
On Fri, Jun 21, 2024 at 11:40 AM Nathan Bossart wrote: > Done. If you look at how the varlistentries begin, there are three separate patterns: * Some document a single role and start with "Allow doing blah blah blah". * Some document a couple of rolls so there are several paragraphs, each beginning with "name_of_rolehttp://www.enterprisedb.com
Re: Proposal: Document ABI Compatibility
On Mon, Jun 17, 2024 at 6:38 PM David E. Wheeler wrote: > Is it? ISTM that there is the intention not to break things that don’t need > to be broken, though that doesn’t rule out interface improvements. I suppose that it's true that we try to avoid gratuitous breakage, but I feel like it would be weird to document that. Sometimes I go to a store and I see a sign that says "shoplifters will be prosecuted." But I have yet to see a store with a sign that says "people who appear to be doing absolutely nothing wrong will not be prosecuted." If I did see such a sign, I would frankly be a little concerned. -- Robert Haas EDB: http://www.enterprisedb.com
Re: POC, WIP: OR-clause support for indexes
On Mon, Jun 24, 2024 at 2:28 PM Robert Haas wrote: > On Mon, Jun 24, 2024 at 1:47 PM Peter Geoghegan wrote: > > I agree, with the proviso that "avoid gratuitous failures" should > > include cases where a query that got the optimization suddenly fails > > to get the optimization, due only to some very innocuous looking > > change. Such as a change from using a constant 1_000_000_000 to a > > constant 5_000_000_000 in the query text. That is a POLA violation. > > Nope, I don't agree with that at all. If you imagine that we can > either have the optimization apply to one of those cases on the other, > or on the other hand we can have some cases that outright fail, I > think it's entirely clear that the former is better. I'm just saying that not having the optimization apply to a query very similar to one where it does apply is a POLA violation. That's another kind of failure, for all practical purposes. Weird performance cliffs like that are bad. It's very easy to imagine code that generates a query text, that at some point randomly and mysteriously gets a sequential scan. Or a much less efficient index scan. > I was assuming this patch shouldn't be changing the way indexes work > at all, just making use of the facilities that we have today. More > could be done, but that might make it harder to get anything > committed. I was just pointing out that there is currently no good way to make nbtree efficiently execute a qual "WHERE a = 5 OR a IS NULL", which is almost entirely (though not quite entirely) due to a lack of any way of expressing that idea through SQL, in a way that'll get pushed down to the index scan node. You can write "WHERE a = any('{5,NULL')", of course, but that doesn't treat NULL as just another array element to match against via an IS NULL qual (due to NULL semantics). Yeah, this is nonessential. But it's quite a nice optimization, and seems entirely doable within the framework of the patch. It would be a natural follow-up. All that I'd need on the nbtree side is to get an input scan key that directly embodies "WHERE mycol = 5 OR mycol IS NULL". That would probably just be a scan key with sk_flags "SK_SEARCHARRAY | SK_SEARCHNULL", that was otherwise identical to the current SK_SEARCHARRAY scan keys. Adopting the nbtree array index scan code to work with this would be straightforward. SK_SEARCHNULL scan keys basically already work like regular equality scan keys at execution time, so all that this optimization requires on the nbtree side is teaching _bt_advance_array_keys to treat NULL as a distinct array condition (evaluated as IS NULL, not as = NULL). > It's even possible, in my mind at least, that the patch is already > doing exactly the right things here. Even if it isn't, the problem > doesn't seem to be fundamental, because if this example can work (and > it does) then what the patch is trying to do should be workable, too. > We just have to make sure we're plugging all the pieces properly > together, and that we have comments adequately explain what is > happening and test cases that verify it. My feeling is that the patch > doesn't meet that standard today, but I think that just means it needs > some more work. I'm not arguing we have to throw the whole thing out, > or invent a lot of new infrastructure, or anything like that. I feel like my point about the potential for POLA violations is pretty much just common sense. I'm not particular about the exact mechanism by which we avoid it; only that we avoid it. -- Peter Geoghegan
Re: strange context message in spi.c?
> On 24 Jun 2024, at 11:14, Stepan Neretin wrote: > > Hi! Looks good to me! Thanks for review. I have this on my TODO for when the tree branches, it doesn't seem like anything worth squeezing in before then. -- Daniel Gustafsson
Re: Addressing SECURITY DEFINER Function Vulnerabilities in PostgreSQL Extensions
On Wed, 2024-06-19 at 08:53 +0530, Ashutosh Sharma wrote: > For standalone functions, users can easily adjust the search_path > settings as needed. However, managing this becomes challenging for > functions created via extensions. Extensions are relocatable, making > it difficult to determine and apply search_path settings in advance > within the extension_name--*.sql file when defining functions or > procedures. A special search_path variable "$extension_schema" would be a better solution to this problem. We need something like that anyway, in case the extension is relocated, so that we don't have to dig through the catalog and update all the settings. > Introducing a new control flag option allows users to set > implicit search_path settings for functions created by the extension, > if needed. The main aim here is to enhance security for functions > created by extensions by setting search_path at the function level. > This ensures precise control over how objects are accessed within > each > function, making behavior more predictable and minimizing security > risks, That leaves me wondering whether it's being addressed at the right level. For instance, did you consider just having a GUC to control the default search_path for a function? That may be too magical, but if so, why would an extension-only setting be better? > especially for SECURITY DEFINER functions associated with > extensions created by superusers. I'm not sure that it's specific to SECURITY DEFINER functions. Regards, Jeff Davis
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On Mon, Jun 24, 2024 at 1:05 PM Robert Haas wrote: > > On Mon, Jun 24, 2024 at 12:43 PM Peter Geoghegan wrote: > > The problem here is that OldestXmin is supposed to be more > > conservative than vistest, which it almost always is, except in this > > one edge case. I don't think that plugging that hole changes the basic > > fact that there is one source of truth about what *needs* to be > > pruned. There is such a source of truth: OldestXmin. > > Well, another approach could be to make it so that OldestXmin actually > is always more conservative than vistest rather than almost always. For the purposes of pruning, we are effectively always using the more conservative of the two with this patch. Are you more concerned about having a single horizon for pruning or about having a horizon that does not move backwards after being established at the beginning of vacuuming the relation? Right now, in master, we do use a single horizon when determining what is pruned -- that from GlobalVisState. OldestXmin is only used for freezing and full page visibility determinations. Using a different horizon for pruning by vacuum than freezing is what is causing the error on master. > I agree with you that letting the pruning horizon move forward during > vacuum is desirable. I'm just wondering if having the vacuum code need > to know a second horizon is really the best way to address that. I was thinking about this some more and I realized I don't really get why we think using GlobalVisState for pruning will let us remove more tuples in the common case. I had always thought it was because the vacuuming backend's GlobalVisState will get updated periodically throughout vacuum and so, assuming the oldest running transaction changes, our horizon for vacuum would change. But, in writing this repro, it is actually quite hard to get GlobalVisState to update. Our backend's RecentXmin needs to have changed. And there aren't very many places where we take a new snapshot after starting to vacuum a relation. One of those is at the end of index vacuuming, but that can only affect the pruning horizon if we have to do multiple rounds of index vacuuming. Is that really the case we are thinking of when we say we want the pruning horizon to move forward during vacuum? - Melanie
Re: PostgreSQL does not compile on macOS SDK 15.0
It appears in macOS SDK 14.5, there were include guards in $SDK_ROOT/usr/include/xlocale/_regex.h: #ifndef _XLOCALE__REGEX_H_ #define _XLOCALE__REGEX_H_ #ifndef _REGEX_H_ #include <_regex.h> #endif // _REGEX_H_ #include <_xlocale.h> In macOS SDK 15.5, these include guards are gone: #ifndef _XLOCALE__REGEX_H_ #define _XLOCALE__REGEX_H_ #include <_regex.h> #include <__xlocale.h> Since _REGEX_H_ was defined locally in PostgreSQL's version of src/include/regex/regex.h, these include guards prevented duplicate definitions from /usr/include/_regex.h (not to be confused with /usr/include/xlocale/_regex.h). If I hack the PostgreSQL src/include/regex/regex.h to include the double underscore include guard of __REGEX_H_, the build succeeds: ``` diff --git a/src/include/regex/regex.h b/src/include/regex/regex.h index d08113724f..734172167a 100644 --- a/src/include/regex/regex.h +++ b/src/include/regex/regex.h @@ -1,3 +1,6 @@ +#ifndef __REGEX_H_ +#define __REGEX_H_ /* never again */ + #ifndef _REGEX_H_ #define _REGEX_H_ /* never again */ /* @@ -187,3 +190,5 @@ extern bool RE_compile_and_execute(text *text_re, char *dat, int dat_len, int nmatch, regmatch_t *pmatch); #endif /* _REGEX_H_ */ + +#endif /* __REGEX_H_ */ ``` Any better ideas here?
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On Mon, Jun 24, 2024 at 3:23 PM Melanie Plageman wrote: > I had always thought it was because the vacuuming backend's > GlobalVisState will get updated periodically throughout vacuum and so, > assuming the oldest running transaction changes, our horizon for > vacuum would change. I believe that it's more of an aspirational thing at this point. That it is currently aspirational (it happens to some extent but isn't ever particularly useful) shouldn't change the analysis about how to fix this bug. > One of those is at the > end of index vacuuming, but that can only affect the pruning horizon > if we have to do multiple rounds of index vacuuming. Is that really > the case we are thinking of when we say we want the pruning horizon to > move forward during vacuum? No, that's definitely not what we were thinking of. It's just an accident that it's almost the only thing that'll do that. -- Peter Geoghegan
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On Thu, Jun 20, 2024 at 7:42 PM Melanie Plageman wrote: > > If vacuum fails to remove a tuple with xmax older than > VacuumCutoffs->OldestXmin and younger than > GlobalVisState->maybe_needed, it will ERROR out when determining > whether or not to freeze the tuple with "cannot freeze committed > xmax". One thing I don't understand is why it is okay to freeze the xmax of a dead tuple just because it is from an aborted update. heap_prepare_freeze_tuple() is called on HEAPTUPLE_RECENTLY_DEAD tuples with normal xmaxes (non-multis) so that it can freeze tuples from aborted updates. The only case in which we freeze dead tuples with a non-multi xmax is if the xmax is from before OldestXmin and is also not committed (so from an aborted update). Freezing dead tuples replaces their xmax with InvalidTransactionId -- which would make them look alive. So, it makes sense we don't do this for dead tuples in the common case. But why is it 1) okay and 2) desirable to freeze xmaxes of tuples from aborted updates? Won't it make them look alive again? - Melanie
Re: RFC: Additional Directory for Extensions
On Jun 24, 2024, at 1:53 PM, Robert Haas wrote: > Is "tighten up what the superuser can do" on our list of objectives? > Personally, I think we should be focusing mostly, and maybe entirely, > on letting non-superusers do more things, with appropriate security > controls. The superuser can ultimately do anything, since they can > cause shell commands to be run and load arbitrary code into the > backend and write code in untrusted procedural languages and mutilate > the system catalogs and lots of other terrible things. I guess the question then is what security controls are appropriate for this feature, which after all tells the postmaster what directories to read files from. It feels a little outside the scope of a regular user to even be aware of the file system undergirding the service. But perhaps there’s a non-superuser role for whom it is appropriate? > Now, I think there are environments where people have used things like > containers to try to lock down the superuser, and while I'm not sure > that can ever be particularly water-tight, if it were the case that > this patch would make it a whole lot easier for a superuser to bypass > the kinds of controls that people are imposing today, that might be an > argument against this patch. But ... off-hand, I'm not seeing such an > exposure. Yeah I’m not even sure I follow. Containers are immutable, other than mutable mounted volumes --- which is one use case this patch is attempting to enable. > On the patch itself, I find the documentation for this to be fairly > hard to understand. I think it could benefit from an example. I'm > confused about whether this is intended to let me search for > extensions in /my/temp/root/usr/lib/postgresql/... by setting > extension_directory=/my/temp/dir, or whether it's intended me to > search both /usr/lib/postgresql as I normally would and also > /some/other/place. I sketched them quickly, so agree they can be better. Reading the code, I now see that it appears to be the former case. I’d like to advocate for the latter. > If the latter, I wonder why we don't handle shared > libraries by setting dynamic_library_path and then just have an > analogue of that for control files. The challenge is that it applies not just to shared object libraries and control files, but also extension SQL files and any other SHAREDIR files an extension might include. But also, I think it should support all the pg_config installation targets that extensions might use, including: BINDIR DOCDIR HTMLDIR PKGINCLUDEDIR LOCALEDIR MANDIR I can imagine an extension wanting or needing to use any and all of these. Best, David
Unusually long checkpoint time on version 16, and 17beta1 running in a docker container
Greetings, While testing pgjdbc I noticed the following pgdb-1 | Will execute command on database postgres: pgdb-1 | SELECT pg_drop_replication_slot(slot_name) FROM pg_replication_slots WHERE slot_name = 'replica_one'; pgdb-1 | DROP USER IF EXISTS replica_one; pgdb-1 | CREATE USER replica_one WITH REPLICATION PASSWORD 'test'; pgdb-1 | SELECT * FROM pg_create_physical_replication_slot('replica_one'); pgdb-1 | pgdb-1 | NOTICE: role "replica_one" does not exist, skipping pgdb-1 | pg_drop_replication_slot pgdb-1 | -- pgdb-1 | (0 rows) pgdb-1 | pgdb-1 | DROP ROLE pgdb-1 | CREATE ROLE pgdb-1 | slot_name | lsn pgdb-1 | -+- pgdb-1 | replica_one | pgdb-1 | (1 row) pgdb-1 | pgdb-1 | waiting for checkpoint pgdb-1 | 2024-06-24 19:07:18.569 UTC [66] LOG: checkpoint starting: force wait pgdb-1 | 2024-06-24 19:11:48.008 UTC [66] LOG: checkpoint complete: wrote 6431 buffers (39.3%); 0 WAL file(s) added, 0 removed, 3 recycled; write=269.438 s, sync=0.001 s, total=269.439 s; sync files=0, longest=0.000 s, average=0.000 s; distance=44140 kB, estimate=44140 kB; lsn=0/4B8, redo lsn=0/428 Note that it takes 4 minutes 48 seconds to do the checkpoint. This seems ridiculously long ? If I add a checkpoint before doing anything there is no delay Will execute command on database postgres: pgdb-1 | checkpoint; pgdb-1 | SELECT pg_drop_replication_slot(slot_name) FROM pg_replication_slots WHERE slot_name = 'replica_one'; pgdb-1 | DROP USER IF EXISTS replica_one; pgdb-1 | CREATE USER replica_one WITH REPLICATION PASSWORD 'test'; pgdb-1 | SELECT * FROM pg_create_physical_replication_slot('replica_one'); pgdb-1 | pgdb-1 | 2024-06-24 19:19:57.498 UTC [66] LOG: checkpoint starting: immediate force wait pgdb-1 | 2024-06-24 19:19:57.558 UTC [66] LOG: checkpoint complete: wrote 6431 buffers (39.3%); 0 WAL file(s) added, 0 removed, 2 recycled; write=0.060 s, sync=0.001 s, total=0.061 s; sync files=0, longest=0.000 s, average=0.000 s; distance=29947 kB, estimate=29947 kB; lsn=0/3223BA0, redo lsn=0/3223B48 ===> pgdb-1 | CHECKPOINT pgdb-1 | pg_drop_replication_slot pgdb-1 | -- pgdb-1 | (0 rows) pgdb-1 | pgdb-1 | DROP ROLE pgdb-1 | NOTICE: role "replica_one" does not exist, skipping pgdb-1 | CREATE ROLE pgdb-1 | slot_name | lsn pgdb-1 | -+- pgdb-1 | replica_one | pgdb-1 | (1 row) pgdb-1 | pgdb-1 | waiting for checkpoint pgdb-1 | 2024-06-24 19:19:57.614 UTC [66] LOG: checkpoint starting: force wait pgdb-1 | 2024-06-24 19:19:57.915 UTC [66] LOG: checkpoint complete: wrote 4 buffers (0.0%); 0 WAL file(s) added, 0 removed, 1 recycled; write=0.301 s, sync=0.001 s, total=0.302 s; sync files=0, longest=0.000 s, average=0.000 s; distance=14193 kB, estimate=28372 kB; lsn=0/480, redo lsn=0/428 This starts in version 16, versions up to and including 15 do not impose the wait. Dave Cramer
Re: [PATCH] Add ACL (Access Control List) acronym
On Mon, Jun 24, 2024, at 18:02, David G. Johnston wrote: > On Mon, Jun 24, 2024 at 8:44 AM Nathan Bossart > wrote: >> I think we could omit "i.e. privileges list." >> > > Agreed. Between the docs and code we say "privileges list" once and > that refers to the dumputIls description of the arguments to grant. As > the acronym page now defines the term using fundamentals, introducing > another term not used elsewhere seems undesirable. New version attached. > Observations: > We are referencing a disambiguation page. We never actually spell out > ACL anywhere so we might as well just reference what Wikipedia believes > is the expected spelling. > > The page we link to uses "permissions" while we consistently use > "privileges" to describe the contents of the list. This seems like an > obvious synonym, but as the point of these is to formally define > things, pointing this equivalence is worth considering. I like this idea. How could this be implemented in the docs? Maybe a ... for ACL in acronyms.sgml? /Joel v2-0001-Add-ACL-Access-Control-List-acronym.patch Description: Binary data
Re: Proposal: Document ABI Compatibility
On Jun 19, 2024, at 05:42, Peter Eisentraut wrote: >>> https://postgr.es/m/CAH2-Wzm-W6hSn71sUkz0Rem=qdeu7tnfmc7_jg2djrlfef_...@mail.gmail.com >>> >>> Theoretically anybody can do this themselves. In practice they don't. >>> So something as simple as providing automated reports about ABI >>> changes might well move the needle here. >> What would be required to make such a thing? Maybe it’d make a good Summer >> of Code project. > > The above thread contains a lengthy discussion about what one could do. I somehow missed that GSoC 2024 is already going with contributors. Making a mental note to add an item for 2025. D
Re: Proposal: Document ABI Compatibility
On Jun 24, 2024, at 14:51, Robert Haas wrote: > I suppose that it's true that we try to avoid gratuitous breakage, but > I feel like it would be weird to document that. I see how that can seem weird to a committer deeply familiar with the development process and how things happen. But people outside the -hackers bubble have very little idea. It’s fair to say it needn’t be a long statement for major versions: a single sentence such as “we try to avoid gratuitous breakage” is a perfectly reasonable framing. But I’d say, in the interest of completeness, it would be useful to document the policy for major release *as well as* minor releases. Best, David
Re: [PATCH] Add ACL (Access Control List) acronym
On Mon, Jun 24, 2024 at 12:46 PM Joel Jacobson wrote: > On Mon, Jun 24, 2024, at 18:02, David G. Johnston wrote: > > > The page we link to uses "permissions" while we consistently use > > "privileges" to describe the contents of the list. This seems like an > > obvious synonym, but as the point of these is to formally define > > things, pointing this equivalence is worth considering. > > I like this idea. How could this be implemented in the docs? Maybe a > ... for ACL in acronyms.sgml? > > Add a second under the one holding the link? David J.
Re: Proposal: Document ABI Compatibility
On Jun 19, 2024, at 05:41, Peter Eisentraut wrote: > This is probably a bit confusing. This might as well mean client application > code against libpq. Better something like "server plugin code that uses the > PostgreSQL server APIs". That works. > But now we're talking about API. That might be subject of another document > or another section in this one, but it seems confusing to mix this with the > ABI discussion. Hrm. They’re super closely-related in my mind, as an extension developer. I need to know both! I guess I’m taking of this policy as what I can expect may be changed (and how to adapt to it) and what won’t. That said, I’m fine to remove the API stuff if there’s consensus objecting to it, to be defined in a separate policy (perhaps on the same doc page). >> PostgreSQL avoids unnecessary API changes in major releases, but usually >> ships a few necessary API changes, including deprecation, renaming, and >> argument variation. > > Obviously, as a practical matter, there won't be random pointless changes. > But I wouldn't go as far as writing anything down about how these APIs are > developed. Fair enough, was trying to give some idea of the sorts of changes. Don’t have to include them. >> In such cases the incompatible changes will be listed in the Release Notes. > > I don't think anyone is signing up to do that. It needn’t be comprehensive. Just mention that an ABI or API changed in the release note item. Unless they almost *all* make such changes. >> Minor Releases >> -- > I think one major problem besides actively avoiding or managing such > minor-version ABI breaks is some automated detection. Otherwise, this just > means "we try, but who knows”. I think you *do* try, and the fact that there are so few issues means you succeed at that. I’m not advocating for an ABI guarantee here, just a description of the policy committees already follow. Here’s an update based on all the feedback, framing things more from the perspective of “do I need to recompile this or change my code”. Many thanks! ``` md ABI Policy == Changes to the the PostgreSQL server APIs may require recompilation of server plugin code that uses them. This policy describes the core team's approach to such changes, and what server API users can expect. Major Releases -- Applications that use server APIs must be compiled for each major release supported by the application. The inclusion of `PG_MODULE_MAGIC` ensures that code compiled for one major version will rejected by other major versions. Developers needing to support multiple versions of PostgreSQL with incompatible APIs should use the `PG_VERSION_NUM` constant to adjust code as appropriate. For example: ``` c #if PG_VERSION_NUM >= 16 #include "varatt.h" #endif ``` The core team avoids unnecessary breakage, but users of the server APIs should expect and be prepared to make adjustments and recompile for every major release. Minor Releases -- PostgreSQL makes an effort to avoid server API and ABI breaks in minor releases. In general, an application compiled against any minor release will work with any other minor release, past or future. In the absence of automated detection of such changes, this is not a guarantee, but history such breaking changes have been extremely rare. When a change *is* required, PostgreSQL will choose the least invasive change possible, for example by squeezing a new field into padding space or appending it to the end of a struct. This sort of change should not impact dependent applications unless they use `sizeof(the struct)` on a struct with an appended field, or create their own instances of such structs --- patterns best avoided. In rare cases, however, even such non-invasive changes may be impractical or impossible. In such an event, the change will be documented in the Release Notes, and details on the issue will also be posted to [TBD; mail list? Blog post? News item?]. To minimize issues and catch changes early, the project strongly recommends that developers adopt continuous integration testing at least for the latest minor release all major versions of Postgres they support. ``` Best, David
Re: RFC: Additional Directory for Extensions
On Mon, Jun 24, 2024 at 3:37 PM David E. Wheeler wrote: > I guess the question then is what security controls are appropriate for this > feature, which after all tells the postmaster what directories to read files > from. It feels a little outside the scope of a regular user to even be aware > of the file system undergirding the service. But perhaps there’s a > non-superuser role for whom it is appropriate? As long as the GUC is superuser-only, I'm not sure what else there is to do here. The only question is whether there's some reason to disallow this even from the superuser, but I'm not quite seeing such a reason. > > On the patch itself, I find the documentation for this to be fairly > > hard to understand. I think it could benefit from an example. I'm > > confused about whether this is intended to let me search for > > extensions in /my/temp/root/usr/lib/postgresql/... by setting > > extension_directory=/my/temp/dir, or whether it's intended me to > > search both /usr/lib/postgresql as I normally would and also > > /some/other/place. > > I sketched them quickly, so agree they can be better. Reading the code, I now > see that it appears to be the former case. I’d like to advocate for the > latter. Sounds good. > > If the latter, I wonder why we don't handle shared > > libraries by setting dynamic_library_path and then just have an > > analogue of that for control files. > > The challenge is that it applies not just to shared object libraries and > control files, but also extension SQL files and any other SHAREDIR files an > extension might include. But also, I think it should support all the > pg_config installation targets that extensions might use, including: > > BINDIR > DOCDIR > HTMLDIR > PKGINCLUDEDIR > LOCALEDIR > MANDIR > > I can imagine an extension wanting or needing to use any and all of these. Are these really all relevant to backend code? -- Robert Haas EDB: http://www.enterprisedb.com
Re: Direct SSL connection and ALPN loose ends
On 21/06/2024 02:32, Jacob Champion wrote: On Thu, Jun 20, 2024 at 4:13 PM Heikki Linnakangas wrote: By "negotiation" I mean the server's response to the startup packet. I.e. "supported"/"not supported"/"error". Ok, I'm still a little confused, probably a terminology issue. The server doesn't respond with "supported" or "not supported" to the startup packet, that happens earlier. I think you mean the SSLRequst / GSSRequest packet, which is sent *before* the startup packet? Yes, sorry. (I'm used to referring to those as startup packets too, ha.) Yeah I'm not sure what the right term would be. Hmm, right, GSS encryption was introduced in v12, and older versions respond with an error to a GSSRequest. We probably could make the same assumption for GSS as we did for TLS in a49fbaaf, i.e. that an error means that something's wrong with the server, rather than that it's just very old and doesn't support GSS. But the case for that is a lot weaker case than with TLS. There are still pre-v12 servers out there in the wild. Right. Since we default to gssencmode=prefer, if you have Kerberos creds in your environment, I think this could potentially break existing software that connects to v11 servers once you upgrade libpq. When you connect to a V11 server and attempt to perform GSSAPI authentication, it will respond with a V3 error that says: "unsupported frontend protocol 1234.5680: server supports 2.0 to 3.0". That was a surprise to me until I tested it just now. I thought that it would respond with a protocol V2 error, but it is not so. The backend sets FrontendProtocol to 1234.5680 before sending the error, and because it is >= 3, the error is sent with protocol version 3. Given that, I think it is a good thing to fail the connection completely on receiving a V2 error. Attached is a patch to fix the other issue, with falling back from SSL to plaintext. And some tests and comment fixes I spotted while at it. 0001: A small comment fix 0002: This is the main patch that fixes the SSL fallback issue 0003: This adds fault injection tests to exercise these early error codepaths. It is not ready to be merged, as it contains a hack to skip locking. See thread at https://www.postgresql.org/message-id/e1ffb822-054e-4006-ac06-50532767f75b%40iki.fi. 0004: More tests, for what happens if the server sends an error after responding "yes" to the SSLRequest or GSSRequest, but before performing the SSL/GSS handshake. Attached is also a little stand-alone perl program that listens on a socket, and when you connect to it, it immediately sends a V2 or V3 error, depending on the argument. That's useful for testing. It could be used as an alternative strategy to the injection points I used in the 0003-0004 patches, but for now I just used it for manual testing. -- Heikki Linnakangas Neon (https://neon.tech) From 4b988b1c74066fe8b5f98acb4ecd20150a47bc33 Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Mon, 24 Jun 2024 20:41:41 +0300 Subject: [PATCH 1/4] Fix outdated comment after removal of direct SSL fallback The option to fall back from direct SSL to negotiated SSL or a plaintext connection was removed in commit fb5718f35f. --- src/interfaces/libpq/fe-connect.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/src/interfaces/libpq/fe-connect.c b/src/interfaces/libpq/fe-connect.c index 071b1b34aa1..e003279fb6c 100644 --- a/src/interfaces/libpq/fe-connect.c +++ b/src/interfaces/libpq/fe-connect.c @@ -3564,8 +3564,8 @@ keep_going: /* We will come back to here until there is if (pollres == PGRES_POLLING_FAILED) { /* - * Failed direct ssl connection, possibly try a new - * connection with postgres negotiation + * SSL handshake failed. We will retry with a plaintext + * connection, if permitted by sslmode. */ CONNECTION_FAILED(); } -- 2.39.2 From 7df581e902b76665de4c751ddbc1309143e6c0b8 Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Mon, 24 Jun 2024 18:14:52 +0300 Subject: [PATCH 2/4] Fix fallback behavior when server sends an ERROR early at startup With sslmode=prefer, the desired behavior is to completely fail the connection attempt, *not* fall back to a plaintext connection, if the server responds to the SSLRequest with an error ('E') response instead of rejecting SSL with an 'N' response. This was broken in commit 05fd30c0e7. Discussion: https://www.postgresql.org/message-id/CAOYmi%2Bnwvu21mJ4DYKUa98HdfM_KZJi7B1MhyXtnsyOO-PB6Ww%40mail.gmail.com --- src/interfaces/libpq/fe-connect.c | 17 + 1 file changed, 17 insertions(+) diff --git a/src/interfaces/libpq/fe-connect.c b/src/interfaces/libpq/fe-connect.c index e003279fb6c..360d9a45476 100644 --- a/src/interfaces/libpq/fe-connect.c +++ b/src/interfaces/libpq/fe-connect.c @@ -3521,6 +3521,12 @@ keep_going: /* We will come back to here until there is * byte here. */ conn->status = CONNECTION_AWAIT
Re: POC, WIP: OR-clause support for indexes
Hi! Let me join the review process. I am no expert in execution plans, so there would not be much help in doing even better optimization. But I can read the code, as a person who is not familiar with this area and help making it clear even to a person like me. So, I am reading v25-0001-Transform-OR-clauses-to-ANY-expression.patch that have been posted some time ago, and especially transform_or_to_any function. > @@ -38,7 +45,6 @@ > int from_collapse_limit; > int join_collapse_limit; > > - > /* > * deconstruct_jointree requires multiple passes over the join tree, > because we > * need to finish computing JoinDomains before we start distributing quals. Do not think that removing empty line should be part of the patch > + /* > + * If the const node's (right side of operator > expression) type > + * don't have “true†array type, then we cannnot do > the > + * transformation. We simply concatenate the expression > node. > + */ Guess using unicode double quotes is not the best idea here... Now to the first part of `transform_or_to_any` signature.asc Description: This is a digitally signed message part.
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On Mon, Jun 24, 2024 at 3:23 PM Melanie Plageman wrote: > Are you more concerned about having a single horizon for pruning or > about having a horizon that does not move backwards after being > established at the beginning of vacuuming the relation? I'm not sure I understand. The most important thing here is fixing the bug. But if we have a choice of how to fix the bug, I'd prefer to do it by having the pruning code test one horizon that is always correct, rather than (as I think the patch does) having it test against two horizons because as a way of covering possible discrepancies between those values. > Right now, in master, we do use a single horizon when determining what > is pruned -- that from GlobalVisState. OldestXmin is only used for > freezing and full page visibility determinations. Using a different > horizon for pruning by vacuum than freezing is what is causing the > error on master. Agreed. > I had always thought it was because the vacuuming backend's > GlobalVisState will get updated periodically throughout vacuum and so, > assuming the oldest running transaction changes, our horizon for > vacuum would change. But, in writing this repro, it is actually quite > hard to get GlobalVisState to update. Our backend's RecentXmin needs > to have changed. And there aren't very many places where we take a new > snapshot after starting to vacuum a relation. One of those is at the > end of index vacuuming, but that can only affect the pruning horizon > if we have to do multiple rounds of index vacuuming. Is that really > the case we are thinking of when we say we want the pruning horizon to > move forward during vacuum? I thought the idea was that the GlobalVisTest stuff would force a recalculation now and then, but maybe it doesn't actually do that? Suppose process A begins a transaction, acquires an XID, and then goes idle. Process B now begins a giant vacuum. At some point in the middle of the vacuum, A ends the transaction. Are you saying that B's GlobalVisTest never really notices that this has happened? -- Robert Haas EDB: http://www.enterprisedb.com
Re: RFC: Additional Directory for Extensions
On Jun 24, 2024, at 4:28 PM, Robert Haas wrote: > As long as the GUC is superuser-only, I'm not sure what else there is > to do here. The only question is whether there's some reason to > disallow this even from the superuser, but I'm not quite seeing such a > reason. I can switch it back from requiring a restart to allowing a superuser to set it. >> I sketched them quickly, so agree they can be better. Reading the code, I >> now see that it appears to be the former case. I’d like to advocate for the >> latter. > > Sounds good. Yeah, though then I have a harder time deciding how it should work. pg_config’s paths are absolute. With your first example, we just use them exactly as they are, but prefix them with the destination directory. So if it’s set to `/my/temp/root/`, then files go into /my/temp/root/$(pg_conifg --sharedir) /my/temp/root/$(pg_conifg --pkglibdir) /my/temp/root/$(pg_conifg --bindir) # etc. Which is exactly how RPM and Apt packages are built, but seems like an odd configuration for general use. >> BINDIR >> DOCDIR >> HTMLDIR >> PKGINCLUDEDIR >> LOCALEDIR >> MANDIR >> >> I can imagine an extension wanting or needing to use any and all of these. > > Are these really all relevant to backend code? Oh I think so. Especially BINDIR; lots of extensions ship with binary applications. And most ship with docs, too (PGXS puts items listed in DOCS into DOCDIR). Some might also produce man pages (for their binaries), HTML docs, and other stuff. Maybe an FTE extension would include locale files? I find it pretty easy to imagine use cases for all of them. So much so that I wrote an extension binary distribution RFC[1] and its POC[2] around them. Best, David [1]: https://github.com/orgs/pgxn/discussions/2 [1]: https://justatheory.com/2024/06/trunk-poc/
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On Mon, Jun 24, 2024 at 3:36 PM Melanie Plageman wrote: > One thing I don't understand is why it is okay to freeze the xmax of a > dead tuple just because it is from an aborted update. We don't do that with XID-based xmaxs. Though perhaps we should, since we'll already prune-away the successor tuple, and so might as well go one tiny step further and clear the xmax for the original tuple via freezing/setting it InvalidTransactionId. Instead we just leave the original tuple largely undisturbed, with its original xmax. We do something like that with Multi-based xmax fields, though not with the specific goal of cleaning up after aborts in mind (we can also remove lockers that are no longer running, regardless of where they are relative to OldestXmin, stuff like that). The actual goal with that is to enforce MultiXactCutoff, independently of whether or not their member XIDs are < FreezeLimit yet. > The only case in which we freeze dead tuples > with a non-multi xmax is if the xmax is from before OldestXmin and is > also not committed (so from an aborted update). Perhaps I misunderstand, but: we simply don't freeze DEAD (not RECENTLY_DEAD) tuples in the first place, because we don't have to (pruning removes them instead). It doesn't matter if they're DEAD due to being from aborted transactions or DEAD due to being deleted/updated by a transaction that committed (committed and < OldestXmin). The freezing related code paths in heapam.c don't particularly care whether a tuple xmax is RECENTLY_DEAD or LIVE to HTSV + OldestXmin. Just as long as it's not fully DEAD (then it should have been pruned). -- Peter Geoghegan
Re: Partial aggregates pushdown
On Mon, 24 Jun 2024 at 15:03, fujii.y...@df.mitsubishielectric.co.jp wrote: > I see. I maybe got your proposal. > Refer to your proposal, for avg(int8), > I create a new native type like state_int8_avg > with the new typsend/typreceive functions > and use them to transmit the state value, right? Yes, that's roughly what I had in mind indeed. > That might seem to be a more fundamental solution > because I can avoid adding export/import functions of my proposal, > which are the new components of aggregate function. > I have never considered the solution. > I appreciate your proposal. Thank you :) > However, I still have the following two questions. > > 1. Not necessary components of new native types > Refer to pg_type.dat, typinput and typoutput are required. > I think that in your proposal they are not necessary, > so waste. I think that it is not acceptable. > How can I resolve the problem? I think requiring typinput/typoutput is a benefit personally, because that makes it possible to do PARTIAL_AGGREGATE pushdown to a different PG major version. Also it makes it easier to debug the partial aggregate values when using psql/pgregress. So yes, it requires implementing both binary (send/receive) and text (input/output) serialization, but it also has some benefits. And in theory you might be able to skip implementing the binary serialization, and rely purely on the text serialization to send partial aggregates between servers. > 2. Many new native types > I think that, basically, each aggregate function does need a new native type. > For example, > avg(int8), avg(numeric), and var_pop(int4) has the same transtype, > PolyNumAggState. > You said that it is enough to add only one native type like state_poly_num_agg > for supporting them, right? Yes, correct. That's what I had in mind. > But the combine functions of them might have quite different expectation > on the data items of PolyNumAggState like > the range of N(means count) and the true/false of calcSumX2 > (the flag of calculating sum of squares). > The final functions of them have the similar expectation. > So, I think that, responded to your proposal, > each of them need a native data type > like state_int8_avg, state_numeric_avg, for safety. > > And, we do need a native type for an aggregate function > whose transtype is not internal and not pseudo. > For avg(int4), the current transtype is _int8. > However, I do need a validation check on the number of the array > And the positiveness of count(the first element of the array). > Responded to your proposal, > I do need a new native type like state_int4_avg. To help avoid confusion let me try to restate what I think you mean here: You're worried about someone passing in a bogus native type into the final/combine functions and then getting crashes and/or wrong results. With internal type people cannot do this because they cannot manually call the combinefunc/finalfunc because the argument type is internal. To solve this problem your suggestion is to make the type specific to the specific aggregate such that send/receive or input/output can validate the input as reasonable. But this would then mean that we have many native types (and also many deserialize/serialize functions). Assuming that's indeed what you meant, that's an interesting thought, I didn't consider that much indeed. My thinking was that we only need to implement send/receive & input/output functions for these types, and even though their meaning is very different we can serialize them in the same way. As you say though, something like that is already true for avg(int4) today. The way avg(int4) handles this issue is by doing some input validation for every call to its trans/final/combinefunc (see int4_avg_accum, int4_avg_combine, and int8_avg). It checks the length of the array there, but it doesn't check the positiveness of the count. I think that makes sense. IMHO these functions only need to protect against crashes (e.g. null pointer dereferences). But I don't think there is a good reason for them to protect the user against passing in weird data. These functions aren't really meant to be called manually in the first place anyway, so if the user does that and they pass in weird data then I'm fine with them getting a weird result back, even errors are fine (only crashes are not). So as long as our input/output & send/receive functions for state_poly_num_agg handle all the inconsistencies that could cause crashes later on (which I think is pretty simple to do for PolyNumAggState), then I don't think we need state_int8_avg, state_numeric_avg, etc. > > Not a full review but some initial notes: > Thank you. I don't have time today, so I'll answer after tomorrow. Sure, no rush.
Re: [PATCH] Add ACL (Access Control List) acronym
On Mon, Jun 24, 2024, at 21:51, David G. Johnston wrote: > On Mon, Jun 24, 2024 at 12:46 PM Joel Jacobson wrote: >> On Mon, Jun 24, 2024, at 18:02, David G. Johnston wrote: >> >> > The page we link to uses "permissions" while we consistently use >> > "privileges" to describe the contents of the list. This seems like an >> > obvious synonym, but as the point of these is to formally define >> > things, pointing this equivalence is worth considering. >> >> I like this idea. How could this be implemented in the docs? Maybe a >> ... for ACL in acronyms.sgml? >> > > Add a second under the one holding the link? How about? + + The linked page uses "permissions" while we consistently use the synonym + "privileges", to describe the contents of the list. For avoidance of + doubt and clarity, these two terms are equivalent in the + PostgreSQL documentation. + /Joel v3-0001-Add-ACL-Access-Control-List-acronym.patch Description: Binary data
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On Mon, Jun 24, 2024 at 4:42 PM Peter Geoghegan wrote: > > On Mon, Jun 24, 2024 at 3:36 PM Melanie Plageman > wrote: > > One thing I don't understand is why it is okay to freeze the xmax of a > > dead tuple just because it is from an aborted update. > > We don't do that with XID-based xmaxs. Though perhaps we should, since > we'll already prune-away the successor tuple, and so might as well go > one tiny step further and clear the xmax for the original tuple via > freezing/setting it InvalidTransactionId. Instead we just leave the > original tuple largely undisturbed, with its original xmax. I thought that was the case too, but we call heap_prepare_freeze_tuple() on HEAPTUPLE_RECENTLY_DEAD tuples and then else if (TransactionIdIsNormal(xid)) { /* Raw xmax is normal XID */ freeze_xmax = TransactionIdPrecedes(xid, cutoffs->OldestXmin); } And then later we if (freeze_xmax) frz->xmax = InvalidTransactionId; and then when we execute freezing the tuple in heap_execute_freeze_tuple() HeapTupleHeaderSetXmax(tuple, frz->xmax); Which sets the xmax to InvalidTransactionId. Or am I missing something? > > The only case in which we freeze dead tuples > > with a non-multi xmax is if the xmax is from before OldestXmin and is > > also not committed (so from an aborted update). > > Perhaps I misunderstand, but: we simply don't freeze DEAD (not > RECENTLY_DEAD) tuples in the first place, because we don't have to > (pruning removes them instead). It doesn't matter if they're DEAD due > to being from aborted transactions or DEAD due to being > deleted/updated by a transaction that committed (committed and < > OldestXmin). Right, I'm talking about HEAPTUPLE_RECENTLY_DEAD tuples. HEAPTUPLE_DEAD tuples are pruned away. But we can't replace the xmax of a tuple that has been deleted or updated by a transaction that committed with InvalidTransactionId. And it seems like the code does that? Why even call heap_prepare_freeze_tuple() on HEAPTUPLE_RECENTLY_DEAD tuples? Is it mainly to handle MultiXact freezing? > The freezing related code paths in heapam.c don't particularly care > whether a tuple xmax is RECENTLY_DEAD or LIVE to HTSV + OldestXmin. > Just as long as it's not fully DEAD (then it should have been pruned). But it just seems like we shouldn't freeze RECENTLY_DEAD either. - Melanie
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On Mon, Jun 24, 2024 at 4:36 PM Robert Haas wrote: > I'm not sure I understand. The most important thing here is fixing the > bug. But if we have a choice of how to fix the bug, I'd prefer to do > it by having the pruning code test one horizon that is always correct, > rather than (as I think the patch does) having it test against two > horizons because as a way of covering possible discrepancies between > those values. Your characterizing of OldestXmin + vistest as two horizons seems pretty arbitrary to me. I know what you mean, of course, but it seems like a distinction without a difference. > I thought the idea was that the GlobalVisTest stuff would force a > recalculation now and then, but maybe it doesn't actually do that? It definitely can do that. Just not in a way that meaningfully increases the number of heap tuples that we can recognize as DEAD and remove. At least not currently. > Suppose process A begins a transaction, acquires an XID, and then goes > idle. Process B now begins a giant vacuum. At some point in the middle > of the vacuum, A ends the transaction. Are you saying that B's > GlobalVisTest never really notices that this has happened? That's my understanding, yes. That is, vistest is approximately the same thing as OldestXmin anyway. At least for now. -- Peter Geoghegan
Re: POC, WIP: OR-clause support for indexes
Hi! Let me join the review process. I am no expert in execution plans, so there would not be much help in doing even better optimization. But I can read the code, as a person who is not familiar with this area and help making it clear even to a person like me. So, I am reading v25-0001-Transform-OR-clauses-to-ANY-expression.patch that have been posted some time ago, and especially transform_or_to_any function. > @@ -38,7 +45,6 @@ > int from_collapse_limit; > int join_collapse_limit; > > - > /* > * deconstruct_jointree requires multiple passes over the join tree, because we > * need to finish computing JoinDomains before we start distributing quals. Do not think that removing empty line should be part of the patch > + /* > + * If the const node's (right side of operator expression) type > + * don't have “true†array type, then we cannnot do the > + * transformation. We simply concatenate the expression node. > + */ Guess using unicode double quotes is not the best idea here... Now to the first part of `transform_or_to_any`: > + List *entries = NIL; I guess the idea of entries should be explained from the start. What kind of entries are accomulated there... I see they are added there all around the code, but what is the purpose of that is not quite clear when you read it. At the first part of `transform_or_to_any` function, you costanly repeat two lines, like a mantra: > + entries = lappend(entries, rinfo); > + continue; "If something is wrong -- do that mantra" From my perspective, if you have to repeat same code again and again, then most probably you have some issues with architecture of the code. If you repeat some code again and again, you need to try to rewrite the code, the way, that part is repeated only once. In that case I would try to move the most of the first loop of `transform_or_to_any` to a separate function (let's say its name is prepare_single_or), that will do all the checks, if this or is good for us; return NULL if it does not suits our purposes (and in this case we do "entries = lappend(entries, rinfo); continue" in the main code, but only once) or return pointer to some useful data if this or clause is good for our purposes. This, I guess will make that part more clear and easy to read, without repeating same "lappend mantra" again and again. Will continue digging into the code tomorrow. P.S. Sorry for sending partly finished email. Pressed Ctrl+Enter accidentally... With no way to make it back :-((( signature.asc Description: This is a digitally signed message part.
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On Mon, Jun 24, 2024 at 4:51 PM Peter Geoghegan wrote: > > On Mon, Jun 24, 2024 at 4:36 PM Robert Haas wrote: > > I thought the idea was that the GlobalVisTest stuff would force a > > recalculation now and then, but maybe it doesn't actually do that? > > It definitely can do that. Just not in a way that meaningfully > increases the number of heap tuples that we can recognize as DEAD and > remove. At least not currently. > > > Suppose process A begins a transaction, acquires an XID, and then goes > > idle. Process B now begins a giant vacuum. At some point in the middle > > of the vacuum, A ends the transaction. Are you saying that B's > > GlobalVisTest never really notices that this has happened? > > That's my understanding, yes. That is, vistest is approximately the > same thing as OldestXmin anyway. At least for now. Exactly. Something has to cause this backend to update its view of the horizon. At the end of index vacuuming, GetOldestNonRemovableTransactionId() will explicitly ComputeXidHorizons() which will update our backend's GlobalVisStates. Otherwise, if our backend's RecentXmin is updated, by taking a new snapshot, then we may update our GlobalVisStates. See GlobalVisTestShouldUpdate() for the conditions under which we would update our GlobalVisStates during the normal visibility checks happening during pruning. Vacuum used to open indexes after calculating horizons before starting its first pass. This led to a recomputation of the horizon. But, in master, there aren't many obvious places where such a thing would be happening. - Melanie
Re: [PATCH] Add ACL (Access Control List) acronym
On Mon, Jun 24, 2024 at 1:49 PM Joel Jacobson wrote: > How about? > > + > + The linked page uses "permissions" while we consistently use the > synonym > + "privileges", to describe the contents of the list. For avoidance of > + doubt and clarity, these two terms are equivalent in the > + PostgreSQL documentation. > + > > /Joel I really dislike "For avoidance of doubt and clarity" - and in terms of being equivalent the following seems like a more accurate description of reality. The PostgreSQL documentation, and code, refers to the specifications within the ACL as "privileges". This has the same meaning as "permissions" on the linked page. Generally if we say "permissions" we are referring to something that is not covered by the ACL. In routine communication the two words are often used interchangeably. David J.
Re: RFC: Additional Directory for Extensions
On Thu, 11 Apr 2024 at 19:52, David E. Wheeler wrote: > I realize this probably isn’t going to happen for 17, given the freeze, but I > would very much welcome feedback and pointers to address concerns about > providing a second directory for extensions and DSOs. Quite a few people have > talked about the need for this in the Extension Mini Summits[1], so I’m sure > I could get some collaborators to make improvements or look at a different > approach. Overall +1 for the idea. We're running into this same limitation (only a single place to put extension files) at Microsoft at the moment. +and to the '$libdir' directive when loading modules +that back functions. I feel like this is a bit strange. Either its impact is too wide, or it's not wide enough depending on your intent. If you want to only change $libdir during CREATE EXTENSION (or ALTER EXTENSION UPDATE), then why not just change it there. And really you'd only want to change it when creating an extension from which the control file is coming from extension_destdir. However, I can also see a case for really always changing $libdir. Because some extensions in shared_preload_libraries, might want to trigger loading other libraries that they ship with dynamically. And these libraries are probably also in extension_destdir.
Re: RFC: Additional Directory for Extensions
On Mon, 24 Jun 2024 at 18:11, Nathan Bossart wrote: > At first glance, the general idea seems reasonable to me. I'm wondering > whether there is a requirement for this directory to be prepended or if it > could be appended to the end. That way, the existing ones would take > priority, which might be desirable from a security standpoint. Citus does ship with some override library for pgoutput to make logical replication/CDC work correctly with sharded tables. Right now using this override library requires changing dynamic_library_path. It would be nice if that wasn't necessary. But this is obviously a small thing. And I definitely agree that there's a security angle to this as well, but honestly that seems rather small too. If an attacker can put shared libraries into the extension_destdir, I'm pretty sure you've lost already, no matter if extension_destdir is prepended or appended to the existing $libdir.
Re: RFC: Additional Directory for Extensions
On Jun 24, 2024, at 17:17, Jelte Fennema-Nio wrote: > If you want to only change $libdir during CREATE EXTENSION (or ALTER > EXTENSION UPDATE), then why not just change it there. And really you'd > only want to change it when creating an extension from which the > control file is coming from extension_destdir. IIUC, the postmaster needs to load an extension on first use in every session unless it’s in shared_preload_libraries. > However, I can also see a case for really always changing $libdir. > Because some extensions in shared_preload_libraries, might want to > trigger loading other libraries that they ship with dynamically. And > these libraries are probably also in extension_destdir. Right, it can be more than just the DSOs for the extension itself. Best, David
Re: RFC: Additional Directory for Extensions
On Mon, 24 Jun 2024 at 22:42, David E. Wheeler wrote: > >> BINDIR > >> DOCDIR > >> HTMLDIR > >> PKGINCLUDEDIR > >> LOCALEDIR > >> MANDIR > >> > >> I can imagine an extension wanting or needing to use any and all of these. > > > > Are these really all relevant to backend code? > > Oh I think so. Especially BINDIR; lots of extensions ship with binary > applications. And most ship with docs, too (PGXS puts items listed in DOCS > into DOCDIR). Some might also produce man pages (for their binaries), HTML > docs, and other stuff. Maybe an FTE extension would include locale files? > > I find it pretty easy to imagine use cases for all of them. So much so that I > wrote an extension binary distribution RFC[1] and its POC[2] around them. Definitely agreed on BINDIR needing to be supported. And while lots of extensions ship with docs, I expect this feature to mostly be used in production environments to make deploying extensions easier. And I'm not sure that many people care about deploying docs to production (honestly lots of people would probably want to strip them). Still, for the sake of completeness it might make sense to support this whole list in extension_destdir. (assuming it's easy to do)
[PATCH] Fix type redefinition build errors with macOS SDK 15.0
Prior to macOS SDK 15, there were include guards in $SDK_ROOT/usr/include/xlocale/_regex.h: #ifndef _REGEX_H_ #include <_regex.h> #endif // _REGEX_H_ #include <_xlocale.h> In macOS SDK 15.5, these include guards are gone: #include <_regex.h> #include <_xlocale.h> Because _REGEX_H_ was defined locally in PostgreSQL's version of src/include/regex/regex.h, these include guards prevented duplicate definitions from $SDK_ROOT/usr/include/_regex.h (not to be confused with $SDK_ROOT/usr/include/xlocale/_regex.h). To fix this build issue, define __REGEX_H_ to prevent macOS from including the header that contain redefinitions of the local regex structures. Discussion: https://www.postgresql.org/message-id/CAMBWrQ%3DF9SSPfsFtCv%3DJT51WGK2VcgLA%2BiiJJOmjN0zbbufOEA%40mail.gmail.com --- src/include/regex/regex.h | 14 ++ 1 file changed, 14 insertions(+) diff --git a/src/include/regex/regex.h b/src/include/regex/regex.h index d08113724f..045ac626cc 100644 --- a/src/include/regex/regex.h +++ b/src/include/regex/regex.h @@ -32,6 +32,20 @@ * src/include/regex/regex.h */ +#if defined(__darwin__) +/* + * mmacOS SDK 15.0 removed the _REGEX_H_ include guards in + * $SDK_ROOT/usr/include/xlocale/_regex.h, so now + * $SDK_ROOT/usr/include/_regex.h is always included. That file defines + * the same types as below. To guard against type redefinition errors, + * define __REGEX_H_. + */ +#ifndef __REGEX_H_ +#define __REGEX_H_ + +#endif /* __REGEX_H_ */ +#endif + /* * Add your own defines, if needed, here. */ -- 2.45.0