Unwanted file mode modification?
Hi, Commit 216a784829 change the src/backend/replication/logical/worker.c file mode from 0644 to 0755, which is unwanted, right? [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=216a784829c2c5f03ab0c43e009126cbb819e9b2 -- Regrads, Japin Li. ChengDu WenWu Information Technology Co.,Ltd.
Re: doc: add missing "id" attributes to extension packaging page
On 09.01.2023 at 03:38, vignesh C wrote: There are couple of commitfest entries for this: https://commitfest.postgresql.org/41/4041/ https://commitfest.postgresql.org/41/4042/ Can one of them be closed? I've split the initial patch into two parts upon Álvaro's request in [1] so that we can discuss them separately https://commitfest.postgresql.org/41/4041/ is tracking the patch you've been trying to apply and that I've just sent a rebased version for. It only adds (invisible) ids to the HTML documentation and can be closed once you've applied the patch. https://commitfest.postgresql.org/41/4042/ is tracking a different patch that makes the ids and the corresponding links discoverable at the HTML surface. Hover one of the psql options in [2] to see the behavior. This one still needs reviewing and there is no discussion around it yet. Regards, Brar [1] https://www.postgresql.org/message-id/20221206083809.3kaygnh2xswoxslj%40alvherre.pgsql [2] https://pgdocs.piening.info/app-psql.html#APP-PSQL-OPTION-PORT
Re: Logical replication timeout problem
On Fri, Jan 6, 2023 at 12:35 PM Ashutosh Bapat wrote: > > + > +/* > + * We don't want to try sending a keepalive message after processing each > + * change as that can have overhead. Tests revealed that there is no > + * noticeable overhead in doing it after continuously processing 100 or > so > + * changes. > + */ > +#define CHANGES_THRESHOLD 100 > > I think a time based threashold makes more sense. What if the timeout was > nearing and those 100 changes just took little more time causing a timeout? We > already have a time based threashold in WalSndKeepaliveIfNecessary(). And that > function is invoked after reading every WAL record in WalSndLoop(). So it does > not look like it's an expensive function. If it is expensive we might want to > worry about WalSndLoop as well. Does it make more sense to remove this > threashold? > We have previously tried this for every change [1] and it brings noticeable overhead. In fact, even doing it for every 10 changes also had some overhead which is why we reached this threshold number. I don't think it can lead to timeout due to skipping changes but sure if we see any such report we can further fine-tune this setting or will try to make it time-based but for now I feel it would be safe to use this threshold. > + > +/* > + * After continuously processing CHANGES_THRESHOLD changes, we > + * try to send a keepalive message if required. > + */ > +if (++changes_count >= CHANGES_THRESHOLD) > +{ > +ctx->update_progress(ctx, ctx->write_location, ctx->write_xid, > false); > +changes_count = 0; > +} > +} > + > > On the other thread, I mentioned that we don't have a TAP test for it. > I agree with > Amit's opinion there that it's hard to create a test which will timeout > everywhere. I think what we need is a way to control the time required for > decoding a transaction. > > A rough idea is to induce a small sleep after decoding every change. The > amount > of sleep * number of changes will help us estimate and control the amount of > time taken to decode a transaction. Then we create a transaction which will > take longer than the timeout threashold to decode. But that's a > significant code. I > don't think PostgreSQL has a facility to induce a delay at a particular place > in the code. > Yeah, I don't know how to induce such a delay while decoding changes. One more thing, I think it would be better to expose a new callback API via reorder buffer as suggested previously [2] similar to other reorder buffer APIs instead of directly using reorderbuffer API to invoke plugin API. [1] - https://www.postgresql.org/message-id/OS3PR01MB6275DFFDAC7A59FA148931529E209%40OS3PR01MB6275.jpnprd01.prod.outlook.com [2] - https://www.postgresql.org/message-id/CAA4eK1%2BfQjndoBOFUn9Wy0hhm3MLyUWEpcT9O7iuCELktfdBiQ%40mail.gmail.com -- With Regards, Amit Kapila.
Re: missing indexes in indexlist with partitioned tables
On Tue, 6 Dec 2022 at 13:43, Arne Roland wrote: > That being said, attached patch should fix the issue reported below. I took a look over the v10 patch and ended up making adjustments to the tests. I didn't quite see the need for the test to be as extensive as you had them in v10. Neither join removals nor unique joins treat partitioned tables any differently from normal tables, so I think it's fine just to have a single test that makes sure join removals work on partitioned tables. I didn't feel inclined to add a test for unique joins. The test I added is mainly just there to make sure something fails if someone decides partitioned tables don't need the indexlist populated at some point in the future. The other changes I made were just cosmetic. I pushed the result. David
Re: MultiXact\SLRU buffers configuration
On Tue, Jan 3, 2023 at 5:02 AM vignesh C wrote: > does not apply on top of HEAD as in [1], please post a rebased patch: > Thanks! Here's the rebase. Best regards, Andrey Borodin. v24-0001-Divide-SLRU-buffers-into-8-associative-banks.patch Description: Binary data
Re: Infinite Interval
On Sun, Jan 8, 2023 at 4:22 AM Joseph Koshakow wrote: > On Sat, Jan 7, 2023 at 3:05 PM Joseph Koshakow wrote: > > > > On Sat, Jan 7, 2023 at 3:04 PM Joseph Koshakow > wrote: > > > > > > I think this patch is just about ready for review, except for the > > > following two questions: > > > 1. Should finite checks on intervals only look at months or all three > > > fields? > > > 2. Should we make the error messages for adding/subtracting infinite > > > values more generic or leave them as is? > > > > > > My opinions are > > > 1. We should only look at months. > > > 2. We should make the errors more generic. > > > > > > Anyone else have any thoughts? > > Here's a patch with the more generic error messages. > > - Joe > HI. I just found out another problem. select * from generate_series(timestamp'-infinity', timestamp 'infinity', interval 'infinity'); ERROR: timestamp out of range select * from generate_series(timestamp'-infinity',timestamp 'infinity', interval '-infinity'); --return following generate_series - (0 rows) select * from generate_series(timestamp 'infinity',timestamp 'infinity', interval 'infinity'); --will run all the time. select * from generate_series(timestamp 'infinity',timestamp 'infinity', interval '-infinity'); ERROR: timestamp out of range select * from generate_series(timestamp'-infinity',timestamp'-infinity', interval 'infinity'); ERROR: timestamp out of range select * from generate_series(timestamp'-infinity',timestamp'-infinity', interval '-infinity'); --will run all the time. -- I recommend David Deutsch's <> Jian
Re: Amcheck verification of GiST and GIN
On Sun, Jan 8, 2023 at 8:05 PM Andrey Borodin wrote: > > Please find the attached new version. In this patchset heapallindexed > flag is removed from GIN checks. > Uh... sorry, git-formatted wrong branch. Here's the correct version. Double checked. Best regards, Andrey Borodin. v19-0003-Add-gin_index_parent_check-to-verify-GIN-index.patch Description: Binary data v19-0001-Refactor-amcheck-to-extract-common-locking-routi.patch Description: Binary data v19-0002-Add-gist_index_parent_check-function-to-verify-G.patch Description: Binary data
Re: Amcheck verification of GiST and GIN
Hi Jose, thank you for review and sorry for so long delay to answer. On Wed, Dec 14, 2022 at 4:19 AM Jose Arthur Benetasso Villanova wrote: > > > On Sun, 27 Nov 2022, Andrey Borodin wrote: > > > On Sun, Nov 27, 2022 at 1:29 PM Andrey Borodin > > wrote: > >> > > I was wrong. GIN check does similar gin_refind_parent() to lock pages > > in bottom-up manner and truly verify downlink-child_page invariant. > > Does this mean that we need the adjustment in docs? It seems to me that gin_index_parent_check() docs are correct. > > > Here's v17. The only difference is that I added progress reporting to > > GiST verification. > > I still did not implement heapallindexed for GIN. Existence of pending > > lists makes this just too difficult for a weekend coding project :( > > > > Thank you! > > > > Best regards, Andrey Borodin. > > > > I'm a bit lost here. I tried your patch again and indeed the > heapallindexed inside gin_check_parent_keys_consistency has a TODO > comment, but it's unclear to me if you are going to implement it or if the > patch "needs review". Right now it's "Waiting on Author". > Please find the attached new version. In this patchset heapallindexed flag is removed from GIN checks. Thank you! Best regards, Andrey Borodin. v18-0003-Add-gin_index_parent_check-to-verify-GIN-index.patch Description: Binary data v18-0002-Add-gist_index_parent_check-function-to-verify-G.patch Description: Binary data v18-0001-Refactor-amcheck-to-extract-common-locking-routi.patch Description: Binary data
Strengthen pg_waldump's --save-fullpage tests
Hi, A recent commit [1] added --save-fullpage option to pg_waldump to extract full page images (FPI) from WAL records and save them into files (one file per FPI) under a specified directory. While it added tests to check the LSN from the FPI file name and the FPI file contents, it missed to further check the FPI contents like the tuples on the page. I'm attaching a patch that basically reads the FPI file (saved by pg_waldump) contents and raw page from the table file (using pageinspect extension) and compares the tuples from both of them. This test ensures that the pg_waldump outputs the correct FPI. This idea is also discussed elsewhere [2]. Thoughts? [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=d497093cbecccf6df26365e06a5f8f8614b591c8 [2] https://www.postgresql.org/message-id/calj2acxesn9dtjgsekm8fig7cxhhxqfqp4fcisjgcmp9wrz...@mail.gmail.com -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com From f714f13f47f98c7b9a4f88539c76862cacaffe49 Mon Sep 17 00:00:00 2001 From: Bharath Rupireddy Date: Sat, 7 Jan 2023 08:29:28 + Subject: [PATCH v1] Strengthen pg_waldump's --save-fullpage tests --- src/bin/pg_waldump/Makefile | 2 ++ src/bin/pg_waldump/t/002_save_fullpage.pl | 34 +-- 2 files changed, 34 insertions(+), 2 deletions(-) diff --git a/src/bin/pg_waldump/Makefile b/src/bin/pg_waldump/Makefile index d6459e17c7..924fb6ad20 100644 --- a/src/bin/pg_waldump/Makefile +++ b/src/bin/pg_waldump/Makefile @@ -3,6 +3,8 @@ PGFILEDESC = "pg_waldump - decode and display WAL" PGAPPICON=win32 +EXTRA_INSTALL = contrib/pageinspect + subdir = src/bin/pg_waldump top_builddir = ../../.. include $(top_builddir)/src/Makefile.global diff --git a/src/bin/pg_waldump/t/002_save_fullpage.pl b/src/bin/pg_waldump/t/002_save_fullpage.pl index 5072703a3d..193887a115 100644 --- a/src/bin/pg_waldump/t/002_save_fullpage.pl +++ b/src/bin/pg_waldump/t/002_save_fullpage.pl @@ -38,16 +38,24 @@ max_wal_senders = 4 }); $node->start; +$node->safe_psql('postgres', q(CREATE EXTENSION pageinspect)); + # Generate data/WAL to examine that will have full pages in them. $node->safe_psql( 'postgres', "SELECT 'init' FROM pg_create_physical_replication_slot('regress_pg_waldump_slot', true, false); -CREATE TABLE test_table AS SELECT generate_series(1,100) a; +CREATE TABLE test_table AS SELECT generate_series(1,2) a; -- Force FPWs on the next writes. CHECKPOINT; -UPDATE test_table SET a = a + 1; +UPDATE test_table SET a = a * 100 WHERE a = 1; "); +# Get raw page from the table. +my $page_from_table = $node->safe_psql( + 'postgres', + q(SELECT get_raw_page('test_table', 0)) +); + ($walfile_name, $blocksize) = split '\|' => $node->safe_psql('postgres', "SELECT pg_walfile_name(pg_switch_wal()), current_setting('block_size')"); @@ -108,4 +116,26 @@ for my $fullpath (glob "$tmp_folder/raw/*") ok($file_count > 0, 'verify that at least one block has been saved'); +# Check that the pg_waldump saves FPI file. +my @fpi_files = glob "$tmp_folder/raw/*_main"; +is(scalar(@fpi_files), 1, 'one FPI file was created'); + +# Read the content of the saved FPI file. +my $page_from_fpi_file = $node->safe_psql( + 'postgres', + qq{SELECT pg_read_binary_file('$fpi_files[0]')} +); + +# Compare the raw page from table and FPI file saved, they must contain same rows. +my $result = $node->safe_psql( + 'postgres', + qq{SELECT tuple_data_split('test_table'::regclass, t_data, t_infomask, t_infomask2, t_bits) + FROM heap_page_items('$page_from_table') + EXCEPT + SELECT tuple_data_split('test_table'::regclass, t_data, t_infomask, t_infomask2, t_bits) + FROM heap_page_items('$page_from_fpi_file')} +); + +is($result, '', 'verify that rows in raw page from table and FPI file saved are same'); + done_testing(); -- 2.34.1
Re: Fixing a couple of buglets in how VACUUM sets visibility map bits
On Sun, Jan 8, 2023 at 4:27 PM Peter Geoghegan wrote: > We're vulnerable to allowing "all-frozen but not all-visible" > inconsistencies because of two factors: this business with not passing > VISIBILITYMAP_ALL_VISIBLE along with VISIBILITYMAP_ALL_FROZEN to > visibilitymap_set(), *and* the use of all_visible_according_to_vm to > set the VM (a local variable that can go stale). We sort of assume > that all_visible_according_to_vm cannot go stale here without our > detecting it. That's almost always the case, but it's not quite > guaranteed. On further reflection even v2 won't repair the page-level PD_ALL_VISIBLE flag in passing in this scenario. ISTM that on HEAD we might actually leave the all-frozen bit set in the VM, while both the all-visible bit and the page-level PD_ALL_VISIBLE bit remain unset. Again, all due to the approach we take with all_visible_according_to_vm, which can go stale independently of both the VM bit being unset and the PD_ALL_VISIBLE bit being unset (in my example problem scenario). FWIW I don't have this remaining problem in my VACUUM freezing/scanning strategies patchset. It just gets rid of all_visible_according_to_vm altogether, which makes things a lot simpler at the point that we set VM bits at the end of lazy_scan_heap -- there is nothing left that can become stale. Quite a lot of the code is just removed; there is exactly one call to visibilitymap_set() at the end of lazy_scan_heap with the patchset, that does everything we need. The patchset also has logic for setting PD_ALL_VISIBLE when it needs to be set, which isn't (and shouldn't) be conditioned on whether we're doing a "all visible -> all frozen " transition or a "neither -> all visible" transition. What it actually needs to be conditioned on is whether it's unset now, and so needs to be set in passing, as part of setting one or both VM bits -- simple as that. -- Peter Geoghegan
Re: doc: add missing "id" attributes to extension packaging page
On Mon, 9 Jan 2023 at 08:01, vignesh C wrote: > > On Wed, 4 Jan 2023 at 04:13, Karl O. Pinc wrote: > > > > On Tue, 3 Jan 2023 21:35:09 +0100 > > Brar Piening wrote: > > > > > On 02.01.2023 at 21:53, Karl O. Pinc wrote: > > > > If the author will look over my version of the patch I believe it > > > > can be approved and sent on to the committers. > > > > > > LGTM. > > > > Approved for committer! > > The patch does not apply on top of HEAD as in [1], please post a rebased > patch: > === Applying patches on top of PostgreSQL commit ID > cd9479af2af25d7fa9bfd24dd4dcf976b360f077 === > === applying patch ./add_html_ids_v2.patch > > patching file doc/src/sgml/ref/pgbench.sgml > patching file doc/src/sgml/ref/psql-ref.sgml > Hunk #73 FAILED at 1824. > > 2 out of 208 hunks FAILED -- saving rejects to file > doc/src/sgml/ref/psql-ref.sgml.rej > > [1] - http://cfbot.cputube.org/patch_41_4041.log There are couple of commitfest entries for this: https://commitfest.postgresql.org/41/4041/ https://commitfest.postgresql.org/41/4042/ Can one of them be closed? Regards, Vignesh
Re: doc: add missing "id" attributes to extension packaging page
On Wed, 4 Jan 2023 at 04:13, Karl O. Pinc wrote: > > On Tue, 3 Jan 2023 21:35:09 +0100 > Brar Piening wrote: > > > On 02.01.2023 at 21:53, Karl O. Pinc wrote: > > > If the author will look over my version of the patch I believe it > > > can be approved and sent on to the committers. > > > > LGTM. > > Approved for committer! The patch does not apply on top of HEAD as in [1], please post a rebased patch: === Applying patches on top of PostgreSQL commit ID cd9479af2af25d7fa9bfd24dd4dcf976b360f077 === === applying patch ./add_html_ids_v2.patch patching file doc/src/sgml/ref/pgbench.sgml patching file doc/src/sgml/ref/psql-ref.sgml Hunk #73 FAILED at 1824. 2 out of 208 hunks FAILED -- saving rejects to file doc/src/sgml/ref/psql-ref.sgml.rej [1] - http://cfbot.cputube.org/patch_41_4041.log Regards, Vignesh
Re: [PATCH] random_normal function
I wrote: > Also, I tried running the new random.sql regression cases over > and over, and found that the "not all duplicates" test fails about > one time in 10 or so. We could probably tolerate that given > that the random test is marked "ignore" in parallel_schedule, but > I thought it best to add one more iteration so we could knock the > odds down. Hmm ... it occurred to me to try the same check on the existing random() tests (attached), and darn if they don't fail even more often, usually within 50K iterations. So maybe we should rethink that whole thing. regards, tom lane \timing on create table if not exists random_tbl (random bigint); do $$ begin for i in 1..100 loop TRUNCATE random_tbl; INSERT INTO RANDOM_TBL (random) SELECT count(*) AS random FROM onek WHERE random() < 1.0/10; -- select again, the count should be different INSERT INTO RANDOM_TBL (random) SELECT count(*) FROM onek WHERE random() < 1.0/10; -- select again, the count should be different INSERT INTO RANDOM_TBL (random) SELECT count(*) FROM onek WHERE random() < 1.0/10; -- select again, the count should be different INSERT INTO RANDOM_TBL (random) SELECT count(*) FROM onek WHERE random() < 1.0/10; -- now test that they are different counts if (select true FROM RANDOM_TBL GROUP BY random HAVING count(random) > 3) then raise notice 'duplicates at iteration %', i; exit; end if; -- approximately check expected distribution if (select true FROM RANDOM_TBL HAVING AVG(random) NOT BETWEEN 80 AND 120) then raise notice 'range at iteration %', i; exit; end if; end loop; end $$;
Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation
On Thu, Dec 29, 2022 at 7:01 PM Peter Geoghegan wrote: > Attached is v2, which is just to fix bitrot. Attached is v3. We no longer apply vacuum_failsafe_age when determining the cutoff for antiwraparound autovacuuming -- the new approach is a bit simpler. This is a fairly small change overall. Now any "table age driven" autovacuum will also be antiwraparound when its relfrozenxid/relminmxid attains an age that's either double the relevant setting (either autovacuum_freeze_max_age or effective_multixact_freeze_max_age), or 1 billion XIDs/MXIDs -- whichever is less. That makes it completely impossible to disable antiwraparound protections (the special antiwrap autocancellation behavior) for table-age-driven autovacuums once table age exceeds 1 billion XIDs/MXIDs. It's still possible to increase autovacuum_freeze_max_age to well over a billion, of course. It just won't be possible to do that while also avoiding the no-auto-cancellation behavior for those autovacuums that are triggered due to table age crossing the autovacuum_freeze_max_age/effective_multixact_freeze_max_age threshold. -- Peter Geoghegan From 6d78e62226608683d9882b4507d37442164267a1 Mon Sep 17 00:00:00 2001 From: Peter Geoghegan Date: Fri, 25 Nov 2022 11:23:20 -0800 Subject: [PATCH v3] Add "table age" trigger concept to autovacuum. Teach autovacuum.c to launch "table age" autovacuums at the same point that it previously triggered antiwraparound autovacuums. Antiwraparound autovacuums are retained, but are only used as a true option of last resort, when regular autovacuum has presumably tried and failed to advance relfrozenxid (likely because the auto-cancel behavior kept cancelling regular autovacuums triggered based on table age). The special auto-cancellation behavior applied by antiwraparound autovacuums is known to cause problems in the field, so it makes sense to avoid it, at least until the point where it starts to look like a proportionate response. Besides, the risk of the system eventually triggering xidStopLimit because of cancellations is a lot lower than it was back when the current auto-cancellation behavior was added by commit acac68b2. For example, there was no visibility map, so restarting antiwraparound autovacuum meant that the next autovacuum would get very little benefit from the work performed by earlier cancelled autovacuums. Also add new instrumentation that lists a triggering condition in the server log whenever an autovacuum is logged. This reports "table age" as the triggering criteria when regular (not antiwraparound) autovacuum runs because we need to advance the table's age. In other cases it will report an autovacuum was launched due to the tuple insert thresholds or the dead tuple thresholds. Note that pg_stat_activity doesn't have any special instrumentation for regular autovacuums that happen to have been triggered based on table age (since it really is just another form of autovacuum, and follows exactly the same rules in vacuumlazy.c and in proc.c). Author: Peter Geoghegan Reviewed-By: Jeff Davis Discussion: https://postgr.es/m/CAH2-Wz=S-R_2rO49Hm94Nuvhu9_twRGbTm6uwDRmRu-Sqn_t3w@mail.gmail.com --- src/include/commands/vacuum.h | 18 ++- src/include/storage/proc.h | 2 +- src/backend/access/heap/vacuumlazy.c| 14 ++ src/backend/access/heap/visibilitymap.c | 5 +- src/backend/access/transam/multixact.c | 4 +- src/backend/commands/vacuum.c | 18 ++- src/backend/postmaster/autovacuum.c | 207 +--- src/backend/storage/lmgr/proc.c | 4 +- 8 files changed, 197 insertions(+), 75 deletions(-) diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h index 689dbb770..b70f69fd9 100644 --- a/src/include/commands/vacuum.h +++ b/src/include/commands/vacuum.h @@ -191,6 +191,21 @@ typedef struct VacAttrStats #define VACOPT_SKIP_DATABASE_STATS 0x100 /* skip vac_update_datfrozenxid() */ #define VACOPT_ONLY_DATABASE_STATS 0x200 /* only vac_update_datfrozenxid() */ +/* + * Values used by autovacuum.c to tell vacuumlazy.c about the specific + * threshold type that triggered an autovacuum worker. + * + * AUTOVACUUM_NONE is used when VACUUM isn't running in an autovacuum worker. + */ +typedef enum AutoVacType +{ + AUTOVACUUM_NONE = 0, + AUTOVACUUM_TABLE_XID_AGE, + AUTOVACUUM_TABLE_MXID_AGE, + AUTOVACUUM_DEAD_TUPLES, + AUTOVACUUM_INSERTED_TUPLES, +} AutoVacType; + /* * Values used by index_cleanup and truncate params. * @@ -222,7 +237,8 @@ typedef struct VacuumParams * use default */ int multixact_freeze_table_age; /* multixact age at which to scan * whole table */ - bool is_wraparound; /* force a for-wraparound vacuum */ + bool is_wraparound; /* antiwraparound autovacuum? */ + AutoVacType trigger; /* Autovacuum trigger condition, if any */ int log_min_duration; /* minimum execution threshold in ms at * which autovacuum is logged, -1 to use * defaul
Re: Fixing a couple of buglets in how VACUUM sets visibility map bits
On Sun, Jan 8, 2023 at 3:53 PM Andres Freund wrote: > How? See the commit message for the scenario I have in mind, which involves a concurrent HOT update that aborts. We're vulnerable to allowing "all-frozen but not all-visible" inconsistencies because of two factors: this business with not passing VISIBILITYMAP_ALL_VISIBLE along with VISIBILITYMAP_ALL_FROZEN to visibilitymap_set(), *and* the use of all_visible_according_to_vm to set the VM (a local variable that can go stale). We sort of assume that all_visible_according_to_vm cannot go stale here without our detecting it. That's almost always the case, but it's not quite guaranteed. > visibilitymap_set() just adds flags, it doesn't remove any already > existing bits: I know. The concrete scenario I have in mind is very subtle (if the problem was this obvious I'm sure somebody would have noticed it by now, since we do hit this visibilitymap_set() call site reasonably often). A concurrent HOT update will certainly clear all the bits for us, which is enough. > You have plenty of changes like this, which are afaict entirely unrelated to > the issue the commit is fixing, in here. It just makes it hard to review the > patch. I didn't think that it was that big of a deal to tweak the style of one or two details in and around lazy_vacuum_heap_rel() in passing, for consistency with lazy_scan_heap(), since the patch already needs to do some of that. I do take your point, though. -- Peter Geoghegan
Re: [PATCH] random_normal function
I wrote: > So the problem in this patch is that it's trying to include > utils/float.h in a src/common file, where we have not included > postgres.h. Question is, why did you do that? (Ah, for M_PI ... but our practice is just to duplicate that #define where needed outside the backend.) I spent some time reviewing this patch. I'm on board with inventing random_normal(): the definition seems solid and the use-case for it seems reasonably well established. I'm not necessarily against inventing similar functions for other distributions, but this patch is not required to do so. We can leave that discussion until somebody is motivated to submit a patch for one. On the other hand, I'm much less on board with inventing random_string(): we don't have any clear field demand for it and the appropriate definitional details are a lot less obvious (for example, whether it needs to be based on pg_strong_random() rather than the random() sequence). I think we should leave that out, and I have done so in the attached updated patch. I noted several errors in the submitted patch. It was creating the function as PARALLEL SAFE which is just wrong, and the whole business with checking PG_NARGS is useless because it will always be 2. (That's not how default arguments work.) The business with checking against DBL_EPSILON seems wrong too. All we need is to ensure that u1 > 0 so that log(u1) will not choke; per spec, log() is defined for any positive input. I see that that seems to have been modeled on the C++ code in the Wikipedia page, but I'm not sure that C++'s epsilon means the same thing, and if it does then their example code is just wrong. See the discussion about "tails truncation" immediately above it: artificially constraining the range of u1 just limits how much of the tail of the distribution we can reproduce. So that led me to doing it the same way as in the existing Box-Muller code in pgbench, which I then deleted per Fabien's advice. BTW, the pgbench code was using sin() not cos(), which I duplicated because using cos() causes the expected output of the pgbench tests to change. I'm not sure whether there was any hard reason to prefer one or the other, and we can certainly change the expected output if there's some reason to prefer cos(). I concur with not worrying about the Inf/NaN cases that Mark pointed out. It's not obvious that the results the proposed code produces are wrong, and it's even less obvious that anyone will ever care. Also, I tried running the new random.sql regression cases over and over, and found that the "not all duplicates" test fails about one time in 10 or so. We could probably tolerate that given that the random test is marked "ignore" in parallel_schedule, but I thought it best to add one more iteration so we could knock the odds down. Also I changed the test iterations so they weren't all invoking random_normal() in exactly the same way. This version seems committable to me, barring objections. regards, tom lane diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 3bf8d021c3..b67dc26a35 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -1815,6 +1815,28 @@ repeat('Pg', 4) PgPgPgPg + + + + random_normal + + + random_normal ( + mean double precision + , stddev double precision ) + double precision + + +Returns a random value from the normal distribution with the given +parameters; mean defaults to 0.0 +and stddev defaults to 1.0 + + +random_normal(0.0, 1.0) +0.051285419 + + + @@ -1824,7 +1846,8 @@ repeat('Pg', 4) PgPgPgPg void -Sets the seed for subsequent random() calls; +Sets the seed for subsequent random() and +random_normal() calls; argument must be between -1.0 and 1.0, inclusive @@ -1848,6 +1871,7 @@ repeat('Pg', 4) PgPgPgPg Without any prior setseed() call in the same session, the first random() call obtains a seed from a platform-dependent source of random bits. + These remarks hold equally for random_normal(). diff --git a/src/backend/catalog/system_functions.sql b/src/backend/catalog/system_functions.sql index f2470708e9..83ca893444 100644 --- a/src/backend/catalog/system_functions.sql +++ b/src/backend/catalog/system_functions.sql @@ -66,6 +66,13 @@ CREATE OR REPLACE FUNCTION bit_length(text) IMMUTABLE PARALLEL SAFE STRICT COST 1 RETURN octet_length($1) * 8; +CREATE OR REPLACE FUNCTION + random_normal(mean float8 DEFAULT 0, stddev float8 DEFAULT 1) + RETURNS float8 + LANGUAGE internal + VOLATILE PARALLEL RESTRICTED STRICT COST 1 +AS 'drandom_normal'; + CREATE OR REPLACE FUNCTION log(numeric) RETURNS numeric LANGUAGE sql diff --git a/src/backend/utils/adt/float.c b/src/backend/utils/adt/float.c
Re: Fixing a couple of buglets in how VACUUM sets visibility map bits
Hi, On 2023-01-08 14:39:19 -0800, Peter Geoghegan wrote: > One of the calls to visibilitymap_set() during VACUUM's initial heap > pass could unset a page's all-visible bit during the process of setting > the same page's all-frozen bit. How? visibilitymap_set() just adds flags, it doesn't remove any already existing bits: map[mapByte] |= (flags << mapOffset); It'll afaict lead to potentially unnecessary WAL records though, which does seem buggy: if (flags != (map[mapByte] >> mapOffset & VISIBILITYMAP_VALID_BITS)) here we check for *equivalence*, but then below we just or-in flags. So visibilitymap_set() with just one of the flags bits set in the parameters, but both set in the page would end up WAL logging unnecessarily. > @@ -2388,8 +2398,8 @@ lazy_vacuum_all_indexes(LVRelState *vacrel) > static void > lazy_vacuum_heap_rel(LVRelState *vacrel) > { > - int index; > - BlockNumber vacuumed_pages; > + int index = 0; > + BlockNumber vacuumed_pages = 0; > Buffer vmbuffer = InvalidBuffer; > LVSavedErrInfo saved_err_info; > > @@ -2406,42 +2416,42 @@ lazy_vacuum_heap_rel(LVRelState *vacrel) > > VACUUM_ERRCB_PHASE_VACUUM_HEAP, >InvalidBlockNumber, > InvalidOffsetNumber); > > - vacuumed_pages = 0; > - > - index = 0; > @@ -2473,12 +2484,12 @@ lazy_vacuum_heap_rel(LVRelState *vacrel) > */ > static int > lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer, > - int index, Buffer *vmbuffer) > + int index, Buffer vmbuffer) > { > VacDeadItems *dead_items = vacrel->dead_items; > Pagepage = BufferGetPage(buffer); > OffsetNumber unused[MaxHeapTuplesPerPage]; > - int uncnt = 0; > + int nunused = 0; > TransactionId visibility_cutoff_xid; > boolall_frozen; > LVSavedErrInfo saved_err_info; > @@ -2508,10 +2519,10 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber > blkno, Buffer buffer, > > Assert(ItemIdIsDead(itemid) && !ItemIdHasStorage(itemid)); > ItemIdSetUnused(itemid); > - unused[uncnt++] = toff; > + unused[nunused++] = toff; > } > > - Assert(uncnt > 0); > + Assert(nunused > 0); > > /* Attempt to truncate line pointer array now */ > PageTruncateLinePointerArray(page); > @@ -2527,13 +2538,13 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber > blkno, Buffer buffer, > xl_heap_vacuum xlrec; > XLogRecPtr recptr; > > - xlrec.nunused = uncnt; > + xlrec.nunused = nunused; > > XLogBeginInsert(); > XLogRegisterData((char *) &xlrec, SizeOfHeapVacuum); > > XLogRegisterBuffer(0, buffer, REGBUF_STANDARD); > - XLogRegisterBufData(0, (char *) unused, uncnt * > sizeof(OffsetNumber)); > + XLogRegisterBufData(0, (char *) unused, nunused * > sizeof(OffsetNumber)); > > recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_VACUUM); > You have plenty of changes like this, which are afaict entirely unrelated to the issue the commit is fixing, in here. It just makes it hard to review the patch. Greetings, Andres Freund
Re: Missing update of all_hasnulls in BRIN opclasses
Thanks Justin! I've applied all the fixes you proposed, and (hopefully) improved a couple other comments. I've been working on this over the past couple days, trying to polish and commit it over the weekend - both into master and backbranches. Sadly, the backpatching part turned out to be a bit more complicated than I expected, because of the BRIN reworks in PG14 (done by me, as foundation for the new opclasses, so ... well). Anyway, I got it done, but it's a bit uglier than I hoped for and I don't feel like pushing this on Sunday midnight. I think it's correct, but maybe another pass to polish it a bit more is better. So here are two patches - one for 11-13, the other for 14-master. There's also a separate patch with pageinspect tests, but only as a demonstration of various (non)broken cases, not for commit. And then also a bash script generating indexes with random data, randomized summarization etc. - on unpatched systems this happens to fail in about 1/3 of the runs (at least for me). I haven't seen any failures with the patches attached (on any branch). As for the issue / fix, I don't think there's a better solution than what the patch does - we need to distinguish empty / all-nulls ranges, but we can't add a flag because of on-disk format / ABI. So using the existing flags seems like the only option - I haven't heard any other ideas so far, and I couldn't come up with any myself either. I've also thought about alternative "encodings" into allnulls/hasnulls, instead of treating (true,true) as "empty" - but none of that ended up being any simpler, quite the opposite actually, as it would change what the individual flags mean etc. So AFAICS this is the best / least disruptive option. I went over all the places touching these flags, to double check if any of those needs some tweaks (similar to union_tuples, which I missed for a long time). But I haven't found anything else, so I think this version of the patches is complete. As for assessing how many indexes are affected - in principle, any index on columns with NULLs may be broken. But it only matters if the index is used for IS NULL queries, other queries are not affected. I also realized that this only affects insertion of individual tuples into existing all-null summaries, not "bulk" summarization that sees all values at once. This happens because in this case add_values_to_range sets hasnulls=true for the first (NULL) value, and then calls the addValue procedure for the second (non-NULL) one, which resets the allnulls flag to false. But when inserting individual rows, we first set hasnulls=true, but brin_form_tuple ignores that because of allnulls=true. And then when inserting the second row, we start with hasnulls=false again, and the opclass quietly resets the allnulls flag. I guess this further reduces the number of broken indexes, especially for data sets with small null_frac, or for append-only (or -mostly) tables where most of the summarization is bulk. I still feel a bit uneasy about this, but I think the patch is solid. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL CompanyFrom 033286cff9733c24fdc7c3f774d947c9f1072aa0 Mon Sep 17 00:00:00 2001 From: Tomas Vondra Date: Sun, 8 Jan 2023 23:42:41 +0100 Subject: [PATCH] extra tests --- contrib/pageinspect/Makefile| 2 +- contrib/pageinspect/expected/brin-fails.out | 152 contrib/pageinspect/sql/brin-fails.sql | 85 +++ 3 files changed, 238 insertions(+), 1 deletion(-) create mode 100644 contrib/pageinspect/expected/brin-fails.out create mode 100644 contrib/pageinspect/sql/brin-fails.sql diff --git a/contrib/pageinspect/Makefile b/contrib/pageinspect/Makefile index 95e030b396..69a28a6d3d 100644 --- a/contrib/pageinspect/Makefile +++ b/contrib/pageinspect/Makefile @@ -22,7 +22,7 @@ DATA = pageinspect--1.11--1.12.sql pageinspect--1.10--1.11.sql \ pageinspect--1.0--1.1.sql PGFILEDESC = "pageinspect - functions to inspect contents of database pages" -REGRESS = page btree brin gin gist hash checksum oldextversions +REGRESS = page btree brin gin gist hash checksum oldextversions brin-fails ifdef USE_PGXS PG_CONFIG = pg_config diff --git a/contrib/pageinspect/expected/brin-fails.out b/contrib/pageinspect/expected/brin-fails.out new file mode 100644 index 00..a12c761fc8 --- /dev/null +++ b/contrib/pageinspect/expected/brin-fails.out @@ -0,0 +1,152 @@ +create table t (a int); +create extension pageinspect; +-- works +drop index if exists t_a_idx; +NOTICE: index "t_a_idx" does not exist, skipping +truncate t; +insert into t values (null), (1); +create index on t using brin (a); +select * from brin_page_items(get_raw_page('t_a_idx', 2), 't_a_idx'::regclass); + itemoffset | blknum | attnum | allnulls | hasnulls | placeholder | value ++++--+--+-+-- + 1 | 0 | 1 | f| t| f
Re: Fixing a couple of buglets in how VACUUM sets visibility map bits
On Mon, Jan 2, 2023 at 10:31 AM Peter Geoghegan wrote: > Would be helpful if I could get a +1 on > v1-0002-Never-just-set-the-all-frozen-bit-in-VM.patch, which is > somewhat more substantial than the others. There has been no response on this thread for over a full week at this point. I'm CC'ing Robert now, since the bug is from his commit a892234f83. Attached revision of the "don't unset all-visible bit while unsetting all-frozen bit" patch adds some assertions that verify that visibility_cutoff_xid is InvalidTransactionId as expected when we go to set any page all-frozen in the VM. It also broadens an existing nearby test for corruption, which gives us some chance of detecting and repairing corruption of this sort that might have slipped in in the field. My current plan is to commit something like this in another week or so, barring any objections. -- Peter Geoghegan v2-0001-Don-t-accidentally-unset-all-visible-bit-in-VM.patch Description: Binary data
Re: Todo: Teach planner to evaluate multiple windows in the optimal order
On Mon, 9 Jan 2023 at 06:17, Ankit Kumar Pandey wrote: > I have addressed all points 1-6 in the attached patch. A few more things: 1. You're still using the 'i' variable in the foreach loop. foreach_current_index() will work. 2. I think the "index" variable needs a better name. sort_pushdown_idx maybe. 3. I don't think you need to set "index" on every loop. Why not just set it to foreach_current_index(l) - 1; break; 4. You're still setting orderby_pathkeys in the foreach loop. That's already been set above and it won't have changed. 5. I don't think there's any need to check pathkeys_contained_in() in the foreach loop anymore. With #3 the index will be -1 if the optimisation cannot apply. You could likely also get rid of try_sort_pushdown too and just make the condition "if (sort_pushdown_idx == foreach_current_index(l))". I'm a little unsure why there's still the is_sorted check there. Shouldn't that always be false now that you're looping until the pathkeys don't match in the foreach_reverse loop? Correct me if I'm wrong as I've not tested, but I think the new code in the foreach loop can just become: if (sort_pushdown_idx == foreach_current_index(l)) { Assert(!is_sorted); window_pathkeys = orderby_pathkeys; is_sorted = pathkeys_count_contained_in(window_pathkeys, path->pathkeys, &presorted_keys); } > I have one doubt regarding runCondition, do we only need to ensure > that window function which has subset sort clause of main query should > not have runCondition or none of the window functions should not contain > runCondition? I have gone with later case but wanted to clarify. Actually, maybe it's ok just to check the top-level WindowClause for runConditions. It's only that one that'll filter rows. That probably simplifies the code quite a bit. Lower-level runConditions only serve to halt the evaluation of WindowFuncs when the runCondition is no longer met. > > > > Also, do you want to have a go at coding up the sort bound pushdown > > too? It'll require removing the limitCount restriction and adjusting > > ExecSetTupleBound() to recurse through a WindowAggState. I think it's > > pretty easy. You could try it then play around with it to make sure it > > works and we get the expected performance. > > I tried this in the patch but kept getting `retrieved too many tuples in > a bounded sort`. > > Added following code in ExecSetTupleBound which correctly found sortstate > > and set bound value. > > else if(IsA(child_node, WindowAggState)) > > { > > WindowAggState *winstate = (WindowAggState *) child_node; > > if (outerPlanState(winstate)) > > ExecSetTupleBound(tuples_needed, > outerPlanState(winstate)); > > } > > I think problem is that are not using limit clause inside window > function (which > may need to scan all tuples) so passing bound value to > WindowAggState->sortstate > is not working as we might expect. Or maybe I am getting it wrong? I was > trying to > have top-N sort for limit clause if orderby pushdown happens. hmm, perhaps the Limit would have to be put between the WindowAgg and Sort for it to work. Maybe that's more complexity than it's worth. David
Re: POC: Lock updated tuples in tuple_update() and tuple_delete()
Hi, Mason! Thank you very much for your review. On Sun, Jan 8, 2023 at 9:33 PM Mason Sharp wrote: > On Fri, Jan 6, 2023 at 4:46 AM Pavel Borisov wrote: >> Besides, the new version has only some minor changes in the comments >> and the commit message. > It looks good, and the greater the concurrency the greater the benefit will > be. Just a few minor suggestions regarding comments. > > "ExecDeleteAct() have already locked the old tuple for us", change "have" to > "has". > > The comments in heapam_tuple_delete() and heapam_tuple_update() might be a > little clearer with something like: > > "If the tuple has been concurrently updated, get lock already so that on > retry it will succeed, provided that the caller asked to do this by > providing a lockedSlot." Thank you. These changes are incorporated into v6 of the patch. > Also, not too important, but perhaps better clarify in the commit message > that the repeated work is driven by ExecUpdate and ExecDelete and can happen > multiple times depending on the concurrency. Hmm... It can't happen arbitrary number of times. If tuple was concurrently updated, the we lock it. Once we lock, nobody can change it until we finish out work. So, I think no changes needed. I'm going to push this if no objections. -- Regards, Alexander Korotkov 0001-Allow-locking-updated-tuples-in-tuple_update-and--v6.patch Description: Binary data
Re: on placeholder entries in view rule action query's range table
Amit Langote writes: > I've attached just the patch that we should move forward with, as > Alvaro might agree. I've looked at this briefly but don't like it very much, specifically the business about retroactively adding an RTE and RTEPermissionInfo into the view's replacement subquery. That seems expensive and bug-prone: if you're going to do that you might as well just leave the OLD entry in place, as the earlier patch did, because you're just reconstructing that same state of the parsetree a bit later on. Furthermore, if that's where we end up then I'm not really sure this is worth doing at all. The idea driving this was that if we could get rid of *both* OLD and NEW RTE entries then we'd not have O(N^2) behavior in deep subquery pull-ups due to the rangetable getting longer with each one. But getting it down from two extra entries to one extra entry isn't going to fix that big-O problem. (The patch as presented still has O(N^2) planning time for the nested-views example discussed earlier.) Conceivably we could make it work by allowing RTE_SUBQUERY RTEs to carry a relation OID and associated RTEPermissionInfo, so that when a view's RTE_RELATION RTE is transmuted to an RTE_SUBQUERY RTE it still carries the info needed to let us lock and permission-check the view. That might be a bridge too far from the ugliness perspective ... although certainly the business with OLD and/or NEW RTEs isn't very pretty either. Anyway, if you don't feel like tackling that then I'd go back to the patch version that kept the OLD RTE. (Maybe we could rename that to something else, though, such as "*VIEW*"?) BTW, I don't entirely understand why this patch is passing regression tests, because it's failed to deal with numerous places that have hard-wired knowledge about these extra RTEs. Look for references to PRS2_OLD_VARNO and PRS2_NEW_VARNO. I think the original rationale for UpdateRangeTableOfViewParse was that we needed to keep the rtables of ON SELECT rules looking similar to those of other types of rules. Maybe we've cleaned up all the places that used to depend on that, but I'm not convinced. regards, tom lane
Re: Todo: Teach planner to evaluate multiple windows in the optimal order
On Mon, 9 Jan 2023 at 05:06, Vik Fearing wrote: > +EXPLAIN (COSTS OFF) > +SELECT empno, > + depname, > + min(salary) OVER (PARTITION BY depname ORDER BY empno) depminsalary, > + sum(salary) OVER (PARTITION BY depname) depsalary, > + count(*) OVER (ORDER BY enroll_date DESC) c > +FROM empsalary > +ORDER BY depname, empno, enroll_date; > + QUERY PLAN > +-- > + WindowAgg > + -> WindowAgg > + -> Sort > + Sort Key: depname, empno, enroll_date > + -> WindowAgg > + -> Sort > + Sort Key: enroll_date DESC > + -> Seq Scan on empsalary > Why aren't min() and sum() calculated on the same WindowAgg run? We'd need to have an ORDER BY per WindowFunc rather than per WindowClause to do that. The problem is when there is no ORDER BY, all rows are peers. Likely there likely are a bunch more optimisations we could do in that area. I think all the builtin window functions (not aggregates being used as window functions) don't care about peer rows, so it may be possible to merge the WindowClauses when the WIndowClause being merged only has window functions that don't care about peer rows. Not for this patch though. David
[PATCH] basebackup: support zstd long distance matching
In-Reply-To: <20220327205020.gm28...@telsasoft.com> On Sun, Mar 27, 2022 at 03:50:20PM -0500, Justin Pryzby wrote: > Here's a patch for zstd --long mode. Rebased. I'll add this to the CF. >From 8385f278e95d7bce7e5eed6bb8fccc1e1db9483d Mon Sep 17 00:00:00 2001 From: Justin Pryzby Date: Sun, 27 Mar 2022 11:55:01 -0500 Subject: [PATCH] basebackup: support zstd long distance matching First proposed here: 20220327205020.gm28...@telsasoft.com --- doc/src/sgml/protocol.sgml| 10 +++- doc/src/sgml/ref/pg_basebackup.sgml | 4 +- src/backend/backup/basebackup_zstd.c | 12 src/bin/pg_basebackup/bbstreamer_zstd.c | 13 + src/bin/pg_basebackup/t/010_pg_basebackup.pl | 9 ++- src/bin/pg_verifybackup/t/008_untar.pl| 8 +++ src/bin/pg_verifybackup/t/010_client_untar.pl | 8 +++ src/common/compression.c | 57 ++- src/include/common/compression.h | 2 + 9 files changed, 118 insertions(+), 5 deletions(-) diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml index 03312e07e25..e07d050517e 100644 --- a/doc/src/sgml/protocol.sgml +++ b/doc/src/sgml/protocol.sgml @@ -2747,7 +2747,8 @@ psql "dbname=postgres replication=database" -c "IDENTIFY_SYSTEM;" level. Otherwise, it should be a comma-separated list of items, each of the form keyword or keyword=value. Currently, the supported - keywords are level and workers. + keywords are level, long and + workers. @@ -2764,6 +2765,13 @@ psql "dbname=postgres replication=database" -c "IDENTIFY_SYSTEM;" 3). + + The long keyword enables long-distance matching + mode, for improved compression ratio, at the expense of higher memory + use. Long-distance mode is supported only for + zstd. + + The workers keyword sets the number of threads that should be used for parallel compression. Parallel compression diff --git a/doc/src/sgml/ref/pg_basebackup.sgml b/doc/src/sgml/ref/pg_basebackup.sgml index db3ad9cd5eb..79d3e657c32 100644 --- a/doc/src/sgml/ref/pg_basebackup.sgml +++ b/doc/src/sgml/ref/pg_basebackup.sgml @@ -424,8 +424,8 @@ PostgreSQL documentation level. Otherwise, it should be a comma-separated list of items, each of the form keyword or keyword=value. -Currently, the supported keywords are level -and workers. +Currently, the supported keywords are level, +long, and workers. The detail string cannot be used when the compression method is specified as a plain integer. diff --git a/src/backend/backup/basebackup_zstd.c b/src/backend/backup/basebackup_zstd.c index ac6cac178a0..1bb5820c884 100644 --- a/src/backend/backup/basebackup_zstd.c +++ b/src/backend/backup/basebackup_zstd.c @@ -118,6 +118,18 @@ bbsink_zstd_begin_backup(bbsink *sink) compress->workers, ZSTD_getErrorName(ret))); } + if ((compress->options & PG_COMPRESSION_OPTION_LONG_DISTANCE) != 0) + { + ret = ZSTD_CCtx_setParameter(mysink->cctx, + ZSTD_c_enableLongDistanceMatching, + compress->long_distance); + if (ZSTD_isError(ret)) + ereport(ERROR, + errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("could not set compression flag for %s: %s", + "long", ZSTD_getErrorName(ret))); + } + /* * We need our own buffer, because we're going to pass different data to * the next sink than what gets passed to us. diff --git a/src/bin/pg_basebackup/bbstreamer_zstd.c b/src/bin/pg_basebackup/bbstreamer_zstd.c index fe17d6df4ef..fba391e2a0f 100644 --- a/src/bin/pg_basebackup/bbstreamer_zstd.c +++ b/src/bin/pg_basebackup/bbstreamer_zstd.c @@ -106,6 +106,19 @@ bbstreamer_zstd_compressor_new(bbstreamer *next, pg_compress_specification *comp compress->workers, ZSTD_getErrorName(ret)); } + if ((compress->options & PG_COMPRESSION_OPTION_LONG_DISTANCE) != 0) + { + ret = ZSTD_CCtx_setParameter(streamer->cctx, + ZSTD_c_enableLongDistanceMatching, + compress->long_distance); + if (ZSTD_isError(ret)) + { + pg_log_error("could not set compression flag for %s: %s", + "long", ZSTD_getErrorName(ret)); + exit(1); + } + } + /* Initialize the ZSTD output buffer. */ streamer->zstd_outBuf.dst = streamer->base.bbs_buffer.data; streamer->zstd_outBuf.size = streamer->base.bbs_buffer.maxlen; diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl index b60cb78a0d5..4d130a7f944 100644 --- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl +++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl @@ -139,7 +139,14 @@ SKIP: 'gzip:workers=3', 'invalid compression specification: compression algorithm "gzip" does not accept a worker count', 'failure on worker count for gzip' - ],); + ],
Re: drop postmaster symlink
On Sat, 7 Jan 2023 22:29:35 -0600 "Karl O. Pinc" wrote: > The only way I could think of to review a patch > that removes something is to report all the places > I looked where a reference to the symlink might be. I forgot to report that I also tried a `make install` and 'make uninstall`, with no problems. FWIW, I suspect the include/backend/postmaster/ directory would today be named include/backend/postgres/. Now would be the time to change the name, but I don't see it being worth the work. And while such a change wouldn't break pg, that kind of change would break something for somebody. Regards, Karl Free Software: "You don't pay back, you pay forward." -- Robert A. Heinlein
Re: MERGE ... RETURNING
On Sun, 8 Jan 2023 at 07:28, Dean Rasheed wrote: So playing around with it (and inspired by the WITH ORDINALITY syntax > for SRFs), I had the idea of allowing "WITH WHEN CLAUSE" at the end of > the returning list, which adds an integer column to the list, whose > value is set to the index of the when clause executed, as in the > attached very rough patch. > Would it be useful to have just the action? Perhaps "WITH ACTION"? My idea is that this would return an enum of INSERT, UPDATE, DELETE (so is "action" the right word?). It seems to me in many situations I would be more likely to care about which of these 3 happened rather than the exact clause that applied. This isn't necessarily meant to be instead of your suggestion because I can imagine wanting to know the exact clause, just an alternative that might suffice in many situations. Using it would also avoid problems arising from editing the query in a way which changes the numbers of the clauses. So, quoting an example from the tests, this allows things like: > > WITH t AS ( > MERGE INTO sq_target t USING v ON tid = sid > WHEN MATCHED AND tid > 2 THEN UPDATE SET balance = t.balance + delta > WHEN NOT MATCHED THEN INSERT (balance, tid) VALUES (balance + delta, > sid) > WHEN MATCHED AND tid < 2 THEN DELETE > RETURNING t.* WITH WHEN CLAUSE > ) > SELECT CASE when_clause > WHEN 1 THEN 'UPDATE' > WHEN 2 THEN 'INSERT' > WHEN 3 THEN 'DELETE' >END, * > FROM t; > > case | tid | balance | when_clause > +-+-+- > INSERT | -1 | -11 | 2 > DELETE | 1 | 100 | 3 > (2 rows) > > 1 row is returned for each merge action executed (other than DO > NOTHING actions), and as usual, the values represent old target values > for DELETE actions, and new target values for INSERT/UPDATE actions. > Would it be feasible to allow specifying old.column or new.column? These would always be NULL for INSERT and DELETE respectively but more useful with UPDATE. Actually I've been meaning to ask this question about UPDATE … RETURNING. Actually, even with DELETE/INSERT, I can imagine wanting, for example, to get only the new values associated with INSERT or UPDATE and not the values removed by a DELETE. So I can imagine specifying new.column to get NULLs for any row that was deleted but still get the new values for other rows. It's also possible to return the source values, and a bare "*" in the > returning list expands to all the source columns, followed by all the > target columns. > Does this lead to a problem in the event there are same-named columns between source and target? The name of the added column, if included, can be changed by > specifying "WITH WHEN CLAUSE [AS] col_alias". I chose the syntax "WHEN > CLAUSE" and "when_clause" as the default column name because those > match the existing terminology used in the docs. >
Re: [RFC] Add jit deform_counter
Hi > > I'm not sure why, but pgss jit metrics are always nulls for explain >> > > analyze queries. I have noticed this with surprise myself, when >> recently >> > > was reviewing the lazy jit patch, but haven't yet figure out what is >> the >> > > reason. Anyway, without "explain analyze" you'll get correct deforming >> > > numbers in pgss. >> > > > > > It is working although I am not sure if it is correctly when I run EXPLAIN ANALYZE for query `explain analyze select count(length(prosrc) > 0) from pg_proc;` I got plan and times ┌─┐ │ QUERY PLAN │ ╞═╡ │ Aggregate (cost=154.10..154.11 rows=1 width=8) (actual time=134.450..134.451 rows=1 loops=1) │ │ -> Seq Scan on pg_proc (cost=0.00..129.63 rows=3263 width=16) (actual time=0.013..0.287 rows=3266 loops=1) │ │ Planning Time: 0.088 ms │ │ JIT: │ │ Functions: 3 │ │ Options: Inlining true, Optimization true, Expressions true, Deforming true │ │ Timing: Generation 0.631 ms, Deforming 0.396 ms, Inlining 10.026 ms, Optimization 78.608 ms, Emission 44.915 ms, Total 134.181 ms │ │ Execution Time: 135.173 ms │ └─┘ (8 rows) Deforming is 0.396ms When I run mentioned query, and when I look to pg_stat_statements table, I see different times deforming is about 10ms wal_bytes │ 0 jit_functions │ 9 jit_generation_time│ 1.90404099 jit_deform_count │ 3 jit_deform_time│ 36.395131 jit_inlining_count │ 3 jit_inlining_time │ 256.104205 jit_optimization_count │ 3 jit_optimization_time │ 132.4536130002 jit_emission_count │ 3 jit_emission_time │ 1.210633 counts are correct, but times are strange - there is not consistency with values from EXPLAIN When I run this query on master, the values are correct jit_functions │ 6 jit_generation_time│ 1.350521 jit_inlining_count │ 2 jit_inlining_time │ 24.0183823 jit_optimization_count │ 2 jit_optimization_time │ 173.405792 jit_emission_count │ 2 jit_emission_time │ 91.226655 ┴─── │ JIT: │ │ Functions: 3 │ │ Options: Inlining true, Optimization true, Expressions true, Deforming true │ │ Timing: Generation 0.636 ms, Inlining 9.309 ms, Optimization 89.653 ms, Emission 45.812 ms, Total 145.410 ms │ │ Execution Time: 146.410 ms │ └┘ Regards Pavel
Re: Fix gin index cost estimation
On Sun, Jan 8, 2023 at 7:08 PM Tom Lane wrote: > Alexander Korotkov writes: > > I'm going to push this and backpatch to all supported versions if no > > objections. > > Push yes, but I'd counsel against back-patching. People don't > generally like unexpected plan changes in stable versions, and > that's what a costing change could produce. There's no argument > that we are fixing a failure or wrong answer here, so it doesn't > seem like back-patch material. Pushed to master, thank you. I've noticed the reason for non-backpatching in the commit message. -- Regards, Alexander Korotkov
Re: Add LZ4 compression in pg_dump
On Thu, Dec 22, 2022 at 11:08:59AM -0600, Justin Pryzby wrote: > There's a couple of lz4 bits which shouldn't be present in 002: file > extension and comments. There were "LZ4" comments and file extension stuff in the preparatory commit. But now it seems like you *removed* them in the LZ4 commit (where it actually belongs) rather than *moving* it from the prior/parent commit *to* the lz4 commit. I recommend to run something like "git diff @{1}" whenever doing this kind of patch surgery. + if (AH->compression_spec.algorithm != PG_COMPRESSION_NONE && + AH->compression_spec.algorithm == PG_COMPRESSION_GZIP && This looks wrong/redundant. The gzip part should be removed, right ? Maybe other places that check if (compression==PG_COMPRESSION_GZIP) should maybe change to say compression!=NONE? _PrepParallelRestore() references ".gz", so I think it needs to be retrofitted to handle .lz4. Ideally, that's built into a struct or list of file extensions to try. Maybe compression.h should have a function to return the file extension of a given algorithm. I'm planning to send a patch for zstd, and hoping its changes will be minimized by these preparatory commits. + errno = errno ? : ENOSPC; "?:" is a GNU extension (not the ternary operator, but the ternary operator with only 2 args). It's not in use anywhere else in postgres. You could instead write it with 3 "errno"s or as "if (errno==0): errno=ENOSPC" You wrote "eol_flag == false" and "eol_flag == 0" and true. But it's cleaner to test it as a boolean: if (eol_flag) / if (!eol_flag). Both LZ4File_init() and its callers check "inited". Better to do it in one place than 3. It's a static function, so I think there's no performance concern. Gzip_close() still has a useless save_errno (or rebase issue?). I think it's confusing to have two functions, one named InitCompressLZ4() and InitCompressorLZ4(). pg_compress_specification is being passed by value, but I think it should be passed as a pointer, as is done everywhere else. pg_compress_algorithm is being writen directly into the pg_dump header. Currently, I think that's not an externally-visible value (it could be renumbered, theoretically even in a minor release). Maybe there should be a "private" enum for encoding the pg_dump header, similar to WAL_COMPRESSION_LZ4 vs BKPIMAGE_COMPRESS_LZ4 ? Or else a comment there should warn that the values are encoded in pg_dump, and must never be changed. + Verify that data files where compressed typo: s/where/were/ Also: s/occurance/occurrence/ s/begining/beginning/ s/Verfiy/Verify/ s/nessary/necessary/ BTW I noticed that cfdopen() was accidentally committed to compress_io.h in master without being defined anywhere. -- Justin
Re: Todo: Teach planner to evaluate multiple windows in the optimal order
On 1/8/23 18:05, Ankit Kumar Pandey wrote: On 08/01/23 21:36, Vik Fearing wrote: On 1/8/23 11:21, Ankit Kumar Pandey wrote: Please find attached patch with addressed issues mentioned before. I am curious about this plan: +-- ORDER BY's in multiple Window functions can be combined into one +-- if they are subset of QUERY's ORDER BY +EXPLAIN (COSTS OFF) +SELECT empno, + depname, + min(salary) OVER (PARTITION BY depname ORDER BY empno) depminsalary, + sum(salary) OVER (PARTITION BY depname) depsalary, + count(*) OVER (ORDER BY enroll_date DESC) c +FROM empsalary +ORDER BY depname, empno, enroll_date; + QUERY PLAN +-- + WindowAgg + -> WindowAgg + -> Sort + Sort Key: depname, empno, enroll_date + -> WindowAgg + -> Sort + Sort Key: enroll_date DESC + -> Seq Scan on empsalary +(8 rows) + Why aren't min() and sum() calculated on the same WindowAgg run? Isn't that exactly what is happening here? First count() with sort on enroll_date is run and then min() and sum()? No, there are two passes over the window for those two but I don't see that there needs to be. Only difference between this and plan generated by master(given below) is a sort in the end. Then this is probably not this patch's job to fix. -- Vik Fearing
Re: POC: Lock updated tuples in tuple_update() and tuple_delete()
On Fri, Jan 6, 2023 at 4:46 AM Pavel Borisov wrote: > Hi, Alexander! > > On Thu, 5 Jan 2023 at 15:11, Alexander Korotkov > wrote: > > > > On Wed, Jan 4, 2023 at 5:05 PM Alexander Korotkov > wrote: > > > On Wed, Jan 4, 2023 at 3:43 PM Pavel Borisov > wrote: > > > > One more update of a patchset to avoid compiler warnings. > > > > > > Thank you for your help. I'm going to provide the revised version of > > > patch with comments and commit message in the next couple of days. > > > > The revised patch is attached. It contains describing commit message, > > comments and some minor code improvements. > > I've looked through the patch once again. It seems in a nice state to > be committed. > I also noticed that in tableam level and NodeModifyTable function > calls we have a one-to-one correspondence between *lockedSlot и bool > lockUpdated, but no checks on this in case something changes in the > code in the future. I'd propose combining these variables to remain > free from these checks. See v5 of a patch. Tests are successfully > passed. > Besides, the new version has only some minor changes in the comments > and the commit message. > > Kind regards, > Pavel Borisov, > Supabase. > It looks good, and the greater the concurrency the greater the benefit will be. Just a few minor suggestions regarding comments. "ExecDeleteAct() have already locked the old tuple for us", change "have" to "has". The comments in heapam_tuple_delete() and heapam_tuple_update() might be a little clearer with something like: "If the tuple has been concurrently updated, get lock already so that on retry it will succeed, provided that the caller asked to do this by providing a lockedSlot." Also, not too important, but perhaps better clarify in the commit message that the repeated work is driven by ExecUpdate and ExecDelete and can happen multiple times depending on the concurrency. Best Regards, Mason
Re: Todo: Teach planner to evaluate multiple windows in the optimal order
On 08/01/23 16:33, David Rowley wrote: On Sun, 8 Jan 2023 at 23:21, Ankit Kumar Pandey wrote: Please find attached patch with addressed issues mentioned before. Here's a quick review: 1. You can use foreach_current_index(l) to obtain the index of the list element. 2. I'd rather see you looping backwards over the list in the first loop. I think it'll be more efficient to loop backwards as you can just break out the loop when you see the pathkeys are not contained in the order by pathkreys. When the optimisation does not apply that means you only need to look at the last item in the list. You could maybe just invent foreach_reverse() for this purpose and put it in pg_list.h. That'll save having to manually code up the loop. 3. I don't think you should call the variable enable_order_by_pushdown. We've a bunch of planner related GUCs that start with enable_. That might cause a bit of confusion. Maybe just try_sort_pushdown. 4. You should widen the scope of orderby_pathkeys and set it within the if (enable_order_by_pushdown) scope. You can reuse this again in the 2nd loop too. Just initialise it to NIL 5. You don't seem to be checking all WindowClauses for a runCondtion. If you do #2 above then you can start that process in the initial reverse loop so that you've checked them all by the time you get around to that WindowClause again in the 2nd loop. 6. The test with "+-- Do not perform additional sort if column is presorted". I don't think "additional" is the correct word here. I think you want to ensure that we don't push down the ORDER BY below the WindowAgg for this case. There is no ability to reduce the sorts here, only move them around, which we agreed we don't want to do for this case. I have addressed all points 1-6 in the attached patch. I have one doubt regarding runCondition, do we only need to ensure that window function which has subset sort clause of main query should not have runCondition or none of the window functions should not contain runCondition? I have gone with later case but wanted to clarify. Also, do you want to have a go at coding up the sort bound pushdown too? It'll require removing the limitCount restriction and adjusting ExecSetTupleBound() to recurse through a WindowAggState. I think it's pretty easy. You could try it then play around with it to make sure it works and we get the expected performance. I tried this in the patch but kept getting `retrieved too many tuples in a bounded sort`. Added following code in ExecSetTupleBound which correctly found sortstate and set bound value. else if(IsA(child_node, WindowAggState)) { WindowAggState *winstate = (WindowAggState *) child_node; if (outerPlanState(winstate)) ExecSetTupleBound(tuples_needed, outerPlanState(winstate)); } I think problem is that are not using limit clause inside window function (which may need to scan all tuples) so passing bound value to WindowAggState->sortstate is not working as we might expect. Or maybe I am getting it wrong? I was trying to have top-N sort for limit clause if orderby pushdown happens. You'll likely want to add a few more rows than the last performance test you did or run the query with pgbench. Running a query once that only takes 1-2ms is likely not a reliable way to test the performance of something. I did some profiling. CREATE TABLE empsalary1 ( depname varchar, empno bigint, salary int, enroll_date date ); INSERT INTO empsalary1(depname, empno, salary, enroll_date) SELECT string_agg (substr('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789', ceil (random() * 62)::integer, 1000), '') AS depname, generate_series(1, 1000) AS empno, ceil (random()*1)::integer AS salary , NOW() + (random() * (interval '90 days')) + '30 days' AS enroll_date; 1) Optimization case EXPLAIN (ANALYZE) SELECT empno, depname, min(salary) OVER (PARTITION BY depname ORDER BY empno) depminsalary, sum(salary) OVER (PARTITION BY depname) depsalary, count(*) OVER (ORDER BY enroll_date DESC) c FROM empsalary1 ORDER BY depname, empno, enroll_date; EXPLAIN (ANALYZE) SELECT empno, depname, min(salary) OVER (PARTITION BY depname) depminsalary FROM empsalary1 ORDER BY depname, empno; In patched version: QUERY PLAN -- --- WindowAgg (cost=93458.04..93458.08 rows=1 width=61) (actual time=149996.006..156756.995 rows=1000 loops=1) -> WindowAgg (cost=93458.04..93458.06 rows=1 width=57) (actual time=108559.126..135892.188 rows=1000 loops=1) -> Sort (cost=93458.04..93458.04 rows=1 width=53) (actual time=108554.213..112564.168 rows=1000 loops=1)
Re: Todo: Teach planner to evaluate multiple windows in the optimal order
On 08/01/23 21:36, Vik Fearing wrote: On 1/8/23 11:21, Ankit Kumar Pandey wrote: Please find attached patch with addressed issues mentioned before. I am curious about this plan: +-- ORDER BY's in multiple Window functions can be combined into one +-- if they are subset of QUERY's ORDER BY +EXPLAIN (COSTS OFF) +SELECT empno, + depname, + min(salary) OVER (PARTITION BY depname ORDER BY empno) depminsalary, + sum(salary) OVER (PARTITION BY depname) depsalary, + count(*) OVER (ORDER BY enroll_date DESC) c +FROM empsalary +ORDER BY depname, empno, enroll_date; + QUERY PLAN +-- + WindowAgg + -> WindowAgg + -> Sort + Sort Key: depname, empno, enroll_date + -> WindowAgg + -> Sort + Sort Key: enroll_date DESC + -> Seq Scan on empsalary +(8 rows) + Why aren't min() and sum() calculated on the same WindowAgg run? Isn't that exactly what is happening here? First count() with sort on enroll_date is run and then min() and sum()? Only difference between this and plan generated by master(given below) is a sort in the end. QUERY PLAN Incremental Sort Sort Key: depname, empno, enroll_date Presorted Key: depname, empno -> WindowAgg -> WindowAgg -> Sort Sort Key: depname, empno -> WindowAgg -> Sort Sort Key: enroll_date DESC -> Seq Scan on empsalary Let me know if I am missing anything. Thanks. -- Regards, Ankit Kumar Pandey
Re: Progress report of CREATE INDEX for nested partitioned tables
On Sat, Dec 17, 2022 at 08:30:02AM -0600, Justin Pryzby wrote: > We have the common problem of too many patches. > > https://www.postgresql.org/message-id/a15f904a70924ffa4ca25c3c744cff31e0e6e143.camel%40gmail.com > This changes the progress reporting to show indirect children as > "total", and adds a global variable to track recursion into > DefineIndex(), allowing it to be incremented without the value being > lost to the caller. > > https://www.postgresql.org/message-id/20221211063334.GB27893%40telsasoft.com > This also counts indirect children, but only increments the progress > reporting in the parent. This has the disadvantage that when > intermediate partitions are in use, the done_partitions counter will > "jump" from (say) 20 to 30 without ever hitting 21-29. > > https://www.postgresql.org/message-id/20221213044331.GJ27893%40telsasoft.com > This has two alternate patches: > - One patch changes to only update progress reporting of *direct* > children. This is minimal, but discourages any future plan to track > progress involving intermediate partitions with finer granularity. > - A alternative patch adds IndexStmt.nparts_done, and allows reporting > fine-grained progress involving intermediate partitions. > > https://www.postgresql.org/message-id/flat/039564d234fc3d014c555a7ee98be69a9e724836.ca...@gmail.com > This also reports progress of intermediate children. The first patch > does it by adding an argument to DefineIndex() (which isn't okay to > backpatch). And an alternate patch does it by adding to IndexStmt. > > @committers: Is it okay to add nparts_done to IndexStmt ? Any hint about this ? This should be resolved before the "CIC on partitioned tables" patch, which I think is otherwise done.
Re: Fix gin index cost estimation
Alexander Korotkov writes: > I'm going to push this and backpatch to all supported versions if no > objections. Push yes, but I'd counsel against back-patching. People don't generally like unexpected plan changes in stable versions, and that's what a costing change could produce. There's no argument that we are fixing a failure or wrong answer here, so it doesn't seem like back-patch material. regards, tom lane
Re: Todo: Teach planner to evaluate multiple windows in the optimal order
On 1/8/23 11:21, Ankit Kumar Pandey wrote: Please find attached patch with addressed issues mentioned before. I am curious about this plan: +-- ORDER BY's in multiple Window functions can be combined into one +-- if they are subset of QUERY's ORDER BY +EXPLAIN (COSTS OFF) +SELECT empno, + depname, + min(salary) OVER (PARTITION BY depname ORDER BY empno) depminsalary, + sum(salary) OVER (PARTITION BY depname) depsalary, + count(*) OVER (ORDER BY enroll_date DESC) c +FROM empsalary +ORDER BY depname, empno, enroll_date; + QUERY PLAN +-- + WindowAgg + -> WindowAgg + -> Sort + Sort Key: depname, empno, enroll_date + -> WindowAgg + -> Sort + Sort Key: enroll_date DESC + -> Seq Scan on empsalary +(8 rows) + Why aren't min() and sum() calculated on the same WindowAgg run? -- Vik Fearing
Re: [Commitfest 2023-01] has started
On Tue, 3 Jan 2023 at 13:13, vignesh C wrote: > > Hi All, > > Just a reminder that Commitfest 2023-01 has started. > There are many patches based on the latest run from [1] which require > a) Rebased on top of head b) Fix compilation failures c) Fix test > failure, please have a look and rebase it so that it is easy for the > reviewers and committers: > 1. TAP output format for pg_regress > 2. Add BufFileRead variants with short read and EOF detection > 3. Add SHELL_EXIT_CODE variable to psql > 4. Add foreign-server health checks infrastructure > 5. Add last_vacuum_index_scans in pg_stat_all_tables > 6. Add index scan progress to pg_stat_progress_vacuum > 7. Add the ability to limit the amount of memory that can be allocated > to backends. > 8. Add tracking of backend memory allocated to pg_stat_activity > 9. CAST( ... ON DEFAULT) > 10. CF App: add "Returned: Needs more interest" close status > 11. CI and test improvements > 12. Cygwin cleanup > 13. Expand character set for ltree labels > 14. Fix tab completion MERGE > 15. Force streaming every change in logical decoding > 16. More scalable multixacts buffers and locking > 17. Move SLRU data into the regular buffer pool > 18. Move extraUpdatedCols out of RangeTblEntry > 19.New [relation] options engine > 20. Optimizing Node Files Support > 21. PGDOCS - Stats views and functions not in order? > 22. POC: Lock updated tuples in tuple_update() and tuple_delete() > 23. Parallelize correlated subqueries that execute within each worker > 24. Pluggable toaster > 25. Prefetch the next tuple's memory during seqscans > 26. Pulling up direct-correlated ANY_SUBLINK > 27. Push aggregation down to base relations and joins > 28. Reduce timing overhead of EXPLAIN ANALYZE using rdtsc > 29. Refactor relation extension, faster COPY > 30. Remove NEW placeholder entry from stored view query range table > 31. TDE key management patches > 32. Use AF_UNIX for tests on Windows (ie drop fallback TCP code) > 33. Windows filesystem support improvements > 34. making relfilenodes 56 bit > 35. postgres_fdw: commit remote (sub)transactions in parallel during > pre-commit > 36.recovery modules > > Commitfest status as of now: > Needs review:177 > Waiting on Author: 47 > Ready for Committer: 20 > Committed: 31 > Withdrawn:4 > Rejected: 0 > Returned with Feedback: 0 > Total: 279 > > We will be needing more members to actively review the patches to get > more patches to the committed state. I would like to remind you that > each patch submitter is expected to review at least one patch from > another submitter during the CommitFest, those members who have not > picked up patch for review please pick someone else's patch to review > as soon as you can. > I'll send out reminders this week to get your patches rebased and > update the status of the patch accordingly. > > [1] - http://cfbot.cputube.org/ Hi Hackers, Here's a quick status report after the first week (I think only about 9 commits happened during the week, the rest were pre-CF activity): status | 3rd Jan | w1 -+---+- Needs review:| 177 | 149 Waiting on Author: |47 | 60 Ready for Committer: |20 | 23 Committed: |31 | 40 Withdrawn: | 4 | 7 Rejected:| 0 | 0 Returned with Feedback: | 0 | 0 Total: | 279 | 279 Here is a list of "Needs review" entries for which there has not been much communication on the thread and needs help in proceeding further. Please pick one of these and help us on how to proceed further: pgbench: using prepared BEGIN statement in a pipeline could cause an error | Yugo Nagata Fix dsa_free() to re-bin segment | Dongming Liu pg_rewind: warn when checkpoint hasn't happened after promotion | James Coleman Work around non-atomic read of read of control file on ext4 | Thomas Munro Rethinking the implementation of ts_headline | Tom Lane Fix GetWALAvailability function code comments for WALAVAIL_REMOVED return value | sirisha chamarti Function to log backtrace of postgres processes | vignesh C, Bharath Rupireddy disallow HEAP_XMAX_COMMITTED and HEAP_XMAX_IS_LOCKED_ONLY | Nathan Bossart New hooks in the connection path | Bertrand Drouvot Check consistency of GUC defaults between .sample.conf and pg_settings.boot_val | Justin Pryzby Add <> support to sepgsql_restorecon | Joe Conway pg_stat_statements and "IN" conditions | Dmitry Dolgov Patch to implement missing join selectivity estimation for range types | Zhicheng Luo, Maxime Schoemans, Diogo Repas, Mahmoud SAKR Operation log for major operations | Dmitry Koval Consider parallel for LATERAL subqueries having LIMIT/OFFSET | James Coleman Using each rel as both outer and inner for anti-joins | Richard Guo partInde
MERGE ... RETURNING
I've been thinking about adding RETURNING support to MERGE in order to let the user see what changed. I considered allowing a separate RETURNING list at the end of each action, but rapidly dismissed that idea. Firstly, it introduces shift/reduce conflicts to the grammar. These can be resolved by making the "AS" before column aliases non-optional, but that's pretty ugly, and there may be a better way. More serious drawbacks are that this syntax is much more cumbersome for the end user, having to repeat the RETURNING clause several times, and the implementation is likely to be pretty complex, so I didn't pursue it. A much simpler approach is to just have a single RETURNING list at the end of the command. That's much easier to implement, and easier for the end user. The main drawback is that it's impossible for the user to work out from the values returned which action was actually taken, and I think that's a pretty essential piece of information (at least it seems pretty limiting to me, not being able to work that out). So playing around with it (and inspired by the WITH ORDINALITY syntax for SRFs), I had the idea of allowing "WITH WHEN CLAUSE" at the end of the returning list, which adds an integer column to the list, whose value is set to the index of the when clause executed, as in the attached very rough patch. So, quoting an example from the tests, this allows things like: WITH t AS ( MERGE INTO sq_target t USING v ON tid = sid WHEN MATCHED AND tid > 2 THEN UPDATE SET balance = t.balance + delta WHEN NOT MATCHED THEN INSERT (balance, tid) VALUES (balance + delta, sid) WHEN MATCHED AND tid < 2 THEN DELETE RETURNING t.* WITH WHEN CLAUSE ) SELECT CASE when_clause WHEN 1 THEN 'UPDATE' WHEN 2 THEN 'INSERT' WHEN 3 THEN 'DELETE' END, * FROM t; case | tid | balance | when_clause +-+-+- INSERT | -1 | -11 | 2 DELETE | 1 | 100 | 3 (2 rows) 1 row is returned for each merge action executed (other than DO NOTHING actions), and as usual, the values represent old target values for DELETE actions, and new target values for INSERT/UPDATE actions. It's also possible to return the source values, and a bare "*" in the returning list expands to all the source columns, followed by all the target columns. The name of the added column, if included, can be changed by specifying "WITH WHEN CLAUSE [AS] col_alias". I chose the syntax "WHEN CLAUSE" and "when_clause" as the default column name because those match the existing terminology used in the docs. Anyway, this feels like a good point to stop playing around and get feedback on whether this seems useful, or if anyone has other ideas. Regards, Dean diff --git a/doc/src/sgml/ref/merge.sgml b/doc/src/sgml/ref/merge.sgml new file mode 100644 index 0995fe0..4fc0a65 --- a/doc/src/sgml/ref/merge.sgml +++ b/doc/src/sgml/ref/merge.sgml @@ -25,6 +25,8 @@ PostgreSQL documentation MERGE INTO [ ONLY ] target_table_name [ * ] [ [ AS ] target_alias ] USING data_source ON join_condition when_clause [...] +[ RETURNING * | output_expression [ [ AS ] output_name ] [, ...] + [ WITH WHEN CLAUSE [ [ AS ] col_alias ] ] ] where data_source is: diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c new file mode 100644 index e34f583..aa3cca0 --- a/src/backend/commands/copy.c +++ b/src/backend/commands/copy.c @@ -274,12 +274,6 @@ DoCopy(ParseState *pstate, const CopyStm { Assert(stmt->query); - /* MERGE is allowed by parser, but unimplemented. Reject for now */ - if (IsA(stmt->query, MergeStmt)) - ereport(ERROR, - errcode(ERRCODE_FEATURE_NOT_SUPPORTED), - errmsg("MERGE not supported in COPY")); - query = makeNode(RawStmt); query->stmt = stmt->query; query->stmt_location = stmt_location; diff --git a/src/backend/commands/copyto.c b/src/backend/commands/copyto.c new file mode 100644 index 8043b4e..e02d7d0 --- a/src/backend/commands/copyto.c +++ b/src/backend/commands/copyto.c @@ -510,7 +510,8 @@ BeginCopyTo(ParseState *pstate, { Assert(query->commandType == CMD_INSERT || query->commandType == CMD_UPDATE || - query->commandType == CMD_DELETE); + query->commandType == CMD_DELETE || + query->commandType == CMD_MERGE); ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c new file mode 100644 index 651ad24..3391269 --- a/src/backend/executor/execPartition.c +++ b/src/backend/executor/execPartition.c @@ -612,6 +612,9 @@ ExecInitPartitionInfo(ModifyTableState * * case or in the case of UPDATE tuple routing where we didn't find a * result rel to reuse. */ + + /* XXX: What about the MERGE case ??? */ + if (node && node->returningLists != NIL) { TupleTableSlot *slot; @@ -877,6 +880,7 @@ ExecInitPartitionInfo(ModifyTableState * List *firstMergeActionList = linitial(node->mergeAct
Re: Fix gin index cost estimation
On Tue, Dec 6, 2022 at 1:22 PM Ronan Dunklau wrote: > Le vendredi 2 décembre 2022, 13:58:27 CET Ronan Dunklau a écrit : > > Le vendredi 2 décembre 2022, 12:33:33 CET Alexander Korotkov a écrit : > > > Hi, Ronan! > > > Thank you for your patch. Couple of quick questions. > > > 1) What magic number 50.0 stands for? I think we at least should make > > > it a macro. > > > > This is what is used in other tree-descending estimation functions, so I > > used that too. Maybe a DEFAULT_PAGE_CPU_COST macro would work for both ? If > > so I'll separate this into two patches, one introducing the macro for the > > other estimation functions, and this patch for gin. > > The 0001 patch does this. > > > > > > 2) "We only charge one data page for the startup cost" – should this > > > be dependent on number of search entries? > > In fact there was another problem. The current code estimate two different > pathes for fetching data pages: in the case of a partial match, it takes into > account that all the data pages will have to be fetched. So this is is now > taken into account for the CPU cost as well. > > For the regular search, we scale the number of data pages by the number of > search entries. Now the patch looks good for me. I made some tests. # create extension pg_trgm; # create extension btree_gin; # create table test1 as (select random() as val from generate_series(1,100) i); # create index test1_gin_idx on test1 using gin (val); # explain select * from test1 where val between 0.1 and 0.2; QUERY PLAN - Bitmap Heap Scan on test1 (cost=1186.21..7089.57 rows=98557 width=8) Recheck Cond: ((val >= '0.1'::double precision) AND (val <= '0.2'::double precision)) -> Bitmap Index Scan on test1_gin_idx (cost=0.00..1161.57 rows=98557 width=0) Index Cond: ((val >= '0.1'::double precision) AND (val <= '0.2'::double precision)) (4 rows) # create index test1_btree_idx on test1 using btree (val); # explain select * from test1 where val between 0.1 and 0.2; QUERY PLAN - Index Only Scan using test1_btree_idx on test1 (cost=0.42..3055.57 rows=98557 width=8) Index Cond: ((val >= '0.1'::double precision) AND (val <= '0.2'::double precision)) (2 rows) Looks reasonable. In this case GIN is much more expensive, because it can't handle range query properly and overlaps two partial matches. # create table test2 as (select 'samplestring' || i as val from generate_series(1,100) i); # create index test2_gin_idx on test2 using gin (val); # explain select * from test2 where val = 'samplestring50'; QUERY PLAN - Bitmap Heap Scan on test2 (cost=20.01..24.02 rows=1 width=18) Recheck Cond: (val = 'samplestring50'::text) -> Bitmap Index Scan on test2_gin_idx (cost=0.00..20.01 rows=1 width=0) Index Cond: (val = 'samplestring50'::text) (4 rows) # create index test2_btree_idx on test2 using btree (val); # explain select * from test2 where val = 'samplestring50'; QUERY PLAN --- Index Only Scan using test2_btree_idx on test2 (cost=0.42..4.44 rows=1 width=18) Index Cond: (val = 'samplestring50'::text) (2 rows) This also looks reasonable. GIN is not terribly bad for this case, but B-tree is much cheaper. I'm going to push this and backpatch to all supported versions if no objections. -- Regards, Alexander Korotkov
Re: Todo: Teach planner to evaluate multiple windows in the optimal order
On Sun, 8 Jan 2023 at 23:21, Ankit Kumar Pandey wrote: > Please find attached patch with addressed issues mentioned before. Here's a quick review: 1. You can use foreach_current_index(l) to obtain the index of the list element. 2. I'd rather see you looping backwards over the list in the first loop. I think it'll be more efficient to loop backwards as you can just break out the loop when you see the pathkeys are not contained in the order by pathkreys. When the optimisation does not apply that means you only need to look at the last item in the list. You could maybe just invent foreach_reverse() for this purpose and put it in pg_list.h. That'll save having to manually code up the loop. 3. I don't think you should call the variable enable_order_by_pushdown. We've a bunch of planner related GUCs that start with enable_. That might cause a bit of confusion. Maybe just try_sort_pushdown. 4. You should widen the scope of orderby_pathkeys and set it within the if (enable_order_by_pushdown) scope. You can reuse this again in the 2nd loop too. Just initialise it to NIL 5. You don't seem to be checking all WindowClauses for a runCondtion. If you do #2 above then you can start that process in the initial reverse loop so that you've checked them all by the time you get around to that WindowClause again in the 2nd loop. 6. The test with "+-- Do not perform additional sort if column is presorted". I don't think "additional" is the correct word here. I think you want to ensure that we don't push down the ORDER BY below the WindowAgg for this case. There is no ability to reduce the sorts here, only move them around, which we agreed we don't want to do for this case. Also, do you want to have a go at coding up the sort bound pushdown too? It'll require removing the limitCount restriction and adjusting ExecSetTupleBound() to recurse through a WindowAggState. I think it's pretty easy. You could try it then play around with it to make sure it works and we get the expected performance. You'll likely want to add a few more rows than the last performance test you did or run the query with pgbench. Running a query once that only takes 1-2ms is likely not a reliable way to test the performance of something. David
Re: [RFC] Add jit deform_counter
ne 8. 1. 2023 v 11:57 odesílatel Dmitry Dolgov <9erthali...@gmail.com> napsal: > > On Sat, Jan 07, 2023 at 07:09:11PM +0100, Pavel Stehule wrote: > > so 7. 1. 2023 v 16:48 odesílatel Dmitry Dolgov <9erthali...@gmail.com> > > napsal: > > > > > > On Fri, Jan 06, 2023 at 09:42:09AM +0100, Pavel Stehule wrote: > > > > The explain part is working, the part of pg_stat_statements doesn't > > > > > > > > set jit_above_cost to 10; > > > > set jit_optimize_above_cost to 10; > > > > set jit_inline_above_cost to 10; > > > > > > > > (2023-01-06 09:08:59) postgres=# explain analyze select > > > > count(length(prosrc) > 0) from pg_proc; > > > > > > > > ┌┐ > > > > │ QUERY > PLAN > > > > │ > > > > > > > > ╞╡ > > > > │ Aggregate (cost=154.10..154.11 rows=1 width=8) (actual > > > > time=132.320..132.321 rows=1 loops=1) > > > │ > > > > │ -> Seq Scan on pg_proc (cost=0.00..129.63 rows=3263 width=16) > > > (actual > > > > time=0.013..0.301 rows=3266 loops=1) │ > > > > │ Planning Time: 0.070 ms > > > > │ > > > > │ JIT: > > > > │ > > > > │ Functions: 3 > > > > │ > > > > │ Options: Inlining true, Optimization true, Expressions true, > > > Deforming > > > > true │ > > > > │ Timing: Generation 0.597 ms, Deforming 0.407 ms, Inlining 8.943 > ms, > > > > Optimization 79.403 ms, Emission 43.091 ms, Total 132.034 ms │ > > > > │ Execution Time: 132.986 ms > > > > │ > > > > > > > > └┘ > > > > (8 rows) > > > > > > > > I see the result of deforming in explain analyze, but related values > in > > > > pg_stat_statements are 0. > > > > > > I'm not sure why, but pgss jit metrics are always nulls for explain > > > analyze queries. I have noticed this with surprise myself, when > recently > > > was reviewing the lazy jit patch, but haven't yet figure out what is > the > > > reason. Anyway, without "explain analyze" you'll get correct deforming > > > numbers in pgss. > > > > > > > It was really strange, because I tested the queries without EXPLAIN > ANALYZE > > too, and new columns were always zero on my comp. Other jit columns were > > filled. But I didn't do a deeper investigation. > > Interesting. I've verified it once more with the query and the > parameters you've posted, got the following: > > jit_functions | 3 > jit_generation_time| 1.257522 > jit_deform_count | 1 > jit_deform_time| 10.381345 > jit_inlining_count | 1 > jit_inlining_time | 71.628168 > jit_optimization_count | 1 > jit_optimization_time | 48.146447 > jit_emission_count | 1 > jit_emission_time | 0.737822 > > Maybe there is anything else special about how you run it? > I hope not, but I'll see. I recheck updated patch > > Otherwise addressed the rest of commentaries, thanks. >
Re: [RFC] Add jit deform_counter
> On Sat, Jan 07, 2023 at 07:09:11PM +0100, Pavel Stehule wrote: > so 7. 1. 2023 v 16:48 odesílatel Dmitry Dolgov <9erthali...@gmail.com> > napsal: > > > > On Fri, Jan 06, 2023 at 09:42:09AM +0100, Pavel Stehule wrote: > > > The explain part is working, the part of pg_stat_statements doesn't > > > > > > set jit_above_cost to 10; > > > set jit_optimize_above_cost to 10; > > > set jit_inline_above_cost to 10; > > > > > > (2023-01-06 09:08:59) postgres=# explain analyze select > > > count(length(prosrc) > 0) from pg_proc; > > > > > ┌┐ > > > │ QUERY PLAN > > > │ > > > > > ╞╡ > > > │ Aggregate (cost=154.10..154.11 rows=1 width=8) (actual > > > time=132.320..132.321 rows=1 loops=1) > > │ > > > │ -> Seq Scan on pg_proc (cost=0.00..129.63 rows=3263 width=16) > > (actual > > > time=0.013..0.301 rows=3266 loops=1) │ > > > │ Planning Time: 0.070 ms > > > │ > > > │ JIT: > > > │ > > > │ Functions: 3 > > > │ > > > │ Options: Inlining true, Optimization true, Expressions true, > > Deforming > > > true │ > > > │ Timing: Generation 0.597 ms, Deforming 0.407 ms, Inlining 8.943 ms, > > > Optimization 79.403 ms, Emission 43.091 ms, Total 132.034 ms │ > > > │ Execution Time: 132.986 ms > > > │ > > > > > └┘ > > > (8 rows) > > > > > > I see the result of deforming in explain analyze, but related values in > > > pg_stat_statements are 0. > > > > I'm not sure why, but pgss jit metrics are always nulls for explain > > analyze queries. I have noticed this with surprise myself, when recently > > was reviewing the lazy jit patch, but haven't yet figure out what is the > > reason. Anyway, without "explain analyze" you'll get correct deforming > > numbers in pgss. > > > > It was really strange, because I tested the queries without EXPLAIN ANALYZE > too, and new columns were always zero on my comp. Other jit columns were > filled. But I didn't do a deeper investigation. Interesting. I've verified it once more with the query and the parameters you've posted, got the following: jit_functions | 3 jit_generation_time| 1.257522 jit_deform_count | 1 jit_deform_time| 10.381345 jit_inlining_count | 1 jit_inlining_time | 71.628168 jit_optimization_count | 1 jit_optimization_time | 48.146447 jit_emission_count | 1 jit_emission_time | 0.737822 Maybe there is anything else special about how you run it? Otherwise addressed the rest of commentaries, thanks. >From 93d739e9258f79474df5644831a1f82fc97742dc Mon Sep 17 00:00:00 2001 From: Dmitrii Dolgov <9erthali...@gmail.com> Date: Sat, 11 Jun 2022 11:54:40 +0200 Subject: [PATCH v3 1/2] Add deform_counter At the moment generation_counter includes time spent on both JITing expressions and tuple deforming. Those are configured independently via corresponding options (jit_expressions and jit_tuple_deforming), which means there is no way to find out what fraction of time tuple deforming alone is taking. Add deform_counter dedicated to tuple deforming, which allows seeing more directly the influence jit_tuple_deforming is having on the query. --- src/backend/commands/explain.c | 7 ++- src/backend/jit/jit.c | 1 + src/backend/jit/llvm/llvmjit_expr.c | 6 ++ src/include/jit/jit.h | 3 +++ 4 files changed, 16 insertions(+), 1 deletion(-) diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c index f86983c660..2f23602a87 100644 --- a/src/backend/commands/explain.c +++ b/src/backend/commands/explain.c @@ -882,6 +882,7 @@ ExplainPrintJIT(ExplainState *es, int jit_flags, JitInstrumentation *ji) /* calculate total time */ INSTR_TIME_SET_ZERO(total_time); + /* don't add deform_counter, it's included into generation_counter */ INSTR_TIME_ADD(total_time, ji->generation_counter); INSTR_TIME_ADD(total_time, ji->inlining_counter); INSTR_TIME_ADD(total_time, ji->optimization_counter); @@ -909,8 +910,9 @@ ExplainPrintJIT(ExplainState *es, int jit_flags, JitInstrumentation *ji) { ExplainIndentText(es); appendStringInfo(es->str, - "Timing: %s %.3f ms, %s %.3f ms, %s %.3f ms, %s %.3f ms, %s %.3f ms\n", + "Timing: %s %.3f
Re: Todo: Teach planner to evaluate multiple windows in the optimal order
Hi David, Please find attached patch with addressed issues mentioned before. Things resolved: 1. Correct position of window function from where order by push down can happen is determined, this fixes issue mentioned in case #1. While writing test cases, I found that optimization do not happen for case #1 (which is prime candidate for such operation) like EXPLAIN (COSTS OFF) SELECT empno, depname, min(salary) OVER (PARTITION BY depname ORDER BY empno) depminsalary, sum(salary) OVER (PARTITION BY depname) depsalary FROM empsalary ORDER BY depname, empno, enroll_date 2. Point #2 as in above discussions a) looks like the best plan to me. What's the point of pushing the sort below the WindowAgg in this case? The point of this optimisation is to reduce the number of sorts not to push them as deep into the plan as possible. We should only be pushing them down when it can reduce the number of sorts. There's no reduction in the number of sorts in the above plan. Works as mentioned. All test cases (newly added and existing ones) are green. -- Regards, Ankit Kumar Pandey diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c index 000d757bdd..fd3fcd43a0 100644 --- a/src/backend/optimizer/plan/planner.c +++ b/src/backend/optimizer/plan/planner.c @@ -4423,6 +4423,12 @@ create_one_window_path(PlannerInfo *root, PathTarget *window_target; ListCell *l; List *topqual = NIL; + bool has_runcondition = false; + + List *window_pathkeys; + int index = 0; + int i; + bool enable_order_by_pushdown = false; /* * Since each window clause could require a different sort order, we stack @@ -4441,10 +4447,45 @@ create_one_window_path(PlannerInfo *root, */ window_target = input_target; + enable_order_by_pushdown = (root->parse->distinctClause == NIL && +root->parse->sortClause != NIL && +root->parse->limitCount == NULL && +!contain_window_function((Node *) get_sortgrouplist_exprs(root->parse->sortClause, + root->processed_tlist))); + + + i=0; + if (enable_order_by_pushdown) + { + /* + * Search for the window function whose path keys are + * subset of orderby_path keys, this allows us to perform + * order by pushdown from this position of window function onwards + */ + foreach(l, activeWindows) + { + List *orderby_pathkeys; + WindowClause *wc = lfirst_node(WindowClause, l); + orderby_pathkeys = make_pathkeys_for_sortclauses(root, + root->parse->sortClause, + root->processed_tlist); + window_pathkeys = make_pathkeys_for_window(root, + wc, + root->processed_tlist); + if (pathkeys_contained_in(window_pathkeys, orderby_pathkeys)) + { + +index = i; +break; + } + i++; + } + } + + i=0; foreach(l, activeWindows) { WindowClause *wc = lfirst_node(WindowClause, l); - List *window_pathkeys; int presorted_keys; bool is_sorted; bool topwindow; @@ -4457,6 +4498,38 @@ create_one_window_path(PlannerInfo *root, path->pathkeys, &presorted_keys); + has_runcondition |= (wc->runCondition != NIL); + + /* + * If not sorted already and the query has an ORDER BY clause, then we + * try to push the entire ORDER BY below the final WindowAgg. We + * don't do this when the query has a DISTINCT clause as the planner + * might opt to do a Hash Aggregate and spoil our efforts here. We + * also can't do this when the ORDER BY contains any WindowFuncs as we + * obviously can't sort something that's yet to be evaluated. We also + * don't bother with this when there is a WindowClauses with a + * runCondition as those can reduce the number of rows to sort in the + * ORDER BY and we don't want add complexity to the WindowAgg sort if + * the final ORDER BY sort is going to have far fewer rows to sort. + * For the same reason, don't bother if the query has a LIMIT clause. + */ + if (enable_order_by_pushdown && !is_sorted && !has_runcondition && (i == index)) + { + List *orderby_pathkeys; + + orderby_pathkeys = make_pathkeys_for_sortclauses(root, + root->parse->sortClause, + root->processed_tlist); + + if (pathkeys_contained_in(window_pathkeys, orderby_pathkeys)) +window_pathkeys = orderby_pathkeys; + + is_sorted = pathkeys_count_contained_in(window_pathkeys, + path->pathkeys, + &presorted_keys); + } + i++; + /* Sort if necessary */ if (!is_sorted) { diff --git a/src/test/regress/expected/window.out b/src/test/regress/expected/window.out index 90e89fb5b6..08663fd914 100644 --- a/src/test/regress/expected/window.out +++ b/src/test/regress/expected/window.out @@ -3982,7 +3982,194 @@ ORDER BY depname, empno; -> Seq Scan on empsalary (11 rows) +-- Do not perform additional sort if column is presorted +CREATE INDEX depname_idx ON empsalary(