Unwanted file mode modification?

2023-01-08 Thread Japin Li


Hi,

Commit 216a784829 change the src/backend/replication/logical/worker.c file mode
from 0644 to 0755, which is unwanted, right?

[1] 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=216a784829c2c5f03ab0c43e009126cbb819e9b2

-- 
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.




Re: doc: add missing "id" attributes to extension packaging page

2023-01-08 Thread Brar Piening

On 09.01.2023 at 03:38, vignesh C wrote:

There are couple of commitfest entries for this:
https://commitfest.postgresql.org/41/4041/
https://commitfest.postgresql.org/41/4042/ Can one of them be closed?

I've split the initial patch into two parts upon Álvaro's request in [1]
so that we can discuss them separately

https://commitfest.postgresql.org/41/4041/ is tracking the patch you've
been trying to apply and that I've just sent a rebased version for. It
only adds (invisible) ids to the HTML documentation and can be closed
once you've applied the patch.
https://commitfest.postgresql.org/41/4042/ is tracking a different patch
that makes the ids and the corresponding links discoverable at the HTML
surface. Hover one of the psql options in [2] to see the behavior. This
one still needs reviewing and there is no discussion around it yet.

Regards,
Brar

[1]
https://www.postgresql.org/message-id/20221206083809.3kaygnh2xswoxslj%40alvherre.pgsql
[2] https://pgdocs.piening.info/app-psql.html#APP-PSQL-OPTION-PORT




Re: Logical replication timeout problem

2023-01-08 Thread Amit Kapila
On Fri, Jan 6, 2023 at 12:35 PM Ashutosh Bapat
 wrote:
>
> +
> +/*
> + * We don't want to try sending a keepalive message after processing each
> + * change as that can have overhead. Tests revealed that there is no
> + * noticeable overhead in doing it after continuously processing 100 or 
> so
> + * changes.
> + */
> +#define CHANGES_THRESHOLD 100
>
> I think a time based threashold makes more sense. What if the timeout was
> nearing and those 100 changes just took little more time causing a timeout? We
> already have a time based threashold in WalSndKeepaliveIfNecessary(). And that
> function is invoked after reading every WAL record in WalSndLoop(). So it does
> not look like it's an expensive function. If it is expensive we might want to
> worry about WalSndLoop as well. Does it make more sense to remove this
> threashold?
>

We have previously tried this for every change [1] and it brings
noticeable overhead. In fact, even doing it for every 10 changes also
had some overhead which is why we reached this threshold number. I
don't think it can lead to timeout due to skipping changes but sure if
we see any such report we can further fine-tune this setting or will
try to make it time-based but for now I feel it would be safe to use
this threshold.

> +
> +/*
> + * After continuously processing CHANGES_THRESHOLD changes, we
> + * try to send a keepalive message if required.
> + */
> +if (++changes_count >= CHANGES_THRESHOLD)
> +{
> +ctx->update_progress(ctx, ctx->write_location, ctx->write_xid, 
> false);
> +changes_count = 0;
> +}
> +}
> +
>
> On the other thread, I mentioned that we don't have a TAP test for it.
> I agree with
> Amit's opinion there that it's hard to create a test which will timeout
> everywhere. I think what we need is a way to control the time required for
> decoding a transaction.
>
> A rough idea is to induce a small sleep after decoding every change. The 
> amount
> of sleep * number of changes will help us estimate and control the amount of
> time taken to decode a transaction. Then we create a transaction which will
> take longer than the timeout threashold to decode. But that's a
> significant code. I
> don't think PostgreSQL has a facility to induce a delay at a particular place
> in the code.
>

Yeah, I don't know how to induce such a delay while decoding changes.

One more thing, I think it would be better to expose a new callback
API via reorder buffer as suggested previously [2] similar to other
reorder buffer APIs instead of directly using reorderbuffer API to
invoke plugin API.


[1] - 
https://www.postgresql.org/message-id/OS3PR01MB6275DFFDAC7A59FA148931529E209%40OS3PR01MB6275.jpnprd01.prod.outlook.com
[2] - 
https://www.postgresql.org/message-id/CAA4eK1%2BfQjndoBOFUn9Wy0hhm3MLyUWEpcT9O7iuCELktfdBiQ%40mail.gmail.com

-- 
With Regards,
Amit Kapila.




Re: missing indexes in indexlist with partitioned tables

2023-01-08 Thread David Rowley
On Tue, 6 Dec 2022 at 13:43, Arne Roland  wrote:
> That being said, attached patch should fix the issue reported below.

I took a look over the v10 patch and ended up making adjustments to
the tests. I didn't quite see the need for the test to be as extensive
as you had them in v10.  Neither join removals nor unique joins treat
partitioned tables any differently from normal tables, so I think it's
fine just to have a single test that makes sure join removals work on
partitioned tables. I didn't feel inclined to add a test for unique
joins. The test I added is mainly just there to make sure something
fails if someone decides partitioned tables don't need the indexlist
populated at some point in the future.

The other changes I made were just cosmetic. I pushed the result.

David




Re: MultiXact\SLRU buffers configuration

2023-01-08 Thread Andrey Borodin
On Tue, Jan 3, 2023 at 5:02 AM vignesh C  wrote:
> does not apply on top of HEAD as in [1], please post a rebased patch:
>
Thanks! Here's the rebase.

Best regards, Andrey Borodin.


v24-0001-Divide-SLRU-buffers-into-8-associative-banks.patch
Description: Binary data


Re: Infinite Interval

2023-01-08 Thread jian he
On Sun, Jan 8, 2023 at 4:22 AM Joseph Koshakow  wrote:

> On Sat, Jan 7, 2023 at 3:05 PM Joseph Koshakow  wrote:
> >
> > On Sat, Jan 7, 2023 at 3:04 PM Joseph Koshakow 
> wrote:
> > >
> > > I think this patch is just about ready for review, except for the
> > > following two questions:
> > >   1. Should finite checks on intervals only look at months or all three
> > >   fields?
> > >   2. Should we make the error messages for adding/subtracting infinite
> > >   values more generic or leave them as is?
> > >
> > > My opinions are
> > >   1. We should only look at months.
> > >   2. We should make the errors more generic.
> > >
> > > Anyone else have any thoughts?
>
> Here's a patch with the more generic error messages.
>
> - Joe
>

HI.

I just found out another problem.

select * from  generate_series(timestamp'-infinity', timestamp 'infinity',
interval 'infinity');
ERROR:  timestamp out of range

select * from  generate_series(timestamp'-infinity',timestamp 'infinity',
interval '-infinity'); --return following

 generate_series
-
(0 rows)


select * from generate_series(timestamp 'infinity',timestamp 'infinity',
interval 'infinity');
--will run all the time.

select * from  generate_series(timestamp 'infinity',timestamp 'infinity',
interval '-infinity');
ERROR:  timestamp out of range

 select * from  generate_series(timestamp'-infinity',timestamp'-infinity',
interval 'infinity');
ERROR:  timestamp out of range

select * from  generate_series(timestamp'-infinity',timestamp'-infinity',
interval '-infinity');
--will run all the time.

-- 
 I recommend David Deutsch's <>

  Jian


Re: Amcheck verification of GiST and GIN

2023-01-08 Thread Andrey Borodin
On Sun, Jan 8, 2023 at 8:05 PM Andrey Borodin  wrote:
>
> Please find the attached new version. In this patchset heapallindexed
> flag is removed from GIN checks.
>
Uh... sorry, git-formatted wrong branch.
Here's the correct version. Double checked.

Best regards, Andrey Borodin.


v19-0003-Add-gin_index_parent_check-to-verify-GIN-index.patch
Description: Binary data


v19-0001-Refactor-amcheck-to-extract-common-locking-routi.patch
Description: Binary data


v19-0002-Add-gist_index_parent_check-function-to-verify-G.patch
Description: Binary data


Re: Amcheck verification of GiST and GIN

2023-01-08 Thread Andrey Borodin
Hi Jose, thank you for review and sorry for so long delay to answer.

On Wed, Dec 14, 2022 at 4:19 AM Jose Arthur Benetasso Villanova
 wrote:
>
>
> On Sun, 27 Nov 2022, Andrey Borodin wrote:
>
> > On Sun, Nov 27, 2022 at 1:29 PM Andrey Borodin  
> > wrote:
> >>
> > I was wrong. GIN check does similar gin_refind_parent() to lock pages
> > in bottom-up manner and truly verify downlink-child_page invariant.
>
> Does this mean that we need the adjustment in docs?
It seems to me that gin_index_parent_check() docs are correct.

>
> > Here's v17. The only difference is that I added progress reporting to
> > GiST verification.
> > I still did not implement heapallindexed for GIN. Existence of pending
> > lists makes this just too difficult for a weekend coding project :(
> >
> > Thank you!
> >
> > Best regards, Andrey Borodin.
> >
>
> I'm a bit lost here. I tried your patch again and indeed the
> heapallindexed inside gin_check_parent_keys_consistency has a TODO
> comment, but it's unclear to me if you are going to implement it or if the
> patch "needs review". Right now it's "Waiting on Author".
>

Please find the attached new version. In this patchset heapallindexed
flag is removed from GIN checks.

Thank you!

Best regards, Andrey Borodin.


v18-0003-Add-gin_index_parent_check-to-verify-GIN-index.patch
Description: Binary data


v18-0002-Add-gist_index_parent_check-function-to-verify-G.patch
Description: Binary data


v18-0001-Refactor-amcheck-to-extract-common-locking-routi.patch
Description: Binary data


Strengthen pg_waldump's --save-fullpage tests

2023-01-08 Thread Bharath Rupireddy
Hi,

A recent commit [1] added --save-fullpage option to pg_waldump to
extract full page images (FPI) from WAL records and save them into
files (one file per FPI) under a specified directory. While it added
tests to check the LSN from the FPI file name and the FPI file
contents, it missed to further check the FPI contents like the tuples
on the page. I'm attaching a patch that basically reads the FPI file
(saved by pg_waldump) contents and raw page from the table file (using
pageinspect extension) and compares the tuples from both of them. This
test ensures that the pg_waldump outputs the correct FPI. This idea is
also discussed elsewhere [2].

Thoughts?

[1] 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=d497093cbecccf6df26365e06a5f8f8614b591c8
[2] 
https://www.postgresql.org/message-id/calj2acxesn9dtjgsekm8fig7cxhhxqfqp4fcisjgcmp9wrz...@mail.gmail.com

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From f714f13f47f98c7b9a4f88539c76862cacaffe49 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Sat, 7 Jan 2023 08:29:28 +
Subject: [PATCH v1] Strengthen pg_waldump's --save-fullpage tests

---
 src/bin/pg_waldump/Makefile   |  2 ++
 src/bin/pg_waldump/t/002_save_fullpage.pl | 34 +--
 2 files changed, 34 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_waldump/Makefile b/src/bin/pg_waldump/Makefile
index d6459e17c7..924fb6ad20 100644
--- a/src/bin/pg_waldump/Makefile
+++ b/src/bin/pg_waldump/Makefile
@@ -3,6 +3,8 @@
 PGFILEDESC = "pg_waldump - decode and display WAL"
 PGAPPICON=win32
 
+EXTRA_INSTALL = contrib/pageinspect
+
 subdir = src/bin/pg_waldump
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
diff --git a/src/bin/pg_waldump/t/002_save_fullpage.pl b/src/bin/pg_waldump/t/002_save_fullpage.pl
index 5072703a3d..193887a115 100644
--- a/src/bin/pg_waldump/t/002_save_fullpage.pl
+++ b/src/bin/pg_waldump/t/002_save_fullpage.pl
@@ -38,16 +38,24 @@ max_wal_senders = 4
 });
 $node->start;
 
+$node->safe_psql('postgres', q(CREATE EXTENSION pageinspect));
+
 # Generate data/WAL to examine that will have full pages in them.
 $node->safe_psql(
 	'postgres',
 	"SELECT 'init' FROM pg_create_physical_replication_slot('regress_pg_waldump_slot', true, false);
-CREATE TABLE test_table AS SELECT generate_series(1,100) a;
+CREATE TABLE test_table AS SELECT generate_series(1,2) a;
 -- Force FPWs on the next writes.
 CHECKPOINT;
-UPDATE test_table SET a = a + 1;
+UPDATE test_table SET a = a * 100 WHERE a = 1;
 ");
 
+# Get raw page from the table.
+my $page_from_table = $node->safe_psql(
+	'postgres',
+	q(SELECT get_raw_page('test_table', 0))
+);
+
 ($walfile_name, $blocksize) = split '\|' => $node->safe_psql('postgres',
 	"SELECT pg_walfile_name(pg_switch_wal()), current_setting('block_size')");
 
@@ -108,4 +116,26 @@ for my $fullpath (glob "$tmp_folder/raw/*")
 
 ok($file_count > 0, 'verify that at least one block has been saved');
 
+# Check that the pg_waldump saves FPI file.
+my @fpi_files = glob "$tmp_folder/raw/*_main";
+is(scalar(@fpi_files), 1, 'one FPI file was created');
+
+# Read the content of the saved FPI file.
+my $page_from_fpi_file = $node->safe_psql(
+	'postgres',
+	qq{SELECT pg_read_binary_file('$fpi_files[0]')}
+);
+
+# Compare the raw page from table and FPI file saved, they must contain same rows.
+my $result = $node->safe_psql(
+	'postgres',
+	qq{SELECT tuple_data_split('test_table'::regclass, t_data, t_infomask, t_infomask2, t_bits)
+		FROM heap_page_items('$page_from_table')
+	  EXCEPT
+	  SELECT tuple_data_split('test_table'::regclass, t_data, t_infomask, t_infomask2, t_bits)
+		FROM heap_page_items('$page_from_fpi_file')}
+);
+
+is($result, '', 'verify that rows in raw page from table and FPI file saved are same');
+
 done_testing();
-- 
2.34.1



Re: Fixing a couple of buglets in how VACUUM sets visibility map bits

2023-01-08 Thread Peter Geoghegan
On Sun, Jan 8, 2023 at 4:27 PM Peter Geoghegan  wrote:
> We're vulnerable to allowing "all-frozen but not all-visible"
> inconsistencies because of two factors: this business with not passing
> VISIBILITYMAP_ALL_VISIBLE along with VISIBILITYMAP_ALL_FROZEN to
> visibilitymap_set(), *and* the use of all_visible_according_to_vm to
> set the VM (a local variable that can go stale). We sort of assume
> that all_visible_according_to_vm cannot go stale here without our
> detecting it. That's almost always the case, but it's not quite
> guaranteed.

On further reflection even v2 won't repair the page-level
PD_ALL_VISIBLE flag in passing in this scenario. ISTM that on HEAD we
might actually leave the all-frozen bit set in the VM, while both the
all-visible bit and the page-level PD_ALL_VISIBLE bit remain unset.
Again, all due to the approach we take with
all_visible_according_to_vm, which can go stale independently of both
the VM bit being unset and the PD_ALL_VISIBLE bit being unset (in my
example problem scenario).

FWIW I don't have this remaining problem in my VACUUM
freezing/scanning strategies patchset. It just gets rid of
all_visible_according_to_vm altogether, which makes things a lot
simpler at the point that we set VM bits at the end of lazy_scan_heap
-- there is nothing left that can become stale. Quite a lot of the
code is just removed; there is exactly one call to visibilitymap_set()
at the end of lazy_scan_heap with the patchset, that does everything
we need.

The patchset also has logic for setting PD_ALL_VISIBLE when it needs
to be set, which isn't (and shouldn't) be conditioned on whether we're
doing a "all visible -> all frozen " transition or a "neither -> all
visible" transition. What it actually needs to be conditioned on is
whether it's unset now, and so needs to be set in passing, as part of
setting one or both VM bits -- simple as that.

-- 
Peter Geoghegan




Re: doc: add missing "id" attributes to extension packaging page

2023-01-08 Thread vignesh C
On Mon, 9 Jan 2023 at 08:01, vignesh C  wrote:
>
> On Wed, 4 Jan 2023 at 04:13, Karl O. Pinc  wrote:
> >
> > On Tue, 3 Jan 2023 21:35:09 +0100
> > Brar Piening  wrote:
> >
> > > On 02.01.2023 at 21:53, Karl O. Pinc wrote:
> > > > If the author will look over my version of the patch I believe it
> > > > can be approved and sent on to the committers.
> > >
> > > LGTM.
> >
> > Approved for committer!
>
> The patch does not apply on top of HEAD as in [1], please post a rebased 
> patch:
> === Applying patches on top of PostgreSQL commit ID
> cd9479af2af25d7fa9bfd24dd4dcf976b360f077 ===
> === applying patch ./add_html_ids_v2.patch
> 
> patching file doc/src/sgml/ref/pgbench.sgml
> patching file doc/src/sgml/ref/psql-ref.sgml
> Hunk #73 FAILED at 1824.
> 
> 2 out of 208 hunks FAILED -- saving rejects to file
> doc/src/sgml/ref/psql-ref.sgml.rej
>
> [1] - http://cfbot.cputube.org/patch_41_4041.log

There are couple of commitfest entries for this:
https://commitfest.postgresql.org/41/4041/
https://commitfest.postgresql.org/41/4042/

Can one of them be closed?

Regards,
Vignesh




Re: doc: add missing "id" attributes to extension packaging page

2023-01-08 Thread vignesh C
On Wed, 4 Jan 2023 at 04:13, Karl O. Pinc  wrote:
>
> On Tue, 3 Jan 2023 21:35:09 +0100
> Brar Piening  wrote:
>
> > On 02.01.2023 at 21:53, Karl O. Pinc wrote:
> > > If the author will look over my version of the patch I believe it
> > > can be approved and sent on to the committers.
> >
> > LGTM.
>
> Approved for committer!

The patch does not apply on top of HEAD as in [1], please post a rebased patch:
=== Applying patches on top of PostgreSQL commit ID
cd9479af2af25d7fa9bfd24dd4dcf976b360f077 ===
=== applying patch ./add_html_ids_v2.patch

patching file doc/src/sgml/ref/pgbench.sgml
patching file doc/src/sgml/ref/psql-ref.sgml
Hunk #73 FAILED at 1824.

2 out of 208 hunks FAILED -- saving rejects to file
doc/src/sgml/ref/psql-ref.sgml.rej

[1] - http://cfbot.cputube.org/patch_41_4041.log

Regards,
Vignesh




Re: [PATCH] random_normal function

2023-01-08 Thread Tom Lane
I wrote:
> Also, I tried running the new random.sql regression cases over
> and over, and found that the "not all duplicates" test fails about
> one time in 10 or so.  We could probably tolerate that given
> that the random test is marked "ignore" in parallel_schedule, but
> I thought it best to add one more iteration so we could knock the
> odds down.

Hmm ... it occurred to me to try the same check on the existing
random() tests (attached), and darn if they don't fail even more
often, usually within 50K iterations.  So maybe we should rethink
that whole thing.

regards, tom lane

\timing on

create table if not exists random_tbl (random bigint);

do $$
begin
for i in 1..100 loop
TRUNCATE random_tbl;

INSERT INTO RANDOM_TBL (random)
  SELECT count(*) AS random
  FROM onek WHERE random() < 1.0/10;

-- select again, the count should be different
INSERT INTO RANDOM_TBL (random)
  SELECT count(*)
  FROM onek WHERE random() < 1.0/10;

-- select again, the count should be different
INSERT INTO RANDOM_TBL (random)
  SELECT count(*)
  FROM onek WHERE random() < 1.0/10;

-- select again, the count should be different
INSERT INTO RANDOM_TBL (random)
  SELECT count(*)
  FROM onek WHERE random() < 1.0/10;

-- now test that they are different counts
if (select true FROM RANDOM_TBL
  GROUP BY random HAVING count(random) > 3) then
raise notice 'duplicates at iteration %', i;
exit;
end if;

-- approximately check expected distribution
if (select true FROM RANDOM_TBL
  HAVING AVG(random) NOT BETWEEN 80 AND 120) then
raise notice 'range at iteration %', i;
exit;
end if;

end loop;
end $$;


Re: Decoupling antiwraparound autovacuum from special rules around auto cancellation

2023-01-08 Thread Peter Geoghegan
On Thu, Dec 29, 2022 at 7:01 PM Peter Geoghegan  wrote:
> Attached is v2, which is just to fix bitrot.

Attached is v3. We no longer apply vacuum_failsafe_age when
determining the cutoff for antiwraparound autovacuuming -- the new
approach is a bit simpler.

This is a fairly small change overall. Now any "table age driven"
autovacuum will also be antiwraparound when its
relfrozenxid/relminmxid attains an age that's either double the
relevant setting (either autovacuum_freeze_max_age or
effective_multixact_freeze_max_age), or 1 billion XIDs/MXIDs --
whichever is less.

That makes it completely impossible to disable antiwraparound
protections (the special antiwrap autocancellation behavior) for
table-age-driven autovacuums once table age exceeds 1 billion
XIDs/MXIDs. It's still possible to increase autovacuum_freeze_max_age
to well over a billion, of course. It just won't be possible to do
that while also avoiding the no-auto-cancellation behavior for those
autovacuums that are triggered due to table age crossing the
autovacuum_freeze_max_age/effective_multixact_freeze_max_age
threshold.

--
Peter Geoghegan
From 6d78e62226608683d9882b4507d37442164267a1 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan 
Date: Fri, 25 Nov 2022 11:23:20 -0800
Subject: [PATCH v3] Add "table age" trigger concept to autovacuum.

Teach autovacuum.c to launch "table age" autovacuums at the same point
that it previously triggered antiwraparound autovacuums.  Antiwraparound
autovacuums are retained, but are only used as a true option of last
resort, when regular autovacuum has presumably tried and failed to
advance relfrozenxid (likely because the auto-cancel behavior kept
cancelling regular autovacuums triggered based on table age).

The special auto-cancellation behavior applied by antiwraparound
autovacuums is known to cause problems in the field, so it makes sense
to avoid it, at least until the point where it starts to look like a
proportionate response.  Besides, the risk of the system eventually
triggering xidStopLimit because of cancellations is a lot lower than it
was back when the current auto-cancellation behavior was added by commit
acac68b2.  For example, there was no visibility map, so restarting
antiwraparound autovacuum meant that the next autovacuum would get very
little benefit from the work performed by earlier cancelled autovacuums.

Also add new instrumentation that lists a triggering condition in the
server log whenever an autovacuum is logged.  This reports "table age"
as the triggering criteria when regular (not antiwraparound) autovacuum
runs because we need to advance the table's age.  In other cases it will
report an autovacuum was launched due to the tuple insert thresholds or
the dead tuple thresholds.  Note that pg_stat_activity doesn't have any
special instrumentation for regular autovacuums that happen to have been
triggered based on table age (since it really is just another form of
autovacuum, and follows exactly the same rules in vacuumlazy.c and in
proc.c).

Author: Peter Geoghegan 
Reviewed-By: Jeff Davis 
Discussion: https://postgr.es/m/CAH2-Wz=S-R_2rO49Hm94Nuvhu9_twRGbTm6uwDRmRu-Sqn_t3w@mail.gmail.com
---
 src/include/commands/vacuum.h   |  18 ++-
 src/include/storage/proc.h  |   2 +-
 src/backend/access/heap/vacuumlazy.c|  14 ++
 src/backend/access/heap/visibilitymap.c |   5 +-
 src/backend/access/transam/multixact.c  |   4 +-
 src/backend/commands/vacuum.c   |  18 ++-
 src/backend/postmaster/autovacuum.c | 207 +---
 src/backend/storage/lmgr/proc.c |   4 +-
 8 files changed, 197 insertions(+), 75 deletions(-)

diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 689dbb770..b70f69fd9 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -191,6 +191,21 @@ typedef struct VacAttrStats
 #define VACOPT_SKIP_DATABASE_STATS 0x100	/* skip vac_update_datfrozenxid() */
 #define VACOPT_ONLY_DATABASE_STATS 0x200	/* only vac_update_datfrozenxid() */
 
+/*
+ * Values used by autovacuum.c to tell vacuumlazy.c about the specific
+ * threshold type that triggered an autovacuum worker.
+ *
+ * AUTOVACUUM_NONE is used when VACUUM isn't running in an autovacuum worker.
+ */
+typedef enum AutoVacType
+{
+	AUTOVACUUM_NONE = 0,
+	AUTOVACUUM_TABLE_XID_AGE,
+	AUTOVACUUM_TABLE_MXID_AGE,
+	AUTOVACUUM_DEAD_TUPLES,
+	AUTOVACUUM_INSERTED_TUPLES,
+} AutoVacType;
+
 /*
  * Values used by index_cleanup and truncate params.
  *
@@ -222,7 +237,8 @@ typedef struct VacuumParams
 			 * use default */
 	int			multixact_freeze_table_age; /* multixact age at which to scan
 			 * whole table */
-	bool		is_wraparound;	/* force a for-wraparound vacuum */
+	bool		is_wraparound;	/* antiwraparound autovacuum? */
+	AutoVacType	trigger;		/* Autovacuum trigger condition, if any */
 	int			log_min_duration;	/* minimum execution threshold in ms at
 	 * which autovacuum is logged, -1 to use
 	 * defaul

Re: Fixing a couple of buglets in how VACUUM sets visibility map bits

2023-01-08 Thread Peter Geoghegan
On Sun, Jan 8, 2023 at 3:53 PM Andres Freund  wrote:
> How?

See the commit message for the scenario I have in mind, which involves
a concurrent HOT update that aborts.

We're vulnerable to allowing "all-frozen but not all-visible"
inconsistencies because of two factors: this business with not passing
VISIBILITYMAP_ALL_VISIBLE along with VISIBILITYMAP_ALL_FROZEN to
visibilitymap_set(), *and* the use of all_visible_according_to_vm to
set the VM (a local variable that can go stale). We sort of assume
that all_visible_according_to_vm cannot go stale here without our
detecting it. That's almost always the case, but it's not quite
guaranteed.

> visibilitymap_set() just adds flags, it doesn't remove any already
> existing bits:

I know. The concrete scenario I have in mind is very subtle (if the
problem was this obvious I'm sure somebody would have noticed it by
now, since we do hit this visibilitymap_set() call site reasonably
often). A concurrent HOT update will certainly clear all the bits for
us, which is enough.

> You have plenty of changes like this, which are afaict entirely unrelated to
> the issue the commit is fixing, in here. It just makes it hard to review the
> patch.

I didn't think that it was that big of a deal to tweak the style of
one or two details in and around lazy_vacuum_heap_rel() in passing,
for consistency with lazy_scan_heap(), since the patch already needs
to do some of that. I do take your point, though.

-- 
Peter Geoghegan




Re: [PATCH] random_normal function

2023-01-08 Thread Tom Lane
I wrote:
> So the problem in this patch is that it's trying to include
> utils/float.h in a src/common file, where we have not included
> postgres.h.  Question is, why did you do that?

(Ah, for M_PI ... but our practice is just to duplicate that #define
where needed outside the backend.)

I spent some time reviewing this patch.  I'm on board with
inventing random_normal(): the definition seems solid and
the use-case for it seems reasonably well established.
I'm not necessarily against inventing similar functions for
other distributions, but this patch is not required to do so.
We can leave that discussion until somebody is motivated to
submit a patch for one.

On the other hand, I'm much less on board with inventing
random_string(): we don't have any clear field demand for it
and the appropriate definitional details are a lot less obvious
(for example, whether it needs to be based on pg_strong_random()
rather than the random() sequence).  I think we should leave that
out, and I have done so in the attached updated patch.

I noted several errors in the submitted patch.  It was creating
the function as PARALLEL SAFE which is just wrong, and the whole
business with checking PG_NARGS is useless because it will always
be 2.  (That's not how default arguments work.)

The business with checking against DBL_EPSILON seems wrong too.
All we need is to ensure that u1 > 0 so that log(u1) will not
choke; per spec, log() is defined for any positive input.  I see that
that seems to have been modeled on the C++ code in the Wikipedia
page, but I'm not sure that C++'s epsilon means the same thing, and
if it does then their example code is just wrong.  See the discussion
about "tails truncation" immediately above it: artificially
constraining the range of u1 just limits how much of the tail
of the distribution we can reproduce.  So that led me to doing
it the same way as in the existing Box-Muller code in pgbench,
which I then deleted per Fabien's advice.

BTW, the pgbench code was using sin() not cos(), which I duplicated
because using cos() causes the expected output of the pgbench tests
to change.  I'm not sure whether there was any hard reason to prefer
one or the other, and we can certainly change the expected output
if there's some reason to prefer cos().

I concur with not worrying about the Inf/NaN cases that Mark
pointed out.  It's not obvious that the results the proposed code
produces are wrong, and it's even less obvious that anyone will
ever care.

Also, I tried running the new random.sql regression cases over
and over, and found that the "not all duplicates" test fails about
one time in 10 or so.  We could probably tolerate that given
that the random test is marked "ignore" in parallel_schedule, but
I thought it best to add one more iteration so we could knock the
odds down.  Also I changed the test iterations so they weren't
all invoking random_normal() in exactly the same way.

This version seems committable to me, barring objections.

regards, tom lane

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 3bf8d021c3..b67dc26a35 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -1815,6 +1815,28 @@ repeat('Pg', 4) PgPgPgPg

   
 
+  
+   
+
+ random_normal
+
+
+ random_normal (
+  mean double precision
+ , stddev double precision  )
+ double precision
+   
+   
+Returns a random value from the normal distribution with the given
+parameters; mean defaults to 0.0
+and stddev defaults to 1.0
+   
+   
+random_normal(0.0, 1.0)
+0.051285419
+   
+  
+
   

 
@@ -1824,7 +1846,8 @@ repeat('Pg', 4) PgPgPgPg
 void


-Sets the seed for subsequent random() calls;
+Sets the seed for subsequent random() and
+random_normal() calls;
 argument must be between -1.0 and 1.0, inclusive


@@ -1848,6 +1871,7 @@ repeat('Pg', 4) PgPgPgPg
Without any prior setseed() call in the same
session, the first random() call obtains a seed
from a platform-dependent source of random bits.
+   These remarks hold equally for random_normal().
   
 
   
diff --git a/src/backend/catalog/system_functions.sql b/src/backend/catalog/system_functions.sql
index f2470708e9..83ca893444 100644
--- a/src/backend/catalog/system_functions.sql
+++ b/src/backend/catalog/system_functions.sql
@@ -66,6 +66,13 @@ CREATE OR REPLACE FUNCTION bit_length(text)
  IMMUTABLE PARALLEL SAFE STRICT COST 1
 RETURN octet_length($1) * 8;
 
+CREATE OR REPLACE FUNCTION
+ random_normal(mean float8 DEFAULT 0, stddev float8 DEFAULT 1)
+ RETURNS float8
+ LANGUAGE internal
+ VOLATILE PARALLEL RESTRICTED STRICT COST 1
+AS 'drandom_normal';
+
 CREATE OR REPLACE FUNCTION log(numeric)
  RETURNS numeric
  LANGUAGE sql
diff --git a/src/backend/utils/adt/float.c b/src/backend/utils/adt/float.c

Re: Fixing a couple of buglets in how VACUUM sets visibility map bits

2023-01-08 Thread Andres Freund
Hi,

On 2023-01-08 14:39:19 -0800, Peter Geoghegan wrote:
> One of the calls to visibilitymap_set() during VACUUM's initial heap
> pass could unset a page's all-visible bit during the process of setting
> the same page's all-frozen bit.

How? visibilitymap_set() just adds flags, it doesn't remove any already
existing bits:

map[mapByte] |= (flags << mapOffset);

It'll afaict lead to potentially unnecessary WAL records though, which does
seem buggy:
if (flags != (map[mapByte] >> mapOffset & VISIBILITYMAP_VALID_BITS))

here we check for *equivalence*, but then below we just or-in flags. So
visibilitymap_set() with just one of the flags bits set in the parameters,
but both set in the page would end up WAL logging unnecessarily.




> @@ -2388,8 +2398,8 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
>  static void
>  lazy_vacuum_heap_rel(LVRelState *vacrel)
>  {
> - int index;
> - BlockNumber vacuumed_pages;
> + int index = 0;
> + BlockNumber vacuumed_pages = 0;
>   Buffer  vmbuffer = InvalidBuffer;
>   LVSavedErrInfo saved_err_info;
>  
> @@ -2406,42 +2416,42 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
>
> VACUUM_ERRCB_PHASE_VACUUM_HEAP,
>InvalidBlockNumber, 
> InvalidOffsetNumber);
>  
> - vacuumed_pages = 0;
> -
> - index = 0;


> @@ -2473,12 +2484,12 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
>   */
>  static int
>  lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
> -   int index, Buffer *vmbuffer)
> +   int index, Buffer vmbuffer)
>  {
>   VacDeadItems *dead_items = vacrel->dead_items;
>   Pagepage = BufferGetPage(buffer);
>   OffsetNumber unused[MaxHeapTuplesPerPage];
> - int uncnt = 0;
> + int nunused = 0;
>   TransactionId visibility_cutoff_xid;
>   boolall_frozen;
>   LVSavedErrInfo saved_err_info;
> @@ -2508,10 +2519,10 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber 
> blkno, Buffer buffer,
>  
>   Assert(ItemIdIsDead(itemid) && !ItemIdHasStorage(itemid));
>   ItemIdSetUnused(itemid);
> - unused[uncnt++] = toff;
> + unused[nunused++] = toff;
>   }
>  
> - Assert(uncnt > 0);
> + Assert(nunused > 0);
>  
>   /* Attempt to truncate line pointer array now */
>   PageTruncateLinePointerArray(page);
> @@ -2527,13 +2538,13 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber 
> blkno, Buffer buffer,
>   xl_heap_vacuum xlrec;
>   XLogRecPtr  recptr;
>  
> - xlrec.nunused = uncnt;
> + xlrec.nunused = nunused;
>  
>   XLogBeginInsert();
>   XLogRegisterData((char *) &xlrec, SizeOfHeapVacuum);
>  
>   XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
> - XLogRegisterBufData(0, (char *) unused, uncnt * 
> sizeof(OffsetNumber));
> + XLogRegisterBufData(0, (char *) unused, nunused * 
> sizeof(OffsetNumber));
>  
>   recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_VACUUM);
>  

You have plenty of changes like this, which are afaict entirely unrelated to
the issue the commit is fixing, in here. It just makes it hard to review the
patch.

Greetings,

Andres Freund




Re: Missing update of all_hasnulls in BRIN opclasses

2023-01-08 Thread Tomas Vondra
Thanks Justin! I've applied all the fixes you proposed, and (hopefully)
improved a couple other comments.

I've been working on this over the past couple days, trying to polish
and commit it over the weekend - both into master and backbranches.
Sadly, the backpatching part turned out to be a bit more complicated
than I expected, because of the BRIN reworks in PG14 (done by me, as
foundation for the new opclasses, so ... well).

Anyway, I got it done, but it's a bit uglier than I hoped for and I
don't feel like pushing this on Sunday midnight. I think it's correct,
but maybe another pass to polish it a bit more is better.

So here are two patches - one for 11-13, the other for 14-master.

There's also a separate patch with pageinspect tests, but only as a
demonstration of various (non)broken cases, not for commit. And then
also a bash script generating indexes with random data, randomized
summarization etc. - on unpatched systems this happens to fail in about
1/3 of the runs (at least for me). I haven't seen any failures with the
patches attached (on any branch).

As for the issue / fix, I don't think there's a better solution than
what the patch does - we need to distinguish empty / all-nulls ranges,
but we can't add a flag because of on-disk format / ABI. So using the
existing flags seems like the only option - I haven't heard any other
ideas so far, and I couldn't come up with any myself either.

I've also thought about alternative "encodings" into allnulls/hasnulls,
instead of treating (true,true) as "empty" - but none of that ended up
being any simpler, quite the opposite actually, as it would change what
the individual flags mean etc. So AFAICS this is the best / least
disruptive option.

I went over all the places touching these flags, to double check if any
of those needs some tweaks (similar to union_tuples, which I missed for
a long time). But I haven't found anything else, so I think this version
of the patches is complete.

As for assessing how many indexes are affected - in principle, any index
on columns with NULLs may be broken. But it only matters if the index is
used for IS NULL queries, other queries are not affected.

I also realized that this only affects insertion of individual tuples
into existing all-null summaries, not "bulk" summarization that sees all
values at once. This happens because in this case add_values_to_range
sets hasnulls=true for the first (NULL) value, and then calls the
addValue procedure for the second (non-NULL) one, which resets the
allnulls flag to false.

But when inserting individual rows, we first set hasnulls=true, but
brin_form_tuple ignores that because of allnulls=true. And then when
inserting the second row, we start with hasnulls=false again, and the
opclass quietly resets the allnulls flag.

I guess this further reduces the number of broken indexes, especially
for data sets with small null_frac, or for append-only (or -mostly)
tables where most of the summarization is bulk.

I still feel a bit uneasy about this, but I think the patch is solid.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL CompanyFrom 033286cff9733c24fdc7c3f774d947c9f1072aa0 Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Sun, 8 Jan 2023 23:42:41 +0100
Subject: [PATCH] extra tests

---
 contrib/pageinspect/Makefile|   2 +-
 contrib/pageinspect/expected/brin-fails.out | 152 
 contrib/pageinspect/sql/brin-fails.sql  |  85 +++
 3 files changed, 238 insertions(+), 1 deletion(-)
 create mode 100644 contrib/pageinspect/expected/brin-fails.out
 create mode 100644 contrib/pageinspect/sql/brin-fails.sql

diff --git a/contrib/pageinspect/Makefile b/contrib/pageinspect/Makefile
index 95e030b396..69a28a6d3d 100644
--- a/contrib/pageinspect/Makefile
+++ b/contrib/pageinspect/Makefile
@@ -22,7 +22,7 @@ DATA =  pageinspect--1.11--1.12.sql pageinspect--1.10--1.11.sql \
 	pageinspect--1.0--1.1.sql
 PGFILEDESC = "pageinspect - functions to inspect contents of database pages"
 
-REGRESS = page btree brin gin gist hash checksum oldextversions
+REGRESS = page btree brin gin gist hash checksum oldextversions brin-fails
 
 ifdef USE_PGXS
 PG_CONFIG = pg_config
diff --git a/contrib/pageinspect/expected/brin-fails.out b/contrib/pageinspect/expected/brin-fails.out
new file mode 100644
index 00..a12c761fc8
--- /dev/null
+++ b/contrib/pageinspect/expected/brin-fails.out
@@ -0,0 +1,152 @@
+create table t (a int);
+create extension pageinspect;
+-- works
+drop index if exists t_a_idx;
+NOTICE:  index "t_a_idx" does not exist, skipping
+truncate t;
+insert into t values (null), (1);
+create index on t using brin (a);
+select * from brin_page_items(get_raw_page('t_a_idx', 2), 't_a_idx'::regclass);
+ itemoffset | blknum | attnum | allnulls | hasnulls | placeholder |  value   
++++--+--+-+--
+  1 |  0 |  1 | f| t| f

Re: Fixing a couple of buglets in how VACUUM sets visibility map bits

2023-01-08 Thread Peter Geoghegan
On Mon, Jan 2, 2023 at 10:31 AM Peter Geoghegan  wrote:
> Would be helpful if I could get a +1 on
> v1-0002-Never-just-set-the-all-frozen-bit-in-VM.patch, which is
> somewhat more substantial than the others.

There has been no response on this thread for over a full week at this
point. I'm CC'ing Robert now, since the bug is from his commit
a892234f83.

Attached revision of the "don't unset all-visible bit while unsetting
all-frozen bit" patch adds some assertions that verify that
visibility_cutoff_xid is InvalidTransactionId as expected when we go
to set any page all-frozen in the VM. It also broadens an existing
nearby test for corruption, which gives us some chance of detecting
and repairing corruption of this sort that might have slipped in in
the field.

My current plan is to commit something like this in another week or
so, barring any objections.

-- 
Peter Geoghegan


v2-0001-Don-t-accidentally-unset-all-visible-bit-in-VM.patch
Description: Binary data


Re: Todo: Teach planner to evaluate multiple windows in the optimal order

2023-01-08 Thread David Rowley
On Mon, 9 Jan 2023 at 06:17, Ankit Kumar Pandey  wrote:
> I have addressed all points 1-6 in the attached patch.

A few more things:

1. You're still using the 'i' variable in the foreach loop.
foreach_current_index() will work.

2. I think the "index" variable needs a better name. sort_pushdown_idx maybe.

3. I don't think you need to set "index" on every loop. Why not just
set it to foreach_current_index(l) - 1; break;

4. You're still setting orderby_pathkeys in the foreach loop. That's
already been set above and it won't have changed.

5. I don't think there's any need to check pathkeys_contained_in() in
the foreach loop anymore. With #3 the index will be -1 if the
optimisation cannot apply. You could likely also get rid of
try_sort_pushdown too and just make the condition "if
(sort_pushdown_idx == foreach_current_index(l))".  I'm a little unsure
why there's still the is_sorted check there.  Shouldn't that always be
false now that you're looping until the pathkeys don't match in the
foreach_reverse loop?

Correct me if I'm wrong as I've not tested, but I think the new code
in the foreach loop can just become:

if (sort_pushdown_idx == foreach_current_index(l))
{
Assert(!is_sorted);
window_pathkeys = orderby_pathkeys;
is_sorted = pathkeys_count_contained_in(window_pathkeys,
path->pathkeys, &presorted_keys);
}


> I have one doubt regarding runCondition, do we only need to ensure
> that window function which has subset sort clause of main query should
> not have runCondition or none of the window functions should not contain
> runCondition? I have gone with later case but wanted to clarify.

Actually, maybe it's ok just to check the top-level WindowClause for
runConditions. It's only that one that'll filter rows.  That probably
simplifies the code quite a bit. Lower-level runConditions only serve
to halt the evaluation of WindowFuncs when the runCondition is no
longer met.

>
>
> > Also, do you want to have a go at coding up the sort bound pushdown
> > too?  It'll require removing the limitCount restriction and adjusting
> > ExecSetTupleBound() to recurse through a WindowAggState. I think it's
> > pretty easy. You could try it then play around with it to make sure it
> > works and we get the expected performance.
>
> I tried this in the patch but kept getting `retrieved too many tuples in
> a bounded sort`.
>
> Added following code in ExecSetTupleBound which correctly found sortstate
>
> and set bound value.
>
> else if(IsA(child_node, WindowAggState))
>
> {
>
> WindowAggState  *winstate = (WindowAggState *) child_node;
>
> if (outerPlanState(winstate))
>
> ExecSetTupleBound(tuples_needed, 
> outerPlanState(winstate));
>
> }
>
> I think problem is that are not using limit clause inside window
> function (which
> may need to scan all tuples) so passing bound value to
> WindowAggState->sortstate
> is not working as we might expect. Or maybe I am getting it wrong? I was
> trying to
> have top-N sort for limit clause if orderby pushdown happens.

hmm, perhaps the Limit would have to be put between the WindowAgg and
Sort for it to work.  Maybe that's more complexity than it's worth.

David




Re: POC: Lock updated tuples in tuple_update() and tuple_delete()

2023-01-08 Thread Alexander Korotkov
Hi, Mason!

Thank you very much for your review.

On Sun, Jan 8, 2023 at 9:33 PM Mason Sharp  wrote:
> On Fri, Jan 6, 2023 at 4:46 AM Pavel Borisov  wrote:
>> Besides, the new version has only some minor changes in the comments
>> and the commit message.
> It looks good, and the greater the concurrency the greater the benefit will 
> be. Just a few minor suggestions regarding comments.
>
> "ExecDeleteAct() have already locked the old tuple for us", change "have" to 
> "has".
>
> The comments in heapam_tuple_delete() and heapam_tuple_update() might be a 
> little clearer with something like:
>
> "If the tuple has been concurrently updated, get lock already so that on
> retry it will succeed, provided that the caller asked to do this by
> providing a lockedSlot."

Thank you.  These changes are incorporated into v6 of the patch.

> Also, not too important, but perhaps better clarify in the commit message 
> that the repeated work is driven by ExecUpdate and ExecDelete and can happen 
> multiple times depending on the concurrency.

Hmm... It can't happen arbitrary number of times.  If tuple was
concurrently updated, the we lock it.  Once we lock, nobody can change
it until we finish out work.  So, I think no changes needed.

I'm going to push this if no objections.

--
Regards,
Alexander Korotkov


0001-Allow-locking-updated-tuples-in-tuple_update-and--v6.patch
Description: Binary data


Re: on placeholder entries in view rule action query's range table

2023-01-08 Thread Tom Lane
Amit Langote  writes:
> I've attached just the patch that we should move forward with, as
> Alvaro might agree.

I've looked at this briefly but don't like it very much, specifically
the business about retroactively adding an RTE and RTEPermissionInfo
into the view's replacement subquery.  That seems expensive and bug-prone:
if you're going to do that you might as well just leave the OLD entry
in place, as the earlier patch did, because you're just reconstructing
that same state of the parsetree a bit later on.

Furthermore, if that's where we end up then I'm not really sure this is
worth doing at all.  The idea driving this was that if we could get rid
of *both* OLD and NEW RTE entries then we'd not have O(N^2) behavior in
deep subquery pull-ups due to the rangetable getting longer with each one.
But getting it down from two extra entries to one extra entry isn't going
to fix that big-O problem.  (The patch as presented still has O(N^2)
planning time for the nested-views example discussed earlier.)

Conceivably we could make it work by allowing RTE_SUBQUERY RTEs to
carry a relation OID and associated RTEPermissionInfo, so that when a
view's RTE_RELATION RTE is transmuted to an RTE_SUBQUERY RTE it still
carries the info needed to let us lock and permission-check the view.
That might be a bridge too far from the ugliness perspective ...
although certainly the business with OLD and/or NEW RTEs isn't very
pretty either.

Anyway, if you don't feel like tackling that then I'd go back to the
patch version that kept the OLD RTE.  (Maybe we could rename that to
something else, though, such as "*VIEW*"?)

BTW, I don't entirely understand why this patch is passing regression
tests, because it's failed to deal with numerous places that have
hard-wired knowledge about these extra RTEs.  Look for references to
PRS2_OLD_VARNO and PRS2_NEW_VARNO.  I think the original rationale
for UpdateRangeTableOfViewParse was that we needed to keep the rtables
of ON SELECT rules looking similar to those of other types of rules.
Maybe we've cleaned up all the places that used to depend on that,
but I'm not convinced.

regards, tom lane




Re: Todo: Teach planner to evaluate multiple windows in the optimal order

2023-01-08 Thread David Rowley
On Mon, 9 Jan 2023 at 05:06, Vik Fearing  wrote:
> +EXPLAIN (COSTS OFF)
> +SELECT empno,
> +   depname,
> +   min(salary) OVER (PARTITION BY depname ORDER BY empno) depminsalary,
> +   sum(salary) OVER (PARTITION BY depname) depsalary,
> +   count(*) OVER (ORDER BY enroll_date DESC) c
> +FROM empsalary
> +ORDER BY depname, empno, enroll_date;
> +  QUERY PLAN
> +--
> + WindowAgg
> +   ->  WindowAgg
> + ->  Sort
> +   Sort Key: depname, empno, enroll_date
> +   ->  WindowAgg
> + ->  Sort
> +   Sort Key: enroll_date DESC
> +   ->  Seq Scan on empsalary

> Why aren't min() and sum() calculated on the same WindowAgg run?

We'd need to have an ORDER BY per WindowFunc rather than per
WindowClause to do that.  The problem is when there is no ORDER BY,
all rows are peers.

Likely there likely are a bunch more optimisations we could do in that
area. I think all the builtin window functions (not aggregates being
used as window functions) don't care about peer rows, so it may be
possible to merge the WindowClauses when the WIndowClause being merged
only has window functions that don't care about peer rows.  Not for
this patch though.

David




[PATCH] basebackup: support zstd long distance matching

2023-01-08 Thread Justin Pryzby
In-Reply-To: <20220327205020.gm28...@telsasoft.com>

On Sun, Mar 27, 2022 at 03:50:20PM -0500, Justin Pryzby wrote:
> Here's a patch for zstd --long mode.

Rebased.  I'll add this to the CF.
>From 8385f278e95d7bce7e5eed6bb8fccc1e1db9483d Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Sun, 27 Mar 2022 11:55:01 -0500
Subject: [PATCH] basebackup: support zstd long distance matching

First proposed here:
20220327205020.gm28...@telsasoft.com
---
 doc/src/sgml/protocol.sgml| 10 +++-
 doc/src/sgml/ref/pg_basebackup.sgml   |  4 +-
 src/backend/backup/basebackup_zstd.c  | 12 
 src/bin/pg_basebackup/bbstreamer_zstd.c   | 13 +
 src/bin/pg_basebackup/t/010_pg_basebackup.pl  |  9 ++-
 src/bin/pg_verifybackup/t/008_untar.pl|  8 +++
 src/bin/pg_verifybackup/t/010_client_untar.pl |  8 +++
 src/common/compression.c  | 57 ++-
 src/include/common/compression.h  |  2 +
 9 files changed, 118 insertions(+), 5 deletions(-)

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 03312e07e25..e07d050517e 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2747,7 +2747,8 @@ psql "dbname=postgres replication=database" -c "IDENTIFY_SYSTEM;"
level.  Otherwise, it should be a comma-separated list of items,
each of the form keyword or
keyword=value. Currently, the supported
-   keywords are level and workers.
+   keywords are level, long and
+   workers.
   
 
   
@@ -2764,6 +2765,13 @@ psql "dbname=postgres replication=database" -c "IDENTIFY_SYSTEM;"
3).
   
 
+  
+   The long keyword enables long-distance matching
+   mode, for improved compression ratio, at the expense of higher memory
+   use.  Long-distance mode is supported only for
+   zstd.
+  
+
   
The workers keyword sets the number of threads
that should be used for parallel compression. Parallel compression
diff --git a/doc/src/sgml/ref/pg_basebackup.sgml b/doc/src/sgml/ref/pg_basebackup.sgml
index db3ad9cd5eb..79d3e657c32 100644
--- a/doc/src/sgml/ref/pg_basebackup.sgml
+++ b/doc/src/sgml/ref/pg_basebackup.sgml
@@ -424,8 +424,8 @@ PostgreSQL documentation
 level.  Otherwise, it should be a comma-separated list of items,
 each of the form keyword or
 keyword=value.
-Currently, the supported keywords are level
-and workers.
+Currently, the supported keywords are level,
+long, and workers.
 The detail string cannot be used when the compression method
 is specified as a plain integer.

diff --git a/src/backend/backup/basebackup_zstd.c b/src/backend/backup/basebackup_zstd.c
index ac6cac178a0..1bb5820c884 100644
--- a/src/backend/backup/basebackup_zstd.c
+++ b/src/backend/backup/basebackup_zstd.c
@@ -118,6 +118,18 @@ bbsink_zstd_begin_backup(bbsink *sink)
 		   compress->workers, ZSTD_getErrorName(ret)));
 	}
 
+	if ((compress->options & PG_COMPRESSION_OPTION_LONG_DISTANCE) != 0)
+	{
+		ret = ZSTD_CCtx_setParameter(mysink->cctx,
+	 ZSTD_c_enableLongDistanceMatching,
+	 compress->long_distance);
+		if (ZSTD_isError(ret))
+			ereport(ERROR,
+	errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+	errmsg("could not set compression flag for %s: %s",
+		   "long", ZSTD_getErrorName(ret)));
+	}
+
 	/*
 	 * We need our own buffer, because we're going to pass different data to
 	 * the next sink than what gets passed to us.
diff --git a/src/bin/pg_basebackup/bbstreamer_zstd.c b/src/bin/pg_basebackup/bbstreamer_zstd.c
index fe17d6df4ef..fba391e2a0f 100644
--- a/src/bin/pg_basebackup/bbstreamer_zstd.c
+++ b/src/bin/pg_basebackup/bbstreamer_zstd.c
@@ -106,6 +106,19 @@ bbstreamer_zstd_compressor_new(bbstreamer *next, pg_compress_specification *comp
 	 compress->workers, ZSTD_getErrorName(ret));
 	}
 
+	if ((compress->options & PG_COMPRESSION_OPTION_LONG_DISTANCE) != 0)
+	{
+		ret = ZSTD_CCtx_setParameter(streamer->cctx,
+	 ZSTD_c_enableLongDistanceMatching,
+	 compress->long_distance);
+		if (ZSTD_isError(ret))
+		{
+			pg_log_error("could not set compression flag for %s: %s",
+		 "long", ZSTD_getErrorName(ret));
+			exit(1);
+		}
+	}
+
 	/* Initialize the ZSTD output buffer. */
 	streamer->zstd_outBuf.dst = streamer->base.bbs_buffer.data;
 	streamer->zstd_outBuf.size = streamer->base.bbs_buffer.maxlen;
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index b60cb78a0d5..4d130a7f944 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -139,7 +139,14 @@ SKIP:
 			'gzip:workers=3',
 			'invalid compression specification: compression algorithm "gzip" does not accept a worker count',
 			'failure on worker count for gzip'
-		],);
+		],

Re: drop postmaster symlink

2023-01-08 Thread Karl O. Pinc
On Sat, 7 Jan 2023 22:29:35 -0600
"Karl O. Pinc"  wrote:

> The only way I could think of to review a patch
> that removes something is to report all the places
> I looked where a reference to the symlink might be.

I forgot to report that I also tried a `make install`
and 'make uninstall`, with no problems.

FWIW, I suspect the include/backend/postmaster/ directory
would today be named include/backend/postgres/.   Now
would be the time to change the name, but I don't see
it being worth the work.  And while such a change
wouldn't break pg, that kind of change would break something
for somebody.

Regards,

Karl 
Free Software:  "You don't pay back, you pay forward."
 -- Robert A. Heinlein




Re: MERGE ... RETURNING

2023-01-08 Thread Isaac Morland
On Sun, 8 Jan 2023 at 07:28, Dean Rasheed  wrote:

So playing around with it (and inspired by the WITH ORDINALITY syntax
> for SRFs), I had the idea of allowing "WITH WHEN CLAUSE" at the end of
> the returning list, which adds an integer column to the list, whose
> value is set to the index of the when clause executed, as in the
> attached very rough patch.
>

Would it be useful to have just the action? Perhaps "WITH ACTION"? My idea
is that this would return an enum of INSERT, UPDATE, DELETE (so is "action"
the right word?). It seems to me in many situations I would be more likely
to care about which of these 3 happened rather than the exact clause that
applied. This isn't necessarily meant to be instead of your suggestion
because I can imagine wanting to know the exact clause, just an alternative
that might suffice in many situations. Using it would also avoid problems
arising from editing the query in a way which changes the numbers of the
clauses.

So, quoting an example from the tests, this allows things like:
>
> WITH t AS (
>   MERGE INTO sq_target t USING v ON tid = sid
> WHEN MATCHED AND tid > 2 THEN UPDATE SET balance = t.balance + delta
> WHEN NOT MATCHED THEN INSERT (balance, tid) VALUES (balance + delta,
> sid)
> WHEN MATCHED AND tid < 2 THEN DELETE
> RETURNING t.* WITH WHEN CLAUSE
> )
> SELECT CASE when_clause
>  WHEN 1 THEN 'UPDATE'
>  WHEN 2 THEN 'INSERT'
>  WHEN 3 THEN 'DELETE'
>END, *
> FROM t;
>
>   case  | tid | balance | when_clause
> +-+-+-
>  INSERT |  -1 | -11 |   2
>  DELETE |   1 | 100 |   3
> (2 rows)
>
> 1 row is returned for each merge action executed (other than DO
> NOTHING actions), and as usual, the values represent old target values
> for DELETE actions, and new target values for INSERT/UPDATE actions.
>

Would it be feasible to allow specifying old.column or new.column? These
would always be NULL for INSERT and DELETE respectively but more useful
with UPDATE. Actually I've been meaning to ask this question about UPDATE …
RETURNING.

Actually, even with DELETE/INSERT, I can imagine wanting, for example, to
get only the new values associated with INSERT or UPDATE and not the values
removed by a DELETE. So I can imagine specifying new.column to get NULLs
for any row that was deleted but still get the new values for other rows.

It's also possible to return the source values, and a bare "*" in the
> returning list expands to all the source columns, followed by all the
> target columns.
>

Does this lead to a problem in the event there are same-named columns
between source and target?

The name of the added column, if included, can be changed by
> specifying "WITH WHEN CLAUSE [AS] col_alias". I chose the syntax "WHEN
> CLAUSE" and "when_clause" as the default column name because those
> match the existing terminology used in the docs.
>


Re: [RFC] Add jit deform_counter

2023-01-08 Thread Pavel Stehule
Hi

> > I'm not sure why, but pgss jit metrics are always nulls for explain
>> > > analyze queries. I have noticed this with surprise myself, when
>> recently
>> > > was reviewing the lazy jit patch, but haven't yet figure out what is
>> the
>> > > reason. Anyway, without "explain analyze" you'll get correct deforming
>> > > numbers in pgss.
>> > >
>
>
>
It is working although I am not sure if it is correctly

when I run EXPLAIN ANALYZE for query `explain analyze select
count(length(prosrc) > 0) from pg_proc;`

I got plan and times

┌─┐
│ QUERY PLAN
   │
╞═╡
│ Aggregate  (cost=154.10..154.11 rows=1 width=8) (actual
time=134.450..134.451 rows=1 loops=1)
│
│   ->  Seq Scan on pg_proc  (cost=0.00..129.63 rows=3263 width=16) (actual
time=0.013..0.287 rows=3266 loops=1)  │
│ Planning Time: 0.088 ms
  │
│ JIT:
   │
│   Functions: 3
   │
│   Options: Inlining true, Optimization true, Expressions true, Deforming
true   │
│   Timing: Generation 0.631 ms, Deforming 0.396 ms, Inlining 10.026 ms,
Optimization 78.608 ms, Emission 44.915 ms, Total 134.181 ms │
│ Execution Time: 135.173 ms
   │
└─┘
(8 rows)

 Deforming is 0.396ms

When I run mentioned query, and when I look to pg_stat_statements table, I
see different times

deforming is about 10ms

wal_bytes  │ 0
jit_functions  │ 9
jit_generation_time│ 1.90404099
jit_deform_count   │ 3
jit_deform_time│ 36.395131
jit_inlining_count │ 3
jit_inlining_time  │ 256.104205
jit_optimization_count │ 3
jit_optimization_time  │ 132.4536130002
jit_emission_count │ 3
jit_emission_time  │ 1.210633

counts are correct, but times are strange -  there is not consistency with
values from EXPLAIN

When I run this query on master, the values are correct

 jit_functions  │ 6
 jit_generation_time│ 1.350521
 jit_inlining_count │ 2
 jit_inlining_time  │ 24.0183823
 jit_optimization_count │ 2
 jit_optimization_time  │ 173.405792
 jit_emission_count │ 2
 jit_emission_time  │ 91.226655
┴───

│ JIT:
  │
│   Functions: 3
  │
│   Options: Inlining true, Optimization true, Expressions true, Deforming
true  │
│   Timing: Generation 0.636 ms, Inlining 9.309 ms, Optimization 89.653 ms,
Emission 45.812 ms, Total 145.410 ms │
│ Execution Time: 146.410 ms
  │
└┘

Regards

Pavel


Re: Fix gin index cost estimation

2023-01-08 Thread Alexander Korotkov
On Sun, Jan 8, 2023 at 7:08 PM Tom Lane  wrote:
> Alexander Korotkov  writes:
> > I'm going to push this and backpatch to all supported versions if no 
> > objections.
>
> Push yes, but I'd counsel against back-patching.  People don't
> generally like unexpected plan changes in stable versions, and
> that's what a costing change could produce.  There's no argument
> that we are fixing a failure or wrong answer here, so it doesn't
> seem like back-patch material.

Pushed to master, thank you.  I've noticed the reason for
non-backpatching in the commit message.

--
Regards,
Alexander Korotkov




Re: Add LZ4 compression in pg_dump

2023-01-08 Thread Justin Pryzby
On Thu, Dec 22, 2022 at 11:08:59AM -0600, Justin Pryzby wrote:
> There's a couple of lz4 bits which shouldn't be present in 002: file
> extension and comments.

There were "LZ4" comments and file extension stuff in the preparatory
commit.  But now it seems like you *removed* them in the LZ4 commit
(where it actually belongs) rather than *moving* it from the
prior/parent commit *to* the lz4 commit.  I recommend to run something
like "git diff @{1}" whenever doing this kind of patch surgery.

+   if (AH->compression_spec.algorithm != PG_COMPRESSION_NONE &&
+   AH->compression_spec.algorithm == PG_COMPRESSION_GZIP &&

This looks wrong/redundant.  The gzip part should be removed, right ?

Maybe other places that check if (compression==PG_COMPRESSION_GZIP)
should maybe change to say compression!=NONE?

_PrepParallelRestore() references ".gz", so I think it needs to be
retrofitted to handle .lz4.  Ideally, that's built into a struct or list
of file extensions to try.  Maybe compression.h should have a function
to return the file extension of a given algorithm.  I'm planning to send
a patch for zstd, and hoping its changes will be minimized by these
preparatory commits.

+ errno = errno ? : ENOSPC;

"?:" is a GNU extension (not the ternary operator, but the ternary
operator with only 2 args).  It's not in use anywhere else in postgres.
You could instead write it with 3 "errno"s or as "if (errno==0):
errno=ENOSPC"

You wrote "eol_flag == false" and "eol_flag == 0" and true.  But it's
cleaner to test it as a boolean: if (eol_flag) / if (!eol_flag).

Both LZ4File_init() and its callers check "inited".  Better to do it in
one place than 3.  It's a static function, so I think there's no
performance concern.

Gzip_close() still has a useless save_errno (or rebase issue?).

I think it's confusing to have two functions, one named
InitCompressLZ4() and InitCompressorLZ4().

pg_compress_specification is being passed by value, but I think it
should be passed as a pointer, as is done everywhere else.

pg_compress_algorithm is being writen directly into the pg_dump header.
Currently, I think that's not an externally-visible value (it could be
renumbered, theoretically even in a minor release).  Maybe there should
be a "private" enum for encoding the pg_dump header, similar to
WAL_COMPRESSION_LZ4 vs BKPIMAGE_COMPRESS_LZ4 ?  Or else a comment there
should warn that the values are encoded in pg_dump, and must never be
changed.

+ Verify that data files where compressed
typo: s/where/were/

Also:
s/occurance/occurrence/
s/begining/beginning/
s/Verfiy/Verify/
s/nessary/necessary/

BTW I noticed that cfdopen() was accidentally committed to compress_io.h
in master without being defined anywhere.

-- 
Justin




Re: Todo: Teach planner to evaluate multiple windows in the optimal order

2023-01-08 Thread Vik Fearing

On 1/8/23 18:05, Ankit Kumar Pandey wrote:



On 08/01/23 21:36, Vik Fearing wrote:



On 1/8/23 11:21, Ankit Kumar Pandey wrote:


Please find attached patch with addressed issues mentioned before.




I am curious about this plan:



+-- ORDER BY's in multiple Window functions can be combined into one
+-- if they are subset of QUERY's ORDER BY
+EXPLAIN (COSTS OFF)
+SELECT empno,
+   depname,
+   min(salary) OVER (PARTITION BY depname ORDER BY empno) 
depminsalary,

+   sum(salary) OVER (PARTITION BY depname) depsalary,
+   count(*) OVER (ORDER BY enroll_date DESC) c
+FROM empsalary
+ORDER BY depname, empno, enroll_date;
+  QUERY PLAN
+--
+ WindowAgg
+   ->  WindowAgg
+ ->  Sort
+   Sort Key: depname, empno, enroll_date
+   ->  WindowAgg
+ ->  Sort
+   Sort Key: enroll_date DESC
+   ->  Seq Scan on empsalary
+(8 rows)
+




Why aren't min() and sum() calculated on the same WindowAgg run?


Isn't that exactly what is happening here? First count() with sort on 
enroll_date is run and


then min() and sum()?


No, there are two passes over the window for those two but I don't see 
that there needs to be.


Only difference between this and plan generated by master(given below) 
is a sort in the end.


Then this is probably not this patch's job to fix.
--
Vik Fearing





Re: POC: Lock updated tuples in tuple_update() and tuple_delete()

2023-01-08 Thread Mason Sharp
On Fri, Jan 6, 2023 at 4:46 AM Pavel Borisov  wrote:

> Hi, Alexander!
>
> On Thu, 5 Jan 2023 at 15:11, Alexander Korotkov 
> wrote:
> >
> > On Wed, Jan 4, 2023 at 5:05 PM Alexander Korotkov 
> wrote:
> > > On Wed, Jan 4, 2023 at 3:43 PM Pavel Borisov 
> wrote:
> > > > One more update of a patchset to avoid compiler warnings.
> > >
> > > Thank you for your help.  I'm going to provide the revised version of
> > > patch with comments and commit message in the next couple of days.
> >
> > The revised patch is attached.  It contains describing commit message,
> > comments and some minor code improvements.
>
> I've looked through the patch once again. It seems in a nice state to
> be committed.
> I also noticed that in tableam level and NodeModifyTable function
> calls we have a one-to-one correspondence between *lockedSlot и bool
> lockUpdated, but no checks on this in case something changes in the
> code in the future. I'd propose combining these variables to remain
> free from these checks. See v5 of a patch. Tests are successfully
> passed.
> Besides, the new version has only some minor changes in the comments
> and the commit message.
>
> Kind regards,
> Pavel Borisov,
> Supabase.
>

It looks good, and the greater the concurrency the greater the benefit will
be. Just a few minor suggestions regarding comments.

"ExecDeleteAct() have already locked the old tuple for us", change "have"
to "has".

The comments in heapam_tuple_delete() and heapam_tuple_update() might be a
little clearer with something like:

"If the tuple has been concurrently updated, get lock already so that on
retry it will succeed, provided that the caller asked to do this by
providing a lockedSlot."

Also, not too important, but perhaps better clarify in the commit message
that the repeated work is driven by ExecUpdate and ExecDelete and can
happen multiple times depending on the concurrency.

Best Regards,

Mason


Re: Todo: Teach planner to evaluate multiple windows in the optimal order

2023-01-08 Thread Ankit Kumar Pandey



On 08/01/23 16:33, David Rowley wrote:



On Sun, 8 Jan 2023 at 23:21, Ankit Kumar Pandey  wrote:

Please find attached patch with addressed issues mentioned before.


Here's a quick review:

1. You can use foreach_current_index(l) to obtain the index of the list element.

2. I'd rather see you looping backwards over the list in the first
loop. I think it'll be more efficient to loop backwards as you can
just break out the loop when you see the pathkeys are not contained in
the order by pathkreys.  When the optimisation does not apply that
means you only need to look at the last item in the list.  You could
maybe just invent foreach_reverse() for this purpose and put it in
pg_list.h.  That'll save having to manually code up the loop.

3. I don't think you should call the variable
enable_order_by_pushdown. We've a bunch of planner related GUCs that
start with enable_. That might cause a bit of confusion.  Maybe just
try_sort_pushdown.

4. You should widen the scope of orderby_pathkeys and set it within
the if (enable_order_by_pushdown) scope. You can reuse this again in
the 2nd loop too.  Just initialise it to NIL

5. You don't seem to be checking all WindowClauses for a runCondtion.
If you do #2 above then you can start that process in the initial
reverse loop so that you've checked them all by the time you get
around to that WindowClause again in the 2nd loop.

6. The test with "+-- Do not perform additional sort if column is
presorted". I don't think "additional" is the correct word here.  I
think you want to ensure that we don't push down the ORDER BY below
the WindowAgg for this case. There is no ability to reduce the sorts
here, only move them around, which we agreed we don't want to do for
this case.



I have addressed all points 1-6 in the attached patch.

I have one doubt regarding runCondition, do we only need to ensure

that window function which has subset sort clause of main query should

not have runCondition or none of the window functions should not contain

runCondition? I have gone with later case but wanted to clarify.



Also, do you want to have a go at coding up the sort bound pushdown
too?  It'll require removing the limitCount restriction and adjusting
ExecSetTupleBound() to recurse through a WindowAggState. I think it's
pretty easy. You could try it then play around with it to make sure it
works and we get the expected performance.


I tried this in the patch but kept getting `retrieved too many tuples in 
a bounded sort`.


Added following code in ExecSetTupleBound which correctly found sortstate

and set bound value.

else if(IsA(child_node, WindowAggState))

{

WindowAggState  *winstate = (WindowAggState *) child_node;

if (outerPlanState(winstate))

ExecSetTupleBound(tuples_needed, 
outerPlanState(winstate));

}

I think problem is that are not using limit clause inside window 
function (which


may need to scan all tuples) so passing bound value to 
WindowAggState->sortstate


is not working as we might expect. Or maybe I am getting it wrong? I was 
trying to


have top-N sort for limit clause if orderby pushdown happens.


You'll likely want to add a few more rows than the last performance test you 
did or run the query
with pgbench. Running a query once that only takes 1-2ms is likely not
a reliable way to test the performance of something.


I did some profiling.

CREATE TABLE empsalary1 (
depname varchar,
empno bigint,
salary int,
enroll_date date
);
INSERT INTO empsalary1(depname, empno, salary, enroll_date)
SELECT string_agg 
(substr('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789', ceil 
(random() * 62)::integer, 1000), '')
 AS depname, generate_series(1, 1000) AS empno, ceil 
(random()*1)::integer AS salary
, NOW() + (random() * (interval '90 days')) + '30 days' AS enroll_date;

1) Optimization case

EXPLAIN (ANALYZE)
SELECT empno,
   depname,
   min(salary) OVER (PARTITION BY depname ORDER BY empno) depminsalary,
   sum(salary) OVER (PARTITION BY depname) depsalary,
   count(*) OVER (ORDER BY enroll_date DESC) c
FROM empsalary1
ORDER BY depname, empno, enroll_date;

EXPLAIN (ANALYZE)
SELECT empno,
   depname,
   min(salary) OVER (PARTITION BY depname) depminsalary
FROM empsalary1
ORDER BY depname, empno;



In patched version:

   QUERY PLAN
   
--

---
 WindowAgg  (cost=93458.04..93458.08 rows=1 width=61) (actual 
time=149996.006..156756.995 rows=1000 loops=1)
   ->  WindowAgg  (cost=93458.04..93458.06 rows=1 width=57) (actual 
time=108559.126..135892.188 rows=1000 loops=1)
 ->  Sort  (cost=93458.04..93458.04 rows=1 width=53) (actual 
time=108554.213..112564.168 rows=1000 loops=1)
  

Re: Todo: Teach planner to evaluate multiple windows in the optimal order

2023-01-08 Thread Ankit Kumar Pandey




On 08/01/23 21:36, Vik Fearing wrote:



On 1/8/23 11:21, Ankit Kumar Pandey wrote:


Please find attached patch with addressed issues mentioned before.




I am curious about this plan:



+-- ORDER BY's in multiple Window functions can be combined into one
+-- if they are subset of QUERY's ORDER BY
+EXPLAIN (COSTS OFF)
+SELECT empno,
+   depname,
+   min(salary) OVER (PARTITION BY depname ORDER BY empno) depminsalary,
+   sum(salary) OVER (PARTITION BY depname) depsalary,
+   count(*) OVER (ORDER BY enroll_date DESC) c
+FROM empsalary
+ORDER BY depname, empno, enroll_date;
+  QUERY PLAN
+--
+ WindowAgg
+   ->  WindowAgg
+ ->  Sort
+   Sort Key: depname, empno, enroll_date
+   ->  WindowAgg
+ ->  Sort
+   Sort Key: enroll_date DESC
+   ->  Seq Scan on empsalary
+(8 rows)
+




Why aren't min() and sum() calculated on the same WindowAgg run?


Isn't that exactly what is happening here? First count() with sort on 
enroll_date is run and


then min() and sum()?


Only difference between this and plan generated by master(given below) 
is a sort in the end.


 QUERY PLAN

 Incremental Sort
   Sort Key: depname, empno, enroll_date
   Presorted Key: depname, empno
   ->  WindowAgg
 ->  WindowAgg
   ->  Sort
 Sort Key: depname, empno
 ->  WindowAgg
   ->  Sort
 Sort Key: enroll_date DESC
 ->  Seq Scan on empsalary

Let me know if I am missing anything. Thanks.

--
Regards,
Ankit Kumar Pandey





Re: Progress report of CREATE INDEX for nested partitioned tables

2023-01-08 Thread Justin Pryzby
On Sat, Dec 17, 2022 at 08:30:02AM -0600, Justin Pryzby wrote:
> We have the common problem of too many patches.
> 
> https://www.postgresql.org/message-id/a15f904a70924ffa4ca25c3c744cff31e0e6e143.camel%40gmail.com
> This changes the progress reporting to show indirect children as
> "total", and adds a global variable to track recursion into
> DefineIndex(), allowing it to be incremented without the value being
> lost to the caller.
> 
> https://www.postgresql.org/message-id/20221211063334.GB27893%40telsasoft.com
> This also counts indirect children, but only increments the progress
> reporting in the parent.  This has the disadvantage that when
> intermediate partitions are in use, the done_partitions counter will
> "jump" from (say) 20 to 30 without ever hitting 21-29.
> 
> https://www.postgresql.org/message-id/20221213044331.GJ27893%40telsasoft.com
> This has two alternate patches:
> - One patch changes to only update progress reporting of *direct*
>   children.  This is minimal, but discourages any future plan to track
>   progress involving intermediate partitions with finer granularity.
> - A alternative patch adds IndexStmt.nparts_done, and allows reporting
>   fine-grained progress involving intermediate partitions.
> 
> https://www.postgresql.org/message-id/flat/039564d234fc3d014c555a7ee98be69a9e724836.ca...@gmail.com
> This also reports progress of intermediate children.  The first patch
> does it by adding an argument to DefineIndex() (which isn't okay to
> backpatch).  And an alternate patch does it by adding to IndexStmt.
> 
> @committers: Is it okay to add nparts_done to IndexStmt ?

Any hint about this ?

This should be resolved before the "CIC on partitioned tables" patch,
which I think is otherwise done.




Re: Fix gin index cost estimation

2023-01-08 Thread Tom Lane
Alexander Korotkov  writes:
> I'm going to push this and backpatch to all supported versions if no 
> objections.

Push yes, but I'd counsel against back-patching.  People don't
generally like unexpected plan changes in stable versions, and
that's what a costing change could produce.  There's no argument
that we are fixing a failure or wrong answer here, so it doesn't
seem like back-patch material.

regards, tom lane




Re: Todo: Teach planner to evaluate multiple windows in the optimal order

2023-01-08 Thread Vik Fearing

On 1/8/23 11:21, Ankit Kumar Pandey wrote:


Please find attached patch with addressed issues mentioned before.



I am curious about this plan:

+-- ORDER BY's in multiple Window functions can be combined into one
+-- if they are subset of QUERY's ORDER BY
+EXPLAIN (COSTS OFF)
+SELECT empno,
+   depname,
+   min(salary) OVER (PARTITION BY depname ORDER BY empno) depminsalary,
+   sum(salary) OVER (PARTITION BY depname) depsalary,
+   count(*) OVER (ORDER BY enroll_date DESC) c
+FROM empsalary
+ORDER BY depname, empno, enroll_date;
+  QUERY PLAN
+--
+ WindowAgg
+   ->  WindowAgg
+ ->  Sort
+   Sort Key: depname, empno, enroll_date
+   ->  WindowAgg
+ ->  Sort
+   Sort Key: enroll_date DESC
+   ->  Seq Scan on empsalary
+(8 rows)
+


Why aren't min() and sum() calculated on the same WindowAgg run?
--
Vik Fearing





Re: [Commitfest 2023-01] has started

2023-01-08 Thread vignesh C
On Tue, 3 Jan 2023 at 13:13, vignesh C  wrote:
>
> Hi All,
>
> Just a reminder that Commitfest 2023-01 has started.
> There are many patches based on the latest run from [1] which require
> a) Rebased on top of head b) Fix compilation failures c) Fix test
> failure, please have a look and rebase it so that it is easy for the
> reviewers and committers:
> 1. TAP output format for pg_regress
> 2. Add BufFileRead variants with short read and EOF detection
> 3. Add SHELL_EXIT_CODE variable to psql
> 4. Add foreign-server health checks infrastructure
> 5. Add last_vacuum_index_scans in pg_stat_all_tables
> 6. Add index scan progress to pg_stat_progress_vacuum
> 7. Add the ability to limit the amount of memory that can be allocated
> to backends.
> 8. Add tracking of backend memory allocated to pg_stat_activity
> 9. CAST( ... ON DEFAULT)
> 10. CF App: add "Returned: Needs more interest" close status
> 11. CI and test improvements
> 12. Cygwin cleanup
> 13. Expand character set for ltree labels
> 14. Fix tab completion MERGE
> 15. Force streaming every change in logical decoding
> 16. More scalable multixacts buffers and locking
> 17. Move SLRU data into the regular buffer pool
> 18. Move extraUpdatedCols out of RangeTblEntry
> 19.New [relation] options engine
> 20. Optimizing Node Files Support
> 21. PGDOCS - Stats views and functions not in order?
> 22. POC: Lock updated tuples in tuple_update() and tuple_delete()
> 23. Parallelize correlated subqueries that execute within each worker
> 24. Pluggable toaster
> 25. Prefetch the next tuple's memory during seqscans
> 26. Pulling up direct-correlated ANY_SUBLINK
> 27. Push aggregation down to base relations and joins
> 28. Reduce timing overhead of EXPLAIN ANALYZE using rdtsc
> 29. Refactor relation extension, faster COPY
> 30. Remove NEW placeholder entry from stored view query range table
> 31. TDE key management patches
> 32. Use AF_UNIX for tests on Windows (ie drop fallback TCP code)
> 33. Windows filesystem support improvements
> 34. making relfilenodes 56 bit
> 35. postgres_fdw: commit remote (sub)transactions in parallel during 
> pre-commit
> 36.recovery modules
>
> Commitfest status as of now:
> Needs review:177
> Waiting on Author:   47
> Ready for Committer:  20
> Committed:  31
> Withdrawn:4
> Rejected:   0
> Returned with Feedback:  0
> Total:  279
>
> We will be needing more members to actively review the patches to get
> more patches to the committed state. I would like to remind you that
> each patch submitter is expected to review at least one patch from
> another submitter during the CommitFest, those members who have not
> picked up patch for review please pick someone else's patch to review
> as soon as you can.
> I'll send out reminders this week to get your patches rebased and
> update the status of the patch accordingly.
>
> [1] - http://cfbot.cputube.org/

Hi Hackers,

Here's a quick status report after the first week (I think only about
9 commits happened during the week, the rest were pre-CF activity):

status   |   3rd Jan |  w1
-+---+-
Needs review:|   177 | 149
Waiting on Author:   |47 |  60
Ready for Committer: |20 |  23
Committed:   |31 |  40
Withdrawn:   | 4 |   7
Rejected:| 0 |   0
Returned with Feedback:  | 0 |   0
Total:   |   279 | 279

Here is a list of "Needs review" entries for which there has not been
much communication on the thread and needs help in proceeding further.
Please pick one of these and help us on how to proceed further:
pgbench: using prepared BEGIN statement in a pipeline could cause an
error | Yugo Nagata
Fix dsa_free() to re-bin segment | Dongming Liu
pg_rewind: warn when checkpoint hasn't happened after promotion | James Coleman
Work around non-atomic read of read of control file on ext4 | Thomas Munro
Rethinking the implementation of ts_headline | Tom Lane
Fix GetWALAvailability function code comments for WALAVAIL_REMOVED
return value | sirisha chamarti
Function to log backtrace of postgres processes | vignesh C, Bharath Rupireddy
disallow HEAP_XMAX_COMMITTED and HEAP_XMAX_IS_LOCKED_ONLY | Nathan Bossart
New hooks in the connection path | Bertrand Drouvot
Check consistency of GUC defaults between .sample.conf and
pg_settings.boot_val | Justin Pryzby
Add <> support to sepgsql_restorecon | Joe Conway
pg_stat_statements and "IN" conditions | Dmitry Dolgov
Patch to implement missing join selectivity estimation for range types
| Zhicheng Luo, Maxime Schoemans, Diogo Repas, Mahmoud SAKR
Operation log for major operations | Dmitry Koval
Consider parallel for LATERAL subqueries having LIMIT/OFFSET | James Coleman
Using each rel as both outer and inner for anti-joins | Richard Guo
partInde

MERGE ... RETURNING

2023-01-08 Thread Dean Rasheed
I've been thinking about adding RETURNING support to MERGE in order to
let the user see what changed.

I considered allowing a separate RETURNING list at the end of each
action, but rapidly dismissed that idea. Firstly, it introduces
shift/reduce conflicts to the grammar. These can be resolved by making
the "AS" before column aliases non-optional, but that's pretty ugly,
and there may be a better way. More serious drawbacks are that this
syntax is much more cumbersome for the end user, having to repeat the
RETURNING clause several times, and the implementation is likely to be
pretty complex, so I didn't pursue it.

A much simpler approach is to just have a single RETURNING list at the
end of the command. That's much easier to implement, and easier for
the end user. The main drawback is that it's impossible for the user
to work out from the values returned which action was actually taken,
and I think that's a pretty essential piece of information (at least
it seems pretty limiting to me, not being able to work that out).

So playing around with it (and inspired by the WITH ORDINALITY syntax
for SRFs), I had the idea of allowing "WITH WHEN CLAUSE" at the end of
the returning list, which adds an integer column to the list, whose
value is set to the index of the when clause executed, as in the
attached very rough patch.

So, quoting an example from the tests, this allows things like:

WITH t AS (
  MERGE INTO sq_target t USING v ON tid = sid
WHEN MATCHED AND tid > 2 THEN UPDATE SET balance = t.balance + delta
WHEN NOT MATCHED THEN INSERT (balance, tid) VALUES (balance + delta, sid)
WHEN MATCHED AND tid < 2 THEN DELETE
RETURNING t.* WITH WHEN CLAUSE
)
SELECT CASE when_clause
 WHEN 1 THEN 'UPDATE'
 WHEN 2 THEN 'INSERT'
 WHEN 3 THEN 'DELETE'
   END, *
FROM t;

  case  | tid | balance | when_clause
+-+-+-
 INSERT |  -1 | -11 |   2
 DELETE |   1 | 100 |   3
(2 rows)

1 row is returned for each merge action executed (other than DO
NOTHING actions), and as usual, the values represent old target values
for DELETE actions, and new target values for INSERT/UPDATE actions.

It's also possible to return the source values, and a bare "*" in the
returning list expands to all the source columns, followed by all the
target columns.

The name of the added column, if included, can be changed by
specifying "WITH WHEN CLAUSE [AS] col_alias". I chose the syntax "WHEN
CLAUSE" and "when_clause" as the default column name because those
match the existing terminology used in the docs.

Anyway, this feels like a good point to stop playing around and get
feedback on whether this seems useful, or if anyone has other ideas.

Regards,
Dean
diff --git a/doc/src/sgml/ref/merge.sgml b/doc/src/sgml/ref/merge.sgml
new file mode 100644
index 0995fe0..4fc0a65
--- a/doc/src/sgml/ref/merge.sgml
+++ b/doc/src/sgml/ref/merge.sgml
@@ -25,6 +25,8 @@ PostgreSQL documentation
 MERGE INTO [ ONLY ] target_table_name [ * ] [ [ AS ] target_alias ]
 USING data_source ON join_condition
 when_clause [...]
+[ RETURNING * | output_expression [ [ AS ] output_name ] [, ...]
+  [ WITH WHEN CLAUSE [ [ AS ] col_alias ] ] ]
 
 where data_source is:
 
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
new file mode 100644
index e34f583..aa3cca0
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -274,12 +274,6 @@ DoCopy(ParseState *pstate, const CopyStm
 	{
 		Assert(stmt->query);
 
-		/* MERGE is allowed by parser, but unimplemented. Reject for now */
-		if (IsA(stmt->query, MergeStmt))
-			ereport(ERROR,
-	errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-	errmsg("MERGE not supported in COPY"));
-
 		query = makeNode(RawStmt);
 		query->stmt = stmt->query;
 		query->stmt_location = stmt_location;
diff --git a/src/backend/commands/copyto.c b/src/backend/commands/copyto.c
new file mode 100644
index 8043b4e..e02d7d0
--- a/src/backend/commands/copyto.c
+++ b/src/backend/commands/copyto.c
@@ -510,7 +510,8 @@ BeginCopyTo(ParseState *pstate,
 		{
 			Assert(query->commandType == CMD_INSERT ||
    query->commandType == CMD_UPDATE ||
-   query->commandType == CMD_DELETE);
+   query->commandType == CMD_DELETE ||
+   query->commandType == CMD_MERGE);
 
 			ereport(ERROR,
 	(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
new file mode 100644
index 651ad24..3391269
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -612,6 +612,9 @@ ExecInitPartitionInfo(ModifyTableState *
 	 * case or in the case of UPDATE tuple routing where we didn't find a
 	 * result rel to reuse.
 	 */
+
+	/* XXX: What about the MERGE case ??? */
+
 	if (node && node->returningLists != NIL)
 	{
 		TupleTableSlot *slot;
@@ -877,6 +880,7 @@ ExecInitPartitionInfo(ModifyTableState *
 		List	   *firstMergeActionList = linitial(node->mergeAct

Re: Fix gin index cost estimation

2023-01-08 Thread Alexander Korotkov
On Tue, Dec 6, 2022 at 1:22 PM Ronan Dunklau  wrote:
> Le vendredi 2 décembre 2022, 13:58:27 CET Ronan Dunklau a écrit :
> > Le vendredi 2 décembre 2022, 12:33:33 CET Alexander Korotkov a écrit :
> > > Hi, Ronan!
> > > Thank you for your patch.  Couple of quick questions.
> > > 1) What magic number 50.0 stands for?  I think we at least should make
> > > it a macro.
> >
> > This is what is used in other tree-descending  estimation functions, so I
> > used that too. Maybe a DEFAULT_PAGE_CPU_COST macro would work for both ? If
> > so I'll separate this into two patches, one introducing the macro for the
> > other estimation functions, and this patch for gin.
>
> The 0001 patch does this.
>
> >
> > > 2) "We only charge one data page for the startup cost" – should this
> > > be dependent on number of search entries?
>
> In fact there was another problem. The current code estimate two different
> pathes for fetching data pages: in the case of a partial match, it takes into
> account that all the data pages will have to be fetched. So this is is now
> taken into account for the CPU cost as well.
>
> For the regular search, we scale the number of data pages by the number of
> search entries.

Now the patch looks good for me.  I made some tests.

# create extension pg_trgm;
# create extension btree_gin;
# create table test1 as (select random() as val from
generate_series(1,100) i);
# create index test1_gin_idx on test1 using gin (val);

# explain select * from test1 where val between 0.1 and 0.2;
 QUERY PLAN
-
 Bitmap Heap Scan on test1  (cost=1186.21..7089.57 rows=98557 width=8)
   Recheck Cond: ((val >= '0.1'::double precision) AND (val <=
'0.2'::double precision))
   ->  Bitmap Index Scan on test1_gin_idx  (cost=0.00..1161.57
rows=98557 width=0)
 Index Cond: ((val >= '0.1'::double precision) AND (val <=
'0.2'::double precision))
(4 rows)

# create index test1_btree_idx on test1 using btree (val);
# explain select * from test1 where val between 0.1 and 0.2;
   QUERY PLAN
-
 Index Only Scan using test1_btree_idx on test1  (cost=0.42..3055.57
rows=98557 width=8)
   Index Cond: ((val >= '0.1'::double precision) AND (val <=
'0.2'::double precision))
(2 rows)

Looks reasonable.  In this case GIN is much more expensive, because it
can't handle range query properly and overlaps two partial matches.

# create table test2 as (select 'samplestring' || i  as val from
generate_series(1,100) i);
# create index test2_gin_idx on test2 using gin (val);
# explain select * from test2 where val = 'samplestring50';
 QUERY PLAN
-
 Bitmap Heap Scan on test2  (cost=20.01..24.02 rows=1 width=18)
   Recheck Cond: (val = 'samplestring50'::text)
   ->  Bitmap Index Scan on test2_gin_idx  (cost=0.00..20.01 rows=1 width=0)
 Index Cond: (val = 'samplestring50'::text)
(4 rows)

# create index test2_btree_idx on test2 using btree (val);
# explain select * from test2 where val = 'samplestring50';
QUERY PLAN
---
 Index Only Scan using test2_btree_idx on test2  (cost=0.42..4.44
rows=1 width=18)
   Index Cond: (val = 'samplestring50'::text)
(2 rows)

This also looks reasonable.  GIN is not terribly bad for this case,
but B-tree is much cheaper.

I'm going to push this and backpatch to all supported versions if no objections.

--
Regards,
Alexander Korotkov




Re: Todo: Teach planner to evaluate multiple windows in the optimal order

2023-01-08 Thread David Rowley
On Sun, 8 Jan 2023 at 23:21, Ankit Kumar Pandey  wrote:
> Please find attached patch with addressed issues mentioned before.

Here's a quick review:

1. You can use foreach_current_index(l) to obtain the index of the list element.

2. I'd rather see you looping backwards over the list in the first
loop. I think it'll be more efficient to loop backwards as you can
just break out the loop when you see the pathkeys are not contained in
the order by pathkreys.  When the optimisation does not apply that
means you only need to look at the last item in the list.  You could
maybe just invent foreach_reverse() for this purpose and put it in
pg_list.h.  That'll save having to manually code up the loop.

3. I don't think you should call the variable
enable_order_by_pushdown. We've a bunch of planner related GUCs that
start with enable_. That might cause a bit of confusion.  Maybe just
try_sort_pushdown.

4. You should widen the scope of orderby_pathkeys and set it within
the if (enable_order_by_pushdown) scope. You can reuse this again in
the 2nd loop too.  Just initialise it to NIL

5. You don't seem to be checking all WindowClauses for a runCondtion.
If you do #2 above then you can start that process in the initial
reverse loop so that you've checked them all by the time you get
around to that WindowClause again in the 2nd loop.

6. The test with "+-- Do not perform additional sort if column is
presorted". I don't think "additional" is the correct word here.  I
think you want to ensure that we don't push down the ORDER BY below
the WindowAgg for this case. There is no ability to reduce the sorts
here, only move them around, which we agreed we don't want to do for
this case.

Also, do you want to have a go at coding up the sort bound pushdown
too?  It'll require removing the limitCount restriction and adjusting
ExecSetTupleBound() to recurse through a WindowAggState. I think it's
pretty easy. You could try it then play around with it to make sure it
works and we get the expected performance. You'll likely want to add a
few more rows than the last performance test you did or run the query
with pgbench. Running a query once that only takes 1-2ms is likely not
a reliable way to test the performance of something.

David




Re: [RFC] Add jit deform_counter

2023-01-08 Thread Pavel Stehule
ne 8. 1. 2023 v 11:57 odesílatel Dmitry Dolgov <9erthali...@gmail.com>
napsal:

> > On Sat, Jan 07, 2023 at 07:09:11PM +0100, Pavel Stehule wrote:
> > so 7. 1. 2023 v 16:48 odesílatel Dmitry Dolgov <9erthali...@gmail.com>
> > napsal:
> >
> > > > On Fri, Jan 06, 2023 at 09:42:09AM +0100, Pavel Stehule wrote:
> > > > The explain part is working, the part of pg_stat_statements doesn't
> > > >
> > > > set jit_above_cost to 10;
> > > > set jit_optimize_above_cost to 10;
> > > > set jit_inline_above_cost to 10;
> > > >
> > > > (2023-01-06 09:08:59) postgres=# explain analyze select
> > > > count(length(prosrc) > 0) from pg_proc;
> > > >
> > >
> ┌┐
> > > > │ QUERY
> PLAN
> > > >   │
> > > >
> > >
> ╞╡
> > > > │ Aggregate  (cost=154.10..154.11 rows=1 width=8) (actual
> > > > time=132.320..132.321 rows=1 loops=1)
> > >   │
> > > > │   ->  Seq Scan on pg_proc  (cost=0.00..129.63 rows=3263 width=16)
> > > (actual
> > > > time=0.013..0.301 rows=3266 loops=1) │
> > > > │ Planning Time: 0.070 ms
> > > >  │
> > > > │ JIT:
> > > >   │
> > > > │   Functions: 3
> > > >   │
> > > > │   Options: Inlining true, Optimization true, Expressions true,
> > > Deforming
> > > > true  │
> > > > │   Timing: Generation 0.597 ms, Deforming 0.407 ms, Inlining 8.943
> ms,
> > > > Optimization 79.403 ms, Emission 43.091 ms, Total 132.034 ms │
> > > > │ Execution Time: 132.986 ms
> > > >   │
> > > >
> > >
> └┘
> > > > (8 rows)
> > > >
> > > > I see the result of deforming in explain analyze, but related values
> in
> > > > pg_stat_statements are 0.
> > >
> > > I'm not sure why, but pgss jit metrics are always nulls for explain
> > > analyze queries. I have noticed this with surprise myself, when
> recently
> > > was reviewing the lazy jit patch, but haven't yet figure out what is
> the
> > > reason. Anyway, without "explain analyze" you'll get correct deforming
> > > numbers in pgss.
> > >
> >
> > It was really strange, because I tested the queries without EXPLAIN
> ANALYZE
> > too, and new columns were always zero on my comp. Other jit columns were
> > filled.  But I didn't do a deeper investigation.
>
> Interesting. I've verified it once more with the query and the
> parameters you've posted, got the following:
>
> jit_functions  | 3
> jit_generation_time| 1.257522
> jit_deform_count   | 1
> jit_deform_time| 10.381345
> jit_inlining_count | 1
> jit_inlining_time  | 71.628168
> jit_optimization_count | 1
> jit_optimization_time  | 48.146447
> jit_emission_count | 1
> jit_emission_time  | 0.737822
>
> Maybe there is anything else special about how you run it?
>

I hope not, but I'll see. I recheck updated patch



>
> Otherwise addressed the rest of commentaries, thanks.
>


Re: [RFC] Add jit deform_counter

2023-01-08 Thread Dmitry Dolgov
> On Sat, Jan 07, 2023 at 07:09:11PM +0100, Pavel Stehule wrote:
> so 7. 1. 2023 v 16:48 odesílatel Dmitry Dolgov <9erthali...@gmail.com>
> napsal:
>
> > > On Fri, Jan 06, 2023 at 09:42:09AM +0100, Pavel Stehule wrote:
> > > The explain part is working, the part of pg_stat_statements doesn't
> > >
> > > set jit_above_cost to 10;
> > > set jit_optimize_above_cost to 10;
> > > set jit_inline_above_cost to 10;
> > >
> > > (2023-01-06 09:08:59) postgres=# explain analyze select
> > > count(length(prosrc) > 0) from pg_proc;
> > >
> > ┌┐
> > > │ QUERY PLAN
> > >   │
> > >
> > ╞╡
> > > │ Aggregate  (cost=154.10..154.11 rows=1 width=8) (actual
> > > time=132.320..132.321 rows=1 loops=1)
> >   │
> > > │   ->  Seq Scan on pg_proc  (cost=0.00..129.63 rows=3263 width=16)
> > (actual
> > > time=0.013..0.301 rows=3266 loops=1) │
> > > │ Planning Time: 0.070 ms
> > >  │
> > > │ JIT:
> > >   │
> > > │   Functions: 3
> > >   │
> > > │   Options: Inlining true, Optimization true, Expressions true,
> > Deforming
> > > true  │
> > > │   Timing: Generation 0.597 ms, Deforming 0.407 ms, Inlining 8.943 ms,
> > > Optimization 79.403 ms, Emission 43.091 ms, Total 132.034 ms │
> > > │ Execution Time: 132.986 ms
> > >   │
> > >
> > └┘
> > > (8 rows)
> > >
> > > I see the result of deforming in explain analyze, but related values in
> > > pg_stat_statements are 0.
> >
> > I'm not sure why, but pgss jit metrics are always nulls for explain
> > analyze queries. I have noticed this with surprise myself, when recently
> > was reviewing the lazy jit patch, but haven't yet figure out what is the
> > reason. Anyway, without "explain analyze" you'll get correct deforming
> > numbers in pgss.
> >
>
> It was really strange, because I tested the queries without EXPLAIN ANALYZE
> too, and new columns were always zero on my comp. Other jit columns were
> filled.  But I didn't do a deeper investigation.

Interesting. I've verified it once more with the query and the
parameters you've posted, got the following:

jit_functions  | 3
jit_generation_time| 1.257522
jit_deform_count   | 1
jit_deform_time| 10.381345
jit_inlining_count | 1
jit_inlining_time  | 71.628168
jit_optimization_count | 1
jit_optimization_time  | 48.146447
jit_emission_count | 1
jit_emission_time  | 0.737822

Maybe there is anything else special about how you run it?

Otherwise addressed the rest of commentaries, thanks.
>From 93d739e9258f79474df5644831a1f82fc97742dc Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthali...@gmail.com>
Date: Sat, 11 Jun 2022 11:54:40 +0200
Subject: [PATCH v3 1/2] Add deform_counter

At the moment generation_counter includes time spent on both JITing
expressions and tuple deforming. Those are configured independently via
corresponding options (jit_expressions and jit_tuple_deforming), which
means there is no way to find out what fraction of time tuple deforming
alone is taking.

Add deform_counter dedicated to tuple deforming, which allows seeing
more directly the influence jit_tuple_deforming is having on the query.
---
 src/backend/commands/explain.c  | 7 ++-
 src/backend/jit/jit.c   | 1 +
 src/backend/jit/llvm/llvmjit_expr.c | 6 ++
 src/include/jit/jit.h   | 3 +++
 4 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index f86983c660..2f23602a87 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -882,6 +882,7 @@ ExplainPrintJIT(ExplainState *es, int jit_flags, JitInstrumentation *ji)
 
 	/* calculate total time */
 	INSTR_TIME_SET_ZERO(total_time);
+	/* don't add deform_counter, it's included into generation_counter */
 	INSTR_TIME_ADD(total_time, ji->generation_counter);
 	INSTR_TIME_ADD(total_time, ji->inlining_counter);
 	INSTR_TIME_ADD(total_time, ji->optimization_counter);
@@ -909,8 +910,9 @@ ExplainPrintJIT(ExplainState *es, int jit_flags, JitInstrumentation *ji)
 		{
 			ExplainIndentText(es);
 			appendStringInfo(es->str,
-			 "Timing: %s %.3f ms, %s %.3f ms, %s %.3f ms, %s %.3f ms, %s %.3f ms\n",
+			 "Timing: %s %.3f 

Re: Todo: Teach planner to evaluate multiple windows in the optimal order

2023-01-08 Thread Ankit Kumar Pandey

Hi David,

Please find attached patch with addressed issues mentioned before.

Things resolved:

1. Correct position of window function from where order by push down can 
happen


is determined, this fixes issue mentioned in case #1.


While writing test cases, I found that optimization do not happen for
case #1

(which is prime candidate for such operation) like

EXPLAIN (COSTS OFF)
SELECT empno,
 depname,
 min(salary) OVER (PARTITION BY depname ORDER BY empno) depminsalary,
 sum(salary) OVER (PARTITION BY depname) depsalary
FROM empsalary
ORDER BY depname, empno, enroll_date


2. Point #2 as in above discussions


a) looks like the best plan to me.  What's the point of pushing the
sort below the WindowAgg in this case? The point of this optimisation
is to reduce the number of sorts not to push them as deep into the
plan as possible. We should only be pushing them down when it can
reduce the number of sorts. There's no reduction in the number of
sorts in the above plan.


Works as mentioned.

All test cases (newly added and existing ones) are green.

--
Regards,
Ankit Kumar Pandey
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 000d757bdd..fd3fcd43a0 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4423,6 +4423,12 @@ create_one_window_path(PlannerInfo *root,
 	PathTarget *window_target;
 	ListCell   *l;
 	List	   *topqual = NIL;
+	bool		has_runcondition = false;
+
+	List	   *window_pathkeys;
+	int			index = 0;
+	int			i;
+	bool		enable_order_by_pushdown = false;
 
 	/*
 	 * Since each window clause could require a different sort order, we stack
@@ -4441,10 +4447,45 @@ create_one_window_path(PlannerInfo *root,
 	 */
 	window_target = input_target;
 
+	enable_order_by_pushdown = (root->parse->distinctClause == NIL &&
+root->parse->sortClause != NIL &&
+root->parse->limitCount == NULL &&
+!contain_window_function((Node *) get_sortgrouplist_exprs(root->parse->sortClause,
+	  root->processed_tlist)));
+
+
+	i=0;
+	if (enable_order_by_pushdown)
+	{
+		/*
+		 * Search for the window function whose path keys are
+		 * subset of orderby_path keys, this allows us to perform
+		 * order by pushdown from this position of window function onwards
+		 */
+		foreach(l, activeWindows)
+		{
+			List	   *orderby_pathkeys;
+			WindowClause *wc = lfirst_node(WindowClause, l);
+			orderby_pathkeys = make_pathkeys_for_sortclauses(root,
+			 root->parse->sortClause,
+			 root->processed_tlist);
+			window_pathkeys = make_pathkeys_for_window(root,
+		wc,
+		root->processed_tlist);
+			if (pathkeys_contained_in(window_pathkeys, orderby_pathkeys))
+			{
+
+index = i;
+break;
+			}
+			i++;
+		}
+	}
+
+	i=0;
 	foreach(l, activeWindows)
 	{
 		WindowClause *wc = lfirst_node(WindowClause, l);
-		List	   *window_pathkeys;
 		int			presorted_keys;
 		bool		is_sorted;
 		bool		topwindow;
@@ -4457,6 +4498,38 @@ create_one_window_path(PlannerInfo *root,
 path->pathkeys,
 &presorted_keys);
 
+		has_runcondition |= (wc->runCondition != NIL);
+
+		/*
+		 * If not sorted already and the query has an ORDER BY clause, then we
+		 * try to push the entire ORDER BY below the final WindowAgg.  We
+		 * don't do this when the query has a DISTINCT clause as the planner
+		 * might opt to do a Hash Aggregate and spoil our efforts here.  We
+		 * also can't do this when the ORDER BY contains any WindowFuncs as we
+		 * obviously can't sort something that's yet to be evaluated.  We also
+		 * don't bother with this when there is a WindowClauses with a
+		 * runCondition as those can reduce the number of rows to sort in the
+		 * ORDER BY and we don't want add complexity to the WindowAgg sort if
+		 * the final ORDER BY sort is going to have far fewer rows to sort.
+		 * For the same reason, don't bother if the query has a LIMIT clause.
+		 */
+		if (enable_order_by_pushdown && !is_sorted && !has_runcondition && (i == index))
+		{
+			List	   *orderby_pathkeys;
+
+			orderby_pathkeys = make_pathkeys_for_sortclauses(root,
+			 root->parse->sortClause,
+			 root->processed_tlist);
+
+			if (pathkeys_contained_in(window_pathkeys, orderby_pathkeys))
+window_pathkeys = orderby_pathkeys;
+
+			is_sorted = pathkeys_count_contained_in(window_pathkeys,
+	path->pathkeys,
+	&presorted_keys);
+		}
+		i++;
+
 		/* Sort if necessary */
 		if (!is_sorted)
 		{
diff --git a/src/test/regress/expected/window.out b/src/test/regress/expected/window.out
index 90e89fb5b6..08663fd914 100644
--- a/src/test/regress/expected/window.out
+++ b/src/test/regress/expected/window.out
@@ -3982,7 +3982,194 @@ ORDER BY depname, empno;
  ->  Seq Scan on empsalary
 (11 rows)
 
+-- Do not perform additional sort if column is presorted
+CREATE INDEX depname_idx ON empsalary(