Re: Parallel INSERT (INTO ... SELECT ...)

2021-03-14 Thread Amit Kapila
On Mon, Mar 15, 2021 at 11:34 AM Justin Pryzby  wrote:
>
> On Mon, Mar 15, 2021 at 11:25:26AM +0530, Amit Kapila wrote:
> > On Fri, Mar 12, 2021 at 3:01 PM houzj.f...@fujitsu.com wrote:
> > >
> > > Attaching new version patch with this change.
> >
> > Thanks, the patch looks good to me. I have made some minor cosmetic
> > changes in the attached. I am planning to push this by tomorrow unless
> > you or others have any more comments or suggestions for this patch.
>
> I still wonder if it should be possible for the GUC to be off, and then
> selectively enable parallel inserts on specific tables.
>

I think that could be inconvenient for users because in most tables
such a restriction won't need to be applied.

-- 
With Regards,
Amit Kapila.




Re: Consider Parallelism While Planning For REFRESH MATERIALIZED VIEW

2021-03-14 Thread Amit Kapila
On Mon, Mar 15, 2021 at 10:39 AM Thomas Munro  wrote:
>
> On Mon, Jan 4, 2021 at 9:05 PM Luc Vlaming  wrote:
>
> While reading some back history, I saw that commit e9baa5e9 introduced
> parallelism for CREATE M V, but REFRESH was ripped out of the original
> patch by Robert, who said:
>
> > The problem with a case like REFRESH MATERIALIZED VIEW is that there's
> > nothing to prevent something that gets run in the course of the query
> > from trying to access the view (and the heavyweight lock won't prevent
> > that, due to group locking). That's probably a stupid thing to do,
> > but it can't be allowed to break the world. The other cases are safe
> > from that particular problem because the table doesn't exist yet.
>
> Hmmm.
>

I am not sure but we might want to add this in comments of the refresh
materialize view function so that it would be easier for future
readers to understand why we have not enabled parallelism for this
case.

-- 
With Regards,
Amit Kapila.




Re: PATCH: Attempt to make dbsize a bit more consistent

2021-03-14 Thread Michael Paquier
On Wed, Feb 24, 2021 at 02:35:51PM +, gkokola...@pm.me wrote:
> Now with attachment. Apologies for the chatter.

The patch has no documentation for the two new functions, so it is a
bit hard to understand what is the value brought here, and what is the
goal wanted just by reading the patch, except that this switches the
size reported for views to NULL instead of zero bytes by reading the
regression tests.

Please note that reporting zero is not wrong for views IMO as these
have no physical storage, so, while it is possible to argue that a
size of zero could imply that this relation size could not be zero, it
will never change, so I'd rather let this behavior as it is.  I
would bet, additionally, that this could break existing use cases.
Reporting NULL, on the contrary, as your patch does, makes things
worse in some ways because that's what pg_relation_size would report
when a relation does not exist anymore.  Imagine for example the case
of a DROP TABLE run in parallel of a scan of pg_class using
pg_relation_size().  So it becomes impossible to know if a relation
has been removed or not.  This joins some points raised by Soumyadeep
upthread.

Anyway, as mentioned by other people upthread, I am not really
convinced either that we should have more flavors of size functions,
particularly depending on the relkind as this would be confusing for
the end-user.  pg_relation_size() can already do this job for all
relkinds that use segment files, so it should still be able to hold
its ground and maintain a consistent behavior with what it does
currently.

+static int64
+calculate_table_fork_size(Relation rel, ForkNumber forkNum)
+{
+   return (int64)table_relation_size(rel, forkNum);
+}
Why isn't that just unsigned, like table_relation_size()?  This is an
internal function so it does not matter to changes its signature, but
I think that you had better do a cast at a higher level instead.

for (int i = 0; i < MAX_FORKNUM; i++)
-   nblocks += smgrnblocks(rel->rd_smgr, i);
+   nblocks += smgrexists(rel->rd_smgr, i)
+   ? smgrnblocks(rel->rd_smgr, i)
+   : 0;
Is that actually the part for views?  Why is that done this way?

+   if (rel->rd_rel->relkind == RELKIND_RELATION ||
+   rel->rd_rel->relkind == RELKIND_TOASTVALUE ||
+   rel->rd_rel->relkind == RELKIND_MATVIEW)
+   size = calculate_table_fork_size(rel,
+forkname_to_number(text_to_cstring(forkName)));
+   else if (rel->rd_rel->relkind == RELKIND_INDEX)
+   size = calculate_relation_size(&(rel->rd_node), rel->rd_backend,
+forkname_to_number(text_to_cstring(forkName)));
+   else
+   {
+   relation_close(rel, AccessShareLock);
+   PG_RETURN_NULL();
+   }
The approach you are taking to bring some consistency in all that by
making the size estimations go through table AMs via
table_relation_size() is promising.  However, this code breaks the
size estimation for sequences, which is not good.  If attempting to
use an evaluation that's based on a table AM, shouldn't this code use
a check based on rd_tableam rather than the relkind then?  We assume
that rd_tableam is set depending on the relkind in relcache.c, for
one.
--
Michael


signature.asc
Description: PGP signature


Re: SQL-standard function body

2021-03-14 Thread Jaime Casanova
On Tue, Mar 9, 2021 at 7:27 AM Peter Eisentraut
 wrote:
>
>
> I see.  The problem is that we don't have serialization and
> deserialization support for most utility statements.  I think I'll need
> to add that eventually.  For now, I have added code to prevent utility
> statements.  I think it's still useful that way for now.
>

Great! thanks!

I found another problem when using CASE expressions:

CREATE OR REPLACE FUNCTION foo_case()
RETURNS boolean
LANGUAGE SQL
BEGIN ATOMIC
select case when random() > 0.5 then true else false end;
END;

apparently the END in the CASE expression is interpreted as the END of
the function

-- 
Jaime Casanova
Director de Servicios Profesionales
SYSTEMGUARDS - Consultores de PostgreSQL




Re: Parallel INSERT (INTO ... SELECT ...)

2021-03-14 Thread Justin Pryzby
On Mon, Mar 15, 2021 at 11:25:26AM +0530, Amit Kapila wrote:
> On Fri, Mar 12, 2021 at 3:01 PM houzj.f...@fujitsu.com wrote:
> >
> > Attaching new version patch with this change.
> 
> Thanks, the patch looks good to me. I have made some minor cosmetic
> changes in the attached. I am planning to push this by tomorrow unless
> you or others have any more comments or suggestions for this patch.

I still wonder if it should be possible for the GUC to be off, and then
selectively enable parallel inserts on specific tables.  In that case, the GUC
should be called something other than enable_*, or maybe it should be an enum
with values like "off" and "allowed"/enabled/defer/???

-- 
Justin




Re: PITR promote bug: Checkpointer writes to older timeline

2021-03-14 Thread Kyotaro Horiguchi
At Sun, 14 Mar 2021 17:59:59 +0900, Michael Paquier  wrote 
in 
> On Thu, Mar 04, 2021 at 05:10:36PM +0900, Kyotaro Horiguchi wrote:
> > read_local_xlog_page is *designed* to maintain ThisTimeLineID.
> > Currently it doesn't seem utilized but I think it's sufficiently
> > reasonable that the function maintains ThisTimeLineID.
> 
> I don't quite follow this line of thoughts.  ThisTimeLineID is
> designed to remain 0 while recovery is running in most processes
> (at the close exception of a WAL sender with a cascading setup,

The reason for the "0" is they just aren't interested in the value.
Checkpointer temporarily uses it then restore to 0 soon.

> physical or logical, of course), so why is there any business for
> read_local_xlog_page() to touch this field at all while in recovery to
> begin with?

Logical decoding stuff is (I think) designed to turn any backend into
a walsender, which may need to maintain ThisTimeLineID.  It seems to
me that logical decoding stuff indents to maintain ThisTimeLineID of
such backends at reading a WAL record. logical_read_xlog_page also
updates ThisTimeLineID and pg_logical_slot_get_changes_guts(),
pg_replication_slot_advance() (and maybe other functions) updates
ThisTimeLineID. So it is natural that local_read_xlog_page() updates
it since it is intended to be used used in logical decoding plugins.

> I equally find confusing that XLogReadDetermineTimeline() relies on a
> specific value of ThisTimeLineID in its own logic, while it clearly
> states that all its callers have to read the current active TLI
> beforehand.  So I think that the existing logic is pretty weak, and
> that resetting the field is an incorrect approach?  It seems to me

It is initialized by IndentifySystem(). And the logical walsender
intends to maintain ThisTimeLineID by subsequent calls to
GetStandbyFlushRecPtr(), which happen in logical_read_xlog_page().

> that we had better not change ThisTimeLineID actively while in
> recovery in this code path and just let others do the job, like
> RecoveryInProgress() once recovery finishes, or
> GetStandbyFlushRecPtr() for a WAL sender.  And finally, we should
> store the current TLI used for replay in a separate variable that gets
> passed down to the XLogReadDetermineTimeline() as argument.

I agree that it's better that the replay TLI is stored in a separate
variable. It is what I was complained on in the previous mails. (It
might not have been so obvious, though..)

> While going through it, I have simplified a bit the proposed TAP tests
> (thanks for replacing the sleep() call, Soumyadeep.  This would have
> made the test slower for nothing on fast machines, and it would cause
> failures on very slow machines).
> 
> The attached fixes the original issue for me, keeping all the records
> in their correct timeline.  And I have not been able to break
> cascading setups.  If it happens that such cases actually break, we
> have holes in our existing test coverage that should be improved.  I
> cannot see anything fancy missing on this side, though.
> 
> Any thoughts?

I don't think there's any acutual user of the function for the
purpose, but..  Anyawy if we remove the update of ThisTimeLineID from
read_local_xlog_page, I think we should remove or rewrite the
following comment for the function.  It no longer works as written in
the catchphrase.

> * Public because it would likely be very helpful for someone writing another
> * output method outside walsender, e.g. in a bgworker.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Parallel INSERT (INTO ... SELECT ...)

2021-03-14 Thread Amit Kapila
On Fri, Mar 12, 2021 at 3:01 PM houzj.f...@fujitsu.com
 wrote:
>
> Attaching new version patch with this change.
>

Thanks, the patch looks good to me. I have made some minor cosmetic
changes in the attached. I am planning to push this by tomorrow unless
you or others have any more comments or suggestions for this patch.

-- 
With Regards,
Amit Kapila.


v28-0001-Add-a-new-GUC-and-a-reloption-to-enable-inserts-.patch
Description: Binary data


Re: Should we improve "PID XXXX is not a PostgreSQL server process" warning for pg_terminate_backend(<>)?

2021-03-14 Thread torikoshia

On 2021-03-07 19:16, Bharath Rupireddy wrote:

On Fri, Feb 5, 2021 at 5:15 PM Bharath Rupireddy
 wrote:


pg_terminate_backend and pg_cancel_backend with postmaster PID produce
"PID  is not a PostgresSQL server process" warning [1], which
basically implies that the postmaster is not a PostgreSQL process at
all. This is a bit misleading because the postmaster is the parent of
all PostgreSQL processes. Should we improve the warning message if the
given PID is postmasters' PID?


+1. I felt it was a bit confusing when reviewing a thread[1].



If yes, how about  a generic message for both of the functions -
"signalling postmaster process is not allowed" or "cannot signal
postmaster process" or some other better suggestion?

[1] 2471176 ---> is postmaster PID.
postgres=# select pg_terminate_backend(2471176);
WARNING:  PID 2471176 is not a PostgreSQL server process
 pg_terminate_backend
--
 f
(1 row)
postgres=# select pg_cancel_backend(2471176);
WARNING:  PID 2471176 is not a PostgreSQL server process
 pg_cancel_backend
---
 f
(1 row)


I'm attaching a small patch that emits a warning "signalling
postmaster with PID %d is not allowed" for postmaster and "signalling
PostgreSQL server process with PID %d is not allowed" for auxiliary
processes such as checkpointer, background writer, walwriter.

However, for stats collector and sys logger processes, we still get
"PID X is not a PostgreSQL server process" warning because they
don't have PGPROC entries(??). So BackendPidGetProc and
AuxiliaryPidGetProc will not help and even pg_stat_activity is not
having these processes' pid.


I also ran into the same problem while creating a patch in [2].

I'm now wondering if changing the message to something like
"PID  is not a PostgreSQL backend process".

"backend process' is now defined as "Process of an instance
which acts on behalf of a client session and handles its
requests." in Appendix.


[1] 
https://www.postgresql.org/message-id/CALDaNm3ZzmFS-%3Dr7oDUzj7y7BgQv%2BN06Kqyft6C3xZDoKnk_6w%40mail.gmail.com


[2] 
https://www.postgresql.org/message-id/0271f440ac77f2a4180e0e56ebd944d1%40oss.nttdata.com



Regards,

--
Atsushi Torikoshi
NTT DATA CORPORATION




Re: Regression tests vs SERIALIZABLE

2021-03-14 Thread Thomas Munro
On Mon, Mar 15, 2021 at 6:14 PM Bharath Rupireddy
 wrote:
> On Mon, Mar 15, 2021 at 9:54 AM Thomas Munro  wrote:
> > While reviewing the patch for parallel REFRESH MATERIALIZED VIEW, I
> > noticed that select_parallel.sql and write_parallel.sql believe that
> > (1) the tests are supposed to work with serializable as a default
> > isolation level, and (2) parallelism would be inhibited by that, so
> > they'd better use something else explicitly.  Here's a patch to update
> > that second thing in light of commit bb16aba5.  I don't think it
> > matters enough to bother back-patching it.
>
> +1, patch basically LGTM. I have one point - do we also need to remove
> "begin isolation level repeatable read;" in aggreates.sql, explain.sql
> and insert_parallel.sql? And in insert_parallel.sql, the comment also
> says "Serializable isolation would disable parallel query", which is
> not true after bb16aba5. Do we need to change that too?

Yeah, you're right.  That brings us to the attached.
From 19be76b8903ad6f179463bc058474338e9d53f8c Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Mon, 15 Mar 2021 17:08:25 +1300
Subject: [PATCH v2] Parallel query regression tests don't need SERIALIZABLE
 workaround.

SERIALIZABLE no longer inhibits parallelism, so the comment and
workaround in various regression tests are obsolete since commit
bb16aba5 (release 12).

Also fix a typo.

Reviewed-by: Bharath Rupireddy 
Discussion: https://postgr.es/m/CA%2BhUKGJUaHeK%3DHLATxF1JOKDjKJVrBKA-zmbPAebOM0Se2FQRg%40mail.gmail.com
---
 src/test/regress/expected/aggregates.out  | 4 ++--
 src/test/regress/expected/explain.out | 4 +---
 src/test/regress/expected/insert_parallel.out | 4 +---
 src/test/regress/expected/select_parallel.out | 4 +---
 src/test/regress/expected/write_parallel.out  | 6 ++
 src/test/regress/sql/aggregates.sql   | 4 ++--
 src/test/regress/sql/explain.sql  | 4 +---
 src/test/regress/sql/insert_parallel.sql  | 4 +---
 src/test/regress/sql/select_parallel.sql  | 4 +---
 src/test/regress/sql/write_parallel.sql   | 6 ++
 10 files changed, 14 insertions(+), 30 deletions(-)

diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out
index 2c818d9253..1ae0e5d939 100644
--- a/src/test/regress/expected/aggregates.out
+++ b/src/test/regress/expected/aggregates.out
@@ -2411,7 +2411,7 @@ ROLLBACK;
 -- Secondly test the case of a parallel aggregate combiner function
 -- returning NULL. For that use normal transition function, but a
 -- combiner function returning NULL.
-BEGIN ISOLATION LEVEL REPEATABLE READ;
+BEGIN;
 CREATE FUNCTION balkifnull(int8, int8)
 RETURNS int8
 PARALLEL SAFE
@@ -2453,7 +2453,7 @@ SELECT balk(hundred) FROM tenk1;
 
 ROLLBACK;
 -- test coverage for aggregate combine/serial/deserial functions
-BEGIN ISOLATION LEVEL REPEATABLE READ;
+BEGIN;
 SET parallel_setup_cost = 0;
 SET parallel_tuple_cost = 0;
 SET min_parallel_table_scan_size = 0;
diff --git a/src/test/regress/expected/explain.out b/src/test/regress/expected/explain.out
index dc7ab2ce8b..791eba8511 100644
--- a/src/test/regress/expected/explain.out
+++ b/src/test/regress/expected/explain.out
@@ -293,9 +293,7 @@ rollback;
 -- actually get (maybe none at all), we can't examine the "Workers" output
 -- in any detail.  We can check that it parses correctly as JSON, and then
 -- remove it from the displayed results.
--- Serializable isolation would disable parallel query, so explicitly use an
--- arbitrary other level.
-begin isolation level repeatable read;
+begin;
 -- encourage use of parallel plans
 set parallel_setup_cost=0;
 set parallel_tuple_cost=0;
diff --git a/src/test/regress/expected/insert_parallel.out b/src/test/regress/expected/insert_parallel.out
index d5fae79031..3599c21eba 100644
--- a/src/test/regress/expected/insert_parallel.out
+++ b/src/test/regress/expected/insert_parallel.out
@@ -52,9 +52,7 @@ insert into test_data select * from generate_series(1,10);
 --
 -- END: setup some tables and data needed by the tests.
 --
--- Serializable isolation would disable parallel query, so explicitly use an
--- arbitrary other level.
-begin isolation level repeatable read;
+begin;
 -- encourage use of parallel plans
 set parallel_setup_cost=0;
 set parallel_tuple_cost=0;
diff --git a/src/test/regress/expected/select_parallel.out b/src/test/regress/expected/select_parallel.out
index 9b0c418db7..05ebcb284a 100644
--- a/src/test/regress/expected/select_parallel.out
+++ b/src/test/regress/expected/select_parallel.out
@@ -3,9 +3,7 @@
 --
 create function sp_parallel_restricted(int) returns int as
   $$begin return $1; end$$ language plpgsql parallel restricted;
--- Serializable isolation would disable parallel query, so explicitly use an
--- arbitrary other level.
-begin isolation level repeatable read;
+begin;
 -- encourage use of parallel plans
 set parallel_setup_cost=0;
 set parallel_tuple_cost=0;
diff --git a/src/test/regress/expected/write_parallel.out b/src/test/

Re: Regression tests vs SERIALIZABLE

2021-03-14 Thread Bharath Rupireddy
On Mon, Mar 15, 2021 at 9:54 AM Thomas Munro  wrote:
> While reviewing the patch for parallel REFRESH MATERIALIZED VIEW, I
> noticed that select_parallel.sql and write_parallel.sql believe that
> (1) the tests are supposed to work with serializable as a default
> isolation level, and (2) parallelism would be inhibited by that, so
> they'd better use something else explicitly.  Here's a patch to update
> that second thing in light of commit bb16aba5.  I don't think it
> matters enough to bother back-patching it.

+1, patch basically LGTM. I have one point - do we also need to remove
"begin isolation level repeatable read;" in aggreates.sql, explain.sql
and insert_parallel.sql? And in insert_parallel.sql, the comment also
says "Serializable isolation would disable parallel query", which is
not true after bb16aba5. Do we need to change that too?

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Re: Consider Parallelism While Planning For REFRESH MATERIALIZED VIEW

2021-03-14 Thread Thomas Munro
On Mon, Jan 4, 2021 at 9:05 PM Luc Vlaming  wrote:
> The new status of this patch is: Ready for Committer

I think the comments above this might as well be removed, because they
aren't very convincing:

+-- Allow parallel planning of the underlying query for refresh materialized
+-- view. We can be ensured that parallelism will be picked because of the
+-- enforcement done at the beginning of the test.
+refresh materialized view parallel_mat_view;

If you just leave the REFRESH command, at least it'll be exercised,
and I know you have a separate CF entry to add EXPLAIN support for
REFRESH.  So I'd just rip these weasel words out and then in a later
commit you can add the EXPLAIN there where it's obviously missing.

While reading some back history, I saw that commit e9baa5e9 introduced
parallelism for CREATE M V, but REFRESH was ripped out of the original
patch by Robert, who said:

> The problem with a case like REFRESH MATERIALIZED VIEW is that there's
> nothing to prevent something that gets run in the course of the query
> from trying to access the view (and the heavyweight lock won't prevent
> that, due to group locking). That's probably a stupid thing to do,
> but it can't be allowed to break the world. The other cases are safe
> from that particular problem because the table doesn't exist yet.

Hmmm.




Re: A new function to wait for the backend exit after termination

2021-03-14 Thread Fujii Masao




On 2021/03/15 12:27, Bharath Rupireddy wrote:

On Sun, Mar 7, 2021 at 2:39 PM Bharath Rupireddy
 wrote:

Attaching v7 patch for further review.


Attaching v8 patch after rebasing on to the latest master.


Thanks for rebasing the patch!

-   WAIT_EVENT_XACT_GROUP_UPDATE
+   WAIT_EVENT_XACT_GROUP_UPDATE,
+   WAIT_EVENT_BACKEND_TERMINATION

These should be listed in alphabetical order.

In pg_wait_until_termination's do-while loop, ResetLatch() should be called. 
Otherwise, it would enter busy-loop after any signal arrives. Because the latch 
is kept set and WaitLatch() always exits immediately in that case.

+   /*
+* Wait in steps of waittime milliseconds until this function exits or
+* timeout.
+*/
+   int64   waittime = 10;

10 ms per cycle seems too frequent?

+   ereport(WARNING,
+   (errmsg("timeout cannot be negative or zero: 
%lld",
+   (long long int) 
timeout)));
+
+   result = false;

IMO the parameter should be verified before doing the actual thing.

Why is WARNING thrown in this case? Isn't it better to throw ERROR like 
pg_promote() does?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Procedures versus the "fastpath" API

2021-03-14 Thread Michael Paquier
On Tue, Mar 09, 2021 at 02:33:47PM -0500, Joe Conway wrote:
> My vote would be reject using fastpath for procedures in all relevant 
> branches.
> If someday someone cares enough to make it work, it is a new feature for a new
> major release.

FWIW, my vote would go for issuing an error if attempting to use a
procedure in the fast path for all the branches.  The lack of
complaint about the error you are mentioning sounds like a pretty good
argument to fail properly on existing branches, and work on this as a
new feature in the future if there is anybody willing to make a case
for it.
--
Michael


signature.asc
Description: PGP signature


Re: Extensions not dumped when --schema is used

2021-03-14 Thread Michael Paquier
On Sun, Feb 21, 2021 at 08:14:45AM +0900, Michael Paquier wrote:
> Okay, that sounds fine to me.  Thanks for confirming.

Guillaume, it has been a couple of weeks since your last update.  Are
you planning to send a new version of the patch?
--
Michael


signature.asc
Description: PGP signature


Re: Improvements and additions to COPY progress reporting

2021-03-14 Thread Michael Paquier
On Wed, Mar 10, 2021 at 09:35:10AM +0100, Matthias van de Meent wrote:
> There are examples in which pg_stat_progress_* -views report
> inaccurate data. I think it is fairly reasonable to at least validate
> some part of the progress reporting, as it is one of the few methods
> for administrators to look at the state of currently running
> administrative tasks, and as such, this user-visible api should be
> validated.

Looking closer at 0002, the size numbers are actually incorrect on
Windows for the second query.  The CRLFs at the end of each line of
emp.data add three bytes to the report of COPY FROM, so this finishes
with 82 bytes for bytes_total and bytes_processed instead of 79.
Let's make this useful but simpler here, so I propose to check that
the counters are higher than zero instead of an exact number.  Let's
also add the relation name relid::regclass while on it.

The tests introduced are rather limited, but you are right that
something is better than nothing here, and I have slightly updated
what the tests sent previously as per the attached.  What do you
think?
--
Michael
diff --git a/src/test/regress/input/copy.source b/src/test/regress/input/copy.source
index a1d529ad36..9e4766898d 100644
--- a/src/test/regress/input/copy.source
+++ b/src/test/regress/input/copy.source
@@ -201,3 +201,65 @@ select * from parted_copytest where b = 1;
 select * from parted_copytest where b = 2;
 
 drop table parted_copytest;
+
+--
+-- Progress reporting for COPY
+--
+create table tab_progress_reporting (
+	name text,
+	age int4,
+	location point,
+	salary int4,
+	manager name
+);
+
+-- Add a trigger to catch and print the contents of the catalog view
+-- pg_stat_progress_copy during data insertion.  This allows to test
+-- the validation of some progress reports for COPY FROM where the trigger
+-- would fire.
+create function notice_after_tab_progress_reporting() returns trigger AS
+$$
+declare report record;
+begin
+	-- The fields ignored here are the ones that may not remain
+	-- consistent across multiple runs.  The sizes reported may differ
+	-- across platforms, so just check if these are strictly positive.
+	with progress_data as (
+	  select
+	 relid::regclass::text as relname,
+	 command,
+	 type,
+	 bytes_processed > 0 as has_bytes_processed,
+	 bytes_total > 0 as has_bytes_total,
+	 tuples_processed,
+	 tuples_excluded
+	from pg_stat_progress_copy
+	where pid = pg_backend_pid())
+	select into report (to_jsonb(r)) as value
+		from progress_data r;
+
+	raise info 'progress: %', report.value::text;
+	return new;
+end;
+$$ language plpgsql;
+
+create trigger check_after_tab_progress_reporting
+	after insert on tab_progress_reporting
+	for each statement
+	execute function notice_after_tab_progress_reporting();
+
+-- Generate COPY FROM report with PIPE.
+copy tab_progress_reporting from stdin;
+sharon	25	(15,12)	1000	sam
+sam	30	(10,5)	2000	bill
+bill	20	(11,10)	1000	sharon
+\.
+
+-- Generate COPY FROM report with FILE, with some excluded tuples.
+truncate tab_progress_reporting;
+copy tab_progress_reporting from '@abs_srcdir@/data/emp.data'
+	where (salary < 2000);
+
+drop trigger check_after_tab_progress_reporting on tab_progress_reporting;
+drop function notice_after_tab_progress_reporting();
+drop table tab_progress_reporting;
diff --git a/src/test/regress/output/copy.source b/src/test/regress/output/copy.source
index 938d3551da..8cf37dc257 100644
--- a/src/test/regress/output/copy.source
+++ b/src/test/regress/output/copy.source
@@ -165,3 +165,57 @@ select * from parted_copytest where b = 2;
 (1 row)
 
 drop table parted_copytest;
+--
+-- Progress reporting for COPY
+--
+create table tab_progress_reporting (
+	name text,
+	age int4,
+	location point,
+	salary int4,
+	manager name
+);
+-- Add a trigger to catch and print the contents of the catalog view
+-- pg_stat_progress_copy during data insertion.  This allows to test
+-- the validation of some progress reports for COPY FROM where the trigger
+-- would fire.
+create function notice_after_tab_progress_reporting() returns trigger AS
+$$
+declare report record;
+begin
+	-- The fields ignored here are the ones that may not remain
+	-- consistent across multiple runs.  The sizes reported may differ
+	-- across platforms, so just check if these are strictly positive.
+	with progress_data as (
+	  select
+	 relid::regclass::text as relname,
+	 command,
+	 type,
+	 bytes_processed > 0 as has_bytes_processed,
+	 bytes_total > 0 as has_bytes_total,
+	 tuples_processed,
+	 tuples_excluded
+	from pg_stat_progress_copy
+	where pid = pg_backend_pid())
+	select into report (to_jsonb(r)) as value
+		from progress_data r;
+
+	raise info 'progress: %', report.value::text;
+	return new;
+end;
+$$ language plpgsql;
+create trigger check_after_tab_progress_reporting
+	after insert on tab_progress_reporting
+	for each statement
+	execute function notice_after_tab_progress_reporting();
+-- Generate CO

Re: pl/pgsql feature request: shorthand for argument and local variable references

2021-03-14 Thread Pavel Stehule
po 15. 3. 2021 v 4:54 odesílatel Julien Rouhaud  napsal:

> On Sun, Mar 14, 2021 at 08:41:15PM +0100, Pavel Stehule wrote:
> >
> > My English is good enough for taking beer everywhere in the world :) . Ti
> > is not good, but a lot of people who don't understand to English
> understand
> > my simplified fork of English language.
>
> I have the same problem.  We can talk about it and other stuff while
> having a
> beer sometimes :)
>

:)


> > updated patch attached
>
> Thanks, it looks good to me, I'm marking the patch as RFC!
>

Thank you very much

Regards

Pavel


Re: Change JOIN tutorial to focus more on explicit joins

2021-03-14 Thread Pavel Stehule
po 15. 3. 2021 v 3:48 odesílatel Thomas Munro 
napsal:

> On Thu, Mar 11, 2021 at 2:06 AM David Steele  wrote:
> > On 12/1/20 3:38 AM, Jürgen Purtz wrote:
> > > OK. Patch attached.
>
> +Queries which access multiple tables (including repeats) at once are
> called
>
> I'd write "Queries that" here (that's is a transatlantic difference in
> usage; I try to proofread these things in American mode for
> consistency with the rest of the language in this project, which I
> probably don't entirely succeed at but this one I've learned...).
>
> Maybe instead of "(including repeats)" it could say "(or multiple
> instances of the same table)"?
>
> +For example, to return all the weather records together with the
> location of the
> +associated city, the database compares the
> city
>  column of each row of the weather table with
> the
>  name column of all rows in the
> cities
>  table, and select the pairs of rows where these values match.
>
> Here "select" should agree with "the database" and take an -s, no?
>
> +This syntax pre-dates the JOIN and
> ON
> +keywords.  The tables are simply listed in the
> FROM,
> +comma-separated, and the comparison expression added to the
> +WHERE clause.
>
> Could we mention SQL92 somewhere?  Like maybe "This syntax pre-dates
> the JOIN and ON keywords, which were introduced by SQL-92".  (That's a
> "non-restrictive which", I think the clue is the comma?)
>

previous syntax should be mentioned too. An reader can find this syntax
thousands applications

Pavel


>


Regression tests vs SERIALIZABLE

2021-03-14 Thread Thomas Munro
Hi,

While reviewing the patch for parallel REFRESH MATERIALIZED VIEW, I
noticed that select_parallel.sql and write_parallel.sql believe that
(1) the tests are supposed to work with serializable as a default
isolation level, and (2) parallelism would be inhibited by that, so
they'd better use something else explicitly.  Here's a patch to update
that second thing in light of commit bb16aba5.  I don't think it
matters enough to bother back-patching it.

However, since commit 862ef372d6b, there *is* one test that fails if
you run make installcheck against a cluster running with -c
default_transaction_isolation=serializable: transaction.sql.  Is that
a mistake?  Is it a goal to be able to run this test suite against all
3 isolation levels?

@@ -1032,7 +1032,7 @@
 SHOW transaction_isolation;  -- out of transaction block
  transaction_isolation
 ---
- read committed
+ serializable
 (1 row)
From 21115a0721aedec9c39a1d41bc38ec2b96960ff1 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Mon, 15 Mar 2021 17:08:25 +1300
Subject: [PATCH] Parallel regression tests don't need workaround for
 SERIALIZABLE.

SERIALIZABLE no longer inhibits parallelism, so the comment and
workaround in {select,write}_parallel.sql are obsolete since commit
bb16aba5 (release 12).

Also fix a nearby typo.
---
 src/test/regress/expected/select_parallel.out | 4 +---
 src/test/regress/expected/write_parallel.out  | 6 ++
 src/test/regress/sql/select_parallel.sql  | 4 +---
 src/test/regress/sql/write_parallel.sql   | 6 ++
 4 files changed, 6 insertions(+), 14 deletions(-)

diff --git a/src/test/regress/expected/select_parallel.out b/src/test/regress/expected/select_parallel.out
index 9b0c418db7..05ebcb284a 100644
--- a/src/test/regress/expected/select_parallel.out
+++ b/src/test/regress/expected/select_parallel.out
@@ -3,9 +3,7 @@
 --
 create function sp_parallel_restricted(int) returns int as
   $$begin return $1; end$$ language plpgsql parallel restricted;
--- Serializable isolation would disable parallel query, so explicitly use an
--- arbitrary other level.
-begin isolation level repeatable read;
+begin;
 -- encourage use of parallel plans
 set parallel_setup_cost=0;
 set parallel_tuple_cost=0;
diff --git a/src/test/regress/expected/write_parallel.out b/src/test/regress/expected/write_parallel.out
index 0c4da2591a..77705f9a70 100644
--- a/src/test/regress/expected/write_parallel.out
+++ b/src/test/regress/expected/write_parallel.out
@@ -1,16 +1,14 @@
 --
 -- PARALLEL
 --
--- Serializable isolation would disable parallel query, so explicitly use an
--- arbitrary other level.
-begin isolation level repeatable read;
+begin;
 -- encourage use of parallel plans
 set parallel_setup_cost=0;
 set parallel_tuple_cost=0;
 set min_parallel_table_scan_size=0;
 set max_parallel_workers_per_gather=4;
 --
--- Test write operations that has an underlying query that is eligble
+-- Test write operations that has an underlying query that is eligible
 -- for parallel plans
 --
 explain (costs off) create table parallel_write as
diff --git a/src/test/regress/sql/select_parallel.sql b/src/test/regress/sql/select_parallel.sql
index 5a01a98b26..d31e290ec2 100644
--- a/src/test/regress/sql/select_parallel.sql
+++ b/src/test/regress/sql/select_parallel.sql
@@ -5,9 +5,7 @@
 create function sp_parallel_restricted(int) returns int as
   $$begin return $1; end$$ language plpgsql parallel restricted;
 
--- Serializable isolation would disable parallel query, so explicitly use an
--- arbitrary other level.
-begin isolation level repeatable read;
+begin;
 
 -- encourage use of parallel plans
 set parallel_setup_cost=0;
diff --git a/src/test/regress/sql/write_parallel.sql b/src/test/regress/sql/write_parallel.sql
index 78b479cedf..a5d63112c9 100644
--- a/src/test/regress/sql/write_parallel.sql
+++ b/src/test/regress/sql/write_parallel.sql
@@ -2,9 +2,7 @@
 -- PARALLEL
 --
 
--- Serializable isolation would disable parallel query, so explicitly use an
--- arbitrary other level.
-begin isolation level repeatable read;
+begin;
 
 -- encourage use of parallel plans
 set parallel_setup_cost=0;
@@ -13,7 +11,7 @@ set min_parallel_table_scan_size=0;
 set max_parallel_workers_per_gather=4;
 
 --
--- Test write operations that has an underlying query that is eligble
+-- Test write operations that has an underlying query that is eligible
 -- for parallel plans
 --
 explain (costs off) create table parallel_write as
-- 
2.30.1



Re: pl/pgsql feature request: shorthand for argument and local variable references

2021-03-14 Thread Julien Rouhaud
On Sun, Mar 14, 2021 at 08:41:15PM +0100, Pavel Stehule wrote:
> 
> My English is good enough for taking beer everywhere in the world :) . Ti
> is not good, but a lot of people who don't understand to English understand
> my simplified fork of English language.

I have the same problem.  We can talk about it and other stuff while having a
beer sometimes :)

> updated patch attached

Thanks, it looks good to me, I'm marking the patch as RFC!




Re: REINDEX backend filtering

2021-03-14 Thread Julien Rouhaud
Please find attached v7, with the following changes:

- all typo reported by Michael and Mark are fixed
- REINDEX (OUTDATED) INDEX will now ignore the index if it doesn't have any
  outdated dependency.  Partitioned index are correctly handled.
- REINDEX (OUTDATED, VERBOSE) will now inform caller of ignored indexes, with
  lines of the form:

NOTICE:  index "index_name" has no outdated dependency

- updated regression tests to cover all those changes.  I kept the current
  approach of using simple SQL test listing the ignored indexes.  I also added
  some OUDATED option to collate.icu.utf8 tests so that we also check that both
  REINDEX and REINDEX(OUTDATED) work as expected.
- move pg_index_has_outdated_dependency to 0002

I didn't remove index_has_outdated_collation() for now.
>From 91bcd6e4565164314eb6444635ca274695de3748 Mon Sep 17 00:00:00 2001
From: Julien Rouhaud 
Date: Thu, 3 Dec 2020 15:54:42 +0800
Subject: [PATCH v7 1/2] Add a new OUTDATED filtering facility for REINDEX
 command.

OUTDATED is added a new unreserved keyword.

When used, REINDEX will only process indexes that have an outdated dependency.
For now, only dependency on collations are supported but we'll likely support
other kind of dependency in the future.

Author: Julien Rouhaud 
Reviewed-by: Michael Paquier, Mark Dilger
Discussion: https://postgr.es/m/20201203093143.GA64934%40nol
---
 doc/src/sgml/ref/reindex.sgml | 12 +++
 src/backend/access/index/indexam.c| 59 ++
 src/backend/catalog/index.c   | 79 ++-
 src/backend/commands/indexcmds.c  | 37 -
 src/backend/parser/gram.y |  4 +-
 src/backend/utils/cache/relcache.c| 47 +++
 src/bin/psql/tab-complete.c   |  2 +-
 src/include/access/genam.h|  1 +
 src/include/catalog/index.h   |  3 +
 src/include/parser/kwlist.h   |  1 +
 src/include/utils/relcache.h  |  1 +
 .../regress/expected/collate.icu.utf8.out | 12 ++-
 src/test/regress/expected/create_index.out| 27 +++
 src/test/regress/sql/collate.icu.utf8.sql | 12 ++-
 src/test/regress/sql/create_index.sql | 18 +
 15 files changed, 301 insertions(+), 14 deletions(-)

diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index ff4dba8c36..c4749c338b 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -26,6 +26,7 @@ REINDEX [ ( option [, ...] ) ] { IN
 where option can be one of:
 
 CONCURRENTLY [ boolean ]
+OUTDATED [ boolean ]
 TABLESPACE new_tablespace
 VERBOSE [ boolean ]
 
@@ -188,6 +189,17 @@ REINDEX [ ( option [, ...] ) ] { IN
 

 
+   
+OUTDATED
+
+ 
+  This option can be used to filter the list of indexes to rebuild and only
+  process indexes that have outdated dependencies.  Fow now, the only
+  dependency handled currently is the collation provider version.
+ 
+
+   
+

 TABLESPACE
 
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 3d2dbed708..dc1c85cf0d 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -145,6 +145,65 @@ index_open(Oid relationId, LOCKMODE lockmode)
 	return r;
 }
 
+/* 
+ *		try_index_open - open an index relation by relation OID
+ *
+ *		Same as index_open, except return NULL instead of failing
+ *		if the index does not exist.
+ * 
+ */
+Relation
+try_index_open(Oid relationId, LOCKMODE lockmode)
+{
+	Relation	r;
+
+	Assert(lockmode >= NoLock && lockmode < MAX_LOCKMODES);
+
+	/* Get the lock first */
+	if (lockmode != NoLock)
+		LockRelationOid(relationId, lockmode);
+
+	/*
+	 * Now that we have the lock, probe to see if the relation really exists
+	 * or not.
+	 */
+	if (!SearchSysCacheExists1(RELOID, ObjectIdGetDatum(relationId)))
+	{
+		/* Release useless lock */
+		if (lockmode != NoLock)
+			UnlockRelationOid(relationId, lockmode);
+
+		return NULL;
+	}
+
+	/* Should be safe to do a relcache load */
+	r = RelationIdGetRelation(relationId);
+
+	if (!RelationIsValid(r))
+		elog(ERROR, "could not open relation with OID %u", relationId);
+
+	/* If we didn't get the lock ourselves, assert that caller holds one */
+	Assert(lockmode != NoLock ||
+		   CheckRelationLockedByMe(r, AccessShareLock, true));
+
+	if (r->rd_rel->relkind != RELKIND_INDEX &&
+		r->rd_rel->relkind != RELKIND_PARTITIONED_INDEX)
+	{
+		ereport(ERROR,
+(errcode(ERRCODE_WRONG_OBJECT_TYPE),
+ errmsg("\"%s\" is not an index",
+		RelationGetRelationName(r;
+	}
+
+	/* Make note that we've accessed a temporary relation */
+	if (RelationUsesLocalBuffers(r))
+		MyXactFlags |= XACT_FLAGS_ACCESSEDTEMPNAMESPACE;
+
+	pgstat_initstats(r);
+
+	return r;
+}
+
 /* 
  *		index_close - close an index relation
  *
diff --git a/src/backend/catalog/index.c b/src/b

Re: EXPLAIN/EXPLAIN ANALYZE REFRESH MATERIALIZED VIEW

2021-03-14 Thread Bharath Rupireddy
On Sat, Mar 13, 2021 at 7:00 AM Japin Li  wrote:
>
> On Mon, 08 Mar 2021 at 12:28, Bharath Rupireddy 
>  wrote:
> > On Sun, Mar 7, 2021 at 10:13 PM Zhihong Yu  wrote:
> >> Hi,
> >>
> >> +* EXPLAIN ANALYZE CREATE TABLE AS or REFRESH MATERIALIZED VIEW
> >> +* WITH NO DATA is weird.
> >>
> >> Maybe it is clearer to spell out WITH NO DATA for both statements, instead 
> >> of sharing it.
> >
> > Done that way.
> >
> >> -   if (!stmt->skipData)
> >> +   if (!stmt->skipData && !explainInfo)
> >> ...
> >> +   else if (explainInfo)
> >>
> >> It would be cleaner to put the 'if (explainInfo)' as the first check. That 
> >> way, the check for skipData can be simplified.
> >
> > Changed.
> >
> > Thanks for review comments. Attaching v7 patch set with changes only
> > in 0002 patch. Please have a look.
> >
>
> The v7 patch looks good to me, and there is no other advice, so I change
> the status to "Ready for Committer".

Thanks for the review.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Re: A new function to wait for the backend exit after termination

2021-03-14 Thread Bharath Rupireddy
On Sun, Mar 7, 2021 at 2:39 PM Bharath Rupireddy
 wrote:
> Attaching v7 patch for further review.

Attaching v8 patch after rebasing on to the latest master.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com


v8-0001-pg_terminate_backend-with-wait-and-timeout.patch
Description: Binary data


Re: recovering from "found xmin ... from before relfrozenxid ..."

2021-03-14 Thread Peter Geoghegan
On Mon, Sep 21, 2020 at 1:11 PM Andres Freund  wrote:
> > I'm not sure I really understand how that's happening, because surely
> > HOT chains and non-HOT chains are pruned by the same code, but it
> > doesn't sound good.
>
> Not necessarily, unfortunately:
>
> case HEAPTUPLE_DEAD:

> So if e.g. a transaction aborts between the heap_page_prune and this
> check the pruning behaviour depends on whether the tuple is part of a
> HOT chain or not.

I have a proposal that includes removing this "tupgone = true" special case:

https://postgr.es/m/CAH2-Wzm7Y=_g3fjvhv7a85afubusydggdneqn1hodveoctl...@mail.gmail.com

Of course this won't change the fact that vacuumlazy.c can disagree
with pruning about what is dead -- that is a necessary consequence of
having multiple HTSV calls for the same tuple in vacuumlazy.c (it can
change in the presence of concurrently aborted transactions). But
removing the "tupgone = true" special case does seem much more
consistent, and simpler overall. We have lots of code that is only
needed to make that special case work. For example, the whole idea of
a dedicated XLOG_HEAP2_CLEANUP_INFO record for recovery conflicts can
go -- we can get by with only XLOG_HEAP2_CLEAN records from pruning,
IIRC.

Have I missed something? The special case in question seems pretty
awful to me, so I have to wonder why somebody else didn't remove it
long ago...

--
Peter Geoghegan




Re: [HACKERS] logical decoding of two-phase transactions

2021-03-14 Thread Amit Kapila
On Fri, Mar 12, 2021 at 2:09 PM Peter Smith  wrote:
>
> Please find attached the latest patch set v58*
>

In this patch-series, I see a problem with synchronous replication
when GUC 'synchronous_standby_names' is configured to use subscriber.
This will allow Prepares and Commits to wait for the subscriber to
finish. Before this patch, we never send prepare as two-phase was not
enabled by a subscriber, so it won't wait for it, rather it will make
progress because we send keep_alive messages. But after this patch, it
will start waiting for Prepare to finish. Now, without spool-file
logic, it will work because prepares are decoded on subscriber and a
corresponding ack will be sent to a publisher but for the spool-file
case, we will wait for Publisher to send commit prepared and in
publisher prepare is not finished because we are waiting for its ack.
So, it will create a sort of deadlock. This is related to the problem
as mentioned in the below comments in the patch:
+ * A future release may be able to detect when all tables are READY and set
+ * a flag to indicate this subscription/slot is ready for two_phase
+ * decoding. Then at the publisher-side, we could enable wait-for-prepares
+ * only when all the slots of WALSender have that flag set.

The difference is that it can happen now itself, prepares
automatically wait if 'synchronous_standby_names' is set. Now, we can
imagine a solution where after spooling to file the changes which
can't be applied during syncup phase, we update the flush location so
that publisher can proceed with that prepare. But I think that won't
work because once we have updated the flush location those prepares
won't be sent again and it is quite possible that we don't have
complete relation information as the schema is not sent with each
transaction. Now, we can go one step further and try to remember the
schema information the first time it is sent so that it can be reused
after restart but I think that will complicate the patch and overall
design.

I think there is a simpler solution to these problems. The idea is to
enable two_phase after the initial sync is over (all relations are in
a READY state).  If we switch-on the 2PC only after all the relations
come to the READY state then we shouldn't get any prepare before
sync-point. However, it is quite possible that before reaching
syncpoint, the slot corresponding to apply-worker has skipped because
2PC was not enabled, and afterward, prepare would be skipped because
by that start_decoding_at might have moved. See the explanation in an
email: 
https://www.postgresql.org/message-id/CAA4eK1LuK4t-ZYYCY7k9nMoYP%2Bdwi-JyqUdtcffQMoB_g5k6Hw%40mail.gmail.com.
Now, even the initial_consistent_point won't help because for
apply-worker, it will be different from tablesync slot's
initial_consistent_point and we would have reached initial consistency
earlier for apply-workers.

To solve the main problem (how to detect the prepares that are skipped
when we toggled the two_pc option) in the above idea, we can mark an
LSN position in the slot (two_phase_at, this will be the same as
start_decoding_at point when we receive slot with 2PC option) where we
enable two_pc. If we encounter any commit prepared whose prepare LSN
is less than two_phase_at, then we need to send prepare for the
transaction along with commit prepared.

For this solution on the subscriber-side, I think we need a tri-state
column (two_phase) in pg_subscription. It can have three values
'disable', 'can_enable', 'enable'. By default, it will be 'disable'.
If the user enables 2PC, then we can set it to 'can_enable' and once
we see all relations are in a READY state, restart the apply-worker
and this time while starting the streaming, send the two_pc option and
then we can change the state to 'enable' so that future restarts won't
send this option again. Now on the publisher side, if this option is
present, it will change the value of two_phase_at in the slot to
start_decoding_at. I think something on these lines should be much
easier than the spool-file implementation unless we see any problem
with this idea.

-- 
With Regards,
Amit Kapila.




Re: Change JOIN tutorial to focus more on explicit joins

2021-03-14 Thread Thomas Munro
On Thu, Mar 11, 2021 at 2:06 AM David Steele  wrote:
> On 12/1/20 3:38 AM, Jürgen Purtz wrote:
> > OK. Patch attached.

+Queries which access multiple tables (including repeats) at once are called

I'd write "Queries that" here (that's is a transatlantic difference in
usage; I try to proofread these things in American mode for
consistency with the rest of the language in this project, which I
probably don't entirely succeed at but this one I've learned...).

Maybe instead of "(including repeats)" it could say "(or multiple
instances of the same table)"?

+For example, to return all the weather records together with the
location of the
+associated city, the database compares the city
 column of each row of the weather table with the
 name column of all rows in the
cities
 table, and select the pairs of rows where these values match.

Here "select" should agree with "the database" and take an -s, no?

+This syntax pre-dates the JOIN and ON
+keywords.  The tables are simply listed in the FROM,
+comma-separated, and the comparison expression added to the
+WHERE clause.

Could we mention SQL92 somewhere?  Like maybe "This syntax pre-dates
the JOIN and ON keywords, which were introduced by SQL-92".  (That's a
"non-restrictive which", I think the clue is the comma?)




Re: Postgres crashes at memcopy() after upgrade to PG 13.

2021-03-14 Thread Peter Geoghegan
On Sun, Mar 14, 2021 at 6:54 PM Avinash Kumar
 wrote:
> Following may be helpful to understand what I meant.
>
> I have renamed the table and index names before adding it here.

It should be possible to run amcheck on your database, which will
detect corrupt posting list tuples on Postgres 13. It's a contrib
extension, so you must first run "CREATE EXTENSION amcheck;". From
there, you can run a query like the following (you may want to
customize this):

SELECT bt_index_parent_check(index => c.oid, heapallindexed => true),
c.relname,
c.relpages
FROM pg_index i
JOIN pg_opclass op ON i.indclass[0] = op.oid
JOIN pg_am am ON op.opcmethod = am.oid
JOIN pg_class c ON i.indexrelid = c.oid
JOIN pg_namespace n ON c.relnamespace = n.oid
WHERE am.amname = 'btree'
-- Don't check temp tables, which may be from another session:
AND c.relpersistence != 't'
-- Function may throw an error when this is omitted:
AND c.relkind = 'i' AND i.indisready AND i.indisvalid
ORDER BY c.relpages DESC;

If this query takes too long to complete you may find it useful to add
something to limit the indexes check, such as: AND n.nspname =
'public' -- that change to the SQL will make the query just test
indexes from the public schema.

Do "SET client_min_messages=DEBUG1 " to get a kind of rudimentary
progress indicator, if that seems useful to you.

The docs have further information on what this bt_index_parent_check
function does, should you need it:
https://www.postgresql.org/docs/13/amcheck.html

-- 
Peter Geoghegan




Re: New IndexAM API controlling index vacuum strategies

2021-03-14 Thread Peter Geoghegan
On Thu, Mar 11, 2021 at 8:31 AM Robert Haas  wrote:
> But even if not, I'm not sure this
> helps much with the situation you're concerned about, which involves
> non-HOT tuples.

Attached is a POC-quality revision of Masahiko's
skip_index_vacuum.patch [1]. There is an improved design for skipping
index vacuuming (improved over the INDEX_CLEANUP stuff from Postgres
12). I'm particularly interested in your perspective on this
refactoring stuff, Robert, because you ran into the same issues after
initial commit of the INDEX_CLEANUP reloption feature [2] -- you ran
into issues with the "tupgone = true" special case. This is the case
where VACUUM considers a tuple dead that was not marked LP_DEAD by
pruning, and so needs to be killed in the second heap scan in
lazy_vacuum_heap() instead. You'll recall that these issues were fixed
by your commit dd695979888 from May 2019. I think that we need to go
further than you did in dd695979888 for this -- we ought to get rid of
the special case entirely.

ISTM that any new code that skips index vacuuming really ought to be
structured as a dynamic version of the "VACUUM (INDEX_CLEANUP OFF)"
mechanism. Or vice versa. The important thing is to recognize that
they're essentially the same thing, and to structure the code such
that they become exactly the same mechanism internally. That's not
trivial right now. But removing the awful "tupgone = true" special
case seems to buy us a lot -- it makes unifying everything relatively
straightforward. In particular, it makes it possible to delay the
decision to vacuum indexes until the last moment, which seems
essential to making index vacuuming optional. And so I have removed
the tupgone/XLOG_HEAP2_CLEANUP_INFO crud in the patch -- that's what
all of the changes relate to. This results in a net negative line
count, which is a nice bonus!

I've CC'd Noah, because my additions to this revision (of Masahiko's
patch) are loosely based on an abandoned 2013 patch from Noah [3] --
Noah didn't care for the "tupgone = true" special case either. I think
that it's fair to say that Tom doesn't much care for it either [4], or
at least was distressed by its lack of test coverage as of a couple of
years ago -- which is a problem that still exists today. Honestly, I'm
surprised that somebody else hasn't removed the code in question
already, long ago -- what possible argument can be made for it now?

This patch makes the "VACUUM (INDEX_CLEANUP OFF)" mechanism no longer
get invoked as if it was like the "no indexes on table so do it all in
one heap pass" optimization. This seems a lot clearer -- INDEX_CLEANUP
OFF isn't able to call lazy_vacuum_page() at all (for the obvious
reason), so any similarity between the two cases was always
superficial -- skipping index vacuuming should not be confused with
doing a one-pass VACUUM/having no indexes at all. The original
INDEX_CLEANUP structure (from commits a96c41fe and dd695979) always
seemed confusing to me for this reason, FWIW.

Note that I've merged multiple existing functions in vacuumlazy.c into
one: the patch merges lazy_vacuum_all_indexes() and lazy_vacuum_heap()
into a single function named vacuum_indexes_mark_unused() (note also
that lazy_vacuum_page() has been renamed to mark_unused_page() to
reflect the fact that it is now strictly concerned with making LP_DEAD
line pointers LP_UNUSED). The big idea is that there is one choke
point that decides whether index vacuuming is needed at all at one
point in time, dynamically. vacuum_indexes_mark_unused() decides this
for us at the last moment. This can only happen during a VACUUM that
has enough memory to fit all TIDs -- otherwise we won't skip anything
dynamically.

We may in the future add additional criteria for skipping index
vacuuming. That can now just be added to the beginning of this new
vacuum_indexes_mark_unused() function. We may even teach
vacuum_indexes_mark_unused() to skip some indexes but not others in a
future release, a possibility that was already discussed at length
earlier in this thread. This new structure has all the context it
needs to do all of these things.

I wonder if we can add some kind of emergency anti-wraparound vacuum
logic to what I have here, for Postgres 14. Can we come up with logic
that has us skip index vacuuming because XID wraparound is on the
verge of causing an outage? That seems like a strategically important
thing for Postgres, so perhaps we should try to get something like
that in. Practically every post mortem blog post involving Postgres
also involves anti-wraparound vacuum.

One consequence of my approach is that we now call
lazy_cleanup_all_indexes(), even when we've skipped index vacuuming
itself. We should at least "check-in" with the indexes IMV. To an
index AM, this will be indistinguishable from a VACUUM that never had
tuples for it to delete, and so never called ambulkdelete() before
calling amvacuumcleanup().  This seems logical to me: why should there
be any significant behavioral divergence betwee

Re: cleanup temporary files after crash

2021-03-14 Thread Thomas Munro
On Wed, Mar 10, 2021 at 1:31 AM Michael Paquier  wrote:
> On Tue, Mar 09, 2021 at 02:28:43AM +0100, Tomas Vondra wrote:
> > Let's move this patch forward. Based on the responses, I agree the
> > default behavior should be to remove the temp files, and I think we
> > should have the GUC (on the off chance that someone wants to preserve
> > the temporary files for debugging or whatever other reason).
>
> Thanks for taking care of this.  I am having some second-thoughts
> about changing this behavior by default, still that's much more useful
> this way.

+1 for having it on by default.

I was also just looking at this patch and came here to say LGTM except
for two cosmetic things, below.

> > I propose to rename the GUC to remove_temp_files_after_crash, I think
> > "remove" is a bit clearer than "cleanup". I've also reworded the sgml
> > docs a little bit.
>
> "remove" sounds fine to me.

+1

> > Attached is a patch with those changes. Barring objections, I'll get
> > this committed in the next couple days.
>
> +When set to on, PostgreSQL will 
> automatically
> Nit: using a  markup for the "on" value.

Maybe should say "which is the default", like other similar things?

> +#remove_temp_files_after_crash = on# remove temporary files after
> +#  # backend crash?
> The indentation of the second line is incorrect here (Incorrect number
> of spaces in tabs perhaps?), and there is no need for the '#' at the
> beginning of the line.

Yeah, that's wrong.  For some reason that one file uses a tab size of
8, unlike the rest of the tree (I guess because people will read that
file in software with the more common setting of 8).  If you do :set
tabstop=8 in vim, suddenly it all makes sense, but it is revealed that
this patch has it wrong, as you said.  (Perhaps this file should have
some of those special Vim/Emacs control messages so we don't keep
getting this wrong?)




Re: Postgres crashes at memcopy() after upgrade to PG 13.

2021-03-14 Thread Avinash Kumar
Hi,

On Sun, Mar 14, 2021 at 10:17 PM Thomas Munro 
wrote:

> [Dropping pgsql-general@ from the CC, because cross-posting triggers
> moderation; sorry I didn't notice that on my first reply]
>
> On Mon, Mar 15, 2021 at 2:05 PM Avinash Kumar
>  wrote:
> > On Sun, Mar 14, 2021 at 10:01 PM Avinash Kumar <
> avinash.vallar...@gmail.com> wrote:
> >> Also the datatype is bigint
>
> Ok.  Collation changes are the most common cause of index problems
> when upgrading OSes, but here we can rule that out if your index is on
> bigint.  So it seems like this is some other kind of corruption in
> your database, or a bug in the deduplication code.
>
I suspect the same.
When i tried to perform a pg_filedump to see the entry of the ID in the
index, it was strange that the entry did not exist in the Index. But, the
SELECT using an Index only scan was still working okay. I have chosen the
start and end page perfectly and there should not be any mistake there.

Following may be helpful to understand what I meant.

I have renamed the table and index names before adding it here.

=# select pg_size_pretty(pg_relation_size('idx_id_mtime')) as size,
relpages from pg_class where relname = 'idx_id_mtime';
 size  | relpages
---+--
 71 MB | 8439

=# select pg_relation_filepath('idx_id_mtime');
 pg_relation_filepath
--
 base/16404/346644309

=# \d+ idx_id_mtime
  Index "public.idx_id_mtime"
  Column   |   Type   | Key? | Definition | Storage | Stats
target
---+--+--++-+--
 sometable_id | bigint   | yes  | sometable_id  | plain   |
 mtime  | timestamp with time zone | yes  | mtime  | plain   |
btree, for table "public.sometable"

$ pg_filedump -R 1 8439 -D bigint,timestamp
/flash/berta13/base/16404/346644309 > 12345.txt

$ cat 12345.txt | grep -w 70334
--> No Output.

We don't see the entry for the ID : 70334 in the output of pg_filedump.
*But, the SELECT statement is still using the same Index. *

=*# EXPLAIN select * from sometable where sometable_id = 70334;
   QUERY PLAN


 Index Scan using idx_id_mtime on sometable  (cost=0.43..2.45 rows=1
width=869)
   Index Cond: (sometable_id = 70334)
(2 rows)

=*# EXPLAIN ANALYZE select * from sometable where sometable_id = 70334;
QUERY PLAN

--
 Index Scan using idx_id_mtime on sometable  (cost=0.43..2.45 rows=1
width=869) (actual time=0.166..0.168 rows=1 loops=1)
   Index Cond: (sometable_id = 70334)
 Planning Time: 0.154 ms
 Execution Time: 0.195 ms
(4 rows)

=*# update sometable set sometable_id = 70334 where sometable_id = 70334;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!?>

Now, let us see the next ID. Here, the entry is visible in the output of
pg_filedump.

$ cat 12345.txt | grep -w 10819
COPY: 10819 2018-03-21 15:16:41.202277

The update still fails with the same error.

=*# update sometable set sometable_id = 10819 where sometable_id = 10819;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!?>


Re: About to add WAL write/fsync statistics to pg_stat_wal view

2021-03-14 Thread Masahiro Ikeda

On 2021/03/11 21:29, Masahiro Ikeda wrote:

On 2021-03-11 11:52, Fujii Masao wrote:

On 2021/03/11 9:38, Masahiro Ikeda wrote:

On 2021-03-10 17:08, Fujii Masao wrote:
IIUC the stats collector basically exits after checkpointer and 
walwriter exit.

But there seems no guarantee that the stats collector processes
all the messages that other processes have sent during the 
shutdown of

the server.


Thanks, I understood the above postmaster behaviors.

PMState manages the status and after checkpointer is exited, the 
postmaster sends
SIGQUIT signal to the stats collector if the shutdown mode is smart 
or fast.

(IIUC, although the postmaster kill the walsender, the archiver and
the stats collector at the same time, it's ok because the walsender
and the archiver doesn't send stats to the stats collector now.)

But, there might be a corner case to lose stats sent by background 
workers like
the checkpointer before they exit (although this is not implemented 
yet.)


For example,

1. checkpointer send the stats before it exit
2. stats collector receive the signal and break before processing
    the stats message from checkpointer. In this case, 1's message 
is lost.

3. stats collector writes the stats in the statsfiles and exit

Why don't you recheck the coming message is zero just before the 
2th procedure?

(v17-0004-guarantee-to-collect-last-stats-messages.patch)


Yes, I was thinking the same. This is the straight-forward fix for 
this issue.

The stats collector should process all the outstanding messages when
normal shutdown is requested, as the patch does. On the other hand,
if immediate shutdown is requested or emergency bailout (by 
postmaster death)
is requested, maybe the stats collector should skip those 
processings

and exit immediately.

But if we implement that, we would need to teach the stats collector
the shutdown type (i.e., normal shutdown or immediate one). Because
currently SIGQUIT is sent to the collector whichever shutdown is 
requested,
and so the collector cannot distinguish the shutdown type. I'm 
afraid that

change is a bit overkill for now.

BTW, I found that the collector calls pgstat_write_statsfiles() even 
at

emergency bailout case, before exiting. It's not necessary to save
the stats to the file in that case because subsequent server startup 
does

crash recovery and clears that stats file. So it's better to make
the collector exit immediately without calling 
pgstat_write_statsfiles()
at emergency bailout case? Probably this should be discussed in 
other
thread because it's different topic from the feature we're 
discussing here,

though.


IIUC, only the stats collector has another hander for SIGQUIT 
although
other background processes have a common hander for it and just call 
_exit(2).
I thought to guarantee when TerminateChildren(SIGTERM) is invoked, 
don't make stats

collector shutdown before other background processes are shutdown.

I will make another thread to discuss that the stats collector should 
know the shutdown type or not.
If it should be, it's better to make the stats collector exit as soon 
as possible if the shutdown type

is an immediate, and avoid losing the remaining stats if it's normal.


+1


I researched the past discussion related to writing the stats files when 
the immediate
shutdown is requested. And I found the following thread([1]) although 
the discussion is

stopped on 12/1/2016.

IIUC, the thread's consensus are

(1) To kill the stats collector soon before writing the stats file is 
needed in some case
because there is a possibility that it takes a long time until the 
failover happens.
The possible reasons are that disk write speed is slow, stats files 
are big, and so on.


(2) It needs to change the behavior from removing all stats files when 
the startup does

crash recovery because autovacuum uses the stats.

(3) It's ok that the stats collector exit without calling 
pgstat_write_statsfiles() if
the stats file is written every X minutes (using wal or another 
mechanism) and startup

process can restore the stats with slightly low freshness.

(4) It needs to find the way how to handle the (2)'s stats file when 
deleting on PITR

rewind or stats collector crash happens.

So, I need to ping the threads. But I don't have any idea to handle (4) 
yet...


[1] 
https://www.postgresql.org/message-id/flat/0A3221C70F24FB45833433255569204D1F5EF25A%40G01JPEXMBYT05









I measured the timing of the above in my linux laptop using 
v17-measure-timing.patch.
I don't have any strong opinion to handle this case since this 
result shows to receive and processes
the messages takes too short time (less than 1ms) although the 
stats collector receives the shutdown

signal in 5msec(099->104) after the checkpointer process exits.


Agreed.



```
1615421204.556 [checkpointer] DEBUG:  received shutdown request 
signal
1615421208.099 [checkpointer] DEBUG:  proc_exit(-1): 0 callbacks to 
make  # exit and send the mess

Re: About to add WAL write/fsync statistics to pg_stat_wal view

2021-03-14 Thread Masahiro Ikeda

On 2021-03-12 12:39, Fujii Masao wrote:

On 2021/03/11 21:29, Masahiro Ikeda wrote:

In addition, I rebased the patch for WAL receiver.
(v17-0003-Makes-the-wal-receiver-report-WAL-statistics.patch)


Thanks! Will review this later.


Thanks a lot!


I read through the 0003 patch. Here are some comments for that.

With the patch, walreceiver's stats are counted as wal_write,
wal_sync, wal_write_time and wal_sync_time in pg_stat_wal. But they
should be counted as different columns because WAL IO is different
between walreceiver and other processes like a backend? For example,
open_sync or open_datasync is chosen as wal_sync_method, those other
processes use O_DIRECT flag to open WAL files, but walreceiver does
not. For example, those other procesess write WAL data in block units,
but walreceiver does not. So I'm concerned that mixing different WAL
IO stats in the same columns would confuse the users. Thought? I'd
like to hear more opinions about how to expose walreceiver's stats to
users.


Thanks, I understood get_sync_bit() checks the sync flags and
the write unit of generated wal data and replicated wal data is 
different.

(It's interesting optimization whether to use kernel cache or not.)

OK. Although I agree to separate the stats for the walrecever,
I want to hear opinions from other people too. I didn't change the 
patch.


Please feel to your comments.





+int
+XLogWriteFile(int fd, const void *buf, size_t nbyte, off_t offset)

This common function writes WAL data and measures IO timing. IMO we
can refactor the code furthermore by making this function handle the
case where pg_write() reports an error. In other words, I think that
the function should do what do-while loop block in XLogWrite() does.
Thought?


OK. I agree.

I wonder to change the error check ways depending on who calls this 
function?
Now, only the walreceiver checks (1)errno==0 and doesn't check 
(2)errno==ENITR.

Other processes are the opposite.

IIUC, it's appropriate that every process checks (1)(2).
Please let me know my understanding is wrong.




BTW, currently XLogWrite() increments IO timing even when pg_pwrite()
reports an error. But this is useless. Probably IO timing should be
incremented after the return code of pg_pwrite() is checked, instead?


Yes, I agree. I fixed it.
(v18-0003-Makes-the-wal-receiver-report-WAL-statistics.patch)



BTW, thanks for your comments in person that the bgwriter process will 
generate wal data.
I checked that it generates the WAL to take a snapshot via 
LogStandySnapshot().

I attached the patch for bgwriter to send the wal stats.
(v18-0005-send-stats-for-bgwriter-when-shutdown.patch)

This patch includes the following changes.

(1) introduce pgstat_send_bgwriter() the mechanism to send the stats
if PGSTAT_STAT_INTERVAL msec has passed like pgstat_send_wal()
to avoid overloading to stats collector because "bgwriter_delay"
can be set for 10msec or more.

(2) remove pgstat_report_wal() and integrate with pgstat_send_wal()
because bgwriter sends WalStats.m_wal_records and to avoid 
overloading (see (1)).
IIUC, although the pros to separate them is to reduce the 
calculation cost of

WalUsageAccumDiff(), the impact is limited.

(3) make a new signal handler for bgwriter to force sending remaining 
stats during shutdown
because of (1) and remove HandleMainLoopInterrupts() because there 
are no processes to use it.



Regards,
--
Masahiro Ikeda
NTT DATA CORPORATIONdiff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 9b2eb0d10b..c7bda9127f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2536,7 +2536,6 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 			Size		nbytes;
 			Size		nleft;
 			int			written;
-			instr_time	start;
 
 			/* OK to write the page(s) */
 			from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;
@@ -2544,49 +2543,9 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 			nleft = nbytes;
 			do
 			{
-errno = 0;
+written = XLogWriteFile(openLogFile, from, nleft, (off_t) startoffset,
+		ThisTimeLineID, openLogSegNo, wal_segment_size);
 
-/* Measure I/O timing to write WAL data */
-if (track_wal_io_timing)
-	INSTR_TIME_SET_CURRENT(start);
-
-pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
-written = pg_pwrite(openLogFile, from, nleft, startoffset);
-pgstat_report_wait_end();
-
-/*
- * Increment the I/O timing and the number of times WAL data
- * were written out to disk.
- */
-if (track_wal_io_timing)
-{
-	instr_time	duration;
-
-	INSTR_TIME_SET_CURRENT(duration);
-	INSTR_TIME_SUBTRACT(duration, start);
-	WalStats.m_wal_write_time += INSTR_TIME_GET_MICROSEC(duration);
-}
-
-WalStats.m_wal_write++;
-
-if (written <= 0)
-{
-	char		xlogfname[MAXFNAMELEN];
-	int			save_errno;
-
-	if (errno == EINTR)
-		continue;
-
-	save_errno = errno;
-	XLogFileName(xlogfn

Re: Collation versioning

2021-03-14 Thread Thomas Munro
On Fri, Nov 6, 2020 at 10:56 AM Thomas Munro  wrote:
> On Wed, Nov 4, 2020 at 9:11 PM Michael Paquier  wrote:
> > On Wed, Nov 04, 2020 at 08:44:15AM +0100, Juan José Santamaría Flecha wrote:
> > >  We could create a static table with the conversion based on what was
> > > discussed for commit a169155, please find attached a spreadsheet with the
> > > comparison. This would require maintenance as new LCIDs are released [1].

> > I am honestly not a fan of something like that as it has good chances
> > to rot.

> No opinion on that, other than that we'd surely want a machine
> readable version.  As for *when* we use that information, I'm
> wondering if it would make sense to convert datcollate to a language
> tag in initdb, and also change pg_upgrade's equivalent_locale()
> function to consider "English_United States.*" and "en-US" to be
> equivalent when upgrading to 14 (which would then be the only point
> you'd ever have to have faith that we can convert the old style names
> to the new names correctly).  I'm unlikely to work on this myself as I
> have other operating systems to fix, but I'll certainly be happy if
> somehow we can get versioning for default on Windows in PG14 and not
> have to come up with weasel words in the manual.

FYI I have added this as an open item for PostgreSQL 14.  My default
action will be to document this limitation, if we can't come up with
something better in time.




Re: Postgres crashes at memcopy() after upgrade to PG 13.

2021-03-14 Thread Thomas Munro
[Dropping pgsql-general@ from the CC, because cross-posting triggers
moderation; sorry I didn't notice that on my first reply]

On Mon, Mar 15, 2021 at 2:05 PM Avinash Kumar
 wrote:
> On Sun, Mar 14, 2021 at 10:01 PM Avinash Kumar  
> wrote:
>> Also the datatype is bigint

Ok.  Collation changes are the most common cause of index problems
when upgrading OSes, but here we can rule that out if your index is on
bigint.  So it seems like this is some other kind of corruption in
your database, or a bug in the deduplication code.




Re: Different compression methods for FPI

2021-03-14 Thread Justin Pryzby
@cfbot: Resending without duplicate patches
>From 581884bb12f666a1eb6a7f4bd769b1edeba4c563 Mon Sep 17 00:00:00 2001
From: Andrey Borodin 
Date: Sat, 27 Feb 2021 09:03:50 +0500
Subject: [PATCH 01/10] Allow alternate compression methods for wal_compression

TODO: bump XLOG_PAGE_MAGIC
---
 doc/src/sgml/config.sgml  | 17 +
 src/backend/Makefile  |  2 +-
 src/backend/access/transam/xlog.c | 10 +++
 src/backend/access/transam/xloginsert.c   | 52 +--
 src/backend/access/transam/xlogreader.c   | 63 ++-
 src/backend/utils/misc/guc.c  | 11 
 src/backend/utils/misc/postgresql.conf.sample |  1 +
 src/include/access/xlog.h |  1 +
 src/include/access/xlog_internal.h|  8 +++
 src/include/access/xlogreader.h   |  1 +
 src/include/access/xlogrecord.h   |  9 +--
 11 files changed, 163 insertions(+), 12 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index a218d78bef..7fb2a84626 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3072,6 +3072,23 @@ include_dir 'conf.d'
   
  
 
+ 
+  wal_compressionion_method (enum)
+  
+   wal_compression_method configuration parameter
+  
+  
+  
+   
+This parameter selects the compression method used to compress WAL when
+wal_compression is enabled.
+The supported methods are pglz and zlib.
+The default value is pglz.
+Only superusers can change this setting.
+   
+  
+ 
+
  
   wal_init_zero (boolean)
   
diff --git a/src/backend/Makefile b/src/backend/Makefile
index 0da848b1fd..3af216ddfc 100644
--- a/src/backend/Makefile
+++ b/src/backend/Makefile
@@ -48,7 +48,7 @@ OBJS = \
 LIBS := $(filter-out -lpgport -lpgcommon, $(LIBS)) $(LDAP_LIBS_BE) $(ICU_LIBS)
 
 # The backend doesn't need everything that's in LIBS, however
-LIBS := $(filter-out -lz -lreadline -ledit -ltermcap -lncurses -lcurses, $(LIBS))
+LIBS := $(filter-out -lreadline -ledit -ltermcap -lncurses -lcurses, $(LIBS))
 
 ifeq ($(with_systemd),yes)
 LIBS += -lsystemd
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f4d1ce5dea..15da91a8dd 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -99,6 +99,7 @@ bool		EnableHotStandby = false;
 bool		fullPageWrites = true;
 bool		wal_log_hints = false;
 bool		wal_compression = false;
+int			wal_compression_method = WAL_COMPRESSION_PGLZ;
 char	   *wal_consistency_checking_string = NULL;
 bool	   *wal_consistency_checking = NULL;
 bool		wal_init_zero = true;
@@ -180,6 +181,15 @@ const struct config_enum_entry recovery_target_action_options[] = {
 	{NULL, 0, false}
 };
 
+/* Note that due to conditional compilation, offsets within the array are not static */
+const struct config_enum_entry wal_compression_options[] = {
+	{"pglz", WAL_COMPRESSION_PGLZ, false},
+#ifdef  HAVE_LIBZ
+	{"zlib", WAL_COMPRESSION_ZLIB, false},
+#endif
+	{NULL, 0, false}
+};
+
 /*
  * Statistics for current checkpoint are collected in this global struct.
  * Because only the checkpointer or a stand-alone backend can perform
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 7052dc245e..34e1227381 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -33,6 +33,10 @@
 #include "storage/proc.h"
 #include "utils/memutils.h"
 
+#ifdef HAVE_LIBZ
+#include 
+#endif
+
 /* Buffer size required to store a compressed version of backup block image */
 #define PGLZ_MAX_BLCKSZ PGLZ_MAX_OUTPUT(BLCKSZ)
 
@@ -113,7 +117,8 @@ static XLogRecData *XLogRecordAssemble(RmgrId rmid, uint8 info,
 	   XLogRecPtr RedoRecPtr, bool doPageWrites,
 	   XLogRecPtr *fpw_lsn, int *num_fpi);
 static bool XLogCompressBackupBlock(char *page, uint16 hole_offset,
-	uint16 hole_length, char *dest, uint16 *dlen);
+	uint16 hole_length, char *dest,
+	uint16 *dlen, WalCompression compression);
 
 /*
  * Begin constructing a WAL record. This must be called before the
@@ -630,11 +635,12 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 			 */
 			if (wal_compression)
 			{
+bimg.compression_method = wal_compression_method;
 is_compressed =
 	XLogCompressBackupBlock(page, bimg.hole_offset,
 			cbimg.hole_length,
 			regbuf->compressed_page,
-			&compressed_len);
+			&compressed_len, bimg.compression_method);
 			}
 
 			/*
@@ -827,7 +833,7 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
  */
 static bool
 XLogCompressBackupBlock(char *page, uint16 hole_offset, uint16 hole_length,
-		char *dest, uint16 *dlen)
+		char *dest, uint16 *dlen, WalCompression compression)
 {
 	int32		orig_len = BLCKSZ - hole_length;
 	int32		len;
@@ -853,12 +859,48 @@ XLogCompressBackupBlock(char *page, ui

Re: REINDEX backend filtering

2021-03-14 Thread Julien Rouhaud
Hi Mark,

On Sun, Mar 14, 2021 at 05:01:20PM -0700, Mark Dilger wrote:
> 
> > On Mar 14, 2021, at 12:10 AM, Julien Rouhaud  wrote:
> 
> I'm coming to this patch quite late, perhaps too late to change design 
> decision in time for version 14.

Thanks for lookint at it!

> + if (outdated && PQserverVersion(conn) < 14)
> + {
> + PQfinish(conn);
> + pg_log_error("cannot use the \"%s\" option on server versions 
> older than PostgreSQL %s",
> +  "outdated", "14");
> + exit(1);
> + }
> 
> If detection of outdated indexes were performed entirely in the frontend 
> (reindexdb) rather than in the backend (reindex command), would reindexdb be 
> able to connect to older servers?  Looking quickly that the catalogs, it 
> appears pg_index, pg_depend, pg_collation and a call to the SQL function 
> pg_collation_actual_version() compared against pg_depend.refobjversion would 
> be enough to construct a list of indexes in need of reindexing.  Am I missing 
> something here?

There won't be any need to connect on older servers if the patch is committed
in this commitfest as refobjversion was also added in pg14.

> I understand that wouldn't help somebody wanting to reindex from psql.  Is 
> that the whole reason you went a different direction with this feature?

This was already discussed with Magnus and Michael.  The main reason for that
are:

- no need for a whole new infrastructure to be able to process a list of
  indexes in parallel which would be required if getting the list of indexes in
  the client

- if done in the backend, then the ability is immediately available for all
  user scripts, compared to the burden of writing the needed query (with the
  usual caveats like quoting, qualifying all objects if the search_path isn't
  safe and such) and looping though all the results.

> + printf(_("  --outdated   only process indexes having 
> outdated depencies\n"));  
> 
> typo.
> 
> + bool outdated;  /* depends on at least on deprected collation? */
> 
> typo.

Thanks! I'll fix those.




Re: REINDEX backend filtering

2021-03-14 Thread Julien Rouhaud
On Mon, Mar 15, 2021 at 08:56:00AM +0900, Michael Paquier wrote:
> On Sun, Mar 14, 2021 at 10:57:37PM +0800, Julien Rouhaud wrote:
> > 
> > Is there really a use case for reindexing a specific index and at the same 
> > time
> > asking for possibly ignoring it?  I think we should just forbid REINDEX
> > (OUTDATED) INDEX.  What do you think?
> 
> I think that there would be cases to be able to handle that, say if a
> user wants to works on a specific set of indexes one-by-one.

If a user want to work on a specific set of indexes one at a time, then the
list of indexes is probably already retrieved from some SQL query and there's
already all necessary infrastructure to properly filter the non oudated
indexes.

> There is
> also the argument of inconsistency with the other commands.

Yes, but the other commands dynamically construct a list of indexes.

The only use case I see would be to process a partitioned index if some of the
underlying indexes have already been processed.  IMO this is better addressed
by REINDEX TABLE.

Anyway I'll make REINDEX (OUTDATED) INDEX to maybe reindex the explicitely
stated index name since you think it's a better behavior.

> 
> > I was thinking that users would be more interested in the list of indexes 
> > being
> > processed rather than the full list of indexes and a mention of whether it 
> > was
> > processed or not.  I can change that if you prefer.
> 
> How invasive do you think it would be to add a note in the verbose
> output when indexes are skipped?

Probably not too invasive, but the verbose output is already inconsistent:

# reindex (verbose) table tt;
NOTICE:  0: table "tt" has no indexes to reindex

But a REINDEX (VERBOSE) DATABASE won't emit such message.  I'm assuming that
it's because it doesn't make sense to warn in that case as the user didn't
explicitly specified the table name.  We have the same behavior for now when
using the OUTDATED option if no indexes are processed.  Should that be changed
too?

> > Did you mean index_has_outdated_collation() and
> > index_has_outdated_dependency()?  It's just to keep things separated, mostly
> > for future improvements on that infrastructure.  I can get rid of that 
> > function
> > and put back the code in index_has_outadted_dependency() if that's overkill.
> 
> Yes, sorry.  I meant index_has_outdated_collation() and
> index_has_outdated_dependency().

And are you ok with this function?




Re: Different compression methods for FPI

2021-03-14 Thread Justin Pryzby
On Sat, Mar 13, 2021 at 08:48:33PM +0500, Andrey Borodin wrote:
> > 13 марта 2021 г., в 06:28, Justin Pryzby  написал(а):
> > Updated patch with a minor fix to configure.ac to avoid warnings on OSX.
> > And 2ndary patches from another thread to allow passing recovery tests.
> > Renamed to WAL_COMPRESSION_*
> > Split LZ4 support to a separate patch and support zstd.  These show that
> > changes needed for a new compression method have been minimized, although 
> > not
> > yet moved to a separate, abstracted compression/decompression function.
> 
> Thanks! Awesome work!
> 
> > These two patches are a prerequisite for this patch to progress:
> > * Run 011_crash_recovery.pl with wal_level=minimal
> > * Make sure published XIDs are persistent
> > 
> > I don't know if anyone will consider this patch for v14 - if not, it should 
> > be
> > set to v15 and revisit in a month.  
> 
> I want to note, that fixes for 011_crash_recovery.pl are not strictly 
> necessary for this patch set.
> The problem in tests arises if we turn on wal_compression, absolutely 
> independently from wal compression method.
> We turn on wal_compression in this test only for CI purposes.

I rearranged the patches to reflect this.
Change to zlib and zstd to level=1.
Add support for negative "zstd fast" levels.
Use correct length accounting for "hole" in LZ4 and ZSTD.
Fixed Solution.pm for zstd on windows.
Switch to zstd by default (for CI).
Add docs.

It occurred to me that the generic "compression API" might be of more
significance if we supported compression of all WAL (not just FPI).  I imagine
that might use streaming compression.

-- 
Justin
>From 9288683238a72ae379d1b444e497285de5c8c28a Mon Sep 17 00:00:00 2001
From: Andrey Borodin 
Date: Sat, 27 Feb 2021 09:03:50 +0500
Subject: [PATCH 01/10] Allow alternate compression methods for wal_compression

TODO: bump XLOG_PAGE_MAGIC
---
 doc/src/sgml/config.sgml  | 17 +
 src/backend/Makefile  |  2 +-
 src/backend/access/transam/xlog.c | 10 +++
 src/backend/access/transam/xloginsert.c   | 52 +--
 src/backend/access/transam/xlogreader.c   | 63 ++-
 src/backend/utils/misc/guc.c  | 11 
 src/backend/utils/misc/postgresql.conf.sample |  1 +
 src/include/access/xlog.h |  1 +
 src/include/access/xlog_internal.h|  8 +++
 src/include/access/xlogreader.h   |  1 +
 src/include/access/xlogrecord.h   |  9 +--
 11 files changed, 163 insertions(+), 12 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index a218d78bef..7fb2a84626 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3072,6 +3072,23 @@ include_dir 'conf.d'
   
  
 
+ 
+  wal_compressionion_method (enum)
+  
+   wal_compression_method configuration parameter
+  
+  
+  
+   
+This parameter selects the compression method used to compress WAL when
+wal_compression is enabled.
+The supported methods are pglz and zlib.
+The default value is pglz.
+Only superusers can change this setting.
+   
+  
+ 
+
  
   wal_init_zero (boolean)
   
diff --git a/src/backend/Makefile b/src/backend/Makefile
index 0da848b1fd..3af216ddfc 100644
--- a/src/backend/Makefile
+++ b/src/backend/Makefile
@@ -48,7 +48,7 @@ OBJS = \
 LIBS := $(filter-out -lpgport -lpgcommon, $(LIBS)) $(LDAP_LIBS_BE) $(ICU_LIBS)
 
 # The backend doesn't need everything that's in LIBS, however
-LIBS := $(filter-out -lz -lreadline -ledit -ltermcap -lncurses -lcurses, $(LIBS))
+LIBS := $(filter-out -lreadline -ledit -ltermcap -lncurses -lcurses, $(LIBS))
 
 ifeq ($(with_systemd),yes)
 LIBS += -lsystemd
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e04250f4e9..04192b7add 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -99,6 +99,7 @@ bool		EnableHotStandby = false;
 bool		fullPageWrites = true;
 bool		wal_log_hints = false;
 bool		wal_compression = false;
+int			wal_compression_method = WAL_COMPRESSION_PGLZ;
 char	   *wal_consistency_checking_string = NULL;
 bool	   *wal_consistency_checking = NULL;
 bool		wal_init_zero = true;
@@ -180,6 +181,15 @@ const struct config_enum_entry recovery_target_action_options[] = {
 	{NULL, 0, false}
 };
 
+/* Note that due to conditional compilation, offsets within the array are not static */
+const struct config_enum_entry wal_compression_options[] = {
+	{"pglz", WAL_COMPRESSION_PGLZ, false},
+#ifdef  HAVE_LIBZ
+	{"zlib", WAL_COMPRESSION_ZLIB, false},
+#endif
+	{NULL, 0, false}
+};
+
 /*
  * Statistics for current checkpoint are collected in this global struct.
  * Because only the checkpointer or a stand-alone backend can perform
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 7052dc245e..34e1227381 100644
--

Postgres crashes at memcopy() after upgrade to PG 13.

2021-03-14 Thread Avinash Kumar
Hi ,

In one of the environments, using pg_upgrade with hard links, PostgreSQL 12
has been upgraded to PostgreSQL 13.1. The OS was Ubuntu 16.04.7 LTS (Xenial
Xerus). pg_repack was used to rebuild all the tables across the database
right after the upgrade to PG 13.

A new server with Ubuntu 20.04.1 LTS was later provisioned. Streaming
replication was set up from the Old Server running on Ubuntu 16 to New
Server on Ubuntu 20 - same PG versions 13.1.

Replication was running fine, but, after the failover to the New Server, an
Update on a few random rows (not on the same page) was causing Segmentation
fault and causing a crash of the Postgres.
Selecting the records using the Index or directly from the table works
absolutely fine. But, when the same records are updated, it gets into the
following error.

2021-03-12 17:20:01.979 CET p#7 s#604b8fa9.7 t#0 LOG:  terminating any
other active server processes
2021-03-12 17:20:01.979 CET p#41 s#604b9212.29 t#0 WARNING:  terminating
connection because of crash of another server process
2021-03-12 17:20:01.979 CET p#41 s#604b9212.29 t#0 DETAIL:  The postmaster
has commanded this server process to roll back the current transaction and
exit, because another server process exited abnormally and possibly
corrupted shared memory.
2021-03-12 17:20:01.979 CET p#41 s#604b9212.29 t#0 HINT:  In a moment you
should be able to reconnect to the database and repeat your command.

gdb backtrace looks like following with the debug symbols.

(gdb) bt
#0  __memmove_avx_unaligned_erms () at
../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:533
#1  0x55b72761c370 in memmove (__len=,
__src=0x55b72930e9c7, __dest=)
at /usr/include/x86_64-linux-gnu/bits/string_fortified.h:40
#2  _bt_swap_posting (newitem=newitem@entry=0x55b7292010c0,
oposting=oposting@entry=0x7f3b46f94778,
postingoff=postingoff@entry=2) at
./build/../src/backend/access/nbtree/nbtdedup.c:796
#3  0x55b72761d40b in _bt_insertonpg (rel=0x7f3acd8a49c0,
itup_key=0x55b7292bc6a8, buf=507, cbuf=0, stack=0x55b7292d5f98,
itup=0x55b7292010c0, itemsz=32, newitemoff=48, postingoff=2,
split_only_page=false)
at ./build/../src/backend/access/nbtree/nbtinsert.c:1167
#4  0x55b72761eae9 in _bt_doinsert (rel=rel@entry=0x7f3acd8a49c0,
itup=itup@entry=0x55b7292bc848,
checkUnique=checkUnique@entry=UNIQUE_CHECK_NO, heapRel=heapRel@entry
=0x7f3acd894f70)
at ./build/../src/backend/access/nbtree/nbtinsert.c:1009
#5  0x55b727621e2e in btinsert (rel=0x7f3acd8a49c0, values=, isnull=, ht_ctid=0x55b7292d4578,
heapRel=0x7f3acd894f70, checkUnique=UNIQUE_CHECK_NO,
indexInfo=0x55b7292bc238)
at ./build/../src/backend/access/nbtree/nbtree.c:210
#6  0x55b727757487 in ExecInsertIndexTuples
(slot=slot@entry=0x55b7292d4548,
estate=estate@entry=0x55b7291ff1f8,
noDupErr=noDupErr@entry=false, specConflict=specConflict@entry=0x0,
arbiterIndexes=arbiterIndexes@entry=0x0)
at ./build/../src/backend/executor/execIndexing.c:393
#7  0x55b7277807a8 in ExecUpdate (mtstate=0x55b7292bb2c8,
tupleid=0x7fff45ea318a, oldtuple=0x0, slot=0x55b7292d4548,
planSlot=0x55b7292c04e8, epqstate=0x55b7292bb3c0,
estate=0x55b7291ff1f8, canSetTag=true)
at ./build/../src/backend/executor/nodeModifyTable.c:1479
#8  0x55b727781655 in ExecModifyTable (pstate=0x55b7292bb2c8) at
./build/../src/backend/executor/nodeModifyTable.c:2253
#9  0x55b727758424 in ExecProcNode (node=0x55b7292bb2c8) at
./build/../src/include/executor/executor.h:248
#10 ExecutePlan (execute_once=, dest=0x55b7292c1728,
direction=, numberTuples=0,
sendTuples=, operation=CMD_UPDATE,
use_parallel_mode=, planstate=0x55b7292bb2c8,
estate=0x55b7291ff1f8) at
./build/../src/backend/executor/execMain.c:1632
#11 standard_ExecutorRun (queryDesc=0x55b7292ba578, direction=, count=0, execute_once=)
at ./build/../src/backend/executor/execMain.c:350
#12 0x55b7278bebf7 in ProcessQuery (plan=,
sourceText=0x55b72919efa8 "\031)\267U", params=0x0, queryEnv=0x0,
dest=0x55b7292c1728, qc=0x7fff45ea34c0) at
./build/../src/backend/tcop/pquery.c:160
#13 0x55b7278bedf9 in PortalRunMulti (portal=portal@entry=0x55b729254128,
isTopLevel=isTopLevel@entry=true,
setHoldSnapshot=setHoldSnapshot@entry=false,
dest=dest@entry=0x55b7292c1728,
altdest=altdest@entry=0x55b7292c1728,
qc=qc@entry=0x7fff45ea34c0) at ./build/../src/backend/tcop/pquery.c:1265
#14 0x55b7278bf847 in PortalRun (portal=portal@entry=0x55b729254128,
count=count@entry=9223372036854775807,
isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true,
dest=dest@entry=0x55b7292c1728,
--Type  for more, q to quit, c to continue without paging--


Is this expected when replication is happening between PostgreSQL databases
hosted on different OS versions like Ubuntu 16 and Ubuntu 20 ? Or, do we
think this is some sort of corruption ?

-- 
Regards,
Avi.


Re: A qsort template

2021-03-14 Thread Thomas Munro
On Sun, Mar 14, 2021 at 5:03 PM Zhihong Yu  wrote:
> For 0001-Add-bsearch-and-unique-templates-to-sort_template.h.patch :
>
> + * Remove duplicates from an array.  Return the new size.
> + */
> +ST_SCOPE size_t
> +ST_UNIQUE(ST_ELEMENT_TYPE *array,
>
> The array is supposed to be sorted, right ?
> The comment should mention this.

Good point, will update.  Thanks!




Re: REINDEX backend filtering

2021-03-14 Thread Mark Dilger



> On Mar 14, 2021, at 12:10 AM, Julien Rouhaud  wrote:
> 
> v6 attached, rebase only due to conflict with recent commit.

Hi Julien,

I'm coming to this patch quite late, perhaps too late to change design decision 
in time for version 14.


+   if (outdated && PQserverVersion(conn) < 14)
+   {
+   PQfinish(conn);
+   pg_log_error("cannot use the \"%s\" option on server versions 
older than PostgreSQL %s",
+"outdated", "14");
+   exit(1);
+   }

If detection of outdated indexes were performed entirely in the frontend 
(reindexdb) rather than in the backend (reindex command), would reindexdb be 
able to connect to older servers?  Looking quickly that the catalogs, it 
appears pg_index, pg_depend, pg_collation and a call to the SQL function 
pg_collation_actual_version() compared against pg_depend.refobjversion would be 
enough to construct a list of indexes in need of reindexing.  Am I missing 
something here?

I understand that wouldn't help somebody wanting to reindex from psql.  Is that 
the whole reason you went a different direction with this feature?



+   printf(_("  --outdated   only process indexes having 
outdated depencies\n"));  

typo.



+   bool outdated;  /* depends on at least on deprected collation? */

typo.



—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: REINDEX backend filtering

2021-03-14 Thread Michael Paquier
On Sun, Mar 14, 2021 at 10:57:37PM +0800, Julien Rouhaud wrote:
> On Sun, Mar 14, 2021 at 08:54:11PM +0900, Michael Paquier wrote:
>> In ReindexRelationConcurrently(), there is no filtering done for the
>> index themselves.  The operation is only done on the list of indexes
>> fetched from the parent relation.  Why?  This means that a REINDEX
>> (OUTDATED) INDEX would actually rebuild an index even if this index
>> has no out-of-date collations, like a catalog.  I think that's
>> confusing.
>> 
>> Same comment for the non-concurrent case, as of the code paths of
>> reindex_index().
> 
> Yes, I'm not sure what we should do in that case.  I thought I put a comment
> about that but it apparently disappeared during some rewrite.
> 
> Is there really a use case for reindexing a specific index and at the same 
> time
> asking for possibly ignoring it?  I think we should just forbid REINDEX
> (OUTDATED) INDEX.  What do you think?

I think that there would be cases to be able to handle that, say if a
user wants to works on a specific set of indexes one-by-one.  There is
also the argument of inconsistency with the other commands.

> I was thinking that users would be more interested in the list of indexes 
> being
> processed rather than the full list of indexes and a mention of whether it was
> processed or not.  I can change that if you prefer.

How invasive do you think it would be to add a note in the verbose
output when indexes are skipped?

> Did you mean index_has_outdated_collation() and
> index_has_outdated_dependency()?  It's just to keep things separated, mostly
> for future improvements on that infrastructure.  I can get rid of that 
> function
> and put back the code in index_has_outadted_dependency() if that's overkill.

Yes, sorry.  I meant index_has_outdated_collation() and
index_has_outdated_dependency().
--
Michael


signature.asc
Description: PGP signature


Re: fdatasync performance problem with large number of DB files

2021-03-14 Thread Thomas Munro
On Mon, Mar 15, 2021 at 11:52 AM Thomas Munro  wrote:
> Time being of the essence, here is the patch I posted last year, this
> time with a GUC and some docs.  You can set sync_after_crash to
> "fsync" (default) or "syncfs" if you have it.

Cfbot told me to add HAVE_SYNCFS to Solution.pm, and I fixed a couple of typos.
From 5b4a97bd99ba0ce789770c1dea48ea5544e175c7 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Sun, 27 Sep 2020 17:22:10 +1300
Subject: [PATCH v3] Optionally use syncfs() for SyncDataDirectory() on Linux.

Opening every file can be slow, as the first step before crash recovery
can begin.  Provide an alternative method, where we call syncfs() on
every possibly different filesystem under the data directory.  This
avoids faulting in inodes for potentially very many inodes.

The option can be controlled with a new GUC:

sync_after_crash={fsync,syncfs}

Discussion: https://postgr.es/m/11bc2bb7-ecb5-3ad0-b39f-df632734cd81%40discourse.org
Discussion: https://postgr.es/m/CAEET0ZHGnbXmi8yF3ywsDZvb3m9CbdsGZgfTXscQ6agcbzcZAw%40mail.gmail.com
---
 configure |  2 +-
 configure.ac  |  1 +
 doc/src/sgml/config.sgml  | 28 +
 src/backend/storage/file/fd.c | 62 ++-
 src/backend/utils/misc/guc.c  | 17 +
 src/backend/utils/misc/postgresql.conf.sample |  1 +
 src/include/pg_config.h.in|  3 +
 src/include/storage/fd.h  |  8 +++
 src/tools/msvc/Solution.pm|  1 +
 9 files changed, 121 insertions(+), 2 deletions(-)

diff --git a/configure b/configure
index 3fd4cecbeb..6a2051da9d 100755
--- a/configure
+++ b/configure
@@ -15239,7 +15239,7 @@ fi
 LIBS_including_readline="$LIBS"
 LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
 
-for ac_func in backtrace_symbols clock_gettime copyfile fdatasync getifaddrs getpeerucred getrlimit kqueue mbstowcs_l memset_s poll posix_fallocate ppoll pstat pthread_is_threaded_np readlink readv setproctitle setproctitle_fast setsid shm_open strchrnul strsignal symlink sync_file_range uselocale wcstombs_l writev
+for ac_func in backtrace_symbols clock_gettime copyfile fdatasync getifaddrs getpeerucred getrlimit kqueue mbstowcs_l memset_s poll posix_fallocate ppoll pstat pthread_is_threaded_np readlink readv setproctitle setproctitle_fast setsid shm_open strchrnul strsignal symlink syncfs sync_file_range uselocale wcstombs_l writev
 do :
   as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh`
 ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var"
diff --git a/configure.ac b/configure.ac
index 2f1585adc0..dd819ab2c2 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1681,6 +1681,7 @@ AC_CHECK_FUNCS(m4_normalize([
 	strchrnul
 	strsignal
 	symlink
+	syncfs
 	sync_file_range
 	uselocale
 	wcstombs_l
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index a218d78bef..cc3843ac81 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9679,6 +9679,34 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir'
   
  
 
+ 
+  sync_after_crash (enum)
+   
+sync_after_crash configuration parameter
+   
+  
+  
+   
+When set to fsync, which is the default,
+PostgreSQL will recursively open and fsync
+all files in the data directory before crash recovery begins.  This
+is intended to make sure that all WAL and data files are durably stored
+on disk before replaying changes.
+   
+   
+On Linux, syncfs may be used instead, to ask the
+operating system to flush the whole file system.  This may be a lot
+faster, because it doesn't need to open each file one by one.  On the
+other hand, it may be slower if the file system is shared by other
+applications that modify a lot of files, since those files will also
+be flushed to disk.  Furthermore, on versions of Linux before 5.8, I/O
+errors encountered while writing data to disk may not be reported to
+PostgreSQL, and relevant error messages may
+appear only in kernel logs.
+   
+  
+ 
+
 
 

diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index b58502837a..461a1a5b07 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -72,9 +72,11 @@
 
 #include "postgres.h"
 
+#include 
 #include 
 #include 
 #include 
+#include 
 #ifndef WIN32
 #include 
 #endif
@@ -158,6 +160,9 @@ int			max_safe_fds = FD_MINFREE;	/* default if not changed */
 /* Whether it is safe to continue running after fsync() fails. */
 bool		data_sync_retry = false;
 
+/* How SyncDataDirectory() should do its job. */
+int			sync_after_crash = SYNC_AFTER_CRASH_FSYNC;
+
 /* Debugging */
 
 #ifdef FDDEBUG
@@ -3263,9 +3268,27 @@ looks_like_temp_rel_name(const char *name)
 	return true;
 }
 
+#ifde

Re: Boundary value check in lazy_tid_reaped()

2021-03-14 Thread Thomas Munro
On Wed, Sep 9, 2020 at 7:33 AM Peter Geoghegan  wrote:
> On Tue, Sep 8, 2020 at 1:23 AM Masahiko Sawada
>  wrote:
> > > > I wonder if you would also see a speed-up with a bsearch() replacement
> > > > that is inlineable, so it can inline the comparator (instead of
> > > > calling it through a function pointer).  I wonder if something more
> > > > like (lblk << 32 | loff) - (rblk << 32 | roff) would go faster than
> > > > the branchy comparator.
> > >
> > > Erm, off course that expression won't work... should be << 16, but
> > > even then it would only work with a bsearch that uses int64_t
> > > comparators, so I take that part back.
> >
> > Yeah, it seems to worth benchmarking the speed-up with an inlining.
> > I'll do some performance tests with/without inlining on top of
> > checking boundary values.
>
> It sounds like Thomas was talking about something like
> itemptr_encode() + itemptr_decode(). In case you didn't know, we
> actually do something like this for the TID tuplesort used for CREATE
> INDEX CONCURRENTLY.

BTW I got around to trying this idea out for a specialised
bsearch_itemptr() using a wide comparator, over here:

https://www.postgresql.org/message-id/CA%2BhUKGKztHEWm676csTFjYzortziWmOcf8HDss2Zr0muZ2xfEg%40mail.gmail.com




Re: fdatasync performance problem with large number of DB files

2021-03-14 Thread Thomas Munro
On Thu, Mar 11, 2021 at 2:32 PM Thomas Munro  wrote:
> On Thu, Mar 11, 2021 at 2:25 PM Tom Lane  wrote:
> > Trolling the net, I found a newer-looking version of the man page,
> > and behold it says
> >
> >In mainline kernel versions prior to 5.8, syncfs() will fail only
> >when passed a bad file descriptor (EBADF).  Since Linux 5.8,
> >syncfs() will also report an error if one or more inodes failed
> >to be written back since the last syncfs() call.
> >
> > So this means that in less-than-bleeding-edge kernels, syncfs can
> > only be regarded as a dangerous toy.  If we expose an option to use
> > it, there had better be large blinking warnings in the docs.
>
> Agreed.  Perhaps we could also try to do something programmatic about that.

Time being of the essence, here is the patch I posted last year, this
time with a GUC and some docs.  You can set sync_after_crash to
"fsync" (default) or "syncfs" if you have it.

I would plan to extend that to include a third option as already
discussed in the other thread, maybe something like "wal" (= sync WAL
files and then do extra analysis of WAL data to sync only data
modified since checkpoint but not replayed), but that'd be material
for PG15.
From e1a14240a08b802b669c7a4d2a7cacc1004a7de4 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Sun, 27 Sep 2020 17:22:10 +1300
Subject: [PATCH v2] Optionally use syncfs() for SyncDataDirectory() on Linux.

Opening every file can be slow, as the first step before crash recovery
can begin.  Provide an alternative method, where we call syncfs() on
every possibly different filesystem under the data directory.  This
avoids faulting in inodes for potentially very many inodes.

The option can be controlled with a new GUC:

sync_after_crash={fsync,syncfs}

Discussion: https://postgr.es/m/11bc2bb7-ecb5-3ad0-b39f-df632734cd81%40discourse.org
Discussion: https://postgr.es/m/CAEET0ZHGnbXmi8yF3ywsDZvb3m9CbdsGZgfTXscQ6agcbzcZAw%40mail.gmail.com
---
 configure |  2 +-
 configure.ac  |  1 +
 doc/src/sgml/config.sgml  | 28 +
 src/backend/storage/file/fd.c | 62 ++-
 src/backend/utils/misc/guc.c  | 17 +
 src/backend/utils/misc/postgresql.conf.sample |  1 +
 src/include/pg_config.h.in|  3 +
 src/include/storage/fd.h  |  8 +++
 8 files changed, 120 insertions(+), 2 deletions(-)

diff --git a/configure b/configure
index 3fd4cecbeb..6a2051da9d 100755
--- a/configure
+++ b/configure
@@ -15239,7 +15239,7 @@ fi
 LIBS_including_readline="$LIBS"
 LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
 
-for ac_func in backtrace_symbols clock_gettime copyfile fdatasync getifaddrs getpeerucred getrlimit kqueue mbstowcs_l memset_s poll posix_fallocate ppoll pstat pthread_is_threaded_np readlink readv setproctitle setproctitle_fast setsid shm_open strchrnul strsignal symlink sync_file_range uselocale wcstombs_l writev
+for ac_func in backtrace_symbols clock_gettime copyfile fdatasync getifaddrs getpeerucred getrlimit kqueue mbstowcs_l memset_s poll posix_fallocate ppoll pstat pthread_is_threaded_np readlink readv setproctitle setproctitle_fast setsid shm_open strchrnul strsignal symlink syncfs sync_file_range uselocale wcstombs_l writev
 do :
   as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh`
 ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var"
diff --git a/configure.ac b/configure.ac
index 2f1585adc0..dd819ab2c2 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1681,6 +1681,7 @@ AC_CHECK_FUNCS(m4_normalize([
 	strchrnul
 	strsignal
 	symlink
+	syncfs
 	sync_file_range
 	uselocale
 	wcstombs_l
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index a218d78bef..f8bd82a729 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9679,6 +9679,34 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir'
   
  
 
+ 
+  sync_after_crash (enum)
+   
+sync_after_crash configuration parameter
+   
+  
+  
+   
+When set to fsync, which is the default,
+PostgreSQL will recursively open and fsync
+all files in the data directory before crash recovery begins.  This
+is intended to make sure that all WAL and data files are durably stored
+on disk before replaying changes.
+   
+   
+On Linux, syncfs may be used instead, to ask the
+operating system to flush the whole file system.  This may be a lot
+faster, because it doesn't need to open each file one by one.  On the
+other hand, it may be slower if the file system is shared by other
+applications that modify a lot of files, since those files will also
+be flushed to disk.  Furthermore, on version of Linux before 5.8, I/O
+errors encountered while writing data to disk may not be reported to
+   

A Case For Inlining Immediate Referential Integrity Checks

2021-03-14 Thread Corey Huinker
A Case For Inlining Immediate Referential Integrity Checks
--

The following is an overview of how Postgres currently implemented
referential integrity, the some problems with that architecture, attempted
solutions for those problems, and a suggstion of another possible solution.

Notes On Notation and Referential Integrity In General
--

All referential integrity is ultimately of this form:

R(X) => T(Y)

Where one referencing table R has a set of columns X That references a set
of columns Y which comprise a unique constraint on a target table T. Note
that the Y-set of columns is usually the primary key of T, but does not
have to be.

The basic referential integrity checks fall into two basic categories,
Insert and Delete, which can be checked Immediately following the
statement, or can be Deferred to the end of the transaction.

The Insert check is fairly straightforward. Any insert to R, or update of R
that modifies [1] any column in X, is checked to see if all of the X
columns are NOT NULL, and if so, a lookup is done on T to find a matching
row tuple of Y. If none is found, then an error is raised.

The Update check is more complicated, as it covers any UPDATE operation
that modifies [1] any column in Y, where all of the values of Y are NOT
NUL, as well as DELETE operation where all of the columns of Y are NOT
NULL. For any Update check, the table R is scanned for any matching X
tuples matching Y in the previous, and for any matches found, an action is
taken. That action can be to fail the operation (NO ACTION, RESTRICT),
update the X values to fixed values (SET NULL, SET DEFAULT), or to delete
those rows in R (CASCADE).


Current Implementation
--

Currently, these operations are handled via per-row triggers. In our
general case, one trigger is placed on R for INSERT operations, and one
trigger is placed on T for DELETE operations, and an additional trigger is
placed on T for UPDATE operations that affect any column of Y.

These Insert trigger functions invoke the C function RI_FKey_check() [2].
The trigger is fired unconditionally, and the trigger itself determines if
there is a referential integrity constraint to be made or not. Ultimately
this trigger invokes an SPI query of the form SELECT 1 FROM  WHERE () FOR KEY SHARE. This query is generally quite straightforward to the
planner, as it becomes either a scan of a single unique index, or a
partition search followed by a scan of a single unique index. The operation
succeeds if a row is found, and fails if it does not.

The Update trigger functions are implemented with a set of C functions
RI_[noaction|restrict|cascade|setnull|setdefault]_[upd|del]() [3]. These
functions each generate a variation of SPI query in one of the following
forms

 cascade: DELETE FROM  WHERE 
 restrict/noaction: SELECT 1 FROM  WHERE  FOR KEY SHARE
 setnull: UPDATE  SET x1 = NULL, ... WHERE 
 setdefault: UPDATE  SET x1 = DEFAULT, ... WHERE 

These triggers are either executed at statement time (Immediate) or are
queued for execution as a part of the transaction commit (Deferred).

Problems With The Current Implementation


The main problems with this architecture come down to visiblity and
performance.

The foremost problem with this implementation is that these extra queries
are not visible to the end user in any way. It is possible to infer that
the functions executed by looking at the constraint defnitions and
comparing pg_stat_user_tables or pg_stat_user_indexes before and after the
operation, but in general the time spent in these functions accrues to the
DML statement (Immediate) or COMMIT statement (Deferred) without any insght
into what took place. This is especially vexing in situations where an
operation as simple as "DELETE FROM highly_referenced_table WHERE id = 1"
hits the primary key index, but takes several seconds to run.

The performance of Insert operations is generally not too bad, in that
query boils down to an Index Scan for a single row. The problem, however,
is that this query must be executed for every row inserted. The query
itself is only planned once, and that query plan is cached for later
re-use. That removes some of the query overhead, but also incurs a growing
cache of plans which can create memory pressure if the number of foreign
keys is large, and indeed this has become a problem for at least one
customer [4]. Some profiling of the RI check indicated that about half of
the time of the insert was spent in SPI functions that could be bypassed if
the C function called index_beginscan and index_rescan directly [5]. And
these indications bore out when Amit Langote wrote a patch [6] which finds
the designanted index from the constraint (with some drilling through
partitions if need be) and then invokes the scan functions. This method
showed about a halving of the t

Re: pl/pgsql feature request: shorthand for argument and local variable references

2021-03-14 Thread Pavel Stehule
Hi

so 13. 3. 2021 v 12:48 odesílatel Julien Rouhaud 
napsal:

> On Sat, Mar 13, 2021 at 12:10:29PM +0100, Pavel Stehule wrote:
> >
> > so 13. 3. 2021 v 9:53 odesílatel Julien Rouhaud 
> napsal:
> > >
> > > I don't think that it makes sense to have multiple occurences of this
> > > command,
> > > and we should simply error out if
> plpgsql_curr_compile->root_ns->itemno is
> > > PLPGSQL_LABEL_REPLACED.  If any case this should also be covered in the
> > > regression tests.
> > >
> >
> > I did it. Thank you for check
>
> Thanks, LGTM.
>
> > I wrote few sentences to documentation
>
> Great!
>
> I just had a few cosmetic comments:
>
> +  block can be changed by inserting special command at the start of
> the function
> [...]
> + arguments) is possible with option routine_label:
> [...]
> +errhint("The option \"routine_label\" can
> be used only once in rutine."),
> [...]
> +-- Check root namespace renaming (routine_label option)
>
> You're sometimes using "command" and "sometimes" option.  Should we always
> use
> the same term, maybe "command" as it's already used for #variable_conflict
> documentation?
>
> Also
>
> +  variables can be qualified with the function's name. The name of
> this outer
> +  block can be changed by inserting special command at the start of
> the function
> +  #routine_label new_name.
>
> It's missing a preposition before "special command".  Maybe
>
> +  variables can be qualified with the function's name. The name of
> this outer
> +  block can be changed by inserting a special command
> +  #routine_label new_name at the start of the
> function.
>
>
> + The function's argument can be qualified by function name:
>
> Should be "qualified with the function name"
>
> + Sometimes the function name is too long and can be more practical to
> use
> + some shorter label. An change of label of top namespace (with
> functions's
> + arguments) is possible with option routine_label:
>
> Few similar issues.  How about:
>
> Sometimes the function name is too long and *it* can be more practical to
> use
> some shorter label. *The top namespace label can be changed* (*along* with
> *the* functions' arguments) *using the option*
> routine_label:
>
> I'm not a native English speaker so the proposes changed may be wrong or
> not
> enough.
>

My English is good enough for taking beer everywhere in the world :) . Ti
is not good, but a lot of people who don't understand to English understand
my simplified fork of English language.

Thank you for check

updated patch attached

Pavel
diff --git a/doc/src/sgml/plpgsql.sgml b/doc/src/sgml/plpgsql.sgml
index 9242c54329..1af56eef86 100644
--- a/doc/src/sgml/plpgsql.sgml
+++ b/doc/src/sgml/plpgsql.sgml
@@ -292,7 +292,10 @@ $$ LANGUAGE plpgsql;
   special variables such as FOUND (see
   ).  The outer block is
   labeled with the function's name, meaning that parameters and special
-  variables can be qualified with the function's name.
+  variables can be qualified with the function's name. The name of this outer
+  block can be changed by inserting special command 
+  #routine_label new_name at the start of the function.
+  .
  
 
 
@@ -435,6 +438,31 @@ $$ LANGUAGE plpgsql;
  
 
 
+
+ The function's argument can be qualified with the function name:
+
+CREATE FUNCTION sales_tax(subtotal real) RETURNS real AS $$
+BEGIN
+RETURN sales_tax.subtotal * 0.06;
+END;
+$$ LANGUAGE plpgsql;
+
+
+
+
+ Sometimes the function name is too long and it can be more practical to use
+ some shorter label. The top namespace label can be changed (along with
+ the functions' arguments) using the option routine_label:
+
+CREATE FUNCTION sales_tax(subtotal real) RETURNS real AS $$
+#routine_label s
+BEGIN
+RETURN s.subtotal * 0.06;
+END;
+$$ LANGUAGE plpgsql;
+
+
+
  
   Some more examples:
 
diff --git a/src/pl/plpgsql/src/pl_comp.c b/src/pl/plpgsql/src/pl_comp.c
index ce8d97447d..d32e050c32 100644
--- a/src/pl/plpgsql/src/pl_comp.c
+++ b/src/pl/plpgsql/src/pl_comp.c
@@ -378,6 +378,10 @@ do_compile(FunctionCallInfo fcinfo,
 	 */
 	plpgsql_ns_init();
 	plpgsql_ns_push(NameStr(procStruct->proname), PLPGSQL_LABEL_BLOCK);
+
+	/* save top ns for possibility to alter top label */
+	function->root_ns = plpgsql_ns_top();
+
 	plpgsql_DumpExecTree = false;
 	plpgsql_start_datums();
 
diff --git a/src/pl/plpgsql/src/pl_funcs.c b/src/pl/plpgsql/src/pl_funcs.c
index 919b968826..9dc07f9e58 100644
--- a/src/pl/plpgsql/src/pl_funcs.c
+++ b/src/pl/plpgsql/src/pl_funcs.c
@@ -105,6 +105,40 @@ plpgsql_ns_additem(PLpgSQL_nsitem_type itemtype, int itemno, const char *name)
 	ns_top = nse;
 }
 
+/*
+ * Replace ns item of label type by creating new entry and redirect
+ * old entry to new one.
+ */
+void
+plpgsql_ns_replace_root_label(PLpgSQL_nsitem *nse,
+			  const char *name,
+			  int location)
+{
+	PLpgSQL_nsitem *new_nse;
+
+

Re: New IndexAM API controlling index vacuum strategies

2021-03-14 Thread Peter Geoghegan
On Sat, Mar 13, 2021 at 7:23 PM Peter Geoghegan  wrote:
> In other words, I am not worried about debt, exactly. Debt is normal
> in moderation. Healthy, even. I am worried about bankruptcy, perhaps
> following a rare and extreme event. It's okay to be imprecise, but all
> of the problems must be survivable. The important thing to me for a
> maintenance_work_mem threshold is that there is *some* limit. At the
> same time, it may totally be worth accepting 2 or 3 index scans during
> some eventual VACUUM operation if there are many more VACUUM
> operations that don't even touch the index -- that's a good deal!
> Also, it may actually be inherently necessary to accept a small risk
> of having a future VACUUM operation that does multiple scans of each
> index -- that is probably a necessary part of skipping index vacuuming
> each time.
>
> Think about the cost of index vacuuming (the amount of I/O and the
> duration of index vacuuming) as less as less memory is available for
> TIDs. It's non-linear. The cost explodes once we're past a certain
> point. The truly important thing is to "never get killed by the
> explosion".

I just remembered this blog post, which gives a nice high level
summary of my mental model for things like this:

https://jessitron.com/2021/01/18/when-costs-are-nonlinear-keep-it-small/

This patch should eliminate inefficient index vacuuming involving very
small "batch sizes" (i.e. a small number of TIDs/index tuples to
delete from indexes). At the same time, it should not allow the batch
size to get too large because that's also inefficient. Perhaps larger
batch sizes are not exactly inefficient -- maybe they're risky. Though
risky is actually kind of the same thing as inefficient, at least to
me.

So IMV what we want to do here is to recognize cases where "batch
size" is so small that index vacuuming couldn't possibly be efficient.
We don't need to truly understand how that might change over time in
each case -- this is relatively easy.

There is some margin for error here, even with this reduced-scope
version that just does the SKIP_VACUUM_PAGES_RATIO thing. The patch
can afford to make suboptimal decisions about the scheduling of index
vacuuming over time (relative to the current approach), provided the
additional cost is at least *tolerable* -- that way we are still very
likely to win in the aggregate, over time. However, the patch cannot
be allowed to create a new risk of significantly worse performance for
any one VACUUM operation.

--
Peter Geoghegan




Re: Transactions involving multiple postgres foreign servers, take 2

2021-03-14 Thread Ibrar Ahmed
On Thu, Feb 11, 2021 at 6:25 PM Masahiko Sawada 
wrote:

> On Fri, Feb 5, 2021 at 2:45 PM Masahiko Sawada 
> wrote:
> >
> > On Tue, Feb 2, 2021 at 5:18 PM Fujii Masao 
> wrote:
> > >
> > >
> > >
> > > On 2021/01/27 14:08, Masahiko Sawada wrote:
> > > > On Wed, Jan 27, 2021 at 10:29 AM Fujii Masao
> > > >  wrote:
> > > >>
> > > >>
> > > >> You fixed some issues. But maybe you forgot to attach the latest
> patches?
> > > >
> > > > Yes, I've attached the updated patches.
> > >
> > > Thanks for updating the patch! I tried to review 0001 and 0002 as the
> self-contained change.
> > >
> > > + * An FDW that implements both commit and rollback APIs can request
> to register
> > > + * the foreign transaction by FdwXactRegisterXact() to participate it
> to a
> > > + * group of distributed tranasction.  The registered foreign
> transactions are
> > > + * identified by OIDs of server and user.
> > >
> > > I'm afraid that the combination of OIDs of server and user is not
> unique. IOW, more than one foreign transactions can have the same
> combination of OIDs of server and user. For example, the following two
> SELECT queries start the different foreign transactions but their user OID
> is the same. OID of user mapping should be used instead of OID of user?
> > >
> > >  CREATE SERVER loopback FOREIGN DATA WRAPPER postgres_fdw;
> > >  CREATE USER MAPPING FOR postgres SERVER loopback OPTIONS (user
> 'postgres');
> > >  CREATE USER MAPPING FOR public SERVER loopback OPTIONS (user
> 'postgres');
> > >  CREATE TABLE t(i int);
> > >  CREATE FOREIGN TABLE ft(i int) SERVER loopback OPTIONS
> (table_name 't');
> > >  BEGIN;
> > >  SELECT * FROM ft;
> > >  DROP USER MAPPING FOR postgres SERVER loopback ;
> > >  SELECT * FROM ft;
> > >  COMMIT;
> >
> > Good catch. I've considered using user mapping OID or a pair of user
> > mapping OID and server OID as a key of foreign transactions but I
> > think it also has a problem if an FDW caches the connection by pair of
> > server OID and user OID whereas the core identifies them by user
> > mapping OID. For instance, mysql_fdw manages connections by pair of
> > server OID and user OID.
> >
> > For example, let's consider the following execution:
> >
> > BEGIN;
> > SET ROLE user_A;
> > INSERT INTO ft1 VALUES (1);
> > SET ROLE user_B;
> > INSERT INTO ft1 VALUES (1);
> > COMMIT;
> >
> > Suppose that an FDW identifies the connections by {server OID, user
> > OID} and the core GTM identifies the transactions by user mapping OID,
> > and user_A and user_B use the public user mapping to connect server_X.
> > In the FDW, there are two connections identified by {user_A, sever_X}
> > and {user_B, server_X} respectively, and therefore opens two
> > transactions on each connection, while GTM has only one FdwXact entry
> > because the two connections refer to the same user mapping OID. As a
> > result, at the end of the transaction, GTM ends only one foreign
> > transaction, leaving another one.
> >
> > Using user mapping OID seems natural to me but I'm concerned that
> > changing role in the middle of transaction is likely to happen than
> > dropping the public user mapping but not sure. We would need to find
> > more better way.
>
> After more thought, I'm inclined to think it's better to identify
> foreign transactions by user mapping OID. The main reason is, I think
> FDWs that manages connection caches by pair of user OID and server OID
> potentially has a problem with the scenario Fujii-san mentioned. If an
> FDW has to use another user mapping (i.g., connection information) due
> to the currently used user mapping being removed, it would have to
> disconnect the previous connection because it has to use the same
> connection cache. But at that time it doesn't know the transaction
> will be committed or aborted.
>
> Also, such FDW has the same problem that postgres_fdw used to have; a
> backend establishes multiple connections with the same connection
> information if multiple local users use the public user mapping. Even
> from the perspective of foreign transaction management, it more makes
> sense that foreign transactions correspond to the connections to
> foreign servers, not to the local connection information.
>
> I can see that some FDW implementations such as mysql_fdw and
> firebird_fdw identify connections by pair of server OID and user OID
> but I think this is because they consulted to old postgres_fdw code. I
> suspect that there is no use case where FDW needs to identify
> connections in that way. If the core GTM identifies them by user
> mapping OID, we could enforce those FDWs to change their way but I
> think that change would be the right improvement.
>
> Regards,
>
> --
> Masahiko Sawada
> EDB:  https://www.enterprisedb.com/
>
>
>
Regression is failing, can you please take a look.

https://cirrus-ci.com/task/5522445932167168


t/080_pg_isready.pl ... ok
# Failed test 'parallel reindexdb for system with --concurrently skips
catalog

Re: pgbench - add pseudo-random permutation function

2021-03-14 Thread Fabien COELHO


Hello Alvaro,


+   /*-
+* Apply 4 rounds of bijective transformations using key updated
+* at each stage:
+*
+* (1) whiten: partial xors on overlapping power-of-2 subsets
+* for instance with v in 0 .. 14 (i.e. with size == 15):
+* if v is in 0 .. 7 do v = (v ^ k) % 8
+* if v is in 7 .. 14 do v = 14 - ((14-v) ^ k) % 8
+* note that because of the overlap (here 7), v may be changed 
twice.
+* this transformation if bijective because the condition to apply 
it
+* is still true after applying it, and xor itself is bijective on a
+* power-of-2 size.
+*
+* (2) scatter: linear modulo
+* v = (v * p + k) % size
+* this transformation is bijective is p & size are prime, which is
+* ensured in the code by the while loop which discards primes when
+* size is a multiple of it.
+*
+*/


My main question on this now is, do you have a scholar reference for
this algorithm?


Nope, otherwise I would have put a reference. I'm a scholar though, if 
it helps:-)


I could not find any algorithm that fitted the bill. The usual approach 
(eg benchmarking designs) is too use some hash function and assume that it 
is not a permutation, too bad.


Basically the algorithm mimics a few rounds of cryptographic encryption 
adapted to any size and simple operators, whereas encryption function are 
restricted to power of two blocks (eg the Feistel network). The structure 
is the same AES with its AddRoundKey the xor-ing stage (adapted to non 
power of two in whiten above), MixColumns which does the scattering, and 
for key expansion, I used Donald Knuth generator. Basically one could say 
that permute is an inexpensive and insecure AES:-)


We could add a reference to AES for the structure of the algorithm itself, 
but otherwise I just iterated designs till I was satisfied with the 
result (again: inexpensive and constant cost, any size, permutation…).


--
Fabien.

Re: More time spending with "delete pending"

2021-03-14 Thread Alexander Lakhin
Hello hackers,

22.11.2020 18:59, Alexander Lakhin wrote:
> 19.11.2020 01:28, Tom Lane wrote:
>
>> Hmm, that is an interesting question isn't it.  Here's a search going
>> back a full year.  There are a few in v12 --- interestingly, all on
>> the statistics file, none from pg_ctl --- and none in v13.  Of course,
>> v13 has only existed as a separate branch since 2020-06-07.
>>
>> There's also a buildfarm test-coverage artifact involved.  The bulk
>> of the HEAD reports are from dory and walleye, neither of which are
>> building v13 as yet, because of unclear problems [1].  I think those
>> two animals build much more frequently than our other Windows animals,
>> too, so the fact that they have more may be just because of that and
>> not because they're somehow more susceptible.  Because of that, I'm not
>> sure that we have enough evidence to say that v13 is better than HEAD.
>> If there is some new bug, it's post-v12, but maybe not post-v13.  But
>> v12 is evidently not perfect either.
> The failures on HEAD (on dory and walleye) need another investigation
> (maybe they are caused by modified stat()...).
To revive this topic and make sure that the issue still persists, I've
searched through the buildfarm logs (for HEAD and REL_13_STABLE) and
collected the following failures of the misc_functions test with the
"Permission denied" error:

bowerbird
HEAD (13 runs last week, 500 total, 18 failed)
-

REL_13_STABLE (4 runs last week, 237 total, 8 failed)
-

drongo
HEAD (25 runs last week, 500 total, 24 failed)
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2020-12-27%2019%3A01%3A50
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2021-01-18%2001%3A01%3A55

REL_13_STABLE (5 runs last week, 255 total, 2 failed)
-

hamerkop
HEAD - (8 runs last week, 501 total, 20 failed)
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hamerkop&dt=2021-02-22%2010%3A20%3A00

fairywren
HEAD - (27 runs last week, 500 total, 28 failed)
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2021-02-22%2012%3A03%3A35
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2021-01-28%2000%3A04%3A50
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2021-01-19%2023%3A29%3A38
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2021-01-17%2018%3A08%3A11
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2020-11-30%2009%3A57%3A15
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2020-11-19%2006%3A02%3A22
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2020-11-18%2018%3A02%3A04
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2020-09-16%2019%3A48%3A21

REL_13_STABLE (5 runs last week, 302 total, 5 failed)
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2021-02-26%2003%3A03%3A42
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2020-10-20%2007%3A50%3A44
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2020-09-10%2006%3A02%3A12
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2020-09-04%2019%3A58%3A39

woodlouse
HEAD (33 runs last week, 500 total, 5 failed)
-

REL_13_STABLE (4 runs last week, 325 total, 1 failed)
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=woodlouse&dt=2021-01-20%2011%3A42%3A30

dory
HEAD (52 runs last week, 500 total, 25 failed)
-

whelk
HEAD (31 runs last week, 500 total, 11 failed)
-

REL_13_STABLE (7 runs last week, 224 total, 1 failed)
- 

(I've not considered lorikeet, jacana, and walleye as they are using
alternative compilers.)
So on some machines there wasn't a single failure (probably it depends
on hardware), but where it fails, such failures account for a
significant share of all failures.
I believe that the patch attached to [1] should fix this issue. The
patch still applies to master and makes the demotest (attached to [2])
pass. Also I've prepared a trivial patch that makes pgwin32_open() use
the original stat() function (as in the proposed change for _pgstat64()).

https://www.postgresql.org/message-id/979f4ee6-9960-e37f-f3ee-f458e8b0%40gmail.com
https://www.postgresql.org/message-id/c3427edf-d7c0-ff57-90f6-b5de3bb62709%40gmail.com

Best regards,
Alexander
diff --git a/src/port/open.c b/src/port/open.c
index 14c6debba9..4bd11ca9d9 100644
--- a/src/port/open.c
+++ b/src/port/open.c
@@ -157,9 +157,9 @@ pgwin32_open(const char *fileName, int fileFlags,...)
 		{
 			if (loops < 10)
 			{
-struct stat st;
+struct microsoft_native_stat st;
 
-if (stat(fileName, &st) != 0)
+if (microsoft_native_stat(fileName, &st) != 0)
 {
 	pg_usleep(10);
 	loops++;


Re: REINDEX backend filtering

2021-03-14 Thread Julien Rouhaud
On Sun, Mar 14, 2021 at 08:54:11PM +0900, Michael Paquier wrote:
> On Sun, Mar 14, 2021 at 04:10:07PM +0800, Julien Rouhaud wrote:
> 
> +   booloutdated_filter = false;
> Wouldn't it be better to rename that "outdated" instead for
> consistency with the other options?

I agree.

> In ReindexRelationConcurrently(), there is no filtering done for the
> index themselves.  The operation is only done on the list of indexes
> fetched from the parent relation.  Why?  This means that a REINDEX
> (OUTDATED) INDEX would actually rebuild an index even if this index
> has no out-of-date collations, like a catalog.  I think that's
> confusing.
> 
> Same comment for the non-concurrent case, as of the code paths of
> reindex_index().

Yes, I'm not sure what we should do in that case.  I thought I put a comment
about that but it apparently disappeared during some rewrite.

Is there really a use case for reindexing a specific index and at the same time
asking for possibly ignoring it?  I think we should just forbid REINDEX
(OUTDATED) INDEX.  What do you think?

> Would it be better to inform the user which indexes are getting
> skipped in the verbose output if REINDEXOPT_VERBOSE is set?

I was thinking that users would be more interested in the list of indexes being
processed rather than the full list of indexes and a mention of whether it was
processed or not.  I can change that if you prefer.

> +   
> +Check if the specified index has any outdated dependency. For now 
> only
> +dependency on collations are supported.
> +   
> [...]
> +OUTDATED
> +
> + 
> +  This option can be used to filter the list of indexes to rebuild and 
> only
> +  process indexes that have outdated dependencies.  Fow now, the only
> +  handle dependency is for the collation provider version.
> + 
> Do we really need to be this specific in this part of the
> documentation with collations?

I think it's important to document what this option really does, and I don't
see a better place to document it.

> The last sentence of this paragraph
> sounds weird.  Don't you mean instead to write "the only dependency
> handled currently is the collation provider version"?

Indeed, I'll fix!

> +\set VERBOSITY terse \\ -- suppress machine-dependent details
> +-- no suitable index should be found
> +REINDEX (OUTDATED) TABLE reindex_coll;
> What are those details?

That just the same comment as the previous occurence in the file, I kept it for
consistency.

> And wouldn't it be more stable to just check
> after the relfilenode of the indexes instead?

Agreed, I'll add additional tests for that.

> " ORDER BY sum(ci.relpages)"
> Schema qualification here, twice.

Well, this isn't actually mandatory, per comment at the top:

/*
 * The queries here are using a safe search_path, so there's no need to
 * fully qualify everything.
 */

But I think it's a better style to fully qualify objects, so I'll fix.

> +   rel = try_relation_open(indexOid, AccessShareLock);
> +
> +   if (rel == NULL)
> +   PG_RETURN_NULL();
> Let's introduce a try_index_open() here.

Good idea!

> What's the point in having both index_has_outdated_collation() and
> index_has_outdated_collation()?

Did you mean index_has_outdated_collation() and
index_has_outadted_dependency()?  It's just to keep things separated, mostly
for future improvements on that infrastructure.  I can get rid of that function
and put back the code in index_has_outadted_dependency() if that's overkill.

> It seems to me that 0001 should be split into two patches:
> - One for the backend OUTDATED option.
> - One for pg_index_has_outdated_dependency(), which only makes sense
> in-core once reindexdb is introduced.

I thought it would be better to add the backend part in a single commit, and
then built the client part on top of that in a different commit.  I can
rearrange things if you want, but in that case should
index_has_outadted_dependency() be in a different patch as you mention or
simply merged with 0002 (the --oudated option for reindexdb)?




Re: pgbench - add pseudo-random permutation function

2021-03-14 Thread Alvaro Herrera
On 2021-Mar-14, Fabien COELHO wrote:

> + /*-
> +  * Apply 4 rounds of bijective transformations using key updated
> +  * at each stage:
> +  *
> +  * (1) whiten: partial xors on overlapping power-of-2 subsets
> +  * for instance with v in 0 .. 14 (i.e. with size == 15):
> +  * if v is in 0 .. 7 do v = (v ^ k) % 8
> +  * if v is in 7 .. 14 do v = 14 - ((14-v) ^ k) % 8
> +  * note that because of the overlap (here 7), v may be changed 
> twice.
> +  * this transformation if bijective because the condition to apply 
> it
> +  * is still true after applying it, and xor itself is bijective on a
> +  * power-of-2 size.
> +  *
> +  * (2) scatter: linear modulo
> +  * v = (v * p + k) % size
> +  * this transformation is bijective is p & size are prime, which is
> +  * ensured in the code by the while loop which discards primes when
> +  * size is a multiple of it.
> +  *
> +  */

My main question on this now is, do you have a scholar reference for
this algorithm?

-- 
Álvaro Herrera   Valdivia, Chile
"Someone said that it is at least an order of magnitude more work to do
production software than a prototype. I think he is wrong by at least
an order of magnitude."  (Brian Kernighan)




Re: Using COPY FREEZE in pgbench

2021-03-14 Thread Tatsuo Ishii
> Ok. ISTM that the option should be triggered as soon as it is
> available, but you do as you wish.

Can you elaborate why you think that using COPY FREEZE before 14 is
beneficial? Or do you want to standardize to use COPY FREEZE?

> I'm unsure how this could happen ever. The benefit of not caring is
> less lines of codes in pgbench and a later call will catch the issue
> anyway, if libpq is corrupted.

I have looked in the code of PQprotocolVersion. The only case in which
it returns 0 is there's no connection. Yes, you are right. Once the
connection has been successfuly established, there's no chance it
fails. So I agree with you.

>>> Question: should there be a word about copy using the freeze option?
>>> I'd say yes, in the section describing "g".
>>
>> Can you elaborate about "section describing "g"? I am not sure what
>> you mean.
> 
> The "g" item in the section describing initialization steps
> (i.e. option -I). I'd suggest just to replace "COPY" with "COPY
> FREEZE" in the sentence.

Ok. The section is needed to be modified.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp




Re: Fallback table AM for relkinds without storage

2021-03-14 Thread Michael Paquier
On Wed, Feb 24, 2021 at 11:51:36AM +0900, Michael Paquier wrote:
> For the file name, using something like pseudo_handler.c or similar
> would be fine, I guess.  However, if we go down the path of one AM per
> relkind for the slot callback, then why not just calling the AMs
> foreign_table_am, view_am and part_table_am?  This could be coupled
> with sanity checks to make sure that AMs assigned to those relations
> are the expected ones.

I am still not quite sure what needs to be done here and this needs
more thoughts, so this has been marked as returned with feedback for
now.  Instead of pushing forward with this patch, I'll just spend more
cycles on stuff that has more chances to make it into 14.
--
Michael


signature.asc
Description: PGP signature


Re: REINDEX backend filtering

2021-03-14 Thread Michael Paquier
On Sun, Mar 14, 2021 at 04:10:07PM +0800, Julien Rouhaud wrote:
> v6 attached, rebase only due to conflict with recent commit.

I have read through the patch.

+   booloutdated_filter = false;
Wouldn't it be better to rename that "outdated" instead for
consistency with the other options?

In ReindexRelationConcurrently(), there is no filtering done for the
index themselves.  The operation is only done on the list of indexes
fetched from the parent relation.  Why?  This means that a REINDEX
(OUTDATED) INDEX would actually rebuild an index even if this index
has no out-of-date collations, like a catalog.  I think that's
confusing.

Same comment for the non-concurrent case, as of the code paths of
reindex_index().

Would it be better to inform the user which indexes are getting
skipped in the verbose output if REINDEXOPT_VERBOSE is set?

+   
+Check if the specified index has any outdated dependency. For now only
+dependency on collations are supported.
+   
[...]
+OUTDATED
+
+ 
+  This option can be used to filter the list of indexes to rebuild and only
+  process indexes that have outdated dependencies.  Fow now, the only
+  handle dependency is for the collation provider version.
+ 
Do we really need to be this specific in this part of the
documentation with collations?  The last sentence of this paragraph
sounds weird.  Don't you mean instead to write "the only dependency
handled currently is the collation provider version"?

+\set VERBOSITY terse \\ -- suppress machine-dependent details
+-- no suitable index should be found
+REINDEX (OUTDATED) TABLE reindex_coll;
What are those details?  And wouldn't it be more stable to just check
after the relfilenode of the indexes instead?

" ORDER BY sum(ci.relpages)"
Schema qualification here, twice.

+   printf(_("  --outdated   only process indexes
having outdated depencies\n"));
s/depencies/dependencies/.

+   rel = try_relation_open(indexOid, AccessShareLock);
+
+   if (rel == NULL)
+   PG_RETURN_NULL();
Let's introduce a try_index_open() here.

What's the point in having both index_has_outdated_collation() and
index_has_outdated_collation()?

It seems to me that 0001 should be split into two patches:
- One for the backend OUTDATED option.
- One for pg_index_has_outdated_dependency(), which only makes sense
in-core once reindexdb is introduced.
--
Michael


signature.asc
Description: PGP signature


Re: pgbench - add pseudo-random permutation function

2021-03-14 Thread Fabien COELHO



See attached v22.


v23:
 - replaces remaining occurences of "pr_perm" in the documentation
 - removes duplicated includes

--
Fabien.diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index 299d93b241..9f49a6a78d 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -1057,7 +1057,7 @@ pgbench  options  d
 
   
 default_seed 
-   seed used in hash functions by default
+   seed used in hash and pseudo-random permutation functions by default
   
 
   
@@ -1842,6 +1842,22 @@ SELECT 4 AS four \; SELECT 5 AS five \aset

   
 
+  
+   
+permute ( i, size [, seed ] )
+integer
+   
+   
+Permuted value of i, in [0,size).
+This is the new position of i in a pseudo-random rearrangement of
+size first integers parameterized by seed.
+   
+   
+permute(0, 4)
+an integer between 0 and 3
+   
+  
+
   

 pi ()
@@ -1960,6 +1976,20 @@ SELECT 4 AS four \; SELECT 5 AS five \aset
 shape of the distribution.

 
+   
+
+  When designing a benchmark which selects rows non-uniformly, be aware
+  that using pgbench non-uniform random functions
+  directly leads to unduly correlations: rows with close ids come out with
+  similar probability, skewing performance measures because they also reside
+  in close pages.
+
+
+  You must use permute to shuffle the selected ids, or
+  some other additional step with similar effect, to avoid these skewing impacts.
+
+   
+

 
  
@@ -2054,24 +2084,54 @@ f(x) = PHI(2.0 * parameter * (x - mu) / (max - min + 1)) /
 hash_fnv1a accept an input value and an optional seed parameter.
 In case the seed isn't provided the value of :default_seed
 is used, which is initialized randomly unless set by the command-line
--D option. Hash functions can be used to scatter the
-distribution of random functions such as random_zipfian or
-random_exponential. For instance, the following pgbench
-script simulates possible real world workload typical for social media and
+-D option.
+  
+
+  
+Function permute implements a parametric pseudo-random permutation
+on an arbitrary size.
+It allows to shuffle output of non uniform random functions so that 
+values drawn more often are not trivially correlated.
+It permutes integers in [0, size) using a seed by applying rounds of
+simple invertible functions, similarly to an encryption function,
+although beware that it is not at all cryptographically secure.
+Compared to hash functions discussed above, the function
+ensures that a perfect permutation is applied: there are neither collisions
+nor holes in the output values.
+Values outside the interval are interpreted modulo the size.
+The function raises an error if size is not positive.
+If no seed is provided, :default_seed is used.
+
+For instance, the following pgbench script
+simulates possible real world workload typical for social media and
 blogging platforms where few accounts generate excessive load:
 
 
-\set r random_zipfian(0, 1, 1.07)
-\set k abs(hash(:r)) % 100
+\set size 100
+\set r random_zipfian(1, :size, 1.07)
+\set k 1 + permute(:r, :size)
 
 
 In some cases several distinct distributions are needed which don't correlate
 with each other and this is when implicit seed parameter comes in handy:
 
 
-\set k1 abs(hash(:r, :default_seed + 123)) % 100
-\set k2 abs(hash(:r, :default_seed + 321)) % 100
+\set k1 1 + permute(:r, :size, :default_seed + 123)
+\set k2 1 + permute(:r, :size, :default_seed + 321)
 
+
+An similar behavior can also be approximated with
+hash:
+
+
+\set size 100
+\set r random_zipfian(1, 100 * :size, 1.07)
+\set k 1 + abs(hash(:r)) % :size
+
+
+As this hash-modulo version generates collisions, some
+id would not be reachable and others would come more often
+than the target distribution.
   
 
   
diff --git a/src/bin/pgbench/exprparse.y b/src/bin/pgbench/exprparse.y
index 4d529ea550..73514a0a47 100644
--- a/src/bin/pgbench/exprparse.y
+++ b/src/bin/pgbench/exprparse.y
@@ -19,6 +19,7 @@
 #define PGBENCH_NARGS_VARIABLE	(-1)
 #define PGBENCH_NARGS_CASE		(-2)
 #define PGBENCH_NARGS_HASH		(-3)
+#define PGBENCH_NARGS_PERMUTE	(-4)
 
 PgBenchExpr *expr_parse_result;
 
@@ -370,6 +371,9 @@ static const struct
 	{
 		"hash_fnv1a", PGBENCH_NARGS_HASH, PGBENCH_HASH_FNV1A
 	},
+	{
+		"permute", PGBENCH_NARGS_PERMUTE, PGBENCH_PERMUTE
+	},
 	/* keep as last array element */
 	{
 		NULL, 0, 0
@@ -482,6 +486,19 @@ make_func(yyscan_t yyscanner, int fnumber, PgBenchExprList *args)
 			}
 			break;
 
+		/* pseudo-random permutation function with optional seed argument */
+		case PGBENCH_NARGS_PERMUTE:
+			if (len < 2 || len > 3)
+expr_yyerror_more(yyscanner, "unexpected number of arguments"

Re: psql - add SHOW_ALL_RESULTS option

2021-03-14 Thread Fabien COELHO


Hello Peter,


This reply was two months ago, and nothing has happened, so I have
marked the patch as RwF.


Given the ongoing work on returning multiple result sets from stored 
procedures[0], I went to dust off this patch.


Based on the feedback, I put back the titular SHOW_ALL_RESULTS option, 
but set the default to on.  I fixed the test failure in 
013_crash_restart.pl.  I also trimmed back the test changes a bit so 
that the resulting test output changes are visible better.  (We could 
make those stylistic changes separately in another patch.)  I'll put 
this back into the commitfest for another look.


Thanks a lot for the fixes and pushing it forward!

My 0.02€: I tested this updated version and do not have any comment on 
this version. From my point of view it could be committed. I would not 
bother to separate the test style ajustments.


--
Fabien.

Re: PITR promote bug: Checkpointer writes to older timeline

2021-03-14 Thread Michael Paquier
On Thu, Mar 04, 2021 at 05:10:36PM +0900, Kyotaro Horiguchi wrote:
> read_local_xlog_page is *designed* to maintain ThisTimeLineID.
> Currently it doesn't seem utilized but I think it's sufficiently
> reasonable that the function maintains ThisTimeLineID.

I don't quite follow this line of thoughts.  ThisTimeLineID is
designed to remain 0 while recovery is running in most processes
(at the close exception of a WAL sender with a cascading setup,
physical or logical, of course), so why is there any business for
read_local_xlog_page() to touch this field at all while in recovery to
begin with?

I equally find confusing that XLogReadDetermineTimeline() relies on a
specific value of ThisTimeLineID in its own logic, while it clearly
states that all its callers have to read the current active TLI
beforehand.  So I think that the existing logic is pretty weak, and
that resetting the field is an incorrect approach?  It seems to me
that we had better not change ThisTimeLineID actively while in
recovery in this code path and just let others do the job, like
RecoveryInProgress() once recovery finishes, or
GetStandbyFlushRecPtr() for a WAL sender.  And finally, we should
store the current TLI used for replay in a separate variable that gets
passed down to the XLogReadDetermineTimeline() as argument.

While going through it, I have simplified a bit the proposed TAP tests
(thanks for replacing the sleep() call, Soumyadeep.  This would have
made the test slower for nothing on fast machines, and it would cause
failures on very slow machines).

The attached fixes the original issue for me, keeping all the records
in their correct timeline.  And I have not been able to break
cascading setups.  If it happens that such cases actually break, we
have holes in our existing test coverage that should be improved.  I
cannot see anything fancy missing on this side, though.

Any thoughts?
--
Michael
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 9ac602b674..34d6286d75 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -56,7 +56,9 @@ extern void wal_segment_open(XLogReaderState *state,
 extern void wal_segment_close(XLogReaderState *state);
 
 extern void XLogReadDetermineTimeline(XLogReaderState *state,
-	  XLogRecPtr wantPage, uint32 wantLength);
+	  XLogRecPtr wantPage,
+	  uint32 wantLength,
+	  TimeLineID readtli);
 
 extern void WALReadRaiseError(WALReadError *errinfo);
 
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index d17d660f46..9695653bc2 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -658,6 +658,13 @@ XLogTruncateRelation(RelFileNode rnode, ForkNumber forkNum,
  * wantLength to the amount of the page that will be read, up to
  * XLOG_BLCKSZ. If the amount to be read isn't known, pass XLOG_BLCKSZ.
  *
+ * readtli is the current timeline as found by the caller of this routine.
+ * When not in recovery, this should be ThisTimeLineID.  As ThisTimeLineID
+ * remains set to 0 for most processes while in recovery, the caller ought
+ * to provide the timeline number given as a result of GetXLogReplayRecPtr()
+ * instead (for a WAL sender this would actually be ThisTimeLineID as the
+ * field gets updated in a cascading WAL sender).
+ *
  * We switch to an xlog segment from the new timeline eagerly when on a
  * historical timeline, as soon as we reach the start of the xlog segment
  * containing the timeline switch.  The server copied the segment to the new
@@ -679,12 +686,13 @@ XLogTruncateRelation(RelFileNode rnode, ForkNumber forkNum,
  *
  * The caller must also make sure it doesn't read past the current replay
  * position (using GetXLogReplayRecPtr) if executing in recovery, so it
- * doesn't fail to notice that the current timeline became historical. The
- * caller must also update ThisTimeLineID with the result of
- * GetXLogReplayRecPtr and must check RecoveryInProgress().
+ * doesn't fail to notice that the current timeline became historical.
  */
 void
-XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wantLength)
+XLogReadDetermineTimeline(XLogReaderState *state,
+		  XLogRecPtr wantPage,
+		  uint32 wantLength,
+		  TimeLineID readtli)
 {
 	const XLogRecPtr lastReadPage = (state->seg.ws_segno *
 	 state->segcxt.ws_segsize + state->segoff);
@@ -712,12 +720,12 @@ XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wa
 	 * just carry on. (Seeking backwards requires a check to make sure the
 	 * older page isn't on a prior timeline).
 	 *
-	 * ThisTimeLineID might've become historical since we last looked, but the
+	 * readtli might've become historical since we last looked, but the
 	 * caller is required not to read past the flush limit it saw at the time
 	 * it looked up the timeline. There's nothing we can do about it if
 	 * StartupXLOG() rename

Re: Using COPY FREEZE in pgbench

2021-03-14 Thread Fabien COELHO



Hello,


After giving it some thought, ISTM that there could also be a
performance improvement with copy freeze from older version, so I'd
suggest to add it after 9.3 where the option was added, i.e. 90300.


You misunderstand about COPY FREEZE. Pre-13 COPY FREEZE does not
contribute a performance improvement. See discussions for more details.
https://www.postgresql.org/message-id/camku%3d1w3osjj2fneelhhnrlxfzitdgp9fphee08nt2fqfmz...@mail.gmail.com
https://www.postgresql.org/message-id/flat/CABOikdN-ptGv0mZntrK2Q8OtfUuAjqaYMGmkdU1dCKFtUxVLrg%40mail.gmail.com


Ok. ISTM that the option should be triggered as soon as it is available, 
but you do as you wish.



Also, I do not think it is worth to fail on a 0 pqserverversion, just
keep the number and consider it a very old version?


If pqserverversion fails, then following PQ calls are likely fail
too. What's a benefit to continue after pqserverversion returns 0?


I'm unsure how this could happen ever. The benefit of not caring is less 
lines of codes in pgbench and a later call will catch the issue anyway, if 
libpq is corrupted.



Question: should there be a word about copy using the freeze option?
I'd say yes, in the section describing "g".


Can you elaborate about "section describing "g"? I am not sure what
you mean.


The "g" item in the section describing initialization steps (i.e. option 
-I). I'd suggest just to replace "COPY" with "COPY FREEZE" in the 
sentence.


--
Fabien.




Re: REINDEX backend filtering

2021-03-14 Thread Julien Rouhaud
On Wed, Mar 03, 2021 at 01:56:59PM +0800, Julien Rouhaud wrote:
> 
> Please find attached v5 which address all previous comments:
> 
> - consistently use "outdated"
> - use REINDEX (OUTDATED) grammar (with a new unreserved OUTDATED keyword)
> - new --outdated option to reindexdb
> - expose a new "pg_index_has_outdated_dependency(regclass)" SQL function
> - use that function in reindexdb --outdated to sort tables by total
>   indexes-to-be-processed size

v6 attached, rebase only due to conflict with recent commit.
>From d2e05e6f64c88b0d5074b9963586bf6999276762 Mon Sep 17 00:00:00 2001
From: Julien Rouhaud 
Date: Thu, 3 Dec 2020 15:54:42 +0800
Subject: [PATCH v6 1/2] Add a new OUTDATED filtering facility for REINDEX
 command.

OUTDATED is added a new unreserved keyword.

When used, REINDEX will only process indexes that have an outdated dependency.
For now, only dependency on collations are supported but we'll likely support
other kind of dependency in the future.

Also add a new pg_index_has_outdated_dependency(regclass) SQL function, so
client code can filter such indexes if needed.  This function will also be used
in a following commit to teach reindexdb to use this new OUTDATED option and
order the tables by the amount of work that will actually be done.

Catversion (should be) bumped.

Author: Julien Rouhaud 
Reviewed-by:
Discussion: https://postgr.es/m/20201203093143.GA64934%40nol
---
 doc/src/sgml/func.sgml |  27 --
 doc/src/sgml/ref/reindex.sgml  |  12 +++
 src/backend/catalog/index.c| 107 -
 src/backend/commands/indexcmds.c   |  12 ++-
 src/backend/parser/gram.y  |   4 +-
 src/backend/utils/cache/relcache.c |  40 
 src/bin/psql/tab-complete.c|   2 +-
 src/include/catalog/index.h|   3 +
 src/include/catalog/pg_proc.dat|   4 +
 src/include/parser/kwlist.h|   1 +
 src/include/utils/relcache.h   |   1 +
 src/test/regress/expected/create_index.out |  10 ++
 src/test/regress/sql/create_index.sql  |  10 ++
 13 files changed, 221 insertions(+), 12 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 9492a3c6b9..0eda6678ac 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -26448,12 +26448,13 @@ SELECT pg_size_pretty(sum(pg_relation_size(relid))) AS total_size
 

  shows the functions
-available for index maintenance tasks.  (Note that these maintenance
-tasks are normally done automatically by autovacuum; use of these
-functions is only required in special cases.)
-These functions cannot be executed during recovery.
-Use of these functions is restricted to superusers and the owner
-of the given index.
+available for index maintenance tasks.  (Note that the maintenance
+tasks performing actions on indexes are normally done automatically by
+autovacuum; use of these functions is only required in special cases.)
+The functions performing actions on indexes cannot be executed during
+recovery.
+Use of the functions performing actions on indexes is restricted to
+superusers and the owner of the given index.

 

@@ -26538,6 +26539,20 @@ SELECT pg_size_pretty(sum(pg_relation_size(relid))) AS total_size
 option.

   
+
+  
+   
+
+ pg_index_has_outdated_dependency
+
+pg_index_has_outdated_dependency ( index regclass )
+boolean
+   
+   
+Check if the specified index has any outdated dependency.  For now only
+dependency on collations are supported.
+   
+  
  
 

diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index ff4dba8c36..aa66e6461f 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -26,6 +26,7 @@ REINDEX [ ( option [, ...] ) ] { IN
 where option can be one of:
 
 CONCURRENTLY [ boolean ]
+OUTDATED [ boolean ]
 TABLESPACE new_tablespace
 VERBOSE [ boolean ]
 
@@ -188,6 +189,17 @@ REINDEX [ ( option [, ...] ) ] { IN
 

 
+   
+OUTDATED
+
+ 
+  This option can be used to filter the list of indexes to rebuild and only
+  process indexes that have outdated dependencies.  Fow now, the only
+  handle dependency is for the collation provider version.
+ 
+
+   
+

 TABLESPACE
 
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 4ef61b5efd..571feac5db 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -100,6 +100,12 @@ typedef struct
 	Oid			pendingReindexedIndexes[FLEXIBLE_ARRAY_MEMBER];
 } SerializedReindexState;
 
+typedef struct
+{
+	Oid relid;	/* targetr index oid */
+	bool outdated;	/* depends on at least on deprected collation? */
+} IndexHasOutdatedColl;
+
 /* non-export function prototypes */
 static bool relationHasPrimaryKey(

Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2021-03-14 Thread Julien Rouhaud
Recent conflict, thanks to cfbot.  v18 attached.
>From fa94eba58ee0ca098cfde0d17de72dc230ee471c Mon Sep 17 00:00:00 2001
From: Julien Rouhaud 
Date: Wed, 14 Oct 2020 02:11:37 +0800
Subject: [PATCH v18 1/3] Move pg_stat_statements query jumbling to core.

A new compute_queryid GUC is also added, to control whether the queryid should
be computed.  It's now possible to disable core queryid computation and use
pg_stat_statements with a different algorithm to compute the queryid by using
third-party module.

Author: Julien Rouhaud
Reviewed-by:
Discussion: https://postgr.es/m/CA+8PKvQnMfOE-c3YLRwxOsCYXQDyP8VXs6CDtMZp1V4=d4l...@mail.gmail.com
---
 .../pg_stat_statements/pg_stat_statements.c   | 805 +
 .../pg_stat_statements.conf   |   1 +
 doc/src/sgml/config.sgml  |  18 +
 src/backend/parser/analyze.c  |  14 +-
 src/backend/tcop/postgres.c   |   6 +-
 src/backend/utils/misc/Makefile   |   1 +
 src/backend/utils/misc/guc.c  |  10 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/backend/utils/misc/queryjumble.c  | 834 ++
 src/include/parser/analyze.h  |   4 +-
 src/include/utils/guc.h   |   1 +
 src/include/utils/queryjumble.h   |  58 ++
 12 files changed, 969 insertions(+), 784 deletions(-)
 create mode 100644 src/backend/utils/misc/queryjumble.c
 create mode 100644 src/include/utils/queryjumble.h

diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 62cccbfa44..99bc7184cb 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -8,24 +8,9 @@
  * a shared hashtable.  (We track only as many distinct queries as will fit
  * in the designated amount of shared memory.)
  *
- * As of Postgres 9.2, this module normalizes query entries.  Normalization
- * is a process whereby similar queries, typically differing only in their
- * constants (though the exact rules are somewhat more subtle than that) are
- * recognized as equivalent, and are tracked as a single entry.  This is
- * particularly useful for non-prepared queries.
- *
- * Normalization is implemented by fingerprinting queries, selectively
- * serializing those fields of each query tree's nodes that are judged to be
- * essential to the query.  This is referred to as a query jumble.  This is
- * distinct from a regular serialization in that various extraneous
- * information is ignored as irrelevant or not essential to the query, such
- * as the collations of Vars and, most notably, the values of constants.
- *
- * This jumble is acquired at the end of parse analysis of each query, and
- * a 64-bit hash of it is stored into the query's Query.queryId field.
- * The server then copies this value around, making it available in plan
- * tree(s) generated from the query.  The executor can then use this value
- * to blame query costs on the proper queryId.
+ * As of Postgres 9.2, this module normalizes query entries.  As of Postgres
+ * 14, the normalization is done by the core, if compute_queryid is enabled, or
+ * by third-party modules if enabled.
  *
  * To facilitate presenting entries to users, we create "representative" query
  * strings in which constants are replaced with parameter symbols ($n), to
@@ -114,8 +99,6 @@ static const uint32 PGSS_PG_MAJOR_VERSION = PG_VERSION_NUM / 100;
 #define USAGE_DEALLOC_PERCENT	5	/* free this % of entries at once */
 #define IS_STICKY(c)	((c.calls[PGSS_PLAN] + c.calls[PGSS_EXEC]) == 0)
 
-#define JUMBLE_SIZE1024	/* query serialization buffer size */
-
 /*
  * Extension version number, for supporting older extension versions' objects
  */
@@ -235,40 +218,6 @@ typedef struct pgssSharedState
 	pgssGlobalStats stats;		/* global statistics for pgss */
 } pgssSharedState;
 
-/*
- * Struct for tracking locations/lengths of constants during normalization
- */
-typedef struct pgssLocationLen
-{
-	int			location;		/* start offset in query text */
-	int			length;			/* length in bytes, or -1 to ignore */
-} pgssLocationLen;
-
-/*
- * Working state for computing a query jumble and producing a normalized
- * query string
- */
-typedef struct pgssJumbleState
-{
-	/* Jumble of current query tree */
-	unsigned char *jumble;
-
-	/* Number of bytes used in jumble[] */
-	Size		jumble_len;
-
-	/* Array of locations of constants that should be removed */
-	pgssLocationLen *clocations;
-
-	/* Allocated length of clocations array */
-	int			clocations_buf_size;
-
-	/* Current number of valid entries in clocations array */
-	int			clocations_count;
-
-	/* highest Param id we've seen, in order to start normalization correctly */
-	int			highest_extern_param_id;
-} pgssJumbleState;
-
 /* Local variables */
 
 /* Current nesting depth of ExecutorRun+ProcessUtility calls */
@@ -342,7 +291,8 @@ PG_FUNCTION_INFO_V1(pg_stat_sta