date:20201109

Re: Use standard SIGHUP and SIGTERM handlers in autoprewarm module

2020-11-09 Thread Kyotaro Horiguchi

At Sat, 7 Nov 2020 19:31:21 +0530, Bharath Rupireddy 
 wrote in 
> The main reason for having SetLatch() in
> SignalHandlerForConfigReload() is to wake up the calling process if
> waiting in WaitLatchOrSocket() or WaitLatch() and reload the new
> config file and use the reloaded config variables. Maybe we should
> give a thought on the scenarios in which the walreceiver process
> waits, and what happens in case the latch is set when SIGHUP is
> received.

The difference is whether the config file is processed at the next
wakeup (by force-reply-request or SIGTERM) of walreceiver or
immediately. If server-reload happened frequently, say, several times
per second(?), we should consider to reduce the useless reloading, but
actually that's not the case.

> And also, I think it's worth having a look at the commit 40f908bdcdc7
> that introduced WalRcvSigHupHandler() and 597a87ccc that replaced
> custom latch with procLatch.

At the time of the first patch, PostgreSQL processes used arbitrarily
implemented SIGHUP handlers for their own so it is natural that
walreceiver used its own one.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Temporary tables versus wraparound... again

2020-11-09 Thread Masahiko Sawada

On Mon, Nov 9, 2020 at 3:23 PM Greg Stark  wrote:
>
> On Mon, 9 Nov 2020 at 00:17, Noah Misch  wrote:
> >
> > > 2) adding the dependency on heapam.h to heap.c makes sense because of
> > > heap_inplace_update bt it may be a bit annoying because I suspect
> > > that's a useful sanity check that the tableam stuff hasn't been
> > > bypassed
> >
> > That is not terrible.  How plausible would it be to call 
> > vac_update_relstats()
> > for this, instead of reimplementing part of it?
>
> It didn't seem worth it to change its API to add boolean flags to skip
> setting some of the variables (I was originally only doing
> relfrozenxid and minmmxid). Now that I'm doing most of the variables
> maybe it makes a bit more sense.
>
> > > @@ -3340,6 +3383,7 @@ heap_truncate_one_rel(Relation rel)
> > >
> > >   /* Truncate the underlying relation */
> > >   table_relation_nontransactional_truncate(rel);
> > > + ResetVacStats(rel);
> >
> > I didn't test, but I expect this will cause a stats reset for the second
> > TRUNCATE here:
> >
> > CREATE TABLE t ();
> > ...
> > BEGIN;
> > TRUNCATE t;
> > TRUNCATE t;  -- inplace relfrozenxid reset
> > ROLLBACK;  -- inplace reset survives
> >
> > Does that indeed happen?
>
> Apparently no, see below.  I have to say I was pretty puzzled by the
> actual behaviour which is that the rollback actually does roll back
> the inplace update. But I *think* what is happening is that the first
> truncate does an MVCC update so the inplace update happens only to the
> newly created tuple which is never commited.

I think in-plase update that the patch introduces is not used because
TRUNCATE doesn't use heap_truncate_one_rel() to truncate a table in
that scenario. It does MVCC update the pg_class tuple for a new
relfilenode with new relfrozenxid and other stats, see
RelationSetNewRelfilenode(). If we create and truncate a table within
the transaction it does in-place update that the patch introduces but
I think it's no problem in this case either.

>
> Thinking about things a bit this does worry me a bit. I wonder if
> inplace update is really safe outside of vacuum where we know we're
> not in a transaction that can be rolled back. But IIRC doing a
> non-inplace update on pg_class for these columns breaks other things.
> I don't know if that's still true.

heap_truncate_one_rel() is not a transaction-safe operation. Doing
in-place updates during that operation seems okay to me unless I'm
missing something.

Regards,

-- 
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/

Re: abstract Unix-domain sockets

2020-11-09 Thread Michael Paquier

On Mon, Nov 09, 2020 at 09:04:24AM +0100, Peter Eisentraut wrote:
> On 2020-11-09 07:08, Michael Paquier wrote:
> The @ is the standard way of representing this in the user interface and the
> configuration, so it seems sensible to me that way.

Ok.

> Can you sketch how you would structure this?  I realize it's not very
> elegant, but I couldn't come up with a better way that didn't involve having
> to duplicate some of the error messages into multiple branches.

I think that I would use a StringInfo to build each sentence of the
hint separately.  The first sentence, "Is another postmaster already
running on port %d?" is already known.  Then the second sentence could
be built depending on the two other conditions.  FWIW, I think that it
is confusing to mention in the hint to remove a socket file that
cannot be removed.
--
Michael


signature.asc
Description: PGP signature

Re: PATCH: Report libpq version and configuration

2020-11-09 Thread Craig Ringer

On Tue, Nov 10, 2020 at 12:33 AM Tom Lane  wrote:

> Alvaro Herrera  writes:
> > Well, if we can make it run in more systems than just Linux, then it
> > seems worth having.  The submitted patch seems a little bit on the
> > naughty side.
>
> I agree that the facility seems possibly useful, as long as we can
> minimize its platform dependency.  Just embedding some strings, as
> I suggested upthread, seems like it'd go far in that direction.
> Yeah, you could spend a lot of effort to make it a bit more user
> friendly, but is the effort really going to be repaid?  The use
> case for this isn't that large, I don't think.
>

The reason I want to expose symbols is mainly for tracing tooling - perf,
systemtap, etc.

I thought it'd make sense to also provide another way to identify the libpq
binary.

My other hesitation about using a "strings libpq.so" approach is that it's
not something I'd be super happy about automating and relying on in scripts
etc. It could break depending on how the compiler decides to arrange things
or due to unrelated changes in libpq that create similar-looking strings
later. I'd prefer to do it deterministically. You can already use "strings"
to identify an unstripped binary built with -ggdb3 (macros in DWARF
debuginfo), but we don't compile the PG_VERSION into the binary, so you
can't even get the basic version string like "postgres (PostgreSQL) 11.9"
from 'strings'.

The whole PQlibInfo() thing came about because I thought it'd be
potentially useful. I've had issues before with applications being built
against a newer version of libpq than what they're linked against, and it's
been rather frustrating to make the app tolerant of that. But it can be
solved (clumsily) using various existing workarounds.

The main things I'd really like to get in place are a way to get the
version as an ELF data symbol, and a simple way to ID the binary.

So the minimal change would be to declare:

const char LIBPQ_VERSION_STR[] = PG_VERSION_STR;
const int LIBPQ_VERSION_NUM = PG_VERSION_NUM;

then change PQgetVersion() to return LIBPQ_VERSION_NUM and add a
PQgetVersionStr() that returns LIBPQ_VERSION_STR.

That OK with you?

Re: MultiXact\SLRU buffers configuration

2020-11-09 Thread Andrey Borodin




> 10 нояб. 2020 г., в 05:13, Tomas Vondra  
> написал(а):
> After the issue reported in [1] got fixed, I've restarted the multi-xact
> stress test, hoping to reproduce the issue. But so far no luck :-(


Tomas, many thanks for looking into this. I figured out that to make multixact 
sets bigger transactions must hang for a while and lock large set of tuples. 
But not continuous range to avoid locking on buffer_content.
I did not manage to implement this via pgbench, that's why I was trying to hack 
on separate go program. But, essentially, no luck either.
I was observing something resemblant though

пятница,  8 мая 2020 г. 15:08:37 (every 1s)

  pid  | wait_event | wait_event_type | state  |
   query
---++-++
 41344 | ClientRead | Client  | idle   | insert into t1 
select generate_series(1,100,1)
 41375 | MultiXactOffsetControlLock | LWLock  | active | select * from 
t1 where i = ANY ($1) for share
 41377 | MultiXactOffsetControlLock | LWLock  | active | select * from 
t1 where i = ANY ($1) for share
 41378 || | active | select * from 
t1 where i = ANY ($1) for share
 41379 | MultiXactOffsetControlLock | LWLock  | active | select * from 
t1 where i = ANY ($1) for share
 41381 || | active | select * from 
t1 where i = ANY ($1) for share
 41383 | MultiXactOffsetControlLock | LWLock  | active | select * from 
t1 where i = ANY ($1) for share
 41385 | MultiXactOffsetControlLock | LWLock  | active | select * from 
t1 where i = ANY ($1) for share
(8 rows)

but this picture was not stable.

How do you collect wait events for aggregation? just insert into some table 
with cron?

Thanks!

Best regards, Andrey Borodin.

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-11-09 Thread tsunakawa.ta...@fujitsu.com

From: Jamison, Kirk/ジャミソン カーク 
> So I proceeded to update the patches using the "cached" parameter and
> updated the corresponding comments to it in 0002.

OK, I'm in favor of the name "cached" now, although I first agreed with 
Horiguchi-san in that it's better to use a name that represents the nature 
(accurate) of information rather than the implementation (cached).  Having a 
second thought, since smgr is a component that manages relation files on 
storage (file system), lseek(SEEK_END) is the accurate value for smgr.  The 
cached value holds a possibly stale size up to which the relation has extended.


The patch looks almost good except for the minor ones:

(1)
+extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum,
+  bool *accurate);

It's still accurate here.


(2)
+ * the buffer pool is sequentially scanned. Since buffers must not 
be
+ * left behind, the latter way is executed unless the sizes of all 
the
+ * involved forks are already cached. See smgrnblocks() for more 
details.
+ * This is only called in recovery when the block count of any 
fork is
+ * cached and the total number of to-be-invalidated blocks per 
relation

count of any fork is
-> counts of all forks are


(3)
In 0004, I thought you would add the invalidated block counts of all relations 
to determine if the optimization is done, as Horiguchi-san suggested.  But I 
find the current patch okay too.


Regards
Takayuki Tsunakawa

Re: logical streaming of xacts via test_decoding is broken

2020-11-09 Thread Dilip Kumar

On Tue, Nov 10, 2020 at 10:52 AM Amit Kapila  wrote:
>
> On Tue, Nov 10, 2020 at 10:26 AM Dilip Kumar  wrote:
> >
> > On Tue, Nov 10, 2020 at 8:14 AM Amit Kapila  wrote:
> > >
> > > On Mon, Nov 9, 2020 at 6:00 PM Dilip Kumar  wrote:
> > > >
> > > > On Mon, Nov 9, 2020 at 5:37 PM Dilip Kumar  
> > > > wrote:
> > > > >
> > > > > On Mon, Nov 9, 2020 at 4:21 PM Amit Kapila  
> > > > > wrote:
> > > > >
> > > > >  The bigger question is do we want to give users an option
> > > > > > for skip_empty_streams similar to skip_empty_xacts? I would again
> > > > > > prefer to give a separate option to the user as well. What do you
> > > > > > think?
> > > > >
> > > > > Yeah, I think giving an option would be better.
> > > >
> > > > I think we should also think about the combinations of the
> > > > skip_empty_xacts and skip_empty_streams.  For example, if the user
> > > > passes the skip_empty_xacts to false and skip_empty_streams to true
> > > > then what should be the behavior, if the complete transaction
> > > > option1: It should not print any stream_start/stream_stop and just
> > > > print commit stream because skip_empty_xacts is false and
> > > > skip_empty_streams is true.
> > > > option2: It should print the stream_start message for the very first
> > > > stream because it is the first stream if the txn and skip_empty_xacts
> > > > is false so print it and later it will print-stream commit.
> > > > option3: Or for the first stream we first put the BEGIN message i.e
> > > > stream begin
> > > > stream start
> > > > stream stop
> > > > stream commit
> > > > option4: the user should not be allowed to pass skip_empty_xacts =
> > > > false with skip_empty_streams to true.  Because if the streaming mode
> > > > is on then we can not print the xact without printing streams.
> > > >
> > > > What is your opinion on this?
> > > >
> > >
> > > I would prefer option-4 and in addition to that we can ensure that if
> > > skip_empty_xacts = true then by default skip_empty_streams is also
> > > true.
> >
> > But then it will behave as a single option only, right? because if
> > 1. skip_empty_xacts = true, then we set skip_empty_streams = true
> >
>
> For this case, users can use skip_empty_xacts = true and
> skip_empty_streams = false. I am just asking if the user has only used
> skip_empty_xacts = true and didn't use the 'skip_empty_streams'
> option.

Ok, thanks for the clarification.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] logical decoding of two-phase transactions

2020-11-09 Thread Peter Smith

FYI - I have cross-checked all the v18 patch code against the v18 code
coverage [1] resulting from running the tests.

The purpose of this study was to identify where there may be any gaps
in the testing of this patch - e.g is there some v18 code not
currently getting executed by the tests?

I found almost all of the normal (not error) code paths are getting executed.

For details please see attached the study results. (MS Excel file)

===

[1] 
https://www.postgresql.org/message-id/CAHut%2BPu4BpUr0GfCLqJjXc%3DDcaKSvjDarSN89-4W2nxBeae9hQ%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia


v18-patch-test-coverage-20201110.xlsx
Description: MS-Excel 2007 spreadsheet

Add docs stub for recovery.conf

2020-11-09 Thread Craig Ringer

Hi all

I noticed that when recovery.conf was removed in 2dedf4d9a8 (yay!) the docs
for it were removed completely as well. That's largely sensible, but is
confusing when users have upgraded and are trying to find out what
happened, or how to configure equivalent functionality.

https://www.postgresql.org/docs/11/recovery-config.html just vanished for
/12/, and as a result
https://www.postgresql.org/docs/current/recovery-config.html is a 404. I
think that's unhelpful since we encourage people to use /current/ links.

The attached patch restores the recovery-config.html page with a short note
explaining why it's gone and what to do instead. It's added to a new
appendix "Obsolete or renamed features, settings and files".

I found it remarkably hard to find out what exactly made a "standby server"
actually be a standby server in the docs so I have added a couple of
cross-references that make the role of the standby.signal file much more
discoverable from relevant locations.

I propose a policy that we preserve our  and  ids. We
should move them to an "obsolete" section at the end, like the one I
created here, and provide stubs for them instead of removing them. That'll
help prevent us from breaking links on the wider web, in 3rd party
documentation, etc.
From 9d6db17c8bff093c08a90129d0ba181554b5d3e2 Mon Sep 17 00:00:00 2001
From: Craig Ringer 
Date: Tue, 10 Nov 2020 10:51:20 +0800
Subject: [PATCH] Link the old recovery.conf docs chapter to the new docs

Add a new docs section for obsolete topics. In it, add a section for
the removed recovery.conf documentation. Explain that recovery.conf
has been removed in Pg 12 and where to find the replacement
configuration.

This way when someone running PostgreSQL 12+ searches for
"recovery.conf" and finds the /11/ documentation for it, the website
docs will have a "latest" link that explains where it went. Right now
the docs file just ends, with no link for later versions. This also
helps people following /current/ web links from articles, blogs, etc,
who would currently get a 404.

Appropriate index terms are added to direct searchers to the right
place too.
---
 .../appendix-obsolete-recovery-config.sgml| 61 +++
 doc/src/sgml/appendix-obsolete.sgml   | 19 ++
 doc/src/sgml/config.sgml  |  4 +-
 doc/src/sgml/filelist.sgml|  4 ++
 doc/src/sgml/high-availability.sgml   | 16 -
 doc/src/sgml/postgres.sgml|  1 +
 doc/src/sgml/ref/pg_basebackup.sgml   |  5 +-
 7 files changed, 105 insertions(+), 5 deletions(-)
 create mode 100644 doc/src/sgml/appendix-obsolete-recovery-config.sgml
 create mode 100644 doc/src/sgml/appendix-obsolete.sgml

diff --git a/doc/src/sgml/appendix-obsolete-recovery-config.sgml b/doc/src/sgml/appendix-obsolete-recovery-config.sgml
new file mode 100644
index 00..98e4b2711d
--- /dev/null
+++ b/doc/src/sgml/appendix-obsolete-recovery-config.sgml
@@ -0,0 +1,61 @@
+
+
+
+
+  The recovery.conf file
+
+   
+ obsolete
+ recovery.conf
+   
+
+   
+PostgreSQL 11 and below used a configuration file named
+recovery.conf
+recovery.conf
+to manage replicas and standbys. Support for this file was removed in PostgreSQL 12. See
+the release notes for PostgreSQL 12 for details
+on this change.
+   
+
+   
+On PostgreSQL 12 and above,
+archive recovery, streaming replication, and PITR
+are configured using
+normal server configuration parameters.
+These are set in postgresql.conf or via
+ALTER SYSTEM
+like any other parameter.
+   
+
+   
+The server will not start if a recovery.conf exists.
+   
+
+   
+The
+trigger_file
+
+ trigger_file
+ promote_trigger_file
+
+setting has been renamed to
+
+   
+
+   
+The
+standby_mode
+
+ standby_mode
+ standby.signal
+
+setting has been removed. A standby.signal file in the data directory
+is used instead. See  for details.
+   
+
+
diff --git a/doc/src/sgml/appendix-obsolete.sgml b/doc/src/sgml/appendix-obsolete.sgml
new file mode 100644
index 00..deebbd083d
--- /dev/null
+++ b/doc/src/sgml/appendix-obsolete.sgml
@@ -0,0 +1,19 @@
+
+
+
+ Obsolete or renamed features, settings and files
+
+ 
+   obsolete
+ 
+
+ 
+   Functionality is sometimes removed from PostgreSQL or documentation for it
+   moves to different places. This section directs users coming from old
+   versions of the documentation or from external links to the appropriate
+   new location for the information they need.
+ 
+
+ &obsolete-recovery-config;
+
+
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index f043433e31..8148b19790 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4227,7 +4227,9 @@ ANY num_sync ( standby server
+  that is
   to receive replication data.  Their values on the primary server
   are irrelevant.
  
diff --git a/doc/src/sgml/filelist

Re: [Patch] Optimize dropping of relation buffers using dlist

2020-11-09 Thread Thomas Munro

On Tue, Nov 10, 2020 at 6:18 PM Amit Kapila  wrote:
> Do we always truncate all the blocks? What if the vacuum has cleaned
> last N (say 100) blocks then how do we handle it?

Oh, hmm.  Yeah, that idea only works for DROP, not for truncate last N.

[PATCH] TAP test showing that pg_replication_slot_advance() works on standby

2020-11-09 Thread Craig Ringer

Hi all

The attached patch adds a test to the TAP logical decoding suite to show
that pg_replication_slot_advance() works on a server in standby mode.

I didn't add a full demonstration of how to do failover with logical slots
because we're still missing a way to "sync" a logical slot from a primary
to a standby, or a way to directly create such a slot safely on a standby
in a way that enforces a safe catalog_xmin etc.

You can't replay from the slot unless the server is promoted, so I don't
test that.

I'm not sure if anyone's going to find it worth committing, but it's here
so searchers can find it at least.
From 218c607b64db493b975bb398528d6a952beeb32f Mon Sep 17 00:00:00 2001
From: Craig Ringer 
Date: Tue, 10 Nov 2020 13:06:41 +0800
Subject: [PATCH] Extend TAP test for pg_replication_slot_advance() to cover
 standby

Show that pg_replication_slot_advance() works on a standby server
too.
---
 src/test/recovery/t/006_logical_decoding.pl | 37 +++--
 1 file changed, 26 insertions(+), 11 deletions(-)

diff --git a/src/test/recovery/t/006_logical_decoding.pl 
b/src/test/recovery/t/006_logical_decoding.pl
index ffc3e77f83..46d521a6a0 100644
--- a/src/test/recovery/t/006_logical_decoding.pl
+++ b/src/test/recovery/t/006_logical_decoding.pl
@@ -7,7 +7,7 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 14;
+use Test::More tests => 17;
 use Config;
 
 # Initialize primary node
@@ -185,19 +185,34 @@ chomp($logical_restart_lsn_post);
 ok(($logical_restart_lsn_pre cmp $logical_restart_lsn_post) == 0,
"logical slot advance persists across restarts");
 
-# done with the node
+# Show that advancing the slot is supported when in recovery (standby)
 $node_primary->stop;
-
-$node_primary->append_conf('postgresql.conf',qq[
-hot_standby = on
-primary_conninfo = 'bogus'
-]);
-
+$node_primary->set_standby_mode();
 $node_primary->start;
 is($node_primary->safe_psql('postgres', 'SELECT 
pg_is_in_recovery();'),'t','started up in recovery');
-$current_lsn = $node_primary->lsn('insert');
+$current_lsn = $node_primary->lsn('replay');
 ($result, $stdout, $stderr) = $node_primary->psql('postgres',
-   "SELECT pg_replication_slot_advance('$logical_slot', 
'$current_lsn'::pg_lsn);"
+   "SELECT end_lsn FROM pg_replication_slot_advance('$logical_slot', 
'$current_lsn'::pg_lsn);"
 );
-ok($result, "pg_replication_slot_advance() on standby succeeded");
+chomp($result);
+is($result, 0, "pg_replication_slot_advance() on standby succeeded");
 is($stderr, '', "pg_replication_slot_advance() on standby produced no stderr");
+
+# The slot actually advanced
+is($node_primary->safe_psql('postgres', "SELECT pg_lsn '$result' >= pg_lsn 
'$current_lsn';"),
+   't', 'pg_replication_slot_advance() reported slot moved forwards');
+is($node_primary->safe_psql('postgres',
+ "SELECT confirmed_flush_lsn = pg_lsn '$current_lsn' from 
pg_replication_slots WHERE slot_name = '$logical_slot';"),
+   't', 'pg_replication_slot_advance() return value matches 
pg_replication_slots on standby');
+
+# Slot advance should persist across clean restarts on standby too
+$node_primary->restart;
+$logical_restart_lsn_post = $node_primary->safe_psql('postgres',
+   "SELECT restart_lsn from pg_replication_slots WHERE slot_name = 
'$logical_slot';"
+);
+chomp($logical_restart_lsn_post);
+ok(($logical_restart_lsn_pre cmp $logical_restart_lsn_post) == 0,
+   "logical slot advance persists across restarts");
+
+# done with the node
+$node_primary->stop;
-- 
2.26.2

Re: [BUG]: segfault during update

2020-11-09 Thread Tom Lane

Andres Freund  writes:
> On 2020-11-08 12:46:44 -0500, Tom Lane wrote:
>> In v12 we end up using the junkfilter's output
>> slot, which does not have a sufficiently accurate tupdesc to deal with
>> an on-disk tuple rather than one constructed by the executor.

> I really wonder if we ought to redesign the column default logic to
> really be a property of the Attr, instead of the constraint stuff it is
> now.

Yeah, I don't much like treating it as a constraint either.  Not quite
sure that it's worth the work to move it somewhere else, though.

> Is it worth adding Bertrand's testcase to the regression suite in some
> form?

I did add an equivalent test case.

regards, tom lane

Re: logical streaming of xacts via test_decoding is broken

2020-11-09 Thread Amit Kapila

On Tue, Nov 10, 2020 at 10:26 AM Dilip Kumar  wrote:
>
> On Tue, Nov 10, 2020 at 8:14 AM Amit Kapila  wrote:
> >
> > On Mon, Nov 9, 2020 at 6:00 PM Dilip Kumar  wrote:
> > >
> > > On Mon, Nov 9, 2020 at 5:37 PM Dilip Kumar  wrote:
> > > >
> > > > On Mon, Nov 9, 2020 at 4:21 PM Amit Kapila  
> > > > wrote:
> > > >
> > > >  The bigger question is do we want to give users an option
> > > > > for skip_empty_streams similar to skip_empty_xacts? I would again
> > > > > prefer to give a separate option to the user as well. What do you
> > > > > think?
> > > >
> > > > Yeah, I think giving an option would be better.
> > >
> > > I think we should also think about the combinations of the
> > > skip_empty_xacts and skip_empty_streams.  For example, if the user
> > > passes the skip_empty_xacts to false and skip_empty_streams to true
> > > then what should be the behavior, if the complete transaction
> > > option1: It should not print any stream_start/stream_stop and just
> > > print commit stream because skip_empty_xacts is false and
> > > skip_empty_streams is true.
> > > option2: It should print the stream_start message for the very first
> > > stream because it is the first stream if the txn and skip_empty_xacts
> > > is false so print it and later it will print-stream commit.
> > > option3: Or for the first stream we first put the BEGIN message i.e
> > > stream begin
> > > stream start
> > > stream stop
> > > stream commit
> > > option4: the user should not be allowed to pass skip_empty_xacts =
> > > false with skip_empty_streams to true.  Because if the streaming mode
> > > is on then we can not print the xact without printing streams.
> > >
> > > What is your opinion on this?
> > >
> >
> > I would prefer option-4 and in addition to that we can ensure that if
> > skip_empty_xacts = true then by default skip_empty_streams is also
> > true.
>
> But then it will behave as a single option only, right? because if
> 1. skip_empty_xacts = true, then we set skip_empty_streams = true
>

For this case, users can use skip_empty_xacts = true and
skip_empty_streams = false. I am just asking if the user has only used
skip_empty_xacts = true and didn't use the 'skip_empty_streams'
option.

-- 
With Regards,
Amit Kapila.

Re: [Patch] Optimize dropping of relation buffers using dlist

2020-11-09 Thread Amit Kapila

On Tue, Nov 10, 2020 at 10:00 AM Thomas Munro  wrote:
>
> On Sat, Nov 7, 2020 at 12:40 AM Amit Kapila  wrote:
> > I think one of the problems is returning fewer rows and that too
> > without any warning or error, so maybe that is a bigger problem but we
> > seem to be okay with it as that is already a known thing though I
> > think that is not documented anywhere.
>
> I'm not OK with it, and I'm not sure it's widely known or understood,
>

Yeah, it is quite possible but may be because we don't see many field
reports nobody thought of doing anything about it.

> though I think we've made some progress in this thread.  Perhaps, as a
> separate project, we need to solve several related problems with a
> shmem table of relation sizes from not-yet-synced files so that
> smgrnblocks() is fast and always sees all preceding smgrextend()
> calls.  If we're going to need something like that anyway, and if we
> can come up with a simple way to detect and report this type of
> failure in the meantime, maybe this fast DROP project should just go
> ahead and use the existing smgrnblocks() function without the weird
> caching bandaid that only works in recovery?
>

I am not sure if it would be easy to detect all such failures and we
might end up opening other can of worms for us but if there is some
simpler way then sure we can consider it. OTOH, till we have a shared
cache of relation sizes (which I think is good for multiple things) it
seems the safe way to proceed by relying on the cache during recovery.
And, it is not that we can't change this once we have a shared
relation size solution.

> > > The main argument I can think of against the idea of using plain old
> > > smgrnblocks() is that the current error messages on smgrwrite()
> > > failure for stray blocks would be indistinguishible from cases where
> > > an external actor unlinked the file.  I don't mind getting an error
> > > that prevents checkpointing -- your system is in big trouble! -- but
> > > it'd be nice to be able to detect that *we* unlinked the file,
> > > implying the filesystem and bufferpool are out of sync, and spit out a
> > > special diagnostic message.  I suppose if it's the checkpointer doing
> > > the writing, it could check if the relfilenode is on the
> > > queued-up-for-delete-after-the-checkpoint list, and if so, it could
> > > produce a different error message just for this edge case.
> > > Unfortunately that's not a general solution, because any backend might
> > > try to write a buffer out and they aren't synchronised with
> > > checkpoints.
> >
> > Yeah, but I am not sure if we can consider manual (external actor)
> > tinkering with the files the same as something that happened due to
> > the database server relying on the wrong information.
>
> Here's a rough idea I thought of to detect this case; I'm not sure if
> it has holes.  When unlinking a relation, currently we truncate
> segment 0 and unlink all the rest of the segments, and tell the
> checkpointer to unlink segment 0 after the next checkpoint.
>

Do we always truncate all the blocks? What if the vacuum has cleaned
last N (say 100) blocks then how do we handle it?

-- 
With Regards,
Amit Kapila.

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-11-09 Thread k.jami...@fujitsu.com

On Tuesday, November 10, 2020 12:27 PM, Horiguchi-san wrote:
> To: amit.kapil...@gmail.com
> Cc: Jamison, Kirk/ジャミソン カーク ; Tsunakawa,
> Takayuki/綱川 貴之 ; t...@sss.pgh.pa.us;
> and...@anarazel.de; robertmh...@gmail.com;
> tomas.von...@2ndquadrant.com; pgsql-hack...@postgresql.org
> Subject: Re: [Patch] Optimize dropping of relation buffers using dlist
> 
> At Tue, 10 Nov 2020 08:33:26 +0530, Amit Kapila 
> wrote in
> > On Tue, Nov 10, 2020 at 8:19 AM k.jami...@fujitsu.com
> >  wrote:
> > >
> > > I repeated the recovery performance test for vacuum. (I made a
> > > mistake previously in NBuffers/128) The 3 kinds of thresholds are almost
> equally performant. I chose NBuffers/256 for this patch.
> > >
> > > | s_b   | Master | NBuffers/512 | NBuffers/256 | NBuffers/128 |
> > > |---||--|--|--|
> > > | 128MB | 1.006  | 1.007| 1.007| 1.007|
> > > | 1GB   | 0.706  | 0.606| 0.606| 0.606|
> > > | 20GB  | 1.907  | 0.606| 0.606| 0.606|
> > > | 100GB | 7.013  | 0.706| 0.606| 0.606|
> > >
> >
> > I think this data is not very clear. What is the unit of time? What is
> > the size of the relation used for the test? Did the test use an
> > optimized path for all cases? If at 128MB, there is no performance
> > gain, can we consider the size of shared buffers as 256MB as well for
> > the threshold?
> 
> In the previous testing, it was shown as:
> 
> Recovery Time (in seconds)
> | s_b   | master | patched | %reg   |
> |---||-||
> | 128MB | 3.043  | 2.977   | -2.22% |
> | 1GB   | 3.417  | 3.41| -0.21% |
> | 20GB  | 20.597 | 2.409   | -755%  |
> | 100GB | 66.862 | 2.409   | -2676% |
> 
> 
> So... The numbers seems to be in seconds, but the master gets about 10
> times faster than this result for uncertain reasons. It seems that the result
> contains something different from the difference by this patch as a larger
> part.

The unit is in seconds.
The results that Horiguchi-san mentioned was the old test case I used where I 
vacuumed
database with 1000 relations that have been deleted.
I used a new test case in my last results that's why they're smaller:
VACUUM 1 parent table (350 MB) and 100 child partition tables (6 MB each)
in separate transcations after deleting the tables. After vacuum, the
parent table became 16kB and each child table was 2224kB.

I added the test for 256MB shared_buffers, and the performance is also almost 
the same.
We gain performance benefits for the larger shared_buffers.

| s_b| Master | NBuffers/512 | NBuffers/256 | NBuffers/128 | 
|||--|--|--| 
| 128MB  | 1.006  | 1.007| 1.007| 1.007| 
| 256 MB | 1.006  | 1.006| 0.906| 0.906| 
| 1GB| 0.706  | 0.606| 0.606| 0.606| 
| 20GB   | 1.907  | 0.606| 0.606| 0.606| 
| 100GB  | 7.013  | 0.706| 0.606| 0.606|

Regards,
Kirk Jamison

Re: [BUG]: segfault during update

2020-11-09 Thread Andres Freund

Hi,

On 2020-11-08 12:46:44 -0500, Tom Lane wrote:
> This logic is entirely gone in v12, which confirms my instinct that
> there was something about Andres' slot-manipulation changes that
> broke this scenario.

Entirely possible :(. In my defense, it wasn't exactly obvious or
documented that the fast default code relied on this
ExecSetSlotDescriptor()...

> In v12 we end up using the junkfilter's output
> slot, which does not have a sufficiently accurate tupdesc to deal with
> an on-disk tuple rather than one constructed by the executor.

I really wonder if we ought to redesign the column default logic to
really be a property of the Attr, instead of the constraint stuff it is
now.

> So I'll go see about back-patching 20d3fe900.

Thanks.

Is it worth adding Bertrand's testcase to the regression suite in some
form?

Greetings,

Andres Freund

Re: logical streaming of xacts via test_decoding is broken

2020-11-09 Thread Dilip Kumar

On Tue, Nov 10, 2020 at 8:14 AM Amit Kapila  wrote:
>
> On Mon, Nov 9, 2020 at 6:00 PM Dilip Kumar  wrote:
> >
> > On Mon, Nov 9, 2020 at 5:37 PM Dilip Kumar  wrote:
> > >
> > > On Mon, Nov 9, 2020 at 4:21 PM Amit Kapila  
> > > wrote:
> > >
> > >  The bigger question is do we want to give users an option
> > > > for skip_empty_streams similar to skip_empty_xacts? I would again
> > > > prefer to give a separate option to the user as well. What do you
> > > > think?
> > >
> > > Yeah, I think giving an option would be better.
> >
> > I think we should also think about the combinations of the
> > skip_empty_xacts and skip_empty_streams.  For example, if the user
> > passes the skip_empty_xacts to false and skip_empty_streams to true
> > then what should be the behavior, if the complete transaction
> > option1: It should not print any stream_start/stream_stop and just
> > print commit stream because skip_empty_xacts is false and
> > skip_empty_streams is true.
> > option2: It should print the stream_start message for the very first
> > stream because it is the first stream if the txn and skip_empty_xacts
> > is false so print it and later it will print-stream commit.
> > option3: Or for the first stream we first put the BEGIN message i.e
> > stream begin
> > stream start
> > stream stop
> > stream commit
> > option4: the user should not be allowed to pass skip_empty_xacts =
> > false with skip_empty_streams to true.  Because if the streaming mode
> > is on then we can not print the xact without printing streams.
> >
> > What is your opinion on this?
> >
>
> I would prefer option-4 and in addition to that we can ensure that if
> skip_empty_xacts = true then by default skip_empty_streams is also
> true.

But then it will behave as a single option only, right? because if
1. skip_empty_xacts = true, then we set skip_empty_streams = true
2. skip_empty_xacts = false then skip_empty_streams can not be set to true

So as per the state machine either both will be true or both will be
false, Am I missing something?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Allow some recovery parameters to be changed with reload

2020-11-09 Thread Kyotaro Horiguchi

At Sat, 07 Nov 2020 00:36:33 +0300, Sergei Kornilov  wrote in 
> Hello
> 
> > I'm wondering if it's safe to allow restore_command to be emptied during
> > archive recovery. Even when it's emptied, archive recovery can proceed
> > by reading WAL files from pg_wal directory. This is the same behavior as
> > when restore_command is set to, e.g., /bin/false.
> 
> I am always confused by this implementation detail. restore_command fails? 
> Fine, let's just read file from pg_wal. But this is different topic...

--- a/src/backend/postmaster/startup.c
+++ b/src/backend/postmaster/startup.c
@@ -114,6 +114,11 @@ StartupRereadConfig(void)
 
if (conninfoChanged || slotnameChanged || tempSlotChanged)
StartupRequestWalReceiverRestart();
+
+   /*
+* Check the combination of new parameters
+*/
+   validateRecoveryParameters();

If someone changes restore_command to '' then reload while crash
recovery is running, the server stops for no valid reason. If
restore_command is set to 'hoge' (literally:p, that is, anything
unexecutable) and send SIGHUP while archive recovery is running, the
server stops.  I think we need to handle these cases more gracefully,
I think. That said, I think we should keep the current behavior that
the server stops if the same happens just after server start.

If someone changes restore_command by mistake to something executable
but fails to offer the specfied file even if it exists, the running
archive recovery finishes then switches timeline unexpectedly.  With
the same reasoning to the discussion abou inexecutable contents just
above, that behavior seems valid when the variable has not changed
since startup, but I'm not sure what to do if that happens by a reload
while (archive|crash) recovery is proceeding.

> I do not know the history of this fatal ereport. It looks like "must specify 
> restore_command when standby mode is not enabled" check is only intended to 
> protect the user from misconfiguration and the rest code will treat empty 
> restore_command correctly, just like /bin/false. Did not notice anything 
> around StandbyMode conditions.

If restore_command is not changable after server-start, it would be
valid for startup to stop for inexecutable content for the variable
since there's no way to proceed recovery.


regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Online verification of checksums

2020-11-09 Thread Michael Paquier

On Thu, Nov 05, 2020 at 10:57:16AM +0900, Michael Paquier wrote:
> I was referring to the patch I sent on this thread that fixes the
> detection of a corruption for the zero-only case and where pd_lsn
> and/or pg_upper are trashed by a corruption of the page header.  Both
> cases allow a base backup to complete on HEAD, while sending pages
> that could be corrupted, which is wrong.  Once you make the page
> verification rely only on pd_checksum, as the patch does because the
> checksum is the only source of truth in the page header, corrupted
> pages are correctly detected, causing pg_basebackup to complain as it
> should.  However, it has also the risk to cause pg_basebackup to fail
> *and* to report as broken pages that are in the process of being
> written, depending on how slow a disk is able to finish a 8kB write.
> That's a different kind of wrongness, and users have two more reasons
> to be pissed.  Note that if a page is found as torn we have a
> consistent page header, meaning that on HEAD the PageIsNew() and
> PageGetLSN() would pass, but the checksum verification would fail as
> the contents at the end of the page does not match the checksum.

Magnus, as the original committer of 4eb77d5, do you have an opinion
to share?
--
Michael


signature.asc
Description: PGP signature

Re: Hybrid Hash/Nested Loop joins and caching results from subplans

2020-11-09 Thread David Rowley

On Tue, 10 Nov 2020 at 15:38, Andy Fan  wrote:
> While I have interest about what caused the tiny difference,  I admit that 
> what direction
> this patch should go is more important.  Not sure if anyone is convinced that
> v8 and v9 have a similar performance.  The current data show it is similar. I 
> want to
> profile/read code more, but I don't know what part I should pay attention to. 
>  So I think
> any hints on why v9 should be better at a noticeable level  in theory should 
> be very
> helpful.  After that, I'd like to read the code or profile more carefully.

It was thought by putting the cache code directly inside
nodeNestloop.c that the overhead of fetching a tuple from a subnode
could be eliminated when we get a cache hit.

A cache hit on v8 looks like:

Nest loop -> Fetch new outer row
Nest loop -> Fetch inner row
Result Cache -> cache hit return first cached tuple
Nest loop -> eval qual and return tuple if matches

With v9 it's more like:

Nest Loop -> Fetch new outer row
Nest loop -> cache hit return first cached tuple
Nest loop -> eval qual and return tuple if matches

So 1 less hop between nodes.

In reality, the hop is not that expensive, so might not be a big
enough factor to slow the execution down.

There's some extra complexity in v9 around the slot type of the inner
tuple. A cache hit means the slot type is Minimal. But a miss means
the slot type is whatever type the inner node's slot is. So some code
exists to switch the qual and projection info around depending on if
we get a cache hit or a miss.

I did some calculations on how costly pulling a tuple through a node in [1].

David

[1] 
https://www.postgresql.org/message-id/CAKJS1f9UXdk6ZYyqbJnjFO9a9hyHKGW7B%3DZRh-rxy9qxfPA5Gw%40mail.gmail.com

Re: [Patch] Optimize dropping of relation buffers using dlist

2020-11-09 Thread Thomas Munro

On Sat, Nov 7, 2020 at 12:40 AM Amit Kapila  wrote:
> I think one of the problems is returning fewer rows and that too
> without any warning or error, so maybe that is a bigger problem but we
> seem to be okay with it as that is already a known thing though I
> think that is not documented anywhere.

I'm not OK with it, and I'm not sure it's widely known or understood,
though I think we've made some progress in this thread.  Perhaps, as a
separate project, we need to solve several related problems with a
shmem table of relation sizes from not-yet-synced files so that
smgrnblocks() is fast and always sees all preceding smgrextend()
calls.  If we're going to need something like that anyway, and if we
can come up with a simple way to detect and report this type of
failure in the meantime, maybe this fast DROP project should just go
ahead and use the existing smgrnblocks() function without the weird
caching bandaid that only works in recovery?

> > The main argument I can think of against the idea of using plain old
> > smgrnblocks() is that the current error messages on smgrwrite()
> > failure for stray blocks would be indistinguishible from cases where
> > an external actor unlinked the file.  I don't mind getting an error
> > that prevents checkpointing -- your system is in big trouble! -- but
> > it'd be nice to be able to detect that *we* unlinked the file,
> > implying the filesystem and bufferpool are out of sync, and spit out a
> > special diagnostic message.  I suppose if it's the checkpointer doing
> > the writing, it could check if the relfilenode is on the
> > queued-up-for-delete-after-the-checkpoint list, and if so, it could
> > produce a different error message just for this edge case.
> > Unfortunately that's not a general solution, because any backend might
> > try to write a buffer out and they aren't synchronised with
> > checkpoints.
>
> Yeah, but I am not sure if we can consider manual (external actor)
> tinkering with the files the same as something that happened due to
> the database server relying on the wrong information.

Here's a rough idea I thought of to detect this case; I'm not sure if
it has holes.  When unlinking a relation, currently we truncate
segment 0 and unlink all the rest of the segments, and tell the
checkpointer to unlink segment 0 after the next checkpoint.  What if
we also renamed segment 0 to "$X.dropped" (to be unlinked by the
checkpointer), and taught GetNewRelFileNode() to also skip anything
for which "$X.dropped" exists?  Then mdwrite() could use
_mdfd_getseg(EXTENSION_RETURN_NULL), and if it gets NULL (= no file),
then it checks if "$X.dropped" exists, and if so it knows that it is
trying to write a stray block from a dropped relation in the buffer
pool.  Then we panic, or warn but drop the write.  The point of the
renaming is that (1) mdwrite() for segment 0 will detect the missing
file (not just higher segments), (2) every backends can see that a
relation has been recently dropped, while also interlocking with the
checkpointer though buffer locks.

> One vague idea could be to develop pg_test_seek which can detect such
> problems but not sure if we can rely on such a tool to always give us
> the right answer. Were you able to consistently reproduce the lseek
> problem on the system where you have tried?

Yeah, I can reproduce that reliably, but it requires quite a bit of
set-up as root so it might be tricky to package up in easy to run
form.  It might be quite nice to prepare an easy-to-use "gallery of
weird buffered I/O effects" project, including some of the
local-filesystem-with-fault-injection stuff that Craig Ringer and
others were testing with a couple of years ago, but maybe not in the
pg repo.

Re: [HACKERS] logical decoding of two-phase transactions

2020-11-09 Thread Masahiko Sawada

On Mon, Nov 9, 2020 at 8:21 PM Amit Kapila  wrote:
>
> On Mon, Nov 9, 2020 at 1:38 PM Masahiko Sawada  wrote:
> >
> > On Mon, Nov 9, 2020 at 3:23 PM Peter Smith  wrote:
> > >
> >
> > I've looked at the patches and done some tests. Here is my comment and
> > question I realized during testing and reviewing.
> >
> > +static void
> > +DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
> > + xl_xact_parsed_prepare *parsed)
> > +{
> > +   XLogRecPtr  origin_lsn = parsed->origin_lsn;
> > +   TimestampTz commit_time = parsed->origin_timestamp;
> >
> >  static void
> >  DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
> > -   xl_xact_parsed_abort *parsed, TransactionId xid)
> > +   xl_xact_parsed_abort *parsed, TransactionId xid, bool prepared)
> >  {
> > int i;
> > +   XLogRecPtr  origin_lsn = InvalidXLogRecPtr;
> > +   TimestampTz commit_time = 0;
> > +   XLogRecPtr  origin_id = XLogRecGetOrigin(buf->record);
> >
> > -   for (i = 0; i < parsed->nsubxacts; i++)
> > +   if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
> > {
> > -   ReorderBufferAbort(ctx->reorder, parsed->subxacts[i],
> > -  buf->record->EndRecPtr);
> > +   origin_lsn = parsed->origin_lsn;
> > +   commit_time = parsed->origin_timestamp;
> > }
> >
> > In the above two changes, parsed->origin_timestamp is used as
> > commit_time. But in DecodeCommit() we use parsed->xact_time instead.
> > Therefore it a transaction didn't have replorigin_session_origin the
> > timestamp of logical decoding out generated by test_decoding with
> > 'include-timestamp' option is invalid. Is it intentional?
> >
>
> I think all three DecodePrepare/DecodeAbort/DecodeCommit should have
> same handling for this with the exception that at DecodePrepare time
> we can't rely on XACT_XINFO_HAS_ORIGIN but instead we need to check if
> origin_timestamp is non-zero then we will overwrite commit_time with
> it. Does that make sense to you?

Yeah, that makes sense to me.

> > 'git show --check' of v18-0002 reports some warnings.
> >
>
> I have also noticed this. Actually, I have already started making some
> changes to these patches apart from what you have reported so I'll
> take care of these things as well.

Ok.

Regards,

-- 
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/

Re: Refactor MD5 implementations and switch to EVP for OpenSSL

2020-11-09 Thread Michael Paquier

On Fri, Nov 06, 2020 at 04:34:34PM +0900, Michael Paquier wrote:
> The attached patch set does a bit of rework to make the Postgres code
> more consistent with OpenSSL, similarly to the work I did for all the
> SHA2 implementations with EVP in [1]:
> - 0001 is something stolen from the SHA2 set, adding to resowner.c
> control of EVP contexts, so as it is possible to clean up anything
> allocated by OpenSSL.
> - 0002 is the central piece, that moves the duplicated
> implementation.  src/common/ and pgcrypto/ use the same code, but I
> have reused pgcrypto as it was already doing the init/update/final
> split similarly to PostgreSQL.  New APIs are designed to control MD5
> contexts, similarly to the work done for SHA2.  Upon using this patch,
> note that pgcrypto+OpenSSL uses our in-core implementation instead of
> OpenSSL's one, but that's fixed in 0003.  We have a set of three
> convenience routines used to generate MD5-hashed passwords, that I
> have moved to a new file in src/common/md5_common.c, aimed at being
> shared between all the implementations.
> - 0003 adds the MD5 implementation based on OpenSSL's EVP, ending the
> work.

The CF bot has been complaining on Windows and this issue is fixed in
the attached.  A refresh of src/tools/msvc for pgcrypto was just
missing.
--
Michael
From f81ff379f10bb18cb2a44607262d0f27050afc3e Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Thu, 5 Nov 2020 15:53:00 +0900
Subject: [PATCH v2 1/3] Add APIs to control EVP contexts for resource owners

This will be used by a set of upcoming patches for EVP contexts with
SHA2 and MD5.
---
 src/include/utils/resowner_private.h  |  7 +++
 src/backend/utils/resowner/resowner.c | 65 +++
 2 files changed, 72 insertions(+)

diff --git a/src/include/utils/resowner_private.h b/src/include/utils/resowner_private.h
index a781a7a2aa..5ce6fcf882 100644
--- a/src/include/utils/resowner_private.h
+++ b/src/include/utils/resowner_private.h
@@ -95,4 +95,11 @@ extern void ResourceOwnerRememberJIT(ResourceOwner owner,
 extern void ResourceOwnerForgetJIT(ResourceOwner owner,
    Datum handle);
 
+/* support for EVP context management */
+extern void ResourceOwnerEnlargeEVP(ResourceOwner owner);
+extern void ResourceOwnerRememberEVP(ResourceOwner owner,
+	 Datum handle);
+extern void ResourceOwnerForgetEVP(ResourceOwner owner,
+   Datum handle);
+
 #endif			/* RESOWNER_PRIVATE_H */
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index 8bc2c4e9ea..1efb5e98b4 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -20,6 +20,10 @@
  */
 #include "postgres.h"
 
+#ifdef USE_OPENSSL
+#include 
+#endif
+
 #include "common/hashfn.h"
 #include "jit/jit.h"
 #include "storage/bufmgr.h"
@@ -128,6 +132,7 @@ typedef struct ResourceOwnerData
 	ResourceArray filearr;		/* open temporary files */
 	ResourceArray dsmarr;		/* dynamic shmem segments */
 	ResourceArray jitarr;		/* JIT contexts */
+	ResourceArray evparr;		/* EVP contexts */
 
 	/* We can remember up to MAX_RESOWNER_LOCKS references to local locks. */
 	int			nlocks;			/* number of owned locks */
@@ -175,6 +180,7 @@ static void PrintTupleDescLeakWarning(TupleDesc tupdesc);
 static void PrintSnapshotLeakWarning(Snapshot snapshot);
 static void PrintFileLeakWarning(File file);
 static void PrintDSMLeakWarning(dsm_segment *seg);
+static void PrintEVPLeakWarning(Datum handle);
 
 
 /*
@@ -444,6 +450,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
 	ResourceArrayInit(&(owner->filearr), FileGetDatum(-1));
 	ResourceArrayInit(&(owner->dsmarr), PointerGetDatum(NULL));
 	ResourceArrayInit(&(owner->jitarr), PointerGetDatum(NULL));
+	ResourceArrayInit(&(owner->evparr), PointerGetDatum(NULL));
 
 	return owner;
 }
@@ -553,6 +560,17 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 
 			jit_release_context(context);
 		}
+
+		/* Ditto for EVP contexts */
+		while (ResourceArrayGetAny(&(owner->evparr), &foundres))
+		{
+			if (isCommit)
+PrintEVPLeakWarning(foundres);
+#ifdef USE_OPENSSL
+			EVP_MD_CTX_destroy((EVP_MD_CTX *) DatumGetPointer(foundres));
+#endif
+			ResourceOwnerForgetEVP(owner, foundres);
+		}
 	}
 	else if (phase == RESOURCE_RELEASE_LOCKS)
 	{
@@ -725,6 +743,7 @@ ResourceOwnerDelete(ResourceOwner owner)
 	Assert(owner->filearr.nitems == 0);
 	Assert(owner->dsmarr.nitems == 0);
 	Assert(owner->jitarr.nitems == 0);
+	Assert(owner->evparr.nitems == 0);
 	Assert(owner->nlocks == 0 || owner->nlocks == MAX_RESOWNER_LOCKS + 1);
 
 	/*
@@ -752,6 +771,7 @@ ResourceOwnerDelete(ResourceOwner owner)
 	ResourceArrayFree(&(owner->filearr));
 	ResourceArrayFree(&(owner->dsmarr));
 	ResourceArrayFree(&(owner->jitarr));
+	ResourceArrayFree(&(owner->evparr));
 
 	pfree(owner);
 }
@@ -1370,3 +1390,48 @@ ResourceOwnerForgetJIT(ResourceOwner owner, Datum handle)
 		elog(ERROR, "J

Re: upcoming API changes for LLVM 12

2020-11-09 Thread Tom Lane

Andres Freund  writes:
> I pushed the change to master.

Thanks!

> If that doesn't show any problems, I'll
> backpatch in a week or so. Seawasp runs only on master, so it should
> satisfy the buildfarm at least.

Yeah, sounds like a good plan.  FWIW, master builds clean for me.

regards, tom lane

Re: upcoming API changes for LLVM 12

2020-11-09 Thread Andres Freund

Hi,

On 2020-11-08 18:22:50 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > Yea, I'll try to do that in the next few days

I pushed the change to master. If that doesn't show any problems, I'll
backpatch in a week or so. Seawasp runs only on master, so it should
satisfy the buildfarm at least.

> > (was plannin to last week,
> > but due to a hand injury I was typing one handed last week - makes it
> > pretty annoying to clean up code. But I just started being able to at
> > least use my left thumb again...).
> 
> Ouch.  Get well soon, and don't overstress your hand --- that's a
> good recipe for long-term problems.

Thanks! I *am* planning not to write all that much for a while. But it's
frustrating / hard, as many other activities are even less an option...

Greetings,

Andres Freund

Re: list of extended statistics on psql

2020-11-09 Thread Tatsuro Yamada


Hi Tomas,


I took a look at this today, and I think the code is ready, but the
regression test needs a bit more work:


Thanks for taking your time. :-D



1) It's probably better to use somewhat more specific names for the
objects, especially when created in public schema. It decreases the
chance of a collision with other tests (which may be hard to notice
because of timing). I suggest we use "stts_" prefix or something like
that, per the attached 0002 patch. (0001 is just the v7 patch)


I agree with your comment. Thanks.




2) The test is failing intermittently because it's executed in parallel
with stats_ext test, which is also creating extended statistics. So
depending on the timing the \dX may list some of the stats_ext stuff.
I'm not sure what to do about this. Either this part needs to be moved
to a separate test executed in a different group, or maybe we should
simply move it to stats_ext.


I thought all tests related to meta-commands exist in psql.sql, but I
realize it's not true. For example, the test of \dRp does not exist in
psql.sql. Therefore, I moved the regression test of \dX to stats_ext.sql
to avoid the test failed in parallel.

Attached patches is following:
 - 0001-v8-Add-dX-command-on-psql.patch
 - 0002-Add-regression-test-of-dX-to-stats_ext.sql.patch

However, I feel the test of \dX is not elegant, so I'm going to try
creating another one since it would be better to be aware of the context
of existing extended stats tests.

Regards,
Tatsuro Yamada


From 11c088cb43cdb13e445e246d0529d5721cf3bb10 Mon Sep 17 00:00:00 2001
From: Tatsuro Yamada 
Date: Tue, 10 Nov 2020 11:39:14 +0900
Subject: [PATCH] Add regression test of \dX to stats_ext.sql

---
 src/test/regress/expected/stats_ext.out | 94 +
 src/test/regress/sql/stats_ext.sql  | 31 +++
 2 files changed, 125 insertions(+)

diff --git a/src/test/regress/expected/stats_ext.out 
b/src/test/regress/expected/stats_ext.out
index 4c3edd213f..0ec4e24960 100644
--- a/src/test/regress/expected/stats_ext.out
+++ b/src/test/regress/expected/stats_ext.out
@@ -1550,6 +1550,100 @@ INSERT INTO tststats.priv_test_tbl
 CREATE STATISTICS tststats.priv_test_stats (mcv) ON a, b
   FROM tststats.priv_test_tbl;
 ANALYZE tststats.priv_test_tbl;
+-- Check printing info about extended statistics by \dX
+create table stts_t1 (a int, b int);
+create statistics stts_1 (ndistinct) on a, b from stts_t1;
+create statistics stts_2 (ndistinct, dependencies) on a, b from stts_t1;
+create statistics stts_3 (ndistinct, dependencies, mcv) on a, b from stts_t1;
+create table stts_t2 (a int, b int, c int);
+create statistics stts_4 on b, c from stts_t2;
+create table stts_t3 (col1 int, col2 int, col3 int);
+create statistics stts_hoge on col1, col2, col3 from stts_t3;
+create schema stts_s1;
+create schema stts_s2;
+create statistics stts_s1.stts_foo on col1, col2 from stts_t3;
+create statistics stts_s2.stts_yama (dependencies, mcv) on col1, col3 from 
stts_t3;
+insert into stts_t1 select i,i from generate_series(1,100) i;
+analyze stts_t1;
+\dX
+  List of extended statistics
+  Schema  |  Name  |  Definition  | 
N_distinct | Dependencies |   Mcv   
+--++--++--+-
+ public   | func_deps_stat | a, b, c FROM functional_dependencies |
| built| 
+ public   | mcv_lists_arrays_stats | a, b, c FROM mcv_lists_arrays|
|  | built
+ public   | mcv_lists_bool_stats   | a, b, c FROM mcv_lists_bool  |
|  | built
+ public   | mcv_lists_stats| a, b, d FROM mcv_lists   |
|  | built
+ public   | stts_1 | a, b FROM stts_t1| 
built  |  | 
+ public   | stts_2 | a, b FROM stts_t1| 
built  | built| 
+ public   | stts_3 | a, b FROM stts_t1| 
built  | built| built
+ public   | stts_4 | b, c FROM stts_t2| 
defined| defined  | defined
+ public   | stts_hoge  | col1, col2, col3 FROM stts_t3| 
defined| defined  | defined
+ stts_s1  | stts_foo   | col1, col2 FROM stts_t3  | 
defined| defined  | defined
+ stts_s2  | stts_yama  | col1, col3 FROM stts_t3  |
| defined  | defined
+ tststats | priv_test_stats| a, b FROM tststats.priv_test_tbl |
|  | built
+(12 rows)
+
+\dX stts_?
+List of extended statistics
+ Schema |  Name  |Definition | N_distinct | Dependencies |   Mcv   
+++---++--+-
+ public | stts_1 | a, b FROM stts_t1 | built

Re: warn_unused_results

2020-11-09 Thread Michael Paquier

On Mon, Nov 09, 2020 at 08:23:31AM +0100, Peter Eisentraut wrote:
> On 2020-11-09 07:56, Michael Paquier wrote:
>> This is accepted by clang, and MSVC has visibly an equivalent for
>> that, as of VS 2012:
>> #elif defined(_MSC_VER) && (_MSC_VER >= 1700)
>> #define pg_nodiscard _Check_return_
>> We don't care about the 1700 condition as we support only >= 1800 on
>> HEAD, and in this case the addition of pg_nodiscard would be required
>> on the definition and the declaration.  Should it be added?  It is
>> much more invasive than the gcc/clang equivalent though..
> 
> AFAICT from the documentation, this only applies for special "analyze" runs,
> not as a normal compiler warning.  Do we have any support for analyze runs
> with MSVC?

You can run them by passing down /p:RunCodeAnalysis=true to MSBFLAGS
when calling the build script.  There are more options like
/p:CodeAnalysisRuleSet to define a set of rules.  By default the
output is rather noisy though now that I look at it.  And having to
add the flag to the definition and the declaration is annoying, so
what you are doing would be enough without MSVC.

>> I am not sure
>> about the addition of repalloc(), as it is quite obvious that one has
>> to use its result.  Lists are fine, these are proper to PG internals
>> and beginners tend to be easily confused in the way to use them.
> 
> realloc() is listed in the GCC documentation as the reason this option
> exists, and glibc tags its realloc() with this attribute, so doing the same
> for repalloc() seems sensible.

Good point.
--
Michael


signature.asc
Description: PGP signature

Re: [Patch] Optimize dropping of relation buffers using dlist

2020-11-09 Thread Kyotaro Horiguchi

At Tue, 10 Nov 2020 08:33:26 +0530, Amit Kapila  wrote 
in 
> On Tue, Nov 10, 2020 at 8:19 AM k.jami...@fujitsu.com
>  wrote:
> >
> > I repeated the recovery performance test for vacuum. (I made a mistake 
> > previously in NBuffers/128)
> > The 3 kinds of thresholds are almost equally performant. I chose 
> > NBuffers/256 for this patch.
> >
> > | s_b   | Master | NBuffers/512 | NBuffers/256 | NBuffers/128 |
> > |---||--|--|--|
> > | 128MB | 1.006  | 1.007| 1.007| 1.007|
> > | 1GB   | 0.706  | 0.606| 0.606| 0.606|
> > | 20GB  | 1.907  | 0.606| 0.606| 0.606|
> > | 100GB | 7.013  | 0.706| 0.606| 0.606|
> >
> 
> I think this data is not very clear. What is the unit of time? What is
> the size of the relation used for the test? Did the test use an
> optimized path for all cases? If at 128MB, there is no performance
> gain, can we consider the size of shared buffers as 256MB as well for
> the threshold?

In the previous testing, it was shown as:

Recovery Time (in seconds)
| s_b   | master | patched | %reg   | 
|---||-|| 
| 128MB | 3.043  | 2.977   | -2.22% | 
| 1GB   | 3.417  | 3.41| -0.21% | 
| 20GB  | 20.597 | 2.409   | -755%  | 
| 100GB | 66.862 | 2.409   | -2676% |


So... The numbers seems to be in seconds, but the master gets about 10
times faster than this result for uncertain reasons. It seems that the
result contains something different from the difference by this patch
as a larger part.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: remove spurious CREATE INDEX CONCURRENTLY wait

2020-11-09 Thread Tom Lane

Alvaro Herrera  writes:
> Yeah ... it would be much better if we can make it use atomics instead.

I was thinking more like "do we need any locking at all".

Assuming that a proc's vacuumFlags can be set by only the process itself,
there's no write conflicts to worry about.  On the read side, there's a
hazard that onlookers will not see the PROC_IN_SAFE_IC flag set; but
that's not any different from what the outcome would be if they looked
just before this stanza executes.  And even if they don't see it, at worst
we lose the optimization being proposed.

There is a question of whether it's important that both copies of the flag
appear to update atomically ... but that just begs the question "why in
heaven's name are there two copies?"

regards, tom lane

Re: [Patch] Optimize dropping of relation buffers using dlist

2020-11-09 Thread Amit Kapila

On Tue, Nov 10, 2020 at 8:19 AM k.jami...@fujitsu.com
 wrote:
>
> > From: k.jami...@fujitsu.com 
> > On Thursday, October 22, 2020 3:15 PM, Kyotaro Horiguchi
> >  wrote:
> > > I'm not sure about the exact steps of the test, but it can be expected
> > > if we have many small relations to truncate.
> > >
> > > Currently BUF_DROP_FULL_SCAN_THRESHOLD is set to Nbuffers / 512,
> > which
> > > is quite arbitrary that comes from a wild guess.
> > >
> > > Perhaps we need to run benchmarks that drops one relation with several
> > > different ratios between the number of buffers to-be-dropped and
> > > Nbuffers, and preferably both on spinning rust and SSD.
> >
> > Sorry to get back to you on this just now.
> > Since we're prioritizing the vacuum patch, we also need to finalize which
> > threshold value to use.
> > I proceeded testing with my latest set of patches because Amit-san's
> > comments on the code, the ones we addressed, don't really affect the
> > performance. I'll post the updated patches for 0002 & 0003 after we come up
> > with the proper boolean parameter name for smgrnblocks and the buffer full
> > scan threshold value.
> >
> > Test the VACUUM performance with the following thresholds:
> >NBuffers/512, NBuffers/256, NBuffers/128, and determine which of the
> > ratio has the best performance in terms of speed.
> >
> > I tested this on my machine (CPU 4v, 8GB memory, ext4) running on SSD.
> > Configure streaming replication environment.
> > shared_buffers = 100GB
> > autovacuum = off
> > full_page_writes = off
> > checkpoint_timeout = 30min
> >
> > [Steps]
> > 1. Create TABLE
> > 2. INSERT data
> > 3. DELETE from TABLE
> > 4. Pause WAL replay on standby
> > 5. VACUUM. Stop the primary.
> > 6. Resume WAL replay and promote standby.
> >
> > With 1 relation, there were no significant changes that we can observe:
> > (In seconds)
> > | s_b   | Master | NBuffers/512 | NBuffers/256 | NBuffers/128 |
> > |---||--|--|--|
> > | 128MB | 0.106  | 0.105| 0.105| 0.105|
> > | 100GB | 0.106  | 0.105| 0.105| 0.105|
> >
> > So I tested with 100 tables and got more convincing measurements:
> >
> > | s_b   | Master | NBuffers/512 | NBuffers/256 | NBuffers/128 |
> > |---||--|--|--|
> > | 128MB | 1.006  | 1.007| 1.006| 0.107|
> > | 1GB   | 0.706  | 0.606| 0.606| 0.605|
> > | 20GB  | 1.907  | 0.606| 0.606| 0.605|
> > | 100GB | 7.013  | 0.706| 0.606| 0.607|
> >
> > The threshold NBuffers/128 has the best performance for default
> > shared_buffers (128MB) with 0.107 s, and equally performing with large
> > shared_buffers up to 100GB.
> >
> > We can use NBuffers/128 for the threshold, although I don't have a
> > measurement for HDD yet.
> > However, I wonder if the above method would suffice to determine the final
> > threshold that we can use. If anyone has suggestions on how we can come
> > up with the final value, like if I need to modify some steps above, I'd
> > appreciate it.
> >
> > Regarding the parameter name. Instead of accurate, we can use "cached" as
> > originally intended from the early versions of the patch since it is the 
> > smgr
> > that handles smgrnblocks to get the the block size of smgr_cached_nblocks..
> > "accurate" may confuse us because the cached value may not be actually
> > accurate..
>
> Hi,
>
> So I proceeded to update the patches using the "cached" parameter and updated
> the corresponding comments to it in 0002.
>
> I've addressed the suggestions and comments of Amit-san on 0003:
> 1. For readability, I moved the code block to a new static function 
> FindAndDropRelFileNodeBuffers()
> 2. Initialize the bool cached with false.
> 3. It's also decided that we don't need the extra pre-checking of RelFileNode
> when locking the bufhdr in FindAndDropRelFileNodeBuffers
>
> I repeated the recovery performance test for vacuum. (I made a mistake 
> previously in NBuffers/128)
> The 3 kinds of thresholds are almost equally performant. I chose NBuffers/256 
> for this patch.
>
> | s_b   | Master | NBuffers/512 | NBuffers/256 | NBuffers/128 |
> |---||--|--|--|
> | 128MB | 1.006  | 1.007| 1.007| 1.007|
> | 1GB   | 0.706  | 0.606| 0.606| 0.606|
> | 20GB  | 1.907  | 0.606| 0.606| 0.606|
> | 100GB | 7.013  | 0.706| 0.606| 0.606|
>

I think this data is not very clear. What is the unit of time? What is
the size of the relation used for the test? Did the test use an
optimized path for all cases? If at 128MB, there is no performance
gain, can we consider the size of shared buffers as 256MB as well for
the threshold?

-- 
With Regards,
Amit Kapila.

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-11-09 Thread k.jami...@fujitsu.com

> From: k.jami...@fujitsu.com 
> On Thursday, October 22, 2020 3:15 PM, Kyotaro Horiguchi
>  wrote:
> > I'm not sure about the exact steps of the test, but it can be expected
> > if we have many small relations to truncate.
> >
> > Currently BUF_DROP_FULL_SCAN_THRESHOLD is set to Nbuffers / 512,
> which
> > is quite arbitrary that comes from a wild guess.
> >
> > Perhaps we need to run benchmarks that drops one relation with several
> > different ratios between the number of buffers to-be-dropped and
> > Nbuffers, and preferably both on spinning rust and SSD.
> 
> Sorry to get back to you on this just now.
> Since we're prioritizing the vacuum patch, we also need to finalize which
> threshold value to use.
> I proceeded testing with my latest set of patches because Amit-san's
> comments on the code, the ones we addressed, don't really affect the
> performance. I'll post the updated patches for 0002 & 0003 after we come up
> with the proper boolean parameter name for smgrnblocks and the buffer full
> scan threshold value.
> 
> Test the VACUUM performance with the following thresholds:
>NBuffers/512, NBuffers/256, NBuffers/128, and determine which of the
> ratio has the best performance in terms of speed.
> 
> I tested this on my machine (CPU 4v, 8GB memory, ext4) running on SSD.
> Configure streaming replication environment.
> shared_buffers = 100GB
> autovacuum = off
> full_page_writes = off
> checkpoint_timeout = 30min
> 
> [Steps]
> 1. Create TABLE
> 2. INSERT data
> 3. DELETE from TABLE
> 4. Pause WAL replay on standby
> 5. VACUUM. Stop the primary.
> 6. Resume WAL replay and promote standby.
> 
> With 1 relation, there were no significant changes that we can observe:
> (In seconds)
> | s_b   | Master | NBuffers/512 | NBuffers/256 | NBuffers/128 |
> |---||--|--|--|
> | 128MB | 0.106  | 0.105| 0.105| 0.105|
> | 100GB | 0.106  | 0.105| 0.105| 0.105|
> 
> So I tested with 100 tables and got more convincing measurements:
> 
> | s_b   | Master | NBuffers/512 | NBuffers/256 | NBuffers/128 |
> |---||--|--|--|
> | 128MB | 1.006  | 1.007| 1.006| 0.107|
> | 1GB   | 0.706  | 0.606| 0.606| 0.605|
> | 20GB  | 1.907  | 0.606| 0.606| 0.605|
> | 100GB | 7.013  | 0.706| 0.606| 0.607|
> 
> The threshold NBuffers/128 has the best performance for default
> shared_buffers (128MB) with 0.107 s, and equally performing with large
> shared_buffers up to 100GB.
> 
> We can use NBuffers/128 for the threshold, although I don't have a
> measurement for HDD yet.
> However, I wonder if the above method would suffice to determine the final
> threshold that we can use. If anyone has suggestions on how we can come
> up with the final value, like if I need to modify some steps above, I'd
> appreciate it.
> 
> Regarding the parameter name. Instead of accurate, we can use "cached" as
> originally intended from the early versions of the patch since it is the smgr
> that handles smgrnblocks to get the the block size of smgr_cached_nblocks..
> "accurate" may confuse us because the cached value may not be actually
> accurate..

Hi, 

So I proceeded to update the patches using the "cached" parameter and updated
the corresponding comments to it in 0002.

I've addressed the suggestions and comments of Amit-san on 0003:
1. For readability, I moved the code block to a new static function 
FindAndDropRelFileNodeBuffers()
2. Initialize the bool cached with false.
3. It's also decided that we don't need the extra pre-checking of RelFileNode
when locking the bufhdr in FindAndDropRelFileNodeBuffers

I repeated the recovery performance test for vacuum. (I made a mistake 
previously in NBuffers/128)
The 3 kinds of thresholds are almost equally performant. I chose NBuffers/256 
for this patch.

| s_b   | Master | NBuffers/512 | NBuffers/256 | NBuffers/128 | 
|---||--|--|--| 
| 128MB | 1.006  | 1.007| 1.007| 1.007| 
| 1GB   | 0.706  | 0.606| 0.606| 0.606| 
| 20GB  | 1.907  | 0.606| 0.606| 0.606| 
| 100GB | 7.013  | 0.706| 0.606| 0.606|

Although we said that we'll prioritize vacuum optimization first, I've also 
updated the 0004 patch
(truncate optimization) which addresses the previous concern of slower truncate 
due to
redundant lookup of already-dropped buffers. In the new patch, we initially 
drop relation buffers
using the optimized DropRelFileNodeBuffers() if the buffers do not exceed the 
full-scan threshold,
then later on we drop the remaining buffers using full-scan.

Regards,
Kirk Jamison


v30-0001-Prevent-invalidating-blocks-in-smgrextend-during.patch
Description:  v30-0001-Prevent-invalidating-blocks-in-smgrextend-during.patch


v30-0002-Add-bool-param-in-smgrnblo

Re: logical streaming of xacts via test_decoding is broken

2020-11-09 Thread Amit Kapila

On Mon, Nov 9, 2020 at 6:00 PM Dilip Kumar  wrote:
>
> On Mon, Nov 9, 2020 at 5:37 PM Dilip Kumar  wrote:
> >
> > On Mon, Nov 9, 2020 at 4:21 PM Amit Kapila  wrote:
> >
> >  The bigger question is do we want to give users an option
> > > for skip_empty_streams similar to skip_empty_xacts? I would again
> > > prefer to give a separate option to the user as well. What do you
> > > think?
> >
> > Yeah, I think giving an option would be better.
>
> I think we should also think about the combinations of the
> skip_empty_xacts and skip_empty_streams.  For example, if the user
> passes the skip_empty_xacts to false and skip_empty_streams to true
> then what should be the behavior, if the complete transaction
> option1: It should not print any stream_start/stream_stop and just
> print commit stream because skip_empty_xacts is false and
> skip_empty_streams is true.
> option2: It should print the stream_start message for the very first
> stream because it is the first stream if the txn and skip_empty_xacts
> is false so print it and later it will print-stream commit.
> option3: Or for the first stream we first put the BEGIN message i.e
> stream begin
> stream start
> stream stop
> stream commit
> option4: the user should not be allowed to pass skip_empty_xacts =
> false with skip_empty_streams to true.  Because if the streaming mode
> is on then we can not print the xact without printing streams.
>
> What is your opinion on this?
>

I would prefer option-4 and in addition to that we can ensure that if
skip_empty_xacts = true then by default skip_empty_streams is also
true.

-- 
With Regards,
Amit Kapila.

Re: remove spurious CREATE INDEX CONCURRENTLY wait

2020-11-09 Thread Michael Paquier

On Mon, Nov 09, 2020 at 11:31:15PM -0300, Alvaro Herrera wrote:
> Yeah ... it would be much better if we can make it use atomics instead.
> Currently it's an uint8, and in PGPROC itself it's probably not a big
> deal to enlarge that, but I fear that quadrupling the size of the
> mirroring array in PROC_HDR might be bad for performance.  However,
> maybe if we use atomics to access it, then we don't need to mirror it
> anymore?  That would need some benchmarking of GetSnapshotData.

Hmm.  If you worry about the performance impact here, it is possible
to do a small performance test without this patch.  vacuum_rel() sets
the flag for a non-full VACUUM, so with one backend running a manual
VACUUM in loop on an empty relation could apply some pressure on any
benchmark, even a simple pgbench.
--
Michael


signature.asc
Description: PGP signature

Re: Hybrid Hash/Nested Loop joins and caching results from subplans

2020-11-09 Thread Andy Fan

On Tue, Nov 10, 2020 at 7:55 AM David Rowley  wrote:

> On Tue, 10 Nov 2020 at 12:49, Tom Lane  wrote:
> >
> > Alvaro Herrera  writes:
> > > Are you taking into account the possibility that generated machine code
> > > is a small percent slower out of mere bad luck?  I remember someone
> > > suggesting that they can make code 2% faster or so by inserting random
> > > no-op instructions in the binary, or something like that.  So if the
> > > difference between v8 and v9 is that small, then it might be due to
> this
> > > kind of effect.
> >
> > Yeah.  I believe what this arises from is good or bad luck about relevant
> > tight loops falling within or across cache lines, and that sort of thing.
> > We've definitely seen performance changes up to a couple percent with
> > no apparent change to the relevant code.
>
> I do happen to prefer having the separate Result Cache node (v8), so
> from my point of view, even if the performance was equal, I'd rather
> have v8. I understand that some others feel different though.
>
>
While I have interest about what caused the tiny difference,  I admit that
what direction
this patch should go is more important.  Not sure if anyone is convinced
that
v8 and v9 have a similar performance.  The current data show it is similar.
I want to
profile/read code more, but I don't know what part I should pay attention
to.  So I think
any hints on why v9 should be better at a noticeable level  in theory
should be very
helpful.  After that, I'd like to read the code or profile more carefully.

-- 
Best Regards
Andy Fan

Re: remove spurious CREATE INDEX CONCURRENTLY wait

2020-11-09 Thread Alvaro Herrera

On 2020-Nov-09, Tom Lane wrote:

> Michael Paquier  writes:
> > On Mon, Nov 09, 2020 at 08:32:13PM -0500, Tom Lane wrote:
> >> Do we really need exclusive lock on the ProcArray to make this flag
> >> change?  That seems pretty bad from a concurrency standpoint.
> 
> > Any place where we update vacuumFlags acquires an exclusive LWLock on
> > ProcArrayLock.  That's held for a very short time, so IMO it won't
> > matter much in practice, particularly if you compare that with the
> > potential gains related to the existing wait phases.
> 
> Not sure I believe that it doesn't matter much in practice.  If there's
> a steady stream of shared ProcArrayLock acquisitions (for snapshot
> acquisition) then somebody wanting exclusive lock will create a big
> hiccup, whether they hold it for a short time or not.

Yeah ... it would be much better if we can make it use atomics instead.
Currently it's an uint8, and in PGPROC itself it's probably not a big
deal to enlarge that, but I fear that quadrupling the size of the
mirroring array in PROC_HDR might be bad for performance.  However,
maybe if we use atomics to access it, then we don't need to mirror it
anymore?  That would need some benchmarking of GetSnapshotData.

Avoiding useless SHA256 initialization with backup manifests, breaking base backups with FIPS

2020-11-09 Thread Michael Paquier

Hi all,

Trying to use OpenSSL with FIPS breaks if one attempts to call the
low-level SHA2 routines we currently use in sha2_openssl.c (upstream
calls that OpenSSLDie()), forcing a crash of PG.  The actual way to
fix that is to use EVP as I solved here:
https://commitfest.postgresql.org/30/2762/

Unfortunately, this enforces an ABI breakage so this is not
backpatchable material.  Now, if one attempts to use OpenSSL with
FIPS, the initialization of backup manifests in
InitializeBackupManifest() enforces a call to pg_sha256_init() for the
manifest file itself even if pg_basebackup, or anything requesting a
base backup with the replication protocol, does *not* want a backup
manifest.  One can for example enforce to not use a backup manifest
with --no-manifest in pg_basebackup, but even if you specify that base
backups cause the backend to crash on HEAD if using FIPS in OpenSSL.

Looking at the code, the checksum of the manifest file is updated or
finalized only if IsManifestEnabled() is satisfied, meaning that if
the caller does not want a manifest we do its initialization, but
we have no use for it.

Attached is a patch that I would like to back-patch down to v13 to
avoid this useless initialization, giving users the possibility to
take base backups with FIPS when not using a backup manifest.  Without
the solution in the first paragraph, you cannot make use of backup 
manifests at all with OpenSSL+FIPS (one can still enforce the use of
the in-core SHA2 implementation even if building with OpenSSL), but at
least it gives an escape route with 13.

Thoughts?
--
Michael
diff --git a/src/backend/replication/backup_manifest.c b/src/backend/replication/backup_manifest.c
index 556e6b5040..bab5e2f53b 100644
--- a/src/backend/replication/backup_manifest.c
+++ b/src/backend/replication/backup_manifest.c
@@ -57,12 +57,17 @@ InitializeBackupManifest(backup_manifest_info *manifest,
 		 backup_manifest_option want_manifest,
 		 pg_checksum_type manifest_checksum_type)
 {
+	memset(manifest, 0, sizeof(backup_manifest_info));
+	manifest->checksum_type = manifest_checksum_type;
+
 	if (want_manifest == MANIFEST_OPTION_NO)
 		manifest->buffile = NULL;
 	else
+	{
 		manifest->buffile = BufFileCreateTemp(false);
-	manifest->checksum_type = manifest_checksum_type;
-	pg_sha256_init(&manifest->manifest_ctx);
+		pg_sha256_init(&manifest->manifest_ctx);
+	}
+
 	manifest->manifest_size = UINT64CONST(0);
 	manifest->force_encode = (want_manifest == MANIFEST_OPTION_FORCE_ENCODE);
 	manifest->first_file = true;


signature.asc
Description: PGP signature

Re: remove spurious CREATE INDEX CONCURRENTLY wait

2020-11-09 Thread Tom Lane

Michael Paquier  writes:
> On Mon, Nov 09, 2020 at 08:32:13PM -0500, Tom Lane wrote:
>> Do we really need exclusive lock on the ProcArray to make this flag
>> change?  That seems pretty bad from a concurrency standpoint.

> Any place where we update vacuumFlags acquires an exclusive LWLock on
> ProcArrayLock.  That's held for a very short time, so IMO it won't
> matter much in practice, particularly if you compare that with the
> potential gains related to the existing wait phases.

Not sure I believe that it doesn't matter much in practice.  If there's
a steady stream of shared ProcArrayLock acquisitions (for snapshot
acquisition) then somebody wanting exclusive lock will create a big
hiccup, whether they hold it for a short time or not.

regards, tom lane

Re: remove spurious CREATE INDEX CONCURRENTLY wait

2020-11-09 Thread Michael Paquier

On Mon, Nov 09, 2020 at 08:32:13PM -0500, Tom Lane wrote:
> Do we really need exclusive lock on the ProcArray to make this flag
> change?  That seems pretty bad from a concurrency standpoint.

Any place where we update vacuumFlags acquires an exclusive LWLock on
ProcArrayLock.  That's held for a very short time, so IMO it won't
matter much in practice, particularly if you compare that with the
potential gains related to the existing wait phases.
--
Michael


signature.asc
Description: PGP signature

Re: remove spurious CREATE INDEX CONCURRENTLY wait

2020-11-09 Thread Tom Lane

Michael Paquier  writes:
>> +LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
>> +MyProc->vacuumFlags |= PROC_IN_SAFE_IC;
>> +ProcGlobal->vacuumFlags[MyProc->pgxactoff] = 
>> MyProc->vacuumFlags;
>> +LWLockRelease(ProcArrayLock);

> I can't help noticing that you are repeating the same code pattern
> eight times.  I think that this should be in its own routine, and that
> we had better document that this should be called just after starting
> a transaction, with an assertion enforcing that.

Do we really need exclusive lock on the ProcArray to make this flag
change?  That seems pretty bad from a concurrency standpoint.

regards, tom lane

Re: remove spurious CREATE INDEX CONCURRENTLY wait

2020-11-09 Thread Michael Paquier

On Mon, Nov 09, 2020 at 04:47:43PM +0100, Dmitry Dolgov wrote:
> > On Tue, Nov 03, 2020 at 07:14:47PM +0100, Dmitry Dolgov wrote:
> > > On Thu, Aug 20, 2020 at 03:11:19PM +0900, Michael Paquier wrote:
> > > On Wed, Aug 19, 2020 at 02:16:46PM -0400, Alvaro Herrera wrote:
> > > > I did not set the flag in REINDEX CONCURRENTLY, but as I understand it
> > > > can be done too, since in essence it's the same thing as a CIC from a
> > > > snapshot management point of view.
> > >
> > > Yes, I see no problems for REINDEX CONCURRENTLY as well as long as
> > > there are no predicates and expressions involved.  The transactions
> > > that should be patched are all started in ReindexRelationConcurrently.
> > > The transaction of index_concurrently_swap() cannot set up that
> > > though.  Only thing to be careful is to make sure that safe_flag is
> > > correct depending on the list of indexes worked on.
> >
> > Hi,
> >
> > After looking through the thread and reading the patch it seems good,
> > and there are only few minor questions:
> >
> > * Doing the same for REINDEX CONCURRENTLY, which does make sense. In
> >   fact it's already mentioned in the commentaries as done, which a bit
> >   confusing.
> 
> Just to give it a shot, would the attached change be enough?

Could it be possible to rename vacuumFlags to statusFlags first?  I
did not see any objection to do this suggestion.

> + LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
> + MyProc->vacuumFlags |= PROC_IN_SAFE_IC;
> + ProcGlobal->vacuumFlags[MyProc->pgxactoff] = 
> MyProc->vacuumFlags;
> + LWLockRelease(ProcArrayLock);

I can't help noticing that you are repeating the same code pattern
eight times.  I think that this should be in its own routine, and that
we had better document that this should be called just after starting
a transaction, with an assertion enforcing that.
--
Michael


signature.asc
Description: PGP signature

Re: pg_upgrade analyze script

2020-11-09 Thread Michael Paquier

On Mon, Nov 09, 2020 at 03:47:22PM +0100, Peter Eisentraut wrote:
> You should just remove those calls.  There is no need to replace them with
> vacuumdb calls.  The reason those calls were there is that they were testing
> the generated script itself.  If the script is gone, there is no more need.
> There are already separate tests for testing vacuumdb.

True, 102_vacuumdb_stages.pl already does all that.
--
Michael


signature.asc
Description: PGP signature

Adding an aminsert() hint that triggers bottom-up index deletion for UPDATEs that can't use HOT

2020-11-09 Thread Peter Geoghegan

I am working on a project called "bottom-up index deletion" (I was
calling it "delete deduplication" until recently). This is another
nbtree project. The patch series adds a mechanism that controls old
duplicate versions caused by non-HOT UPDATEs. This is very effective.
In fact, it prevents almost all "unnecessary" page splits. By that I
mean it totally prevents page splits caused by version churn, where
old duplicate versions accumulate in indexes and cause page splits.
This is at least true in the absence of a long running xact/open
snapshot, though only provided the index isn't very small.

Importantly, the delete mechanism that the. patch series adds
naturally tends to bound the number of physical index tuple versions
for any given logical row represented in the index. I believe that
that's a big problem for us at present.

I'm starting a new thread to discuss issues with changing the
signature of the aminsert() index AM routine to add a hint for the
benefit of the patch. I would like to get some buy-in on the details.
But first some background:

The patch relies on the executor passing down a hint to nbtree that
indicates that the incoming index tuple is from an UPDATE that did not
modify any of the columns covered by the index. That's how the new
deletion mechanism avoids wasting cycles in cases that have no chance
of benefiting from the optimization, like plain INSERTs whose
aminsert() calls naturally insert non-transient index tuples that
point to whole new logical rows. We only trigger a relatively
expensive special round of deletion when we notice an accumulation of
versions on a leaf page, and are fairly confident (though not yet
completely sure) that that's what we see on the page. Hence
"bottom-up".

It's easy to review the executor mechanism in isolation, without
looking at the indexing stuff at all. It's broken out into its own
patch file in the latest version. The patch is called
v7-0002-Pass-down-logically-unchanged-index-hint.patch, and can be
downloaded here:

https://postgr.es/m/cah2-wzmp5aymeft_n3wadvw8d7dduaphpqrzds5kv7vjnxs...@mail.gmail.com

Some questions about the infrastructure I'm thinking of adding:

1. Is there any reason to believe that this will create noticeable
performance overhead elsewhere?

2. Is the current approach of adding a new boolean argument to
aminsert() comprehensive?

Examples of where I might have gone wrong with the current draft design:

Might I be missing an opportunity to add a more general mechanism that
will be useful in a variety of index access methods?

Maybe an enum would make more sense?

Or maybe I should add a new amupdate() routine? Or something else completely?

ISTM that ther general idea of giving index access methods hints about
what's going on with UPDATE chains is a good one. It isn't necessary
for index AMs to have a *reliable* understanding of exactly what an
UPDATE chain looks like -- that would be very brittle. But conveying
the general idea of the "lifecycle" of the data at the level of a leaf
page makes perfect sense.

To expand the discussion beyond the immediate needs of my patch: I
also think that it would make sense to "delete mark" index tuples with
a visibility hint (think of a new "probably going to be garbage before
too long" index tuple bit) when DELETE statements run. Again, this
doesn't have to be reliable in the same way that setting an LP_DEAD
bit does. This delete marking stuff is not on my agenda right now.
Just an example of another mechanism based on similar principles.

--
Peter Geoghegan

Re: Disable WAL logging to speed up data loading

2020-11-09 Thread Kyotaro Horiguchi

At Mon, 9 Nov 2020 10:18:08 -0500, Stephen Frost  wrote in 
> Greetings,
> 
> * osumi.takami...@fujitsu.com (osumi.takami...@fujitsu.com) wrote:
> > When I consider the use case is the system of data warehouse
> > as described upthread, the size of each table can be large.
> > Thus, changing the status from unlogged to logged (recoverable)
> > takes much time under the current circumstances, which was discussed before.
> 
> Ok- so the issue is that, today, we dump all of the table into the WAL
> when we go from unlogged to logged, but as I outlined previously,
> perhaps that's because we're missing a trick there when
> wal_level=minimal.  If wal_level=minimal, then it would seem like we
> could lock the table, then sync it and then mark is as logged, which is

FWIW, the following is that, I think it works not only when
wal_level=minimal for SET UNLOGGED, and only works when minimal for
SET LOGGED.

https://www.postgresql.org/message-id/20201002.100621.1668918756520136893.horikyota@gmail.com

| For fuel(?) of the discussion, I tried a very-quick PoC for in-place
| ALTER TABLE SET LOGGED/UNLOGGED and resulted as attached. After some
| trials of several ways, I drifted to the following way after poking
| several ways.
| 
| 1. Flip BM_PERMANENT of active buffers
| 2. adding/removing init fork
| 3. sync files,
| 4. Flip pg_class.relpersistence.
| 
| It always skips table copy in the SET UNLOGGED case, and only when
| wal_level=minimal in the SET LOGGED case.  Crash recovery seems
| working by some brief testing by hand.

> more-or-less what you're asking to have be effectively done with the
> proposed wal_level=none, but this would be an optimization for all
> existing users of wal_level=minimal who have unlogged tables that they
> want to change to logged, and this works on a per-table basis instead,
> which seems like a better approach than a cluster-wide setting.
> 
> > By having the limited window of time,
> > during wal_level=none, I'd like to make wal_level=none work to
> > localize and minimize the burden to guarantee all commands are
> > repeatable. To achieve this, after switching wal_level from none to higher 
> > ones,
> > the patch must ensure crash recovery, though.
> 
> Perhaps a helper command could be added to ALTER TABLE ALL IN TABLESPACE
> to marked a bunch of unlogged tables over to being logged would be good
> to add too.

I agree to this aspect of the in-place flipping of UNLOGGED.

> > Sorry that my current patch doesn't complete this aspect fully at present
> > but, may I have your opinion about this ?
> 
> Presently, my feeling is that we could address this use-case without
> having to introduce a new cluster-wide WAL level, and that's the
> direction I'd want to see this going.  Perhaps I'm missing something
> about why the approach I've set forth above wouldn't work, and
> wal_level=none would, but I've not seen it yet.

Couldn't we have something like the following?

 ALTER TABLE table1, table2, table3 SET UNLOGGED;

That is, multiple target object specification in ALTER TABLE sttatement.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

RE: POC: postgres_fdw insert batching

2020-11-09 Thread tsunakawa.ta...@fujitsu.com

Hello,


The attached patch implements the new bulk insert routine for postgres_fdw and 
the executor utilizing it.  It passes make check-world.

I measured performance in a basic non-partitioned case by modifying Tomas-san's 
scripts.  They perform an INSERT SELECT statement that copies one million 
records.  The table consists of two integer columns, with a primary key on one 
of those them.  You can run the attached prepare.sql to set up once.  local.sql 
inserts to the table directly, while fdw.sql inserts through a foreign table.

The performance results, the average time of 5 runs,  were as follows on a 
Linux host where the average round-trip time of "ping localhost" was 34 us:

master, local: 6.1 seconds
master, fdw: 125.3 seconds
patched, fdw: 11.1 seconds (11x improvement)


The patch accumulates at most 100 records in ModifyTableState before inserting 
in bulk.  Also, when an input record is targeted for a different relation (= 
partition) than that for already accumulated records, insert the accumulated 
records and store the new record for later insert.

[Issues]

1. Do we want a GUC parameter, say, max_bulk_insert_records = (integer), to 
control the number of records inserted at once?
The range of allowed values would be between 1 and 1,000.  1 disables bulk 
insert.
The possible reason of the need for this kind of parameter would be to limit 
the amount of memory used for accumulated records, which could be prohibitively 
large if each record is big.  I don't think this is a must, but I think we can 
have it.

2. Should we accumulate records per relation in ResultRelInfo instead?
That is, when inserting into a partitioned table that has foreign partitions, 
delay insertion until a certain number of input records accumulate, and then 
insert accumulated records per relation (e.g., 50 records to relation A, 30 
records to relation B, and 20 records to relation C.)  If we do that,

* The order of insertion differs from the order of input records.  Is it OK?

* Should the maximum count of accumulated records be applied per relation or 
the query?
When many foreign partitions belong to a partitioned table, if the former is 
chosen, it may use much memory in total.  If the latter is chosen, the records 
per relation could be few and thus the benefit of bulk insert could be small.


Regards
Takayuki Tsunakawa



fdw.sql
Description: fdw.sql


local.sql
Description: local.sql


prepare.sql
Description: prepare.sql


v1-0001-Add-bulk-insert-for-foreign-tables.patch
Description: v1-0001-Add-bulk-insert-for-foreign-tables.patch

Make Append Cost aware of some run time partition prune case

2020-11-09 Thread Andy Fan

Currently the cost model of append path sums the cost/rows for all the
subpaths, it usually works well until we run into the run-time partition
prune
case.  The first result is that generic plans will rarely be used for some
cases.
For instance, SELECT * FROM p WHERE pkey = $1;  The custom plan will only
count the cost of one partition, however generic plan will count the cost
for all the
partitions even we are sure that only 1 partition will survive. Another
impact
is that planners may choose a wrong plan.  for example,  SELECT * FROM t1,
p
WHERE t1.a = p.pkey;  The cost/rows of t1 nest loop p is estimated highly
improperly. This patch wants to help this case to some extent.

The basic idea is we need to estimate how many partitions will survive after
considering the runtime partition prune.  After that,  we will adjust the
cost/rows accordingly. IIUC, this follows Robert's idea at [1].
However there are too many special cases which is hard
to handle, but luckily the most common case would be sth like  partykey =
$1,
which we can estimate well, this patch is aimed to handle that part
specially.
I supposed 75% partitions will survive for other cases arbitrarily,
actually I
want to use 100% to be the same as the current situation. If we decide to
handle their special case  differently, PartitionedRelQual has to be
redefined
a bit (see comments around that.)  which is the main data struct in the
implementation.

The attached is the minimum runnable patch. There are some obvious issue
here:

1. MergeAppend path is not handled on purpose, it will be done at last.
2. The cost for Parallel Append Path is not adjusted, I'm not sure about
what is
   the best way to do it.  So there are 2 test cases in
sql/partition_prune.out
   failed due to this, I just echo the 'results' to expected' so that you
can
   know which one is impacted.


Here are some simple test results.

create table a1 partition of a for values in (1) partition by range (b, c);
create table a1_1 partition of a1 for values from (1, 0) to (1, 10);
create table a1_2 partition of a1 for values from (2, 0) to (2, 10);

create table a2 partition of a for values in (2) partition by range (b, c);
create table a2_1 partition of a2 for values from (1, 0) to (1, 10);
create table a2_2 partition of a2 for values from (2, 0) to (2, 10);


insert into a select 1, i%2 + 1, i % 10 from generate_series(1, 1) i;
insert into a select 2, i%2 + 1, i % 10 from generate_series(1, 1) i;

analyze a;

set plan_cache_mode to force_generic_plan;

prepare s as select * from a where a = $1;
PREPARE
explain execute s(1);
QUERY PLAN
---
 Append  (cost=0.00..231.00 rows=1 width=12)  *<< both rows/cost is
adjusted.*
   Subplans Removed: 2
   ->  Seq Scan on a1_1 a_1  (cost=0.00..90.50 rows=5000 width=12)
 Filter: (a = $1)
   ->  Seq Scan on a1_2 a_2  (cost=0.00..90.50 rows=5000 width=12)
 Filter: (a = $1)
(6 rows)

prepare s4 as select * from a where a = $1 and b = $2 and c = $3;
PREPARE
explain execute s4(1, 1, 2);
 QUERY PLAN

 Append  (cost=0.00..120.50 rows=1000 width=12)  *<< Here*
   Subplans Removed: 3
   ->  Seq Scan on a1_1 a_1  (cost=0.00..115.50 rows=1000 width=12)
 Filter: ((a = $1) AND (b = $2) AND (c = $3))
(4 rows)

prepare s2 as select * from a where a = $1 union all select * from a where
a = $2;
PREPARE
explain execute s2(1, 1);
   QUERY PLAN
-
 Append  (cost=0.00..762.00 rows=2 width=12)
   ->  Append  (cost=0.00..231.00 rows=1 width=12)  *<< Here*
 Subplans Removed: 2
 ->  Seq Scan on a1_1 a_1  (cost=0.00..90.50 rows=5000 width=12)
   Filter: (a = $1)
 ->  Seq Scan on a1_2 a_2  (cost=0.00..90.50 rows=5000 width=12)
   Filter: (a = $1)
   ->  Append  (cost=0.00..231.00 rows=1 width=12) << Here
 Subplans Removed: 2
 ->  Seq Scan on a1_1 a_4  (cost=0.00..90.50 rows=5000 width=12)
   Filter: (a = $2)
 ->  Seq Scan on a1_2 a_5  (cost=0.00..90.50 rows=5000 width=12)
   Filter: (a = $2)
(13 rows)

prepare s3 as select * from a where a = $1 union select * from a where a =
$2;
PREPARE
explain execute s3(1, 1);
  QUERY PLAN
---
 HashAggregate  (cost=912.00..1112.00 rows=2 width=12)
   Group Key: a.a, a.b, a.c
   ->  Append  (cost=0.00..762.00 rows=2 width=12)
 ->  Append  (cost=0.00..231.00 rows=1 width=12) << Here
   Subplans Removed: 2
   ->  Seq Scan on a1_1 a_1  (cost=0.00..90.50 rows=5000
width=12)
 Filter: (a = $1)
   ->  Seq Scan on a1_2 a_2  (cost=0.00..90.50 r

Re: MultiXact\SLRU buffers configuration

2020-11-09 Thread Tomas Vondra

Hi,

After the issue reported in [1] got fixed, I've restarted the multi-xact
stress test, hoping to reproduce the issue. But so far no luck :-(

I've started slightly different tests on two machines - on one machine
I've done this:

  a) init.sql

  create table t (a int);
  insert into t select i from generate_series(1,1) s(i);
  alter table t add primary key (a);

  b) select.sql

  SELECT * FROM t
   WHERE a = (1+mod(abs(hashint4(extract(epoch from now())::int)),
1)) FOR KEY SHARE;

  c) pgbench -n -c 32 -j 8 -f select.sql -T $((24*3600)) test

The idea is to have large table and many clients hitting a small random
subset of the rows. A sample of wait events from ~24h run looks like this:

  e_type  |e_name|   sum
--+--+--
 LWLock   | BufferContent| 13913863
  |  |  7194679
 LWLock   | WALWrite |  1710507
 Activity | LogicalLauncherMain  |   726599
 Activity | AutoVacuumMain   |   726127
 Activity | WalWriterMain|   725183
 Activity | CheckpointerMain |   604694
 Client   | ClientRead   |   599900
 IO   | WALSync  |   502904
 Activity | BgWriterMain |   378110
 Activity | BgWriterHibernate|   348464
 IO   | WALWrite |   129412
 LWLock   | ProcArray| 6633
 LWLock   | WALInsert| 5714
 IO   | SLRUWrite| 2580
 IPC  | ProcArrayGroupUpdate | 2216
 LWLock   | XactSLRU | 2196
 Timeout  | VacuumDelay  | 1078
 IPC  | XactGroupUpdate  |  737
 LWLock   | LockManager  |  503
 LWLock   | WALBufMapping|  295
 LWLock   | MultiXactMemberSLRU  |  267
 IO   | DataFileWrite|   68
 LWLock   | BufferIO |   59
 IO   | DataFileRead |   27
 IO   | DataFileFlush|   14
 LWLock   | MultiXactGen |7
 LWLock   | BufferMapping|1

So, nothing particularly interesting - there certainly are not many wait
events related to SLRU.

On the other machine I did this:

  a) init.sql
  create table t (a int primary key);
  insert into t select i from generate_series(1,1000) s(i);

  b) select.sql
  select * from t for key share;

  c) pgbench -n -c 32 -j 8 -f select.sql -T $((24*3600)) test

and the wait events (24h run too) look like this:

  e_type   |e_name |   sum
---+---+--
 LWLock| BufferContent | 20804925
   |   |  2575369
 Activity  | LogicalLauncherMain   |   745780
 Activity  | AutoVacuumMain|   745292
 Activity  | WalWriterMain |   740507
 Activity  | CheckpointerMain  |   737691
 Activity  | BgWriterHibernate |   731123
 LWLock| WALWrite  |   570107
 IO| WALSync   |   452603
 Client| ClientRead|   151438
 BufferPin | BufferPin |23466
 LWLock| WALInsert |21631
 IO| WALWrite  |19050
 LWLock| ProcArray |15082
 Activity  | BgWriterMain  |14655
 IPC   | ProcArrayGroupUpdate  | 7772
 LWLock| WALBufMapping | 3555
 IO| SLRUWrite | 1951
 LWLock| MultiXactGen  | 1661
 LWLock| MultiXactMemberSLRU   |  359
 LWLock| MultiXactOffsetSLRU   |  242
 LWLock| XactSLRU  |  141
 IPC   | XactGroupUpdate   |  104
 LWLock| LockManager   |   28
 IO| DataFileRead  |4
 IO| ControlFileSyncUpdate |1
 Timeout   | VacuumDelay   |1
 IO| WALInitWrite  |1

Also nothing particularly interesting - few SLRU wait events.

So unfortunately this does not really reproduce the SLRU locking issues
you're observing - clearly, there has to be something else triggering
it. Perhaps this workload is too simplistic, or maybe we need to run
different queries. Or maybe the hw needs to be somewhat different (more
CPUs? different storage?)


[1]
https://www.postgresql.org/message-id/20201104013205.icogbi773przyny5@development

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Hybrid Hash/Nested Loop joins and caching results from subplans

2020-11-09 Thread Peter Geoghegan

On Mon, Nov 9, 2020 at 3:49 PM Tom Lane  wrote:
> Alvaro Herrera  writes:
> > Are you taking into account the possibility that generated machine code
> > is a small percent slower out of mere bad luck?  I remember someone
> > suggesting that they can make code 2% faster or so by inserting random
> > no-op instructions in the binary, or something like that.  So if the
> > difference between v8 and v9 is that small, then it might be due to this
> > kind of effect.
>
> Yeah.  I believe what this arises from is good or bad luck about relevant
> tight loops falling within or across cache lines, and that sort of thing.
> We've definitely seen performance changes up to a couple percent with
> no apparent change to the relevant code.

That was Andrew Gierth. And it was 5% IIRC.

In theory it should be possible to control for this using a tool like
stabilizer:

https://github.com/ccurtsinger/stabilizer

I am not aware of anybody having actually used the tool with Postgres,
though. It looks rather inconvenient.

-- 
Peter Geoghegan

Re: Hybrid Hash/Nested Loop joins and caching results from subplans

2020-11-09 Thread David Rowley

On Tue, 10 Nov 2020 at 12:49, Tom Lane  wrote:
>
> Alvaro Herrera  writes:
> > Are you taking into account the possibility that generated machine code
> > is a small percent slower out of mere bad luck?  I remember someone
> > suggesting that they can make code 2% faster or so by inserting random
> > no-op instructions in the binary, or something like that.  So if the
> > difference between v8 and v9 is that small, then it might be due to this
> > kind of effect.
>
> Yeah.  I believe what this arises from is good or bad luck about relevant
> tight loops falling within or across cache lines, and that sort of thing.
> We've definitely seen performance changes up to a couple percent with
> no apparent change to the relevant code.

It possibly is this issue.

Normally how I build up my confidence in which is faster is why just
rebasing on master as it advances and see if the winner ever changes.
The theory here is if one patch is consistently the fastest, then
there's more chance if there being a genuine reason for it.

So far I've only rebased v9 twice. Both times it was slower than v8.
Since the benchmarks are all scripted, it's simple enough to kick off
another round to see if anything has changed.

I do happen to prefer having the separate Result Cache node (v8), so
from my point of view, even if the performance was equal, I'd rather
have v8. I understand that some others feel different though.

David

Re: Hybrid Hash/Nested Loop joins and caching results from subplans

2020-11-09 Thread Tom Lane

Alvaro Herrera  writes:
> Are you taking into account the possibility that generated machine code
> is a small percent slower out of mere bad luck?  I remember someone
> suggesting that they can make code 2% faster or so by inserting random
> no-op instructions in the binary, or something like that.  So if the
> difference between v8 and v9 is that small, then it might be due to this
> kind of effect.

Yeah.  I believe what this arises from is good or bad luck about relevant
tight loops falling within or across cache lines, and that sort of thing.
We've definitely seen performance changes up to a couple percent with
no apparent change to the relevant code.

regards, tom lane

Re: Add important info about ANALYZE after create Functional Index

2020-11-09 Thread Fabrízio de Royes Mello

On Mon, 9 Nov 2020 at 20:27 Bruce Momjian  wrote:

>
> I see REINDEX CONCURRENTLY was fixed in head, but the docs didn't get
> updated to mention the need to run ANALYZE or wait for autovacuum before
> expression indexes can be fully used by the optimizer.  Instead of
> putting this mention in the maintenance section, I thought the CREATE
> INDEX page make more sense, since it is more of a usability issue,
> rather than "why use expression indexes".  Patch attached, which I plan
> to apply to all supported branches.
>


Did a quick review and totally agree... thanks a lot Bruce to help us don’t
miss it.

Regards,


-- 
   Fabrízio de Royes Mello Timbira - http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento

Re: Add important info about ANALYZE after create Functional Index

2020-11-09 Thread Bruce Momjian

On Tue, Oct 27, 2020 at 12:12:00AM -0700, Nikolay Samokhvalov wrote:
> On Mon, Oct 26, 2020 at 3:08 PM Fabrízio de Royes Mello <
> fabriziome...@gmail.com> wrote:
> 
> Would be nice if add some information about it into our docs but not sure
> where. I'm thinking about:
> - doc/src/sgml/ref/create_index.sgml
> - doc/src/sgml/maintenance.sgml (routine-reindex)
> 
> 
> Attaching the patches for the docs, one for 11 and older, and another for 12+
> (which have REINDEX CONCURRENTLY not suffering from lack of ANALYZE).
> 
> I still think that automating is the right thing to do but of course, it's a
> much bigger topic that a quick fix dor the docs.

I see REINDEX CONCURRENTLY was fixed in head, but the docs didn't get
updated to mention the need to run ANALYZE or wait for autovacuum before
expression indexes can be fully used by the optimizer.  Instead of
putting this mention in the maintenance section, I thought the CREATE
INDEX page make more sense, since it is more of a usability issue,
rather than "why use expression indexes".  Patch attached, which I plan
to apply to all supported branches.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
new file mode 100644
index 749db28..48c42db
*** a/doc/src/sgml/ref/create_index.sgml
--- b/doc/src/sgml/ref/create_index.sgml
*** Indexes:
*** 746,751 
--- 746,761 

  

+The system collects statistics on all of a table's columns.
+Newly-created non-expression indexes can immediately
+use these statistics to determine an index's usefulness.
+For new expression indexes, it is necessary to run ANALYZE or wait for
+the autovacuum daemon to analyze
+the table to generate statistics about new expression indexes.
+   
+ 
+   
 For most index methods, the speed of creating an index is
 dependent on the setting of .
 Larger values will reduce the time needed for index creation, so long

Re: Additional Chapter for Tutorial

2020-11-09 Thread David G. Johnston

On Sun, Nov 8, 2020 at 8:56 AM Jürgen Purtz  wrote:

>
> Good catches. Everything applied.
>

Reviewed the first three sections.

template0 - I would remove the schema portions of this and simply note this
as being a pristine recovery database in the diagram.

I would drop the word "more" and just say "system schemas".  I would drop
pg_toast from the list of system schema and focus on the three user-facing
ones.

Instead of "my_schema" (optional) I would do "my_schema" (example)

Server Graphic
#3 Global SQL Objects: Objects which are shared among all databases within
a cluster.
#6 Client applications are prohibited from connecting to template0
#1 If by you we mean "the client" saying that you work "in the cluster
data" doesn't really help.  I would emphasize the point that the client
sees an endpoint the Postmaster publishes as a port or socket file and that
plus the database name defines the endpoint the client connects to (meld
with #5)

In lieu of some of the existing detail provided about structure I would add
information about configuration and search_path at this level.

I like the object type enumeration - I would suggest grouping them by type
in a manner consistent with the documentation and making each one a link to
its "primary" section - the SQL Command reference if all else fails.

The "i" in internal in 51.3 (the image) needs capitalization).

You correctly add both Extension and Collation as database-level objects
but they are not mentioned anywhere else.  They do belong here and need to
be tied in properly in the text.

The whole thing needs a good pass focused on capitalization.  Both for
typos and to decide when various primary concepts like Instance should be
capitalized and when not.

51.4 - When you look at the diagram seeing /pg/data/base looks really cool,
but when reading the prose where both the "pg" and the "base" are omitted
and all you get are repeated references to "data", the directory name
choice becomes an issue IMO.  I suggest (and changed the attached) to name
the actual root directory "pgdata".  You should change the /pg/ directory
name to something like ".../tutorial_project/".

Since you aren't following alphabetical order anyway I would place
pg_tblspc after globals since tablespaces are globals and thus proximity
links them here - and pointing out that pg_tblspc holds the data makes
stating that global doesn't contain tablespace data unnecessary.

Maybe point out somewhere the the "base/databaseOID" directory represents
the default tablespace for each database, which isn't "global", only the
non-default tablespaces are considered globals (or just get rid of the
mentioned on "non-default tablespace" for now).

David J.

0012-architecture-dgj-suggestions.patch
Description: Binary data

Re: Hybrid Hash/Nested Loop joins and caching results from subplans

2020-11-09 Thread Alvaro Herrera

On 2020-Nov-10, David Rowley wrote:

> On Mon, 9 Nov 2020 at 16:29, Andy Fan  wrote:

> >  However I believe v9
> > should be no worse than v8 all the time,  Is there any theory to explain
> > your result?
> 
> Nothing jumps out at me from looking at profiles.   The only thing I
> noticed was the tuple deforming is more costly with v9. I'm not sure
> why.

Are you taking into account the possibility that generated machine code
is a small percent slower out of mere bad luck?  I remember someone
suggesting that they can make code 2% faster or so by inserting random
no-op instructions in the binary, or something like that.  So if the
difference between v8 and v9 is that small, then it might be due to this
kind of effect.

I don't know what is a good technique to test this hypothesis.

Re: -Wformat-signedness

2020-11-09 Thread Thomas Munro

On Tue, Nov 10, 2020 at 4:25 AM Tom Lane  wrote:
> Peter Eisentraut  writes:
> > 1. enums are unsigned by default in gcc, so all those internal error
> > messages "unrecognized blah kind: %d" need to be changed to %u.
>
> Do we have reason to think that that is true in every C compiler?
> My own preference for this would be to leave the messages as-is
> and add explicit "(int)" casts to the arguments.  There are some
> fraction of these that are like that already.

>From experimentation, it seems that GCC enumerator constants are int,
but enum variables are int or signed int depending on whether any
negative values were defined.  Valid values have to be representable
as int anyway regardless of what size and signedness a compiler
chooses to use, so yeah, +1 for casting to int.

Re: Hybrid Hash/Nested Loop joins and caching results from subplans

2020-11-09 Thread David Rowley

On Mon, 9 Nov 2020 at 16:29, Andy Fan  wrote:
> I think either version is OK for me and I like this patch overall.

That's good to know. Thanks.

>  However I believe v9
> should be no worse than v8 all the time,  Is there any theory to explain
> your result?

Nothing jumps out at me from looking at profiles.   The only thing I
noticed was the tuple deforming is more costly with v9. I'm not sure
why.

The other part of v9 that I don't have a good solution for yet is the
code around the swapping of the projection info for the Nested Loop.
The cache always uses a MinimalTupleSlot, but we may have a
VirtualSlot when we get a cache miss. If we do then we need to
initialise 2 different projection infos so when we project from the
cache that we have the step to deform the minimal tuple. That step is
not required when the inner slot is a virtual slot.

I did some further testing on performance. Basically, I increased the
size of the tests by 2 orders of magnitude. Instead of 100k rows, I
used 10million rows. (See attached
resultcache_master_vs_v8_vs_v9_big.csv)

Loading that in with:

# create table resultcache_bench2 (tbl text, target text, col text,
latency_master numeric(10,3), latency_v8 numeric(10,3), latency_v9
numeric(10,3));
# copy resultcache_bench2 from
'/path/to/resultcache_master_vs_v8_vs_v9_big.csv' with(format csv);

I see that v8 still wins.

postgres=# select round(avg(latency_v8/latency_master)*100,1) as
v8_vs_master, round(avg(latency_v9/latency_master)*100,1) as
v9_vs_master, round(avg(latency_v8/latency_v9)*100,1) as v8_vs_v9 from
resultcache_bench2;
 v8_vs_master | v9_vs_master | v8_vs_v9
--+--+--
 56.7 | 58.8 | 97.3

Execution for all tests for v8 runs in 56.7% of master, but v9 runs in
58.8% of master's time. Full results in
resultcache_master_v8_vs_v9_big.txt.  v9 wins in 7 of 24 tests this
time. The best example test for v8 shows that v8 takes 90.6% of the
time of v9, but in the tests where v9 is faster, it only has a 4.3%
lead on v8 (95.7%). You can see that overall v8 is 2.7% faster than v9
for these tests.

David
select col,tbl,target, sum(latency_master) as master, sum(latency_v8) v8, 
sum(latency_v9) v9, round(avg(latency_v8/latency_master)*100,1) as 
v8_vs_master, round(avg(latency_v9/latency_master)*100,1) as v9_vs_master, 
round(avg(latency_v8/latency_v9)*100,1) as v8_vs_v9 from resultcache_bench2 
group by 1,2,3 order by 2,1,3;

   col|tbl|   target   |   master   | v8 | v9 | 
v8_vs_master | v9_vs_master | v8_vs_v9
--+---+++++--+--+--
 hundred  | lookup1   | '*'|   5750.325 |   4321.320 |   4548.600 | 
75.1 | 79.1 | 95.0
 hundred  | lookup1   | count(*)   |   4495.967 |   3111.354 |   3301.798 | 
69.2 | 73.4 | 94.2
 hundred  | lookup1   | count(l.a) |   4551.304 |   3231.036 |   3526.468 | 
71.0 | 77.5 | 91.6
 one  | lookup1   | '*'|   5724.242 |   4309.074 |   4479.229 | 
75.3 | 78.3 | 96.2
 one  | lookup1   | count(*)   |   4467.683 |   2912.478 |   3214.049 | 
65.2 | 71.9 | 90.6
 one  | lookup1   | count(l.a) |   4503.882 |   3145.432 |   3462.072 | 
69.8 | 76.9 | 90.9
 ten  | lookup1   | '*'|   5554.401 |   4128.838 |   4337.956 | 
74.3 | 78.1 | 95.2
 ten  | lookup1   | count(*)   |   4377.590 |   2925.131 |   3159.214 | 
66.8 | 72.2 | 92.6
 ten  | lookup1   | count(l.a) |   .535 |   3102.161 |   3382.259 | 
69.8 | 76.1 | 91.7
 thousand | lookup1   | '*'|   7869.671 |   4281.352 |   4492.678 | 
54.4 | 57.1 | 95.3
 thousand | lookup1   | count(*)   |   6686.179 |   2950.660 |   3217.514 | 
44.1 | 48.1 | 91.7
 thousand | lookup1   | count(l.a) |   6714.409 |   3152.067 |   3422.330 | 
46.9 | 51.0 | 92.1
 hundred  | lookup100 | '*'| 253973.453 | 153069.800 | 152742.719 | 
60.3 | 60.1 |100.2
 hundred  | lookup100 | count(*)   | 140918.236 |  51365.583 |  51559.118 | 
36.5 | 36.6 | 99.6
 hundred  | lookup100 | count(l.a) | 143877.890 |  68809.635 |  65981.690 | 
47.8 | 45.9 |104.3
 one  | lookup100 | '*'| 255409.805 | 154030.921 | 153737.083 | 
60.3 | 60.2 |100.2
 one  | lookup100 | count(*)   | 140432.641 |  51141.736 |  51354.441 | 
36.4 | 36.6 | 99.6
 one  | lookup100 | count(l.a) | 143358.858 |  68678.546 |  65701.101 | 
47.9 | 45.8 |104.5
 ten  | lookup100 | '*'| 231615.463 | 139033.321 | 139677.871 | 
60.0 | 60.3 | 99.5
 ten  | lookup100 | count(*)   | 128410.287 |  46405.746 |  46593.886 | 
36.1 | 36.3 | 99

Re: Error on failed COMMIT

2020-11-09 Thread Dave Cramer

On Wed, 30 Sep 2020 at 18:14, Andrew Dunstan 
wrote:

>
> On 8/4/20 12:19 PM, Dave Cramer wrote:
> > Attached is the rebased patch for consideration.
> >
> >
>
>
> It's a bit sad this has been hanging around so long without attention.
>
>
> The previous discussion seems to give the patch a clean bill of health
> for most/all of the native drivers. Are there any implications for libpq
> based drivers such as DBD::Pg and psycopg2? How about for ecpg?
>
>
> cheers
>
>
> andrew
>
>
> --
> Andrew Dunstanhttps://www.2ndQuadrant.com
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>


Attached is a rebased patch with fixes for the isolation tests


Dave Cramer
www.postgres.rocks


0001-Throw-error-and-rollback-on-a-failed-transaction-ins.patch
Description: Binary data

Re: Useless string ouput in error message

2020-11-09 Thread Fabien COELHO





I think I found a typo for the output of an error message which may cause 
building warning.
Please refer to the attachment for the detail.


Indeed. Thanks for the fix!

--
Fabien.

Re: Misuse of TimestampDifference() in the autoprewarm feature of pg_prewarm

2020-11-09 Thread Tom Lane

Alexey Kondratov  writes:
> On 2020-11-09 21:53, Tom Lane wrote:
>> 0002 seems like a pretty clear bug fix, though I wonder if this is exactly
>> what we want to do going forward.  It seems like a very large fraction of
>> the callers of TimestampDifference would like to have the value in msec,
>> which means we're doing a whole lot of expensive and error-prone
>> arithmetic to break down the difference to sec/usec and then put it
>> back together again.  Let's get rid of that by inventing, say
>> TimestampDifferenceMilliseconds(...).

> Yeah, I get into this problem after a bug in another extension — 
> pg_wait_sampling. I have attached 0002, which implements 
> TimestampDifferenceMilliseconds(), so 0003 just uses this new function 
> to solve the initial issues. If it looks good to you, then we can switch 
> all similar callers to it.

Yeah, let's move forward with that --- in fact, I'm inclined to
back-patch it.  (Not till the current release cycle is done, though.
I don't find this important enough to justify a last-moment patch.)

BTW, I wonder if we shouldn't make TimestampDifferenceMilliseconds
round any fractional millisecond up rather than down.  Rounding down
seems to create a hazard of uselessly waking just before the delay is
completed.  Better to wake just after.

I still think your 0001 is fishy, but don't have time today to stare at
it more closely.

regards, tom lane

Re: Misuse of TimestampDifference() in the autoprewarm feature of pg_prewarm

2020-11-09 Thread Alexey Kondratov


On 2020-11-09 21:53, Tom Lane wrote:

Alexey Kondratov  writes:
After fixing this issue I have noticed that it still dumps blocks 
twice

at each timeout (here I set autoprewarm_interval to 15s):
...
This happens because at timeout time we were using continue, but
actually we still have to wait the entire autoprewarm_interval after
successful dumping.


I don't think your 0001 is correct.  It would be okay if apw_dump_now()
could be counted on to take negligible time, but we shouldn't assume
that should we?



Yes, it seems so, if I understand you correctly. I had a doubt about 
possibility of pg_ctl to exit earlier than a dumping process. Now I 
added an explicit wait for dump file into test.



I agree that the "continue" seems a bit bogus, because it's skipping
the ResetLatch call at the bottom of the loop; it's not quite clear
to me whether that's a good thing or not.  But the general idea of
the existing code seems to be to loop around and make a fresh 
calculation

of how-long-to-wait, and that doesn't seem wrong.


I have left the last patch intact, since it resolves the 'double dump' 
issue, but I agree with нщгк point about existing logic of the code, 
although it is a bit broken. So I have to think more about how to fix it 
in a better way.


0002 seems like a pretty clear bug fix, though I wonder if this is 
exactly
what we want to do going forward.  It seems like a very large fraction 
of
the callers of TimestampDifference would like to have the value in 
msec,

which means we're doing a whole lot of expensive and error-prone
arithmetic to break down the difference to sec/usec and then put it
back together again.  Let's get rid of that by inventing, say
TimestampDifferenceMilliseconds(...).


Yeah, I get into this problem after a bug in another extension — 
pg_wait_sampling. I have attached 0002, which implements 
TimestampDifferenceMilliseconds(), so 0003 just uses this new function 
to solve the initial issues. If it looks good to you, then we can switch 
all similar callers to it.



BTW, I see another bug of a related ilk.  Look what
postgres_fdw/connection.c is doing:

TimestampDifference(now, endtime, &secs, µsecs);

/* To protect against clock skew, limit sleep to one 
minute. */
cur_timeout = Min(6, secs * USECS_PER_SEC + 
microsecs);


/* Sleep until there's something to do */
wc = WaitLatchOrSocket(MyLatch,
   WL_LATCH_SET | 
WL_SOCKET_READABLE |
   WL_TIMEOUT | 
WL_EXIT_ON_PM_DEATH,

   PQsocket(conn),
   cur_timeout, PG_WAIT_EXTENSION);

WaitLatchOrSocket's timeout is measured in msec not usec.  I think the
comment about "clock skew" is complete BS, and the Min() calculation 
was

put in as a workaround by somebody observing that the sleep waited too
long, but not understanding why.


I wonder how many troubles one can get with all these unit conversions.


Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres CompanyFrom c79de17014753b311858b4570ca475f713328c62 Mon Sep 17 00:00:00 2001
From: Alexey Kondratov 
Date: Mon, 9 Nov 2020 19:24:55 +0300
Subject: [PATCH v2 4/4] pg_prewarm: refactor autoprewarm waits

Previously it was dumping twice at every timeout time.
---
 contrib/pg_prewarm/autoprewarm.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/contrib/pg_prewarm/autoprewarm.c b/contrib/pg_prewarm/autoprewarm.c
index e5bd130bc8..872c7d51b1 100644
--- a/contrib/pg_prewarm/autoprewarm.c
+++ b/contrib/pg_prewarm/autoprewarm.c
@@ -236,7 +236,9 @@ autoprewarm_main(Datum main_arg)
 			{
 last_dump_time = GetCurrentTimestamp();
 apw_dump_now(true, false);
-continue;
+
+/* We have to sleep even after a successful dump */
+delay_in_ms = autoprewarm_interval * 1000;
 			}
 
 			/* Sleep until the next dump time. */
-- 
2.19.1

From c38c07708d57d6dec5a8a1697ca9c9810ad4d7ce Mon Sep 17 00:00:00 2001
From: Alexey Kondratov 
Date: Mon, 9 Nov 2020 19:12:00 +0300
Subject: [PATCH v2 3/4] pg_prewarm: fix autoprewarm_interval behaviour.

Previously it misused seconds from TimestampDifference() as
milliseconds, so it was dumping autoprewarm.blocks ~every second
event with default autoprewarm_interval = 300s.
---
 contrib/pg_prewarm/autoprewarm.c | 8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/contrib/pg_prewarm/autoprewarm.c b/contrib/pg_prewarm/autoprewarm.c
index d3dec6e3ec..e5bd130bc8 100644
--- a/contrib/pg_prewarm/autoprewarm.c
+++ b/contrib/pg_prewarm/autoprewarm.c
@@ -222,16 +222,14 @@ autoprewarm_main(Datum main_arg)
 		{
 			long		delay_in_ms = 0;
 			TimestampTz next_dump_time = 0;
-			long		secs = 0;
-			int			usecs = 0;
 
 			/* Compute the next dump time. */
 			next_dump_time =
 TimestampTzPlusMilliseconds(last_dump_time,
 			autoprewarm_i

Re: public schema default ACL

2020-11-09 Thread Bruce Momjian

On Mon, Nov  2, 2020 at 01:41:09PM -0500, Stephen Frost wrote:
> At least from seeing the users that start out with PG and then come to
> the Slack or IRC channel asking questions, the on-boarding experience
> today typically consists of 'apt install postgresql' and then complaints
> that they aren't able to figure out how to log into PG (often asking
> about what the default password is to log in as 'postgres', or why the
> system is saying 'role "root" does not exist').  Once a user gets to the
> point of understanding or wanting to create other roles in the system,
> saying they need to create a schema for that role if they want it to be
> able to create objects (just like a user needing a home directory)
> doesn't seem likely to be all that unexpected.

It is a good point that the user has to create another user before this
becomes a usability issue.  It seems at the time the first user is
created (non-postgres), the admin needs to decide how the public schema
should behave.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: public schema default ACL

2020-11-09 Thread Bruce Momjian

On Mon, Nov  2, 2020 at 11:05:15PM -0800, Noah Misch wrote:
> On Mon, Nov 02, 2020 at 12:42:26PM -0500, Tom Lane wrote:
> > Robert Haas  writes:
> > > On Mon, Nov 2, 2020 at 5:51 AM Peter Eisentraut
> > >  wrote:
> > >> I'm not convinced, however, that this would would really move the needle
> > >> in terms of the general security-uneasiness about the public schema and
> > >> search paths.  AFAICT, in any of your proposals, the default would still
> > >> be to have the public schema world-writable and in the path.
> > 
> > > Noah's proposed change to initdb appears to involve removing CREATE
> > > permission by default, so I don't think this is true.
> > 
> > I assume that means removing *public* CREATE permissions, not the
> > owner's (which'd be the DB owner with the proposed changes).
> 
> My plan is for the default to become:
> 
>   GRANT USAGE ON SCHEMA public TO PUBLIC;
>   ALTER SCHEMA public OWNER TO DATABASE_OWNER;  -- new syntax

Seems it would be better to create a predefined role that owns the
public schema, or at least has create permission for the public schema
--- that way, when you are creating a role, you can decide if the role
should have creation permissions in the public schema, rather than
having people always using the database owner for this purpose.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: Skip ExecCheckRTPerms in CTAS with no data

2020-11-09 Thread Anastasia Lubennikova


On 29.09.2020 14:39, Bharath Rupireddy wrote:

On Mon, Sep 28, 2020 at 7:48 PM Tom Lane  wrote:

Bharath Rupireddy  writes:

In case of CTAS with no data, we actually do not insert the tuples
into the created table, so we can skip checking for the insert
permissions. Anyways, the insert permissions will be checked when the
tuples are inserted into the table.

I'd argue this is wrong.  You don't get to skip permissions checks
in ordinary queries just because, say, there's a LIMIT 0 on the
query.


Right, when there's a select with limit 0 clause, we do check for the
select permissions. But my point is, we don't check insert
permissions(or select or update etc.) when we create a plain table
using CREATE TABLE test_tbl(a1 INT). Of course, we do check create
permissions. Going by the similar point, shouldn't we also check only
create permission(which is already being done as part of
DefineRelation) and skip the insert permission(the change this patch
does) for the new table being created as part of CREATE TABLE test_tbl
AS SELECT * FROM test_tbl2? However select permission will be checked
for test_tbl2. The insert permissions will be checked anyways before
inserting rows into the table created in CTAS.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

I see Tom's objection above. Still, I tend to agree that if 'WITH NO 
DATA' was specified explicitly, CREATE AS should behave more like a 
utility statement rather than a regular query. So I think that this 
patch can be useful in some use-cases and I definitely don't see any 
harm it could cause. Even the comment in the current code suggests that 
it is an option.


I took a look at the patch. It is quite simple, so no comments about the 
code. It would be good to add a test to select_into.sql to show that it 
only applies to 'WITH NO DATA' and that subsequent insertions will fail 
if permissions are not set.


Maybe we should also mention it a documentation, but I haven't found any 
specific paragraph about permissions on CTAS.


--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Disable WAL logging to speed up data loading

2020-11-09 Thread Stephen Frost

Greetings,

* David G. Johnston (david.g.johns...@gmail.com) wrote:
> On Mon, Nov 9, 2020 at 10:36 AM Stephen Frost  wrote:
> > * David G. Johnston (david.g.johns...@gmail.com) wrote:
> > > If the commit doesn't complete all of the newly created pages are junk.
> > > Otherwise, you have a crash-recoverable state for those tables as regards
> > > those specific pages.
> >
> > How would we track that and know which pages are junk?
> 
> Every one of those pages would have a single dead transaction id contained
> within it.  If there is bookkeeping that needs to happen that could be wal
> logged - the goal here being not to avoid all wal but to avoid data wal.  I
> don't know enough about the internals here to be more specific.

Yeah, not sure how well that would end up working..  Maybe there's a way
to get there, but definitely a fair bit of complicated work.

> > > Conceptually, we need an ability to perform a partial CHECKPOINT that
> > names
> > > specific tables, and make sure the crash-recovery works for those tables
> > > while figuring out what amount of effort to expend on informing the dba
> > and
> > > alerting/preventing features that require wal from using those tables.
> >
> > Yeah, seems pretty complicated.
> >
> > Did you see an issue with the basic idea I proposed earlier, whereby an
> > unlogged table could become 'logged', while we are at wal_level=minimal,
> > by essentially checkpointing it (locking it, forcing out any buffers we
> > have associated with it, and then fsync'ing it- not sure how much of
> > that is already done in the unlogged->logged process but I would guess
> > most of it) while not actually writing it into the WAL?
> 
> That is basically half of what is described above - the part at commit when
> the relation is persisted to disk.

Right, think I agree with you there.

> What your earlier description seems to
> be missing is the part about temporarily making a logged relation
> unlogged.

While I get that there may be use-cases for that, it seems that the
use-case being described here doesn't require it- there just needs to be
a way to take the unlogged table and make it into a 'logged' table
without having to actually write it all into the WAL when wal_level is
minimal.

There may be another option to addressing the use-case as you're looking
at it though- by using partitioning.  That is, there's a partitioned
table which has logged tables in it, a new unlogged table is created and
added to the partitioned table, it's then loaded, and then it's
converted to being logged (or maybe it's loaded first and then added to
the partitioned table, either way).  This would also have the advantage
that you'd be able to continue making changes to the partitioned table
as you normally do, in general.

> I envision that as being part of a transaction as opposed to a
> permanent attribute of the table.  I envision a storage parameter that
> allows individual relations to be considered as having wal_level='minimal'
> even if the system as a whole has, e.g., wal_level='replication'.  Only
> those could be forced into this temporarily unlogged mode.

I mean ... we have a way of saying that individual relations have a
lower WAL level than others- they're UNLOGGED.  I'm still seeing this as
an opportunity to build on that and improve that, rather than invent
something new.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: WIP: BRIN multi-range indexes

2020-11-09 Thread John Naylor

On Mon, Nov 9, 2020 at 1:39 PM Tomas Vondra 
wrote:
>
>
> While investigating the failures, I've tried increasing the values a
> lot, without observing any measurable increase in runtime. IIRC I've
> even used (10 * target_partlen) or something like that. That tells me
> it's not very sensitive part of the code, so I'd suggest to simply use
> something that we know is large enough to be safe.

Okay, then it's not worth being clever.

> For example, the largest bloom filter we can have is 32kB, i.e. 262kb,
> at which point the largest gap is less than 95 (per the gap table). And
> we may use up to BLOOM_MAX_NUM_PARTITIONS, so let's just use
> BLOOM_MAX_NUM_PARTITIONS * 100

Sure.

> FWIW I wonder if we should do something about bloom filters that we know
> can get larger than page size. In the example I used, we know that
> nbits=575104 is larger than page, so as the filter gets more full (and
> thus more random and less compressible) it won't possibly fit. Maybe we
> should reject that right away, instead of "delaying it" until later, on
> the basis that it's easier to fix at CREATE INDEX time (compared to when
> inserts/updates start failing at a random time).

Yeah, I'd be inclined to reject that right away.

> The problem with this is of course that if the index is multi-column,
> this may not be strict enough (i.e. each filter would fit independently,
> but the whole index row is too large). But it's probably better to do at
> least something, and maybe improve that later with some whole-row check.

A whole-row check would be nice, but I don't know how hard that would be.

As a Devil's advocate proposal, how awful would it be to not allow
multicolumn brin-bloom indexes?

--
John Naylor
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Misuse of TimestampDifference() in the autoprewarm feature of pg_prewarm

2020-11-09 Thread Tom Lane

Alexey Kondratov  writes:
> After fixing this issue I have noticed that it still dumps blocks twice 
> at each timeout (here I set autoprewarm_interval to 15s):
> ...
> This happens because at timeout time we were using continue, but 
> actually we still have to wait the entire autoprewarm_interval after 
> successful dumping.

I don't think your 0001 is correct.  It would be okay if apw_dump_now()
could be counted on to take negligible time, but we shouldn't assume
that should we?

I agree that the "continue" seems a bit bogus, because it's skipping
the ResetLatch call at the bottom of the loop; it's not quite clear
to me whether that's a good thing or not.  But the general idea of
the existing code seems to be to loop around and make a fresh calculation
of how-long-to-wait, and that doesn't seem wrong.

0002 seems like a pretty clear bug fix, though I wonder if this is exactly
what we want to do going forward.  It seems like a very large fraction of
the callers of TimestampDifference would like to have the value in msec,
which means we're doing a whole lot of expensive and error-prone
arithmetic to break down the difference to sec/usec and then put it
back together again.  Let's get rid of that by inventing, say
TimestampDifferenceMilliseconds(...).

BTW, I see another bug of a related ilk.  Look what
postgres_fdw/connection.c is doing:

TimestampDifference(now, endtime, &secs, µsecs);

/* To protect against clock skew, limit sleep to one minute. */
cur_timeout = Min(6, secs * USECS_PER_SEC + microsecs);

/* Sleep until there's something to do */
wc = WaitLatchOrSocket(MyLatch,
   WL_LATCH_SET | WL_SOCKET_READABLE |
   WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
   PQsocket(conn),
   cur_timeout, PG_WAIT_EXTENSION);

WaitLatchOrSocket's timeout is measured in msec not usec.  I think the
comment about "clock skew" is complete BS, and the Min() calculation was
put in as a workaround by somebody observing that the sleep waited too
long, but not understanding why.

regards, tom lane

Re: PATCH: Attempt to make dbsize a bit more consistent

2020-11-09 Thread Soumyadeep Chakraborty

Hey Georgios,

Thanks for looking for more avenues to invoke tableAM APIS! Please find
my review below:

On Tue, Oct 13, 2020 at 6:28 AM  wrote:

1.

>  /*
> - * heap size, including FSM and VM
> + * table size, including FSM and VM
>  */

We should not mention FSM and VM in dbsize.c at all as these are
heap AM specific. We can say:
table size, excluding TOAST relation

2.

>  /*
>  * Size of toast relation
>  */
>  if (OidIsValid(rel->rd_rel->reltoastrelid))
> - size += calculate_toast_table_size(rel->rd_rel->reltoastrelid);
> + {
> + Relation toastRel;
> +
> + toastRel = relation_open(rel->rd_rel->reltoastrelid, AccessShareLock);

We can replace the OidIsValid check with relation_needs_toast_table()
and then have the OidIsValid() check in an Assert. Perhaps, that will
make things more readable.

3.

> + if (rel->rd_rel->relkind == RELKIND_RELATION ||
> + rel->rd_rel->relkind == RELKIND_TOASTVALUE ||
> + rel->rd_rel->relkind == RELKIND_MATVIEW)
> + size = calculate_table_size(rel);
> + else
> + {
> + relation_close(rel, AccessShareLock);
> + PG_RETURN_NULL();
> + }

This leads to behavioral changes:

I am talking about the case where one calls pg_table_size() on an index.
W/o your patch, it returns the size of the index. W/ your patch it
returns NULL. Judging by the documentation, this function should not
ideally apply to indexes but it does! I have a sinking feeling that lots
of users use it for this purpose, as there is no function to calculate
the size of a single specific index (except pg_relation_size()).
The same argument I have made above applies to sequences. Users may have
trial-and-errored their way into finding that pg_table_size() can tell
them the size of a specific index/sequence! I don't know how widespread
the use is in the user community, so IMO maybe we should be conservative
and not introduce this change? Alternatively, we could call out that
pg_table_size() is only for tables by throwing an error if anything
other than a table is passed in.

If we decide to preserve the existing behavior of the pg_table_size():
It seems that for things not backed by the tableAM (indexes and
sequences), they should still go through calculate_relation_size().
We can call table_relation_size() based on if relkind is
RELKIND_RELATION, RELKIND_TOASTVALUE or RELKIND_MATVIEW. Perhaps it
might be worthwhile making a new macro RELKIND_HAS_TABLE_STORAGE to
capture these three cases (See RELKIND_HAS_STORAGE). This also ensures
that we return 0 for things that don't qualify as RELKIND_HAS_STORAGE,
such as a partitioned table (Currently w/ the patch applied, we return
NULL for those cases, which is another behavior change)

4.

> @@ -3776,10 +3776,24 @@ listTables(const char *tabtypes, const char *pattern, 
> bool verbose, bool showSys
>gettext_noop("Access Method"));
>
>  /*
> + * As of PostgreSQL 14, do not use pg_table_size() for indexes and
> + * sequences as it does not behave sanely for those.
> + *
>   * As of PostgreSQL 9.0, use pg_table_size() to show a more accurate
>   * size of a table, including FSM, VM and TOAST tables.
>   */
> -if (pset.sversion >= 9)
> +if (pset.sversion >= 14)
> +appendPQExpBuffer(&buf,
> +  ",\n CASE"
> +  " WHEN c.relkind in 
> ("CppAsString2(RELKIND_INDEX)", "
> +
> CppAsString2(RELKIND_PARTITIONED_INDEX)", "
> +
> CppAsString2(RELKIND_SEQUENCE)") THEN"
> +  " 
> pg_catalog.pg_size_pretty(pg_catalog.pg_relation_size(c.oid))"
> +  " ELSE"
> +  " 
> pg_catalog.pg_size_pretty(pg_catalog.pg_table_size(c.oid))"
> +  " END as \"%s\"",
> +  gettext_noop("Size"));
> +else if (pset.sversion >= 9)
>  appendPQExpBuffer(&buf,
>",\n  
> pg_catalog.pg_size_pretty(pg_catalog.pg_table_size(c.oid)) as \"%s\"",
>gettext_noop("Size"));

Following on from point 3, if we decide to preserve the existing behavior,
we wouldn't need this change, as it would be internalized within
pg_table_size().

4.
> >
> > -   return size;
> >
> > -   Assert(size < PG_INT64_MAX);
> >
> > -
> > -   return (int64)size;
> > I assume that this change, and the other ones like that, aim to handle 
> > int64
> > overflow? Using the extra legroom of uint64 can still lead to an 
> > overflow,
> > however theoretical it may be. Wouldn't it be better to check for 
> > overflow
> > before adding to size rather than after the fact? Something along the 
> > lines of
> > checking for headroom left:
> >
> > rel_size = table_relation_size(..);
> > if (rel_size > (PG_INT64_MAX - total_size))
> >

Re: RE: Delaying/avoiding BTreeTupleGetNAtts() call within _bt_compare()

2020-11-09 Thread Peter Geoghegan

On Mon, Nov 2, 2020 at 1:04 PM Peter Geoghegan  wrote:
> if Andres cannot spend any time on this in the foreseeable future then
> I'll withdraw the patch. I intend to formally withdraw the patch on
> November 9th, provided no new information comes to light.

I have now formally withdrawn the patch in the CF app.

Thanks
-- 
Peter Geoghegan

Re: Disable WAL logging to speed up data loading

2020-11-09 Thread David G. Johnston

On Mon, Nov 9, 2020 at 10:36 AM Stephen Frost  wrote:

> * David G. Johnston (david.g.johns...@gmail.com) wrote:
>
> > If the commit doesn't complete all of the newly created pages are junk.
> > Otherwise, you have a crash-recoverable state for those tables as regards
> > those specific pages.
>
> How would we track that and know which pages are junk?
>

Every one of those pages would have a single dead transaction id contained
within it.  If there is bookkeeping that needs to happen that could be wal
logged - the goal here being not to avoid all wal but to avoid data wal.  I
don't know enough about the internals here to be more specific.

> > Conceptually, we need an ability to perform a partial CHECKPOINT that
> names
> > specific tables, and make sure the crash-recovery works for those tables
> > while figuring out what amount of effort to expend on informing the dba
> and
> > alerting/preventing features that require wal from using those tables.
>
> Yeah, seems pretty complicated.
>
> Did you see an issue with the basic idea I proposed earlier, whereby an
> unlogged table could become 'logged', while we are at wal_level=minimal,
> by essentially checkpointing it (locking it, forcing out any buffers we
> have associated with it, and then fsync'ing it- not sure how much of
> that is already done in the unlogged->logged process but I would guess
> most of it) while not actually writing it into the WAL?
>

That is basically half of what is described above - the part at commit when
the relation is persisted to disk.  What your earlier description seems to
be missing is the part about temporarily making a logged relation
unlogged.  I envision that as being part of a transaction as opposed to a
permanent attribute of the table.  I envision a storage parameter that
allows individual relations to be considered as having wal_level='minimal'
even if the system as a whole has, e.g., wal_level='replication'.  Only
those could be forced into this temporarily unlogged mode.

One part I hadn't given thought to is indexes and how those would interact
with this plan. Mostly due to lack of internals knowledge.

David J.

Re: automatic analyze: readahead - add "IO read time" log message

2020-11-09 Thread Tomas Vondra

On 11/9/20 7:06 PM, Stephen Frost wrote:
> Greetings,
> 
> * Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:
>> On 11/4/20 5:02 PM, Stephen Frost wrote:
>>> * Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:
> If you highlight "738754560" in the output it appears to duplicate the
> syscalls issued until it preads() - in case of "738754560" offset it was
> asked for 3 times. Also I wouldn't  imagine in wildest dreams that
> posix_fadvise(POSIX_FADV_WILLNEED) is such a cheap syscall.

 IMHO that'a a bug in the patch, which always tries to prefetch all
 "future" blocks, including those that were already prefetched. It
 probably needs to do something like bitmap heap scan where we track
 what was already prefetched and only issue the new blocks.
>>>
>>> Updated patch attached which:
>>>
>>> - Starts out by pre-fetching the first effective_io_concurrency number
>>>   of blocks we are going to want, hopefully making it so the kernel will
>>>   trust our fadvise's over its own read-ahead, right from the start.
>>> - Makes sure the prefetch iterator is pushed forward whenever the
>>>   regular interator is moved forward.
>>> - After each page read, issues a prefetch, similar to BitmapHeapScan, to
>>>   hopefully avoiding having the prefetching get in the way of the
>>>   regular i/o.
>>> - Added some comments, ran pgindent, added a commit message.
>>
>> Nice, that was quick ;-)
> 
> :)
> 
>>> I do think we should also include patch that Jakub wrote previously
>>> which adds information about the read rate of ANALYZE.
>>
>> +1
> 
> Attached is an updated patch which updates the documentation and
> integrates Jakub's initial work on improving the logging around
> auto-analyze (and I made the logging in auto-vacuum more-or-less match
> it).
> 

Thanks. I'll do some testing/benchmarking once my machines are free, in
a couple days perhaps. But as I said before, I don't expect this to
behave very differently from other places that already do prefetching.

>>> I'll look at integrating that into this patch and then look at a new
>>> patch to do something similar for VACUUM in a bit.
>>
>> +1
> 
> I spent some time looking into this but it's a bit complicated..  For
> some sound reasons, VACUUM will avoid skipping through a table when
> there's only a few pages that it could skip (not skipping allows us to
> move forward the relfrozenxid).  That said, perhaps we could start doing
> prefetching once we've decided that we're skipping.  We'd need to think
> about if we have to worry about the VM changing between the pre-fetching
> and the time when we're actually going to ask for the page..  I don't
> *think* that's an issue because only VACUUM would be changing the pages
> to be all-frozen or all-visible, and so if we see a page that isn't one
> of those then we're going to want to visit that page and that's not
> going to change, and we probably don't need to care about a page that
> used to be all-frozen and now isn't during this run- but if the prefetch
> went ahead and got page 10, and now page 8 is not-all-frozen and the
> actual scan is at page 5, then maybe it wants page 8 next and that isn't
> what we pre-fetched...
> 
> Anyhow, all-in-all, definitely more complicated and probably best
> considered and discussed independently>

+1

FWIW I wonder if this should be tracked separately in the CF app, as
it's very different from the original "add some logging" patch, which
makes the CF entry rather misleading.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: automatic analyze: readahead - add "IO read time" log message

2020-11-09 Thread Stephen Frost

Greetings,

* Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:
> On 11/4/20 5:02 PM, Stephen Frost wrote:
> >* Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:
> >>>If you highlight "738754560" in the output it appears to duplicate the
> >>>syscalls issued until it preads() - in case of "738754560" offset it was
> >>>asked for 3 times. Also I wouldn't  imagine in wildest dreams that
> >>>posix_fadvise(POSIX_FADV_WILLNEED) is such a cheap syscall.
> >>
> >>IMHO that'a a bug in the patch, which always tries to prefetch all
> >>"future" blocks, including those that were already prefetched. It
> >>probably needs to do something like bitmap heap scan where we track
> >>what was already prefetched and only issue the new blocks.
> >
> >Updated patch attached which:
> >
> >- Starts out by pre-fetching the first effective_io_concurrency number
> >   of blocks we are going to want, hopefully making it so the kernel will
> >   trust our fadvise's over its own read-ahead, right from the start.
> >- Makes sure the prefetch iterator is pushed forward whenever the
> >   regular interator is moved forward.
> >- After each page read, issues a prefetch, similar to BitmapHeapScan, to
> >   hopefully avoiding having the prefetching get in the way of the
> >   regular i/o.
> >- Added some comments, ran pgindent, added a commit message.
> 
> Nice, that was quick ;-)

:)

> >I do think we should also include patch that Jakub wrote previously
> >which adds information about the read rate of ANALYZE.
> 
> +1

Attached is an updated patch which updates the documentation and
integrates Jakub's initial work on improving the logging around
auto-analyze (and I made the logging in auto-vacuum more-or-less match
it).

> >I'll look at integrating that into this patch and then look at a new
> >patch to do something similar for VACUUM in a bit.
> 
> +1

I spent some time looking into this but it's a bit complicated..  For
some sound reasons, VACUUM will avoid skipping through a table when
there's only a few pages that it could skip (not skipping allows us to
move forward the relfrozenxid).  That said, perhaps we could start doing
prefetching once we've decided that we're skipping.  We'd need to think
about if we have to worry about the VM changing between the pre-fetching
and the time when we're actually going to ask for the page..  I don't
*think* that's an issue because only VACUUM would be changing the pages
to be all-frozen or all-visible, and so if we see a page that isn't one
of those then we're going to want to visit that page and that's not
going to change, and we probably don't need to care about a page that
used to be all-frozen and now isn't during this run- but if the prefetch
went ahead and got page 10, and now page 8 is not-all-frozen and the
actual scan is at page 5, then maybe it wants page 8 next and that isn't
what we pre-fetched...

Anyhow, all-in-all, definitely more complicated and probably best
considered and discussed independently.

> >If you're doing further benchmarking of ANALYZE though, this would
> >probably be the better patch to use.  Certainly improved performance
> >here quite a bit with effective_io_concurrency set to 16.
> 
> Yeah. I'd expect this to be heavily dependent on hardware.

Sure, I agree with that too.

Thanks,

Stephen
From 59ace10ce767b08d3a175322bcabed2fc9010054 Mon Sep 17 00:00:00 2001
From: Stephen Frost 
Date: Wed, 4 Nov 2020 10:46:23 -0500
Subject: [PATCH] Use pre-fetching for ANALYZE and improve AV logging

When we have posix_fadvise() available, we can improve the performance
of an ANALYZE by quite a bit by using it to inform the kernel of the
blocks that we're going to be asking for.  Similar to bitmap index
scans, the number of buffer pre-fetched is based off of the GUC
effective_io_concurrency.

Also, arrange to provide similar details for auto-analyze that we have
for auto-vaccum, about the read rate, dirty rate, and add I/O timing
(when track_io_timing is enabled) to both.

Stephen Frost and Jakub Wartak

Discussion: https://www.postgresql.org/message-id/VI1PR0701MB69603A433348EDCF783C6ECBF6EF0%40VI1PR0701MB6960.eurprd07.prod.outlook.com
---
 doc/src/sgml/config.sgml |   8 +-
 src/backend/access/heap/vacuumlazy.c |  18 +++
 src/backend/commands/analyze.c   | 157 +--
 3 files changed, 172 insertions(+), 11 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index f043433e31..345a0305c6 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7358,9 +7358,11 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
 I/O timing information is
 displayed in 
 pg_stat_database, in the output of
- when the BUFFERS option is
-used, and by .  Only superusers can
-change this setting.
+ when the BUFFERS option
+is used, by autovacuum for auto-vacuums and auto-analyzes, when
+ is set and by
+.  Only superusers can change

Re: Misuse of TimestampDifference() in the autoprewarm feature of pg_prewarm

2020-11-09 Thread Jeevan Ladhe

Hi,

On Mon, Nov 9, 2020 at 10:15 PM Alexey Kondratov 
wrote:

> Hi Hackers,
>
> Today I have accidentally noticed that autoprewarm feature of pg_prewarm
> used TimestampDifference()'s results in a wrong way.
>
> First, it used *seconds* result from it as a *milliseconds*. It was
> causing it to make dump file autoprewarm.blocks ~every second with
> default setting of autoprewarm_interval = 300s.
>
>
I had a look at this, and I agree that this is an issue. I also had a look
at
the patch 0002, and the patch looks good to me.

In patch 0003 there is a typo:

+ /* We have to sleep even after a successfull dump */

s/successfull/successful


Regards,
Jeevan Ladhe

Re: WIP: BRIN multi-range indexes

2020-11-09 Thread Tomas Vondra




On 11/9/20 3:29 PM, John Naylor wrote:
> On Sat, Nov 7, 2020 at 4:38 PM Tomas Vondra
> mailto:tomas.von...@enterprisedb.com>>
> wrote:
> 
>> Overall, I think there's very little difference, particularly in the
>> "match" cases when we're searching for a value that we know is in the
>> table. The one-hash variant seems to perform a bit better, but the
>> difference is fairly small.
>>
>> In the "mismatch" cases (searching for value that is not in the table)
>> the differences are more significant, but it might be noise. It does
>> seem much more "green" than "red", i.e. the one-hash variant seems to be
>> faster (although this does depend on the values for formatting).
>>
>> To sum this up, I think the one-hash approach seems interesting. It's
>> not going to give us huge speedups because we're only hashing int32
>> values anyway (not the source data), but it's worth exploring.
> 
> Thanks for testing! It seems you tested against the version with two
> moduli, and not the alternative discussed in
> 
> https://www.postgresql.org/message-id/20200918222702.omsieaphfj3ctqg3%40development
> 
> 
> which would in fact be rehashing the 32 bit values. I think that would
> be the way to go if we don't use the one-hashing approach.
> 

Yeah. I forgot about this detail, and I may try again with the two-hash
variant, but I wonder how much difference would it make, considering the
results match the expected results (that is, the scan fraction" results
for fill_factor=100 match the target fpr almost perfectly).

I think there's a possibly-more important omission in the testing - I
forgot about the "sort mode" used initially, when the filter keeps the
actual hash values and only switches to hashing later. I wonder if that
plays role for some of the cases.

I'll investigate this a bit in the next round of tests.


>> a) set_bloom_partitions does this:
>>
>>     while (primes[pidx + nhashes - 1] <= target && primes[pidx] > 0)
>>        pidx++;
>>
>> which is broken, because the second part of the condition only checks
>> the current index - we may end up using nhashes primes after that, and
>> some of them may be 0. So this needs to be:
>>
>>     while (primes[pidx + nhashes - 1] <= target &&
>>            primes[pidx + nhashes] > 0)
>>        pidx++;
> 
> Good catch.
> 
>> b) set_bloom_partitions does this to generate primes:
>>
>>     /*
>>      * Increase the limit to ensure we have some primes higher than
>>      * the target partition length. The 100 value is arbitrary, but
>>      * should be well over what we need.
>>      */
>>     primes = generate_primes(target_partlen + 100);
>>
>> It's not clear to me why 100 is sufficient, particularly for large page
>> sizes. AFAIK the primes get more and more sparse, so how does this
>> guarantee we'll get enough "sufficiently large" primes?
> 
> This value is not rigorous and should be improved, but I started with
> that by looking at the table in section 3 in
> 
> https://primes.utm.edu/notes/gaps.html
> 
> 
> I see two ways to make a stronger guarantee:
> 
> 1. Take the average gap between primes near n, which is log(n), and
> multiply that by BLOOM_MAX_NUM_PARTITIONS. Adding that to the target
> seems a better heuristic than a constant, and is still simple to calculate.
> 
> With the pathological example you gave of n=575104, k=3 (target_partlen
> = 191701), the number to add is log(191701) * 10 = 122.  By the table
> referenced above, the largest prime gap under 360653 is 95, so we're
> guaranteed to find at least one prime in the space of 122 above the
> target. That will likely be enough to find the closest-to-target filter
> size for k=3. Even if it weren't, nbits is so large that the relative
> difference is tiny. I'd say a heuristic like this is most likely to be
> off precisely when it matters the least. At this size, even if we find
> zero primes above our target, the relative filter size is close to 
> 
> (575104 - 3 * 95) / 575104 = 0.9995
> 
> For a more realistic bad-case target partition length, log(1327) * 10 =
> 72. There are 33 composites after 1327, the largest such gap below 9551.
> That would give five primes larger than the target
> 1361   1367   1373   1381   1399
> 
> which is more than enough for k<=10:
> 
> 1297 +  1301  + 1303  + 1307  + 1319  + 1321  + 1327  + 1361 + 1367 +
> 1373 = 13276
> 
> 2. Use a "segmented range" algorithm for the sieve and iterate until we
> get k*2 primes, half below and half above the target. This would be an
> absolute guarantee, but also more code, so I'm inclined against that.
> 

Thanks, that makes sense.

While investigating the failures, I've tried increasing the values a
lot, without observing any measurable increase in runtime. IIRC I've
even used (10 * target_partlen) or something like that. That tells me
it's not very sensitive part of the code, so I'd suggest to simply use
something t

Re: Disable WAL logging to speed up data loading

2020-11-09 Thread Stephen Frost

Greetings,

* David G. Johnston (david.g.johns...@gmail.com) wrote:
> On Mon, Nov 9, 2020 at 8:18 AM Stephen Frost  wrote:
> > Presently, my feeling is that we could address this use-case without
> > having to introduce a new cluster-wide WAL level, and that's the
> > direction I'd want to see this going.  Perhaps I'm missing something
> > about why the approach I've set forth above wouldn't work, and
> > wal_level=none would, but I've not seen it yet.
>
> +1
> 
> We are trying to address a performance optimization for an insert-only
> scenario on a limited set of tables by placing the entire cluster in a
> dangerous state.  The "copy table unlogged" solution is definitely closer
> to what we want - this is
> demonstrably worse.

Yeah, agreed.

> For this case the fundamental feature that would seem to be required is an
> ability for a transaction commit to return only after the system has
> ensured that all of the new pages added to the relation during the scope of
> the transaction have made it to disk.  Something like:
> 
> BEGIN UNLOGGED TRANSACTION FOR table1, table2;
> -- locking probably allows reads, definitely disallows concurrent writes,
> to the named tables
> -- Disallow updates and deletes, do not use dead tuple space, for the
> tables named.  Should be able to do normal stuff for other tables?
> -- Always create new pages
> COPY TO table1;
> COPY TO table2;
> COMMIT; -- wait here until data files for table1 and table2 are completely
> written and the transaction alive flag is committed to the WAL.

That's certainly an interesting idea, but seems like a much larger step
than just making some improvements to how UNLOGGED tables work today,
and then perhaps some helper options to make it easier to create
UNLOGGED tables and change them from unlogged to logged when the
wal_level is set to 'minimal'.

Also- I don't think this would end up working for normally logged
relations at a wal_level higher than 'minimal', since if we don't
log those pages then they won't get to replicas.

> I suppose the above could be written "BEGIN UNLOGGED TRANSACTION FOR ALL
> TABLES" and you'd get the initial database population optimization
> capability.

Or just 'BEGIN UNLOGGED TRANSACTION'..  I wonder if we'd have to run
around and lock all tables as you're suggesting above or if we could
just lock them as they get used..

> If the commit doesn't complete all of the newly created pages are junk.
> Otherwise, you have a crash-recoverable state for those tables as regards
> those specific pages.

How would we track that and know which pages are junk?

> Conceptually, we need an ability to perform a partial CHECKPOINT that names
> specific tables, and make sure the crash-recovery works for those tables
> while figuring out what amount of effort to expend on informing the dba and
> alerting/preventing features that require wal from using those tables.

Yeah, seems pretty complicated.

Did you see an issue with the basic idea I proposed earlier, whereby an
unlogged table could become 'logged', while we are at wal_level=minimal,
by essentially checkpointing it (locking it, forcing out any buffers we
have associated with it, and then fsync'ing it- not sure how much of
that is already done in the unlogged->logged process but I would guess
most of it) while not actually writing it into the WAL?

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: Disable WAL logging to speed up data loading

2020-11-09 Thread David G. Johnston

On Mon, Nov 9, 2020 at 8:18 AM Stephen Frost  wrote:

> Presently, my feeling is that we could address this use-case without
>
having to introduce a new cluster-wide WAL level, and that's the
> direction I'd want to see this going.  Perhaps I'm missing something
> about why the approach I've set forth above wouldn't work, and
> wal_level=none would, but I've not seen it yet.
>
>
+1

We are trying to address a performance optimization for an insert-only
scenario on a limited set of tables by placing the entire cluster in a
dangerous state.  The "copy table unlogged" solution is definitely closer
to what we want - this is
demonstrably worse.

For this case the fundamental feature that would seem to be required is an
ability for a transaction commit to return only after the system has
ensured that all of the new pages added to the relation during the scope of
the transaction have made it to disk.  Something like:

BEGIN UNLOGGED TRANSACTION FOR table1, table2;
-- locking probably allows reads, definitely disallows concurrent writes,
to the named tables
-- Disallow updates and deletes, do not use dead tuple space, for the
tables named.  Should be able to do normal stuff for other tables?
-- Always create new pages
COPY TO table1;
COPY TO table2;
COMMIT; -- wait here until data files for table1 and table2 are completely
written and the transaction alive flag is committed to the WAL.

I suppose the above could be written "BEGIN UNLOGGED TRANSACTION FOR ALL
TABLES" and you'd get the initial database population optimization
capability.

If the commit doesn't complete all of the newly created pages are junk.
Otherwise, you have a crash-recoverable state for those tables as regards
those specific pages.

Conceptually, we need an ability to perform a partial CHECKPOINT that names
specific tables, and make sure the crash-recovery works for those tables
while figuring out what amount of effort to expend on informing the dba and
alerting/preventing features that require wal from using those tables.

David J.

Misuse of TimestampDifference() in the autoprewarm feature of pg_prewarm

2020-11-09 Thread Alexey Kondratov


Hi Hackers,

Today I have accidentally noticed that autoprewarm feature of pg_prewarm 
used TimestampDifference()'s results in a wrong way.


First, it used *seconds* result from it as a *milliseconds*. It was 
causing it to make dump file autoprewarm.blocks ~every second with 
default setting of autoprewarm_interval = 300s.


Here is a log part with debug output in this case:

```
2020-11-09 19:09:00.162 MSK [85328] LOG:  dumping autoprewarm.blocks
2020-11-09 19:09:01.161 MSK [85328] LOG:  dumping autoprewarm.blocks
2020-11-09 19:09:02.160 MSK [85328] LOG:  dumping autoprewarm.blocks
2020-11-09 19:09:03.159 MSK [85328] LOG:  dumping autoprewarm.blocks
```

After fixing this issue I have noticed that it still dumps blocks twice 
at each timeout (here I set autoprewarm_interval to 15s):


```
2020-11-09 19:18:59.692 MSK [85662] LOG:  dumping autoprewarm.blocks
2020-11-09 19:18:59.700 MSK [85662] LOG:  dumping autoprewarm.blocks

2020-11-09 19:19:14.694 MSK [85662] LOG:  dumping autoprewarm.blocks
2020-11-09 19:19:14.704 MSK [85662] LOG:  dumping autoprewarm.blocks
```

This happens because at timeout time we were using continue, but 
actually we still have to wait the entire autoprewarm_interval after 
successful dumping.


I have fixed both issues in the attached patches and also added a 
minimalistic tap test as a first one to verify that this automatic 
damping still works after refactoring. I put Robert into CC, since he is 
an author of this feature.


What do you think?


Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres CompanyFrom 6d4bab7f21c3661dd4dd5a0de7e097b1de3f642c Mon Sep 17 00:00:00 2001
From: Alexey Kondratov 
Date: Mon, 9 Nov 2020 19:24:55 +0300
Subject: [PATCH v1 3/3] pg_prewarm: refactor autoprewarm waits

Previously it was dumping twice at every timeout time.
---
 contrib/pg_prewarm/autoprewarm.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/contrib/pg_prewarm/autoprewarm.c b/contrib/pg_prewarm/autoprewarm.c
index b18a065ed5..f52c83de1e 100644
--- a/contrib/pg_prewarm/autoprewarm.c
+++ b/contrib/pg_prewarm/autoprewarm.c
@@ -238,7 +238,9 @@ autoprewarm_main(Datum main_arg)
 			{
 last_dump_time = GetCurrentTimestamp();
 apw_dump_now(true, false);
-continue;
+
+/* We have to sleep even after a successfull dump */
+delay_in_ms = autoprewarm_interval * 1000;
 			}
 
 			/* Sleep until the next dump time. */
-- 
2.19.1

From 8793b8beb6a5c1ae730f1fffb09dff64c83bc631 Mon Sep 17 00:00:00 2001
From: Alexey Kondratov 
Date: Mon, 9 Nov 2020 19:12:00 +0300
Subject: [PATCH v1 2/3] pg_prewarm: fix autoprewarm_interval behaviour.

Previously it misused seconds from TimestampDifference() as
milliseconds, so it was dumping autoprewarm.blocks ~every second
event with default autoprewarm_interval = 300s.
---
 contrib/pg_prewarm/autoprewarm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/contrib/pg_prewarm/autoprewarm.c b/contrib/pg_prewarm/autoprewarm.c
index d3dec6e3ec..b18a065ed5 100644
--- a/contrib/pg_prewarm/autoprewarm.c
+++ b/contrib/pg_prewarm/autoprewarm.c
@@ -231,7 +231,7 @@ autoprewarm_main(Datum main_arg)
 			autoprewarm_interval * 1000);
 			TimestampDifference(GetCurrentTimestamp(), next_dump_time,
 &secs, &usecs);
-			delay_in_ms = secs + (usecs / 1000);
+			delay_in_ms = secs * 1000 + (usecs / 1000);
 
 			/* Perform a dump if it's time. */
 			if (delay_in_ms <= 0)
-- 
2.19.1

From 31dc30c97861afae9c34852afc5a5b1c91bbeadc Mon Sep 17 00:00:00 2001
From: Alexey Kondratov 
Date: Mon, 9 Nov 2020 19:04:10 +0300
Subject: [PATCH v1 1/3] pg_prewarm: add tap test for autoprewarm feature

---
 contrib/pg_prewarm/Makefile |  2 +
 contrib/pg_prewarm/t/001_autoprewarm.pl | 51 +
 2 files changed, 53 insertions(+)
 create mode 100644 contrib/pg_prewarm/t/001_autoprewarm.pl

diff --git a/contrib/pg_prewarm/Makefile b/contrib/pg_prewarm/Makefile
index b13ac3c813..9cfde8c4e4 100644
--- a/contrib/pg_prewarm/Makefile
+++ b/contrib/pg_prewarm/Makefile
@@ -10,6 +10,8 @@ EXTENSION = pg_prewarm
 DATA = pg_prewarm--1.1--1.2.sql pg_prewarm--1.1.sql pg_prewarm--1.0--1.1.sql
 PGFILEDESC = "pg_prewarm - preload relation data into system buffer cache"
 
+TAP_TESTS = 1
+
 ifdef USE_PGXS
 PG_CONFIG = pg_config
 PGXS := $(shell $(PG_CONFIG) --pgxs)
diff --git a/contrib/pg_prewarm/t/001_autoprewarm.pl b/contrib/pg_prewarm/t/001_autoprewarm.pl
new file mode 100644
index 00..b564c29931
--- /dev/null
+++ b/contrib/pg_prewarm/t/001_autoprewarm.pl
@@ -0,0 +1,51 @@
+#
+# Check that pg_prewarm can dump blocks from shared buffers
+# to PGDATA/autoprewarm.blocks.
+#
+
+use strict;
+use Test::More;
+use TestLib;
+use Time::HiRes qw(usleep);
+use warnings;
+
+use PostgresNode;
+
+plan tests => 3;
+
+my $node = get_new_node("node");
+$node->init;
+$node->append_conf(
+'postgresql.conf', qq(
+shared_preload_libraries = 'pg_prewarm'
+pg_prewarm.autoprewarm = 'o

Re: PATCH: Report libpq version and configuration

2020-11-09 Thread Tom Lane

Alvaro Herrera  writes:
> Well, if we can make it run in more systems than just Linux, then it
> seems worth having.  The submitted patch seems a little bit on the
> naughty side.

I agree that the facility seems possibly useful, as long as we can
minimize its platform dependency.  Just embedding some strings, as
I suggested upthread, seems like it'd go far in that direction.
Yeah, you could spend a lot of effort to make it a bit more user
friendly, but is the effort really going to be repaid?  The use
case for this isn't that large, I don't think.

regards, tom lane

Re: Reduce the time required for a database recovery from archive.

2020-11-09 Thread Stephen Frost

Greetings,

* Dmitry Shulga (d.shu...@postgrespro.ru) wrote:
> > On 19 Oct 2020, at 23:25, Stephen Frost  wrote:
>  Implementation of this approach assumes running several background 
>  processes (bgworkers)
>  each of which runs a shell command specified by the parameter 
>  restore_command
>  to deliver an archived WAL file. Number of running parallel processes is 
>  limited
>  by the new parameter max_restore_command_workers. If this parameter has 
>  value 0
>  then WAL files delivery is performed using the original algorithm, that 
>  is in
>  one-by-one manner. If this parameter has value greater than 0 then the 
>  database
>  server starts several bgworker processes up to the limit specified by
>  the parameter max_restore_command_workers and passes to every process
>  WAL file name to deliver. Active processes start prefetching of specified
>  WAL files and store received files in the directory pg_wal/pgsql_tmp. 
>  After
>  bgworker process finishes receiving a file it marks itself as a free 
>  process
>  and waits for a new request to receive a next WAL file. The main process
>  performing database recovery still handles WAL files in one-by-one 
>  manner,
>  but instead of waiting for a next required WAL file's availability it 
>  checks for
>  that file in the prefetched directory. If a new file is present there,
>  the main process starts its processing.
> >>> 
> >>> I'm a bit confused about this description- surely it makes sense for the
> >> OK. The description I originally provided was probably pretty misleading 
> >> so I will try to clarify it a bit.
> >> 
> >> So, as soon as a bgworker process finishes delivering a WAL file it marks 
> >> itself as a free.
> >> 
> >> WAL records applier working in parallel and processing the WAL files in 
> >> sequential manner.
> >> Once it finishes handling of the current WAL file, it checks whether it is 
> >> possible to run extra bgworker processes
> >> to deliver WAL files which will be required a bit later. If there are free 
> >> bgworker processes then applier requests 
> >> to start downloading of one or more extra WAL files. After that applier 
> >> determines a name of next WAL file to handle
> >> and checks whether it exist in the prefetching directory. If it does exist 
> >> then applier starts handling it and
> >> processing loop is repeated. 
> > 
> > Ok- so the idea is that each time the applying process finishes with a
> > WAL file then it'll see if there's an available worker and, if so, will
> > give it the next file to go get (which would presumably be some number
> > in the future and the actual next file the applying process needs is
> > already available).  That sounds better, at least, though I'm not sure
> > why we're making it the job of the applying process to push the workers
> > each time..?
> Every bgwork serves as a task to deliver a WAL file. Considering a task as an 
> active entity is well-known approach in software design.
> So I don't see any issues with such implementation. Moreover, implementation 
> of this approach is probably simpler than any other alternatives
> and still providing positive performance impact in comparing with current 
> (non optimized) implementation.

I don't think we look only at if something is an improvement or not over
the current situation when we consider changes.

The relatively simple approach I was thinking was that a couple of
workers would be started and they'd have some prefetch amount that needs
to be kept out ahead of the applying process, which they could
potentially calculate themselves without needing to be pushed forward by
the applying process.

> >  Also, I'm not sure about the interface- wouldn't it make
> > more sense to have a "pre-fetch this amount of WAL" kind of parameter
> > directly instead of tying that to the number of background workers?
> This approach was originally considered and closely discussed.
> Finally, it was decided that introducing an extra GUC parameter to control 
> pre-fetch limit is not practical since it shifts responsibility for tuning 
> prefetching
> mechanism from postgres server to a user.
> From my point of view the fewer parameters exist to set up some feature the 
> better.

I agree in general that it's better to have fewer parameters, but I
disagree that this isn't an important option for users to be able to
tune- the rate of fetching WAL and of applying WAL varies quite a bit
from system to system.  Being able to tune the pre-fetch seems like it'd
actually be more important to a user than the number of processes
required to keep up with that amount of pre-fetching, which is something
we could actually figure out on our own...

> >  You
> > might only need one or two processes doing WAL fetching to be able to
> > fetch faster than the applying process is able to apply it, but you
> > probably want to pre-fetch more than just one or

Re: PATCH: Report libpq version and configuration

2020-11-09 Thread Alvaro Herrera

On 2020-Oct-27, Craig Ringer wrote:

> On Tue, Oct 27, 2020 at 12:56 AM Tom Lane  wrote:
>
> > +1.  Are we concerned about translatability of these strings?  I think
> > I'd vote against, as it would complicate applications, but it's worth
> > thinking about it now not later.
> 
> It's necessary not to translate the key names, they are identifiers
> not descriptive text. I don't object to having translations too, but
> the translation teams have quite enough to do already with user-facing
> text that will get regularly seen. So while it'd be potentially
> interesting to expose translated versions too, I'm not entirely
> convinced. It's a bit like translating macro names. You could, but ...
> why?

I don't think translating these values is useful for much.  I see it
similar to translating pg_controldata output: it is troublesome (to
pg_upgrade for instance) and serves no public that I know of.


> > Again, I'm not exactly excited about this.  I do not one bit like
> > patches that assume that x64 linux is the universe, or at least
> > all of it that need be catered to.  Reminds me of people who thought
> > Windows was the universe, not too many years ago.
> 
> Yeah. I figured you'd say that, and don't disagree. It's why I split
> this patch out - it's kind of a sacrificial patch.

Well, if we can make it run in more systems than just Linux, then it
seems worth having.  The submitted patch seems a little bit on the
naughty side.

Re: abstract Unix-domain sockets

2020-11-09 Thread Andreas Karlsson


On 11/9/20 9:04 AM, Peter Eisentraut wrote:

On 2020-11-09 07:08, Michael Paquier wrote:

As abstract namespaces don't have permissions, anyone knowing the name
of the path, which should be unique, can have an access to the server.
Do you think that the documentation should warn the user about that?
This feature is about easing the management part of the socket paths
while throwing away the security aspect of it.


We could modify the documentation further.  But note that the 
traditional way of putting the socket into /tmp has the same properties, 
so this shouldn't be a huge shock.


One issue with them is that they interact differently with kernel 
namespaces than normal unix sockets do. Abstract sockets are handled by 
the network namespaces, and not the file system namespaces. But I am not 
sure that this is our job to document.


Andreas

Re: -O switch

2020-11-09 Thread Tom Lane

Magnus Hagander  writes:
> On Wed, Nov 4, 2020 at 2:10 AM Tom Lane  wrote:
>> ... looking at this again, BackendRun certainly looks ridiculously
>> over-engineered for what it still does.

> Yeah, looking at it again, I agree. PFA an updated patch, which I'll
> go ahead and push shortly.

LGTM.

> I do noticed when looking through this -- the comment before the function 
> says:

>  * returns:
>  * Shouldn't return at all.
>  * If PostgresMain() fails, return status.

> I'm pretty sure that's incorrect in the current branches as well,
> since it's a void function it will never return anything. Pretty sure
> it should just have the first point and not the second one there, or
> is this trying to convey some meaning I'm just not getting?

Looking at old versions, BackendRun and PostgresMain used to be
declared to return int.  Whoever changed that to void evidently
missed updating this comment.

I'd reduce the whole thing to "Doesn't return."  If you were feeling
really ambitious you could start plastering pg_attribute_noreturn() on
these functions ... but since that would be placed on the declarations,
a comment here would still be in order probably.

regards, tom lane

Re: remove spurious CREATE INDEX CONCURRENTLY wait

2020-11-09 Thread Dmitry Dolgov

> On Tue, Nov 03, 2020 at 07:14:47PM +0100, Dmitry Dolgov wrote:
> > On Thu, Aug 20, 2020 at 03:11:19PM +0900, Michael Paquier wrote:
> > On Wed, Aug 19, 2020 at 02:16:46PM -0400, Alvaro Herrera wrote:
> > > I did not set the flag in REINDEX CONCURRENTLY, but as I understand it
> > > can be done too, since in essence it's the same thing as a CIC from a
> > > snapshot management point of view.
> >
> > Yes, I see no problems for REINDEX CONCURRENTLY as well as long as
> > there are no predicates and expressions involved.  The transactions
> > that should be patched are all started in ReindexRelationConcurrently.
> > The transaction of index_concurrently_swap() cannot set up that
> > though.  Only thing to be careful is to make sure that safe_flag is
> > correct depending on the list of indexes worked on.
>
> Hi,
>
> After looking through the thread and reading the patch it seems good,
> and there are only few minor questions:
>
> * Doing the same for REINDEX CONCURRENTLY, which does make sense. In
>   fact it's already mentioned in the commentaries as done, which a bit
>   confusing.

Just to give it a shot, would the attached change be enough?
>From d30d8acf91679985970334069ab7f1f8f7fc3ec5 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Tue, 4 Aug 2020 22:04:57 -0400
Subject: [PATCH v3] Avoid spurious CREATE INDEX CONCURRENTLY waits

---
 src/backend/commands/indexcmds.c | 122 ++-
 src/include/storage/proc.h   |   6 +-
 2 files changed, 124 insertions(+), 4 deletions(-)

diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 75552c64ed..5019397d50 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -385,7 +385,10 @@ CompareOpclassOptions(Datum *opts1, Datum *opts2, int natts)
  * lazy VACUUMs, because they won't be fazed by missing index entries
  * either.  (Manual ANALYZEs, however, can't be excluded because they
  * might be within transactions that are going to do arbitrary operations
- * later.)
+ * later.)  Processes running CREATE INDEX CONCURRENTLY or REINDEX CONCURRENTLY
+ * on indexes that are neither expressional nor partial are also safe to
+ * ignore, since we know that those processes won't examine any data
+ * outside the table they're indexing.
  *
  * Also, GetCurrentVirtualXIDs never reports our own vxid, so we need not
  * check for that.
@@ -406,7 +409,8 @@ WaitForOlderSnapshots(TransactionId limitXmin, bool progress)
 	VirtualTransactionId *old_snapshots;
 
 	old_snapshots = GetCurrentVirtualXIDs(limitXmin, true, false,
-		  PROC_IS_AUTOVACUUM | PROC_IN_VACUUM,
+		  PROC_IS_AUTOVACUUM | PROC_IN_VACUUM
+		  | PROC_IN_SAFE_IC,
 		  &n_old_snapshots);
 	if (progress)
 		pgstat_progress_update_param(PROGRESS_WAITFOR_TOTAL, n_old_snapshots);
@@ -426,7 +430,8 @@ WaitForOlderSnapshots(TransactionId limitXmin, bool progress)
 
 			newer_snapshots = GetCurrentVirtualXIDs(limitXmin,
 	true, false,
-	PROC_IS_AUTOVACUUM | PROC_IN_VACUUM,
+	PROC_IS_AUTOVACUUM | PROC_IN_VACUUM
+	| PROC_IN_SAFE_IC,
 	&n_newer_snapshots);
 			for (j = i; j < n_old_snapshots; j++)
 			{
@@ -519,6 +524,7 @@ DefineIndex(Oid relationId,
 	bool		amcanorder;
 	amoptions_function amoptions;
 	bool		partitioned;
+	bool		safe_index;
 	Datum		reloptions;
 	int16	   *coloptions;
 	IndexInfo  *indexInfo;
@@ -1045,6 +1051,17 @@ DefineIndex(Oid relationId,
 		}
 	}
 
+	/*
+	 * When doing concurrent index builds, we can set a PGPROC flag to tell
+	 * concurrent VACUUM, CREATE INDEX CONCURRENTLY and REINDEX CONCURRENTLY
+	 * to ignore us when waiting for concurrent snapshots.  That can only be
+	 * done for indexes that don't execute any expressions.  Determine that.
+	 * (The flag is reset automatically at transaction end, so it must be
+	 * set for each transaction.)
+	 */
+	safe_index = indexInfo->ii_Expressions == NIL &&
+		indexInfo->ii_Predicate == NIL;
+
 	/*
 	 * Report index creation if appropriate (delay this till after most of the
 	 * error checks)
@@ -1431,6 +1448,15 @@ DefineIndex(Oid relationId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+	{
+		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+		MyProc->vacuumFlags |= PROC_IN_SAFE_IC;
+		ProcGlobal->vacuumFlags[MyProc->pgxactoff] = MyProc->vacuumFlags;
+		LWLockRelease(ProcArrayLock);
+	}
+
 	/*
 	 * The index is now visible, so we can report the OID.
 	 */
@@ -1490,6 +1516,15 @@ DefineIndex(Oid relationId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+	{
+		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+		MyProc->vacuumFlags |= PROC_IN_SAFE_IC;
+		ProcGlobal->vacuumFlags[MyProc->pgxactoff] = MyProc->vacuumFlags;
+		LWLockRelease(ProcArrayLock);
+	}
+
 	/*
 	 * Phase 3 of concu

Re: Reduce the time required for a database recovery from archive.

2020-11-09 Thread Dmitry Shulga

Hello Stephen,

> On 19 Oct 2020, at 23:25, Stephen Frost  wrote:
> 
> Greetings,
> 
 Implementation of this approach assumes running several background 
 processes (bgworkers)
 each of which runs a shell command specified by the parameter 
 restore_command
 to deliver an archived WAL file. Number of running parallel processes is 
 limited
 by the new parameter max_restore_command_workers. If this parameter has 
 value 0
 then WAL files delivery is performed using the original algorithm, that is 
 in
 one-by-one manner. If this parameter has value greater than 0 then the 
 database
 server starts several bgworker processes up to the limit specified by
 the parameter max_restore_command_workers and passes to every process
 WAL file name to deliver. Active processes start prefetching of specified
 WAL files and store received files in the directory pg_wal/pgsql_tmp. After
 bgworker process finishes receiving a file it marks itself as a free 
 process
 and waits for a new request to receive a next WAL file. The main process
 performing database recovery still handles WAL files in one-by-one manner,
 but instead of waiting for a next required WAL file's availability it 
 checks for
 that file in the prefetched directory. If a new file is present there,
 the main process starts its processing.
>>> 
>>> I'm a bit confused about this description- surely it makes sense for the
>> OK. The description I originally provided was probably pretty misleading so 
>> I will try to clarify it a bit.
>> 
>> So, as soon as a bgworker process finishes delivering a WAL file it marks 
>> itself as a free.
>> 
>> WAL records applier working in parallel and processing the WAL files in 
>> sequential manner.
>> Once it finishes handling of the current WAL file, it checks whether it is 
>> possible to run extra bgworker processes
>> to deliver WAL files which will be required a bit later. If there are free 
>> bgworker processes then applier requests 
>> to start downloading of one or more extra WAL files. After that applier 
>> determines a name of next WAL file to handle
>> and checks whether it exist in the prefetching directory. If it does exist 
>> then applier starts handling it and
>> processing loop is repeated. 
> 
> Ok- so the idea is that each time the applying process finishes with a
> WAL file then it'll see if there's an available worker and, if so, will
> give it the next file to go get (which would presumably be some number
> in the future and the actual next file the applying process needs is
> already available).  That sounds better, at least, though I'm not sure
> why we're making it the job of the applying process to push the workers
> each time..?
Every bgwork serves as a task to deliver a WAL file. Considering a task as an 
active entity is well-known approach in software design.
So I don't see any issues with such implementation. Moreover, implementation of 
this approach is probably simpler than any other alternatives
and still providing positive performance impact in comparing with current (non 
optimized) implementation.

>  Also, I'm not sure about the interface- wouldn't it make
> more sense to have a "pre-fetch this amount of WAL" kind of parameter
> directly instead of tying that to the number of background workers?
This approach was originally considered and closely discussed.
Finally, it was decided that introducing an extra GUC parameter to control 
pre-fetch limit is not practical since it shifts responsibility for tuning 
prefetching
mechanism from postgres server to a user.
From my point of view the fewer parameters exist to set up some feature the 
better.

>  You
> might only need one or two processes doing WAL fetching to be able to
> fetch faster than the applying process is able to apply it, but you
> probably want to pre-fetch more than just one or two 16 MB WAL files.


Every time when prefetching is started a number of potentially prefetched files 
is calculated by expression
 PREFETCH_RATION * max_restore_command_workers - 'number of already 
prefetched files'
where PREFETCH_RATION is compiled-in constant and has value 16.

After that a task for delivering a next WAL file is placed to a current free 
bgworker process up until no more free bgworker processes.


> In other words, I would have thought we'd have:
> 
> wal_prefetch_amount = 1GB
> max_restore_command_workers = 2
> 
> and then you'd have up to 2 worker processes running and they'd be
> keeping 1GB of WAL pre-fetched at all times.  If we have just
> 'max_restore_command_workers' and you want to pre-fetch 1GB of WAL then
> you'd have to have a pretty high value there and you'd end up with
> a bunch of threads that all spike to go do work each time the applying
Sorry, I don't see how we can end up with a bunch of threads?
max_restore_command_workers has value 2 in your example meaning that no more 
than 2 bgworkers could be run c

Re: -Wformat-signedness

2020-11-09 Thread Tom Lane

Peter Eisentraut  writes:
> 1. enums are unsigned by default in gcc, so all those internal error 
> messages "unrecognized blah kind: %d" need to be changed to %u.

Do we have reason to think that that is true in every C compiler?
My own preference for this would be to leave the messages as-is
and add explicit "(int)" casts to the arguments.  There are some
fraction of these that are like that already.

regards, tom lane

Re: Disable WAL logging to speed up data loading

2020-11-09 Thread Stephen Frost

Greetings,

* osumi.takami...@fujitsu.com (osumi.takami...@fujitsu.com) wrote:
> On Tuesday, Nov 3, 2020 3:02 AM Stephen Frost  wrote:
> > I'm not sure that wal_level=none is really the right way to address this
> > use-case.  We already have unlogged tables and that's pretty clean and
> > meets the "we want to load data fast without having to pay for WAL" use 
> > case.
> > The argument here seems to be that to take advantage of unlogged tables
> > requires the tools using PG to know how to issue a 'CREATE UNLOGGED
> > TABLE' command instead of a 'CREATE TABLE' command.  That doesn't
> > seem like a huge leap, but we could make it easier by just adding a
> > 'table_log_default' or such GUC that could be set on the data loading role 
> > to
> > have all tables created by it be unlogged.
> I'm afraid to say that in the case to setup all tables as unlogged,
> the user are forced to be under tension to
> back up *all* commands from application, in preparation for unexpected crash.
> This is because whenever the server crashes,
> the unlogged tables are truncated and the DBA needs to
> input the processings after the last backup again without exception.
> I didn't think that this was easy and satisfied the user.

You'll need to explain how this is different from the proposed
'wal_level = none' option, since it sounded like that would be exactly
the same case..?

> In addition, as long as the tables are unlogged, the user cannot be released 
> from
> this condition or (requirement ?) to back up all commands or
> to guarantee that all commands are repeatable for the DBA.

They can change the table to be logged though, if they wish to.

> When I consider the use case is the system of data warehouse
> as described upthread, the size of each table can be large.
> Thus, changing the status from unlogged to logged (recoverable)
> takes much time under the current circumstances, which was discussed before.

Ok- so the issue is that, today, we dump all of the table into the WAL
when we go from unlogged to logged, but as I outlined previously,
perhaps that's because we're missing a trick there when
wal_level=minimal.  If wal_level=minimal, then it would seem like we
could lock the table, then sync it and then mark is as logged, which is
more-or-less what you're asking to have be effectively done with the
proposed wal_level=none, but this would be an optimization for all
existing users of wal_level=minimal who have unlogged tables that they
want to change to logged, and this works on a per-table basis instead,
which seems like a better approach than a cluster-wide setting.

> By having the limited window of time,
> during wal_level=none, I'd like to make wal_level=none work to
> localize and minimize the burden to guarantee all commands are
> repeatable. To achieve this, after switching wal_level from none to higher 
> ones,
> the patch must ensure crash recovery, though.

Perhaps a helper command could be added to ALTER TABLE ALL IN TABLESPACE
to marked a bunch of unlogged tables over to being logged would be good
to add too.

> Sorry that my current patch doesn't complete this aspect fully at present
> but, may I have your opinion about this ?

Presently, my feeling is that we could address this use-case without
having to introduce a new cluster-wide WAL level, and that's the
direction I'd want to see this going.  Perhaps I'm missing something
about why the approach I've set forth above wouldn't work, and
wal_level=none would, but I've not seen it yet.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: Prevent printing "next step instructions" in initdb and pg_upgrade

2020-11-09 Thread Alvaro Herrera

On 2020-Nov-09, Magnus Hagander wrote:

> On Mon, Nov 9, 2020 at 3:29 PM Alvaro Herrera  wrote:
> >
> > How about a switch like "--with-scripts=" where the list can be
> > "all" to include everything (default), "none" to include nothing, or a
> > comma-separated list of things to include?  (Also "--with-scripts=help"
> > to query which things are supported)  So
> > "--with-scripts=reindex_hash,largeobject" omits the useless
> > delete_old_cluster script and keeps the ones you want.
> >
> > Perhaps you could see it in the negative direction as more convenient,
> > so --suppress-scripts=delete_old_cluster
> 
> What would be the actual use-case in this though?

The point is that we can do this just this time, and then in the future
we'll already have a mechanism for emitting or suppressing scripts that
are useful for some cases but not others.  If I read you right, we
currenly have two cases: the user is either running it manually (in
which case you claim they want all scripts) or there's a wrapper (in
which case reindex_hash et al are useful, others are not).  But what if,
say, we determined that it's a good idea to suggest to migrate BRIN
indexes from v1 to v2 for datatype X?  If your tooling is prepared to
deal with that then it'll find the script useful, otherwise it's junk.
And we don't have to have a new option "--include-brin-upgrade" or
whatever; just add a new value to the suppressable script list.

Re: vacuum -vs reltuples on insert only index

2020-11-09 Thread Jehan-Guillaume de Rorthais

On Wed, 4 Nov 2020 18:44:03 -0800
Peter Geoghegan  wrote:

> On Mon, Nov 2, 2020 at 10:03 AM Peter Geoghegan  wrote:
> > Actually, it seems better to always count num_index_tuples the old way
> > during cleanup-only index VACUUMs, despite the inaccuracy that that
> > creates with posting list tuples.  
> 
> Pushed a fix for this just now.
> 
> Thanks for the report!

Sorry I couldn't give some more feedback on your thoughts on time...

Thank you for your investigation and fix!

Regards,

Re: pg_upgrade analyze script

2020-11-09 Thread Peter Eisentraut


On 2020-11-09 11:22, Magnus Hagander wrote:

I have spotted one small-ish thing.  This patch is missing to update
the following code in vcregress.pl:
 print "\nSetting up stats on new cluster\n\n";
 system(".\\analyze_new_cluster.bat") == 0 or exit 1;

Ah, nice catch -- thanks! I guess this is unfortunately not a test
that's part of what cfbot tests.

Untested on Windows, but following the patterns of the rows before it.
I will go ahead and push this version in a bit.


You should just remove those calls.  There is no need to replace them 
with vacuumdb calls.  The reason those calls were there is that they 
were testing the generated script itself.  If the script is gone, there 
is no more need.  There are already separate tests for testing vacuumdb.

Re: Rethinking LOCK TABLE's behavior on views

2020-11-09 Thread Alvaro Herrera

On 2020-Nov-07, Noah Misch wrote:

> On Sat, Nov 07, 2020 at 11:57:20AM -0500, Tom Lane wrote:

> > A completely different approach we could consider is to weaken the
> > permissions requirements for LOCK on a view, say "allow it if either
> > the calling user or the view owner has the needed permission".  This
> > seems generally pretty messy and so I don't much like it, but we
> > should consider as many solutions as we can think of.
> 
> This is the best of what you've listed by a strong margin, and I don't know of
> better options you've not listed.  +1 for it.  Does it work for you?

It does sound attractive from a user complexity perspective, even if it
does sound messy form an implementation perspective.

> I think
> the mess arises from LOCK TABLE serving "get locks sufficient for $ACTIONS" as
> a family of use cases.  For views only, different $ACTIONS want different
> behavior.  $ACTIONS==SELECT wants today's behavior; pg_get_viewdef() wants
> shallower recursion and caller permissions; DROP VIEW wants no recursion.

Maybe we can tackle this problem directly, by adding a clause to LOCK
TABLE to indicate a purpose for the lock that the server can use to
determine the level of recursion.  For example
  LOCK TABLE xyz IN  FOR 
where  can be READ, DROP, DEFINE.

(For back-patch purposes we could store the purpose in LockStmt->mode,
which has more than enough unused bits).

Re: Prevent printing "next step instructions" in initdb and pg_upgrade

2020-11-09 Thread Magnus Hagander

On Mon, Nov 9, 2020 at 3:29 PM Alvaro Herrera  wrote:
>
> On 2020-Nov-09, Magnus Hagander wrote:
>
> > But for usability that makes less sense. For the delete script, the
> > wrapper (that the switch is intended for) knows more than pg_upgrade
> > about how to delete it, so it can do a better job, and thus it makes
> > sense to silence it. But for something like these .sql (and also in
> > the previously mentioned case of extension upgrades), pg_upgrade knows
> > more and the only thing the wrapper could do is to reimplement the
> > same functionality, and maintain it.
> >
> > I wonder if the best way forward might be to revert it back to being a
> > --no-instructions switch, remove the delete_old_cluster.sh script
> > completely and just replace it with instructions saying "you should
> > delete the old cluster", and then keep generating these scripts as
> > necessary?
>
> How about a switch like "--with-scripts=" where the list can be
> "all" to include everything (default), "none" to include nothing, or a
> comma-separated list of things to include?  (Also "--with-scripts=help"
> to query which things are supported)  So
> "--with-scripts=reindex_hash,largeobject" omits the useless
> delete_old_cluster script and keeps the ones you want.
>
> Perhaps you could see it in the negative direction as more convenient,
> so --suppress-scripts=delete_old_cluster

What would be the actual use-case in this though?

I can see the --suppress-scripts=delete_old_cluster, but that's
basically the only usecase I can see for this. And for use in any
wrapper, that wrapper would have to know ahead of time what the
options would be... And if that's the only usecase, than this ends up
basically boiling down to the same as my suggestion just with a more
complex switch? :)

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/

Re: WIP: BRIN multi-range indexes

2020-11-09 Thread John Naylor

On Sat, Nov 7, 2020 at 4:38 PM Tomas Vondra 
wrote:

> Overall, I think there's very little difference, particularly in the
> "match" cases when we're searching for a value that we know is in the
> table. The one-hash variant seems to perform a bit better, but the
> difference is fairly small.
>
> In the "mismatch" cases (searching for value that is not in the table)
> the differences are more significant, but it might be noise. It does
> seem much more "green" than "red", i.e. the one-hash variant seems to be
> faster (although this does depend on the values for formatting).
>
> To sum this up, I think the one-hash approach seems interesting. It's
> not going to give us huge speedups because we're only hashing int32
> values anyway (not the source data), but it's worth exploring.

Thanks for testing! It seems you tested against the version with two
moduli, and not the alternative discussed in

https://www.postgresql.org/message-id/20200918222702.omsieaphfj3ctqg3%40development

which would in fact be rehashing the 32 bit values. I think that would be
the way to go if we don't use the one-hashing approach.

> a) set_bloom_partitions does this:
>
> while (primes[pidx + nhashes - 1] <= target && primes[pidx] > 0)
>pidx++;
>
> which is broken, because the second part of the condition only checks
> the current index - we may end up using nhashes primes after that, and
> some of them may be 0. So this needs to be:
>
> while (primes[pidx + nhashes - 1] <= target &&
>primes[pidx + nhashes] > 0)
>pidx++;

Good catch.

> b) set_bloom_partitions does this to generate primes:
>
> /*
>  * Increase the limit to ensure we have some primes higher than
>  * the target partition length. The 100 value is arbitrary, but
>  * should be well over what we need.
>  */
> primes = generate_primes(target_partlen + 100);
>
> It's not clear to me why 100 is sufficient, particularly for large page
> sizes. AFAIK the primes get more and more sparse, so how does this
> guarantee we'll get enough "sufficiently large" primes?

This value is not rigorous and should be improved, but I started with that
by looking at the table in section 3 in

https://primes.utm.edu/notes/gaps.html

I see two ways to make a stronger guarantee:

1. Take the average gap between primes near n, which is log(n), and
multiply that by BLOOM_MAX_NUM_PARTITIONS. Adding that to the target seems
a better heuristic than a constant, and is still simple to calculate.

With the pathological example you gave of n=575104, k=3 (target_partlen =
191701), the number to add is log(191701) * 10 = 122.  By the table
referenced above, the largest prime gap under 360653 is 95, so we're
guaranteed to find at least one prime in the space of 122 above the target.
That will likely be enough to find the closest-to-target filter size for
k=3. Even if it weren't, nbits is so large that the relative difference is
tiny. I'd say a heuristic like this is most likely to be off precisely when
it matters the least. At this size, even if we find zero primes above our
target, the relative filter size is close to

(575104 - 3 * 95) / 575104 = 0.9995

For a more realistic bad-case target partition length, log(1327) * 10 = 72.
There are 33 composites after 1327, the largest such gap below 9551. That
would give five primes larger than the target
1361   1367   1373   1381   1399

which is more than enough for k<=10:

1297 +  1301  + 1303  + 1307  + 1319  + 1321  + 1327  + 1361 + 1367 + 1373
= 13276

2. Use a "segmented range" algorithm for the sieve and iterate until we get
k*2 primes, half below and half above the target. This would be an absolute
guarantee, but also more code, so I'm inclined against that.

> c) generate_primes uses uint16 to store the primes, so it can only
> generate primes up to 32768. That's (probably) enough for 8kB pages, but
> for 32kB pages it's clearly insufficient.

Okay.

> As for the original question how expensive this naive sieve is, I
> haven't been able to measure any significant timings. The logging aroung
> generate_primes usually looks like this:
>
> 2020-11-07 20:36:10.614 CET [232789] LOG:  generating primes nbits
> 575104 nhashes 3 target_partlen 191701
> 2020-11-07 20:36:10.614 CET [232789] LOG:  primes generated
>
> So it takes 0.000 second for this extreme page size. I don't think we
> need to invent anything more elaborate.

Okay, good to know. If we were concerned about memory, we could have it
check only odd numbers. That's a common feature of sieves, but also makes
the code a bit harder to understand if you haven't seen it before.

Also to fill in something I left for later, the reference for this

/* upper bound of number of primes below limit */
/* WIP: reference for this number */
int numprimes = 1.26 * limit / log(limit);

is

Rosser, J. Barkley; Schoenfeld, Lowell (1962). "Approximate formulas for
some functions of prime numbers". Illinois J. Math. 6: 64–94.
doi:10.1215/ijm/1255631807

More pr

Re: Prevent printing "next step instructions" in initdb and pg_upgrade

2020-11-09 Thread Alvaro Herrera

On 2020-Nov-09, Magnus Hagander wrote:

> But for usability that makes less sense. For the delete script, the
> wrapper (that the switch is intended for) knows more than pg_upgrade
> about how to delete it, so it can do a better job, and thus it makes
> sense to silence it. But for something like these .sql (and also in
> the previously mentioned case of extension upgrades), pg_upgrade knows
> more and the only thing the wrapper could do is to reimplement the
> same functionality, and maintain it.
> 
> I wonder if the best way forward might be to revert it back to being a
> --no-instructions switch, remove the delete_old_cluster.sh script
> completely and just replace it with instructions saying "you should
> delete the old cluster", and then keep generating these scripts as
> necessary?

How about a switch like "--with-scripts=" where the list can be
"all" to include everything (default), "none" to include nothing, or a
comma-separated list of things to include?  (Also "--with-scripts=help"
to query which things are supported)  So
"--with-scripts=reindex_hash,largeobject" omits the useless
delete_old_cluster script and keeps the ones you want.

Perhaps you could see it in the negative direction as more convenient,
so --suppress-scripts=delete_old_cluster

Re: Reduce the number of special cases to build contrib modules on windows

2020-11-09 Thread Alvaro Herrera

On 2020-Nov-06, David Rowley wrote:

> +# Handle makefile rules for when file to be added to the project
> +# does not exist.  Returns 1 when the original file add should be
> +# skipped.
> +sub FindAndAddAdditionalFiles
> +{
> + my $self = shift;
> + my $fname = shift;
> + my ($ext) = $fname =~ /(\.[^.]+)$/;
> +
> + # For .c files, check if a .l file of the same name exists and add that
> + # too.
> + if ($ext eq ".c")
> + {
> + my $filenoext = $fname;
> + $filenoext =~ s{\.[^.]+$}{};

I think you can make this simpler by capturing both the basename and the
extension in one go.  For example,

  $fname =~ /(.*)(\.[^.]+)$/;
  $filenoext = $1;
  $ext = $2;

so you avoid the second =~ statement.

> + if (-e "$filenoext.l")
> + {
> + AddFileConditional($self, "$filenoext.l");
> + return 1;
> + }
> + if (-e "$filenoext.y")
> + {
> + AddFileConditional($self, "$filenoext.y");

Maybe DRY like

for my $ext (".l", ".y") {
  my $file = $filenoext . $ext;
  AddFileConditional($self, $file) if -f $file;
  return 1;
}

Note: comment says "check if a .l file" and then checks both .l and .y.
Probably want to update the comment ...

Re: Making cancellations safe

2020-11-09 Thread Dave Cramer

On Wed, 4 Nov 2020 at 10:50, Shay Rojansky  wrote:

> Hi all.
>
> Back in 2016 I started a thread about making cancellations safer[1], I'd
> like to try to pick this up again. Here a summary of the previous
> conversation:
>
> The main ask here is to allow clients to specify which command to cancel,
> to avoid various race conditions where the wrong command is accidentally
> cancelled. The implementation proposal by Tom was for the client and server
> to simply count messages on each side, and to modify the CancelRequest
> message for the client to integrate the requested sequence number to be
> cancelled.
>
> The CancelRequest format change can probably happen in a non-breaking way,
> e.g. by defining a new cancel request code (the 1234/5678 at the beginning
> of the message). The client could know that the server is able to accept
> the new CancelRequest format via a minor protocol bump, or some other means.
>
> There was also discussion of allowing "cancel up to" semantics, where all
> messages up to the provided sequence number are cancelled. While this could
> be useful, it's important to allow setting a lower bound as well. A new
> experimental mode in the Npgsql driver "multiplexes" queries from unrelated
> threads into the same physical connection, possibly pipelining two
> unrelated queries go into a single outgoing TCP packet. In this case, the
> second query producer may issue a cancellation, and unintentionally cancel
> the first query which is still in progress. So ideally the CancelRequest
> would contain both a lower and an upper bound (with the lower being
> optional).
>
> Since cancellation requests may arrive before their target queries do, the
> backend should be able to track unfulfilled requests and apply them as soon
> as the target query is received.
>
> Finally, there was also an orthogonal discussion about widening the cancel
> key (currently 32-bit). However, that seems like it would be a protocol
> breaking change, so possibly better treated separately.
>
> [1]
> https://www.postgresql.org/message-id/flat/CADT4RqDh1CEgz7QgKwYSLT9TMCk7O%3DncauUaSQKVt_nPNTE9wQ%40mail.gmail.com#00832c72f4b93d57d6b0ac59de8eca85
>


I think this is going to become more relevant as libraries push the
protocol. Currently we have a number of implementations of pipelining where
this would be useful.

In general I think we need to have a way to cancel a specific query.

Regards,

Dave

Re: document pg_settings view doesn't display custom options

2020-11-09 Thread John Naylor

On Mon, Nov 9, 2020 at 2:12 AM Fujii Masao 
wrote:

> Pushed. Thanks!
>

Thank you!

-- 
John Naylor
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Prevent printing "next step instructions" in initdb and pg_upgrade

2020-11-09 Thread Magnus Hagander

On Mon, Nov 9, 2020 at 2:18 PM Anastasia Lubennikova
 wrote:
>
> On 02.11.2020 16:23, Magnus Hagander wrote:
>
> On Tue, Oct 27, 2020 at 11:35:25AM +0100, Peter Eisentraut wrote:
>
> On 2020-10-06 12:26, Magnus Hagander wrote:
>
> I went with the name --no-instructions to have the same name for both
> initdb and pg_upgrade. The downside is that "no-instructions" also
> causes the scripts not to be written in pg_upgrade, which arguably is a
> different thing. We could go with "--no-instructions" and
> "--no-scripts", but that would leave the parameters different. I also
> considered "--no-next-step", but that one didn't quite have the right
> ring to me. I'm happy for other suggestions on the parameter names.
>
> What scripts are left after we remove the analyze script, as discussed in a
> different thread?
>
> There is still delete_old_cluster.sh.
>
> I wonder if we should just not do it and just say "you should delete
> the old cluster". Then we can leave it up to platform integrations to
> enhance that, based on their platform knowledge (such as for example
> knowing if the platform runs systemd or not). That would leave us with
> both pg_upgrade and initdb printing instructions, and not scripts, and
> solve that part of the issue.
>
> Do we only care about .sh scripts? There are also reindex_hash.sql and 
> pg_largeobject.sql in src/bin/pg_upgrade/version.c with instructions.
> How should we handle them?

Oh, that's a good point. I had completely forgotten about those.

For consistency, one might say that those should be dropped as well.

But for usability that makes less sense. For the delete script, the
wrapper (that the switch is intended for) knows more than pg_upgrade
about how to delete it, so it can do a better job, and thus it makes
sense to silence it. But for something like these .sql (and also in
the previously mentioned case of extension upgrades), pg_upgrade knows
more and the only thing the wrapper could do is to reimplement the
same functionality, and maintain it.

I wonder if the best way forward might be to revert it back to being a
--no-instructions switch, remove the delete_old_cluster.sh script
completely and just replace it with instructions saying "you should
delete the old cluster", and then keep generating these scripts as
necessary?

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/

Re: Prevent printing "next step instructions" in initdb and pg_upgrade

2020-11-09 Thread Anastasia Lubennikova


On 02.11.2020 16:23, Magnus Hagander wrote:

On Tue, Oct 27, 2020 at 11:35:25AM +0100, Peter Eisentraut wrote:

On 2020-10-06 12:26, Magnus Hagander wrote:

I went with the name --no-instructions to have the same name for both
initdb and pg_upgrade. The downside is that "no-instructions" also
causes the scripts not to be written in pg_upgrade, which arguably is a
different thing. We could go with "--no-instructions" and
"--no-scripts", but that would leave the parameters different. I also
considered "--no-next-step", but that one didn't quite have the right
ring to me. I'm happy for other suggestions on the parameter names.

What scripts are left after we remove the analyze script, as discussed in a
different thread?

There is still delete_old_cluster.sh.
I wonder if we should just not do it and just say "you should delete
the old cluster". Then we can leave it up to platform integrations to
enhance that, based on their platform knowledge (such as for example
knowing if the platform runs systemd or not). That would leave us with
both pg_upgrade and initdb printing instructions, and not scripts, and
solve that part of the issue.

Do we only care about .sh scripts? There are also reindex_hash.sql and 
pg_largeobject.sqlin src/bin/pg_upgrade/version.c with instructions.

How should we handle them?

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: logical streaming of xacts via test_decoding is broken

2020-11-09 Thread Dilip Kumar

On Mon, Nov 9, 2020 at 5:37 PM Dilip Kumar  wrote:
>
> On Mon, Nov 9, 2020 at 4:21 PM Amit Kapila  wrote:
> >
> > On Mon, Nov 9, 2020 at 3:01 PM Dilip Kumar  wrote:
> > >
> > > On Mon, Nov 9, 2020 at 1:34 PM Dilip Kumar  wrote:
> > > >
> > > > On Mon, Nov 9, 2020 at 11:31 AM Amit Kapila  
> > > > wrote:
> > > > >
> > > > > On Mon, Nov 9, 2020 at 11:21 AM Dilip Kumar  
> > > > > wrote:
> > > > > >
> > > > > > On Mon, Nov 9, 2020 at 11:04 AM Dilip Kumar  
> > > > > > wrote:
> > > > > > >
> > > > > > > On Mon, Nov 9, 2020 at 11:00 AM Amit Kapila 
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > Michael reported a BF failure [1] related to one of the logical
> > > > > > > > streaming test case and I've analyzed the issue. As responded on
> > > > > > > > pgsql-committers [2], the issue here is that the streaming
> > > > > > > > transactions can be interleaved and because we are maintaining 
> > > > > > > > whether
> > > > > > > > xact_wrote_changes at the LogicalDecodingContext level, one of 
> > > > > > > > later
> > > > > > > > transaction can overwrite the flag for previously streaming
> > > > > > > > transaction. I think it is logical to have this flag at each
> > > > > > > > transaction level (aka in ReorderBufferTxn), however till now 
> > > > > > > > it was
> > > > > > > > fine because the changes of each transaction are decoded at 
> > > > > > > > one-shot
> > > > > > > > which will be no longer true. We can keep a 
> > > > > > > > output_plugin_private data
> > > > > > > > pointer in ReorderBufferTxn which will be used by test_decoding 
> > > > > > > > module
> > > > > > > > to keep this and any other such flags in future. We need to set 
> > > > > > > > this
> > > > > > > > flag at begin_cb and stream_start_cb APIs and then reset/remove 
> > > > > > > > it at
> > > > > > > > stream_commit_cb, stream_abort_cb and stream_stop_cb APIs.
> > > > > >
> > > > > > So IIUC, we need to keep 'output_plugin_private' in
> > > > > > LogicalDecodingContext as well as in ReorderBufferTxn,  So the
> > > > > > output_plugin_private in the ReorderBufferTxn will currently just 
> > > > > > keep
> > > > > > one flag xact_wrote_changes and the remaining things will still be
> > > > > > maintained in output_plugin_private of the LogicalDecodingContext.  
> > > > > > Is
> > > > > > my understanding correct?
> > > > > >
> > > > >
> > > > > Yes. But keep it as void * so that we can add more things later if 
> > > > > required.
> > > >
> > > > Yeah, that makes sense to me.
> > >
> > > I have made some POC changes and analyzed this further,  I think that
> > > for the streaming transaction we need 2 flags
> > > 1) xact_wrote_changes 2) stream_wrote_changes
> > >
> > > So basically, if the stream didn't make any changes we can skip the
> > > stream start and stream stop message for the empty stream, but if any
> > > of the streams has made any change then we need to emit the
> > > transaction commit message.  But if we want to avoid tracking the
> > > changes per stream then maybe once we set the xact_wrote_changes to
> > > true once for the txn then we better emit the message for all the
> > > stream without tracking whether the stream is empty or not.  What is
> > > your thought on this?
> > >
> >
> > I would prefer to have two separate flags to control this behavior
> > because without that it is quite possible that in some of the cases we
> > display empty stream start/stop messages even when that is not
> > intended.
>
> +1
>
>  The bigger question is do we want to give users an option
> > for skip_empty_streams similar to skip_empty_xacts? I would again
> > prefer to give a separate option to the user as well. What do you
> > think?
>
> Yeah, I think giving an option would be better.

I think we should also think about the combinations of the
skip_empty_xacts and skip_empty_streams.  For example, if the user
passes the skip_empty_xacts to false and skip_empty_streams to true
then what should be the behavior, if the complete transaction
option1: It should not print any stream_start/stream_stop and just
print commit stream because skip_empty_xacts is false and
skip_empty_streams is true.
option2: It should print the stream_start message for the very first
stream because it is the first stream if the txn and skip_empty_xacts
is false so print it and later it will print-stream commit.
option3: Or for the first stream we first put the BEGIN message i.e
stream begin
stream start
stream stop
stream commit
option4: the user should not be allowed to pass skip_empty_xacts =
false with skip_empty_streams to true.  Because if the streaming mode
is on then we can not print the xact without printing streams.

What is your opinion on this?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: logical streaming of xacts via test_decoding is broken

2020-11-09 Thread Dilip Kumar

On Mon, Nov 9, 2020 at 4:21 PM Amit Kapila  wrote:
>
> On Mon, Nov 9, 2020 at 3:01 PM Dilip Kumar  wrote:
> >
> > On Mon, Nov 9, 2020 at 1:34 PM Dilip Kumar  wrote:
> > >
> > > On Mon, Nov 9, 2020 at 11:31 AM Amit Kapila  
> > > wrote:
> > > >
> > > > On Mon, Nov 9, 2020 at 11:21 AM Dilip Kumar  
> > > > wrote:
> > > > >
> > > > > On Mon, Nov 9, 2020 at 11:04 AM Dilip Kumar  
> > > > > wrote:
> > > > > >
> > > > > > On Mon, Nov 9, 2020 at 11:00 AM Amit Kapila 
> > > > > >  wrote:
> > > > > > >
> > > > > > > Michael reported a BF failure [1] related to one of the logical
> > > > > > > streaming test case and I've analyzed the issue. As responded on
> > > > > > > pgsql-committers [2], the issue here is that the streaming
> > > > > > > transactions can be interleaved and because we are maintaining 
> > > > > > > whether
> > > > > > > xact_wrote_changes at the LogicalDecodingContext level, one of 
> > > > > > > later
> > > > > > > transaction can overwrite the flag for previously streaming
> > > > > > > transaction. I think it is logical to have this flag at each
> > > > > > > transaction level (aka in ReorderBufferTxn), however till now it 
> > > > > > > was
> > > > > > > fine because the changes of each transaction are decoded at 
> > > > > > > one-shot
> > > > > > > which will be no longer true. We can keep a output_plugin_private 
> > > > > > > data
> > > > > > > pointer in ReorderBufferTxn which will be used by test_decoding 
> > > > > > > module
> > > > > > > to keep this and any other such flags in future. We need to set 
> > > > > > > this
> > > > > > > flag at begin_cb and stream_start_cb APIs and then reset/remove 
> > > > > > > it at
> > > > > > > stream_commit_cb, stream_abort_cb and stream_stop_cb APIs.
> > > > >
> > > > > So IIUC, we need to keep 'output_plugin_private' in
> > > > > LogicalDecodingContext as well as in ReorderBufferTxn,  So the
> > > > > output_plugin_private in the ReorderBufferTxn will currently just keep
> > > > > one flag xact_wrote_changes and the remaining things will still be
> > > > > maintained in output_plugin_private of the LogicalDecodingContext.  Is
> > > > > my understanding correct?
> > > > >
> > > >
> > > > Yes. But keep it as void * so that we can add more things later if 
> > > > required.
> > >
> > > Yeah, that makes sense to me.
> >
> > I have made some POC changes and analyzed this further,  I think that
> > for the streaming transaction we need 2 flags
> > 1) xact_wrote_changes 2) stream_wrote_changes
> >
> > So basically, if the stream didn't make any changes we can skip the
> > stream start and stream stop message for the empty stream, but if any
> > of the streams has made any change then we need to emit the
> > transaction commit message.  But if we want to avoid tracking the
> > changes per stream then maybe once we set the xact_wrote_changes to
> > true once for the txn then we better emit the message for all the
> > stream without tracking whether the stream is empty or not.  What is
> > your thought on this?
> >
>
> I would prefer to have two separate flags to control this behavior
> because without that it is quite possible that in some of the cases we
> display empty stream start/stop messages even when that is not
> intended.

+1

 The bigger question is do we want to give users an option
> for skip_empty_streams similar to skip_empty_xacts? I would again
> prefer to give a separate option to the user as well. What do you
> think?

Yeah, I think giving an option would be better.


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Prevent printing "next step instructions" in initdb and pg_upgrade

2020-11-09 Thread Magnus Hagander

On Mon, Nov 2, 2020 at 2:23 PM Magnus Hagander  wrote:
>
> On Tue, Oct 27, 2020 at 12:40 PM Bruce Momjian  wrote:
> >
> > On Tue, Oct 27, 2020 at 12:35:56PM +0100, Peter Eisentraut wrote:
> > > On 2020-10-27 11:53, Bruce Momjian wrote:
> > > > On Tue, Oct 27, 2020 at 11:35:25AM +0100, Peter Eisentraut wrote:
> > > > > On 2020-10-06 12:26, Magnus Hagander wrote:
> > > > > > I went with the name --no-instructions to have the same name for 
> > > > > > both
> > > > > > initdb and pg_upgrade. The downside is that "no-instructions" also
> > > > > > causes the scripts not to be written in pg_upgrade, which arguably 
> > > > > > is a
> > > > > > different thing. We could go with "--no-instructions" and
> > > > > > "--no-scripts", but that would leave the parameters different. I 
> > > > > > also
> > > > > > considered "--no-next-step", but that one didn't quite have the 
> > > > > > right
> > > > > > ring to me. I'm happy for other suggestions on the parameter names.
> > > > >
> > > > > What scripts are left after we remove the analyze script, as 
> > > > > discussed in a
> > > > > different thread?
> > > >
> > > > There is still delete_old_cluster.sh.
> > >
> > > Well, that one can trivially be replaced by a printed instruction, too.
> >
> > True. On my system is it simply:
> >
> > rm -rf '/u/pgsql.old/data'
> >
> > The question is whether the user is going to record the vacuumdb and rm
> > instructions that display at the end of the pg_upgrade run, or do we
> > need to write it down for them in script files.
>
> That assumes for example that you've had no extra tablespaces defined
> in it. And it assumes your config files are actually in the same
> directory etc.
>
> Now, pg_upgrade *could* create a script that "actually works" for most
> things, since it connects to the system and could then enumerate
> things like tablespaces and config file locations, and generate a
> script that actually uses that information.
>
> But getting that to cover things like removing/disabling systemd
> services or init jobs or whatever the platform uses can quickly get
> pretty complex I think...
>
> I wonder if we should just not do it and just say "you should delete
> the old cluster". Then we can leave it up to platform integrations to
> enhance that, based on their platform knowledge (such as for example
> knowing if the platform runs systemd or not). That would leave us with
> both pg_upgrade and initdb printing instructions, and not scripts, and
> solve that part of the issue.
>
> Thinking further, there has been one other thing that we've talked
> about before, which is to have pg_upgrade either run extension
> upgrades, or generate a script that would run extension upgrades. If
> we want to do that as form of a script, then we're bringing the
> scripts back into play again... But maybe that's actually an argument
> for *not* having pg_upgrade generate that script, but instead either
> do the extension upgrades automatically, or have a separate tool that
> one can run independent of pg_upgrade...

PFA a rebased version of this patch on top of what has happened since,
and changing the pg_upgrade parameter to be --no-scripts.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/
diff --git a/doc/src/sgml/ref/initdb.sgml b/doc/src/sgml/ref/initdb.sgml
index 385ac25150..995d78408e 100644
--- a/doc/src/sgml/ref/initdb.sgml
+++ b/doc/src/sgml/ref/initdb.sgml
@@ -275,6 +275,19 @@ PostgreSQL documentation
   
  
 
+ 
+  --no-instructions
+  
+   
+By default, initdb will write instructions for how
+to start the cluster at the end of its output. This option causes
+those instructions to be left out. This is primarily intended for use
+by tools that wrap initdb in platform specific
+behavior, where those instructions are likely to be incorrect.
+   
+  
+ 
+
  
   --pwfile=filename
   
diff --git a/doc/src/sgml/ref/pgupgrade.sgml b/doc/src/sgml/ref/pgupgrade.sgml
index 92e1d09a55..6e70b8514d 100644
--- a/doc/src/sgml/ref/pgupgrade.sgml
+++ b/doc/src/sgml/ref/pgupgrade.sgml
@@ -230,6 +230,20 @@ PostgreSQL documentation
   
  
 
+ 
+  --no-scripts
+  
+   
+By default, pg_upgrade will write instructions and
+scripts for how to delete the old cluster at the end of its
+output. This option causes those scripts and instructions to be left
+out. This is primarily intended for use by tools that wrap
+pg_upgrade in platform specific behavior, where
+those instructions are likely to be incorrect.
+   
+  
+ 
+
  
   -?
   --help
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index ee3bfa82f4..c1f742b1b4 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -139,6 +139,7 @@ static const char *authmethodhost = NULL;
 static const char *authmethodlocal = NULL;
 static bool debug = false;

Re: [HACKERS] logical decoding of two-phase transactions

2020-11-09 Thread Amit Kapila

On Mon, Nov 9, 2020 at 1:38 PM Masahiko Sawada  wrote:
>
> On Mon, Nov 9, 2020 at 3:23 PM Peter Smith  wrote:
> >
>
> I've looked at the patches and done some tests. Here is my comment and
> question I realized during testing and reviewing.
>
> +static void
> +DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
> + xl_xact_parsed_prepare *parsed)
> +{
> +   XLogRecPtr  origin_lsn = parsed->origin_lsn;
> +   TimestampTz commit_time = parsed->origin_timestamp;
>
>  static void
>  DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
> -   xl_xact_parsed_abort *parsed, TransactionId xid)
> +   xl_xact_parsed_abort *parsed, TransactionId xid, bool prepared)
>  {
> int i;
> +   XLogRecPtr  origin_lsn = InvalidXLogRecPtr;
> +   TimestampTz commit_time = 0;
> +   XLogRecPtr  origin_id = XLogRecGetOrigin(buf->record);
>
> -   for (i = 0; i < parsed->nsubxacts; i++)
> +   if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
> {
> -   ReorderBufferAbort(ctx->reorder, parsed->subxacts[i],
> -  buf->record->EndRecPtr);
> +   origin_lsn = parsed->origin_lsn;
> +   commit_time = parsed->origin_timestamp;
> }
>
> In the above two changes, parsed->origin_timestamp is used as
> commit_time. But in DecodeCommit() we use parsed->xact_time instead.
> Therefore it a transaction didn't have replorigin_session_origin the
> timestamp of logical decoding out generated by test_decoding with
> 'include-timestamp' option is invalid. Is it intentional?
>

I think all three DecodePrepare/DecodeAbort/DecodeCommit should have
same handling for this with the exception that at DecodePrepare time
we can't rely on XACT_XINFO_HAS_ORIGIN but instead we need to check if
origin_timestamp is non-zero then we will overwrite commit_time with
it. Does that make sense to you?

> ---
> +   if (is_commit)
> +   txn->txn_flags |= RBTXN_COMMIT_PREPARED;
> +   else
> +   txn->txn_flags |= RBTXN_ROLLBACK_PREPARED;
> +
> +   if (rbtxn_commit_prepared(txn))
> +   rb->commit_prepared(rb, txn, commit_lsn);
> +   else if (rbtxn_rollback_prepared(txn))
> +   rb->rollback_prepared(rb, txn, commit_lsn);
>
> RBTXN_COMMIT_PREPARED and RBTXN_ROLLBACK_PREPARED are used only here
> and it seems to me that it's not necessarily necessary.
>

+1.

> ---
> +   /*
> +* If this is COMMIT_PREPARED and the output plugin supports
> +* two-phase commits then set the prepared flag to true.
> +*/
> +   prepared = ((info == XLOG_XACT_COMMIT_PREPARED) &&
> ctx->twophase) ? true : false;
>
> We can write instead:
>
> prepared = ((info == XLOG_XACT_COMMIT_PREPARED) && ctx->twophase);
>
>
> +   /*
> +* If this is ABORT_PREPARED and the output plugin supports
> +* two-phase commits then set the prepared flag to true.
> +*/
> +   prepared = ((info == XLOG_XACT_ABORT_PREPARED) &&
> ctx->twophase) ? true : false;
>
> The same is true here.
>

+1.

> ---
> 'git show --check' of v18-0002 reports some warnings.
>

I have also noticed this. Actually, I have already started making some
changes to these patches apart from what you have reported so I'll
take care of these things as well.

-- 
With Regards,
Amit Kapila.

1 2 >

1 - 100 of 119 matches

Mail list logo