date:20201130

Re: [doc] remove reference to pg_dump pre-8.1 switch behaviour

2020-11-30 Thread Michael Paquier

On Mon, Nov 30, 2020 at 03:46:19PM -0500, Tom Lane wrote:
> Michael Paquier  writes:
> > So this comes down to 5 items, as per the attached.  Thoughts?
> 
> These items look fine to me, except this bit seems a bit awkward:
> 
> + Note that the delayed indexing technique used for GIN
> + (see  for details) makes this advice
> + less necessary, but for very large updates it may still be best to
> + drop and recreate the index.
> 
> Less necessary than what?  Maybe instead write
> 
>   When fastupdate is enabled (see ...), the penalty is much less than
>   when it is not.  But for very large updates it may still be best to
>   drop and recreate the index.

Thanks, that's indeed better.  I used your wording, looked at that
again, and applied that. 
--
Michael


signature.asc
Description: PGP signature

Re: autovac issue with large number of tables

2020-11-30 Thread Fujii Masao





On 2020/12/01 16:23, Masahiko Sawada wrote:

On Tue, Dec 1, 2020 at 1:48 PM Kasahara Tatsuhito
 wrote:


Hi,

On Mon, Nov 30, 2020 at 8:59 PM Fujii Masao  wrote:




On 2020/11/30 10:43, Masahiko Sawada wrote:

On Sun, Nov 29, 2020 at 10:34 PM Kasahara Tatsuhito
 wrote:


Hi, Thanks for you comments.

On Fri, Nov 27, 2020 at 9:51 PM Fujii Masao  wrote:




On 2020/11/27 18:38, Kasahara Tatsuhito wrote:

Hi,

On Fri, Nov 27, 2020 at 1:43 AM Fujii Masao  wrote:




On 2020/11/26 10:41, Kasahara Tatsuhito wrote:

On Wed, Nov 25, 2020 at 8:46 PM Masahiko Sawada  wrote:


On Wed, Nov 25, 2020 at 4:18 PM Kasahara Tatsuhito
 wrote:


Hi,

On Wed, Nov 25, 2020 at 2:17 PM Masahiko Sawada  wrote:


On Fri, Sep 4, 2020 at 7:50 PM Kasahara Tatsuhito
 wrote:


Hi,

On Wed, Sep 2, 2020 at 2:10 AM Kasahara Tatsuhito
 wrote:

I wonder if we could have table_recheck_autovac do two probes of the stats
data.  First probe the existing stats data, and if it shows the table to
be already vacuumed, return immediately.  If not, *then* force a stats
re-read, and check a second time.

Does the above mean that the second and subsequent table_recheck_autovac()
will be improved to first check using the previous refreshed statistics?
I think that certainly works.

If that's correct, I'll try to create a patch for the PoC


I still don't know how to reproduce Jim's troubles, but I was able to reproduce
what was probably a very similar problem.

This problem seems to be more likely to occur in cases where you have
a large number of tables,
i.e., a large amount of stats, and many small tables need VACUUM at
the same time.

So I followed Tom's advice and created a patch for the PoC.
This patch will enable a flag in the table_recheck_autovac function to use
the existing stats next time if VACUUM (or ANALYZE) has already been done
by another worker on the check after the stats have been updated.
If the tables continue to require VACUUM after the refresh, then a refresh
will be required instead of using the existing statistics.

I did simple test with HEAD and HEAD + this PoC patch.
The tests were conducted in two cases.
(I changed few configurations. see attached scripts)

1. Normal VACUUM case
  - SET autovacuum = off
  - CREATE tables with 100 rows
  - DELETE 90 rows for each tables
  - SET autovacuum = on and restart PostgreSQL
  - Measure the time it takes for all tables to be VACUUMed

2. Anti wrap round VACUUM case
  - CREATE brank tables
  - SELECT all of these tables (for generate stats)
  - SET autovacuum_freeze_max_age to low values and restart PostgreSQL
  - Consumes a lot of XIDs by using txid_curent()
  - Measure the time it takes for all tables to be VACUUMed

For each test case, the following results were obtained by changing
autovacuum_max_workers parameters to 1, 2, 3(def) 5 and 10.
Also changing num of tables to 1000, 5000, 1 and 2.

Due to the poor VM environment (2 VCPU/4 GB), the results are a little unstable,
but I think it's enough to ask for a trend.

===
[1.Normal VACUUM case]
 tables:1000
  autovacuum_max_workers 1:   (HEAD) 20 sec VS (with patch)  20 sec
  autovacuum_max_workers 2:   (HEAD) 18 sec VS (with patch)  16 sec
  autovacuum_max_workers 3:   (HEAD) 18 sec VS (with patch)  16 sec
  autovacuum_max_workers 5:   (HEAD) 19 sec VS (with patch)  17 sec
  autovacuum_max_workers 10:  (HEAD) 19 sec VS (with patch)  17 sec

 tables:5000
  autovacuum_max_workers 1:   (HEAD) 77 sec VS (with patch)  78 sec
  autovacuum_max_workers 2:   (HEAD) 61 sec VS (with patch)  43 sec
  autovacuum_max_workers 3:   (HEAD) 38 sec VS (with patch)  38 sec
  autovacuum_max_workers 5:   (HEAD) 45 sec VS (with patch)  37 sec
  autovacuum_max_workers 10:  (HEAD) 43 sec VS (with patch)  35 sec

 tables:1
  autovacuum_max_workers 1:   (HEAD) 152 sec VS (with patch)  153 sec
  autovacuum_max_workers 2:   (HEAD) 119 sec VS (with patch)   98 sec
  autovacuum_max_workers 3:   (HEAD)  87 sec VS (with patch)   78 sec
  autovacuum_max_workers 5:   (HEAD) 100 sec VS (with patch)   66 sec
  autovacuum_max_workers 10:  (HEAD)  97 sec VS (with patch)   56 sec

 tables:2
  autovacuum_max_workers 1:   (HEAD) 338 sec VS (with patch)  339 sec
  autovacuum_max_workers 2:   (HEAD) 231 sec VS (with patch)  229 sec
  autovacuum_max_workers 3:   (HEAD) 220 sec VS (with patch)  191 sec
  autovacuum_max_workers 5:   (HEAD) 234 sec VS (with patch)  147 sec
  autovacuum_max_workers 10:  (HEAD) 320 sec VS (with patch)  113 sec

[2.Anti wrap round VACUUM case]
 tables:1000
  autovacuum_max_workers 1:   (HEAD) 19 sec VS (with patch) 18 sec
  autovacuum_max_workers 2:   (HEAD) 14 sec VS (with patch) 15 sec
  autovacuum_max_workers 3:   (HEAD) 14 sec VS (with patch) 14 sec
  autovacuum_max_workers 5:   (HEAD) 14 sec VS (with p

Re: [HACKERS] logical decoding of two-phase transactions

2020-11-30 Thread Amit Kapila

On Mon, Nov 30, 2020 at 7:17 PM Amit Kapila  wrote:
>
> On Mon, Nov 30, 2020 at 2:36 PM Ajin Cherian  wrote:
>
> Sure, but you can see in your example above it got skipped due to
> start_decoding_at not due to DecodingContextReady. So, the problem as
> mentioned by me previously was how we distinguish those cases because
> it can skip due to start_decoding_at during restart as well when we
> would have already sent the prepare to the subscriber.
>
> One idea could be that the subscriber skips the transaction if it sees
> the transaction is already prepared.
>

To skip it, we need to send GID in begin message and then on
subscriber-side, check if the prepared xact already exists, if so then
set a flag. The flag needs to be set in begin/start_stream and reset
in stop_stream/commit/abort. Using the flag, we can skip the entire
contents of the prepared xact. In ReorderFuffer-side also, we need to
get and set GID in txn even when we skip it because we need to send
the same at commit time. In this solution, we won't be able to send it
during normal start_stream because by that time we won't know GID and
I think that won't be required. Note that this is only required when
we skipped sending prepare, otherwise, we just need to send
Commit-Prepared at commit time.

Another way to solve this problem via publisher-side is to maintain in
some file at slot level whether we have sent prepare for a particular
txn? Basically, after sending prepare, we need to update the slot
information on disk to indicate that the particular GID is sent (we
can probably store GID and LSN of Prepare). Then next time whenever we
have to skip prepare due to whatever reason, we can check the
existence of persistent information on disk for that GID, if it exists
then we need to send just Commit Prepared, otherwise, the entire
transaction. We can remove this information during or after
CheckPointSnapBuild, basically, we can remove the information of all
GID's that are after cutoff LSN computed via
ReplicationSlotsComputeLogicalRestartLSN. Now, we can even think of
removing this information after Commit Prepared but not sure if that
is correct because we can't lose this information unless
start_decoding_at (or restart_lsn) is moved past the commit lsn

Now, to persist this information, there could be multiple
possibilities (a) maintain the flexible array for GID's at the end of
ReplicationSlotPersistentData, (b) have a separate state file per-slot
for prepared xacts, (c) have a separate state file for each prepared
xact per-slot.

With (a) during upgrade from the previous version there could be a
problem because the previous data won't match new data but I am not
sure if we maintain slots info intact after upgrade. I think (c) would
be simplest but OTOH, having many such files (in case there are more
prepared xacts) per-slot might not be a good idea.

One more thing that needs to be thought about is when we are sending
the entire xact at commit time whether we will send prepare
separately? Because, if we don't send it separately, then later
allowing the PREPARE on the master to wait for prepare via subscribers
won't be possible?

Thoughts?

-- 
With Regards,
Amit Kapila.

Re: autovac issue with large number of tables

2020-11-30 Thread Masahiko Sawada

On Tue, Dec 1, 2020 at 1:48 PM Kasahara Tatsuhito
 wrote:
>
> Hi,
>
> On Mon, Nov 30, 2020 at 8:59 PM Fujii Masao  
> wrote:
> >
> >
> >
> > On 2020/11/30 10:43, Masahiko Sawada wrote:
> > > On Sun, Nov 29, 2020 at 10:34 PM Kasahara Tatsuhito
> > >  wrote:
> > >>
> > >> Hi, Thanks for you comments.
> > >>
> > >> On Fri, Nov 27, 2020 at 9:51 PM Fujii Masao 
> > >>  wrote:
> > >>>
> > >>>
> > >>>
> > >>> On 2020/11/27 18:38, Kasahara Tatsuhito wrote:
> >  Hi,
> > 
> >  On Fri, Nov 27, 2020 at 1:43 AM Fujii Masao 
> >   wrote:
> > >
> > >
> > >
> > > On 2020/11/26 10:41, Kasahara Tatsuhito wrote:
> > >> On Wed, Nov 25, 2020 at 8:46 PM Masahiko Sawada 
> > >>  wrote:
> > >>>
> > >>> On Wed, Nov 25, 2020 at 4:18 PM Kasahara Tatsuhito
> > >>>  wrote:
> > 
> >  Hi,
> > 
> >  On Wed, Nov 25, 2020 at 2:17 PM Masahiko Sawada 
> >   wrote:
> > >
> > > On Fri, Sep 4, 2020 at 7:50 PM Kasahara Tatsuhito
> > >  wrote:
> > >>
> > >> Hi,
> > >>
> > >> On Wed, Sep 2, 2020 at 2:10 AM Kasahara Tatsuhito
> > >>  wrote:
> >  I wonder if we could have table_recheck_autovac do two probes 
> >  of the stats
> >  data.  First probe the existing stats data, and if it shows 
> >  the table to
> >  be already vacuumed, return immediately.  If not, *then* force 
> >  a stats
> >  re-read, and check a second time.
> > >>> Does the above mean that the second and subsequent 
> > >>> table_recheck_autovac()
> > >>> will be improved to first check using the previous refreshed 
> > >>> statistics?
> > >>> I think that certainly works.
> > >>>
> > >>> If that's correct, I'll try to create a patch for the PoC
> > >>
> > >> I still don't know how to reproduce Jim's troubles, but I was 
> > >> able to reproduce
> > >> what was probably a very similar problem.
> > >>
> > >> This problem seems to be more likely to occur in cases where you 
> > >> have
> > >> a large number of tables,
> > >> i.e., a large amount of stats, and many small tables need VACUUM 
> > >> at
> > >> the same time.
> > >>
> > >> So I followed Tom's advice and created a patch for the PoC.
> > >> This patch will enable a flag in the table_recheck_autovac 
> > >> function to use
> > >> the existing stats next time if VACUUM (or ANALYZE) has already 
> > >> been done
> > >> by another worker on the check after the stats have been updated.
> > >> If the tables continue to require VACUUM after the refresh, then 
> > >> a refresh
> > >> will be required instead of using the existing statistics.
> > >>
> > >> I did simple test with HEAD and HEAD + this PoC patch.
> > >> The tests were conducted in two cases.
> > >> (I changed few configurations. see attached scripts)
> > >>
> > >> 1. Normal VACUUM case
> > >>  - SET autovacuum = off
> > >>  - CREATE tables with 100 rows
> > >>  - DELETE 90 rows for each tables
> > >>  - SET autovacuum = on and restart PostgreSQL
> > >>  - Measure the time it takes for all tables to be VACUUMed
> > >>
> > >> 2. Anti wrap round VACUUM case
> > >>  - CREATE brank tables
> > >>  - SELECT all of these tables (for generate stats)
> > >>  - SET autovacuum_freeze_max_age to low values and restart 
> > >> PostgreSQL
> > >>  - Consumes a lot of XIDs by using txid_curent()
> > >>  - Measure the time it takes for all tables to be VACUUMed
> > >>
> > >> For each test case, the following results were obtained by 
> > >> changing
> > >> autovacuum_max_workers parameters to 1, 2, 3(def) 5 and 10.
> > >> Also changing num of tables to 1000, 5000, 1 and 2.
> > >>
> > >> Due to the poor VM environment (2 VCPU/4 GB), the results are a 
> > >> little unstable,
> > >> but I think it's enough to ask for a trend.
> > >>
> > >> ===
> > >> [1.Normal VACUUM case]
> > >> tables:1000
> > >>  autovacuum_max_workers 1:   (HEAD) 20 sec VS (with patch)  
> > >> 20 sec
> > >>  autovacuum_max_workers 2:   (HEAD) 18 sec VS (with patch)  
> > >> 16 sec
> > >>  autovacuum_max_workers 3:   (HEAD) 18 sec VS (with patch)  
> > >> 16 sec
> > >>  autovacuum_max_workers 5:   (HEAD) 19 sec VS (with patch)  
> > >> 17 sec
> > >>  autovacuum_max_workers 10:  (HEAD) 19 sec VS (wi

Re: PG vs LLVM 12 on seawasp, next round

2020-11-30 Thread Andres Freund

Hi,

On 2020-12-01 17:35:49 +1300, Thomas Munro wrote:
> Since seawasp's bleeding edge LLVM installation moved to "trunk
> 20201114 c8f4e06b 12.0.0" ~16 days ago, it has been red.  Further
> updates didn't help it and it's now on "trunk 20201127 6ee22ca6
> 12.0.0".  I wonder if there is something in Fabien's scripting that
> needs to be tweaked, perhaps a symlink name or similar.  I don't
> follow LLVM development but I found my way to a commit[1] around the
> right time that mentions breaking up the OrcJIT library, so *shrug*
> maybe that's a clue.
> 
> +ERROR:  could not load library
> "/home/fabien/pg/build-farm-11/buildroot/HEAD/pgsql.build/tmp_install/home/fabien/pg/build-farm-11/buildroot/HEAD/inst/lib/postgresql/llvmjit.so":
> libLLVMOrcJIT.so.12git: cannot open shared object file: No such file
> or directory

It's a change in how LLVM dependencies are declared
internally. Previously the 'native' component was - unintentionally -
transitively included via the 'orcjit' component, but now that's not the
case anymore.

The attached patch should fix it, I think?

Greetings,

Andres Freund
diff --git i/config/llvm.m4 w/config/llvm.m4
index a5f4a9af448..3a75cd8b4df 100644
--- i/config/llvm.m4
+++ w/config/llvm.m4
@@ -76,6 +76,7 @@ AC_DEFUN([PGAC_LLVM_SUPPORT],
   debuginfodwarf) pgac_components="$pgac_components $pgac_component";;
   orcjit) pgac_components="$pgac_components $pgac_component";;
   passes) pgac_components="$pgac_components $pgac_component";;
+  native) pgac_components="$pgac_components $pgac_component";;
   perfjitevents) pgac_components="$pgac_components $pgac_component";;
 esac
   done;
diff --git i/configure w/configure
index ffcd0c5b1d4..2a03ed0a018 100755
--- i/configure
+++ w/configure
@@ -5168,6 +5168,7 @@ fi
   debuginfodwarf) pgac_components="$pgac_components $pgac_component";;
   orcjit) pgac_components="$pgac_components $pgac_component";;
   passes) pgac_components="$pgac_components $pgac_component";;
+  native) pgac_components="$pgac_components $pgac_component";;
   perfjitevents) pgac_components="$pgac_components $pgac_component";;
 esac
   done;

Re: TAP test utility module 'PG_LSN.pm'

2020-11-30 Thread Michael Paquier

On Tue, Dec 01, 2020 at 03:14:06PM +0900, Fujii Masao wrote:
> You mean the same function as the commit 9bae7e4cde provided?

Completely forgot about this one, thanks.

/me hides
--
Michael


signature.asc
Description: PGP signature

Re: TAP test utility module 'PG_LSN.pm'

2020-11-30 Thread Fujii Masao





On 2020/12/01 14:58, Michael Paquier wrote:

On Tue, Dec 01, 2020 at 12:03:41PM +0800, Craig Ringer wrote:

I'd like to share the attached PG_LSN.pm module that I use when
writing TAP tests. I suggest that it be considered for inclusion in
core.

It defines a Perl datatype PG_LSN with operator support, so you can
write things like

 cmp_ok($got_lsn, "<", $expected_lsn, "testname")

in TAP tests and get sensible results without any concern for LSN
representation details, locale, etc. You can subtract LSNs to get a
byte difference too.


In my experience, any TAP tests making use of LSN operations can just
let the backend do the maths, so I am not much a fan of duplicating
that stuff in a perl module.


Agreed.


Wouldn't it be better to add an
equivalent of your add() function in the backend then?


You mean the same function as the commit 9bae7e4cde provided?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: TAP test utility module 'PG_LSN.pm'

2020-11-30 Thread Craig Ringer

On Tue, 1 Dec 2020 at 13:58, Michael Paquier  wrote:
>
> On Tue, Dec 01, 2020 at 12:03:41PM +0800, Craig Ringer wrote:
> > I'd like to share the attached PG_LSN.pm module that I use when
> > writing TAP tests. I suggest that it be considered for inclusion in
> > core.
> >
> > It defines a Perl datatype PG_LSN with operator support, so you can
> > write things like
> >
> > cmp_ok($got_lsn, "<", $expected_lsn, "testname")
> >
> > in TAP tests and get sensible results without any concern for LSN
> > representation details, locale, etc. You can subtract LSNs to get a
> > byte difference too.
>
> In my experience, any TAP tests making use of LSN operations can just
> let the backend do the maths, so I am not much a fan of duplicating
> that stuff in a perl module.  Wouldn't it be better to add an
> equivalent of your add() function in the backend then?  That could
> also get tested in the main regression test suite.

I find it convenient not to have as much log spam. Also, IIRC I needed
it at some points where the target backend(s) were down while I was
testing something related to a backend that was still in recovery and
not yet available for querying.

I don't really mind though, I'm just sharing what I have found useful.

Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2020-11-30 Thread Michael Paquier

On Mon, Nov 30, 2020 at 11:43:08PM -0600, Justin Pryzby wrote:
> I eyeballed the patch and rebased the rest of the series on top if it to play
> with.  Looks fine - thanks.

Cool, thanks.

> FYI, the commit messages have the proper "author" for attribution.  I proposed
> and implemented the grammar changes in 0002, and implemented INDEX_TABLESPACE,
> but I'm a reviewer for the main patches.

Well, my impression is that both of you kept exchanging patches,
touching and reviewing each other's patch (note that Alexei has also
sent a rebase of 0002 just yesterday), so I think that it is fair to
say that both of you should be listed as authors and credited as such
in the release notes for this one.
--
Michael


signature.asc
Description: PGP signature

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-11-30 Thread Dilip Kumar

On Mon, Nov 30, 2020 at 6:49 PM Amit Kapila  wrote:
>
> On Mon, Nov 30, 2020 at 3:14 AM Noah Misch  wrote:
> >
> > On Mon, Sep 07, 2020 at 12:00:41PM +0530, Amit Kapila wrote:
> > > Thanks, I have pushed the last patch. Let's wait for a day or so to
> > > see the buildfarm reports
> >
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2020-09-08%2006%3A24%3A14
> > failed the new 015_stream.pl test with the subscriber looping like this:
> >
> > 2020-09-08 11:22:49.848 UTC [13959252:1] LOG:  logical replication apply 
> > worker for subscription "tap_sub" has started
> > 2020-09-08 11:22:54.045 UTC [13959252:2] ERROR:  could not open temporary 
> > file "16393-510.changes.0" from BufFile "16393-510.changes": No such file 
> > or directory
> > 2020-09-08 11:22:54.055 UTC [7602182:1] LOG:  logical replication apply 
> > worker for subscription "tap_sub" has started
> > 2020-09-08 11:22:54.101 UTC [31785284:4] LOG:  background worker "logical 
> > replication worker" (PID 13959252) exited with exit code 1
> > 2020-09-08 11:23:01.142 UTC [7602182:2] ERROR:  could not open temporary 
> > file "16393-510.changes.0" from BufFile "16393-510.changes": No such file 
> > or directory
> > ...
> >
> > What happened there?
> >
>
> What is going on here is that the expected streaming file is missing.
> Normally, the first time we send a stream of changes (some percentage
> of transaction changes) we create the streaming file, and then in
> respective streams we just keep on writing in that file the changes we
> receive from the publisher, and on commit, we read that file and apply
> all the changes.
>
> The above kind of error can happen due to the following reasons: (a)
> the first time we sent the stream and created the file and that got
> removed before the second stream reached the subscriber. (b) from the
> publisher-side, we never sent the indication that it is the first
> stream and the subscriber directly tries to open the file thinking it
> is already there.
>
> Now, the publisher and subscriber log doesn't directly indicate any of
> the above problems but I have some observations.
>
> The subscriber log indicates that before the apply worker exits due to
> an error the new apply worker gets started. We delete the
> streaming-related temporary files on proc_exit, so one possibility
> could have been that the new apply worker has created the streaming
> file which the old apply worker has removed but that is not possible
> because we always create these temp-files by having procid in the
> path.

Yeah, and I have tried to test on this line, basically, after the
streaming has started I have set the binary=on.  Now using gdb I have
made the worker wait before it deletes the temp file and meanwhile the
new worker started and it worked properly as expected.

> The other thing I observed in the code is that we can mark the
> transaction as streamed (via ReorderBufferTruncateTxn) if we try to
> stream a transaction that has no changes the first time we try to
> stream the transaction. This would lead to symptom (b) because the
> second-time when there are more changes we would stream the changes as
> it is not the first time. However, this shouldn't happen because we
> never pick-up a transaction to stream which has no changes. I can try
> to fix the code here such that we don't mark the transaction as
> streamed unless we have streamed at least one change but I don't see
> how it is related to this particular test failure.

Yeah, this can be improved but as you mentioned that we never select
an empty transaction for streaming so this case should not occur.  I
will perform some testing/review around this and report.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: BUG #16663: DROP INDEX did not free up disk space: idle connection hold file marked as deleted

2020-11-30 Thread Michael Paquier

On Tue, Dec 01, 2020 at 04:06:48PM +1300, Thomas Munro wrote:
> I felt on balance it was a "bug", since it causes operational
> difficulties for people and was clearly not our intended behaviour,
> and I announced this intention 6 weeks ago.

Oops, sorry for missing this discussion for such a long time :/

> Of course I'll be happy to revert it from the back-branches if
> that's the consensus.  Any > other opinions?

If there are no other opinions, I am also fine to rely on your
judgment.
--
Michael


signature.asc
Description: PGP signature

Re: Improving spin-lock implementation on ARM.

2020-11-30 Thread Tom Lane

Alexander Korotkov  writes:
> 2) None of the patches considered in this thread give a clear
> advantage for PostgreSQL built with LSE.

Yeah, I think so.

> To further confirm this let's wait for Kunpeng 920 tests by Krunal
> Bauskar and Amit Khandekar.  Also it would be nice if someone will run
> benchmarks similar to [1] on Apple M1.

I did what I could in this department.  It's late and I'm not going to
have time to run read/write benchmarks before bed, but here are some
results for the "pgbench -S" cases.  I tried to match your testing
choices, but could not entirely:

* Configure options are --enable-debug, --disable-cassert, no other
special configure options or CFLAG choices.

* I have not been able to find a way to make Apple's compiler not
use the LSE spinlock instructions, so all of these correspond to
your LSE cases.

* I used shared_buffers = 1GB, because this machine only has 16GB
RAM so 32GB is clearly out of reach.  Also I used pgbench scale
factor 100 not 1000.  Since we're trying to measure contention
effects not I/O speed, I don't think a huge test case is appropriate.

* I still haven't gotten pgbench to work with -j settings above 128,
so these runs use -j equal to half -c.  Shouldn't really affect
conclusions about scaling.  (BTW, I see a similar limitation on
macOS Catalina x86_64, so whatever that is, it's not new.)

* Otherwise, the first plot shows median of three results from
"pgbench -S -M prepared -T 120 -c $n -j $j", as you had it.
The right-hand plot shows all three of the values in yerrorbars
format, just to give a sense of the noise level.

Clearly, there is something weird going on at -c 4.  There's a cluster
of results around 180K TPS, and another cluster around 210-220K TPS,
and nothing in between.  I suspect that the scheduler is doing
something bogus with sometimes putting pgbench onto the slow cores.
Anyway, I believe that the apparent gap between HEAD and the other
curves at -c 4 is probably an artifact: HEAD had two 180K-ish results
and one 220K-ish result, while the other curves had the reverse, so
the medians are different but there's probably not any non-chance
effect there.

Bottom line is that these patches don't appear to do much of
anything on M1, as you surmised.

regards, tom lane

Re: TAP test utility module 'PG_LSN.pm'

2020-11-30 Thread Michael Paquier

On Tue, Dec 01, 2020 at 12:03:41PM +0800, Craig Ringer wrote:
> I'd like to share the attached PG_LSN.pm module that I use when
> writing TAP tests. I suggest that it be considered for inclusion in
> core.
> 
> It defines a Perl datatype PG_LSN with operator support, so you can
> write things like
> 
> cmp_ok($got_lsn, "<", $expected_lsn, "testname")
> 
> in TAP tests and get sensible results without any concern for LSN
> representation details, locale, etc. You can subtract LSNs to get a
> byte difference too.

In my experience, any TAP tests making use of LSN operations can just
let the backend do the maths, so I am not much a fan of duplicating
that stuff in a perl module.  Wouldn't it be better to add an
equivalent of your add() function in the backend then?  That could
also get tested in the main regression test suite.
--
Michael


signature.asc
Description: PGP signature

RE: [PATCH] Keeps tracking the uniqueness with UniqueKey

2020-11-30 Thread Hou, Zhijie

Hi 

I look into the patch again and have some comments.

1.
+   Size oid_cmp_len = sizeof(Oid) * ind1->ncolumns;
+
+   return ind1->ncolumns == ind2->ncolumns &&
+   ind1->unique == ind2->unique &&
+   memcmp(ind1->indexkeys, ind2->indexkeys, sizeof(int) * 
ind1->ncolumns) == 0 &&
+   memcmp(ind1->opfamily, ind2->opfamily, oid_cmp_len) == 0 &&
+   memcmp(ind1->opcintype, ind2->opcintype, oid_cmp_len) == 0 &&
+   memcmp(ind1->sortopfamily, ind2->sortopfamily, oid_cmp_len) == 
0 &&
+   equal(get_tlist_exprs(ind1->indextlist, true),
+ get_tlist_exprs(ind2->indextlist, true));

The length of sortopfamily,opfamily and opcintype seems ->nkeycolumns not 
->ncolumns.
I checked function get_relation_info where init the IndexOptInfo.
(If there are more places where can change the length, please correct me)


2.

+   COPY_SCALAR_FIELD(ncolumns);
+   COPY_SCALAR_FIELD(nkeycolumns);
+   COPY_SCALAR_FIELD(unique);
+   COPY_SCALAR_FIELD(immediate);
+   /* We just need to know if it is NIL or not */
+   COPY_SCALAR_FIELD(indpred);
+   COPY_SCALAR_FIELD(predOK);
+   COPY_POINTER_FIELD(indexkeys, from->ncolumns * sizeof(int));
+   COPY_POINTER_FIELD(indexcollations, from->ncolumns * sizeof(Oid));
+   COPY_POINTER_FIELD(opfamily, from->ncolumns * sizeof(Oid));
+   COPY_POINTER_FIELD(opcintype, from->ncolumns * sizeof(Oid));
+   COPY_POINTER_FIELD(sortopfamily, from->ncolumns * sizeof(Oid));
+   COPY_NODE_FIELD(indextlist);

The same as 1.
Should use nkeycolumns if I am right.


3.
+   foreach(lc, newnode->indextlist)
+   {
+   TargetEntry *tle = lfirst_node(TargetEntry, lc);
+   /* Index on expression is ignored */
+   Assert(IsA(tle->expr, Var));
+   tle->expr = (Expr *) find_parent_var(appinfo, (Var *) 
tle->expr);
+   newnode->indexkeys[idx] = castNode(Var, tle->expr)->varattno;
+   idx++;
+   }

The count variable 'idx'  can be replaces by foreach_current_index().


Best regards,
houzj

Re: proposal: unescape_text function

2020-11-30 Thread Pavel Stehule

po 30. 11. 2020 v 22:56 odesílatel Pavel Stehule 
napsal:

>
>
> po 30. 11. 2020 v 22:15 odesílatel Pavel Stehule 
> napsal:
>
>>
>>
>> po 30. 11. 2020 v 14:14 odesílatel Peter Eisentraut <
>> peter.eisentr...@enterprisedb.com> napsal:
>>
>>> On 2020-11-29 18:36, Pavel Stehule wrote:
>>> >
>>> > I don't really get the point of this function.  There is AFAICT no
>>> > function to produce this escaped format, and it's not a recognized
>>> > interchange format.  So under what circumstances would one need to
>>> > use this?
>>> >
>>> >
>>> > Some corporate data can be in CSV format with escaped unicode
>>> > characters. Without this function it is not possible to decode these
>>> > files without external application.
>>>
>>> I would like some supporting documentation on this.  So far we only have
>>> one stackoverflow question, and then this implementation, and they are
>>> not even the same format.  My worry is that if there is not precise
>>> specification, then people are going to want to add things in the
>>> future, and there will be no way to analyze such requests in a
>>> principled way.
>>>
>>>
>> I checked this and it is "prefix backslash-u hex" used by Java,
>> JavaScript  or RTF -
>> https://billposer.org/Software/ListOfRepresentations.html
>>
>> In some languages (Python), there is decoder "unicode-escape". Java has
>> a method escapeJava, for conversion from unicode to ascii. I can imagine so
>> these data are from Java systems exported to 8bit strings - so this
>> implementation can be accepted as  referential. This format is used by
>> https://docs.oracle.com/javase/8/docs/technotes/tools/unix/native2ascii.html
>> tool too.
>>
>> Postgres can decode this format too, and the patch is based on Postgres
>> implementation. I just implemented a different interface.
>>
>> Currently decode function does only text->bytea transformation. Maybe a
>> more generic function "decode_text" and "encode_text" for similar cases can
>> be better (here we need text->text transformation). But it looks like
>> overengineering now.
>>
>> Maybe we introduce new encoding "ascii" and we can implement new
>> conversions "ascii_to_utf8" and "utf8_to_ascii". It looks like the most
>> clean solution. What do you think about it?
>>
>
> a better name of new encoding can be "unicode-escape" than "ascii". We use
> "to_ascii" function for different use case.
>
> set client_encoding to unicode-escape;
> copy tab from xxx;
> ...
>
> but it doesn't help when only a few columns from the table are in
> unicode-escape format.
>
>
probably the most complete solution can be from two steps:

1. introducing new encoding - "ascii_unicode_escape" with related
conversions
2. introducing two new functions - text_escape and text_unescape with two
parameters - source text and conversion name

select text_convert_to('Тимати', 'ascii_unicode_escape')
\u0422\u0438\u043c\u0430\u0442\u0438 .. result is text

select text_convert_from('\u0422\u0438\u043c\u0430\u0442\u0438',
'ascii_unicode_escape')
┌──┐
│ ?column? │
╞══╡
│ Тимати   │
└──┘
(1 row)


>
>
>> Regards
>>
>> Pavel
>>
>>
>>

Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2020-11-30 Thread Justin Pryzby

On Tue, Dec 01, 2020 at 11:46:55AM +0900, Michael Paquier wrote:
> On Mon, Nov 30, 2020 at 05:12:42PM +0300, Alexey Kondratov wrote:
> > Thanks. I have rebased the remaining patches on top of 873ea9ee to use
> > 'utility_option_list' instead of 'common_option_list'.
> 
> Thanks, that helps a lot.  I have gone through 0002, and tweaked it as
> the attached (note that this patch is also interesting for another
> thing in development: backend-side reindex filtering of
> collation-sensitive indexes).  Does that look right to you?

I eyeballed the patch and rebased the rest of the series on top if it to play
with.  Looks fine - thanks.

FYI, the commit messages have the proper "author" for attribution.  I proposed
and implemented the grammar changes in 0002, and implemented INDEX_TABLESPACE,
but I'm a reviewer for the main patches.

-- 
Justin

RE: [POC] Fast COPY FROM command for the table with foreign partitions

2020-11-30 Thread tsunakawa.ta...@fujitsu.com

From: Amit Langote 
> Andrey's original patch had the flag to, as I understand it, make the
> partitioning case work correctly.  When inserting into a
> non-partitioned table, there's only one relation to care about.  In
> that case, CopyFrom() can use either the new COPY interface or the
> INSERT interface for the entire operation when talking to a foreign
> target relation's FDW driver.  With partitions, that has to be
> considered separately for each partition.  What complicates the matter
> further is that while the original target relation (the root
> partitioned table in the partitioning case) is fully initialized in
> CopyFrom(), partitions are lazily initialized by ExecFindPartition().

Yeah, I felt it a bit confusing to see the calls to Begin/EndForeignInsert() in 
both CopyFrom() and ExecInitRoutingInfo().


> Note that the initialization of a given target relation can also
> optionally involve calling the FDW to perform any pre-COPY
> initializations.  So if a given partition is a foreign table, whether
> the copy operation was initialized using the COPY interface or the
> INSERT interface is determined away from CopyFrom().  Andrey created
> ri_usesMultiInsert to remember which was used so that CopyFrom() can
> use the correct interface during the subsequent interactions with the
> partition's driver.
> 
> Now, it does not seem outright impossible to do this without the flag,
> but maybe Andrey thinks it is good for readability?  If it is
> confusing from a modularity standpoint, maybe we should rethink that.
> That said, I still think that there should be a way for CopyFrom() to
> tell ExecFindPartition() which FDW interface to initialize a given
> foreign table partition's copy operation with -- COPY if the copy
> allows multi-insert, INSERT if not.  Maybe the multi_insert parameter
> I mentioned earlier would serve that purpose.

I agree with your idea of adding multi_insert argument to ExecFindPartition() 
to request a multi-insert-capable partition.  At first, I thought 
ExecFindPartition() is used for all operations, insert/delete/update/select, so 
I found it odd to add multi_insert argument.  But ExecFindPartion() is used 
only for insert, so multi_insert argument seems okay.


Regards
Takayuki Tsunakawa

Re: scram-sha-256 broken with FIPS and OpenSSL 1.0.2

2020-11-30 Thread Michael Paquier

On Mon, Nov 30, 2020 at 02:29:29PM +0100, Daniel Gustafsson wrote:
> Yeah, that's along the lines of what I was thinking of.

Hmm.  I have looked at that, and thought first about having directly a
reference to the resowner directly in pg_cryptohash_ctx, but that's
not a good plan for two reasons:
- common/cryptohash.h would get knowledge of that, requiring bundling
in a bunch of dependencies.
- There is no need for that in the non-OpenSSL case.
So, instead, I have been thinking about using an extra context layer
only for cryptohash_openssl.c with a structure saved as
pg_cryptohash_context->data that stores the information about 
EVP_MD_CTX* and the resource owner.  Then, I was thinking about
storing directly pg_cryptohash_ctx in the resowner EVP array and just
call pg_cryptohash_free() from resowner.c without the need of an
extra routine.  I have not tested this idea but that should work.
What's your take?

In parallel, I have spent more time today polishing and reviewing 0001
(indented, adjusted a couple of areas and added also brackets and
extra comments as you suggested) and tested it on Linux and Windows,
with and without OpenSSL down to 1.0.1, the oldest version supported
on HEAD.  So I'd like to apply the attached first and sort out the
resowner stuff in a next step.
--
Michael
From 5ffd653336f1c41043635fad3618cf0bae7b1cec Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Tue, 1 Dec 2020 13:21:44 +0900
Subject: [PATCH v6] Rework SHA2 and crypto hash APIs

This will make easier a switch to EVP for the OpenSSL SHA2 layer.  Note
that the layer introduced here is generalized for the purpose of a
future integration with HMAC, MD5, and even more.
---
 src/include/common/checksum_helper.h  |  13 +-
 src/include/common/cryptohash.h   |  40 
 src/include/common/scram-common.h |  17 +-
 src/include/common/sha2.h |  89 +---
 src/include/replication/backup_manifest.h |   3 +-
 src/backend/libpq/auth-scram.c| 113 ++
 src/backend/replication/backup_manifest.c |  25 ++-
 src/backend/replication/basebackup.c  |  24 ++-
 src/backend/utils/adt/cryptohashes.c  |  53 +++--
 src/common/Makefile   |   6 +-
 src/common/checksum_helper.c  |  79 +--
 src/common/cryptohash.c   | 190 +
 src/common/cryptohash_openssl.c   | 197 ++
 src/common/scram-common.c | 173 ++-
 src/common/sha2.c |  23 +-
 .../common/sha2.h => common/sha2_int.h}   |  38 +---
 src/common/sha2_openssl.c | 102 -
 src/bin/pg_verifybackup/parse_manifest.c  |  15 +-
 src/bin/pg_verifybackup/pg_verifybackup.c |  24 ++-
 src/interfaces/libpq/fe-auth-scram.c  | 118 ++-
 contrib/pgcrypto/internal-sha2.c  | 188 -
 src/tools/msvc/Mkvcbuild.pm   |   3 +-
 src/tools/pgindent/typedefs.list  |   2 +
 23 files changed, 955 insertions(+), 580 deletions(-)
 create mode 100644 src/include/common/cryptohash.h
 create mode 100644 src/common/cryptohash.c
 create mode 100644 src/common/cryptohash_openssl.c
 copy src/{include/common/sha2.h => common/sha2_int.h} (73%)
 delete mode 100644 src/common/sha2_openssl.c

diff --git a/src/include/common/checksum_helper.h b/src/include/common/checksum_helper.h
index 48b0745dad..b07a34e7e4 100644
--- a/src/include/common/checksum_helper.h
+++ b/src/include/common/checksum_helper.h
@@ -14,6 +14,7 @@
 #ifndef CHECKSUM_HELPER_H
 #define CHECKSUM_HELPER_H
 
+#include "common/cryptohash.h"
 #include "common/sha2.h"
 #include "port/pg_crc32c.h"
 
@@ -41,10 +42,10 @@ typedef enum pg_checksum_type
 typedef union pg_checksum_raw_context
 {
 	pg_crc32c	c_crc32c;
-	pg_sha224_ctx c_sha224;
-	pg_sha256_ctx c_sha256;
-	pg_sha384_ctx c_sha384;
-	pg_sha512_ctx c_sha512;
+	pg_cryptohash_ctx *c_sha224;
+	pg_cryptohash_ctx *c_sha256;
+	pg_cryptohash_ctx *c_sha384;
+	pg_cryptohash_ctx *c_sha512;
 } pg_checksum_raw_context;
 
 /*
@@ -66,8 +67,8 @@ typedef struct pg_checksum_context
 extern bool pg_checksum_parse_type(char *name, pg_checksum_type *);
 extern char *pg_checksum_type_name(pg_checksum_type);
 
-extern void pg_checksum_init(pg_checksum_context *, pg_checksum_type);
-extern void pg_checksum_update(pg_checksum_context *, const uint8 *input,
+extern int	pg_checksum_init(pg_checksum_context *, pg_checksum_type);
+extern int	pg_checksum_update(pg_checksum_context *, const uint8 *input,
 			   size_t len);
 extern int	pg_checksum_final(pg_checksum_context *, uint8 *output);
 
diff --git a/src/include/common/cryptohash.h b/src/include/common/cryptohash.h
new file mode 100644
index 00..0e4a6631a3
--- /dev/null
+++ b/src/include/common/cryptohash.h
@@ -0,0 +1,40 @@
+/*-
+ *
+ * cryptohash.h
+

Re: Add statistics to pg_stat_wal view for wal related parameter tuning

2020-11-30 Thread Fujii Masao

On 2020/11/26 16:07, Masahiro Ikeda wrote:

On 2020-11-25 20:19, Fujii Masao wrote:

On 2020/11/19 16:31, Masahiro Ikeda wrote:

On 2020-11-17 11:46, Fujii Masao wrote:

On 2020/11/16 16:35, Masahiro Ikeda wrote:

On 2020-11-12 14:58, Fujii Masao wrote:

On 2020/11/06 10:25, Masahiro Ikeda wrote:

On 2020-10-30 11:50, Fujii Masao wrote:

On 2020/10/29 17:03, Masahiro Ikeda wrote:

Hi,

Thanks for your comments and advice. I updated the patch.

On 2020-10-21 18:03, Kyotaro Horiguchi wrote:

At Tue, 20 Oct 2020 16:11:29 +0900, Masahiro Ikeda
 wrote in

On 2020-10-20 12:46, Amit Kapila wrote:
> I see that we also need to add extra code to capture these stats (some
> of which is in performance-critical path especially in
> XLogInsertRecord) which again makes me a bit uncomfortable. It might
> be that it is all fine as it is very important to collect these stats
> at cluster-level in spite that the same information can be gathered at
> statement-level to help customers but I don't see a very strong case
> for that in your proposal.

We should avoid that duplication as possible even if the both number
are important.

Also about performance, I thought there are few impacts because it
increments stats in memory. If I can implement to reuse pgWalUsage's
value which already collects these stats, there is no impact in
XLogInsertRecord.
For example, how about pg_stat_wal() calculates the accumulated
value of wal_records, wal_fpi, and wal_bytes to use pgWalUsage's
value?

I don't think that works, but it would work that pgstat_send_wal()
takes the difference of that values between two successive calls.

WalUsage prevWalUsage;
...
pgstat_send_wal()
{
..
   /* fill in some values using pgWalUsage */
   WalStats.m_wal_bytes   = pgWalUsage.wal_bytes   - prevWalUsage.wal_bytes;
   WalStats.m_wal_records = pgWalUsage.wal_records - prevWalUsage.wal_records;
   WalStats.m_wal_wal_fpi = pgWalUsage.wal_fpi - prevWalUsage.wal_fpi;
...
   pgstat_send(&WalStats, sizeof(WalStats));

   /* remember the current numbers */
   prevWalUsage = pgWalUsage;

Thanks for Horiguchi-san's advice, I changed to reuse pgWalUsage
which is already defined and eliminates the extra overhead.

+    /* fill in some values using pgWalUsage */
+    WalStats.m_wal_bytes = pgWalUsage.wal_bytes - prevWalUsage.wal_bytes;
+    WalStats.m_wal_records = pgWalUsage.wal_records - prevWalUsage.wal_records;
+    WalStats.m_wal_fpi = pgWalUsage.wal_fpi - prevWalUsage.wal_fpi;

It's better to use WalUsageAccumDiff() here?

Yes, thanks. I fixed it.

prevWalUsage needs to be initialized with pgWalUsage?

+    if (AmWalWriterProcess()){
+    WalStats.m_wal_write_walwriter++;
+    }
+    else
+    {
+    WalStats.m_wal_write_backend++;
+    }

I think that it's better not to separate m_wal_write_xxx into two for
walwriter and other processes. Instead, we can use one m_wal_write_xxx
counter and make pgstat_send_wal() send also the process type to
the stats collector. Then the stats collector can accumulate the counters
per process type if necessary. If we adopt this approach, we can easily
extend pg_stat_wal so that any fields can be reported per process type.

I'll remove the above source code because these counters are not useful.

On 2020-10-30 12:00, Fujii Masao wrote:

On 2020/10/20 11:31, Masahiro Ikeda wrote:

Hi,

I think we need to add some statistics to pg_stat_wal view.

Although there are some parameter related WAL,
there are few statistics for tuning them.

I think it's better to provide the following statistics.
Please let me know your comments.

```
postgres=# SELECT * from pg_stat_wal;
-[ RECORD 1 ]---+--
wal_records | 2000224
wal_fpi | 47
wal_bytes   | 248216337
wal_buffers_full    | 20954
wal_init_file   | 8
wal_write_backend   | 20960
wal_write_walwriter | 46
wal_write_time  | 51
wal_sync_backend    | 7
wal_sync_walwriter  | 8
wal_sync_time   | 0
stats_reset | 2020-10-20 11:04:51.307771+09
```

1. Basic statistics of WAL activity

- wal_records: Total number of WAL records generated
- wal_fpi: Total number of WAL full page images generated
- wal_bytes: Total amount of WAL bytes generated

To understand DB's performance, first, we will check the performance
trends for the entire database instance.
For example, if the number of wal_fpi becomes higher, users may tune
"wal_compression", "checkpoint_timeout" and so on.

Although users can check the above statistics via EXPLAIN, auto_explain,
autovacuum and pg_stat_statements now,
if users want to see the performance trends  for the entire database,
they must recalculate the statistics.

I think it is useful to add the sum of the basic statistics.

2.  WAL segment file creation

- wal_init_file: Total number of WAL segment files created.

To create a new WAL file may have an impact on the performance of
a write-heavy workloa

Re: autovac issue with large number of tables

2020-11-30 Thread Kasahara Tatsuhito

Hi,

On Mon, Nov 30, 2020 at 8:59 PM Fujii Masao  wrote:
>
>
>
> On 2020/11/30 10:43, Masahiko Sawada wrote:
> > On Sun, Nov 29, 2020 at 10:34 PM Kasahara Tatsuhito
> >  wrote:
> >>
> >> Hi, Thanks for you comments.
> >>
> >> On Fri, Nov 27, 2020 at 9:51 PM Fujii Masao  
> >> wrote:
> >>>
> >>>
> >>>
> >>> On 2020/11/27 18:38, Kasahara Tatsuhito wrote:
>  Hi,
> 
>  On Fri, Nov 27, 2020 at 1:43 AM Fujii Masao 
>   wrote:
> >
> >
> >
> > On 2020/11/26 10:41, Kasahara Tatsuhito wrote:
> >> On Wed, Nov 25, 2020 at 8:46 PM Masahiko Sawada 
> >>  wrote:
> >>>
> >>> On Wed, Nov 25, 2020 at 4:18 PM Kasahara Tatsuhito
> >>>  wrote:
> 
>  Hi,
> 
>  On Wed, Nov 25, 2020 at 2:17 PM Masahiko Sawada 
>   wrote:
> >
> > On Fri, Sep 4, 2020 at 7:50 PM Kasahara Tatsuhito
> >  wrote:
> >>
> >> Hi,
> >>
> >> On Wed, Sep 2, 2020 at 2:10 AM Kasahara Tatsuhito
> >>  wrote:
>  I wonder if we could have table_recheck_autovac do two probes of 
>  the stats
>  data.  First probe the existing stats data, and if it shows the 
>  table to
>  be already vacuumed, return immediately.  If not, *then* force a 
>  stats
>  re-read, and check a second time.
> >>> Does the above mean that the second and subsequent 
> >>> table_recheck_autovac()
> >>> will be improved to first check using the previous refreshed 
> >>> statistics?
> >>> I think that certainly works.
> >>>
> >>> If that's correct, I'll try to create a patch for the PoC
> >>
> >> I still don't know how to reproduce Jim's troubles, but I was able 
> >> to reproduce
> >> what was probably a very similar problem.
> >>
> >> This problem seems to be more likely to occur in cases where you 
> >> have
> >> a large number of tables,
> >> i.e., a large amount of stats, and many small tables need VACUUM at
> >> the same time.
> >>
> >> So I followed Tom's advice and created a patch for the PoC.
> >> This patch will enable a flag in the table_recheck_autovac 
> >> function to use
> >> the existing stats next time if VACUUM (or ANALYZE) has already 
> >> been done
> >> by another worker on the check after the stats have been updated.
> >> If the tables continue to require VACUUM after the refresh, then a 
> >> refresh
> >> will be required instead of using the existing statistics.
> >>
> >> I did simple test with HEAD and HEAD + this PoC patch.
> >> The tests were conducted in two cases.
> >> (I changed few configurations. see attached scripts)
> >>
> >> 1. Normal VACUUM case
> >>  - SET autovacuum = off
> >>  - CREATE tables with 100 rows
> >>  - DELETE 90 rows for each tables
> >>  - SET autovacuum = on and restart PostgreSQL
> >>  - Measure the time it takes for all tables to be VACUUMed
> >>
> >> 2. Anti wrap round VACUUM case
> >>  - CREATE brank tables
> >>  - SELECT all of these tables (for generate stats)
> >>  - SET autovacuum_freeze_max_age to low values and restart 
> >> PostgreSQL
> >>  - Consumes a lot of XIDs by using txid_curent()
> >>  - Measure the time it takes for all tables to be VACUUMed
> >>
> >> For each test case, the following results were obtained by changing
> >> autovacuum_max_workers parameters to 1, 2, 3(def) 5 and 10.
> >> Also changing num of tables to 1000, 5000, 1 and 2.
> >>
> >> Due to the poor VM environment (2 VCPU/4 GB), the results are a 
> >> little unstable,
> >> but I think it's enough to ask for a trend.
> >>
> >> ===
> >> [1.Normal VACUUM case]
> >> tables:1000
> >>  autovacuum_max_workers 1:   (HEAD) 20 sec VS (with patch)  20 
> >> sec
> >>  autovacuum_max_workers 2:   (HEAD) 18 sec VS (with patch)  16 
> >> sec
> >>  autovacuum_max_workers 3:   (HEAD) 18 sec VS (with patch)  16 
> >> sec
> >>  autovacuum_max_workers 5:   (HEAD) 19 sec VS (with patch)  17 
> >> sec
> >>  autovacuum_max_workers 10:  (HEAD) 19 sec VS (with patch)  17 
> >> sec
> >>
> >> tables:5000
> >>  autovacuum_max_workers 1:   (HEAD) 77 sec VS (with patch)  78 
> >> sec
> >>  autovacuum_max_workers 2:   (HEAD) 61 sec VS (with patch)  43 
> >> sec
> >>  autovacuum_max_workers 3:   (HEAD) 38 sec

PG vs LLVM 12 on seawasp, next round

2020-11-30 Thread Thomas Munro

Hi

Since seawasp's bleeding edge LLVM installation moved to "trunk
20201114 c8f4e06b 12.0.0" ~16 days ago, it has been red.  Further
updates didn't help it and it's now on "trunk 20201127 6ee22ca6
12.0.0".  I wonder if there is something in Fabien's scripting that
needs to be tweaked, perhaps a symlink name or similar.  I don't
follow LLVM development but I found my way to a commit[1] around the
right time that mentions breaking up the OrcJIT library, so *shrug*
maybe that's a clue.

+ERROR:  could not load library
"/home/fabien/pg/build-farm-11/buildroot/HEAD/pgsql.build/tmp_install/home/fabien/pg/build-farm-11/buildroot/HEAD/inst/lib/postgresql/llvmjit.so":
libLLVMOrcJIT.so.12git: cannot open shared object file: No such file
or directory

[1] 
https://github.com/llvm/llvm-project/commit/1d0676b54c4e3a517719220def96dfdbc26d8048

TAP test utility module 'PG_LSN.pm'

2020-11-30 Thread Craig Ringer

Hi all

I'd like to share the attached PG_LSN.pm module that I use when
writing TAP tests. I suggest that it be considered for inclusion in
core.

It defines a Perl datatype PG_LSN with operator support, so you can
write things like

cmp_ok($got_lsn, "<", $expected_lsn, "testname")

in TAP tests and get sensible results without any concern for LSN
representation details, locale, etc. You can subtract LSNs to get a
byte difference too.

It's small but I've found it handy.


PG_LSN.pm
Description: Perl program

Re: Printing backtrace of postgres processes

2020-11-30 Thread Tom Lane

Andres Freund  writes:
> It should be quite doable to emit such backtraces directly to stderr,
> instead of using appendStringInfoString()/elog().

No, please no.

(1) On lots of logging setups (think syslog), anything that goes to
stderr is just going to wind up in the bit bucket.  I realize that
we have that issue already for memory context dumps on OOM errors,
but that doesn't make it a good thing.

(2) You couldn't really write "to stderr", only to fileno(stderr),
creating issues about interleaving of the output with regular stderr
output.  For instance it's quite likely that the backtrace would
appear before stderr output that had actually been emitted earlier,
which'd be tremendously confusing.

(3) This isn't going to do anything good for my concerns about interleaved
output from different processes, either.

regards, tom lane

Re: [HACKERS] logical decoding of two-phase transactions

2020-11-30 Thread Amit Kapila

On Tue, Dec 1, 2020 at 7:55 AM Ajin Cherian  wrote:
>
> On Tue, Dec 1, 2020 at 12:46 AM Amit Kapila  wrote:
>
>
>
> > Sure, but you can see in your example above it got skipped due to
> > start_decoding_at not due to DecodingContextReady. So, the problem as
> > mentioned by me previously was how we distinguish those cases because
> > it can skip due to start_decoding_at during restart as well when we
> > would have already sent the prepare to the subscriber.
>
> The distinguishing factor is that at restart, the Prepare does satisfy
> DecodingContextReady (because the snapshot is consistent then).
> In both cases, the prepare is prior to start_decoding_at, but when the
> prepare is before a consistent point,
> it does not satisfy DecodingContextReady.
>

I think it won't be true when we reuse some already serialized
snapshot from some other slot. It is possible that we wouldn't have
encountered such a serialized snapshot while creating a slot but later
during replication, we might use it because by that time some other
slot has serialized the one at that point.

-- 
With Regards,
Amit Kapila.

Re: runtime error copying oids field

2020-11-30 Thread Zhihong Yu

Alvaro, et al:
Please let me know how to proceed with the patch.

Running test suite with the patch showed no regression.

Cheers

On Mon, Nov 30, 2020 at 3:24 PM Zhihong Yu  wrote:

> Hi,
> See attached patch which is along the line Alvaro outlined.
>
> Cheers
>
> On Mon, Nov 30, 2020 at 3:01 PM Alvaro Herrera 
> wrote:
>
>> On 2020-Nov-30, Zhihong Yu wrote:
>>
>> > This was the line runtime error was raised:
>> >
>> > memcpy(part_oids, partdesc->oids, sizeof(Oid) * nparts);
>> >
>> > From RelationBuildPartitionDesc we can see that:
>> >
>> >   if (nparts > 0)
>> >   {
>> >   PartitionBoundInfo boundinfo;
>> >   int*mapping;
>> >   int next_index = 0;
>> >
>> >   result->oids = (Oid *) palloc0(nparts * sizeof(Oid));
>> >
>> > The cause was oids field was not assigned due to nparts being 0.
>> > This is verified by additional logging added just prior to the memcpy
>> call.
>> >
>> > I want to get the community's opinion on whether a null check should be
>> > added prior to the memcpy() call.
>>
>> As far as I understand, we do want to avoid memcpy's of null pointers;
>> see [1].
>>
>> In this case I think it'd be sane to skip the complete block, not just
>> the memcpy, something like
>>
>> diff --git a/src/backend/commands/indexcmds.c
>> b/src/backend/commands/indexcmds.c
>> index ca24620fd0..d35deb433a 100644
>> --- a/src/backend/commands/indexcmds.c
>> +++ b/src/backend/commands/indexcmds.c
>> @@ -1163,15 +1163,17 @@ DefineIndex(Oid relationId,
>>
>> if (partitioned)
>> {
>> +   PartitionDesc partdesc;
>> +
>> /*
>>  * Unless caller specified to skip this step (via ONLY),
>> process each
>>  * partition to make sure they all contain a
>> corresponding index.
>>  *
>>  * If we're called internally (no stmt->relation),
>> recurse always.
>>  */
>> -   if (!stmt->relation || stmt->relation->inh)
>> +   partdesc = RelationGetPartitionDesc(rel);
>> +   if ((!stmt->relation || stmt->relation->inh) &&
>> partdesc->nparts > 0)
>> {
>> -   PartitionDesc partdesc =
>> RelationGetPartitionDesc(rel);
>> int nparts = partdesc->nparts;
>> Oid*part_oids =
>> palloc(sizeof(Oid) * nparts);
>> boolinvalidate_parent = false;
>>
>> [1]
>> https://www.postgresql.org/message-id/flat/20200904023648.GB3426768%40rfd.leadboat.com
>>
>

Confusing behavior of psql's \e

2020-11-30 Thread Laurenz Albe

If you quit the editor without saving, the current query buffer
or the last executed SQL statement get run.

This can be annoying and disruptive, and it requires you to
empty the file and save it if you don't want to execute anything.

But when editing a script, it is a clear POLA violation:

test=> \! cat q.sql
SELECT 99;

test=> SELECT 42;
 ?column? 
--
   42
(1 row)

test=> \e q.sql
[quit the editor without saving]
 ?column? 
--
   42
(1 row)

This is pretty bad: you either have to re-run the previous statement
or you have to empty your script file.  Both are unappealing.

I have been annoyed about this myself, and I have heard other prople
complain about it, so I propose to clear the query buffer if the
editor exits without modifying the file.

This behavior is much more intuitive for me.

Yours,
Laurenz Albe
From 209e4a0ab51f88a82e1fc0d4e4efd24d38a7d26c Mon Sep 17 00:00:00 2001
From: Laurenz Albe 
Date: Tue, 1 Dec 2020 04:28:50 +0100
Subject: [PATCH] Discard query buffer if editor is quit in \e

Before, the current query buffer or the previous query was executed
when you quit the editor without saving.

This was frequently annoying, but downright confusing if \e was used
to edit a script file, because it would then execute the previous
command rather than the script file.
---
 doc/src/sgml/ref/psql-ref.sgml | 4 +++-
 src/bin/psql/command.c | 2 ++
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index 221a967bfe..002ed38fe7 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -1949,7 +1949,9 @@ testdb=>
 
 
 
-The new contents of the query buffer are then re-parsed according to
+If the editor is quit without modifying the file, the query buffer
+is cleared.
+Otherwise, the new contents of the query buffer are re-parsed according to
 the normal rules of psql, treating the
 whole buffer as a single line.  Any complete queries are immediately
 executed; that is, if the query buffer contains or ends with a
diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c
index c7a83d5dfc..ffc5d209bc 100644
--- a/src/bin/psql/command.c
+++ b/src/bin/psql/command.c
@@ -3838,6 +3838,8 @@ do_edit(const char *filename_arg, PQExpBuffer query_buf,
 			fclose(stream);
 		}
 	}
+	else
+		resetPQExpBuffer(query_buf);
 
 	/* remove temp file */
 	if (!filename_arg)
-- 
2.26.2

Re: Printing backtrace of postgres processes

2020-11-30 Thread Craig Ringer

On Tue, 1 Dec 2020 at 11:31, Andres Freund  wrote:
>
> Hi,
>
> On 2020-11-30 13:35:46 +0800, Craig Ringer wrote:
> > I find that when I most often want a backtrace of a running, live
> > backend, it's because the backend is doing something that isn't
> > passing a CHECK_FOR_INTERRUPTS() so it's not responding to signals. So
> > it wouldn't help if a backend is waiting on an LWLock, busy in a
> > blocking call to some loaded library, a blocking syscall, etc. But
> > there are enough other times I want live backtraces, and I'm not the
> > only one whose needs matter.
>
> Random thought: Wonder if it could be worth adding a conditionally
> compiled mode where we track what the longest time between two
> CHECK_FOR_INTERRUPTS() calls is (with some extra logic for client
> IO).
>
> Obviously the regression tests don't tend to hit the worst cases of
> CFR() less code, but even if they did, we currently wouldn't know from
> running the regression tests.

We can probably determine that just as well with a perf or systemtap
run on an --enable-dtrace build. Just tag CHECK_FOR_INTERRUPTS() with
a SDT marker then record the timings.

It might be convenient to have it built-in I guess, but if we tag the
site and do the timing/tracing externally we don't have to bother
about conditional compilation and special builds.

Re: Printing backtrace of postgres processes

2020-11-30 Thread Andres Freund

Hi,

On 2020-11-30 13:35:46 +0800, Craig Ringer wrote:
> I find that when I most often want a backtrace of a running, live
> backend, it's because the backend is doing something that isn't
> passing a CHECK_FOR_INTERRUPTS() so it's not responding to signals. So
> it wouldn't help if a backend is waiting on an LWLock, busy in a
> blocking call to some loaded library, a blocking syscall, etc. But
> there are enough other times I want live backtraces, and I'm not the
> only one whose needs matter.

Random thought: Wonder if it could be worth adding a conditionally
compiled mode where we track what the longest time between two
CHECK_FOR_INTERRUPTS() calls is (with some extra logic for client
IO).

Obviously the regression tests don't tend to hit the worst cases of
CFR() less code, but even if they did, we currently wouldn't know from
running the regression tests.

Greetings,

Andres Freund

Re: Printing backtrace of postgres processes

2020-11-30 Thread Andres Freund

Hi,

On 2020-11-22 01:25:08 -0500, Tom Lane wrote:
> Surely this is *utterly* unsafe.  You can't do that sort of stuff in
> a signal handler.

That's of course true for the current implementation - but I don't think
it's a fundamental constraint. With a bit of care backtrace() and
backtrace_symbols() itself can be signal safe:

> backtrace()  and  backtrace_symbols_fd()  don't  call  malloc() explicitly, 
> but they are part of libgcc, which gets loaded dynamically when first
> used.  Dynamic loading usually triggers a call to malloc(3).  If you need 
> certain calls to these two functions to not allocate memory (in  signal
> handlers, for example), you need to make sure libgcc is loaded beforehand.

It should be quite doable to emit such backtraces directly to stderr,
instead of using appendStringInfoString()/elog(). Or even use a static
buffer.

It does have quite some appeal to be able to debug production workloads
where queries can't be cancelled etc. And knowing that backtraces
reliably work in case of SIGQUIT etc is also nice...

> I would like to see some discussion of the security implications
> of such a feature, as well.  ("There aren't any" is the wrong
> answer.)

+1

Greetings,

Andres Freund

Re: Improving spin-lock implementation on ARM.

2020-11-30 Thread Krunal Bauskar

On Tue, 1 Dec 2020 at 02:31, Alexander Korotkov 
wrote:

> On Mon, Nov 30, 2020 at 7:00 AM Krunal Bauskar 
> wrote:
> > 3. Problem with GCC approach is still a lot of distro don't support gcc
> 9.4 as default.
> > To use this approach:
> > * PGSQL will have to roll out its packages using gcc-9.4+ only so
> that they are compatible with all aarch64 machines
> > * but this continues to affect all other users who tend to build
> pgsql using standard distro based compiler. (unless they upgrade compiler).
>
> I think if a user, who runs PostgreSQL on a multicore machine with
> high-concurrent load, can take care about installing the appropriate
> package/using the appropriate compiler (especially if we publish
> explicit instructions for that).  In the same way such advanced users
> tune Linux kernel etc.
>
> BTW, how do you get that required gcc version is 9.4?  I've managed to
> use LSE with gcc 9.3.
>

Did they backported it to 9.3?
I am just looking at the gcc guide.
https://gcc.gnu.org/gcc-9/changes.html

GCC 9.4Target Specific ChangesAArch64

   - The option -moutline-atomics has been added to aid deployment of the
   Large System Extensions (LSE)



>
> --
> Regards,
> Alexander Korotkov
>


-- 
Regards,
Krunal Bauskar

Re: Huge memory consumption on partitioned table with FKs

2020-11-30 Thread Corey Huinker

On Mon, Nov 30, 2020 at 9:48 PM Tom Lane  wrote:

> Corey Huinker  writes:
> > Given that we're already looking at these checks, I was wondering if this
> > might be the time to consider implementing these checks by directly
> > scanning the constraint index.
>
> Yeah, maybe.  Certainly ri_triggers is putting a huge amount of effort
> into working around the SPI/parser/planner layer, to not a lot of gain.
>
> However, it's not clear to me that that line of thought will work well
> for the statement-level-trigger approach.  In that case you might be
> dealing with enough tuples to make a different plan advisable.
>
> regards, tom lane
>

Bypassing SPI would probably mean that we stay with row level triggers, and
the cached query plan would go away, perhaps replaced by an
already-looked-up-this-tuple hash sorta like what the cached nested loops
effort is doing.

I've been meaning to give this a try when I got some spare time. This may
inspire me to try again.

Re: Printing backtrace of postgres processes

2020-11-30 Thread Craig Ringer

On Tue, 1 Dec 2020 at 07:04, Tom Lane  wrote:

> I'd feel better about it if the mechanism had you specify exactly
> one target process, and were restricted to a superuser requestor.

Er, rather. I actually assumed the former was the case already, not
having looked closely yet.

Re: BUG #16663: DROP INDEX did not free up disk space: idle connection hold file marked as deleted

2020-11-30 Thread Thomas Munro

On Tue, Dec 1, 2020 at 3:55 PM Michael Paquier  wrote:
> On Mon, Nov 30, 2020 at 06:59:40PM +1300, Thomas Munro wrote:
> > So I fixed that, by adding a return value to do_truncate() and
> > checking it.  That's the version I plan to commit tomorrow, unless
> > there are further comments or objections.  I've also attached a
> > version suitable for REL_11_STABLE and earlier branches (with a name
> > that cfbot should ignore), where things are slightly different.  In
> > those branches, the register_forget_request() logic is elsewhere.
>
> Hmm.  Sorry for arriving late at the party.  But is that really
> something suitable for a backpatch?  Sure, it is not optimal to not
> truncate all the segments when a transaction dropping a relation
> commits, but this was not completely broken either.

I felt on balance it was a "bug", since it causes operational
difficulties for people and was clearly not our intended behaviour,
and I announced this intention 6 weeks ago.  Of course I'll be happy
to revert it from the back-branches if that's the consensus.  Any
other opinions?

Re: BUG #16663: DROP INDEX did not free up disk space: idle connection hold file marked as deleted

2020-11-30 Thread Michael Paquier

On Mon, Nov 30, 2020 at 06:59:40PM +1300, Thomas Munro wrote:
> So I fixed that, by adding a return value to do_truncate() and
> checking it.  That's the version I plan to commit tomorrow, unless
> there are further comments or objections.  I've also attached a
> version suitable for REL_11_STABLE and earlier branches (with a name
> that cfbot should ignore), where things are slightly different.  In
> those branches, the register_forget_request() logic is elsewhere.

Hmm.  Sorry for arriving late at the party.  But is that really
something suitable for a backpatch?  Sure, it is not optimal to not
truncate all the segments when a transaction dropping a relation
commits, but this was not completely broken either.
--
Michael


signature.asc
Description: PGP signature

Re: BUG #16663: DROP INDEX did not free up disk space: idle connection hold file marked as deleted

2020-11-30 Thread Thomas Munro

On Mon, Nov 30, 2020 at 6:59 PM Thomas Munro  wrote:
> On Wed, Nov 25, 2020 at 8:00 AM Pavel Borisov  wrote:
> > The new status of this patch is: Ready for Committer

> ...  That's the version I plan to commit tomorrow, unless
> there are further comments or objections.  ...

Done, and back-patched.

I thought a bit more about the fact that we fail to unlink
higher-numbered segments in certain error cases, potentially leaving
stray files behind.  As far as I can see, nothing we do in this
code-path is going to be a bullet-proof solution to that problem.  One
simple idea would be for the checkpointer to refuse to unlink segment
0 (thereby allowing the relfilenode to be recycled) until it has
scanned the parent directory for any related files that shouldn't be
there.

> While looking at trace output, I figured we should just use
> truncate(2) on non-Windows, on the master branch only.  It's not like
> it really makes much difference, but I don't see why we shouldn't
> allow ourselves to use ancient standardised Unix syscalls when we can.

Also pushed, but only to master.

Re: Huge memory consumption on partitioned table with FKs

2020-11-30 Thread Tom Lane

Corey Huinker  writes:
> Given that we're already looking at these checks, I was wondering if this
> might be the time to consider implementing these checks by directly
> scanning the constraint index.

Yeah, maybe.  Certainly ri_triggers is putting a huge amount of effort
into working around the SPI/parser/planner layer, to not a lot of gain.

However, it's not clear to me that that line of thought will work well
for the statement-level-trigger approach.  In that case you might be
dealing with enough tuples to make a different plan advisable.

regards, tom lane

Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2020-11-30 Thread Michael Paquier

On Mon, Nov 30, 2020 at 05:12:42PM +0300, Alexey Kondratov wrote:
> Thanks. I have rebased the remaining patches on top of 873ea9ee to use
> 'utility_option_list' instead of 'common_option_list'.

Thanks, that helps a lot.  I have gone through 0002, and tweaked it as
the attached (note that this patch is also interesting for another
thing in development: backend-side reindex filtering of
collation-sensitive indexes).  Does that look right to you?

These are mostly matters of consistency with the other commands using
DefElem, but I think that it is important to get things right:
- Having the list of options in parsenodes.h becomes incorrect,
because these get now only used at execution time, like VACUUM.  So I
have moved that to cluster.h and index.h.
- Let's use an enum for REINDEX, like the others.
- Having parse_reindex_params() in utility.c is wrong for something
aimed at being used only for REINDEX, so I have moved that to
indexcmds.c, and renamed the routine to be more consistent with the
rest.  I think that we could more here by having an ExecReindex() that
does all the work based on object types, but I have left that out for
now to keep the change minimal.
- Switched one of the existing tests to stress CONCURRENTLY within
parenthesis.
- Indented the whole.

A couple of extra things below.

  *  CLUSTER [VERBOSE]  [ USING  ]
+ *  CLUSTER [VERBOSE] [(options)]  [ USING  ]
This line is wrong, and should be:
CLUSTER [ (options) ]  [ USING  ]

+CLUSTER [VERBOSE] [ ( option
+CLUSTER [VERBOSE] [ ( option [, 
...] ) ]
The docs in cluster.sgml are wrong as well, you can have VERBOSE as a
single option or as a parenthesized option, but never both at the same
time.  On the contrary, psql completion got that right.  I was first a
bit surprised that you would not allow the parenthesized set for the
case where a relation is not specified in the command, but I agree
that this does not seem worth the extra complexity now as this thread
aims at being able to use TABLESPACE which makes little sense
database-wide.

-VERBOSE
+VERBOSE [ boolean ]
Forgot about CONCURRENTLY as an option here, as this becomes
possible.
--
Michael
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index f4559b09d7..c041628049 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -29,6 +29,15 @@ typedef enum
 	INDEX_DROP_SET_DEAD
 } IndexStateFlagsAction;
 
+/* options for REINDEX */
+typedef enum ReindexOption
+{
+	REINDEXOPT_VERBOSE = 1 << 0,	/* print progress info */
+	REINDEXOPT_REPORT_PROGRESS = 1 << 1,	/* report pgstat progress */
+	REINDEXOPT_MISSING_OK = 1 << 2, /* skip missing relations */
+	REINDEXOPT_CONCURRENTLY = 1 << 3	/* concurrent mode */
+} ReindexOption;
+
 /* state info for validate_index bulkdelete callback */
 typedef struct ValidateIndexState
 {
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index e05884781b..7cfb37c9b2 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -14,11 +14,19 @@
 #define CLUSTER_H
 
 #include "nodes/parsenodes.h"
+#include "parser/parse_node.h"
 #include "storage/lock.h"
 #include "utils/relcache.h"
 
 
-extern void cluster(ClusterStmt *stmt, bool isTopLevel);
+/* options for CLUSTER */
+typedef enum ClusterOption
+{
+	CLUOPT_RECHECK = 1 << 0,	/* recheck relation state */
+	CLUOPT_VERBOSE = 1 << 1		/* print progress info */
+} ClusterOption;
+
+extern void cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel);
 extern void cluster_rel(Oid tableOid, Oid indexOid, int options);
 extern void check_index_is_clusterable(Relation OldHeap, Oid indexOid,
 	   bool recheck, LOCKMODE lockmode);
diff --git a/src/include/commands/defrem.h b/src/include/commands/defrem.h
index 7a079ef07f..1133ae1143 100644
--- a/src/include/commands/defrem.h
+++ b/src/include/commands/defrem.h
@@ -34,6 +34,7 @@ extern ObjectAddress DefineIndex(Oid relationId,
  bool check_not_in_use,
  bool skip_build,
  bool quiet);
+extern int	ReindexParseOptions(ParseState *pstate, ReindexStmt *stmt);
 extern void ReindexIndex(RangeVar *indexRelation, int options, bool isTopLevel);
 extern Oid	ReindexTable(RangeVar *relation, int options, bool isTopLevel);
 extern void ReindexMultipleTables(const char *objectName, ReindexObjectType objectKind,
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index d1f9ef29ca..d6a6969b0d 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -3196,18 +3196,12 @@ typedef struct AlterSystemStmt
  *		Cluster Statement (support pbrown's cluster index implementation)
  * --
  */
-typedef enum ClusterOption
-{
-	CLUOPT_RECHECK = 1 << 0,	/* recheck relation state */
-	CLUOPT_VERBOSE = 1 << 1		/* print progress info */
-} ClusterOption;
-
 typedef struct ClusterStmt
 {
 	NodeTag		type;
 	RangeVar   *relation;		/* relation being indexed, or NULL if all */
 	char	   *indexname;

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-11-30 Thread tsunakawa.ta...@fujitsu.com

From: Kyotaro Horiguchi 
> We are relying on the "fact" that the first lseek() call of a
> (startup) process tells the truth.  We added an assertion so that we
> make sure that the cached value won't be cleared during recovery.  A
> possible remaining danger would be closing of an smgr object of a live
> relation just after a file extension failure. I think we are thinking
> that that doesn't happen during recovery.  Although it seems to me
> true, I'm not confident.
> 
> If that's true, we don't even need to look at the "cached" flag at all
> and always be able to rely on the returned value from msgrnblocks()
> during recovery.  Otherwise, we need to avoid the danger situation.

Hmm, I've gotten to think that smgrnblocks() doesn't need the cached parameter, 
too.  DropRel*Buffers() can just check InRecovery.  Regarding the only concern 
about smgrclose() by startup process, I was afraid of the cache invalidation by 
CacheInvalidateSmgr(), but startup process doesn't receive shared inval 
messages.  So, it doesn't call smgrclose*() due to shared cache invalidation.

[InitRecoveryTransactionEnvironment()]
/* Initialize shared invalidation management for Startup process, being
 * Initialize shared invalidation management for Startup process, being
 * careful to register ourselves as a sendOnly process so we don't need to
 * read messages, nor will we get signaled when the queue starts filling
 * up.
 */
SharedInvalBackendInit(true);


Kirk-san,
Can you check to see if smgrclose() and its friends are not called during 
recovery using the regression test?



Regards
Takayuki Tsunakawa

Re: Huge memory consumption on partitioned table with FKs

2020-11-30 Thread Corey Huinker

>
> I think this can be solved easily in the patch, by having
> ri_BuildQueryKey() compare the parent's fk_attnums to the parent; if
> they are equal then use the parent's constaint_id, otherwise use the
> child constraint.  That way, the cache entry is reused in the common
> case where they are identical.
>

Somewhat of a detour, but in reviewing the patch for Statement-Level RI
checks, Andres and I observed that SPI made for a large portion of the RI
overhead.

Given that we're already looking at these checks, I was wondering if this
might be the time to consider implementing these checks by directly
scanning the constraint index.

回复: 回复: [PATCH] BUG FIX: Core dump could happen when VACUUM FULL in standalone mode

2020-11-30 Thread Yulin PEI

I think you are right after reading code in compute_parallel_vacuum_workers() :)

发件人: Tom Lane 
发送时间: 2020年12月1日 2:54
收件人: Yulin PEI 
抄送: Masahiko Sawada ; pgsql-hackers@lists.postgresql.org 

主题: Re: 回复: [PATCH] BUG FIX: Core dump could happen when VACUUM FULL in 
standalone mode

Yulin PEI  writes:
> Yes, I agree because (IsNormalProcessingMode() ) means that current process 
> is not in bootstrap mode and postmaster process will not build index.
> So my new modified patch is attached.

This is a good catch, but the proposed fix still seems pretty random
and unlike how it's done elsewhere.  It seems to me that since
index_build() is relying on plan_create_index_workers() to assess
parallel safety, that's where to check IsUnderPostmaster.  Moreover,
the existing code in compute_parallel_vacuum_workers (which gets
this right) associates the IsUnderPostmaster check with the initial
check on max_parallel_maintenance_workers.  So I think that the
right fix is to adopt the compute_parallel_vacuum_workers coding
in plan_create_index_workers, and thereby create a model for future
uses of max_parallel_maintenance_workers to follow.

regards, tom lane

Re: [HACKERS] logical decoding of two-phase transactions

2020-11-30 Thread Ajin Cherian

On Tue, Dec 1, 2020 at 12:46 AM Amit Kapila  wrote:

> So what caused it to skip due to start_decoding_at? Because the commit
> where the snapshot became consistent is after Prepare. Does it happen
> due to the below code in SnapBuildFindSnapshot() where we bump
> start_decoding_at.
>
> {
> ...
> if (running->oldestRunningXid == running->nextXid)
> {
> if (builder->start_decoding_at == InvalidXLogRecPtr ||
> builder->start_decoding_at <= lsn)
> /* can decode everything after this */
> builder->start_decoding_at = lsn + 1;

I think the reason is that in the function
DecodingContextFindStartpoint(), the code
loops till it finds the consistent snapshot. Then once consistent
snapshot is found, it sets
slot->data.confirmed_flush = ctx->reader->EndRecPtr; This will be used
as the start_decoding_at when the slot is
restarted for decoding.

> Sure, but you can see in your example above it got skipped due to
> start_decoding_at not due to DecodingContextReady. So, the problem as
> mentioned by me previously was how we distinguish those cases because
> it can skip due to start_decoding_at during restart as well when we
> would have already sent the prepare to the subscriber.

The distinguishing factor is that at restart, the Prepare does satisfy
DecodingContextReady (because the snapshot is consistent then).
In both cases, the prepare is prior to start_decoding_at, but when the
prepare is before a consistent point,
it does not satisfy DecodingContextReady. Which is why I suggested
using the check DecodingContextReady  to mark the prepare as 'Not
decoded".

regards,
Ajin Cherian
Fujitsu Australia

Re: Fix generate_useful_gather_paths for parallel unsafe pathkeys

2020-11-30 Thread James Coleman

On Sun, Nov 29, 2020 at 4:20 PM Tomas Vondra
 wrote:
>
> On 11/29/20 3:26 PM, James Coleman wrote:
> > On Mon, Nov 23, 2020 at 8:35 AM James Coleman  wrote:
> >>
> >> On Sun, Nov 22, 2020 at 4:59 PM Tomas Vondra
> >>  wrote:
> >>> ...
> >>> Thanks for the patches, I'll take a closer look next week. The 0002
> >>> patch is clearly something we need to backpatch, not sure about 0001 as
> >>> it essentially enables new behavior in some cases (Sort on unsorted
> >>> paths below Gather Merge).
> >>
> >> Thanks for taking a look.
> >>
> >> I had the same question re: backporting. On the one hand it will
> >> change behavior (this is always a bit of a gray area since in one
> >> sense bugs change behavior), but IMO it's not a new feature, because
> >> the code clearly intended to have that feature in the first place (it
> >> creates gather merges on top of a full sort). So I'd lean towards
> >> considering it a bug fix, but I'm also not going to make that a hill
> >> to die on.
> >>
> >> One semi-related question: do you think we should add a comment to
> >> prepare_sort_from_pathkeys explaining that modifying it may mean
> >> find_em_expr_for_rel needs to be updated also?
> >
> > Here's an updated patch series.
> >
> > 0001 and 0002 as before, but with some minor cleanup.
> >
>
> 0001 - seems fine
>
> 0002 - Should we rename the "parallel" parameter to something more
> descriptive, like "require_parallel_safe"?

Done.

> > 0003 disallows SRFs in these sort expressions (per discussion over at [1]).
> >
>
> OK, but I'd move the define from tlist.c to tlist.h, not optimizer.h.

IS_SRF_CALL doesn't really seem to be particularly specific
conceptually to target lists (and of course now we're using it in
equivclass.c), so I'd tried to put it somewhere more general. Is
moving it to tlist.h mostly to minimize changes? Or do you think it
fits more naturally with the tlist code (I might be missing
something)?

> > 0004 removes the search through the target for matching volatile
> > expressions (per discussion at [2]).
> >
>
> OK
>
> > 0005 adds the comment I mentioned above clarifying these two functions
> > are linked.
> >
>
> Makes sense. I wonder if we need to be more specific about how
> consistent those two places need to be. Do they need to match 1:1, or
> how do people in the future decide which restrictions need to be in both
> places?

At this point I'm not sure I'd have a good way to describe that: one
is a clear superset of the other (and we already talk in the comments
about the other being a subset), but it's really going to be about
"what do we want to allow to be sorted proactively"...hmm, that
thought made me take a swing at it; let me know what you think.

James
From 3cc9d13869e25b546bf113d57e5015b5ab2c2abf Mon Sep 17 00:00:00 2001
From: jcoleman 
Date: Wed, 25 Nov 2020 15:46:00 -0500
Subject: [PATCH v3 3/5] Disallow SRFs in proactive sort

So they can be evaluated at the proper time as determiend by
make_sort_input_target.
---
 src/backend/optimizer/path/equivclass.c|  8 
 src/backend/optimizer/util/tlist.c |  5 -
 src/include/optimizer/optimizer.h  |  5 +
 src/test/regress/expected/incremental_sort.out | 12 
 src/test/regress/sql/incremental_sort.sql  |  2 ++
 5 files changed, 27 insertions(+), 5 deletions(-)

diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c
index 75cb1750a8..8c3ef98d73 100644
--- a/src/backend/optimizer/path/equivclass.c
+++ b/src/backend/optimizer/path/equivclass.c
@@ -835,6 +835,14 @@ find_em_expr_usable_for_sorting_rel(PlannerInfo *root, EquivalenceClass *ec, Rel
 
 		if (require_parallel_safe && !is_parallel_safe(root, (Node *) em_expr))
 			continue;
+
+		/*
+		 * Disallow SRFs so that all of them can be evaluated at the correct
+		 * time as determined by make_sort_input_target.
+		 */
+		if (IS_SRF_CALL((Node *) em_expr))
+			continue;
+
 		/*
 		 * As long as the expression isn't volatile then
 		 * prepare_sort_from_pathkeys is able to generate a new target entry,
diff --git a/src/backend/optimizer/util/tlist.c b/src/backend/optimizer/util/tlist.c
index 02a3c6b165..01cea102ea 100644
--- a/src/backend/optimizer/util/tlist.c
+++ b/src/backend/optimizer/util/tlist.c
@@ -21,11 +21,6 @@
 #include "optimizer/tlist.h"
 
 
-/* Test if an expression node represents a SRF call.  Beware multiple eval! */
-#define IS_SRF_CALL(node) \
-	((IsA(node, FuncExpr) && ((FuncExpr *) (node))->funcretset) || \
-	 (IsA(node, OpExpr) && ((OpExpr *) (node))->opretset))
-
 /*
  * Data structures for split_pathtarget_at_srfs().  To preserve the identity
  * of sortgroupref items even if they are textually equal(), what we track is
diff --git a/src/include/optimizer/optimizer.h b/src/include/optimizer/optimizer.h
index 3e4171056e..1e135652b9 100644
--- a/src/include/optimizer/optimizer.h
+++ b/src/include/optimizer/optimizer.h
@@ -24,6 +24,11 @@
 
 #include "nodes/parsenodes.h"
 
+/* Test if an ex

Re: VACUUM (DISABLE_PAGE_SKIPPING on)

2020-11-30 Thread Masahiko Sawada

On Fri, Nov 20, 2020 at 8:47 PM Simon Riggs  wrote:
>
> On Fri, 20 Nov 2020 at 10:15, Simon Riggs  wrote:
> >
> > On Fri, 20 Nov 2020 at 01:40, Masahiko Sawada  wrote:
> > >
> > > On Thu, Nov 19, 2020 at 8:02 PM Simon Riggs  wrote:
> > > >
> > > > On Wed, 18 Nov 2020 at 17:59, Robert Haas  wrote:
> > > > >
> > > > > On Wed, Nov 18, 2020 at 12:54 PM Simon Riggs  
> > > > > wrote:
> > > > > > Patches attached.
> > > > > > 1. vacuum_anti_wraparound.v2.patch
> > > > > > 2. vacuumdb_anti_wrap.v1.patch - depends upon (1)
> > > > >
> > > > > I don't like the use of ANTI_WRAPAROUND as a name for this new option.
> > > > > Wouldn't it make more sense to call it AGGRESSIVE? Or maybe something
> > > > > else, but I dislike anti-wraparound.
> > > >
> > > > -1 for using the term AGGRESSIVE, which seems likely to offend people.
> > > > I'm sure a more descriptive term exists.
> > >
> > > Since we use the term aggressive scan in the docs, I personally don't
> > > feel unnatural about that. But since this option also disables index
> > > cleanup when not enabled explicitly, I’m concerned a bit if user might
> > > get confused. I came up with some names like FEEZE_FAST and
> > > FREEZE_MINIMAL but I'm not sure these are better.
> >
> > FREEZE_FAST seems good.
> >
> > > BTW if this option also disables index cleanup for faster freezing,
> > > why don't we disable heap truncation as well?
> >
> > Good idea
>
> Patch attached, using the name "FAST_FREEZE" instead.
>

Thank you for updating the patch.

Here are some comments on the patch.

-   if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
+   if (params->options & VACOPT_DISABLE_PAGE_SKIPPING ||
+   params->options & VACOPT_FAST_FREEZE)

I think we need to update the following comment that is above this
change as well:

/*
 * We request an aggressive scan if the table's frozen Xid is now older
 * than or equal to the requested Xid full-table scan limit; or if the
 * table's minimum MultiXactId is older than or equal to the requested
 * mxid full-table scan limit; or if DISABLE_PAGE_SKIPPING was specified.
 */

This mentions only DISABLE_PAGE_SKIPPING now. Or the second idea is to
set both params.freeze_table_age and params.multixact_freeze_table_age
to 0 at ExecVacuum() instead of getting aggressive turned on here.
Considering the consistency between FREEZE and FREEZE_FAST, we might
want to take the second option.

---
+   if (fast_freeze &&
+   params.index_cleanup == VACOPT_TERNARY_DEFAULT)
+   params.index_cleanup = VACOPT_TERNARY_DISABLED;
+
+   if (fast_freeze &&
+   params.truncate == VACOPT_TERNARY_DEFAULT)
+   params.truncate = VACOPT_TERNARY_DISABLED;
+
+   if (fast_freeze && freeze)
+   ereport(ERROR,
+   (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+errmsg("cannot specify both FREEZE and FAST_FREEZE
options on VACUUM")));
+

I guess that you disallow enabling both FREEZE and FAST_FREEZE because
it's contradictory, which makes sense to me. But it seems to me that
enabling FAST_FREEZE, INDEX_CLEANUP, and TRUNCATE is also
contradictory because it will no longer be “fast”. The purpose of this
option to advance relfrozenxid as fast as possible by disabling index
cleanup, heap truncation etc. Is there any use case where a user wants
to enable these options (FAST_FREEZE, INDEX_CLEANUP, and TRUNCATE) at
the same time? If not, probably it’s better to either disallow it or
have FAST_FREEZE overwrites these two settings even if the user
specifies them explicitly.

Regards,

--
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/

RE: Disable WAL logging to speed up data loading

2020-11-30 Thread tsunakawa.ta...@fujitsu.com

From: Kyotaro Horiguchi 
> We're emitting only redo logs.  So I think theoretically we don't need
> anything other than the shutdown checkpoint record because we don't
> perform recovery and checkpoint record is required at startup.
> 
> RM_XLOG_ID:
>  XLOG_FPI_FOR_HINT  - not needed?
>  XLOG_FPI   - not needed?
> 
>  XLOG_CHECKPOINT_SHUTDOWN - must have
> 
> So how about the followings?
>  XLOG_CHECKPOINT_ONLINE
>  XLOG_NOOP
>  XLOG_NEXTOID
>  XLOG_SWITCH
>  XLOG_BACKUP_END
>  XLOG_PARAMETER_CHANGE
>  XLOG_RESTORE_POINT
>  XLOG_FPW_CHANGE
>  XLOG_END_OF_RECOVERY
> 
> 
> RM_XACT_ID:
>  XLOG_XACT_COMMIT
>  XLOG_XACT_PREPARE
>  XLOG_XACT_ABORT
>  XLOG_XACT_COMMIT_PREPARED
>  XLOG_XACT_ABORT_PREPARED
>  XLOG_XACT_ASSIGNMENT
>  XLOG_XACT_INVALIDATIONS
> 
> Do we need all of these?

What need to be emitted even when wal_level = none are:

RM_XLOG_ID:
- XLOG_CHECKPOINT_SHUTDOWN
- XLOG_PARAMETER_CHANGE

RM_XACT_ID:
-  XLOG_XACT_PREPARE

XLOG_PARAMETER_CHANGE is necessary to detect during archive recovery that 
wal_level was changed from >= replica to none, thus failing the archive 
recovery.  This is for the fix discussed in this thread to change the error 
level from WARNING to FATAL.


> Yeah, although it's enough only to restrict non-harmful records
> practically, if we find that only a few kinds of records are needed,
> maybe it's cleaner to allow only required record type(s).
> 
> Maybe it's right that if we can filter-out records looking only rmid,
> since the xlog facility doesn't need to know about record types of a
> resource manager.  But if we need to finer-grained control on the
> record types, I'm afraid that that's wrong.  However, if we need only
> the XLOG_CHECKPOINT_SHUTDOWN record, it might be better to let
> XLogInsert filter records rather than inserting that filtering code to
> all the caller sites.

Agreed.  As the kind of WAL records to be emitted is very limited, I think 
XLogInsert() can filter them where the current patch does.  (OTOH, the boot 
strap mode filters WAL coarsely as follows.  I thought we may follow this 
coarse RMID-based filtering as a pessimistic safeguard against new kind of WAL 
records that are necessary even in wal_level = none.)

/*
 * In bootstrap mode, we don't actually log anything but XLOG resources;
 * return a phony record pointer.
 */
if (IsBootstrapProcessingMode() && rmid != RM_XLOG_ID)
{
XLogResetInsertion();
EndPos = SizeOfXLogLongPHD; /* start of 1st chkpt record */
return EndPos;
}


> I don't dislike "none" since it seems to me practically "none".  It
> seems rather correct if we actually need only the shutdown checkpoint
> record.
> 
> "unrecoverable" is apparently misleading. "crash_unsafe" is precise
> but seems somewhat alien being among "logical", "replica" and
> "minimal".

OK, I'm happy with "none" too.  We can describe in the manual that some limited 
amount of WAL is emitted.


Regards
Takayuki Tsunakawa

Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

2020-11-30 Thread Justin Pryzby

On Tue, Nov 03, 2020 at 08:56:06PM -0300, Alvaro Herrera wrote:
> Here's an updated version of this patch.
> 
> Apart from rebasing to current master, I made the following changes:
> 
> * On the first transaction (the one that marks the partition as
> detached), the partition is locked with ShareLock rather than
> ShareUpdateExclusiveLock.  This means that DML is not allowed anymore.
> This seems a reasonable restriction to me; surely by the time you're
> detaching the partition you're not inserting data into it anymore.

I don't think it's an issue with your patch, but FYI that sounds like something
I had to do recently: detach *all* partitions of various tabls to promote their
partition key column from timestamp to timestamptz.  And we insert directly
into child tables, not routed via parent.

I don't your patch is still useful, but not to us.  So the documentation should
be clear about that.

> * ALTER TABLE .. DETACH PARTITION FINALIZE now waits for concurrent old
> snapshots, as previously discussed. This should ensure that the user
> doesn't just cancel the wait during the second transaction by Ctrl-C and
> run FINALIZE immediately afterwards, which I claimed would bring
> inconsistency.
> 
> * Avoid creating the CHECK constraint if an identical one already
> exists.
> 
> (Note I do not remove the constraint on ATTACH.  That seems pointless.)
> 
> Still to do: test this using the new hook added by 6f0b632f083b.

tab complete?

> +* Ensure that foreign keys still hold after this detach.  This keeps 
> lock
> +* on the referencing tables, which prevent concurrent transactions 
> from

keeps locks or
which prevents

> +++ b/doc/src/sgml/ref/alter_table.sgml
> @@ -947,6 +950,24 @@ WITH ( MODULUS  class="parameter">numeric_literal, REM
>attached to the target table's indexes are detached.  Any triggers that
>were created as clones of those in the target table are removed.
>   
> + 
> +  If CONCURRENTLY is specified, this process runs in 
> two
> +  transactions in order to avoid blocking other sessions that might be 
> accessing
> +  the partitioned table.  During the first transaction,
> +  SHARE UPDATE EXCLUSIVE is taken in both parent 
> table and

missing "lock"
taken *on* ?

> +  partition, and the partition is marked detached; at that point, the 
> transaction

probably "its partition,"

> +  If FINALIZE is specified, complete actions of a
> +  previous DETACH CONCURRENTLY invocation that
> +  was cancelled or crashed.

say "actions are completed" or:

  If FINALIZE is specified, a previous DETACH that was cancelled or interrupted
  is completed.

> + if (!inhdetached && detached)
> + ereport(ERROR,
> + 
> (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> +  errmsg("cannot complete 
> detaching partition \"%s\"",
> + childname),
> +  errdetail("There's no partial 
> concurrent detach in progress.")));

maybe say "partially-complete" or remove "partial"

> +  * the partition being detached?  Putting them on the partition 
> being
> +  * detached would be wrong, since they'd become "lost" after 
> the but
after *that* ?

> +  * Concurrent mode has to work harder; first we add a new constraint to 
> the
> +  * partition that matches the partition constraint, if there isn't a 
> matching
> +  * one already.  The reason for this is that the planner may have made
> +  * optimizations that depend on the constraint.  XXX Isn't it 
> sufficient to
> +  * invalidate the partition's relcache entry?

Ha.  I suggested this years ago.
https://www.postgresql.org/message-id/20180601221428.gu5...@telsasoft.com
|. The docs say: if detaching/re-attach a partition, should first ADD CHECK to
|  avoid a slow ATTACH operation.  Perhaps DETACHing a partition could
|  implicitly CREATE a constraint which is usable when reATTACHing?

> +  * Then we close our existing transaction, and in a new one wait for
> +  * all process to catch up on the catalog updates we've done so far; at

processes

> +  * We don't need to concern ourselves with waiting for a lock 
> the
> +  * partition itself, since we will acquire AccessExclusiveLock 
> below.

lock *on* ?

> +  * If asked to, wait until existing snapshots are gone.  This is 
> important
> +  * in the second transaction of DETACH PARTITION CONCURRENTLY is 
> canceled:

s/in/if/

> +++ b/src/bin/psql/describe.c
> - printfPQExpBuffer(&tmpbuf, _("Partition of: %s %s"), 
> parent_name,
> -   partdef);
> + printfPQExpBuffer(&tmpbuf, _("Partition of: %s %s%s"), 
> parent_name,
> +

Re: Why does create_gather_merge_plan need make_sort?

2020-11-30 Thread James Coleman

On Sun, Nov 29, 2020 at 4:10 PM Tomas Vondra
 wrote:
>
>
>
> On 11/29/20 3:44 PM, James Coleman wrote:
> > On Mon, Nov 23, 2020 at 8:19 AM James Coleman  wrote:
> >>
> >> ..
> >>
> >> So I'm +1 on turning it into an ERROR log instead, since it seems to
> >> me that encountering this case would almost certainly represent a bug
> >> in path generation.
> >
> > Here's a patch to do that.
> >
>
> Thanks. Isn't the comment incomplete? Should it mention just parallel
> safety or also volatility?

Volatility makes it parallel unsafe, and I'm not sure I want to get
into listing every possible issue here, but in the interest of making
it more obviously representative of the kinds of issues we might
encounter, I've tweaked it to mention volatility also, as well as
refer to the issues as "examples" of such concerns.

James
From b4b08b70d8961ea29587412e9a2ef4dd39111ff0 Mon Sep 17 00:00:00 2001
From: jcoleman 
Date: Sun, 29 Nov 2020 09:38:59 -0500
Subject: [PATCH v2] Error if gather merge paths aren't sufficiently sorted

---
 src/backend/optimizer/plan/createplan.c | 14 --
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 40abe6f9f6..5ecf9f4065 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -1793,13 +1793,15 @@ create_gather_merge_plan(PlannerInfo *root, GatherMergePath *best_path)
 		 &gm_plan->nullsFirst);
 
 
-	/* Now, insert a Sort node if subplan isn't sufficiently ordered */
+	/*
+	 * All gather merge paths should have already guaranteed the necessary sort
+	 * order either by adding an explicit sort node or by using presorted input.
+	 * We can't simply add a sort here on additional pathkeys, because we can't
+	 * guarantee the sort would be safe. For example, expressions may be
+	 * volatile or otherwise parallel unsafe.
+	 */
 	if (!pathkeys_contained_in(pathkeys, best_path->subpath->pathkeys))
-		subplan = (Plan *) make_sort(subplan, gm_plan->numCols,
-	 gm_plan->sortColIdx,
-	 gm_plan->sortOperators,
-	 gm_plan->collations,
-	 gm_plan->nullsFirst);
+		elog(ERROR, "gather merge input not sufficiently sorted");
 
 	/* Now insert the subplan under GatherMerge. */
 	gm_plan->plan.lefttree = subplan;
-- 
2.17.1

Re: remove spurious CREATE INDEX CONCURRENTLY wait

2020-11-30 Thread Michael Paquier

On Mon, Nov 30, 2020 at 04:15:27PM -0300, Alvaro Herrera wrote:
> In the interest of showing progress, I'm going to mark this CF item as
> committed, and I'll submit the remaining pieces in a new thread.

Thanks for splitting!
--
Michael


signature.asc
Description: PGP signature

Re: Huge memory consumption on partitioned table with FKs

2020-11-30 Thread Kyotaro Horiguchi

At Mon, 30 Nov 2020 21:03:45 -0300, Alvaro Herrera  
wrote in 
> On 2020-Nov-26, Kyotaro Horiguchi wrote:
> 
> > This shares RI_ConstraintInfo cache by constraints that shares the
> > same parent constraints. But you forgot that the cache contains some
> > members that can differ among partitions.
> > 
> > Consider the case of attaching a partition that have experienced a
> > column deletion.
> 
> I think this can be solved easily in the patch, by having
> ri_BuildQueryKey() compare the parent's fk_attnums to the parent; if
> they are equal then use the parent's constaint_id, otherwise use the
> child constraint.  That way, the cache entry is reused in the common
> case where they are identical.

*I* think it's the direction.  After an off-list discussion, we
 confirmed that even in that case the patch works as is because
 fk_attnum (or contuple.conkey) always stores key attnums compatible
 to the topmost parent when conparent has a valid value (assuming the
 current usage of fk_attnum), but I still feel uneasy to rely on that
 unclear behavior.

> I would embed all this knowledge in ri_BuildQueryKey though, without
> adding the new function ri_GetParentConstOid.  I don't think that
> function meaningful abstraction value, and instead it would make what I
> suggest more difficult.

It seems to me reasonable.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: vac_update_datfrozenxid will raise "wrong tuple length" if pg_database tuple contains toast attribute.

2020-11-30 Thread Ashwin Agrawal

On Sun, Nov 29, 2020 at 5:27 PM Michael Paquier  wrote:

> > One thing that strikes me as unwise is that we could run into similar
> > problems with vac_update_relstats() in the future, and there have been
> > recent talks about having more toast tables like for pg_class.  That
> > is not worth caring about on stable branches because it is not an
> > active problem there, but we could do something better on HEAD.
>
> For now, I have added just a comment at the top of
> heap_inplace_update() to warn callers.
>

I am thinking if there is some way to assert this aspect, but seems no way.
So, yes, having at least a comment is good for now.

Junfeng and Ashwin also mentioned to me off-list that their patch used
> a second copy for performance reasons, but I don't see why this could
> become an issue as we update each pg_database row only once a job is
> done.  So I'd like to apply something like the attached on HEAD,
> comments are welcome.

Yes the attached patch looks good to me for PostgreSQL. Thanks Michael.
(In Greenplum, due to per table dispatch to segments, during database wide
vacuum this function gets called per table instead of only at the end, hence
we were trying to be conservative.)

Re: [DOC] Document concurrent index builds waiting on each other

2020-11-30 Thread James Coleman

On Mon, Nov 30, 2020 at 4:53 PM Alvaro Herrera  wrote:
>
> On 2020-Sep-30, Michael Paquier wrote:
>
> > +  
> > +   CREATE INDEX (including the 
> > CONCURRENTLY
> > +   option) commands are included when VACUUM calculates 
> > what
> > +   dead tuples are safe to remove even on tables other than the one being 
> > indexed.
> > +  
> > FWIW, this is true as well for REINDEX CONCURRENTLY because both use
> > the same code paths for index builds and validation, with basically
> > the same waiting phases.  But is CREATE INDEX the correct place for
> > that?  Wouldn't it be better to tell about such things on the VACUUM
> > doc?
>
> Yeah, I think it might be more sensible to document this in
> maintenance.sgml, as part of the paragraph that discusses removing
> tuples "to save space".  But making it inline with the rest of the flow,
> it seems to distract from higher-level considerations, so I suggest to
> make it a footnote instead.

I have mixed feelings about wholesale moving it; users aren't likely
to read the vacuum doc when considering how running CIC might impact
their system, though I do understand why it otherwise fits there. Even
if the primary details are in the vacuum, I tend to think a reference
note (or link to the vacuum docs) in the create index docs would be
useful. The principle here is that 1.) vacuum is automatic/part of the
background of the system, not just something people trigger manually,
and 2.) we ought to document things where the user action triggering
the behavior is documented.

> I'm not sure on the wording to use; what about this?

The wording seems fine to me.

This is a replacement for what was 0002 earlier? And 0001 from earlier
still seems to be a useful standalone patch?

James

Re: error_severity of brin work item

2020-11-30 Thread Justin Pryzby

On Mon, Nov 30, 2020 at 08:47:32PM -0300, Alvaro Herrera wrote:
> The more I look at this, the less I like it.  This would set a precedent
> that any action that can be initiated from an autovac work-item has a
> requirement of silently being discarded when it referenced a
> non-existant relation.

My original request was to change to INFO, which is what vacuum proper does at
vacuum_open_relation().  I realize that still means that new work item handlers
would have to do the LockOid, try_open dance - maybe it could be factored out.

I noticed this on multiple servers immediate after changing a nagios script to
look for all ERRORs (rather than a specific list of log_messages) and for a
longer time period.

> I'd rather have the code that drops the index go through the list of
> work-items and delete those that reference that relation.
> 
> I'm not sure if this is something that ought to be done in index_drop();
> One objection might be that if the drop is rolled back, the work-items
> are lost.

Should it be done in an AtCommit hook or something like that ?

-- 
Justin

Re: enable_incremental_sort changes query behavior

2020-11-30 Thread James Coleman

On Tue, Nov 3, 2020 at 4:39 PM Tomas Vondra
 wrote:
> I've pushed the 0001 part, i.e. the main fix. Not sure about the other
> parts (comments and moving the code back to postgres_fdw) yet.

I noticed the CF entry [1] got moved to the next CF; I'm thinking this
entry should be marked as committed since the fix for the initial bug
reported on this thread has been pushed. We have the parallel safety
issue outstanding, but there's a separate thread and patch for that,
so I'll make a new CF entry for that.

I can mark it as committed, but I'm not sure how to "undo" (or if
that's desirable) the move to the next CF.

Thoughts?

James

1: https://commitfest.postgresql.org/30/2754/

Re: Huge memory consumption on partitioned table with FKs

2020-11-30 Thread Alvaro Herrera

On 2020-Nov-26, Kyotaro Horiguchi wrote:

> This shares RI_ConstraintInfo cache by constraints that shares the
> same parent constraints. But you forgot that the cache contains some
> members that can differ among partitions.
> 
> Consider the case of attaching a partition that have experienced a
> column deletion.

I think this can be solved easily in the patch, by having
ri_BuildQueryKey() compare the parent's fk_attnums to the parent; if
they are equal then use the parent's constaint_id, otherwise use the
child constraint.  That way, the cache entry is reused in the common
case where they are identical.

I would embed all this knowledge in ri_BuildQueryKey though, without
adding the new function ri_GetParentConstOid.  I don't think that
function meaningful abstraction value, and instead it would make what I
suggest more difficult.

Consider parallel for lateral subqueries with limit

2020-11-30 Thread James Coleman

I've been investigating parallelizing certain correlated subqueries,
and during that work stumbled across the fact that
set_rel_consider_parallel disallows parallel query on what seems like
a fairly simple case.

Consider this query:

select t.unique1
from tenk1 t
join lateral (select t.unique1 from tenk1 offset 0) l on true;

Current set_rel_consider_parallel sets consider_parallel=false on the
subquery rel because it has a limit/offset. That restriction makes a
lot of sense when we have a subquery whose results conceptually need
to be "shared" (or at least be the same) across multiple workers
(indeed the relevant comment in that function notes that cases where
we could prove a unique ordering would also qualify, but punts on
implementing that due to complexity). But if the subquery is LATERAL,
then no such conceptual restriction.

If we change the code slightly to allow considering parallel query
even in the face of LIMIT/OFFSET for LATERAL subqueries, then our
query above changes from the following plan:

 Nested Loop
   Output: t.unique1
   ->  Gather
 Output: t.unique1
 Workers Planned: 2
 ->  Parallel Index Only Scan using tenk1_unique1 on public.tenk1 t
   Output: t.unique1
   ->  Gather
 Output: NULL::integer
 Workers Planned: 2
 ->  Parallel Index Only Scan using tenk1_hundred on public.tenk1
   Output: NULL::integer

to this plan:

 Gather
   Output: t.unique1
   Workers Planned: 2
   ->  Nested Loop
 Output: t.unique1
 ->  Parallel Index Only Scan using tenk1_unique1 on public.tenk1 t
   Output: t.unique1
 ->  Index Only Scan using tenk1_hundred on public.tenk1
   Output: NULL::integer

The code change itself is quite simple (1 line). As far as I can tell
we don't need to expressly check parallel safety of the limit/offset
expressions; that appears to happen elsewhere (and that makes sense
since the RTE_RELATION case doesn't check those clauses either).

If I'm missing something about the safety of this (or any other
issue), I'd appreciate the feedback.

James
From 0aff5f1b5e35e37a311c01e9f53caf6e088e8d43 Mon Sep 17 00:00:00 2001
From: jcoleman 
Date: Mon, 30 Nov 2020 11:36:35 -0500
Subject: [PATCH v1] Allow parallel LATERAL subqueries with LIMIT/OFFSET

The code that determined whether or not a rel should be considered for
parallel query excluded subqueries with LIMIT/OFFSET. That's correct in
the general case: as the comment notes that'd mean we have to guarantee
ordering (and claims it's not worth checking that) for results to be
consistent across workers. However there's a simpler case that hasn't
been considered: LATERAL subqueries with LIMIT/OFFSET don't fall under
the same reasoning since they're executed (when not converted to a JOIN)
per tuple anyway, so consistency of results across workers isn't a
factor.
---
 src/backend/optimizer/path/allpaths.c |  4 +++-
 src/test/regress/expected/select_parallel.out | 15 +++
 src/test/regress/sql/select_parallel.sql  |  6 ++
 3 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 84a69b064a..3c9313b5a9 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -686,11 +686,13 @@ set_rel_consider_parallel(PlannerInfo *root, RelOptInfo *rel,
 			 * inconsistent results at the top-level.  (In some cases, where
 			 * the result is ordered, we could relax this restriction.  But it
 			 * doesn't currently seem worth expending extra effort to do so.)
+			 * LATERAL is an exception: LIMIT/OFFSET is safe to execute within
+			 * workers since the sub-select is executed per tuple
 			 */
 			{
 Query	   *subquery = castNode(Query, rte->subquery);
 
-if (limit_needed(subquery))
+if (!rte->lateral && limit_needed(subquery))
 	return;
 			}
 			break;
diff --git a/src/test/regress/expected/select_parallel.out b/src/test/regress/expected/select_parallel.out
index 9b0c418db7..9ba40ca2c5 100644
--- a/src/test/regress/expected/select_parallel.out
+++ b/src/test/regress/expected/select_parallel.out
@@ -1042,6 +1042,21 @@ explain (costs off)
Filter: (stringu1 ~~ '%'::text)
 (11 rows)
 
+-- ...unless it's LATERAL
+savepoint settings;
+set parallel_tuple_cost=0;
+explain (costs off) select t.unique1 from tenk1 t
+join lateral (select t.unique1 from tenk1 offset 0) l on true;
+ QUERY PLAN  
+-
+ Gather
+   Workers Planned: 4
+   ->  Nested Loop
+ ->  Parallel Index Only Scan using tenk1_unique1 on tenk1 t
+ ->  Index Only Scan using tenk1_hundred on tenk1
+(5 rows)
+
+rollback to savepoint settings;
 -- to increase the parallel query test coverage
 SAVEPOINT settings;
 SET LOCAL force_parallel_mode = 1;
diff

Re: error_severity of brin work item

2020-11-30 Thread Alvaro Herrera

The more I look at this, the less I like it.  This would set a precedent
that any action that can be initiated from an autovac work-item has a
requirement of silently being discarded when it referenced a
non-existant relation.

I'd rather have the code that drops the index go through the list of
work-items and delete those that reference that relation.

I'm not sure if this is something that ought to be done in index_drop();
One objection might be that if the drop is rolled back, the work-items
are lost.  It's the easiest, though; and work-items are supposed to be
lossy anyway, and vacuum would fix the lack of summarization eventually.
So, not pretty, but not all that bad.  (Hopefully rolled-back drops are
not all that common.)

Re: Online verification of checksums

2020-11-30 Thread David Steele


On 11/30/20 9:27 AM, Stephen Frost wrote:

Greetings,

* Michael Paquier (mich...@paquier.xyz) wrote:

On Fri, Nov 27, 2020 at 11:15:27AM -0500, Stephen Frost wrote:

* Magnus Hagander (mag...@hagander.net) wrote:

On Thu, Nov 26, 2020 at 8:42 AM Michael Paquier  wrote:

But here the checksum is broken, so while the offset is something we
can rely on how do you make sure that the LSN is fine?  A broken
checksum could perfectly mean that the LSN is actually *not* fine if
the page header got corrupted.


Of course that could be the case, but it gets to be a smaller and
smaller chance by checking that the LSN read falls within reasonable
bounds.


FWIW, I find that scary.


There's ultimately different levels of 'scary' and the risk here that
something is actually wrong following these checks strikes me as being
on the same order as random bits being flipped in the page and still
getting a valid checksum (which is entirely possible with our current
checksum...), or maybe even less.  


I would say a lot less. First you'd need to corrupt one of the eight 
bytes that make up the LSN (pretty likely since corruption will probably 
affect the entire block) and then it would need to be updated to a value 
that falls within the current backup range, a 1 in 16 million chance if 
a terabyte of WAL is generated during the backup. Plus, the corruption 
needs to happen during the backup since we are going to check for that, 
and the corrupted LSN needs to be ascending, and the LSN originally read 
needs to be within the backup range (another 1 in 16 million chance) 
since pages written before the start backup checkpoint should not be torn.


So as far as I can see there are more likely to be false negatives from 
the checksum itself.


It would also be easy to add a few rounds of checks, i.e. test if the 
LSN ascends but stays in the backup LSN range N times.


Honestly, I'm much more worried about corruption zeroing the entire 
page. I don't know how likely that is, but I know none of our proposed 
solutions would catch it.


Andres, since you brought this issue up originally perhaps you'd like to 
weigh in?


Regards,
--
-David
da...@pgmasters.net

Re: Add Information during standby recovery conflicts

2020-11-30 Thread Masahiko Sawada

On Tue, Dec 1, 2020 at 3:25 AM Alvaro Herrera  wrote:
>
> On 2020-Dec-01, Fujii Masao wrote:
>
> > + if (proc)
> > + {
> > + if (nprocs == 0)
> > + appendStringInfo(&buf, "%d", 
> > proc->pid);
> > + else
> > + appendStringInfo(&buf, ", %d", 
> > proc->pid);
> > +
> > + nprocs++;
> >
> > What happens if all the backends in wait_list have gone? In other words,
> > how should we handle the case where nprocs == 0 (i.e., nprocs has not been
> > incrmented at all)? This would very rarely happen, but can happen.
> > In this case, since buf.data is empty, at least there seems no need to log
> > the list of conflicting processes in detail message.
>
> Yes, I noticed this too; this can be simplified by changing the
> condition in the ereport() call to be "nprocs > 0" (rather than
> wait_list being null), otherwise not print the errdetail.  (You could
> test buf.data or buf.len instead, but that seems uglier to me.)

+1

Maybe we can also improve the comment of this function from:

+ * This function also reports the details about the conflicting
+ * process ids if *wait_list is not NULL.

to " This function also reports the details about the conflicting
process ids if exist" or something.

Regards,

-- 
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/

Re: runtime error copying oids field

2020-11-30 Thread Zhihong Yu

Hi,
See attached patch which is along the line Alvaro outlined.

Cheers

On Mon, Nov 30, 2020 at 3:01 PM Alvaro Herrera 
wrote:

> On 2020-Nov-30, Zhihong Yu wrote:
>
> > This was the line runtime error was raised:
> >
> > memcpy(part_oids, partdesc->oids, sizeof(Oid) * nparts);
> >
> > From RelationBuildPartitionDesc we can see that:
> >
> >   if (nparts > 0)
> >   {
> >   PartitionBoundInfo boundinfo;
> >   int*mapping;
> >   int next_index = 0;
> >
> >   result->oids = (Oid *) palloc0(nparts * sizeof(Oid));
> >
> > The cause was oids field was not assigned due to nparts being 0.
> > This is verified by additional logging added just prior to the memcpy
> call.
> >
> > I want to get the community's opinion on whether a null check should be
> > added prior to the memcpy() call.
>
> As far as I understand, we do want to avoid memcpy's of null pointers;
> see [1].
>
> In this case I think it'd be sane to skip the complete block, not just
> the memcpy, something like
>
> diff --git a/src/backend/commands/indexcmds.c
> b/src/backend/commands/indexcmds.c
> index ca24620fd0..d35deb433a 100644
> --- a/src/backend/commands/indexcmds.c
> +++ b/src/backend/commands/indexcmds.c
> @@ -1163,15 +1163,17 @@ DefineIndex(Oid relationId,
>
> if (partitioned)
> {
> +   PartitionDesc partdesc;
> +
> /*
>  * Unless caller specified to skip this step (via ONLY),
> process each
>  * partition to make sure they all contain a corresponding
> index.
>  *
>  * If we're called internally (no stmt->relation), recurse
> always.
>  */
> -   if (!stmt->relation || stmt->relation->inh)
> +   partdesc = RelationGetPartitionDesc(rel);
> +   if ((!stmt->relation || stmt->relation->inh) &&
> partdesc->nparts > 0)
> {
> -   PartitionDesc partdesc =
> RelationGetPartitionDesc(rel);
> int nparts = partdesc->nparts;
> Oid*part_oids = palloc(sizeof(Oid)
> * nparts);
> boolinvalidate_parent = false;
>
> [1]
> https://www.postgresql.org/message-id/flat/20200904023648.GB3426768%40rfd.leadboat.com
>


v01-check-nparts-for-defining-idx.patch
Description: Binary data

Re: 回复: [PATCH] BUG FIX: Core dump could happen when VACUUM FULL in standalone mode

2020-11-30 Thread Masahiko Sawada

On Tue, Dec 1, 2020 at 3:54 AM Tom Lane  wrote:
>
> Yulin PEI  writes:
> > Yes, I agree because (IsNormalProcessingMode() ) means that current process 
> > is not in bootstrap mode and postmaster process will not build index.
> > So my new modified patch is attached.
>
> This is a good catch, but the proposed fix still seems pretty random
> and unlike how it's done elsewhere.  It seems to me that since
> index_build() is relying on plan_create_index_workers() to assess
> parallel safety, that's where to check IsUnderPostmaster.  Moreover,
> the existing code in compute_parallel_vacuum_workers (which gets
> this right) associates the IsUnderPostmaster check with the initial
> check on max_parallel_maintenance_workers.  So I think that the
> right fix is to adopt the compute_parallel_vacuum_workers coding
> in plan_create_index_workers, and thereby create a model for future
> uses of max_parallel_maintenance_workers to follow.

+1

Regards,

-- 
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/

Re: Printing backtrace of postgres processes

2020-11-30 Thread Tom Lane

Craig Ringer  writes:
>> I would like to see some discussion of the security implications
>> of such a feature, as well.  ("There aren't any" is the wrong
>> answer.)

> If the stack only goes to the log, I actually don't think there are
> significant security implications beyond those we already have with
> our existing backtrace printing features. We already trust anyone who
> can read the log almost completely, and we can already emit stacks to
> the log. But I'd still want it to be gated superuser-only, or a role
> that's GRANTable by superuser only by default, since it exposes
> arbitrary internals of the server.

The concerns that I had were that the patch as submitted provides a
mechanism that causes ALL processes in the system to dump backtraces,
not a targeted request; and that it allows any user to issue such
requests at an unbounded rate.  That seems like a really easy route
to denial of service.  There's also a question of whether you'd even
get intelligible results from dozens of processes simultaneously
dumping many-line messages to the same place.  (This might work out
all right if you're using our syslogger, but it probably would not
with any other logging technology.)

I'd feel better about it if the mechanism had you specify exactly
one target process, and were restricted to a superuser requestor.

I'm not excited about adding on frammishes like letting one process
extract another's stack trace.  I think that just adds more points
of failure, which is a bad thing in a feature that you're only going
to care about when things are a bit pear-shaped already.

regards, tom lane

Re: runtime error copying oids field

2020-11-30 Thread Alvaro Herrera

On 2020-Nov-30, Zhihong Yu wrote:

> This was the line runtime error was raised:
> 
> memcpy(part_oids, partdesc->oids, sizeof(Oid) * nparts);
> 
> From RelationBuildPartitionDesc we can see that:
> 
>   if (nparts > 0)
>   {
>   PartitionBoundInfo boundinfo;
>   int*mapping;
>   int next_index = 0;
> 
>   result->oids = (Oid *) palloc0(nparts * sizeof(Oid));
> 
> The cause was oids field was not assigned due to nparts being 0.
> This is verified by additional logging added just prior to the memcpy call.
> 
> I want to get the community's opinion on whether a null check should be
> added prior to the memcpy() call.

As far as I understand, we do want to avoid memcpy's of null pointers;
see [1].

In this case I think it'd be sane to skip the complete block, not just
the memcpy, something like

diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index ca24620fd0..d35deb433a 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1163,15 +1163,17 @@ DefineIndex(Oid relationId,
 
if (partitioned)
{
+   PartitionDesc partdesc;
+
/*
 * Unless caller specified to skip this step (via ONLY), 
process each
 * partition to make sure they all contain a corresponding 
index.
 *
 * If we're called internally (no stmt->relation), recurse 
always.
 */
-   if (!stmt->relation || stmt->relation->inh)
+   partdesc = RelationGetPartitionDesc(rel);
+   if ((!stmt->relation || stmt->relation->inh) && 
partdesc->nparts > 0)
{
-   PartitionDesc partdesc = RelationGetPartitionDesc(rel);
int nparts = partdesc->nparts;
Oid*part_oids = palloc(sizeof(Oid) * 
nparts);
boolinvalidate_parent = false;

[1] 
https://www.postgresql.org/message-id/flat/20200904023648.GB3426768%40rfd.leadboat.com

Re: [BUG] orphaned function

2020-11-30 Thread Tom Lane

"Drouvot, Bertrand"  writes:
> here is a scenario that produces an orphaned function (means it does not 
> belong to any namespace):
> [ drop schema before committing function creation ]

Hm.  Historically we've not been too fussed about preventing such race
conditions, and I wonder just how far is sane to take it.  Should we
acquire lock on the function's argument/result data types?  Its language?

Given the precedent of RangeVarGetAndCheckCreationNamespace, I'm
willing to accept this patch's goals as stated.  But it feels like
we need some clearer shared understanding of which things we are
willing to bother with preventing races for, and which we aren't.

> Please find attached a patch that adds a LockDatabaseObject() call in 
> QualifiedNameGetCreationNamespace() to prevent such orphaned situations.

I don't think that actually succeeds in preventing the race, it'll
just delay your process till the deletion is committed.  But you
already have the namespaceId.  Note the complex retry logic in
RangeVarGetAndCheckCreationNamespace; without something on the same
order, you're not closing the hole in any meaningful way.  Likely
what this patch should do is refactor that function so that its guts
can be used for other object-creation scenarios.  (The fact that
this is so far from trivial is one reason I'm not in a hurry to
extend this sort of protection to other dependencies.)

regards, tom lane

runtime error copying oids field

2020-11-30 Thread Zhihong Yu

Hi,
In our testPgRegressTrigger test log, I saw the following (this was for a
relatively old version of PG):

197859  [ts-1]
 ../../../../../../src/postgres/src/backend/commands/indexcmds.c:1062:22:
runtime error: null pointer passed as argument 2, which is declared to
never be null
197860  [ts-1]
 /opt/yb-build/brew/linuxbrew-20181203T161736v9/include/string.h:43:28:
note: nonnull attribute specified here
197861  [ts-1]   #0 0xacbd0f in DefineIndex
$YB_SRC_ROOT/src/postgres/src/backend/commands/../../../../../../src/postgres/src/backend/commands/indexcmds.c:1062:4
197862  [ts-1]   #1 0x11441e0 in ProcessUtilitySlow
$YB_SRC_ROOT/src/postgres/src/backend/tcop/../../../../../../src/postgres/src/backend/tcop/utility.c:1436:7
197863  [ts-1]   #2 0x114141f in standard_ProcessUtility
$YB_SRC_ROOT/src/postgres/src/backend/tcop/../../../../../../src/postgres/src/backend/tcop/utility.c:962:4
197864  [ts-1]   #3 0x1140b65 in YBProcessUtilityDefaultHook
$YB_SRC_ROOT/src/postgres/src/backend/tcop/../../../../../../src/postgres/src/backend/tcop/utility.c:3574:3
197865  [ts-1]   #4 0x7f47d4950eac in pgss_ProcessUtility
$YB_SRC_ROOT/src/postgres/contrib/pg_stat_statements/../../../../../src/postgres/contrib/pg_stat_statements/pg_stat_statements.c:1120:4

This was the line runtime error was raised:

memcpy(part_oids, partdesc->oids, sizeof(Oid) * nparts);

>From RelationBuildPartitionDesc we can see that:

if (nparts > 0)
{
PartitionBoundInfo boundinfo;
int*mapping;
int next_index = 0;

result->oids = (Oid *) palloc0(nparts * sizeof(Oid));

The cause was oids field was not assigned due to nparts being 0.
This is verified by additional logging added just prior to the memcpy call.

I want to get the community's opinion on whether a null check should be
added prior to the memcpy() call.

Cheers

Re: proposal: unescape_text function

2020-11-30 Thread Pavel Stehule

po 30. 11. 2020 v 22:15 odesílatel Pavel Stehule 
napsal:

>
>
> po 30. 11. 2020 v 14:14 odesílatel Peter Eisentraut <
> peter.eisentr...@enterprisedb.com> napsal:
>
>> On 2020-11-29 18:36, Pavel Stehule wrote:
>> >
>> > I don't really get the point of this function.  There is AFAICT no
>> > function to produce this escaped format, and it's not a recognized
>> > interchange format.  So under what circumstances would one need to
>> > use this?
>> >
>> >
>> > Some corporate data can be in CSV format with escaped unicode
>> > characters. Without this function it is not possible to decode these
>> > files without external application.
>>
>> I would like some supporting documentation on this.  So far we only have
>> one stackoverflow question, and then this implementation, and they are
>> not even the same format.  My worry is that if there is not precise
>> specification, then people are going to want to add things in the
>> future, and there will be no way to analyze such requests in a
>> principled way.
>>
>>
> I checked this and it is "prefix backslash-u hex" used by Java,
> JavaScript  or RTF -
> https://billposer.org/Software/ListOfRepresentations.html
>
> In some languages (Python), there is decoder "unicode-escape". Java has a
> method escapeJava, for conversion from unicode to ascii. I can imagine so
> these data are from Java systems exported to 8bit strings - so this
> implementation can be accepted as  referential. This format is used by
> https://docs.oracle.com/javase/8/docs/technotes/tools/unix/native2ascii.html
> tool too.
>
> Postgres can decode this format too, and the patch is based on Postgres
> implementation. I just implemented a different interface.
>
> Currently decode function does only text->bytea transformation. Maybe a
> more generic function "decode_text" and "encode_text" for similar cases can
> be better (here we need text->text transformation). But it looks like
> overengineering now.
>
> Maybe we introduce new encoding "ascii" and we can implement new
> conversions "ascii_to_utf8" and "utf8_to_ascii". It looks like the most
> clean solution. What do you think about it?
>

a better name of new encoding can be "unicode-escape" than "ascii". We use
"to_ascii" function for different use case.

set client_encoding to unicode-escape;
copy tab from xxx;
...

but it doesn't help when only a few columns from the table are in
unicode-escape format.




> Regards
>
> Pavel
>
>
>

Re: [DOC] Document concurrent index builds waiting on each other

2020-11-30 Thread Alvaro Herrera

On 2020-Sep-30, Michael Paquier wrote:

> +  
> +   CREATE INDEX (including the 
> CONCURRENTLY
> +   option) commands are included when VACUUM calculates 
> what
> +   dead tuples are safe to remove even on tables other than the one being 
> indexed.
> +  
> FWIW, this is true as well for REINDEX CONCURRENTLY because both use
> the same code paths for index builds and validation, with basically
> the same waiting phases.  But is CREATE INDEX the correct place for
> that?  Wouldn't it be better to tell about such things on the VACUUM
> doc?

Yeah, I think it might be more sensible to document this in
maintenance.sgml, as part of the paragraph that discusses removing
tuples "to save space".  But making it inline with the rest of the flow,
it seems to distract from higher-level considerations, so I suggest to
make it a footnote instead.

I'm not sure on the wording to use; what about this?

>From 6c9ad72e4e61dbf05f34146cb67706dd675a38f0 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Mon, 30 Nov 2020 18:50:06 -0300
Subject: [PATCH v5] Note CIC and RC in vacuum's doc

Per James Coleman
---
 doc/src/sgml/maintenance.sgml | 12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 4d8ad754f8..d68d7f936e 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -159,7 +159,17 @@
 concurrency control (MVCC, see ): the row version
 must not be deleted while it is still potentially visible to other
 transactions. But eventually, an outdated or deleted row version is no
-longer of interest to any transaction. The space it occupies must then be
+longer of interest to any transaction.
+
+ 
+  Note that VACUUM needs to retain tuples that are
+  nominally visible to transactions running
+  CREATE INDEX CONCURRENTLY or
+  REINDEX CONCURRENTLY, even when run on tables
+  other than the tables being indexed.
+ 
+
+The space it occupies must then be
 reclaimed for reuse by new rows, to avoid unbounded growth of disk
 space requirements. This is done by running VACUUM.

-- 
2.20.1

Re: support IncrementalSortPath type in outNode()

2020-11-30 Thread Tom Lane

"Hou, Zhijie"  writes:
> I found IncrementalSortPath type is not analyzed in outNode.

Indeed, that's supposed to be supported.  Pushed, thanks for the patch!

regards, tom lane

Re: proposal: unescape_text function

2020-11-30 Thread Pavel Stehule

po 30. 11. 2020 v 14:14 odesílatel Peter Eisentraut <
peter.eisentr...@enterprisedb.com> napsal:

> On 2020-11-29 18:36, Pavel Stehule wrote:
> >
> > I don't really get the point of this function.  There is AFAICT no
> > function to produce this escaped format, and it's not a recognized
> > interchange format.  So under what circumstances would one need to
> > use this?
> >
> >
> > Some corporate data can be in CSV format with escaped unicode
> > characters. Without this function it is not possible to decode these
> > files without external application.
>
> I would like some supporting documentation on this.  So far we only have
> one stackoverflow question, and then this implementation, and they are
> not even the same format.  My worry is that if there is not precise
> specification, then people are going to want to add things in the
> future, and there will be no way to analyze such requests in a
> principled way.
>
>
I checked this and it is "prefix backslash-u hex" used by Java, JavaScript
or RTF - https://billposer.org/Software/ListOfRepresentations.html

In some languages (Python), there is decoder "unicode-escape". Java has a
method escapeJava, for conversion from unicode to ascii. I can imagine so
these data are from Java systems exported to 8bit strings - so this
implementation can be accepted as  referential. This format is used by
https://docs.oracle.com/javase/8/docs/technotes/tools/unix/native2ascii.html
tool too.

Postgres can decode this format too, and the patch is based on Postgres
implementation. I just implemented a different interface.

Currently decode function does only text->bytea transformation. Maybe a
more generic function "decode_text" and "encode_text" for similar cases can
be better (here we need text->text transformation). But it looks like
overengineering now.

Maybe we introduce new encoding "ascii" and we can implement new
conversions "ascii_to_utf8" and "utf8_to_ascii". It looks like the most
clean solution. What do you think about it?

Regards

Pavel

Re: Printing LSN made easy

2020-11-30 Thread Andres Freund

Hi,

On 2020-11-29 12:10:21 -0500, Tom Lane wrote:
> Agreed.  snprintf.c is meant to implement a recognized standard
> (ok, %m is a GNU extension, but it's still pretty standard).
> I'm not on board with putting PG-only extensions in there.

I wonder if there's something we could layer on the elog.c level
instead. But I also don't like the idea of increasing the use of wrapper
functions that need to allocate string buffers than then need to get
copied in turn.

We could do something like
   errmsg("plain string arg: %s, conv string arg: %s", somestr, estr_lsn(lsn))

where estr_lsn() wouldn't do any conversion directly, instead setting up
a callback in ErrorData that knows how to do the type specific
conversion. Then during EVALUATE_MESSAGE() we'd evaluate the message
piecemeal and call the output callbacks for each arg, using the
StringInfo.

There's two main issues with something roughly like this:
1) We'd need to do format string parsing somewhere above snprintf.c,
   which isn't free.
2) Without relying on C11 / _Generic() some ugly macro hackery would be
   needed to have a argument-number indexed state. If we did rely on
   _Generic() we probably could do better, even getting rid of the need
   for something like estr_lsn().

Greetings,

Andres Freund

Re: Improving spin-lock implementation on ARM.

2020-11-30 Thread Alexander Korotkov

On Mon, Nov 30, 2020 at 7:00 AM Krunal Bauskar  wrote:
> 3. Problem with GCC approach is still a lot of distro don't support gcc 9.4 
> as default.
> To use this approach:
> * PGSQL will have to roll out its packages using gcc-9.4+ only so that 
> they are compatible with all aarch64 machines
> * but this continues to affect all other users who tend to build pgsql 
> using standard distro based compiler. (unless they upgrade compiler).

I think if a user, who runs PostgreSQL on a multicore machine with
high-concurrent load, can take care about installing the appropriate
package/using the appropriate compiler (especially if we publish
explicit instructions for that).  In the same way such advanced users
tune Linux kernel etc.

BTW, how do you get that required gcc version is 9.4?  I've managed to
use LSE with gcc 9.3.

--
Regards,
Alexander Korotkov

Re: Improving spin-lock implementation on ARM.

2020-11-30 Thread Alexander Korotkov

On Mon, Nov 30, 2020 at 9:21 PM Tom Lane  wrote:
> Alexander Korotkov  writes:
> > I tend to think that LSE is enabled by default in Apple's clang based
> > on your previous message[1].  In order to dispel the doubts could you
> > please provide assembly of SpinLockAcquire for following clang
> > options.
> > "-O2"
> > "-O2 -march=armv8-a+lse"
> > "-O2 -march=armv8-a"
>
> Huh.  Those options make exactly zero difference to the code generated
> for SpinLockAcquire/SpinLockRelease; it's the same as I showed upthread,
> for either the HEAD definition of TAS() or the CAS patch's version.
>
> So now I'm at a loss as to the reason for the performance difference
> I got.  -march=armv8-a+lse does make a difference to code generation
> someplace, because the overall size of the postgres executable changes
> by 16kB or so.  One might argue that the performance difference is due
> to better code elsewhere than the spinlocks ... but the test I'm running
> is basically just
>
> while (count-- > 0)
> {
> XLogGetLastRemovedSegno();
>
> CHECK_FOR_INTERRUPTS();
> }
>
> so it's hard to see where a non-spinlock-related code change would come
> in.  That loop itself definitely generates the same code either way.
>
> I did find this interesting output from "clang -v":
>
> -target-cpu vortex -target-feature +v8.3a -target-feature +fp-armv8 
> -target-feature +neon -target-feature +crc -target-feature +crypto 
> -target-feature +fullfp16 -target-feature +ras -target-feature +lse 
> -target-feature +rdm -target-feature +rcpc -target-feature +zcm 
> -target-feature +zcz -target-feature +sha2 -target-feature +aes
>
> whereas adding -march=armv8-a+lse changes that to just
>
> -target-cpu vortex -target-feature +neon -target-feature +lse -target-feature 
> +zcm -target-feature +zcz
>
> On the whole, that would make one think that -march=armv8-a+lse
> should produce worse code than the default settings.

Great, thanks.

So, I think the following hypothesis isn't disproved yet.
1) On ARM with LSE support, PostgreSQL built with LSE is faster than
PostgreSQL built without LSE.  Even if the latter is patched with
anything considered in this thread.
2) None of the patches considered in this thread give a clear
advantage for PostgreSQL built with LSE.

To further confirm this let's wait for Kunpeng 920 tests by Krunal
Bauskar and Amit Khandekar.  Also it would be nice if someone will run
benchmarks similar to [1] on Apple M1.

Links
1. 
https://www.postgresql.org/message-id/CAPpHfdsGqVd6EJ4mr_RZVE5xSiCNBy4MuSvdTrKmTpM0eyWGpg%40mail.gmail.com

--
Regards,
Alexander Korotkov

Re: [doc] remove reference to pg_dump pre-8.1 switch behaviour

2020-11-30 Thread Tom Lane

Michael Paquier  writes:
> So this comes down to 5 items, as per the attached.  Thoughts?

These items look fine to me, except this bit seems a bit awkward:

+ Note that the delayed indexing technique used for GIN
+ (see  for details) makes this advice
+ less necessary, but for very large updates it may still be best to
+ drop and recreate the index.

Less necessary than what?  Maybe instead write

  When fastupdate is enabled (see ...), the penalty is much less than
  when it is not.  But for very large updates it may still be best to
  drop and recreate the index.

regards, tom lane

Re: range_agg

2020-11-30 Thread Alexander Korotkov

On Mon, Nov 30, 2020 at 10:35 PM Alexander Korotkov
 wrote:
> On Sun, Nov 29, 2020 at 11:53 PM Paul A Jungwirth
>  wrote:
> >
> > On Sun, Nov 29, 2020 at 11:43 AM Alexander Korotkov
> >  wrote:
> > > Thank you.  Could you please, update doc/src/sgml/catalogs.sgml,
> > > because pg_type and pg_range catalogs are updated.
> >
> > Attached! :-)
>
> You're quick, thank you.  Please, also take a look at cfbot failure
> https://travis-ci.org/github/postgresql-cfbot/postgresql/builds/746623942
> I've tried to reproduce it, but didn't manage yet.

Got it.  type_sanity test fails on any platform, you just need to
repeat "make check" till it fails.

The failed query checked consistency of range types, but it didn't
take into account ranges of domains and ranges of records, which are
exercised by multirangetypes test running in parallel.  We could teach
this query about such kinds of ranges, but I think that would be
overkill, because we're not going to introduce such builtin ranges
yet.  So, I'm going to just move multirangetypes test into another
group of parallel tests.

--
Regards,
Alexander Korotkov

Re: Change JOIN tutorial to focus more on explicit joins

2020-11-30 Thread David G. Johnston

On Mon, Nov 30, 2020 at 1:15 PM Jürgen Purtz  wrote:

> On 30.11.20 20:45, Anastasia Lubennikova wrote:
> > As far as I see something got committed and now the discussion is stuck
> in arguing about parenthesis.
> > FWIW, I think it is a matter of personal taste. Maybe we can compromise
> on simply leaving this part unchanged.
>
> With or without parenthesis is a little more than a personal taste, but
> it's a very tiny detail. I'm happy with either of the two variants.
>
>
Sorry, I managed to overlook the most recent patch.

I admitted my use of parentheses was incorrect and I don't see anyone else
defending them.  Please remove them.

Minor typos:

"the database compare" -> needs an "s" (compares)

"In this case, the definition how to compare their rows." -> remove,
redundant with the first sentence

"The results from the older implicit syntax, and the newer explicit JOIN/ON
syntax, are identical" -> move the commas around to what is shown here

David J.

Re: [patch] [doc] Introduce view updating options more succinctly

2020-11-30 Thread Anastasia Lubennikova

The following review has been posted through the commitfest application:
make installcheck-world:  not tested
Implements feature:   not tested
Spec compliant:   not tested
Documentation:tested, passed

I wonder, why this patch didn't get a review during the CF.
This minor improvement looks good to me, so I mark it Ready for Committer.

The new status of this patch is: Ready for Committer

Re: Change JOIN tutorial to focus more on explicit joins

2020-11-30 Thread Jürgen Purtz


On 30.11.20 20:45, Anastasia Lubennikova wrote:

As far as I see something got committed and now the discussion is stuck in 
arguing about parenthesis.
FWIW, I think it is a matter of personal taste. Maybe we can compromise on 
simply leaving this part unchanged.


With or without parenthesis is a little more than a personal taste, but 
it's a very tiny detail. I'm happy with either of the two variants.


--

J. Purtz

Re: [DOC] Document concurrent index builds waiting on each other

2020-11-30 Thread Alvaro Herrera

On 2020-Nov-30, Anastasia Lubennikova wrote:

> The commitfest is nearing the end and I wonder what is this discussion 
> waiting for.
> It looks like the proposed patch received its fair share of review, so
> I mark it as ReadyForCommitter and lay responsibility for the final
> decision on them.

I'll get these pushed now, thanks for the reminder.

Re: [DOC] Document concurrent index builds waiting on each other

2020-11-30 Thread Anastasia Lubennikova

Status update for a commitfest entry.

The commitfest is nearing the end and I wonder what is this discussion waiting 
for.
It looks like the proposed patch received its fair share of review, so I mark 
it as ReadyForCommitter and lay responsibility for the final decision on them.

The new status of this patch is: Ready for Committer

{CREATE INDEX, REINDEX} CONCURRENTLY improvements

2020-11-30 Thread Alvaro Herrera

Hello,

In a previous thread [1], we added smarts so that processes running
CREATE INDEX CONCURRENTLY would not wait for each other.

One is adding the same to REINDEX CONCURRENTLY.  I've attached patch
0002 here which does that.

Why 0002, you ask?  That's because preparatory patch 0001 simplifies the
ReindexRelationConcurrently somewhat by adding a struct to be used of
indexes that are going to be processed, instead of just a list of Oids.
This is a good change in itself because it let us get rid of duplicative
open/close of the index rels in order to obtain some info that's already
known at the start.

The other thing is that it'd be good if we can make VACUUM also ignore
Xmin of processes doing CREATE INDEX CONCURRENTLY and REINDEX
CONCURRENTLY, when possible.  I have two possible ideas to handle this,
about which I'll post later.


[1] https://postgr.es/m/20200810233815.GA18970@alvherre.pgsql

-- 
Álvaro Herrera   Valdivia, Chile
>From 623a460b791dd873ae5daf6a0cd4e8f446a772f8 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Mon, 30 Nov 2020 16:01:13 -0300
Subject: [PATCH 1/2] create ReindexIndexInfo

---
 src/backend/commands/indexcmds.c | 130 ---
 src/tools/pgindent/typedefs.list |   1 +
 2 files changed, 70 insertions(+), 61 deletions(-)

diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index ca24620fd0..b1ce83e1dd 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -114,6 +114,16 @@ typedef struct ReindexErrorInfo
 	char		relkind;
 } ReindexErrorInfo;
 
+/*
+ * Index to process in ReindexRelationConcurrently
+ */
+typedef struct ReindexIndexInfo
+{
+	Oid			indexId;
+	Oid			tableId;
+	Oid			amId;
+} ReindexIndexInfo;
+
 /*
  * CheckIndexCompatible
  *		Determine whether an existing index definition is compatible with a
@@ -3132,10 +3142,16 @@ ReindexRelationConcurrently(Oid relationOid, int options)
 		get_rel_name(cellOid;
 	else
 	{
+		ReindexIndexInfo *idx;
+
 		/* Save the list of relation OIDs in private context */
 		oldcontext = MemoryContextSwitchTo(private_context);
 
-		indexIds = lappend_oid(indexIds, cellOid);
+		idx = palloc(sizeof(ReindexIndexInfo));
+		idx->indexId = cellOid;
+		/* other fields set later */
+
+		indexIds = lappend(indexIds, idx);
 
 		MemoryContextSwitchTo(oldcontext);
 	}
@@ -3172,13 +3188,18 @@ ReindexRelationConcurrently(Oid relationOid, int options)
 			get_rel_name(cellOid;
 		else
 		{
+			ReindexIndexInfo *idx;
+
 			/*
 			 * Save the list of relation OIDs in private
 			 * context
 			 */
 			oldcontext = MemoryContextSwitchTo(private_context);
 
-			indexIds = lappend_oid(indexIds, cellOid);
+			idx = palloc(sizeof(ReindexIndexInfo));
+			idx->indexId = cellOid;
+			indexIds = lappend(indexIds, idx);
+			/* other fields set later */
 
 			MemoryContextSwitchTo(oldcontext);
 		}
@@ -3197,6 +3218,7 @@ ReindexRelationConcurrently(Oid relationOid, int options)
 Oid			heapId = IndexGetRelation(relationOid,
 	  (options & REINDEXOPT_MISSING_OK) != 0);
 Relation	heapRelation;
+ReindexIndexInfo *idx;
 
 /* if relation is missing, leave */
 if (!OidIsValid(heapId))
@@ -3247,7 +3269,10 @@ ReindexRelationConcurrently(Oid relationOid, int options)
  * Save the list of relation OIDs in private context.  Note
  * that invalid indexes are allowed here.
  */
-indexIds = lappend_oid(indexIds, relationOid);
+idx = palloc(sizeof(ReindexIndexInfo));
+idx->indexId = relationOid;
+indexIds = lappend(indexIds, idx);
+/* other fields set later */
 
 MemoryContextSwitchTo(oldcontext);
 break;
@@ -3306,31 +3331,36 @@ ReindexRelationConcurrently(Oid relationOid, int options)
 	foreach(lc, indexIds)
 	{
 		char	   *concurrentName;
-		Oid			indexId = lfirst_oid(lc);
+		ReindexIndexInfo *idx = lfirst(lc);
+		ReindexIndexInfo *newidx;
 		Oid			newIndexId;
 		Relation	indexRel;
 		Relation	heapRel;
 		Relation	newIndexRel;
 		LockRelId  *lockrelid;
 
-		indexRel = index_open(indexId, ShareUpdateExclusiveLock);
+		indexRel = index_open(idx->indexId, ShareUpdateExclusiveLock);
 		heapRel = table_open(indexRel->rd_index->indrelid,
 			 ShareUpdateExclusiveLock);
 
+		idx->tableId = RelationGetRelid(heapRel);
+		idx->amId = indexRel->rd_rel->relam;
+
 		/* This function shouldn't be called for temporary relations. */
 		if (indexRel->rd_rel->relpersistence == RELPERSISTENCE_TEMP)
 			elog(ERROR, "cannot reindex a temporary table concurrently");
 
 		pgstat_progress_start_command(PROGRESS_COMMAND_CREATE_INDEX,
-	  RelationGetRelid(heapRel));
+	  idx->tableId);
+
 		progress_vals[0] = PROGRESS_CREATEIDX_COMMAND_REINDEX_CONCURRENTLY;
 		progress_vals[1] = 0;	/* initializing */
-		progress_vals[2] = indexId;
-		progress_vals[3] = indexRel->rd_rel->relam;
+		progress_vals[2] =

Re: Change JOIN tutorial to focus more on explicit joins

2020-11-30 Thread Anastasia Lubennikova

Status update for a commitfest entry.

The commitfest is nearing the end and this thread was inactive for a while. As 
far as I see something got committed and now the discussion is stuck in arguing 
about parenthesis.
FWIW, I think it is a matter of personal taste. Maybe we can compromise on 
simply leaving this part unchanged.

If you are planning to continue working on it, please move it to the next CF.

Re: range_agg

2020-11-30 Thread Alexander Korotkov

On Sun, Nov 29, 2020 at 11:53 PM Paul A Jungwirth
 wrote:
>
> On Sun, Nov 29, 2020 at 11:43 AM Alexander Korotkov
>  wrote:
> > Thank you.  Could you please, update doc/src/sgml/catalogs.sgml,
> > because pg_type and pg_range catalogs are updated.
>
> Attached! :-)

You're quick, thank you.  Please, also take a look at cfbot failure
https://travis-ci.org/github/postgresql-cfbot/postgresql/builds/746623942
I've tried to reproduce it, but didn't manage yet.

--
Regards,
Alexander Korotkov

Re: remove spurious CREATE INDEX CONCURRENTLY wait

2020-11-30 Thread Alvaro Herrera

In the interest of showing progress, I'm going to mark this CF item as
committed, and I'll submit the remaining pieces in a new thread.

Thanks!

Re: Add docs stub for recovery.conf

2020-11-30 Thread David G. Johnston

On Mon, Nov 30, 2020 at 11:42 AM Bruce Momjian  wrote:

>
> The downside is you end up with X-1 dummy sections just to allow for
> references to old syntax, and you then have to find them all and remove
> them when you implement the proper solution.  I have no intention of
> applying such an X-1 fix.
>
>
X = 2; seems like a strong objection for such a minor issue.  The status
quo seems worse than that.

David J.

Re: [PATCH] remove deprecated v8.2 containment operators

2020-11-30 Thread Tom Lane

Justin Pryzby  writes:
> I think this is waiting on me to provide a patch for the contrib/ modules with
> update script removing the SQL operators, and documentating their deprecation.

Right.  We can remove the SQL operators, but not (yet) the C code support.

I'm not sure that the docs change would do more than remove any existing
mentions of the operators.

regards, tom lane

Re: [PATCH] remove deprecated v8.2 containment operators

2020-11-30 Thread Justin Pryzby

On Mon, Nov 30, 2020 at 09:51:12PM +0300, Anastasia Lubennikova wrote:
> On 16.11.2020 23:55, Justin Pryzby wrote:
> > On Fri, Nov 13, 2020 at 10:03:43AM -0500, Stephen Frost wrote:
> > > * Magnus Hagander (mag...@hagander.net) wrote:
> > > > On Thu, Nov 12, 2020 at 11:28 PM Tom Lane  wrote:
> > > > > > The changes to the contrib modules appear to be incomplete in some 
> > > > > > ways.
> > > > > >In cube, hstore, and seg, there are no changes to the extension
> > > > > > scripts to remove the operators.  All you're doing is changing the C
> > > > > > code to no longer recognize the strategy, but that doesn't explain 
> > > > > > what
> > > > > > will happen if the operator is still used.  In intarray, by 
> > > > > > contrast,
> > > > > > you're editing an existing extension script, but that should be 
> > > > > > done by
> > > > > > an upgrade script instead.
> > > > > In the contrib modules, I'm afraid what you gotta do is remove the
> > > > > SQL operator definitions but leave the opclass code support in place.
> > > > > That's because there's no guarantee that users will update the 
> > > > > extension's
> > > > > SQL version immediately, so a v14 build of the .so might still be used
> > > > > with the old SQL definitions.  It's not clear how much window we need
> > > > > give for people to do that update, but I don't think "zero" is an
> > > > > acceptable answer.
> > > > Based on my experience from the field, the answer is "never".
> > > > 
> > > > As in, most people have no idea they are even *supposed* to do such an
> > > > upgrade, so they don't do it. Until we solve that problem, I think
> > > > we're basically stuck with keeping them "forever". (and even if/when
> > > > we do, "zero" is probably not going to cut it, no)
> > > Yeah, this is a serious problem and one that we should figure out a way
> > > to fix or at least improve on- maybe by having pg_upgrade say something
> > > about extensions that could/should be upgraded..?
> > I think what's needed are:
> > 
> > 1) a way to *warn* users about deprecation.  CREATE EXTENSION might give an
> > elog(WARNING), but it's probably not enough.  It only happens once, and if 
> > it's
> > in pg_restore/pg_upgrade, it be wrapped by vendor upgrade scripts.  It 
> > needs to
> > be more like first function call in every session.  Or more likely, put it 
> > in
> > documentation for 10 years.
> > 
> > 2) a way to *enforce* it, like making CREATE EXTENSION fail when run 
> > against an
> > excessively old server, including by pg_restore/pg_upgrade (which ought to 
> > also
> > handle it in --check).
> > 
> > Are there any contrib for which (1) is done and we're anywhere near doing 
> > (2) ?
> 
> Status update for a commitfest entry.
> 
> The commitfest is nearing the end and this thread is "Waiting on Author". As

I think this is waiting on me to provide a patch for the contrib/ modules with
update script removing the SQL operators, and documentating their deprecation.

Is that right ?

-- 
Justin

Re: [PATCH] remove deprecated v8.2 containment operators

2020-11-30 Thread Tom Lane

Anastasia Lubennikova  writes:
> Status update for a commitfest entry.

> The commitfest is nearing the end and this thread is "Waiting on 
> Author". As far as I see we don't have a patch here and discussion is a 
> bit stuck.
> So, I am planning to return it with feedback. Any objections?

AFAICS, the status is

(1) core-code changes are committed;

(2) proposed edits to contrib modules need significant rework;

(3) there was also a bit of discussion about inventing a mechanism
to prod users to update out-of-date extensions.

Now, (3) is far outside the scope of this patch, and I do not
think it should block finishing (2).  We need a new patch for (2),
but there's no real doubt as to what it should contain -- Justin
just needs to turn the crank.

You could either move this to the next CF in state WoA, or
close it RWF.  But the patch did make progress in this CF,
so I'd tend to lean to the former.

regards, tom lane

Re: 回复: [PATCH] BUG FIX: Core dump could happen when VACUUM FULL in standalone mode

2020-11-30 Thread Tom Lane

Yulin PEI  writes:
> Yes, I agree because (IsNormalProcessingMode() ) means that current process 
> is not in bootstrap mode and postmaster process will not build index.
> So my new modified patch is attached.

This is a good catch, but the proposed fix still seems pretty random
and unlike how it's done elsewhere.  It seems to me that since
index_build() is relying on plan_create_index_workers() to assess
parallel safety, that's where to check IsUnderPostmaster.  Moreover,
the existing code in compute_parallel_vacuum_workers (which gets
this right) associates the IsUnderPostmaster check with the initial
check on max_parallel_maintenance_workers.  So I think that the
right fix is to adopt the compute_parallel_vacuum_workers coding
in plan_create_index_workers, and thereby create a model for future
uses of max_parallel_maintenance_workers to follow.

regards, tom lane

diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 247f7d4625..1a94b58f8b 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -6375,8 +6375,11 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 	double		reltuples;
 	double		allvisfrac;

-	/* Return immediately when parallelism disabled */
-	if (max_parallel_maintenance_workers == 0)
+	/*
+	 * We don't allow performing parallel operation in standalone backend or
+	 * when parallelism is disabled.
+	 */
+	if (!IsUnderPostmaster || max_parallel_maintenance_workers == 0)
 		return 0;

 	/* Set up largely-dummy planner state */

Re: [PATCH] remove deprecated v8.2 containment operators

2020-11-30 Thread Anastasia Lubennikova


On 16.11.2020 23:55, Justin Pryzby wrote:

On Fri, Nov 13, 2020 at 10:03:43AM -0500, Stephen Frost wrote:

* Magnus Hagander (mag...@hagander.net) wrote:

On Thu, Nov 12, 2020 at 11:28 PM Tom Lane  wrote:

The changes to the contrib modules appear to be incomplete in some ways.
   In cube, hstore, and seg, there are no changes to the extension
scripts to remove the operators.  All you're doing is changing the C
code to no longer recognize the strategy, but that doesn't explain what
will happen if the operator is still used.  In intarray, by contrast,
you're editing an existing extension script, but that should be done by
an upgrade script instead.

In the contrib modules, I'm afraid what you gotta do is remove the
SQL operator definitions but leave the opclass code support in place.
That's because there's no guarantee that users will update the extension's
SQL version immediately, so a v14 build of the .so might still be used
with the old SQL definitions.  It's not clear how much window we need
give for people to do that update, but I don't think "zero" is an
acceptable answer.

Based on my experience from the field, the answer is "never".

As in, most people have no idea they are even *supposed* to do such an
upgrade, so they don't do it. Until we solve that problem, I think
we're basically stuck with keeping them "forever". (and even if/when
we do, "zero" is probably not going to cut it, no)

Yeah, this is a serious problem and one that we should figure out a way
to fix or at least improve on- maybe by having pg_upgrade say something
about extensions that could/should be upgraded..?

I think what's needed are:

1) a way to *warn* users about deprecation.  CREATE EXTENSION might give an
elog(WARNING), but it's probably not enough.  It only happens once, and if it's
in pg_restore/pg_upgrade, it be wrapped by vendor upgrade scripts.  It needs to
be more like first function call in every session.  Or more likely, put it in
documentation for 10 years.

2) a way to *enforce* it, like making CREATE EXTENSION fail when run against an
excessively old server, including by pg_restore/pg_upgrade (which ought to also
handle it in --check).

Are there any contrib for which (1) is done and we're anywhere near doing (2) ?



Status update for a commitfest entry.

The commitfest is nearing the end and this thread is "Waiting on 
Author". As far as I see we don't have a patch here and discussion is a 
bit stuck.

So, I am planning to return it with feedback. Any objections?

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Add docs stub for recovery.conf

2020-11-30 Thread Bruce Momjian

On Mon, Nov 30, 2020 at 11:31:35AM -0700, David G. Johnston wrote:
> On Mon, Nov 30, 2020 at 11:25 AM Bruce Momjian  wrote:
> 
> On Mon, Nov 30, 2020 at 10:11:04AM +0800, Craig Ringer wrote:
> > Can we please just address this docs issue? If you don't like my 
> solution
> can
> > you please supply a patch that you feel addresses the problem? Or 
> clearly
> state
> > that you don't think there is a problem, and do so in a way that 
> actually
> > addresses the specific points I have raised about what's wrong with the
> status
> > quo?
> 
> If we know there are X problems, and we fix one of them one way, then
> later fix the rest another way, we have to undo the first fix.  If you
> don't want to fix all X, then let's wait until someone does want to fix
> them all.
> 
> IMO there is only the original problem with an acceptable solution presented
> that can be committed without downside.  If that has to be undone because
> someone else in the future decides on a different solution that happens to
> touch this too, fine, it can be changed again.

The downside is you end up with X-1 dummy sections just to allow for
references to old syntax, and you then have to find them all and remove
them when you implement the proper solution.  I have no intention of
applying such an X-1 fix.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: Improve handling of parameter differences in physical replication

2020-11-30 Thread Peter Eisentraut


On 2020-11-20 16:47, Sergei Kornilov wrote:

Hmm... Good question. How about putting CheckForStandbyTrigger() in a wait loop, but 
reporting FATAL with an appropriate message, such as "promotion is not possible 
because of insufficient parameter settings"?
Also it suits me if we only document that we ignore promote here. I don't think 
this is an important case. And yes, it's not easy to allow promotion, since we 
have already updated control file.

Probably we need pause only after we reached consistency?


Here is an updated patch that implements both of these points.  I have 
used hot standby active instead of reached consistency.  I guess 
arguments could be made either way, but the original use case really 
cared about hot standby.


--
Peter Eisentraut
2ndQuadrant, an EDB company
https://www.2ndquadrant.com/
From 2d93c2918af7ff186aa4a6a2327286b6eb1a2859 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Mon, 30 Nov 2020 19:21:40 +0100
Subject: [PATCH v6] Pause recovery for insufficient parameter settings

When certain parameters are changed on a physical replication primary,
this is communicated to standbys using the XLOG_PARAMETER_CHANGE WAL
record.  The standby then checks whether its own settings are at least
as big as the ones on the primary.  If not, the standby shuts down
with a fatal error.

This patch changes this behavior for hot standbys to pause recovery at
that point instead.  That allows read traffic on the standby to
continue while database administrators figure out next steps.  When
recovery is unpaused, the server shuts down (as before).  The idea is
to fix the parameters while recovery is paused and then restart when
there is a maintenance window.

Discussion: 
https://www.postgresql.org/message-id/flat/4ad69a4c-cc9b-0dfe-0352-8b1b0cd36...@2ndquadrant.com
---
 doc/src/sgml/high-availability.sgml | 51 +++--
 src/backend/access/transam/xlog.c   | 59 ++---
 2 files changed, 93 insertions(+), 17 deletions(-)

diff --git a/doc/src/sgml/high-availability.sgml 
b/doc/src/sgml/high-availability.sgml
index 19d7bd2b28..71f7f58cb0 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -2120,18 +2120,14 @@ Administrator's Overview

 

-The setting of some parameters on the standby will need reconfiguration
-if they have been changed on the primary. For these parameters,
-the value on the standby must
-be equal to or greater than the value on the primary.
-Therefore, if you want to increase these values, you should do so on all
-standby servers first, before applying the changes to the primary server.
-Conversely, if you want to decrease these values, you should do so on the
-primary server first, before applying the changes to all standby servers.
-If these parameters
-are not set high enough then the standby will refuse to start.
-Higher values can then be supplied and the server
-restarted to begin recovery again.  These parameters are:
+The settings of some parameters determine the size of shared memory for
+tracking transaction IDs, locks, and prepared transactions.  These shared
+memory structures must be no smaller on a standby than on the primary in
+order to ensure that the standby does not run out of shared memory during
+recovery.  For example, if the primary had used a prepared transaction but
+the standby had not allocated any shared memory for tracking prepared
+transactions, then recovery could not continue until the standby's
+configuration is changed.  The parameters affected are:
 
   

@@ -2160,6 +2156,37 @@ Administrator's Overview
 

   
+
+The easiest way to ensure this does not become a problem is to have these
+parameters set on the standbys to values equal to or greater than on the
+primary.  Therefore, if you want to increase these values, you should do
+so on all standby servers first, before applying the changes to the
+primary server.  Conversely, if you want to decrease these values, you
+should do so on the primary server first, before applying the changes to
+all standby servers.  Keep in mind that when a standby is promoted, it
+becomes the new reference for the required parameter settings for the
+standbys that follow it.  Therefore, to avoid this becoming a problem
+during a switchover or failover, it is recommended to keep these settings
+the same on all standby servers.
+   
+
+   
+The WAL tracks changes to these parameters on the
+primary.  If a hot standby processes WAL that indicates that the current
+value on the primary is higher than its own value, it will log a warning
+and pause recovery, for example:
+
+WARNING:  hot standby is not possible because of insufficient parameter 
settings
+DETAIL:  max_connections = 80 is a lower setting than on the primary server, 
where its value was 100.
+

Re: Add docs stub for recovery.conf

2020-11-30 Thread David G. Johnston

On Mon, Nov 30, 2020 at 11:25 AM Bruce Momjian  wrote:

> On Mon, Nov 30, 2020 at 10:11:04AM +0800, Craig Ringer wrote:
> > Can we please just address this docs issue? If you don't like my
> solution can
> > you please supply a patch that you feel addresses the problem? Or
> clearly state
> > that you don't think there is a problem, and do so in a way that actually
> > addresses the specific points I have raised about what's wrong with the
> status
> > quo?
>
> If we know there are X problems, and we fix one of them one way, then
> later fix the rest another way, we have to undo the first fix.  If you
> don't want to fix all X, then let's wait until someone does want to fix
> them all.
>
>
IMO there is only the original problem with an acceptable solution
presented that can be committed without downside.  If that has to be undone
because someone else in the future decides on a different solution that
happens to touch this too, fine, it can be changed again.

David J.

Re: Cost overestimation of foreign JOIN

2020-11-30 Thread Andrey Lepikhov


On 30.11.2020 22:38, Tom Lane wrote:

Andrey Lepikhov  writes:

Maybe it is needed to swap lines 2908 and 2909 (see attachment)?


No; as explained in the comment immediately above here, we're assuming
that the join conditions will be applied on the cross product of the
input relations.


Thank you. Now it is clear to me.


Now admittedly, that's a worst-case assumption, since it amounts to
expecting that the remote server will do the join in the dumbest
possible nested-loop way.  If the remote can use a merge or hash
join, for example, the cost is likely to be a lot less.


My goal is scaling Postgres on a set of the same servers with same 
postgres instances. If one server uses for the join a hash-join node, i 
think it is most likely that the other server will also use for this 
join a hash-join node (Maybe you remember, I also use the statistics 
copying technique to provide up-to-date statistics on partitions). Tests 
show good results with such an approach. But maybe this is my special case.



 But it is
not the job of this code path to outguess the remote planner.  It's
certainly not appropriate to invent an unprincipled cost estimate
as a substitute for trying to guess that.


Agreed.


If you're unhappy with the planning results you get for this,
why don't you have use_remote_estimate turned on?


I have a mixed load model. Large queries are suitable for additional 
estimate queries. But for many simple SELECT's that touch a small 
portion of the data, the latency has increased significantly. And I 
don't know how to switch the use_remote_estimate setting in such case.


--
regards,
Andrey Lepikhov
Postgres Professional

Re: Add docs stub for recovery.conf

2020-11-30 Thread Bruce Momjian

On Mon, Nov 30, 2020 at 10:11:04AM +0800, Craig Ringer wrote:
> Can we please just address this docs issue? If you don't like my solution can
> you please supply a patch that you feel addresses the problem? Or clearly 
> state
> that you don't think there is a problem, and do so in a way that actually
> addresses the specific points I have raised about what's wrong with the status
> quo?

If we know there are X problems, and we fix one of them one way, then
later fix the rest another way, we have to undo the first fix.  If you
don't want to fix all X, then let's wait until someone does want to fix
them all.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: Add Information during standby recovery conflicts

2020-11-30 Thread Alvaro Herrera

On 2020-Dec-01, Fujii Masao wrote:

> + if (proc)
> + {
> + if (nprocs == 0)
> + appendStringInfo(&buf, "%d", proc->pid);
> + else
> + appendStringInfo(&buf, ", %d", 
> proc->pid);
> +
> + nprocs++;
> 
> What happens if all the backends in wait_list have gone? In other words,
> how should we handle the case where nprocs == 0 (i.e., nprocs has not been
> incrmented at all)? This would very rarely happen, but can happen.
> In this case, since buf.data is empty, at least there seems no need to log
> the list of conflicting processes in detail message.

Yes, I noticed this too; this can be simplified by changing the
condition in the ereport() call to be "nprocs > 0" (rather than
wait_list being null), otherwise not print the errdetail.  (You could
test buf.data or buf.len instead, but that seems uglier to me.)

Re: Improving spin-lock implementation on ARM.

2020-11-30 Thread Tom Lane

Alexander Korotkov  writes:
> I tend to think that LSE is enabled by default in Apple's clang based
> on your previous message[1].  In order to dispel the doubts could you
> please provide assembly of SpinLockAcquire for following clang
> options.
> "-O2"
> "-O2 -march=armv8-a+lse"
> "-O2 -march=armv8-a"

Huh.  Those options make exactly zero difference to the code generated
for SpinLockAcquire/SpinLockRelease; it's the same as I showed upthread,
for either the HEAD definition of TAS() or the CAS patch's version.

So now I'm at a loss as to the reason for the performance difference
I got.  -march=armv8-a+lse does make a difference to code generation
someplace, because the overall size of the postgres executable changes
by 16kB or so.  One might argue that the performance difference is due
to better code elsewhere than the spinlocks ... but the test I'm running
is basically just

while (count-- > 0)
{
XLogGetLastRemovedSegno();

CHECK_FOR_INTERRUPTS();
}

so it's hard to see where a non-spinlock-related code change would come
in.  That loop itself definitely generates the same code either way.

I did find this interesting output from "clang -v":

-target-cpu vortex -target-feature +v8.3a -target-feature +fp-armv8 
-target-feature +neon -target-feature +crc -target-feature +crypto 
-target-feature +fullfp16 -target-feature +ras -target-feature +lse 
-target-feature +rdm -target-feature +rcpc -target-feature +zcm -target-feature 
+zcz -target-feature +sha2 -target-feature +aes

whereas adding -march=armv8-a+lse changes that to just

-target-cpu vortex -target-feature +neon -target-feature +lse -target-feature 
+zcm -target-feature +zcz

On the whole, that would make one think that -march=armv8-a+lse
should produce worse code than the default settings.

regards, tom lane

Re: Add Information during standby recovery conflicts

2020-11-30 Thread Fujii Masao





On 2020/11/30 16:26, Masahiko Sawada wrote:

On Mon, Nov 30, 2020 at 3:46 PM Drouvot, Bertrand  wrote:


Hi,

On 11/30/20 4:41 AM, Masahiko Sawada wrote:

On Sun, Nov 29, 2020 at 3:47 PM Drouvot, Bertrand  wrote:

Hi Alvaro,

On 11/28/20 6:36 PM, Alvaro Herrera wrote:

Hi Bertrand,

On 2020-Nov-28, Drouvot, Bertrand wrote:


+ if (nprocs > 0)
+ {
+ ereport(LOG,
+ (errmsg("recovery still waiting after %ld.%03d 
ms: %s",
+ msecs, usecs, 
_(get_recovery_conflict_desc(reason))),
+  (errdetail_log_plural("Conflicting process: 
%s.",
+
"Conflicting processes: %s.",
+   
 nprocs, buf.data;
+ }
+ else
+ {
+ ereport(LOG,
+ (errmsg("recovery still waiting after %ld.%03d 
ms: %s",
+ msecs, usecs, 
_(get_recovery_conflict_desc(reason);
+ }
+
+ pfree(buf.data);
+ }
+ else
+ ereport(LOG,
+ (errmsg("recovery still waiting after %ld.%03d ms: 
%s",
+ msecs, usecs, 
_(get_recovery_conflict_desc(reason);
+}

Another trivial stylistic point is that you can collapse all these
ereport calls into one, with something like

 ereport(LOG,
 errmsg("recovery still waiting after ...", opts),
 waitlist != NULL ? errdetail_log_plural("foo bar baz", opts) : 0);

where the "waitlist" has been constructed beforehand, or is set to NULL
if there's no process list.

Nice!


+ switch (reason)
+ {
+ case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
+ reasonDesc = gettext_noop("for recovery conflict on buffer 
pin");
+ break;

Pure bikeshedding after discussing this with my pillow: I think I'd get
rid of the initial "for" in these messages.

both comments implemented in the new attached version.


Thank you for updating the patch!

+   /* Also, set deadlock timeout for logging purpose if necessary */
+   if (log_recovery_conflict_waits && !need_log)
+   {
+   timeouts[cnt].id = STANDBY_TIMEOUT;
+   timeouts[cnt].type = TMPARAM_AFTER;
+   timeouts[cnt].delay_ms = DeadlockTimeout;
+   cnt++;
+   }

You changed to 'need_log' but this condition seems not correct. I
think we need to set this timeout when both log_recovery_conflict and
need_log is true.


Nice catch!

In fact it behaves correctly, that's jut the 'need_log' name that is
miss leading: I renamed it to 'already_logged' in the new attached version.



Thanks! I'd prefer 'need_log' because we can check it using the
affirmative form in that condition, which would make the code more
readable a bit. But I'd like to leave it to committers. I've marked
this patch as "Ready for Committer".


I'm still in the middle of the review, but please let me share
my current review comments.

+   /* Set wait start timestamp if logging is enabled */
+   if (log_recovery_conflict_waits)
+   waitStart = GetCurrentTimestamp();

This seems to cause even the primary server to call GetCurrentTimestamp()
if logging is enabled. To avoid unnecessary GetCurrentTimestamp(),
we should add "InHotStandby" into the if-condition?

+   initStringInfo(&buf);
+
+   if (wait_list)

Isn't it better to call initStringInfo() only when wait_list is not NULL?
For example, we can update the code so that it's executed when nprocs == 0.

+   if (proc)
+   {
+   if (nprocs == 0)
+   appendStringInfo(&buf, "%d", proc->pid);
+   else
+   appendStringInfo(&buf, ", %d", 
proc->pid);
+
+   nprocs++;

What happens if all the backends in wait_list have gone? In other words,
how should we handle the case where nprocs == 0 (i.e., nprocs has not been
incrmented at all)? This would very rarely happen, but can happen.
In this case, since buf.data is empty, at least there seems no need to log
the list of conflicting processes in detail message.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Add Information during standby recovery conflicts

2020-11-30 Thread Fujii Masao





On 2020/11/20 18:17, Drouvot, Bertrand wrote:

Hi,

On 11/17/20 4:44 PM, Fujii Masao wrote:


Thanks for updating the patch! Here are review comments.

+    Controls whether a log message is produced when the startup process
+    is waiting longer than deadlock_timeout
+    for recovery conflicts.

But a log message can be produced also when the backend is waiting
for recovery conflict. Right? If yes, this description needs to be corrected.


Thanks for looking at it!

I don't think so, only the startup process should write those new log messages.

What makes you think that would not be the case?


Probably my mis-underding of the patch did that. Sorry for noise..

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Autovacuum on partitioned table (autoanalyze)

2020-11-30 Thread Alvaro Herrera

On 2020-Nov-10, Kyotaro Horiguchi wrote:

> In second thought about the reason for the "toprel_oid". It is perhaps
> to avoid "wrongly" propagated values to ancestors after a manual
> ANALYZE on a partitioned table.  But the same happens after an
> autoanalyze iteration if some of the ancestors of a leaf relation are
> analyzed before the leaf relation in a autoanalyze iteration.  That
> can trigger an unnecessary analyzing for some of the ancestors.

I'm not sure I understand this point.  I think we should only trigger
this on analyzes of *leaf* partitions, not intermediate partitioned
relations.  That way you would never get these unnecesary analyzes.
Am I missing something?

(So with my proposal in the previous email, we would send the list of
ancestor relations after analyzing a leaf partition.  Whenever we
analyze a non-leaf, then the list of ancestors is sent as an empty
list.)

> > > Regarding the counters in pg_stat_all_tables: maybe some of these should 
> > > be
> > > null rather than zero ?  Or else you should make an 0001 patch to fully
> > > implement this view, with all relevant counters, not just 
> > > n_mod_since_analyze,
> > > last_*analyze, and *analyze_count.  These are specifically misleading:
> > >
> > > last_vacuum |
> > > last_autovacuum |
> > > n_ins_since_vacuum  | 0
> > > vacuum_count| 0
> > > autovacuum_count| 0
> > >
> > I haven't modified this part yet, but you meant that we should set
> > null to counters
> > about vacuum because partitioned tables are not vacuumed?
> 
> Perhaps bacause partitioned tables *cannot* be vacuumed.  I'm not sure
> what is the best way here.  Showing null seems reasonable but I'm not
> sure that doesn't break anything.

I agree that showing NULLs for the vacuum columns is better.  Perhaps
the most reasonable way to do this is use -1 as an indicator that NULL
ought to be returned from pg_stat_get_vacuum_count() et al, and add a
boolean in PgStat_TableCounts next to t_truncated, maybe "t_nullvacuum"
that says to store -1 instead of 0 in pgstat_recv_tabstat.

Re: Cost overestimation of foreign JOIN

2020-11-30 Thread Tom Lane

Andrey Lepikhov  writes:
> Maybe it is needed to swap lines 2908 and 2909 (see attachment)?

No; as explained in the comment immediately above here, we're assuming
that the join conditions will be applied on the cross product of the
input relations.

Now admittedly, that's a worst-case assumption, since it amounts to
expecting that the remote server will do the join in the dumbest
possible nested-loop way.  If the remote can use a merge or hash
join, for example, the cost is likely to be a lot less.  But it is
not the job of this code path to outguess the remote planner.  It's
certainly not appropriate to invent an unprincipled cost estimate
as a substitute for trying to guess that.

If you're unhappy with the planning results you get for this,
why don't you have use_remote_estimate turned on?

regards, tom lane

Re: Notes on physical replica failover with logical publisher or subscriber

2020-11-30 Thread Alexey Kondratov


Hi Craig,

On 2020-11-30 06:59, Craig Ringer wrote:


https://wiki.postgresql.org/wiki/Logical_replication_and_physical_standby_failover



Thank you for sharing these notes. I have not dealt a lot with 
physical/logical replication interoperability, so those were mostly new 
problems for me to know.


One point from the wiki page, which seems clear enough to me:

```
Logical slots can fill pg_wal and can't benefit from archiving. Teach 
the logical decoding page read callback how to use the restore_command 
to retrieve WAL segs temporarily if they're not found in pg_wal...

```

It does not look like a big deal to teach logical decoding process to 
use restore_command, but I have some doubts about how everything will 
perform in the case when we started getting WAL from archive for 
decoding purposes. If we started using restore_command, then subscriber 
lagged long enough to exceed max_slot_wal_keep_size. Taking into account 
that getting WAL files from the archive has an additional overhead and 
that primary continues generating (and archiving) new segments, there is 
a possibility for primary to start doing this double duty forever --- 
archive WAL file at first and get it back for decoding when requested.


Another problem is that there are maybe several active decoders, IIRC, 
so they would have better to communicate in order to avoid fetching the 
same segment twice.




I tried to address many of these issues with failover slots, but I am
not trying to beat that dead horse now. I know that at least some
people here are of the opinion that effort shouldn't go into
logical/physical replication interoperation anyway - that we should
instead address the remaining limitations in logical replication so
that it can provide complete HA capabilities without use of physical
replication. So for now I'm just trying to save others who go looking
into these issues some time and warn them about some of the less
obvious booby-traps.



Another point to add regarding logical replication capabilities to build 
logical-only HA system --- logical equivalent of pg_rewind. At least I 
have not noticed anything after brief reading of the wiki page. IIUC, 
currently there is no way to quickly return ex-primary (ex-logical 
publisher) into HA-cluster without doing a pg_basebackup, isn't it? It 
seems that we should have the same problem here as with physical 
replication --- ex-primary may accept some xacts after promotion of new 
primary, so their history diverges and old primary should be rewound 
before being returned as standby (subscriber).



Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company

Re: Re: parallel distinct union and aggregate support patch

2020-11-30 Thread bu...@sohu.com


> 1.
> +#define BATCH_SORT_MAX_BATCHES 512
>  
> Did you decide this number based on some experiment or is there some
> analysis behind selecting this number?
When there are too few batches, if a certain process works too slowly, it will 
cause unbalanced load.
When there are too many batches, FD will open and close files frequently.

> 2.
> +BatchSortState* ExecInitBatchSort(BatchSort *node, EState *estate, int 
> eflags)
> +{
> + BatchSortState *state;
> + TypeCacheEntry *typentry;
> 
> + for (i=0;inumGroupCols;++i)
> + {
> ...
> + InitFunctionCallInfoData(*fcinfo, flinfo, 1, attr->attcollation, NULL, 
> NULL);
> + fcinfo->args[0].isnull = false;
> + state->groupFuns = lappend(state->groupFuns, fcinfo);
> + }
>  
> From the variable naming, it appeared like the batch sort is dependent
> upon the grouping node.  I think instead of using the name
> numGroupCols and groupFuns we need to use names that are more relevant
> to the batch sort something like numSortKey.
Not all data types support both sorting and hashing calculations, such as 
user-defined data types.
We do not need all columns to support hash calculation when we batch, so I used 
two variables.

> 3.
> + if (eflags & (EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK))
> + {
> + /* for now, we only using in group aggregate */
> + ereport(ERROR,
> + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> + errmsg("not support execute flag(s) %d for group sort", eflags)));
> + }
>  
> Instead of ereport, you should just put an Assert for the unsupported
> flag or elog.
In fact, this is an unfinished feature, BatchSort should also support these 
features, welcome to supplement.

> 4.
> + state = makeNode(BatchSortState);
> + state->ps.plan = (Plan*) node;
> + state->ps.state = estate;
> + state->ps.ExecProcNode = ExecBatchSortPrepare;
>  
> I think the main executor entry function should be named ExecBatchSort
> instead of ExecBatchSortPrepare, it will look more consistent with the
> other executor machinery.
The job of the ExecBatchSortPrepare function is to preprocess the data (batch 
and pre-sort),
and when its work ends, it will call "ExecSetExecProcNode(pstate, 
ExecBatchSort)" to return the data to the ExecBatchSort function.
There is another advantage of dividing into two functions, 
It is not necessary to judge whether tuplesort is now available every time the 
function is processed to improve the subtle performance.
And I think this code is clearer.

1 2 >

1 - 100 of 161 matches

Mail list logo