date:20200915

Re: Subscription test 013_partition.pl fails under CLOBBER_CACHE_ALWAYS

2020-09-15 Thread Amit Kapila

On Wed, Sep 16, 2020 at 9:41 AM Tom Lane  wrote:
>
> Amit Kapila  writes:
> > So, can we assume that the current code can only cause the problem in
> > CCA builds bot not in any practical scenario because after having a
> > lock on relation probably there shouldn't be any invalidation which
> > leads to this problem?
>
>
> In short: the value of CCA testing is to model sinval overruns
> happening at any point where they could happen.  The real-world
> odds of one happening at any given instant are low, but they're
> never zero.
>

Thanks for the explanation. I have read your patch and it looks good to me.

-- 
With Regards,
Amit Kapila.

RE: Transactions involving multiple postgres foreign servers, take 2

2020-09-15 Thread tsunakawa.ta...@fujitsu.com

From: Masahiko Sawada 
> The resolver process has two functionalities: resolving foreign
> transactions automatically when the user issues COMMIT (the case you
> described in the second paragraph), and resolving foreign transaction
> when the corresponding backend no longer exist or when the server
> crashes during in the middle of 2PC (described in the third
> paragraph).
> 
> Considering the design without the resolver process, I think we can
> easily replace the latter with the manual resolution. OTOH, it's not
> easy for the former. I have no idea about better design for now,
> although, as you described, if we could ensure that the process
> doesn't raise an error during resolving foreign transactions after
> committing the local transaction we would not need the resolver
> process.

Yeah, the resolver background process -- someone independent of client sessions 
-- is necessary, because the client session disappears sometime.  When the 
server that hosts the 2PC coordinator crashes, there are no client sessions.  
Our DBMS Symfoware also runs background threads that take care of resolution of 
in-doubt transactions due to a server or network failure.

Then, how does the resolver get involved in 2PC to enable parallel 2PC?  Two 
ideas quickly come to mind:

(1) Each client backend issues prepare and commit to multiple remote nodes 
asynchronously.
If the communication fails during commit, the client backend leaves the commit 
notification task to the resolver.
That is, the resolver lends a hand during failure recovery, and doesn't 
interfere with the transaction processing during normal operation.

(2) The resolver takes some responsibility in 2PC processing during normal 
operation.
(send prepare and/or commit to remote nodes and get the results.)
To avoid serial execution per transaction, the resolver bundles multiple 
requests, send them in bulk, and wait for multiple replies at once.
This allows the coordinator to do its own prepare processing in parallel with 
those of participants.
However, in Postgres, this requires context switches between the client backend 
and the resolver.


Our Symfoware takes (2).  However, it doesn't suffer from the context switch, 
because the server is multi-threaded and further implements or uses more 
lightweight entities than the thread.


> Or the second idea would be that the backend commits only the local
> transaction then returns the acknowledgment of COMMIT to the user
> without resolving foreign transactions. Then the user manually
> resolves the foreign transactions by, for example, using the SQL
> function pg_resolve_foreign_xact() within a separate transaction. That
> way, even if an error occurred during resolving foreign transactions
> (i.g., executing COMMIT PREPARED), it’s okay as the user is already
> aware of the local transaction having been committed and can retry to
> resolve the unresolved foreign transaction. So we won't need the
> resolver process while avoiding such inconsistency.
> 
> But a drawback would be that the transaction commit doesn't ensure
> that all foreign transactions are completed. The subsequent
> transactions would need to check if the previous distributed
> transaction is completed to see its results. I’m not sure it’s a good
> design in terms of usability.

I don't think it's a good design as you are worried.  I guess that's why 
Postgres-XL had to create a tool called pgxc_clean and ask the user to resolve 
transactions with it.

pgxc_clean
https://www.postgres-xl.org/documentation/pgxcclean.html

"pgxc_clean is a Postgres-XL utility to maintain transaction status after a 
crash. When a Postgres-XL node crashes and recovers or fails over, the commit 
status of the node may be inconsistent with other nodes. pgxc_clean checks 
transaction commit status and corrects them."


Regards
Takayuki Tsunakawa

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-09-15 Thread Ashutosh Sharma

On Wed, Sep 16, 2020 at 10:40 AM Ashutosh Sharma  wrote:
>
> On Wed, Sep 16, 2020 at 9:14 AM Tom Lane  wrote:
> >
> > Ashutosh Sharma  writes:
> > > On Wed, Sep 16, 2020 at 1:25 AM Tom Lane  wrote:
> > >> * Should any of the other tables in the test be converted to temp?
> >
> > > Are you trying to say that we can achieve the things being done in
> > > test-case 1 and 2 by having a single temp table and we should aim for
> > > it because it will make the test-case more efficient and easy to
> > > maintain?
> >
> > Well, I'm just looking at the comment that says the reason for the
> > begin/rollback structure is to keep autovacuum's hands off the table.
> > In most if not all of the other places where we need that, the preferred
> > method is to make the table temp or mark it with (autovacuum = off).
> > While this way isn't wrong exactly, nor inefficient, it does seem
> > a little restrictive.  For instance, you can't easily test cases that
> > involve intentional errors.
> >
> > Another point is that we have a few optimizations that apply to tables
> > created in the current transaction.  I'm not sure whether any of them
> > would fire in this test case, but if they do (now or in the future)
> > that might mean you aren't testing the usual scenario for use of
> > pg_surgery, which is surely not going to be a new-in-transaction
> > table.  (That might be an argument for preferring autovacuum = off
> > over a temp table, too.)
> >
>
> I agree with you on both the above points. I'll try to make the
> necessary changes to address all your comments. Also, I'd prefer
> creating a normal heap table with autovacuum = off over the temp table
> for the reasons you mentioned in the second point.
>

Attached is the patch with the changes suggested here. I've tried to
use a normal heap table with (autovacuum = off) wherever possible.
Please have a look and let me know for any other issues.

Thanks,

-- 
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com
diff --git a/contrib/pg_surgery/expected/heap_surgery.out b/contrib/pg_surgery/expected/heap_surgery.out
index 9451c57..033e3aa 100644
--- a/contrib/pg_surgery/expected/heap_surgery.out
+++ b/contrib/pg_surgery/expected/heap_surgery.out
@@ -1,8 +1,8 @@
 create extension pg_surgery;
 -- create a normal heap table and insert some rows.
--- note that we don't commit the transaction, so autovacuum can't interfere.
-begin;
-create table htab(a int);
+-- note that we don't want autovacuum to interfere and therefore we disable it
+-- on the test table.
+create table htab(a int) with (autovacuum_enabled = off);
 insert into htab values (100), (200), (300), (400), (500);
 -- test empty TID array
 select heap_force_freeze('htab'::regclass, ARRAY[]::tid[]);
@@ -85,9 +85,10 @@ NOTICE:  skipping tid (0, 6) for relation "htab" because the item number is out
  
 (1 row)
 
-rollback;
 -- set up a new table with a redirected line pointer
-create table htab2(a int) with (autovacuum_enabled = off);
+-- note that unlike other test cases we need a temp table here to ensure that we
+-- get a stable vacuum results.
+create temp table htab2(a int);
 insert into htab2 values (100);
 update htab2 set a = 200;
 vacuum htab2;
@@ -175,6 +176,6 @@ ERROR:  "vw" is not a table, materialized view, or TOAST table
 select heap_force_freeze('vw'::regclass, ARRAY['(0, 1)']::tid[]);
 ERROR:  "vw" is not a table, materialized view, or TOAST table
 -- cleanup.
-drop table htab2;
+drop table htab;
 drop view vw;
 drop extension pg_surgery;
diff --git a/contrib/pg_surgery/sql/heap_surgery.sql b/contrib/pg_surgery/sql/heap_surgery.sql
index 8a27214..234e667 100644
--- a/contrib/pg_surgery/sql/heap_surgery.sql
+++ b/contrib/pg_surgery/sql/heap_surgery.sql
@@ -1,9 +1,9 @@
 create extension pg_surgery;
 
 -- create a normal heap table and insert some rows.
--- note that we don't commit the transaction, so autovacuum can't interfere.
-begin;
-create table htab(a int);
+-- note that we don't want autovacuum to interfere and therefore we disable it
+-- on the test table.
+create table htab(a int) with (autovacuum_enabled = off);
 insert into htab values (100), (200), (300), (400), (500);
 
 -- test empty TID array
@@ -38,10 +38,10 @@ select ctid, xmax from htab where xmin = 2;
 -- out-of-range TIDs should be skipped
 select heap_force_freeze('htab'::regclass, ARRAY['(0, 0)', '(0, 6)']::tid[]);
 
-rollback;
-
 -- set up a new table with a redirected line pointer
-create table htab2(a int) with (autovacuum_enabled = off);
+-- note that unlike other test cases we need a temp table here to ensure that we
+-- get a stable vacuum results.
+create temp table htab2(a int);
 insert into htab2 values (100);
 update htab2 set a = 200;
 vacuum htab2;
@@ -86,6 +86,6 @@ select heap_force_kill('vw'::regclass, ARRAY['(0, 1)']::tid[]);
 select heap_force_freeze('vw'::regclass, ARRAY['(0, 1)']::tid[]);
 
 -- cleanup.
-drop table htab2;
+drop table htab;
 drop view vw;
 drop extension pg_surgery;

Re: Parallelize stream replication process

2020-09-15 Thread Paul Guo

> On Sep 16, 2020, at 11:15 AM, Li Japin  wrote:
> 
> 
> 
>> On Sep 15, 2020, at 3:41 PM, Fujii Masao  wrote:
>> 
>> 
>> 
>> On 2020/09/15 13:41, Bharath Rupireddy wrote:
>>> On Tue, Sep 15, 2020 at 9:27 AM Li Japin  wrote:

 For now, postgres use single process to send, receive and replay the WAL 
 when we use stream replication,
 is there any point to parallelize this process? If it does, how do we 
 start?

 Any thoughts?
>> 
>> Probably this is another parallelism than what you're thinking,
>> but I was thinking to start up walwriter process in the standby server
>> and make it fsync the streamed WAL data. This means that we leave
>> a part of tasks of walreceiver process to walwriter. Walreceiver
>> performs WAL receive and write, and walwriter performs WAL flush,
>> in parallel. I'm just expecting that this change would improve
>> the replication performance, e.g., reduce the time to wait for
>> sync replication.

Yes this should be able to improve that in theory. 

>> 
>> Without this change (i.e., originally), only walreceiver performs
>> WAL receive, write and flush. So wihle walreceiver is fsyncing WAL data,
>> it cannot receive newly-arrived WAL data. If WAL flush takes a time,
>> which means that the time to wait for sync replication in the primary
>> would be enlarged.
>> 
>> Regards,
>> 
>> -- 
>> Fujii Masao
>> Advanced Computing Technology Center
>> Research and Development Headquarters
>> NTT DATA CORPORATION
> 
> Yeah, this might be a direction. 
> 
> Now I am thinking about how to parallel WAL log replay. If we can improve the 
> efficiency
> of replay, then we can shorten the database recovery time (streaming 
> replication or database
> crash recovery). 

Yes, parallelization should be able to help the scenario that cpu util rate is 
100% or if it is not
100% but continues blocking in some operations which could be improved by 
running in
parallel. There are a lot of scenarios of WAL replay (memory operation, system 
calls, locking, etc).
I do not have the experience of real production environment, so not sure 
whether or how
the recovery suffer, but I believe parallel recovery should help to accelerate 
and help failover
which is quite important especially in cloud environment. To do this may need 
to carefully
analyze the dependency of various WALL at first. It won’t be a small effort. 
I’ve heard some
databases have implemented this though not sure how much that helps.

For parallel wal receiver/sender, its functionality is simple so after 
decoupling the fsync operation to
a separate process not sure how beneficial parallelization would be (surely if 
wal receiver/sender
Is 100% cpu utilized that is needed).

> 
> For streaming replication, we may need to improve the transmission of WAL 
> logs to improve
> the entire recovery process.
> 
> I’m not sure if this is correct.
> 
> Regards,
> Japin Li.
>

Re: Force update_process_title=on in crash recovery?

2020-09-15 Thread Tom Lane

Thomas Munro  writes:
> On Wed, Sep 16, 2020 at 2:30 PM Michael Paquier  wrote:
>> Another thing to be careful here is WIN32, see 0921554.  And slowing
>> down recovery is never a good idea.

> Right, that commit makes a lot of sense because it suppresses many
> system calls that happen for each query.  The same problem existed on
> older FreeBSD versions and I saw that costing ~10% of TPS on read-only
> pgbench.  In other commits I've been removing system calls that happen
> for every WAL record.  But in this thread I'm talking about an update
> per 16MB WAL file, which seems like an acceptable ratio to me.

Hmm ... the thread leading up to 0921554 indicates that the performance
penalty of update_process_title=on is just ridiculously large on Windows.
Maybe those numbers are not relevant to crash recovery WAL-application,
but it might be smart to actually measure that not just assume it.

In any case, I'd recommend setting up any patch you create for this
to be easily "ifndef WIN32"'d in case we change our minds on the
point later.

regards, tom lane

Re: Force update_process_title=on in crash recovery?

2020-09-15 Thread Thomas Munro

On Wed, Sep 16, 2020 at 2:30 PM Michael Paquier  wrote:
> On Tue, Sep 15, 2020 at 10:01:18AM -0400, Tom Lane wrote:
> > Seems like a good argument, but you'd have to be careful about the
> > final state when you stop overriding update_process_title --- it can't
> > be left looking like it's still-in-progress on some random WAL file.
> > (Compare my nearby gripes about walsenders being sloppy about their
> > pg_stat_activity and process title presentations.)
>
> Another thing to be careful here is WIN32, see 0921554.  And slowing
> down recovery is never a good idea.

Right, that commit makes a lot of sense because it suppresses many
system calls that happen for each query.  The same problem existed on
older FreeBSD versions and I saw that costing ~10% of TPS on read-only
pgbench.  In other commits I've been removing system calls that happen
for every WAL record.  But in this thread I'm talking about an update
per 16MB WAL file, which seems like an acceptable ratio to me.

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-09-15 Thread Ashutosh Sharma

On Wed, Sep 16, 2020 at 9:14 AM Tom Lane  wrote:
>
> Ashutosh Sharma  writes:
> > On Wed, Sep 16, 2020 at 1:25 AM Tom Lane  wrote:
> >> * Should any of the other tables in the test be converted to temp?
>
> > Are you trying to say that we can achieve the things being done in
> > test-case 1 and 2 by having a single temp table and we should aim for
> > it because it will make the test-case more efficient and easy to
> > maintain?
>
> Well, I'm just looking at the comment that says the reason for the
> begin/rollback structure is to keep autovacuum's hands off the table.
> In most if not all of the other places where we need that, the preferred
> method is to make the table temp or mark it with (autovacuum = off).
> While this way isn't wrong exactly, nor inefficient, it does seem
> a little restrictive.  For instance, you can't easily test cases that
> involve intentional errors.
>
> Another point is that we have a few optimizations that apply to tables
> created in the current transaction.  I'm not sure whether any of them
> would fire in this test case, but if they do (now or in the future)
> that might mean you aren't testing the usual scenario for use of
> pg_surgery, which is surely not going to be a new-in-transaction
> table.  (That might be an argument for preferring autovacuum = off
> over a temp table, too.)
>

I agree with you on both the above points. I'll try to make the
necessary changes to address all your comments. Also, I'd prefer
creating a normal heap table with autovacuum = off over the temp table
for the reasons you mentioned in the second point.

-- 
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Re: PostgreSQL 13 RC 1 release announcement draft

2020-09-15 Thread Peter Eisentraut


On 2020-09-15 18:10, Jonathan S. Katz wrote:

To upgrade to PostgreSQL 13 RC 1 from Beta 3 or an earlier version of
PostgreSQL 13, you will need to use a strategy similar to upgrading between
major versions of PostgreSQL (e.g. `pg_upgrade` or `pg_dump` / `pg_restore`).


Is this correct?  I don't see a catversion change between beta3 and rc1.

Also, if correct, the word "similar" seems redundant or confusing here.

--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [Patch] Optimize dropping of relation buffers using dlist

2020-09-15 Thread Amit Kapila

On Wed, Sep 16, 2020 at 9:02 AM Kyotaro Horiguchi
 wrote:
>
> At Wed, 16 Sep 2020 08:33:06 +0530, Amit Kapila  
> wrote in
> > On Wed, Sep 16, 2020 at 7:46 AM Kyotaro Horiguchi
> >  wrote:
> > > Is this means lseek(SEEK_END) doesn't count blocks that are
> > > write(2)'ed (by smgrextend) but not yet flushed? (I don't think so,
> > > for clarity.) The nblocks cache is added just to reduce the number of
> > > lseek()s and expected to always have the same value with what lseek()
> > > is expected to return.
> > >
> >
> > See comments in ReadBuffer_common() which indicates such a possibility
> > ("Unfortunately, we have also seen this case occurring because of
> > buggy Linux kernels that sometimes return an lseek(SEEK_END) result
> > that doesn't account for a recent write."). Also, refer my previous
> > email [1] on this and another email link in that email which has a
> > discussion on this point.
> >
> > > The reason it is reliable only during recovery
> > > is that the cache is not shared but the startup process is the only
> > > process that changes the relation size during recovery.
> > >
> >
> > Yes, that is why we are planning to do this optimization for recovery path.
> >
> > > If any other process can extend the relation while smgrtruncate is
> > > running, the current DropRelFileNodeBuffers should have the chance
> > > that a new buffer for extended area is allocated at a buffer location
> > > where the function already have passed by, which is a disaster.
> > >
> >
> > The relation might have extended before smgrtruncate but the newly
> > added pages can be flushed by checkpointer during smgrtruncate.
> >
> > [1] - 
> > https://www.postgresql.org/message-id/CAA4eK1LH2uQWznwtonD%2Bnch76kqzemdTQAnfB06z_LXa6NTFtQ%40mail.gmail.com
>
> Ah! I understood that! The reason we can rely on the cahce is that the
> cached value is *not* what lseek returned but how far we intended to
> extend. Thank you for the explanation.
>
> By the way I'm not sure that actually happens, but if one smgrextend
> call exnteded the relation by two or more blocks, the cache is
> invalidated and succeeding smgrnblocks returns lseek()'s result.
>

Can you think of any such case? I think in recovery we use
XLogReadBufferExtended->ReadBufferWithoutRelcache for reading the page
which seems to be extending page-by-page but there could be some case
where that is not true. One idea is to run regressions and add an
Assert to see if we are extending more than a block during recovery.

> Don't
> we need to guarantee the cache to be valid while recovery?
>

One possibility could be that we somehow detect that the value we are
using is cached one and if so then only do this optimization.


-- 
With Regards,
Amit Kapila.

RE: Transactions involving multiple postgres foreign servers, take 2

2020-09-15 Thread tsunakawa.ta...@fujitsu.com

From: Masahiko Sawada 
> > If so, can't we stipulate that the FDW implementor should ensure that the
> commit function always returns control to the caller?
> 
> How can the FDW implementor ensure that? Since even palloc could call
> ereport(ERROR) I guess it's hard to require that to all FDW
> implementors.

I think the what FDW commit routine will do is to just call xa_commit(), or 
PQexec("COMMIT PREPARED") in postgres_fdw.


> It's still a rough idea but I think we can use TMASYNC flag and
> xa_complete explained in the XA specification. The core transaction
> manager call prepare, commit, rollback APIs with the flag, requiring
> to execute the operation asynchronously and to return a handler (e.g.,
> a socket taken by PQsocket in postgres_fdw case) to the transaction
> manager. Then the transaction manager continues polling the handler
> until it becomes readable and testing the completion using by
> xa_complete() with no wait, until all foreign servers return OK on
> xa_complete check.

Unfortunately, even Oracle and Db2 don't support XA asynchronous execution for 
years.  Our DBMS Symfoware doesn't, either.  I don't expect other DBMSs support 
it.

Hmm, I'm afraid this may be one of the FDW's intractable walls for a serious 
scale-out DBMS.  If we define asynchronous FDW routines for 2PC, postgres_fdw 
would be able to implement them by using libpq asynchronous functions.  But 
other DBMSs can't ...


> > Maybe we can consider VOLATILE functions update data.  That may be
> overreaction, though.
> 
> Sorry I don't understand that. The volatile functions are not pushed
> down to the foreign servers in the first place, no?

Ah, you're right.  Then, the choices are twofold: (1) trust users in that their 
functions don't update data or the user's claim (specification) about it, and 
(2) get notification through FE/BE protocol that the remote transaction may 
have updated data.


Regards
Takayuki Tsunakawa

Re: Subscription test 013_partition.pl fails under CLOBBER_CACHE_ALWAYS

2020-09-15 Thread Tom Lane

Amit Kapila  writes:
> So, can we assume that the current code can only cause the problem in
> CCA builds bot not in any practical scenario because after having a
> lock on relation probably there shouldn't be any invalidation which
> leads to this problem?

No.  The reason we expend so much time and effort on CCA testing is
that cache flushes are unpredictable, and they can happen even when
you have a lock on the object(s) in question.  In particular, the
easiest real-world case that could cause the described problem is an
sinval queue overrun that we detect during the GetSubscriptionRelState
call at the bottom of logicalrep_rel_open.  In that case we'll
come back to the caller with the LogicalRepRelMapEntry marked as
already needing revalidation, because the cache inval callback
will have marked *all* of them that way.  That's actually a harmless
condition, because we have lock on the rel so nothing really
changed ... but if we blew away localreloid and the caller needs
to use that, kaboom.

We could imagine marking the entry valid at the very bottom of
logicalrep_rel_open, but that just moves the problem somewhere else.
Any caller that does *any* catalog access while holding open a
LogicalRepRelMapEntry would not be able to rely on its localreloid
staying valid.  That's a recipe for irreproducible bugs, and it's
unnecessary.  In practice the entry is good as long as
we continue to hold a lock on the local relation.  So we should
mark LogicalRepRelMapEntries as potentially-needing-revalidation
in a way that doesn't interfere with active users of the entry.

In short: the value of CCA testing is to model sinval overruns
happening at any point where they could happen.  The real-world
odds of one happening at any given instant are low, but they're
never zero.

regards, tom lane

Re: Subscription test 013_partition.pl fails under CLOBBER_CACHE_ALWAYS

2020-09-15 Thread Amit Kapila

On Wed, Sep 16, 2020 at 1:16 AM Tom Lane  wrote:
>
> I wrote:
> It's not really clear to me why setting localreloid to zero is a sane
> way to represent "this entry needs to be revalidated".  I think a
> separate flag would be more appropriate.  Once we have lock on the
> target relation, it seems to me that no interesting changes should
> be possible as long as we have lock; so there's no very good reason
> to destroy useful state to remind ourselves that we should recheck
> it next time.
>

So, can we assume that the current code can only cause the problem in
CCA builds bot not in any practical scenario because after having a
lock on relation probably there shouldn't be any invalidation which
leads to this problem?

-- 
With Regards,
Amit Kapila.

Re: Online checksums verification in the backend

2020-09-15 Thread Michael Paquier

On Fri, Sep 11, 2020 at 09:49:16AM +0200, Julien Rouhaud wrote:
> Thanks!

I got some numbers out of my pocket, using the following base
configuration:
wal_level = minimal
fsync = off
shared_buffers = '300MB' # also tested with 30MB and 60MB
checksum_cost_delay = 0 # default in patch

And for the test I have used pgbench initialized at a scale of 250, to
have close to 3.5GB of data, so as it gives us a sort of 90% buffer
eviction, with all the data cached in the OS (we may want to look as
well at the case where the OS cache does not hold all the relation
pages).  I have also run some tests with 30MB and 60MB of shared
buffers, for similar results.

I also applied some prewarming on the cluster:
create extension pg_prewarm
select pg_prewarm(oid) from pg_class where oid > 16000;

Then, I have done five runs using that:
pgbench -S -M prepared -c 64/128/256 -n -T 60
In parallel of that, I got this stuff running in parallel all the
time:
select pg_check_relation('pgbench_accounts');
\watch 0.1 

Here are some TPS numbers with the execution time of pg_check_relation.
In the five runs, I removed the highest and lowest ones, then took an
average of the remaining three.  I have also tested two modes: with
and without the optimization, that requires a one-liner in checksum.c
as per your latest patch:
--- a/src/backend/storage/page/checksum.c
+++ b/src/backend/storage/page/checksum.c
@@ -84,7 +84,7 @@ check_one_block(Relation relation, ForkNumber forknum, 
BlockNumber blkno,
 uint16 *chk_expected, uint16 *chk_found)
 {
 charbuffer[BLCKSZ];
-boolforce_lock = false;
+boolforce_lock = true;
 boolfound_in_sb;

Within parenthesis is the amount of time taken by pg_relation_check()
for a single check.  This is not completely exact and I saw some
variations, just to give an impression:
Conns  64 128   256
force_lock=true  60590 (7~8s)  55652 (10.2~10.5s) 46812 (9~10s)
force_lock=false   58637 (5s)54131 (6~7s)  37091 (1.1~1.2s)

For connections higher than 128, I was kind of surprised to see
pg_relation_check being more aggressive without the optimization, with
much less contention on the buffer mapping LWLock actually, but that's
an impression from looking at pg_stat_activity.

Looking at the wait events for each run, I saw much more hiccups with
the buffer mapping LWLock when forcing the lock rather than not, still
I was able to see some contention when also not forcing the lock.  Not
surprising as this rejects a bunch of pages from shared buffers.

> I used all default settings, but by default checksum_cost_delay is 0
> so there shouldn't be any throttling.

Thanks, so did I.  From what I can see, there could be as well
benefits in not using the optimization by default so as the relation
check applies some natural throttling by making the checks actually
slower (there is a link between the individual runtime of
pg_relation_time and the TPS).  So it also seems to me that the
throttling parameters would be beneficial, but it looks to me that
there is as well a point to not include any throttling in a first
version if the optimization to go full speed is not around.  Using
three new GUCs for those function calls is still too much IMO, so
there is also the argument to move all this stuff into a new contrib/
module, and have a bgworker implementation as part of it as it would
need shared_preload_libraries anyway.

Also, I have been putting some thoughts into the APIs able to fetch a
buffer without going through the shared buffers.  And neither
checksum.c, because it should be a place that those APIs depends on
and include only the most-internal logic for checksum algorithm and
computation, nor checksumfuncs.c, because we need to think about the
case of a background worker as well (that could spawn a set of dynamic
workers connecting to different databases able to do checksum
verifications?).  It would be good to keep the handling of the buffer
mapping lock as well as the calls to smgrread() into a single place.
ReadBuffer_common() is a natural place for that, though it means for
our use case the addition of three new options:
- Being able to pass down directly a buffer pointer to save the page
contents.
- Being able to not verify directly a page, leaving the verification
to the caller upthread.
- Addition of a new mode, that I am calling here RBM_PRIVATE, where we
actually read the page from disk if not yet in shared buffers, except
that we fill in the contents of the page using the pointer given by
the caller.  That's different than the use of local buffers, as we
don't need this much amount of complications like temporary tables of
course for per-page checks.

Another idea would be to actually just let ReadBuffer_common just do
the check by itself, with a different mode like a kind of
RBM_VALIDATE, where we just return a verification state of the page
that can be consumed by callers.

This also comes

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-09-15 Thread Tom Lane

Ashutosh Sharma  writes:
> On Wed, Sep 16, 2020 at 1:25 AM Tom Lane  wrote:
>> * Should any of the other tables in the test be converted to temp?

> Are you trying to say that we can achieve the things being done in
> test-case 1 and 2 by having a single temp table and we should aim for
> it because it will make the test-case more efficient and easy to
> maintain?

Well, I'm just looking at the comment that says the reason for the
begin/rollback structure is to keep autovacuum's hands off the table.
In most if not all of the other places where we need that, the preferred
method is to make the table temp or mark it with (autovacuum = off).
While this way isn't wrong exactly, nor inefficient, it does seem
a little restrictive.  For instance, you can't easily test cases that
involve intentional errors.

Another point is that we have a few optimizations that apply to tables
created in the current transaction.  I'm not sure whether any of them
would fire in this test case, but if they do (now or in the future)
that might mean you aren't testing the usual scenario for use of
pg_surgery, which is surely not going to be a new-in-transaction
table.  (That might be an argument for preferring autovacuum = off
over a temp table, too.)

Like I said, I don't have a big problem with leaving the rest of the
test as-is.  It just seems to be doing things in an unusual way for
no very good reason.

regards, tom lane

Re: [Patch] Optimize dropping of relation buffers using dlist

2020-09-15 Thread Kyotaro Horiguchi

At Wed, 16 Sep 2020 08:33:06 +0530, Amit Kapila  wrote 
in 
> On Wed, Sep 16, 2020 at 7:46 AM Kyotaro Horiguchi
>  wrote:
> > Is this means lseek(SEEK_END) doesn't count blocks that are
> > write(2)'ed (by smgrextend) but not yet flushed? (I don't think so,
> > for clarity.) The nblocks cache is added just to reduce the number of
> > lseek()s and expected to always have the same value with what lseek()
> > is expected to return.
> >
> 
> See comments in ReadBuffer_common() which indicates such a possibility
> ("Unfortunately, we have also seen this case occurring because of
> buggy Linux kernels that sometimes return an lseek(SEEK_END) result
> that doesn't account for a recent write."). Also, refer my previous
> email [1] on this and another email link in that email which has a
> discussion on this point.
>
> > The reason it is reliable only during recovery
> > is that the cache is not shared but the startup process is the only
> > process that changes the relation size during recovery.
> >
> 
> Yes, that is why we are planning to do this optimization for recovery path.
> 
> > If any other process can extend the relation while smgrtruncate is
> > running, the current DropRelFileNodeBuffers should have the chance
> > that a new buffer for extended area is allocated at a buffer location
> > where the function already have passed by, which is a disaster.
> >
> 
> The relation might have extended before smgrtruncate but the newly
> added pages can be flushed by checkpointer during smgrtruncate.
> 
> [1] - 
> https://www.postgresql.org/message-id/CAA4eK1LH2uQWznwtonD%2Bnch76kqzemdTQAnfB06z_LXa6NTFtQ%40mail.gmail.com

Ah! I understood that! The reason we can rely on the cahce is that the
cached value is *not* what lseek returned but how far we intended to
extend. Thank you for the explanation.

By the way I'm not sure that actually happens, but if one smgrextend
call exnteded the relation by two or more blocks, the cache is
invalidated and succeeding smgrnblocks returns lseek()'s result. Don't
we need to guarantee the cache to be valid while recovery?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-09-15 Thread Ashutosh Sharma

On Wed, Sep 16, 2020 at 1:25 AM Tom Lane  wrote:
>
> Robert Haas  writes:
> > On Mon, Sep 14, 2020 at 6:26 AM Ashutosh Sharma  
> > wrote:
> >> Thanks for reporting. I'm able to reproduce the issue by creating some
> >> delay just before "-- now create an unused line pointer" and use the
> >> delay to start a new session either with repeatable read or
> >> serializable transaction isolation level and run some query on the
> >> test table. To fix this, as you suggested I've converted the test
> >> table to the temp table. Attached is the patch with the changes.
> >> Please have a look and let me know about any concerns.
>
> > Tom, do you have any concerns about this fix?
>
> It seems OK as far as it goes.  Two thoughts:
>
> * Do we need a comment in the test pointing out that the table must be
> temp to ensure that we get stable vacuum results?  Or will the commit
> log message be enough documentation?
>

I'll add a note for this.

> * Should any of the other tables in the test be converted to temp?
> I see that the other test cases are kluging around related issues
> by dint of never committing their tables at all.  It's not clear
> to me how badly those test cases have been distorted by that, or
> whether it means they're testing less-than-typical situations.
>

Are you trying to say that we can achieve the things being done in
test-case 1 and 2 by having a single temp table and we should aim for
it because it will make the test-case more efficient and easy to
maintain? If so, I will try to do the necessary changes and submit a
new patch for it. please confirm.

Thanks,

-- 
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Re: Parallelize stream replication process

2020-09-15 Thread Li Japin



> On Sep 15, 2020, at 3:41 PM, Fujii Masao  wrote:
> 
> 
> 
> On 2020/09/15 13:41, Bharath Rupireddy wrote:
>> On Tue, Sep 15, 2020 at 9:27 AM Li Japin  wrote:
>>> 
>>> For now, postgres use single process to send, receive and replay the WAL 
>>> when we use stream replication,
>>> is there any point to parallelize this process? If it does, how do we start?
>>> 
>>> Any thoughts?
> 
> Probably this is another parallelism than what you're thinking,
> but I was thinking to start up walwriter process in the standby server
> and make it fsync the streamed WAL data. This means that we leave
> a part of tasks of walreceiver process to walwriter. Walreceiver
> performs WAL receive and write, and walwriter performs WAL flush,
> in parallel. I'm just expecting that this change would improve
> the replication performance, e.g., reduce the time to wait for
> sync replication.
> 
> Without this change (i.e., originally), only walreceiver performs
> WAL receive, write and flush. So wihle walreceiver is fsyncing WAL data,
> it cannot receive newly-arrived WAL data. If WAL flush takes a time,
> which means that the time to wait for sync replication in the primary
> would be enlarged.
> 
> Regards,
> 
> -- 
> Fujii Masao
> Advanced Computing Technology Center
> Research and Development Headquarters
> NTT DATA CORPORATION

Yeah, this might be a direction. 

Now I am thinking about how to parallel WAL log replay. If we can improve the 
efficiency
of replay, then we can shorten the database recovery time (streaming 
replication or database
crash recovery). 

For streaming replication, we may need to improve the transmission of WAL logs 
to improve
the entire recovery process.

I’m not sure if this is correct.

Regards,
Japin Li.

Re: [Patch] Optimize dropping of relation buffers using dlist

2020-09-15 Thread Amit Kapila

On Wed, Sep 16, 2020 at 7:46 AM Kyotaro Horiguchi
 wrote:
>
> At Wed, 2 Sep 2020 08:18:06 +0530, Amit Kapila  
> wrote in
> > On Wed, Sep 2, 2020 at 7:01 AM Kyotaro Horiguchi
> >  wrote:
> > > Isn't a relation always locked asscess-exclusively, at truncation
> > > time?  If so, isn't even the result of lseek reliable enough?
> > >
> >
> > Even if the relation is locked, background processes like checkpointer
> > can still touch the relation which might cause problems. Consider a
> > case where we extend the relation but didn't flush the newly added
> > pages. Now during truncate operation, checkpointer can still flush
> > those pages which can cause trouble for truncate. But, I think in the
> > recovery path such cases won't cause a problem.
>
> I reconsided on this and still have a doubt.
>
> Is this means lseek(SEEK_END) doesn't count blocks that are
> write(2)'ed (by smgrextend) but not yet flushed? (I don't think so,
> for clarity.) The nblocks cache is added just to reduce the number of
> lseek()s and expected to always have the same value with what lseek()
> is expected to return.
>

See comments in ReadBuffer_common() which indicates such a possibility
("Unfortunately, we have also seen this case occurring because of
buggy Linux kernels that sometimes return an lseek(SEEK_END) result
that doesn't account for a recent write."). Also, refer my previous
email [1] on this and another email link in that email which has a
discussion on this point.

> The reason it is reliable only during recovery
> is that the cache is not shared but the startup process is the only
> process that changes the relation size during recovery.
>

Yes, that is why we are planning to do this optimization for recovery path.

> If any other process can extend the relation while smgrtruncate is
> running, the current DropRelFileNodeBuffers should have the chance
> that a new buffer for extended area is allocated at a buffer location
> where the function already have passed by, which is a disaster.
>

The relation might have extended before smgrtruncate but the newly
added pages can be flushed by checkpointer during smgrtruncate.

[1] - 
https://www.postgresql.org/message-id/CAA4eK1LH2uQWznwtonD%2Bnch76kqzemdTQAnfB06z_LXa6NTFtQ%40mail.gmail.com

-- 
With Regards,
Amit Kapila.

Re: Parallelize stream replication process

2020-09-15 Thread Li Japin

Thanks for clarifying the questions!

> On Sep 15, 2020, at 12:41 PM, Bharath Rupireddy 
>  wrote:
> 
> On Tue, Sep 15, 2020 at 9:27 AM Li Japin  wrote:
>> 
>> For now, postgres use single process to send, receive and replay the WAL 
>> when we use stream replication,
>> is there any point to parallelize this process? If it does, how do we start?
>> 
>> Any thoughts?
>> 
> 
> I think we must ask few questions:
> 
> 1. What's the major gain we get out of this? Is it that the time to
> stream gets reduced or something else?

I think when the database failover, we might shorten the recovery time from the 
parallel stream replication.

> If the answer to the above point is something solid, then
> 2. How do we distribute the work to multiple processes?
> 3. Do we need all of the workers to maintain the order in which they
> read WAL files(on the publisher) and apply the changes(on the
> subscriber?)
> 3. Do we want to map the sender/publisher workers to
> receiver/subscriber workers on a one-to-one basis? If not, how do we
> do it?
> 4. How do sender and receiver workers communicate?
> 5. What if we have multiple subscribers/receivers?
> 
> I'm no expert in replication, I may be wrong as well. Others may have
> better thoughts.
> 

Maybe we can distribute the work to multiple processes according by the WAL 
record type.

In the first step, I think we can parallel the replay process. We can classify 
the WAL by WAL type or RmgrId,
and then parallel those WAL replay if possible.

Then, we can think how to parallel WalReceiver and WalSender.

Best regards
Japin Li.

Re: [Patch] Optimize dropping of relation buffers using dlist

2020-09-15 Thread Kyotaro Horiguchi

At Wed, 16 Sep 2020 11:56:29 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
(Oops! Some of my comments duplicate with Tsunakawa-san, sorry.)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [Patch] Optimize dropping of relation buffers using dlist

2020-09-15 Thread Kyotaro Horiguchi

Thanks for the new version. Jamison.

At Tue, 15 Sep 2020 11:11:26 +, "k.jami...@fujitsu.com" 
 wrote in 
> Hi, 
> 
> > BTW, I think I see one problem in the code:
> > >
> > > if (RelFileNodeEquals(bufHdr->tag.rnode, rnode.node) &&
> > > + bufHdr->tag.forkNum == forkNum[j] && tag.blockNum >=
> > > + bufHdr->firstDelBlock[j])
> > >
> > > Here, I think you need to use 'i' not 'j' for forkNum and
> > > firstDelBlock as those are arrays w.r.t forks. That might fix the
> > > problem but I am not sure as I haven't tried to reproduce it.
> > 
> > Thanks for advice. Right, that seems to be the cause of error, and fixing 
> > that
> > (using fork) solved the case.
> > I also followed the advice of Tsunakawa-san of using more meaningful
> > iterator Instead of using "i" & "j" for readability.

(FWIW, I prefer short conventional names for short-term iterator variables.)


master> * XXX currently it sequentially searches the buffer pool, should be
master> * changed to more clever ways of searching.  However, this routine
master> * is used only in code paths that aren't very performance-critical,
master> * and we shouldn't slow down the hot paths to make it faster ...

This comment needs a rewrite.


+   for (fork_num = 0; fork_num < nforks; fork_num++)
{
if (RelFileNodeEquals(bufHdr->tag.rnode, rnode.node) &&
-   bufHdr->tag.forkNum == forkNum[j] &&
-   bufHdr->tag.blockNum >= firstDelBlock[j])
+   bufHdr->tag.forkNum == forkNum[fork_num] &&
+   bufHdr->tag.blockNum >= firstDelBlock[fork_num])

fork_num is not actually a fork number, but the index of forkNum[].
It should be fork_idx (or just i, which I prefer..).

-   for (j = 0; j < nforks; j++)
-   DropRelFileNodeLocalBuffers(rnode.node, 
forkNum[j],
-   
firstDelBlock[j]);
+   for (fork_num = 0; fork_num < nforks; fork_num++)
+   DropRelFileNodeLocalBuffers(rnode.node, 
forkNum[fork_num],
+   
firstDelBlock[fork_num]);

I think we don't need to include the irrelevant refactoring in this
patch. (And I think j is better there.)

+* We only speedup this path during recovery, because that's the only
+* timing when we can get a valid cached value of blocks for relation.
+* See comment in smgrnblocks() in smgr.c. Otherwise, proceed to usual
+* buffer invalidation process (scanning of whole shared buffers).

We need an explanation of why we do this optimizaton only for the
recovery case.

+   /* Get the number of blocks for the supplied relation's 
fork */
+   nblocks = smgrnblocks(smgr_reln, forkNum[fork_num]);
+   Assert(BlockNumberIsValid(nblocks));
+
+   if (nblocks < BUF_DROP_FULLSCAN_THRESHOLD)

As mentioned upthread, the criteria whether we do full-scan or
lookup-drop is how large portion of NBUFFERS this relation-drop can be
going to invalidate.  So the nblocks above sould be the sum of number
of blocks to be truncated (not just the total number of blocks) of all
designated forks.  Then once we decided to do loopup-drop method, we
do that for all forks.

+   for (block_num = 0; block_num <= nblocks; 
block_num++)
+   {

block_num is quite confusing with nblocks, at least for
me(:p). Likewise fork_num, I prefer that it is just j or iblk or
something else anyway not confusing with nblocks.  By the way, the
loop runs nblocks + 1 times, which seems wrong.  We can start the loop
from firstDelBlock[fork_num], instead of 0 and that makes the check
against firstDelBlock[] later useless.

+   /* create a tag with respect to the 
block so we can lookup the buffer */
+   INIT_BUFFERTAG(newTag, rnode.node, 
forkNum[fork_num],
+  
firstDelBlock[block_num]);

Mmm. it is wrong that the tag is initialized using
firstDelBlock[block_num]. Why isn't is just block_num?


+   if (buf_id < 0)
+   {
+   LWLockRelease(newPartitionLock);
+   continue;
+   }
+   LWLockRelease(newPartitionLock);

We don't need two separate LWLockRelease()'s there.

+   /*
+* We can make this a tad faster by prechecking the buffer tag before
+* we attempt to lock the buffer; this saves a lot of lock
...
+*/
+   if

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-09-15 Thread tsunakawa.ta...@fujitsu.com

The code doesn't seem to be working correctly.


(1)
+   for (block_num = 0; block_num <= nblocks; 
block_num++)

should be

+   for (block_num = firstDelBlock[fork_num]; 
block_num < nblocks; block_num++)

because:

* You only want to invalidate blocks >= firstDelBlock[fork_num], don't you?
* The relation's block number ranges from 0 to nblocks - 1.


(2)
+   INIT_BUFFERTAG(newTag, rnode.node, 
forkNum[fork_num],
+  
firstDelBlock[block_num]);

Replace firstDelBlock[fork_num] with block_num, because you want to process the 
current block of per-block loop.  Your code accesses memory out of the bounds 
of the array, and doesn't invalidate any buffer.


(3)
+   if 
(RelFileNodeEquals(bufHdr->tag.rnode, rnode.node) &&
+   bufHdr->tag.forkNum == 
forkNum[fork_num] &&
+   bufHdr->tag.blockNum >= 
firstDelBlock[block_num])
+   InvalidateBuffer(bufHdr);   
/* releases spinlock */
+   else
+   UnlockBufHdr(bufHdr, buf_state);

Replace
bufHdr->tag.blockNum >= firstDelBlock[fork_num]
with
bufHdr->tag.blockNum == block_num
because you want to check if the found buffer is for the current block of the 
loop.


(4)
+   /*
+* We've invalidated the nblocks already. Scan 
the shared buffers
+* for each fork.
+*/
+   if (block_num > nblocks)
+   {
+   
DropRelFileNodeBuffersOfFork(rnode.node, forkNum[fork_num],
+   
 firstDelBlock[fork_num]);
+   }

This part is unnecessary.  This invalidates all buffers that (2) failed to 
process, so the regression test succeeds.


Regards
Takayuki Tsunakawa

Re: Force update_process_title=on in crash recovery?

2020-09-15 Thread Michael Paquier

On Tue, Sep 15, 2020 at 10:01:18AM -0400, Tom Lane wrote:
> Seems like a good argument, but you'd have to be careful about the
> final state when you stop overriding update_process_title --- it can't
> be left looking like it's still-in-progress on some random WAL file.
> (Compare my nearby gripes about walsenders being sloppy about their
> pg_stat_activity and process title presentations.)

Another thing to be careful here is WIN32, see 0921554.  And slowing
down recovery is never a good idea.
--
Michael


signature.asc
Description: PGP signature

Re: [Patch] Optimize dropping of relation buffers using dlist

2020-09-15 Thread Kyotaro Horiguchi

At Wed, 2 Sep 2020 08:18:06 +0530, Amit Kapila  wrote 
in 
> On Wed, Sep 2, 2020 at 7:01 AM Kyotaro Horiguchi
>  wrote:
> > Isn't a relation always locked asscess-exclusively, at truncation
> > time?  If so, isn't even the result of lseek reliable enough?
> >
> 
> Even if the relation is locked, background processes like checkpointer
> can still touch the relation which might cause problems. Consider a
> case where we extend the relation but didn't flush the newly added
> pages. Now during truncate operation, checkpointer can still flush
> those pages which can cause trouble for truncate. But, I think in the
> recovery path such cases won't cause a problem.

I reconsided on this and still have a doubt.

Is this means lseek(SEEK_END) doesn't count blocks that are
write(2)'ed (by smgrextend) but not yet flushed? (I don't think so,
for clarity.) The nblocks cache is added just to reduce the number of
lseek()s and expected to always have the same value with what lseek()
is expected to return. The reason it is reliable only during recovery
is that the cache is not shared but the startup process is the only
process that changes the relation size during recovery.

If any other process can extend the relation while smgrtruncate is
running, the current DropRelFileNodeBuffers should have the chance
that a new buffer for extended area is allocated at a buffer location
where the function already have passed by, which is a disaster.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Feedback on table expansion hook (including patch)

2020-09-15 Thread Euler Taveira

On Thu, 7 May 2020 at 05:11, Erik Nordström  wrote:

>
> I am looking for feedback on the possibility of adding a table expansion
> hook to PostgreSQL (see attached patch). The motivation for this is to
> allow extensions to optimize table expansion. In particular, TimescaleDB
> does its own table expansion in order to apply a number of optimizations,
> including partition pruning (note that TimescaleDB uses inheritance since
> PostgreSQL 9.6 rather than declarative partitioning ). There's currently no
> official hook for table expansion, but TimescaleDB has been using the
> get_relation_info hook for this purpose. Unfortunately, PostgreSQL 12 broke
> this for us since it moved expansion to a later stage where we can no
> longer control it without some pretty bad hacks. Given that PostgreSQL 12
> changed the expansion state of a table for the get_relation_info hook, we
> are thinking about this as a regression and are wondering if this could be
> considered against the head of PG 12 or maybe even PG 13 (although we
> realize feature freeze has been reached)?
>
>
I reviewed your patch and it looks good to me. You mentioned that it would
be useful for partitioning using table inheritance but it could also be
used for declarative partitioning (at least until someone decides to detach
it from inheritance infrastructure).

Unfortunately, you showed up late here. Even though the hook is a
straightforward feature, Postgres does not add new features to released
versions or after the feature freeze.

The only point that I noticed was that you chose "control over" but similar
code uses "control in".


-- 
Euler Taveira http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Optimising compactify_tuples()

2020-09-15 Thread Peter Geoghegan

On Tue, Sep 15, 2020 at 7:01 PM Thomas Munro  wrote:
> On Wed, Sep 16, 2020 at 1:30 PM David Rowley  wrote:
> > I also did some further performance tests of something other than
> > recovery. I can also report a good performance improvement in VACUUM.
> > Something around the ~25% reduction mark
>
> Wonderful results.  It must be rare for a such a localised patch to
> have such a large effect on such common workloads.

Yes, that's terrific.

-- 
Peter Geoghegan

Re: pg_restore causing deadlocks on partitioned tables

2020-09-15 Thread Amit Langote

On Wed, Sep 16, 2020 at 2:41 AM Tom Lane  wrote:
> Amit Langote  writes:
> > On Tue, Sep 15, 2020 at 7:28 AM Tom Lane  wrote:
> >> AFAICS, it is utterly silly for InitResultRelInfo to be forcing
> >> a partition qual to be computed when we might not need it.
> >> We could flush ResultRelInfo.ri_PartitionCheck altogether and
> >> have anything that was reading it instead do
> >> RelationGetPartitionQual(ResultRelInfo.ri_RelationDesc).
>
> > Yeah, makes sense.  Please see attached a patch to do that.
>
> Just eyeballing this, this bit seems bogus:
>
> @@ -1904,7 +1903,7 @@ ExecConstraints(ResultRelInfo *resultRelInfo,
> Bitmapset  *insertedCols;
> Bitmapset  *updatedCols;
>
> -   Assert(constr || resultRelInfo->ri_PartitionCheck);
> +   Assert(constr);
>
> if (constr && constr->has_not_null)
> {
>
> It does look like all the call sites check for the rel having constraints
> before calling, so the modified Assert may not be failing ... but why
> are we asserting and then also making a run-time test?
>
> My inclination is to just drop the Assert as useless.  There's no
> particular reason for this function to make it a hard requirement
> that callers optimize away unnecessary calls.

Yeah, the Assert seems pretty pointless at this point.

> I'm suspicious of the business in ExecPartitionCheck about constructing
> a constant-true expression.  I think executing that is likely to add
> more cycles than you save by not running through this code each time;
> once relcache has cached the knowledge that the partition expression
> is empty, all the steps here are pretty darn cheap ... which no doubt
> is why there wasn't a comparable optimization already.

Ah, you're right.

>  If you're
> really concerned about that it'd be better to add a separate
> "bool ri_PartitionCheckExprValid" flag.  (Perhaps that's worth doing
> to avoid impacts from relcache flushes; though I remain unconvinced
> that optimizing for the empty-expression case is useful.)

Agreed that it's not really necessary to optimize that case.

Updated patch attached.

-- 
Amit Langote
EnterpriseDB: http://www.enterprisedb.com


remove-ri_PartitionCheck_v2.patch
Description: Binary data

Re: Optimising compactify_tuples()

2020-09-15 Thread Thomas Munro

On Wed, Sep 16, 2020 at 1:30 PM David Rowley  wrote:
> Thanks a lot for the detailed benchmark results and profiles. That was
> useful.  I've pushed both patches now. I did a bit of a sweep of the
> comments on the 0001 patch before pushing it.
>
> I also did some further performance tests of something other than
> recovery. I can also report a good performance improvement in VACUUM.
> Something around the ~25% reduction mark

Wonderful results.  It must be rare for a such a localised patch to
have such a large effect on such common workloads.

Re: Optimising compactify_tuples()

2020-09-15 Thread David Rowley

On Wed, 16 Sep 2020 at 02:10, Jakub Wartak  wrote:
> BTW: this message "redo done at 0/9749FF70 system usage: CPU: user: 13.46 s, 
> system: 0.78 s, elapsed: 14.25 s" is priceless addition :)

Thanks a lot for the detailed benchmark results and profiles. That was
useful.  I've pushed both patches now. I did a bit of a sweep of the
comments on the 0001 patch before pushing it.

I also did some further performance tests of something other than
recovery. I can also report a good performance improvement in VACUUM.
Something around the ~25% reduction mark

psql -c "drop table if exists t1;" postgres > /dev/null
psql -c "create table t1 (a int primary key, b int not null) with
(autovacuum_enabled = false, fillfactor = 85);" postgres > /dev/null
psql -c "insert into t1 select x,0 from generate_series(1,1000)
x;" postgres > /dev/null
psql -c "drop table if exists log_wal;" postgres > /dev/null
psql -c "create table log_wal (lsn pg_lsn not null);" postgres > /dev/null
psql -c "insert into log_wal values(pg_current_wal_lsn());" postgres > /dev/null
pgbench -n -f update.sql -t 6 -c 200 -j 200 -M prepared postgres
psql -c "select 'Used ' ||
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), lsn)) || ' of
WAL' from log_wal limit 1;" postgres
psql postgres

\timing on
VACUUM t1;

Fillfactor = 85

patched:

Time: 2917.515 ms (00:02.918)
Time: 2944.564 ms (00:02.945)
Time: 3004.136 ms (00:03.004)

master:
Time: 4050.355 ms (00:04.050)
Time: 4104.999 ms (00:04.105)
Time: 4158.285 ms (00:04.158)

Fillfactor = 100

Patched:

Time: 4245.676 ms (00:04.246)
Time: 4251.485 ms (00:04.251)
Time: 4247.802 ms (00:04.248)

Master:
Time: 5459.433 ms (00:05.459)
Time: 5917.356 ms (00:05.917)
Time: 5430.986 ms (00:05.431)

David

Re: logtape.c stats don't account for unused "prefetched" block numbers

2020-09-15 Thread Tom Lane

Alvaro Herrera  writes:
> They get moderated because the nore...@postgresql.org address which
> appears as sender is not subscribed to any list.

Ah-hah.

> I also added that
> address to the whitelist now, but whether that's a great fix in the long
> run is debatable. 

Can/should we change the address that originates such messages?

regards, tom lane

Re: Subscription test 013_partition.pl fails under CLOBBER_CACHE_ALWAYS

2020-09-15 Thread Tom Lane

I wrote:
> With this, we get through 013_partition.pl under CCA.  I plan to
> try to run all of subscription/ and recovery/ before concluding
> there's nothing else to fix, though.

Looks like the rest passes.  FTR, it was possible to get through
subscription/ in about 2 hours on my workstation, and recovery/ in about
30 minutes, after hacking things so that initdb used a non-CCA backend.
So that's definitely a promising thing to look at if anyone wants
to try to do this on a regular basis.

Aside from increasing wal_receiver_timeout as previously mentioned,
I found I had to set a higher pg_ctl start timeout using PGCTLTIMEOUT
in order to get through the recovery tests.

regards, tom lane

Re: logtape.c stats don't account for unused "prefetched" block numbers

2020-09-15 Thread Alvaro Herrera

On 2020-Sep-15, Tom Lane wrote:

> Peter Geoghegan  writes:
> > On Tue, Sep 15, 2020 at 3:02 PM Tom Lane  wrote:
> >> The tag is applied, though for some reason the pgsql-committers auto
> >> e-mail about new tags hasn't been working lately.
> 
> > Thanks. FWIW I did get the automated email shortly after you sent this 
> > email.
> 
> Yeah, it did show up here too, about an hour after I pushed the tag.
> The last several taggings have been delayed similarly, and I think
> at least one never was reported at all.

I approved it about half an hour after it got in the moderation queue.

They get moderated because the nore...@postgresql.org address which
appears as sender is not subscribed to any list.  I also added that
address to the whitelist now, but whether that's a great fix in the long
run is debatable. 

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: logtape.c stats don't account for unused "prefetched" block numbers

2020-09-15 Thread Tom Lane

Peter Geoghegan  writes:
> On Tue, Sep 15, 2020 at 3:02 PM Tom Lane  wrote:
>> The tag is applied, though for some reason the pgsql-committers auto
>> e-mail about new tags hasn't been working lately.

> Thanks. FWIW I did get the automated email shortly after you sent this email.

Yeah, it did show up here too, about an hour after I pushed the tag.
The last several taggings have been delayed similarly, and I think
at least one never was reported at all.

regards, tom lane

Re: DROP relation IF EXISTS Docs and Tests - Bug Fix

2020-09-15 Thread Alexander Korotkov

Hi!

I've skimmed through the thread and checked the patchset.  Everything
looks good, except one paragraph, which doesn't look completely clear.

+  
+   This emulates the functionality provided by
+but sets the created object's
+   type definition
+   to domain.
+  

As I get it states that CREATE DOMAIN somehow "emulates" CREATE TYPE.
Could you please, rephrase it?  It looks confusing to me yet.

--
Regards,
Alexander Korotkov

Re: logtape.c stats don't account for unused "prefetched" block numbers

2020-09-15 Thread Peter Geoghegan

On Tue, Sep 15, 2020 at 3:02 PM Tom Lane  wrote:
> The tag is applied, though for some reason the pgsql-committers auto
> e-mail about new tags hasn't been working lately.

Thanks. FWIW I did get the automated email shortly after you sent this email.

-- 
Peter Geoghegan

Re: logtape.c stats don't account for unused "prefetched" block numbers

2020-09-15 Thread Tom Lane

I wrote:
> I plan to tag rc1 in around six hours, ~2200UTC today, barring
> trouble reports from packagers (none so far).  Feel free to push your
> patch once the tag appears.

The tag is applied, though for some reason the pgsql-committers auto
e-mail about new tags hasn't been working lately.

regards, tom lane

Re: "Unified logging system" breaks access to pg_dump debug outputs

2020-09-15 Thread Alvaro Herrera

On 2020-Sep-15, Tom Lane wrote:

> I wrote:
> > Alternatively, we might consider inventing an additional logging.c
> > function pg_logging_increase_level() with the obvious semantics, and
> > make the various programs just call that when they see a -v switch.
> > That would be a slightly bigger patch, but it would more easily support
> > programs with a range of useful verbosities, so maybe that's a better
> > idea.
> 
> After further thought, I concluded that's a clearly superior solution,
> so 0001 attached does it like that.  After noting that the enum values
> are in the opposite direction from how I thought they went, I realized
> that "increase_level" might be a bit ambiguous, so I now propose to
> call it pg_logging_increase_verbosity.

I like this better too ... I've wished for extra-verbose messages from
pg_dump in the past, and this now allows me to add something should I
want them again.

> > (Note: it seems possible that the theoretical multiple verbosity
> > levels in pg_dump were already broken before cc8d41511, because
> > right offhand I do not see any code that that removed that would
> > have allowed invoking the higher levels either.
> 
> Closer inspection says this was almost certainly true, because
> I discovered that pg_dump -v -v crashes if you don't specify
> an output filename :-(.  So this has probably been unreachable
> at least since we went over to using our own snprintf always;
> before that, depending on platform, it might've been fine.
> So we also need 0002 attached to fix that.

Ugh.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Gripes about walsender command processing

2020-09-15 Thread Alvaro Herrera

On 2020-Sep-15, Tom Lane wrote:

> [ blink ... ]  So why is it that DropReplicationSlot does
> 
>   SetQueryCompletion(, CMDTAG_DROP_REPLICATION_SLOT, 0);
>   EndCommand(, DestRemote, false);
> 
> when the caller will immediately after that do
> 
>   SetQueryCompletion(, CMDTAG_SELECT, 0);
>   EndCommand(, DestRemote, true);
> 
> StartLogicalReplication has a similar weirdness.
> The more I look at this code, the more broken it seems.

I overlooked this in 2f9661311b83.  From this perspective, I agree it
looks wrong.  We still have to send *some* completion tag (the 'C'
message), but maybe we can invent a separate entry point in dest.c for
that -- EndReplicationCommand() or some such -- that takes values from a
separate enum?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Subscription test 013_partition.pl fails under CLOBBER_CACHE_ALWAYS

2020-09-15 Thread Tom Lane

I wrote:
> It's not really clear to me why setting localreloid to zero is a sane
> way to represent "this entry needs to be revalidated".  I think a
> separate flag would be more appropriate.  Once we have lock on the
> target relation, it seems to me that no interesting changes should
> be possible as long as we have lock; so there's no very good reason
> to destroy useful state to remind ourselves that we should recheck
> it next time.

Here's a patch that changes that, and also cleans up some sloppy
thinking about how to re-acquire lock on the replication target
relation.  (Just because the OID was valid last you heard does
not mean that table_open is guaranteed to succeed.)

With this, we get through 013_partition.pl under CCA.  I plan to
try to run all of subscription/ and recovery/ before concluding
there's nothing else to fix, though.

regards, tom lane

diff --git a/src/backend/replication/logical/relation.c b/src/backend/replication/logical/relation.c
index 3d2d56295b..9ee70a2563 100644
--- a/src/backend/replication/logical/relation.c
+++ b/src/backend/replication/logical/relation.c
@@ -77,7 +77,7 @@ logicalrep_relmap_invalidate_cb(Datum arg, Oid reloid)
 		{
 			if (entry->localreloid == reloid)
 			{
-entry->localreloid = InvalidOid;
+entry->localrelvalid = false;
 hash_seq_term();
 break;
 			}
@@ -91,7 +91,7 @@ logicalrep_relmap_invalidate_cb(Datum arg, Oid reloid)
 		hash_seq_init(, LogicalRepRelMap);
 
 		while ((entry = (LogicalRepRelMapEntry *) hash_seq_search()) != NULL)
-			entry->localreloid = InvalidOid;
+			entry->localrelvalid = false;
 	}
 }
 
@@ -230,15 +230,13 @@ logicalrep_rel_att_by_name(LogicalRepRelation *remoterel, const char *attname)
 /*
  * Open the local relation associated with the remote one.
  *
- * Optionally rebuilds the Relcache mapping if it was invalidated
- * by local DDL.
+ * Rebuilds the Relcache mapping if it was invalidated by local DDL.
  */
 LogicalRepRelMapEntry *
 logicalrep_rel_open(LogicalRepRelId remoteid, LOCKMODE lockmode)
 {
 	LogicalRepRelMapEntry *entry;
 	bool		found;
-	Oid			relid = InvalidOid;
 	LogicalRepRelation *remoterel;
 
 	if (LogicalRepRelMap == NULL)
@@ -254,14 +252,45 @@ logicalrep_rel_open(LogicalRepRelId remoteid, LOCKMODE lockmode)
 
 	remoterel = >remoterel;
 
+	/* Ensure we don't leak a relcache refcount. */
+	if (entry->localrel)
+		elog(ERROR, "remote relation ID %u is already open", remoteid);
+
 	/*
 	 * When opening and locking a relation, pending invalidation messages are
-	 * processed which can invalidate the relation.  We need to update the
-	 * local cache both when we are first time accessing the relation and when
-	 * the relation is invalidated (aka entry->localreloid is set InvalidOid).
+	 * processed which can invalidate the relation.  Hence, if the entry is
+	 * currently considered valid, try to open the local relation by OID and
+	 * see if invalidation ensues.
+	 */
+	if (entry->localrelvalid)
+	{
+		entry->localrel = try_table_open(entry->localreloid, lockmode);
+		if (!entry->localrel)
+		{
+			/* Table was renamed or dropped. */
+			entry->localrelvalid = false;
+		}
+		else if (!entry->localrelvalid)
+		{
+			/* Note we release the no-longer-useful lock here. */
+			table_close(entry->localrel, lockmode);
+			entry->localrel = NULL;
+		}
+	}
+
+	/*
+	 * If the entry has been marked invalid since we last had lock on it,
+	 * re-open the local relation by name and rebuild all derived data.
 	 */
-	if (!OidIsValid(entry->localreloid))
+	if (!entry->localrelvalid)
 	{
+		Oid			relid;
+		int			found;
+		Bitmapset  *idkey;
+		TupleDesc	desc;
+		MemoryContext oldctx;
+		int			i;
+
 		/* Try to find and lock the relation by name. */
 		relid = RangeVarGetRelid(makeRangeVar(remoterel->nspname,
 			  remoterel->relname, -1),
@@ -272,21 +301,7 @@ logicalrep_rel_open(LogicalRepRelId remoteid, LOCKMODE lockmode)
 	 errmsg("logical replication target relation \"%s.%s\" does not exist",
 			remoterel->nspname, remoterel->relname)));
 		entry->localrel = table_open(relid, NoLock);
-
-	}
-	else
-	{
-		relid = entry->localreloid;
-		entry->localrel = table_open(entry->localreloid, lockmode);
-	}
-
-	if (!OidIsValid(entry->localreloid))
-	{
-		int			found;
-		Bitmapset  *idkey;
-		TupleDesc	desc;
-		MemoryContext oldctx;
-		int			i;
+		entry->localreloid = relid;
 
 		/* Check for supported relkind. */
 		CheckSubscriptionRelkind(entry->localrel->rd_rel->relkind,
@@ -380,7 +395,7 @@ logicalrep_rel_open(LogicalRepRelId remoteid, LOCKMODE lockmode)
 			}
 		}
 
-		entry->localreloid = relid;
+		entry->localrelvalid = true;
 	}
 
 	if (entry->state != SUBREL_STATE_READY)
@@ -523,7 +538,7 @@ logicalrep_partmap_invalidate_cb(Datum arg, Oid reloid)
 		{
 			if (entry->localreloid == reloid)
 			{
-entry->localreloid = InvalidOid;
+entry->localrelvalid = false;
 hash_seq_term();
 break;
 			}
@@ -537,7 +552,7 @@ logicalrep_partmap_invalidate_cb(Datum arg, Oid

Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

2020-09-15 Thread Robert Haas

On Tue, Sep 15, 2020 at 2:04 PM Andres Freund  wrote:
> > How is it possible?  Because tuple which has a committed xmax and the
> > xmax is older than the oldestXmin, should not come for freezing unless
> > it is lock_only xid (because those tuples are already gone).
>
> There've been several cases of this in the past. A fairly easy way is a
> corrupted relfrozenxid (of which there are many examples).

Hmm, so is the case you're worried about here the case where the
freezing threshold is greater than the pruning threshold? i.e. The
relfrozenxid has been updated to a value greater than the xmin we
derive from the procarray?

If that's not the case, then I don't see what problem there can be
here. To reach heap_prepare_freeze_tuple the tuple has to survive
pruning. If xmin < freezing-threshold and freezing-threshold <
pruning-threshold and the tuple survived pruning, then xmin must be a
committed transaction visible to everyone so setting xmin to
FrozenTransactionId is fine. If xmax < freezing-threshold and
freezing-threshold < pruning-threshold and the tuple survived pruning,
xmax must be visible to everyone and can't be running so it must have
aborted, so setting xmax to InvalidTransactionId is fine.

On the other hand if, somehow, freezing-threshold > pruning-threshold,
then freezing seems categorically unsafe. Doing so would change
visibility decisions of transactions that are still running, or that
were running at the time when we computed the pruning threshold. But
the sanity checks in heap_prepare_freeze_tuple() seem like they would
catch many such cases, but I'm not sure if they're all water-tight. It
might be better to skip calling heap_prepare_freeze_tuple() altogether
if the freezing threshold does not precede the pruning threshold.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Force update_process_title=on in crash recovery?

2020-09-15 Thread Justin Pryzby

On Tue, Sep 15, 2020 at 10:01:18AM -0400, Tom Lane wrote:
> Thomas Munro  writes:
> > Based on a couple of independent reports from users with no idea how
> > to judge the progress of a system recovering from a crash, Christoph
> > and I wondered if we should override update_process_title for the
> > "recovering ..." message, at least until connections are allowed.  We
> > already do that to set the initial titles.
> 
> > Crash recovery is a rare case where important information is reported
> > through the process title that isn't readily available anywhere else,
> > since you can't log in.  If you want to gauge  progress on a system
> > that happened to crash with update_process_title set to off, your best
> > hope is probably to trace the process or spy on the files it has open,
> > to see which WAL segment it's accessing, but that's not very nice.
> 
> Seems like a good argument, but you'd have to be careful about the
> final state when you stop overriding update_process_title --- it can't
> be left looking like it's still-in-progress on some random WAL file.
> (Compare my nearby gripes about walsenders being sloppy about their
> pg_stat_activity and process title presentations.)

Related:
https://commitfest.postgresql.org/29/2688/

I'm not sure I understood Michael's recent message, but I think maybe refers to
promotion of a standby.

-- 
Justin

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-09-15 Thread Tom Lane

Robert Haas  writes:
> On Mon, Sep 14, 2020 at 6:26 AM Ashutosh Sharma  wrote:
>> Thanks for reporting. I'm able to reproduce the issue by creating some
>> delay just before "-- now create an unused line pointer" and use the
>> delay to start a new session either with repeatable read or
>> serializable transaction isolation level and run some query on the
>> test table. To fix this, as you suggested I've converted the test
>> table to the temp table. Attached is the patch with the changes.
>> Please have a look and let me know about any concerns.

> Tom, do you have any concerns about this fix?

It seems OK as far as it goes.  Two thoughts:

* Do we need a comment in the test pointing out that the table must be
temp to ensure that we get stable vacuum results?  Or will the commit
log message be enough documentation?

* Should any of the other tables in the test be converted to temp?
I see that the other test cases are kluging around related issues
by dint of never committing their tables at all.  It's not clear
to me how badly those test cases have been distorted by that, or
whether it means they're testing less-than-typical situations.

Anyway, if you're satisfied with leaving the other cases as-is,
I have no objections.

regards, tom lane

Re: Subscription test 013_partition.pl fails under CLOBBER_CACHE_ALWAYS

2020-09-15 Thread Tom Lane

I wrote:
> [ $subject ]

I found some time to trace this down, and what it turns out to be is
that apply_handle_truncate() is making use of a LogicalRepRelMapEntry's
localreloid field without any consideration for the possibility that
that's been set to zero as a result of a cache flush.  The visible
symptom of "cache lookup failed for relation 0" comes from trying
to invoke find_all_inheritors with a zero OID.

Now, study of apply_handle_truncate doesn't immediately reveal where
a cache flush could have occurred, but I realized that it's actually
possible that the LogicalRepRelMapEntry is *already* marked invalid
when logicalrep_rel_open() returns!  That's because for some reason
it does GetSubscriptionRelState last, after it's already marked the
entry valid, and that function does plenty o' catalog accesses.

It's not really clear to me why setting localreloid to zero is a sane
way to represent "this entry needs to be revalidated".  I think a
separate flag would be more appropriate.  Once we have lock on the
target relation, it seems to me that no interesting changes should
be possible as long as we have lock; so there's no very good reason
to destroy useful state to remind ourselves that we should recheck
it next time.

regards, tom lane

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-09-15 Thread Robert Haas

On Mon, Sep 14, 2020 at 6:26 AM Ashutosh Sharma  wrote:
> Thanks for reporting. I'm able to reproduce the issue by creating some
> delay just before "-- now create an unused line pointer" and use the
> delay to start a new session either with repeatable read or
> serializable transaction isolation level and run some query on the
> test table. To fix this, as you suggested I've converted the test
> table to the temp table. Attached is the patch with the changes.
> Please have a look and let me know about any concerns.

Tom, do you have any concerns about this fix?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Simplified version of read_binary_file (src/backend/utils/adt/genfile.c)

2020-09-15 Thread Ranier Vilela

Em ter., 15 de set. de 2020 às 14:54, Alvaro Herrera <
alvhe...@2ndquadrant.com> escreveu:

> I think you meant _IONBF instead of _IOFBF -- otherwise it's at odds
> with the comment you add.  But what is the justification for that
> addition?  I don't see us doing that anywhere else.
>
No.
_IOFBF *Full buffering:* On output, data is written once the buffer is full
(or flushed ). On Input, the buffer is
filled when an input operation is requested and the buffer is empty.
* _IONBF No buffering:* No buffer is used. Each I/O operation is written as
soon as possible. In this case, the *buffer* and *size* parameters are
ignored.
_IONBF ignores buffer and size.

Without setvbuf, fread uses an internal buffer, default 4096 bytes (OS
dependent).
If fread uses an internal buffer, then it needs a copy to the buffer
provided by the function.
setvbuf, does the same as read function low level, copies directly to the
final buffer.

regards,
Ranier Vilela

Re: Gripes about walsender command processing

2020-09-15 Thread Tom Lane

Michael Paquier  writes:
> On Mon, Sep 14, 2020 at 11:34:44PM -0400, Tom Lane wrote:
>> (I don't quite follow your comment about repslot drop, btw.)

> There is already a command tag equivalent to DROP_REPLICATION_SLOT:
> $ git grep REPLICATION -- */cmdtaglist.h
> src/include/tcop/cmdtaglist.h:PG_CMDTAG(CMDTAG_DROP_REPLICATION_SLOT,
> "DROP REPLICATION SLOT", false, false, false)

[ blink ... ]  So why is it that DropReplicationSlot does

SetQueryCompletion(, CMDTAG_DROP_REPLICATION_SLOT, 0);
EndCommand(, DestRemote, false);

when the caller will immediately after that do

SetQueryCompletion(, CMDTAG_SELECT, 0);
EndCommand(, DestRemote, true);

StartLogicalReplication has a similar weirdness.
The more I look at this code, the more broken it seems.

Anyway, independently of whether walsender should be sending multiple
command-complete messages, I don't think I really approve of including
replication commands in the CommandTag enum.  It still seems like a
type pun.  However, it looks like we'd have to duplicate
SetQueryCompletion/EndCommand if we don't want to do that, so maybe
I'd better just hold my nose and do it that way.

regards, tom lane

Re: [HACKERS] [PATCH] Generic type subscripting

2020-09-15 Thread Pavel Stehule

st 9. 9. 2020 v 23:04 odesílatel Justin Pryzby 
napsal:

> On Wed, Aug 05, 2020 at 04:04:22PM +0200, Dmitry Dolgov wrote:
> > > On Sun, Aug 02, 2020 at 12:50:12PM +0200, Pavel Stehule wrote:
> > > >
> > > > > Maybe this could be salvaged by flushing 0005 in its current form
> and
> > > > > having the jsonb subscript executor do something like "if the
> current
> > > > > value-to-be-subscripted is a JSON array, then try to convert the
> textual
> > > > > subscript value to an integer".  Not sure about what the error
> handling
> > > > > rules ought to be like, though.
> > > >
> > > > I'm fine with the idea of separating 0005 patch and potentially
> prusuing
> > > > it as an independent item. Just need to rebase 0006, since Pavel
> > > > mentioned that it's a reasonable change he would like to see in the
> > > > final result.
> > >
> > > +1
> >
> > Here is what I had in mind. Worth noting that, as well as the original
>
> This seems to already hit a merge conflict (8febfd185).
> Would you re-rebase ?
>

This can be easy fixed. Maybe I found a another issue.

create table foo(a jsonb);

postgres=# select * from foo;
┌───┐
│ a │
╞═══╡
│ [0, null, null, null, null, null, null, null, null, null, "ahoj"] │
└───┘
(1 row)

It is working like I expect

but

postgres=# truncate foo;
TRUNCATE TABLE
postgres=# insert into foo values('[]');
INSERT 0 1
postgres=# update foo set a[10] = 'ahoj';
UPDATE 1
postgres=# select * from foo;
┌──┐
│a │
╞══╡
│ ["ahoj"] │
└──┘
(1 row)

Other parts look well. The plpgsql support is not part of this patch, but
it can be the next step. Implemented feature is interesting enough - it is
a simple user friendly interface for work with jsonb and in future with
other types.

Regards

Pavel


> --
> Justin
>

Re: PG 13 release notes, first draft

2020-09-15 Thread Jonathan S. Katz

On 9/15/20 2:30 PM, Tom Lane wrote:
> "Jonathan S. Katz"  writes:
>> On 9/15/20 2:11 PM, Tom Lane wrote:
>>> The other incompatibilities are only listed once, if they're minor.
>>> I was just about to commit the attached.
> 
>> Even better. +1
> 
> Pushed, thanks for looking.

Thanks for modifying...though I have a gripe about it being labeled a
gripe[1] ;) Though it gave me a good laugh...

Jonathan

[1]
https://git.postgresql.org/pg/commitdiff/d42c6176446440b185fcb95c214b7e40d5758b60



signature.asc
Description: OpenPGP digital signature

Re: logtape.c stats don't account for unused "prefetched" block numbers

2020-09-15 Thread Peter Geoghegan

On Mon, Sep 14, 2020 at 11:44 PM Jeff Davis  wrote:
> Attached. HashAgg seems to accurately reflect the filesize, as does
> Sort.

For the avoidance of doubt: I think that this is the right way to go,
and that it should be committed shortly, before we stamp 13.0. The
same goes for hashagg-release-write-buffers.patch.

Thanks
-- 
Peter Geoghegan

Re: PG 13 release notes, first draft

2020-09-15 Thread Tom Lane

"Jonathan S. Katz"  writes:
> On 9/15/20 2:11 PM, Tom Lane wrote:
>> The other incompatibilities are only listed once, if they're minor.
>> I was just about to commit the attached.

> Even better. +1

Pushed, thanks for looking.

regards, tom lane

Re: PG 13 release notes, first draft

2020-09-15 Thread Jonathan S. Katz

On 9/15/20 2:11 PM, Tom Lane wrote:
> "Jonathan S. Katz"  writes:
>> On 9/15/20 1:05 PM, Tom Lane wrote:
>>> After thinking a bit, I'm inclined to agree that we should move it
>>> to "Incompatibilities".  It is a core server change (third-party
>>> extensions don't have a choice to opt out, as per postgis' issue),
>>> and even if it's trivial, we have some even-more-trivial issues
>>> listed there, like command tag changes.
> 
>> How about this?
> 
> The other incompatibilities are only listed once, if they're minor.
> I was just about to commit the attached.

Even better. +1

Jonathan



signature.asc
Description: OpenPGP digital signature

Re: PG 13 release notes, first draft

2020-09-15 Thread Tom Lane

"Jonathan S. Katz"  writes:
> On 9/15/20 1:05 PM, Tom Lane wrote:
>> After thinking a bit, I'm inclined to agree that we should move it
>> to "Incompatibilities".  It is a core server change (third-party
>> extensions don't have a choice to opt out, as per postgis' issue),
>> and even if it's trivial, we have some even-more-trivial issues
>> listed there, like command tag changes.

> How about this?

The other incompatibilities are only listed once, if they're minor.
I was just about to commit the attached.

regards, tom lane

diff --git a/doc/src/sgml/release-13.sgml b/doc/src/sgml/release-13.sgml
index 1f130ca1fe..3fd97c9d10 100644
--- a/doc/src/sgml/release-13.sgml
+++ b/doc/src/sgml/release-13.sgml
@@ -270,6 +270,25 @@ Author: Tom Lane 
  
 
 
+ 
+
+
+  
+   Remove support for upgrading unpackaged (pre-9.1) extensions (Tom Lane)
+  
+
+  
+   The FROM option
+   of CREATE
+   EXTENSION is no longer supported.  Any installations
+   still using unpackaged extensions should upgrade them to a packaged
+   version before updating to PostgreSQL 13.
+  
+ 
+
 
 
-
-  
-   Remove support for upgrading unpackaged (pre-9.1) extensions (Tom Lane)
-  
- 
-
- 
-

Re: PG 13 release notes, first draft

2020-09-15 Thread Jonathan S. Katz

On 9/15/20 1:05 PM, Tom Lane wrote:
> "Jonathan S. Katz"  writes:
>> On 9/15/20 12:05 PM, Tom Lane wrote:
>>> Ah.  OK, we can certainly extend it to mention that.  How about
>>> (not bothering with markup yet)
>>>
>>>  Remove support for upgrading unpackaged (pre-9.1) extensions (Tom Lane)
>>> +
>>> +The FROM option of CREATE EXTENSION is no longer supported.
> 
>> +1.
> 
>> With that in place, I'm more ambivalent to whether or not it's mentioned
>> in the incompatibilities section as well, though would lean towards
>> having a mention of it there as it technically is one. But I don't feel
>> too strongly about it.
> 
> After thinking a bit, I'm inclined to agree that we should move it
> to "Incompatibilities".  It is a core server change (third-party
> extensions don't have a choice to opt out, as per postgis' issue),
> and even if it's trivial, we have some even-more-trivial issues
> listed there, like command tag changes.

How about this?

Jonathan
diff --git a/doc/src/sgml/release-13.sgml b/doc/src/sgml/release-13.sgml
index 0ca970e452..bed56b0f3d 100644
--- a/doc/src/sgml/release-13.sgml
+++ b/doc/src/sgml/release-13.sgml
@@ -324,6 +324,19 @@ Author: Peter Geoghegan 
  
 
 
+
+
+
+ 
+  Remove support for CREATE
+  EXTENSION ... FROM
+  (Tom Lane)
+ 
+
+

 
   
@@ -2941,6 +2954,12 @@ Author: Tom Lane 
   
Remove support for upgrading unpackaged (pre-9.1) extensions (Tom Lane)
   
+
+  
+   The FROM option of
+   CREATE EXTENSION
+is no longer supported.
+  
  
 
  


signature.asc
Description: OpenPGP digital signature

Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

2020-09-15 Thread Andres Freund

Hi,

On 2020-09-15 12:52:25 +0530, Dilip Kumar wrote:
> On Tue, Sep 15, 2020 at 11:14 AM Andres Freund  wrote:
> >
> > On 2020-09-15 10:54:29 +0530, Dilip Kumar wrote:
> > > What problem do you see if we set xmax to the InvalidTransactionId and
> > > HEAP_XMAX_INVALID flag in the infomask ?
> >
> > 1) It'll make a dead tuple appear live. You cannot do this for tuples
> >with an xid below the horizon.
> 
> How is it possible?  Because tuple which has a committed xmax and the
> xmax is older than the oldestXmin, should not come for freezing unless
> it is lock_only xid (because those tuples are already gone).

There've been several cases of this in the past. A fairly easy way is a
corrupted relfrozenxid (of which there are many examples).

You simply cannot just assume that everything is OK and argue that
that's why it's ok to fix data corruption in some approximate manner. By
definition everything *is not ok* if you ever come here.

Greetings,

Andres Freund

Re: Simplified version of read_binary_file (src/backend/utils/adt/genfile.c)

2020-09-15 Thread Alvaro Herrera

I think you meant _IONBF instead of _IOFBF -- otherwise it's at odds
with the comment you add.  But what is the justification for that
addition?  I don't see us doing that anywhere else.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Simplified version of read_binary_file (src/backend/utils/adt/genfile.c)

2020-09-15 Thread Ranier Vilela

Em sex., 11 de set. de 2020 às 18:43, Ranier Vilela 
escreveu:

> Em sex., 11 de set. de 2020 às 17:44, Tom Lane 
> escreveu:
>
>> Ranier Vilela  writes:
>> > New version, with support to read Virtual File (pipe, FIFO and socket).
>> > With assert, in case, erroneous, of trying to read a pipe, with offset.
>>
>> Really, could you do a little more thinking and testing of your own,
>> rather than expecting the rest of us to point out holes in your thinking?
>>
> Yes, I can.
>
>
>> * bytes_to_read == 0 is a legal case.
>>
> Ok. Strange call to read a file with zero bytes.
>
>
>> * "Assert(seek_offset != 0)" is an equally awful idea.
>>
> I was trying to predict the case of reading a Virtual File, with
> bytes_to_read == -1 and seek_offset == 0,
> which is bound to fail in fseeko.
> In any case, 96d1f423f95d  it will fail with any Virtual File.
>
>
>>
>> * Removing the seek code from the negative-bytes_to_read path
>> is just broken.
>>
> Ok.
>
>
>> * The only reason this code is shorter than the previous version is
>> you took out all the comments explaining what it was doing, and
>> failed to replace them.  This is just as subtle as before, so I
>> don't find that acceptable.
>>
> It was not my intention.
>
>
>>
>> I do agree that it might be worth skipping the fseeko call in the
>> probably-common case where seek_offset is zero.  Otherwise I don't
>> see much reason to change what's there.
>>
> Well, I think that v3 is a little better, but it’s just my opinion.
> v3 try to copy directly in the final buffer, which will certainly be a
> little faster.
>
v4 patch, simple benchmark, read binary file with size > 6GB.

PostgreSQL main:
postgres=# \timing on
Timing is on.
postgres=# select
pg_read_file('c:\users\ranier\downloads\macOS_Catalina.7z');
ERROR:  file length too large
Time: 3068,459 ms (00:03,068)
postgres=#

PostgreSQL patched (v4 read_binary_file):
postgres=# \timing on
Timing is on.
postgres=# select
pg_read_file('c:\users\ranier\downloads\macOS_Catalina.7z');
ERROR:  file length too large
Time: 701,035 ms
postgres=#

4.37x faster, very good.

v4 handles pipes very well.
Tested with
https://docs.microsoft.com/en-us/windows/win32/ipc/multithreaded-pipe-server

Terminal one:
C:\usr\src\tests\pipes>pipe_server

Pipe Server: Main thread awaiting client connection on \\.\pipe\mynamedpipe

Terminal two:
postgres=# \timing on
Timing is on.
postgres=# select pg_read_file('\\.\pipe\mynamedpipe');

Terminal one:
Client connected, creating a processing thread.

Pipe Server: Main thread awaiting client connection on \\.\pipe\mynamedpipe
InstanceThread created, receiving and processing messages.

Pipe Server: Main thread awaiting client connection on \\.\pipe\mynamedpipe
InstanceThread created, receiving and processing messages.
^C
C:\usr\src\tests\pipes>

Terminal two:
postgres=# select pg_read_file('\\.\pipe\mynamedpipe');
pg_read_file
--

(1 row)

Time: 77267,481 ms (01:17,267)
postgres=#

regards,
Ranier Vilela

v4-0001-simplified_read_binary_file.patch
Description: Binary data

Re: pg_restore causing deadlocks on partitioned tables

2020-09-15 Thread Tom Lane

Amit Langote  writes:
> On Tue, Sep 15, 2020 at 7:28 AM Tom Lane  wrote:
>> AFAICS, it is utterly silly for InitResultRelInfo to be forcing
>> a partition qual to be computed when we might not need it.
>> We could flush ResultRelInfo.ri_PartitionCheck altogether and
>> have anything that was reading it instead do
>> RelationGetPartitionQual(ResultRelInfo.ri_RelationDesc).

> Yeah, makes sense.  Please see attached a patch to do that.

Just eyeballing this, this bit seems bogus:

@@ -1904,7 +1903,7 @@ ExecConstraints(ResultRelInfo *resultRelInfo,
Bitmapset  *insertedCols;
Bitmapset  *updatedCols;

-   Assert(constr || resultRelInfo->ri_PartitionCheck);
+   Assert(constr);

if (constr && constr->has_not_null)
{

It does look like all the call sites check for the rel having constraints
before calling, so the modified Assert may not be failing ... but why
are we asserting and then also making a run-time test?

My inclination is to just drop the Assert as useless.  There's no
particular reason for this function to make it a hard requirement
that callers optimize away unnecessary calls.

I'm suspicious of the business in ExecPartitionCheck about constructing
a constant-true expression.  I think executing that is likely to add
more cycles than you save by not running through this code each time;
once relcache has cached the knowledge that the partition expression
is empty, all the steps here are pretty darn cheap ... which no doubt
is why there wasn't a comparable optimization already.  If you're
really concerned about that it'd be better to add a separate
"bool ri_PartitionCheckExprValid" flag.  (Perhaps that's worth doing
to avoid impacts from relcache flushes; though I remain unconvinced
that optimizing for the empty-expression case is useful.)

regards, tom lane

Re: Yet another fast GiST build

2020-09-15 Thread Heikki Linnakangas


On 15/09/2020 19:46, Andrey M. Borodin wrote:




15 сент. 2020 г., в 16:36, Heikki Linnakangas  написал(а):

Another patch version, fixed a few small bugs pointed out by assertion failures 
in the regression tests.

- Heikki



These changes in create_index.out do not seem correct to me

  SELECT * FROM point_tbl ORDER BY f1 <-> '0,1';
  f1
  ---
- (0,0)
   (1e-300,-1e-300)
+ (0,0)

I did not figure out the root cause yet. We do not touch anything related to 
distance computation..


Ah yeah, that's subtle. Those rows are considered to be equally distant 
from (0, 1), given the precision of the <-> operator:


regression=#  SELECT f1, f1 <-> '0,1' FROM point_tbl ORDER BY f1 <-> '0,1';
f1 | ?column?
---+--
 (0,0) |1
 (1e-300,-1e-300)  |1
 (-3,4)| 4.24264068711929
 (-10,0)   | 10.0498756211209
 (10,10)   | 13.4536240470737
 (-5,-12)  | 13.9283882771841
 (5.1,34.5)|  33.885985303662
 (1e+300,Infinity) | Infinity
 (NaN,NaN) |  NaN
   |
(10 rows)

It is arbitrary which one you get first.

It's not very nice to have a not-well defined order of rows in the 
expected output, as it could change in the future if we change the index 
build algorithm again. But we have plenty of cases that depend on the 
physical row order, and it's not like this changes very often, so I 
think it's ok to just memorize the new order in the expected output.


- Heikki

Re: PG 13 release notes, first draft

2020-09-15 Thread Tom Lane

"Jonathan S. Katz"  writes:
> On 9/15/20 12:05 PM, Tom Lane wrote:
>> Ah.  OK, we can certainly extend it to mention that.  How about
>> (not bothering with markup yet)
>> 
>>  Remove support for upgrading unpackaged (pre-9.1) extensions (Tom Lane)
>> +
>> +The FROM option of CREATE EXTENSION is no longer supported.

> +1.

> With that in place, I'm more ambivalent to whether or not it's mentioned
> in the incompatibilities section as well, though would lean towards
> having a mention of it there as it technically is one. But I don't feel
> too strongly about it.

After thinking a bit, I'm inclined to agree that we should move it
to "Incompatibilities".  It is a core server change (third-party
extensions don't have a choice to opt out, as per postgis' issue),
and even if it's trivial, we have some even-more-trivial issues
listed there, like command tag changes.

regards, tom lane

Re: "Unified logging system" breaks access to pg_dump debug outputs

2020-09-15 Thread Tom Lane

I wrote:
> Alternatively, we might consider inventing an additional logging.c
> function pg_logging_increase_level() with the obvious semantics, and
> make the various programs just call that when they see a -v switch.
> That would be a slightly bigger patch, but it would more easily support
> programs with a range of useful verbosities, so maybe that's a better
> idea.

After further thought, I concluded that's a clearly superior solution,
so 0001 attached does it like that.  After noting that the enum values
are in the opposite direction from how I thought they went, I realized
that "increase_level" might be a bit ambiguous, so I now propose to
call it pg_logging_increase_verbosity.

> (Note: it seems possible that the theoretical multiple verbosity
> levels in pg_dump were already broken before cc8d41511, because
> right offhand I do not see any code that that removed that would
> have allowed invoking the higher levels either.

Closer inspection says this was almost certainly true, because
I discovered that pg_dump -v -v crashes if you don't specify
an output filename :-(.  So this has probably been unreachable
at least since we went over to using our own snprintf always;
before that, depending on platform, it might've been fine.
So we also need 0002 attached to fix that.

regards, tom lane

diff --git a/src/bin/pg_archivecleanup/pg_archivecleanup.c b/src/bin/pg_archivecleanup/pg_archivecleanup.c
index e454bae767..12338e3bb2 100644
--- a/src/bin/pg_archivecleanup/pg_archivecleanup.c
+++ b/src/bin/pg_archivecleanup/pg_archivecleanup.c
@@ -302,7 +302,7 @@ main(int argc, char **argv)
 		switch (c)
 		{
 			case 'd':			/* Debug mode */
-pg_logging_set_level(PG_LOG_DEBUG);
+pg_logging_increase_verbosity();
 break;
 			case 'n':			/* Dry-Run mode */
 dryrun = true;
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 784bceaec3..53432acefc 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -512,7 +512,7 @@ main(int argc, char **argv)
 
 			case 'v':			/* verbose */
 g_verbose = true;
-pg_logging_set_level(PG_LOG_INFO);
+pg_logging_increase_verbosity();
 break;
 
 			case 'w':
diff --git a/src/bin/pg_dump/pg_dumpall.c b/src/bin/pg_dump/pg_dumpall.c
index 97d2b8dac1..219ca963c3 100644
--- a/src/bin/pg_dump/pg_dumpall.c
+++ b/src/bin/pg_dump/pg_dumpall.c
@@ -283,7 +283,7 @@ main(int argc, char *argv[])
 
 			case 'v':
 verbose = true;
-pg_logging_set_level(PG_LOG_INFO);
+pg_logging_increase_verbosity();
 appendPQExpBufferStr(pgdumpopts, " -v");
 break;
 
diff --git a/src/bin/pg_dump/pg_restore.c b/src/bin/pg_dump/pg_restore.c
index 544ae3bc5c..eebf0d300b 100644
--- a/src/bin/pg_dump/pg_restore.c
+++ b/src/bin/pg_dump/pg_restore.c
@@ -245,7 +245,7 @@ main(int argc, char **argv)
 
 			case 'v':			/* verbose */
 opts->verbose = 1;
-pg_logging_set_level(PG_LOG_INFO);
+pg_logging_increase_verbosity();
 break;
 
 			case 'w':
diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index 23fc749e44..0ec52cb032 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -181,7 +181,7 @@ main(int argc, char **argv)
 
 			case 3:
 debug = true;
-pg_logging_set_level(PG_LOG_DEBUG);
+pg_logging_increase_verbosity();
 break;
 
 			case 'D':			/* -D or --target-pgdata */
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 332eabf637..663d7d292a 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -5522,7 +5522,7 @@ main(int argc, char **argv)
 pgport = pg_strdup(optarg);
 break;
 			case 'd':
-pg_logging_set_level(PG_LOG_DEBUG);
+pg_logging_increase_verbosity();
 break;
 			case 'c':
 benchmarking_option_set = true;
diff --git a/src/common/logging.c b/src/common/logging.c
index 6a3a437a34..d9632fffc8 100644
--- a/src/common/logging.c
+++ b/src/common/logging.c
@@ -157,12 +157,30 @@ pg_logging_config(int new_flags)
 	log_flags = new_flags;
 }
 
+/*
+ * pg_logging_init sets the default log level to INFO.  Programs that prefer
+ * a different default should use this to set it, immediately afterward.
+ */
 void
 pg_logging_set_level(enum pg_log_level new_level)
 {
 	__pg_log_level = new_level;
 }
 
+/*
+ * Command line switches such as --verbose should invoke this.
+ */
+void
+pg_logging_increase_verbosity(void)
+{
+	/*
+	 * The enum values are chosen such that we have to decrease __pg_log_level
+	 * in order to become more verbose.
+	 */
+	if (__pg_log_level > PG_LOG_NOTSET + 1)
+		__pg_log_level--;
+}
+
 void
 pg_logging_set_pre_callback(void (*cb) (void))
 {
diff --git a/src/include/common/logging.h b/src/include/common/logging.h
index 028149c7a1..3205b8fef9 100644
--- a/src/include/common/logging.h
+++ b/src/include/common/logging.h
@@ -66,6 +66,7 @@ extern enum pg_log_level __pg_log_level;
 void		pg_logging_init(const char *argv0);
 void		pg_logging_config(int

Re: Yet another fast GiST build

2020-09-15 Thread Andrey M. Borodin




> 15 сент. 2020 г., в 16:36, Heikki Linnakangas  написал(а):
> 
> Another patch version, fixed a few small bugs pointed out by assertion 
> failures in the regression tests.
> 
> - Heikki
> 

These changes in create_index.out do not seem correct to me

 SELECT * FROM point_tbl ORDER BY f1 <-> '0,1';
 f1 
 ---
- (0,0)
  (1e-300,-1e-300)
+ (0,0)

I did not figure out the root cause yet. We do not touch anything related to 
distance computation..

Best regards, Andrey Borodin.

Re: PG 13 release notes, first draft

2020-09-15 Thread Jonathan S. Katz

On 9/15/20 12:05 PM, Tom Lane wrote:
> "Jonathan S. Katz"  writes:
>> On 9/15/20 11:45 AM, Tom Lane wrote:
>>> It is listed already in the "Additional Modules" section (about line 2940
>>> in release-13.sgml as of right now).
> 
>> ...sort of. It talks about the feature, but does not talk about the
>> syntax removal, which is what I was originally searching for in the
>> release notes.
> 
> Ah.  OK, we can certainly extend it to mention that.  How about
> (not bothering with markup yet)
> 
>  Remove support for upgrading unpackaged (pre-9.1) extensions (Tom Lane)
> +
> +The FROM option of CREATE EXTENSION is no longer supported.

+1.

With that in place, I'm more ambivalent to whether or not it's mentioned
in the incompatibilities section as well, though would lean towards
having a mention of it there as it technically is one. But I don't feel
too strongly about it.

Jonathan



signature.asc
Description: OpenPGP digital signature

PostgreSQL 13 RC 1 release announcement draft

2020-09-15 Thread Jonathan S. Katz

Hi,

Attached is a draft of the PostgreSQL 13 RC 1 release announcement.

If you have feedback, please be sure you've left it no later than
2020-09-16 AoE.

Thanks!

Jonathan
PostgreSQL 13 RC 1 Released
===

The PostgreSQL Global Development Group announces that the first release
candidate of PostgreSQL 13 is now available for download. As a release
candidate, PostgreSQL 13 RC 1 will be mostly identical to the initial release of
PostgreSQL 13, though some more fixes may be applied prior to the general
availability of PostgreSQL 13.

The planned date for the general availability of PostgreSQL 13 is September 24,
2020. Please see the "Release Schedule" section for more details.

Upgrading to PostgreSQL 13 RC 1
---

To upgrade to PostgreSQL 13 RC 1 from Beta 3 or an earlier version of
PostgreSQL 13, you will need to use a strategy similar to upgrading between
major versions of PostgreSQL (e.g. `pg_upgrade` or `pg_dump` / `pg_restore`).
For more information, please visit the documentation section on
[upgrading](https://www.postgresql.org/docs/13/static/upgrading.html).

Changes Since 13 Beta 3
---

There have been many bug fixes for PostgreSQL 13 reported during the Beta 3
period and applied to this release candidate. These include:

* Adjustments to the costing model for hash aggregates that spill to disk.
* Adjustments to the output of `EXPLAIN (BUFFERS)`.
* Display the stats target of extended statistics in the output of `\d`.

For a detailed list of fixes, please visit the
[open 
items](https://wiki.postgresql.org/wiki/PostgreSQL_13_Open_Items#resolved_before_13rc1)
page.

Release Schedule


This is the first release candidate for PostgreSQL 13. Unless an issue is
discovered that warrants a delay or to produce an additional release candidate,
PostgreSQL 13 should be made generally available on September 24, 2020.

For further information please see the
[Beta Testing](https://www.postgresql.org/developer/beta/) page.

Links
-

* [Download](https://www.postgresql.org/download/)
* [Beta Testing Information](https://www.postgresql.org/developer/beta/)
* [PostgreSQL 13 RC 1 Release 
Notes](https://www.postgresql.org/docs/13/static/release-13.html)
* [PostgreSQL 13 Open 
Issues](https://wiki.postgresql.org/wiki/PostgreSQL_13_Open_Items)
* [Submit a Bug](https://www.postgresql.org/account/submitbug/)
* [Follow @postgresql on Twitter](https://www.twitter.com/postgresql)


signature.asc
Description: OpenPGP digital signature

Re: logtape.c stats don't account for unused "prefetched" block numbers

2020-09-15 Thread Tom Lane

Peter Geoghegan  writes:
> On Mon, Sep 14, 2020 at 8:48 PM Peter Geoghegan  wrote:
>> On Mon, Sep 14, 2020 at 8:07 PM Jeff Davis  wrote:
>>> Sure. Will backporting either patch into REL_13_STABLE now interfere
>>> with RC1 release in any way?

> It is okay to skip RC1 and commit the patch/patches for 13 itself.
> Please wait until after Tom has pushed the rc1 tag. This will probably
> happen tomorrow.

I plan to tag rc1 in around six hours, ~2200UTC today, barring
trouble reports from packagers (none so far).  Feel free to push your
patch once the tag appears.

regards, tom lane

Re: PG 13 release notes, first draft

2020-09-15 Thread Tom Lane

"Jonathan S. Katz"  writes:
> On 9/15/20 11:45 AM, Tom Lane wrote:
>> It is listed already in the "Additional Modules" section (about line 2940
>> in release-13.sgml as of right now).

> ...sort of. It talks about the feature, but does not talk about the
> syntax removal, which is what I was originally searching for in the
> release notes.

Ah.  OK, we can certainly extend it to mention that.  How about
(not bothering with markup yet)

 Remove support for upgrading unpackaged (pre-9.1) extensions (Tom Lane)
+
+The FROM option of CREATE EXTENSION is no longer supported.

regards, tom lane

Re: logtape.c stats don't account for unused "prefetched" block numbers

2020-09-15 Thread Peter Geoghegan

On Mon, Sep 14, 2020 at 8:48 PM Peter Geoghegan  wrote:
> On Mon, Sep 14, 2020 at 8:07 PM Jeff Davis  wrote:
> > Sure. Will backporting either patch into REL_13_STABLE now interfere
> > with RC1 release in any way?
>
> The RMT will discuss this.

It is okay to skip RC1 and commit the patch/patches for 13 itself.
Please wait until after Tom has pushed the rc1 tag. This will probably
happen tomorrow.

-- 
Peter Geoghegan

Re: PG 13 release notes, first draft

2020-09-15 Thread Jonathan S. Katz

On 9/15/20 11:45 AM, Tom Lane wrote:
> "Jonathan S. Katz"  writes:
>> On a different note, I became aware of this[1] and noticed that dropping
>> "CREATE EXTENSION ... FROM" was not listed in the incompatibilities
>> section, so proposing the attached. I have no strong opinions on the
>> final wording, mainly wanted to get it listed.
> 
> It is listed already in the "Additional Modules" section (about line 2940
> in release-13.sgml as of right now).

...sort of. It talks about the feature, but does not talk about the
syntax removal, which is what I was originally searching for in the
release notes.

>  I don't know Bruce's exact reasoning
> for not including it in the "Incompatibilities" section, but I tend to
> agree that it shouldn't be significant to any real-world user. 

I do tend agree with this intuitively but don't have any data to back it
up either way. That said, we did modify the command and it would be good
to at least mention the fact we dropped "FROM" somewhere in the release
notes. It provides a good reference in case someone reports an "issue"
in the future stemming from dropping the "FROM" keyword, we can say "oh
it changed in PG13, see XYZ."

(Speaking from having used the release notes to perform similar
troubleshooting).

> I think
> that the postgis testing issue you reference is just an ancient test
> case that they should drop as no longer relevant.

Sure, and AIUI they're going to do that, mostly that was a reference
point to the changed syntax.

Jonathan

signature.asc
Description: OpenPGP digital signature

Re: PG 13 release notes, first draft

2020-09-15 Thread Tom Lane

"Jonathan S. Katz"  writes:
> On a different note, I became aware of this[1] and noticed that dropping
> "CREATE EXTENSION ... FROM" was not listed in the incompatibilities
> section, so proposing the attached. I have no strong opinions on the
> final wording, mainly wanted to get it listed.

It is listed already in the "Additional Modules" section (about line 2940
in release-13.sgml as of right now).  I don't know Bruce's exact reasoning
for not including it in the "Incompatibilities" section, but I tend to
agree that it shouldn't be significant to any real-world user.  I think
that the postgis testing issue you reference is just an ancient test
case that they should drop as no longer relevant.

regards, tom lane

Re: PG 13 release notes, first draft

2020-09-15 Thread Jonathan S. Katz

On 9/15/20 9:49 AM, Jonathan S. Katz wrote:
> On 9/15/20 5:22 AM, Masahiko Sawada wrote:
>> On Tue, 15 Sep 2020 at 13:56, Peter Eisentraut
>>  wrote:
>>>
>>> On 2020-09-09 22:57, Jonathan S. Katz wrote:
 +
 + 
 +  Parallelized vacuuming of B-tree indexes
 + 
 +
>>>
>>> I don't think B-tree indexes are relevant here.  AFAICT, this feature
>>> applies to all indexes.
>>>
>>
>> Yes, parallel vacuum applies to all types of indexes provided by
>> PostgreSQL binary, and other types of indexes also can use it.
> 
> I'm not sure where I got B-tree from. I've attached a correction.

On a different note, I became aware of this[1] and noticed that dropping
"CREATE EXTENSION ... FROM" was not listed in the incompatibilities
section, so proposing the attached. I have no strong opinions on the
final wording, mainly wanted to get it listed.

Thanks,

Jonathan

[1] https://trac.osgeo.org/postgis/ticket/4753
diff --git a/doc/src/sgml/release-13.sgml b/doc/src/sgml/release-13.sgml
index 0ca970e452..8fbaff4623 100644
--- a/doc/src/sgml/release-13.sgml
+++ b/doc/src/sgml/release-13.sgml
@@ -324,6 +324,19 @@ Author: Peter Geoghegan 
  
 
 
+
+
+
+ 
+  Remove support for CREATE
+  EXTENSION ... FROM
+  (Tom Lane)
+ 
+
+

 
   


signature.asc
Description: OpenPGP digital signature

Re: PG 13 release notes, first draft

2020-09-15 Thread Tom Lane

"Jonathan S. Katz"  writes:
> On 9/15/20 5:22 AM, Masahiko Sawada wrote:
>> On Tue, 15 Sep 2020 at 13:56, Peter Eisentraut
>>  wrote:
>>> I don't think B-tree indexes are relevant here.  AFAICT, this feature
>>> applies to all indexes.

> I'm not sure where I got B-tree from. I've attached a correction.

Right, pushed.  I clarified the main entry for this a tad, too.

regards, tom lane

Re: pg_restore causing deadlocks on partitioned tables

2020-09-15 Thread Tom Lane

Amit Langote  writes:
> On Tue, Sep 15, 2020 at 10:47 PM Tom Lane  wrote:
>> Ah, right.  That seems like a bug but we have not attempted to fix it.

> IIRC, when this behavior was brought up as a bug in past discussions,
> it was decided that it will be fixed when NOT NULL constraints are
> represented using pg_constraint entries.

Yeah, that matches my recollection too.

>> But we could restrict the optimization to partitioned tables, where
>> the assumption does hold, no?

> Yeah, seems safe in their case.

And that's sufficient to cover pg_restore's issue, since it's
only going to be trying to do this for partitioned tables.

I'll wait till tomorrow to push this, since we're still in
freeze mode for the v13 branch.

regards, tom lane

Re: Optimising compactify_tuples()

2020-09-15 Thread Jakub Wartak

David Rowley wrote:

> I've attached patches in git format-patch format. I'm proposing to commit 
> these in about 48 hours time unless there's some sort of objection before 
> then.

Hi David, no objections at all, I've just got reaffirming results here, as per 
[1] (SLRU thread but combined results with qsort testing) I've repeated 
crash-recovery tests here again:

TEST0a: check-world passes
TEST0b: brief check: DB after recovery returns correct data which was present 
only into the WAL stream - SELECT sum(c) from sometable

TEST1: workload profile test as per standard TPC-B [2], with majority of 
records in WAL stream being Heap/HOT_UPDATE on same system with NVMe as 
described there.

results of master (62e221e1c01e3985d2b8e4b68c364f8486c327ab) @ 15/09/2020 as 
baseline:
15.487, 1.013
15.789, 1.033
15.942, 1.118

profile looks most of the similar:
17.14%  postgres  libc-2.17.so[.] __memmove_ssse3_back
---__memmove_ssse3_back
   compactify_tuples
   PageRepairFragmentation
   heap2_redo
   StartupXLOG
 8.16%  postgres  postgres[.] hash_search_with_hash_value
---hash_search_with_hash_value
   |--4.49%--BufTableLookup
[..]
--3.67%--smgropen

master with 2 patches by David 
(v8-0001-Optimize-compactify_tuples-function.patch + 
v8-0002-Report-resource-usage-at-the-end-of-recovery.patch): 
14.236, 1.02
14.431, 1.083
14.256, 1.02

so 9-10% faster in this simple verification check. If I had pgbench running the 
result would be probably better. Profile is similar:

13.88%  postgres  libc-2.17.so[.] __memmove_ssse3_back
---__memmove_ssse3_back
--13.47%--compactify_tuples

10.61%  postgres  postgres[.] hash_search_with_hash_value
---hash_search_with_hash_value
   |--5.31%--smgropen
[..]
--5.31%--BufTableLookup


TEST2: update-only test, just as you performed in [3] to trigger the hotspot, 
with table fillfactor=85 and update.sql (100% updates, ~40% Heap/HOT_UPDATE 
[N], ~40-50% [record sizes]) with slightly different amount of data.

results of master as baseline:
233.377, 0.727
233.233, 0.72
234.085, 0.729

with profile:
24.49%  postgres  postgres  [.] pg_qsort
17.01%  postgres  postgres  [.] PageRepairFragmentation
12.93%  postgres  postgres  [.] itemoffcompare
(sometimes I saw also a ~13% swapfunc)

results of master with above 2 patches, 2.3x speedup:
101.6, 0.709
101.837, 0.71
102.243, 0.712

with profile (so yup the qsort is gone, hurray!):

32.65%  postgres  postgres  [.] PageRepairFragmentation
---PageRepairFragmentation
   heap2_redo
   StartupXLOG
10.88%  postgres  postgres  [.] compactify_tuples
---compactify_tuples
 8.84%  postgres  postgres  [.] hash_search_with_hash_value

BTW: this message "redo done at 0/9749FF70 system usage: CPU: user: 13.46 s, 
system: 0.78 s, elapsed: 14.25 s" is priceless addition :) 

-J.

[1] - 
https://www.postgresql.org/message-id/flat/VI1PR0701MB696023DA7815207237196DC8F6570%40VI1PR0701MB6960.eurprd07.prod.outlook.com#188ad4e772615999ec427486d1066948
[2] - pgbench -i -s 100, pgbench -c8 -j8 -T 240, ~1.6GB DB with 2.3GB after 
crash in pg_wal to be replayed
[3] - 
https://www.postgresql.org/message-id/CAApHDvoKwqAzhiuxEt8jSquPJKDpH8DNUZDFUSX9P7DXrJdc3Q%40mail.gmail.com
 , in my case: pgbench -c 16 -j 16 -T 240 -f update.sql , ~1GB DB with 4.3GB 
after crash in pg_wal to be replayed

Re: pg_restore causing deadlocks on partitioned tables

2020-09-15 Thread Amit Langote

On Tue, Sep 15, 2020 at 10:47 PM Tom Lane  wrote:
> Amit Langote  writes:
> > On Tue, Sep 15, 2020 at 9:09 AM Tom Lane  wrote:
> >> I wrote a quick patch for this part.  It seems pretty safe and probably
> >> could be back-patched without fear.
>
> > The patch's theory that if the parent column has NOT NULL set then it
> > must be set in child tables too does not actually hold for plain
> > inheritance cases, because as shown above, NOT NULL can be dropped in
> > children independently of the parent.
>
> Ah, right.  That seems like a bug but we have not attempted to fix it.

IIRC, when this behavior was brought up as a bug in past discussions,
it was decided that it will be fixed when NOT NULL constraints are
represented using pg_constraint entries.

> But we could restrict the optimization to partitioned tables, where
> the assumption does hold, no?

Yeah, seems safe in their case.

-- 
Amit Langote
EnterpriseDB: http://www.enterprisedb.com

Re: Force update_process_title=on in crash recovery?

2020-09-15 Thread Tom Lane

Thomas Munro  writes:
> Based on a couple of independent reports from users with no idea how
> to judge the progress of a system recovering from a crash, Christoph
> and I wondered if we should override update_process_title for the
> "recovering ..." message, at least until connections are allowed.  We
> already do that to set the initial titles.

> Crash recovery is a rare case where important information is reported
> through the process title that isn't readily available anywhere else,
> since you can't log in.  If you want to gauge  progress on a system
> that happened to crash with update_process_title set to off, your best
> hope is probably to trace the process or spy on the files it has open,
> to see which WAL segment it's accessing, but that's not very nice.

Seems like a good argument, but you'd have to be careful about the
final state when you stop overriding update_process_title --- it can't
be left looking like it's still-in-progress on some random WAL file.
(Compare my nearby gripes about walsenders being sloppy about their
pg_stat_activity and process title presentations.)

regards, tom lane

Re: Parallel copy

2020-09-15 Thread Bharath Rupireddy

On Fri, Sep 11, 2020 at 3:49 AM Greg Nancarrow  wrote:
>
> I couldn't use the original machine from which I obtained the previous
> results, but ended up using a 4-core CentOS7 VM, which showed a
> similar pattern in the performance results for this test case.
> I obtained the following results from loading a 2GB CSV file (100
> rows, 4 indexes):
>
> Copy TypeDuration (s)  Load factor
> ===
> Normal Copy190.891-
>
> Parallel Copy
> (#workers)
> 1210.947   0.90
>
Hi Greg,

I tried to recreate the test case(attached) and I didn't find much
difference with the custom postgresql.config file.
Test case: 25 tuples, 4 indexes(composite indexes with 10
columns), 3.7GB, 100 columns(as suggested by you and all the
varchar(255) columns are having 255 characters), exec time in sec.

With custom postgresql.conf[1], removed and recreated the data
directory after every run(I couldn't perform the OS page cache flush
due to some reasons. So, chose this recreation of data dir way, for
testing purpose):
 HEAD: 129.547, 128.624, 128.890
 Patch: 0 workers - 130.213, 131.298, 130.555
 Patch: 1 worker - 127.757, 125.560, 128.275

With default postgresql.conf, removed and recreated the data directory
after every run:
 HEAD: 138.276, 150.472, 153.304
 Patch: 0 workers - 162.468,  149.423, 159.137
 Patch: 1 worker - 136.055, 144.250, 137.916

Few questions:
 1. Was the run performed with default postgresql.conf file? If not,
what are the changed configurations?
 2. Are the readings for normal copy(190.891sec, mentioned by you
above) taken on HEAD or with patch, 0 workers? How much is the runtime
with your test case on HEAD(Without patch) and 0 workers(With patch)?
 3. Was the run performed on release build?
 4. Were the readings taken on multiple runs(say 3 or 4 times)?

[1] - Postgres configuration used for above testing:
shared_buffers = 40GB
max_worker_processes = 32
max_parallel_maintenance_workers = 24
max_parallel_workers = 32
synchronous_commit = off
checkpoint_timeout = 1d
max_wal_size = 24GB
min_wal_size = 15GB
autovacuum = off

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
{\rtf1\ansi\ansicpg1252\cocoartf2513
\cocoatextscaling0\cocoaplatform0{\fonttbl\f0\fswiss\fcharset0 Helvetica;}
{\colortbl;\red255\green255\blue255;}
{\*\expandedcolortbl;;}
\paperw11900\paperh16840\margl1440\margr1440\vieww30560\viewh11000\viewkind0
\pard\tx566\tx1133\tx1700\tx2267\tx2834\tx3401\tx3968\tx4535\tx5102\tx5669\tx6236\tx6803\pardirnatural\partightenfactor0

\f0\fs24 \cf0  DROP TABLE IF EXISTS t100_1;\
 CREATE TABLE t100_1 (\
 c1_s_1 bigserial,\
 c2_vc_1 varchar(255),\
 c3_d_1 date,\
 c4_n_1 numeric,\
 c5_vc_2 varchar(255),\
 c6_d_2 date,\
 c7_vc_3 varchar(255),\
 c8_t_1 time without time zone,\
 c9_vc_4 varchar(255),\
 c10_vc_5 varchar(255),\
 c11_t_2 time without time zone,\
 c12_vc_6 varchar(255),\
 c13_vc_7 varchar(255),\
 c14_n_2 numeric,\
 c15_d_3 date,\
 c16_t_3 time without time zone,\
 c17_n_3 numeric,\
 c18_vc_8 varchar(255),\
 c19_d_4 date,\
 c20_vc_9 varchar(255),\
 c21_t_4 time without time zone,\
 c22_vc_10 varchar(255),\
 c23_d_5 date,\
 c24_vc_11 varchar(255),\
 c25_t_5 time without time zone,\
 c26_t_6 time without time zone,\
 c27_vc_12 varchar(255),\
 c28_vc_13 varchar(255),\
 c29_d_6 date,\
 c30_vc_14 varchar(255),\
 c31_vc_15 varchar(255),\
 c32_t_7 time without time zone,\
 c33_t_8 time without time zone,\
 c34_vc_16 varchar(255),\
 c35_vc_17 varchar(255),\
 c36_d_7 date,\
 c37_d_8 date,\
 c38_vc_18 varchar(255),\
 c39_t_9 time without time zone,\
 c40_vc_19 varchar(255),\
 c41_vc_20 varchar(255),\
 c42_vc_21 varchar(255),\
 c43_vc_22 varchar(255),\
 c44_t_10 time without time zone,\
 c45_d_9 date,\
 c46_vc_23 varchar(255),\
 c47_t_11 time without time zone,\
 c48_vc_24 varchar(255),\
 c49_vc_25 varchar(255),\
 c50_vc_26 varchar(255),\
 c51_d_10 date,\
 c52_vc_27 varchar(255),\
 c53_vc_28 varchar(255),\
 c54_vc_29 varchar(255),\
 c55_vc_30 varchar(255),\
 c56_d_11 date,\
 c57_vc_31 varchar(255),\
 c58_vc_32 varchar(255),\
 c59_vc_33 varchar(255),\
 c60_vc_34 varchar(255),\
 c61_vc_35 varchar(255),\
 c62_vc_36 varchar(255),\
 c63_vc_37 varchar(255),\
 c64_vc_38 varchar(255),\
 c65_vc_39 varchar(255),\
 c66_n_4 numeric,\
 c67_vc_40 varchar(255),\
 c68_vc_41 varchar(255),\
 c69_vc_42 varchar(255),\
 c70_vc_43 varchar(255),\
 c71_n_5 numeric,\
 c72_vc_44 varchar(255),\
 c73_n_6 numeric,\
 c74_vc_45 varchar(255),\
 c75_vc_46 varchar(255),\
 c76_vc_47 varchar(255),\
 c77_n_7 numeric,\
 c78_n_8 numeric,\
 c79_d_12 date,\
 c80_n_9 numeric,\
 c81_vc_48 varchar(255),\
 c82_vc_49 varchar(255),\
 c83_vc_50 varchar(255),\
 c84_n_10 numeric,\
 c85_vc_51 varchar(255),\
 c86_vc_52 varchar(255),\
 c87_vc_53 varchar(255),\
 c88_d_13 date,\
 c89_n_13 numeric,\
 c90_vc_54 varchar(255),\
 c91_vc_55 varchar(255),\
 c92_n_11 numeric,\
 c93_n_12 numeric,\
 c94_vc_56 varchar(255),\

Re: PG 13 release notes, first draft

2020-09-15 Thread Jonathan S. Katz

On 9/15/20 5:22 AM, Masahiko Sawada wrote:
> On Tue, 15 Sep 2020 at 13:56, Peter Eisentraut
>  wrote:
>>
>> On 2020-09-09 22:57, Jonathan S. Katz wrote:
>>> +
>>> + 
>>> +  Parallelized vacuuming of B-tree indexes
>>> + 
>>> +
>>
>> I don't think B-tree indexes are relevant here.  AFAICT, this feature
>> applies to all indexes.
>>
> 
> Yes, parallel vacuum applies to all types of indexes provided by
> PostgreSQL binary, and other types of indexes also can use it.

I'm not sure where I got B-tree from. I've attached a correction.

Thanks,

Jonathan
diff --git a/doc/src/sgml/release-13.sgml b/doc/src/sgml/release-13.sgml
index 0ca970e452..f7852c8618 100644
--- a/doc/src/sgml/release-13.sgml
+++ b/doc/src/sgml/release-13.sgml
@@ -37,7 +37,7 @@
 
 
  
-  Parallelized vacuuming of B-tree indexes
+  Parallelized vacuuming of indexes
  
 
 


signature.asc
Description: OpenPGP digital signature

Re: pg_restore causing deadlocks on partitioned tables

2020-09-15 Thread Tom Lane

Amit Langote  writes:
> On Tue, Sep 15, 2020 at 9:09 AM Tom Lane  wrote:
>> I wrote a quick patch for this part.  It seems pretty safe and probably
>> could be back-patched without fear.

> The patch's theory that if the parent column has NOT NULL set then it
> must be set in child tables too does not actually hold for plain
> inheritance cases, because as shown above, NOT NULL can be dropped in
> children independently of the parent.

Ah, right.  That seems like a bug but we have not attempted to fix it.
But we could restrict the optimization to partitioned tables, where
the assumption does hold, no?

regards, tom lane

Re: Transactions involving multiple postgres foreign servers, take 2

2020-09-15 Thread Ashutosh Bapat

On Fri, Sep 11, 2020 at 4:37 PM Masahiko Sawada
 wrote:
>
> Considering the design without the resolver process, I think we can
> easily replace the latter with the manual resolution. OTOH, it's not
> easy for the former. I have no idea about better design for now,
> although, as you described, if we could ensure that the process
> doesn't raise an error during resolving foreign transactions after
> committing the local transaction we would not need the resolver
> process.

My initial patch used the same backend to resolve foreign
transactions. But in that case even though the user receives COMMIT
completed, the backend isn't accepting the next query till it is busy
resolving the foreign server. That might be a usability issue again if
attempting to resolve all foreign transactions takes noticeable time.
If we go this route, we should try to resolve as many foreign
transactions as possible ignoring any errors while doing so and
somehow let user know which transactions couldn't be resolved. User
can then take responsibility for resolving those.

>
> Or the second idea would be that the backend commits only the local
> transaction then returns the acknowledgment of COMMIT to the user
> without resolving foreign transactions. Then the user manually
> resolves the foreign transactions by, for example, using the SQL
> function pg_resolve_foreign_xact() within a separate transaction. That
> way, even if an error occurred during resolving foreign transactions
> (i.g., executing COMMIT PREPARED), it’s okay as the user is already
> aware of the local transaction having been committed and can retry to
> resolve the unresolved foreign transaction. So we won't need the
> resolver process while avoiding such inconsistency.
>
> But a drawback would be that the transaction commit doesn't ensure
> that all foreign transactions are completed. The subsequent
> transactions would need to check if the previous distributed
> transaction is completed to see its results. I’m not sure it’s a good
> design in terms of usability.

I agree, this won't be acceptable.

In either case, I think a solution where the local server takes
responsibility to resolve foreign transactions will be better even in
the first cut.

-- 
Best Wishes,
Ashutosh Bapat

Re: [HACKERS] logical decoding of two-phase transactions

2020-09-15 Thread Amit Kapila

On Tue, Sep 15, 2020 at 5:27 PM Ajin Cherian  wrote:
>
> On Sat, Sep 12, 2020 at 9:40 PM Amit Kapila  wrote:
>>
>>
>> Another thing, I noticed is that originally we have subscriber-side
>> support as well, see [1] (see *pgoutput* patch) but later dropped it
>> due to some reasons [2]. I think we should have pgoutput support as
>> well, so see what is required to get that incorporated.
>>
> I have added the rebased patch-set for pgoutput and subscriber side changes 
> as well. This also includes a test case in subscriber.
>

As mentioned in my email there were some reasons due to which the
support has been left for later, have you checked those and if so, can
you please explain how you have addressed those or why they are not
relevant now if that is the case?

-- 
With Regards,
Amit Kapila.

Re: [HACKERS] logical decoding of two-phase transactions

2020-09-15 Thread Amit Kapila

On Wed, Sep 9, 2020 at 3:33 PM Ajin Cherian  wrote:
>
> On Mon, Sep 7, 2020 at 11:17 PM Amit Kapila  wrote:
>>
>>
>> Nikhil has a test for the same
>> (0004-Teach-test_decoding-plugin-to-work-with-2PC.Jan4) in his last
>> email [1]. You might want to use it to test this behavior. I think you
>> can also keep the tests as a separate patch as Nikhil had.
>>
> Done. I've added the tests and also tweaked code to make sure that the aborts 
> during 2 phase commits are also handled.
>

I don't think it is complete yet.
*
* This error can only occur when we are sending the data in
  * streaming mode and the streaming is not finished yet.
  */
- Assert(streaming);
- Assert(stream_started);
+ Assert(streaming || rbtxn_prepared(txn));
+ Assert(stream_started  || rbtxn_prepared(txn));

Here, you have updated the code but comments are still not updated.

*
@@ -2370,10 +2391,19 @@ ReorderBufferProcessTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn,
  errdata = NULL;
  curtxn->concurrent_abort = true;

- /* Reset the TXN so that it is allowed to stream remaining data. */
- ReorderBufferResetTXN(rb, txn, snapshot_now,
-   command_id, prev_lsn,
-   specinsert);
+ /* If streaming, reset the TXN so that it is allowed to stream
remaining data. */
+ if (streaming && stream_started)
+ {
+ ReorderBufferResetTXN(rb, txn, snapshot_now,
+   command_id, prev_lsn,
+   specinsert);
+ }
+ else
+ {
+ elog(LOG, "stopping decoding of %s (%u)",
+ txn->gid[0] != '\0'? txn->gid:"", txn->xid);
+ rb->abort(rb, txn, commit_lsn);
+ }

I don't think we need to perform abort here. Later we will anyway
encounter the WAL for Rollback Prepared for which we will call
abort_prepared_cb. As we have set the 'concurrent_abort' flag, it will
allow us to skip all the intermediate records. Here, we need only
enough state in ReorderBufferTxn that it can be later used for
ReorderBufferFinishPrepared(). Basically, you need functionality
similar to ReorderBufferTruncateTXN where except for invalidations you
can free memory for everything else. You can either write a new
function ReorderBufferTruncatePreparedTxn or pass another bool
parameter in ReorderBufferTruncateTXN to indicate it is prepared_xact
and then clean up additional things that are not required for prepared
xact.

*
Similarly, I don't understand why we need below code:
ReorderBufferProcessTXN()
{
..
+ if (rbtxn_rollback(txn))
+ rb->abort(rb, txn, commit_lsn);
..
}

There is nowhere we are setting the RBTXN_ROLLBACK flag, so how will
this check be true? If we decide to remove this code then don't forget
to update the comments.

*
If my previous two comments are correct then I don't think we need the
below interface.
+
+ Transaction Abort Callback
+
+ 
+  The required abort_cb callback is called whenever
+  a transaction abort has to be initiated. This can happen if we are
+  decoding a transaction that has been prepared for two-phase commit and
+  a concurrent rollback happens while we are decoding it.
+
+typedef void (*LogicalDecodeAbortCB) (struct LogicalDecodingContext *ctx,
+   ReorderBufferTXN *txn,
+   XLogRecPtr abort_lsn);

>>
>>
>> I don't know why the patch has used this way to implement an option to
>> enable two-phase. Can't we use how we implement 'stream-changes'
>> option in commit 7259736a6e? Just refer how we set ctx->streaming and
>> you can use a similar way to set this parameter.
>
>
> Done, I've moved the checks for callbacks to inside the corresponding 
> wrappers.
>

This is not what I suggested. Please study the commit 7259736a6e and
see how streaming option is implemented. I want later subscribers can
specify whether they want transactions to be decoded at prepare time
similar to what we have done for streaming. Also, search for
ctx->streaming in the code and see how it is set to get the idea.

Note: Please use version number while sending patches, you can use
something like git format-patch -N -v n to do that. It makes easier
for the reviewer to compare it with the previous version.

Few other comments:
===
1.
ReorderBufferProcessTXN()
{
..
if (streaming)
{
ReorderBufferTruncateTXN(rb, txn);

/* Reset the CheckXidAlive */
CheckXidAlive = InvalidTransactionId;
}
else
ReorderBufferCleanupTXN(rb, txn);
..
}

I don't think we can perform ReorderBufferCleanupTXN for the prepared
transactions because if we have removed the ReorderBufferTxn before
commit, the later code might not consider such a transaction in the
system and compute the wrong value of restart_lsn for a slot.
Basically, in SnapBuildProcessRunningXacts() when we call
ReorderBufferGetOldestTXN(), it should show the ReorderBufferTxn of
the prepared transaction which is not yet committed but because we
have removed it after prepare, it won't get that TXN and then that
leads to wrong computation of restart_lsn. Once we start from a wrong
point in WAL, the snapshot built was incorrect which will lead to the

Re: Range checks of pg_test_fsync --secs-per-test and pg_test_timing --duration

2020-09-15 Thread Peter Eisentraut


On 2020-09-11 09:08, Michael Paquier wrote:

On Thu, Sep 10, 2020 at 03:59:20PM +0200, Peter Eisentraut wrote:

The first patch you proposed checks for errno == ERANGE, but pgbench code
doesn't do that.  So one of them is not correct.


Sorry for the confusion, I misunderstood what you were referring to.
Yes, the first patch is wrong to add the check on errno.  FWIW, I
thought about your point to use strtol() but that does not seem worth
the complication for those tools.  It is not like anybody is going to
use high values for these, and using uint64 to make sure that the
boundaries are checked just add more checks for bounds.  There is
one example in pg_test_timing when compiling the total time.


I didn't mean use strtol() to be able to process larger values, but for 
the error checking.  atoi() cannot detect any errors other than ERANGE. 
So if you are spending effort on making the option value parsing more 
robust, relying on atoi() will result in an incomplete solution.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Collation versioning

2020-09-15 Thread Peter Eisentraut


On 2020-09-08 16:45, Julien Rouhaud wrote:

I usually agree with that approach, I'm just afraid that getting a consensus on
the best way to do that will induce a lot of discussions, while this is
probably a corner case due to general usage of hash and bloom indexes.

Anyway, in order to make progress on that topic I attach an additional POC
commit to add the required infrastructure to handle this case in
v29-0001-Add-a-new-amnostablecollorder-flag-in-IndexAmRou.patch.


I'm confused now.  I think we had mostly agreed on the v28 patch set, 
without this additional AM flag.  There was still some discussion on 
what the AM flag's precise semantics should be.  Do we want to work that 
out first?


Btw., I'm uneasy about the term "stable collation order".  "Stable" has 
an established meaning for sorting.  It's really about whether the AM 
uses collations at all, right?


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Force update_process_title=on in crash recovery?

2020-09-15 Thread Thomas Munro

Hi,

Based on a couple of independent reports from users with no idea how
to judge the progress of a system recovering from a crash, Christoph
and I wondered if we should override update_process_title for the
"recovering ..." message, at least until connections are allowed.  We
already do that to set the initial titles.

Crash recovery is a rare case where important information is reported
through the process title that isn't readily available anywhere else,
since you can't log in.  If you want to gauge  progress on a system
that happened to crash with update_process_title set to off, your best
hope is probably to trace the process or spy on the files it has open,
to see which WAL segment it's accessing, but that's not very nice.

Re: [HACKERS] logical decoding of two-phase transactions

2020-09-15 Thread Ajin Cherian

On Sat, Sep 12, 2020 at 9:40 PM Amit Kapila  wrote:

>
> Another thing, I noticed is that originally we have subscriber-side
> support as well, see [1] (see *pgoutput* patch) but later dropped it
> due to some reasons [2]. I think we should have pgoutput support as
> well, so see what is required to get that incorporated.
>
> I have added the rebased patch-set for pgoutput and subscriber side
changes as well. This also includes a test case in subscriber.

regards,
Ajin Cherian


0001-Support-decoding-of-two-phase-transactions.patch
Description: Binary data


0002-Tap-test-to-test-concurrent-aborts-during-2-phase-co.patch
Description: Binary data


0003-pgoutput-output-plugin-support-for-logical-decoding-.patch
Description: Binary data

Re: Use incremental sort paths for window functions

2020-09-15 Thread David Rowley

On Tue, 15 Sep 2020 at 23:21, David Rowley  wrote:
>
> On Tue, 15 Sep 2020 at 20:12, Daniel Gustafsson  wrote:
> >
> > No comments on this version, LGTM.
>
> Cool. Many thanks for having a look.

Pushed. 62e221e1c

David

Standardize the printf format for st_size

2020-09-15 Thread Peter Eisentraut

Existing code uses various inconsistent ways to printf struct stat's 
st_size member.  The type of that is off_t, which is in most cases a 
signed 64-bit integer, so use the long long int format for it.


I don't think anything is affected by this in practice, but it seems 
reasonable to set a good convention for the next time this code gets copied.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 028e5a17e2cfa1acfdbbd9c2eaaf526b07b484c4 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 15 Sep 2020 13:33:48 +0200
Subject: [PATCH] Standardize the printf format for st_size

Existing code used various inconsistent ways to printf struct stat's
st_size member.  The type of that is off_t, which is in most cases a
signed 64-bit integer, so use the long long int format for it.
---
 src/backend/access/transam/twophase.c | 12 ++--
 src/backend/access/transam/xlogarchive.c  |  6 +++---
 src/bin/pg_basebackup/pg_receivewal.c |  4 ++--
 src/bin/pg_verifybackup/pg_verifybackup.c |  8 
 src/fe_utils/archive.c|  6 +++---
 5 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/src/backend/access/transam/twophase.c 
b/src/backend/access/transam/twophase.c
index ef4f9981e3..7940060443 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1243,10 +1243,10 @@ ReadTwoPhaseFile(TransactionId xid, bool missing_ok)
stat.st_size > MaxAllocSize)
ereport(ERROR,
(errcode(ERRCODE_DATA_CORRUPTED),
-errmsg_plural("incorrect size of file \"%s\": 
%zu byte",
-  "incorrect size of 
file \"%s\": %zu bytes",
-  (Size) stat.st_size, 
path,
-  (Size) 
stat.st_size)));
+errmsg_plural("incorrect size of file \"%s\": 
%lld byte",
+  "incorrect size of 
file \"%s\": %lld bytes",
+  (long long int) 
stat.st_size, path,
+  (long long int) 
stat.st_size)));
 
crc_offset = stat.st_size - sizeof(pg_crc32c);
if (crc_offset != MAXALIGN(crc_offset))
@@ -1270,8 +1270,8 @@ ReadTwoPhaseFile(TransactionId xid, bool missing_ok)
 errmsg("could not read file \"%s\": 
%m", path)));
else
ereport(ERROR,
-   (errmsg("could not read file \"%s\": 
read %d of %zu",
-   path, r, (Size) 
stat.st_size)));
+   (errmsg("could not read file \"%s\": 
read %d of %lld",
+   path, r, (long long 
int) stat.st_size)));
}
 
pgstat_report_wait_end();
diff --git a/src/backend/access/transam/xlogarchive.c 
b/src/backend/access/transam/xlogarchive.c
index 8f8734dc1d..cae93ab69d 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -202,10 +202,10 @@ RestoreArchivedFile(char *path, const char *xlogfname,
else
elevel = FATAL;
ereport(elevel,
-   (errmsg("archive file \"%s\" 
has wrong size: %lu instead of %lu",
+   (errmsg("archive file \"%s\" 
has wrong size: %lld instead of %lld",
xlogfname,
-   (unsigned long) 
stat_buf.st_size,
-   (unsigned long) 
expectedSize)));
+   (long long int) 
stat_buf.st_size,
+   (long long int) 
expectedSize)));
return false;
}
else
diff --git a/src/bin/pg_basebackup/pg_receivewal.c 
b/src/bin/pg_basebackup/pg_receivewal.c
index cd05f5fede..cddc896390 100644
--- a/src/bin/pg_basebackup/pg_receivewal.c
+++ b/src/bin/pg_basebackup/pg_receivewal.c
@@ -269,8 +269,8 @@ FindStreamingStart(uint32 *tli)
 
if (statbuf.st_size != WalSegSz)
{
-   pg_log_warning("segment file \"%s\" has 
incorrect size %d, skipping",
-  dirent->d_name, 
(int) statbuf.st_size);
+

Re: Yet another fast GiST build

2020-09-15 Thread Heikki Linnakangas


On 11/09/2020 09:02, Andrey M. Borodin wrote:

10 сент. 2020 г., в 17:43, Heikki Linnakangas 
написал(а):

One more patch version attached. I fixed some memory leaks, and
fixed the abbreviation on 32-bit systems, and a bunch of small
comment changes and such.




The patch looks fine to me. On my machine GiST for points is builded
10x faster than before the patch.


Another patch version, fixed a few small bugs pointed out by assertion 
failures in the regression tests.


- Heikki
>From fdf51af02513384949d4a26c2c8381e7715703b7 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Tue, 15 Sep 2020 14:32:26 +0300
Subject: [PATCH v19 1/1] Add support for building GiST index by sorting.

This adds a new optional support function to the GiST access method:
sortsupport. If it is defined, the GiST index is built by sorting all data
to the order defined by the sortsupport's comparator function, and packing
the tuples in that order to GiST pages. This is similar to how B-tree
index build works, and is much faster than inserting the tuples one by one.
The resulting index is smaller too, because the pages are packed more
tightly, upto 'fillfactor'. The normal build method works by splitting
pages, which tends to lead to more wasted space.

The quality of the resulting index depends on how good the opclass-defined
sort order is. A good order preserves locality of the input data.

As the first user of this facility, add 'sortsupport' function to the
point_ops opclass. It sorts the points in Z-order (aka Morton Code), by
interleaving the bits of the X and Y coordinates.

Author: Andrey Borodin
Reviewed-by: Pavel Borisov, Thomas Munro
Discussion: https://www.postgresql.org/message-id/1A36620E-CAD8-4267-9067-FB31385E7C0D%40yandex-team.ru
---
 doc/src/sgml/gist.sgml |  70 +++
 src/backend/access/gist/gistbuild.c| 510 +
 src/backend/access/gist/gistproc.c | 229 +
 src/backend/access/gist/gistutil.c |  53 ++-
 src/backend/access/gist/gistvalidate.c |   6 +-
 src/backend/access/transam/xloginsert.c|  57 +++
 src/backend/utils/sort/sortsupport.c   |  34 ++
 src/backend/utils/sort/tuplesort.c |  57 +++
 src/include/access/gist.h  |   3 +-
 src/include/access/gist_private.h  |   3 +
 src/include/access/xloginsert.h|   2 +
 src/include/catalog/catversion.h   |   1 +
 src/include/catalog/pg_amproc.dat  |   2 +
 src/include/catalog/pg_proc.dat|   3 +
 src/include/utils/sortsupport.h|   1 +
 src/include/utils/tuplesort.h  |   4 +
 src/test/regress/expected/create_index.out |   6 +-
 17 files changed, 935 insertions(+), 106 deletions(-)

diff --git a/doc/src/sgml/gist.sgml b/doc/src/sgml/gist.sgml
index f9226e7a35c..7c72a547409 100644
--- a/doc/src/sgml/gist.sgml
+++ b/doc/src/sgml/gist.sgml
@@ -259,6 +259,8 @@ CREATE INDEX ON my_table USING GIST (my_inet_column inet_ops);
compress method is omitted. The optional tenth method
options is needed if the operator class provides
the user-specified parameters.
+   The sortsupport method is also optional and is used to
+   speed up building a GiST index.
  
 
  
@@ -1065,6 +1067,74 @@ my_compress(PG_FUNCTION_ARGS)
   
  
 
+
+
+ sortsupport
+ 
+  
+   Returns a comparator function to sort data in a way that preserves
+   locality. It is used by CREATE INDEX and
+   REINDEX. The quality of the created index depends on
+   how well the sort order determined by the comparator function preserves
+   locality of the inputs.
+  
+  
+   The sortsupport method is optional. If it is not
+   provided, CREATE INDEX builds the index by inserting
+   each tuple to the tree using the penalty and
+   picksplit functions, which is much slower.
+  
+
+  
+   The SQL declaration of the function must look like
+   this:
+
+
+CREATE OR REPLACE FUNCTION my_sortsupport(internal)
+RETURNS void
+AS 'MODULE_PATHNAME'
+LANGUAGE C STRICT;
+
+
+   The argument is a pointer to a SortSupport
+   struct. At a minimum, the function must fill in its comparator field.
+   The comparator takes three arguments: two Datums to compare, and
+   a pointer to the SortSupport struct. The
+   Datums are the two indexed values in the format that they are stored
+   in the index; that is, in the format returned by the
+   compress method. The full API is defined in
+   src/include/utils/sortsupport.h.
+   
+
+   
+The matching code in the C module could then follow this skeleton:
+
+
+PG_FUNCTION_INFO_V1(my_sortsupport);
+
+static int
+my_fastcmp(Datum x, Datum y, SortSupport ssup)
+{
+  /* establish order between x and y by computing some sorting value z */
+
+  int z1 = ComputeSpatialCode(x);
+  int z2 = ComputeSpatialCode(y);
+
+  return z1 == z2 ? 0 : z1 > z2 ? 1 : -1;
+}
+
+Datum

Re: Use incremental sort paths for window functions

2020-09-15 Thread David Rowley

On Tue, 15 Sep 2020 at 20:12, Daniel Gustafsson  wrote:
>
> On that note, assume we have the below scenario:
>
> wfunc .. (order by a), .. (order by a,b), .. (order by a,b,c)
>
> Currently the windows will be ordered such that a,b,c is sorted first, with 
> a,b
> and a not having to sort.  I wonder if there is a good heuristic to find cases
> where sorting a, then a,b incrementally and finally a,b,c incrementally is
> cheaper than a big sort of a,b,c?  If a,b,c would spill but subsequent
> incremental sorts won't then perhaps that could be a case?  Not sure if it's
> worth the planner time, just thinking out loud.

It's a worthy cause, but unfortunately, I don't think there's any very
realistic thing that can be done about that.  The problem is that
you're deciding the "most sorted" window clause and putting that first
in the parameters to the query_planner()'s callback function.  If you
wanted to try some alternative orders then it means calling
query_planner() again with some other order for
qp_extra.activeWindows.

Perhaps there's some other way of doing it so that the planner does
some sort of preliminary investigation about the best order to
evaluate the windows in.  Currently, standard_qp_callback just takes
the first window and has the planner perform the join order search
based on that.  Performing the join order search multiple times is
just not realistic, so it could only be done by some sort of
pre-checks.  e.g, is there an index that's likely to help me obtain
this specific order.  Then we'd just have to hope that through the
join search that the planner actually managed to produce a more
optimal plan than it would have if we'd left the window evaluation
order alone. It sounds pretty tricky to make cheap and good enough at
the same time.

> No comments on this version, LGTM.

Cool. Many thanks for having a look.

David

Re: pg_restore causing deadlocks on partitioned tables

2020-09-15 Thread Amit Langote

On Tue, Sep 15, 2020 at 7:28 AM Tom Lane  wrote:
> I wrote:
> > However, the deadlock report suggests, and manual experimentation
> > confirms, that
>
> > (1) TRUNCATE on a partition tries to get AccessShareLock on the parent;
>
> The reason for this is that
>
> (a) ExecuteTruncateGuts calls InitResultRelInfo, because it might
> need that to fire TRUNCATE triggers for the child relation.
>
> (b) InitResultRelInfo calls RelationGetPartitionQual, which
> of course(?) must access the parent table.
>
> AFAICS, it is utterly silly for InitResultRelInfo to be forcing
> a partition qual to be computed when we might not need it.
> We could flush ResultRelInfo.ri_PartitionCheck altogether and
> have anything that was reading it instead do
> RelationGetPartitionQual(ResultRelInfo.ri_RelationDesc).
>
> Actually it looks like most of the places reading it are
> just interested in non-nullness; can't those be nuked from
> orbit in favor of testing rel->rd_rel->relispartition?

Yeah, makes sense.  Please see attached a patch to do that.

-- 
Amit Langote
EnterpriseDB: http://www.enterprisedb.com


remove-ri_PartitionCheck.patch
Description: Binary data

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-09-15 Thread k.jami...@fujitsu.com

Hi, 

> BTW, I think I see one problem in the code:
> >
> > if (RelFileNodeEquals(bufHdr->tag.rnode, rnode.node) &&
> > + bufHdr->tag.forkNum == forkNum[j] && tag.blockNum >=
> > + bufHdr->firstDelBlock[j])
> >
> > Here, I think you need to use 'i' not 'j' for forkNum and
> > firstDelBlock as those are arrays w.r.t forks. That might fix the
> > problem but I am not sure as I haven't tried to reproduce it.
> 
> Thanks for advice. Right, that seems to be the cause of error, and fixing that
> (using fork) solved the case.
> I also followed the advice of Tsunakawa-san of using more meaningful
> iterator Instead of using "i" & "j" for readability.
> 
> I also added a new function when relation fork is bigger than the threshold
> If (nblocks > BUF_DROP_FULLSCAN_THRESHOLD)
> (DropRelFileNodeBuffersOfFork) Perhaps there's a better name for that
> function.
> However, as expected in the previous discussions, this is a bit slower than 
> the
> standard buffer invalidation process, because the whole shared buffers are
> scanned nfork times.
> Currently, I set the threshold to (NBuffers / 500)

I made a mistake in the v12. I replaced the firstDelBlock[fork_num] with 
firstDelBlock[block_num],
In the for-loop code block of block_num, because we want to process the current 
block of per-block loop

OTOH, I used the firstDelBlock[fork_num] when relation fork is bigger than the 
threshold,
or if the cached blocks of small relations were already invalidated.

The logic could be either correct or wrong, so I'd appreciate feedback and 
comments/advice.

Regards,
Kirk Jamison


v13-Speedup-dropping-of-relation-buffers-during-recovery.patch
Description:  v13-Speedup-dropping-of-relation-buffers-during-recovery.patch

Re: [PATCH] Remove useless distinct clauses

2020-09-15 Thread David Rowley

On Fri, 31 Jul 2020 at 20:41, Pierre Ducroquet  wrote:
>
> In a recent audit, I noticed that application developers have a tendency to
> abuse the distinct clause. For instance they use an ORM and add a distinct at
> the top level just because they don't know the cost it has, or they don't know
> that using EXISTS is a better way to express their queries than doing JOINs
> (or worse, they can't do better).
>
> They thus have this kind of queries (considering tbl1 has a PK of course):
> SELECT DISTINCT * FROM tbl1;
> SELECT DISTINCT * FROM tbl1 ORDER BY a;
> SELECT DISTINCT tbl1.* FROM tbl1
> JOIN tbl2 ON tbl2.a = tbl1.id;

This is a common anti-pattern that I used to see a couple of jobs ago.
What seemed to happen was that someone would modify some query or a
view to join in an additional table to fetch some information that was
now required.  At some later time, there'd be a bug report to say that
the query is returning certain records more than once.  The
developer's solution was to add DISTINCT, instead of figuring out that
the join that was previously added missed some column from the join
clause.

> These can be transformed into:
> SELECT * FROM tbl1 ORDER BY *;
> SELECT * FROM tbl1 ORDER BY a;
> SELECT tbl1.* FROM tbl1 SEMI-JOIN tbl2 ON tbl2.a = tbl1.id ORDER BY tbl1.*;
>
> The attached patch does that.

Unfortunately, there are quite a few issues with what you have:

First off, please see
https://www.postgresql.org/docs/devel/source-format.html about how we
format the source code.  Please pay attention to how we do code
comments and braces on a separate line.

Another problem is that we shouldn't be really wiping out the distinct
clause like you are with "root->parse->distinctClause = NULL;" there's
some discussion in [1] about that.

Also, the processing of the join tree where you switch inner joins to
semi joins looks broken. This would require much more careful and
recursive processing to do properly.  However, I'm not really sure
what that is as I'm not sure of all the cases that you can optimise
this way, and more importantly, which ones you can't.  There's also no
hope of anyone else knowing this as you've not left any comments about
why what you're doing is valid.

If you want an example of what can cause what you have to brake:

create table t (a int primary key);
explain select distinct a from t cross join pg_class cross join pg_attribute;
 QUERY PLAN
-
 Nested Loop Semi Join  (cost=0.15..37682.53 rows=999600 width=4)
   ->  Nested Loop  (cost=0.15..12595.31 rows=999600 width=4)
 ->  Index Only Scan using t_pkey on t  (cost=0.15..82.41
rows=2550 width=4)
 ->  Materialize  (cost=0.00..18.88 rows=392 width=0)
   ->  Seq Scan on pg_class  (cost=0.00..16.92 rows=392 width=0)
   ->  Materialize  (cost=0.00..105.17 rows=3145 width=0)
 ->  Seq Scan on pg_attribute  (cost=0.00..89.45 rows=3145 width=0)
(7 rows)

-- Note the join to pg_attribute remains a cross join.
insert into t values(1);
-- the following should only return 1 row. It returns many more than that.
select distinct a from t cross join pg_class cross join pg_attribute;

I can't figure out why you're doing this either:

+ /**
+ * If there was no sort clause, we change the distinct into a sort clause.
+ */
+ if (!root->parse->sortClause)
+ root->parse->sortClause = root->parse->distinctClause;

It's often better to say "why" rather than "what" when it comes to
code comments. It's pretty easy to see "what". It's the "why" part
that people more often get stuck on. Although, sometimes what you're
doing is complex and it does need a mention of "what". That's not the
case for the above though.

David

[1] 
https://www.postgresql.org/message-id/CAExHW5t7ALZmaN8gL5DZV%2Ben5G%3D4QTbKSYhBrXnSrKgCYNr_AA%40mail.gmail.com

Re: ECPG: proposal for new DECLARE STATEMENT

2020-09-15 Thread Michael Meskes

> This patch has now been silent for quite a while, unless someone is
> interested
> enough to bring it forward it seems about time to close it.

I am interested but still short on time. I will definitely look into it
as soon as I find some spare minutes.

Michael
-- 
Michael Meskes
Michael at Fam-Meskes dot De
Michael at Meskes dot (De|Com|Net|Org)
Meskes at (Debian|Postgresql) dot Org

Re: ECPG: proposal for new DECLARE STATEMENT

2020-09-15 Thread Daniel Gustafsson

> On 30 Mar 2020, at 18:53, David Steele  wrote:
> 
> On 1/11/20 10:41 PM, Tomas Vondra wrote:
>> On Sun, Jan 12, 2020 at 03:52:48AM +0100, Tomas Vondra wrote:
>>> ...
>>> 
>>> I'm not an ecpg expert (in fact I've never even used it), so my review
>>> is pretty superficial, but I only found a couple of minor whitespace
>>> issues (adding/removing a line/tab) - see the attached file.
>>> 
>> Meh, forgot to attach the file ...
> 
> Any thoughts on Tomas' comments?
> 
> A big part of moving a patch forward is keeping the thread active and 
> answering comments/review.

This patch has now been silent for quite a while, unless someone is interested
enough to bring it forward it seems about time to close it.

cheers ./daniel

Re: Autovacuum on partitioned table (autoanalyze)

2020-09-15 Thread Kyotaro Horiguchi

At Tue, 25 Aug 2020 14:28:20 +0200, Daniel Gustafsson  wrote 
in 
> > I attach the latest patch that solves the above Werror.
> > Could you please check it again?
> 
> This version now pass the tests in the Travis pipeline as can be seen in the
> link below, and is ready to be reviewed in the upcoming commitfest:
> 
>   http://cfbot.cputube.org/yuzuko-hosoya.html

At Mon, 6 Jul 2020 19:35:37 +0900, yuzuko  wrote in 
> I think there are other approaches like Tom's idea that Justin previously
> referenced, but this patch works the same way as previous patches.
> (tracks updated/inserted/deleted tuples and checks whether the partitioned
> tables needs auto-analyze, same as nonpartitioned tables)
> Because I wanted to be able to analyze partitioned tables by autovacuum
> as a first step, and I think this approach is the simplest way to do it.

I'm not sure if anything bad happen if parent and children are not
agree on statistics.

The requirement suggested here seems to be:

- We want to update parent's stats when any of its children gets its
  stats updated. This is curucial especially for time-series
  partitioning.

- However, we don't want analyze the whole-tree every time any of the
  children was analyzed.

To achieve the both, stats-merging seems to the optimal solution.

Putting that aside, I had a brief look on the latest patch.

/* We only count stats for things that have storage */
-   if (!RELKIND_HAS_STORAGE(relkind))
+   if (!RELKIND_HAS_STORAGE(relkind) ||
+   relkind == RELKIND_PARTITIONED_TABLE)
{
rel->pgstat_info = NULL;

RELKIND_HAS_STORAGE(RELKIND_PARTITIONED_TABLE) is already false.
Maybe you wanted to do "&& relkind !=" instead:p


+   /*
+* If this relation is partitioned, we store all ancestors' oid
+* to propagate its changed_tuples to their parents when this
+* transaction is committed.
+*/
+   if (rel->rd_rel->relispartition && pgstat_info->ancestors == 
NULL)

If the relation was detached then attached to another partition within
a transaction, the ancestor list would get stale and the succeeding
modification to the relation propagates into wrong ancestors.

I think vacuum time is more appropriate to modify ancestors stats. It
seems to me that what Alvalo pointed isthe list-order-susceptible
manner of collecting children's modified tuples.


+   ? 0  /* partitioned tables don't have any data, so it's 0 */

If the comment is true, we shouldn't have non-zero t_changed_tuples,
too. I think the reason for the lines is something different.

# Oops! Time's up now.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Unnecessary delay in streaming replication due to replay lag

2020-09-15 Thread lchch1...@sina.cn

Hello

I read the code and test the patch, it run well on my side, and I have several 
issues on the
patch.

1. When call RequestXLogStreaming() during replay, you pick timeline straightly 
from control
file, do you think it should pick timeline from timeline history file?

2. In archive recovery mode which will never turn to a stream mode, I think in 
current code it
will call RequestXLogStreaming() too which can avoid.

3. I found two 018_x.pl when I do make check, maybe rename the new one?




Regards,
Highgo Software (Canada/China/Pakistan) 
URL : www.highgo.ca 
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca

Re: PG 13 release notes, first draft

2020-09-15 Thread Masahiko Sawada

On Tue, 15 Sep 2020 at 13:56, Peter Eisentraut
 wrote:
>
> On 2020-09-09 22:57, Jonathan S. Katz wrote:
> > +
> > + 
> > +  Parallelized vacuuming of B-tree indexes
> > + 
> > +
>
> I don't think B-tree indexes are relevant here.  AFAICT, this feature
> applies to all indexes.
>

Yes, parallel vacuum applies to all types of indexes provided by
PostgreSQL binary, and other types of indexes also can use it.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Retry Cached Remote Connections for postgres_fdw in case remote backend gets killed/goes away

2020-09-15 Thread Fujii Masao





On 2020/07/17 21:02, Bharath Rupireddy wrote:


On Tue, Jul 14, 2020 at 6:13 PM Ashutosh Bapat
 wrote:


Has this been added to CF, possibly next CF?



I have not added yet. Request to get it done in this CF, as the final
patch for review(v3 patch) is ready and shared. We can target it to
the next CF if there are major issues with the patch.



I added this feature to the next CF - https://commitfest.postgresql.org/29/2651/


Thanks for the patch! Here are some comments from me.

+   PQreset(entry->conn);

Isn't using PQreset() to reconnect to the foreign server unsafe?
When new connection is established, some SET commands seem
to need to be issued like configure_remote_session() does.
But PQreset() doesn't do that at all.


Originally when GetConnection() establishes new connection,
it resets all transient state fields, to be sure all are clean (as the
comment says). With the patch, even when reconnecting
the remote server, shouldn't we do the same, for safe?


+   PGresult *res = NULL;
+   res = PQgetResult(entry->conn);
+   PQclear(res);

Are these really necessary? I was just thinking that's not because
pgfdw_get_result() and pgfdw_report_error() seem to do that
already in do_sql_command().


+   /* Start a new transaction or subtransaction if needed. */
+   begin_remote_xact(entry);

Even when there is no cached connection and new connection is made,
then if begin_remote_xact() reports an error, another new connection
tries to be made again. This behavior looks a bit strange.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: [patch] _bt_binsrch* improvements - equal-prefix-skip binary search

2020-09-15 Thread Matthias van de Meent

On Fri, 11 Sep 2020 at 19:41, Peter Geoghegan  wrote:
>
> Are you familiar with this thread?
>
>
https://www.postgresql.org/message-id/cah2-wzn_nayk4pr0hrwo0stwhmxjp5qyu+x8vppt030xpqr...@mail.gmail.com
>
> I wrote a patch that implemented the same idea, which is sometimes
> called dynamic prefix truncation. I found some very subtle problems
> with it, though. Concurrent page deletions could occur, which could
> cause a scan that skips a prefix to miss that the page it landed on
> doesn't have the same common prefix anymore.

Thank you for pointing me to that thread, I was not yet familiar with it.
It took me some time to get familiar with the Lanin and Shasha [0] paper,
but the concerns regarding concurrent page deletion are indeed problematic
for algorithmic prefix truncation, and do make it near impossible to use
algorithmic prefix truncation without resetting the accumulated prefix
every page.

At that, I will retract this current patch, and (unless someone's already
working on it) start research on implementing physical prefix truncation
through deduplicating the prefix shared with a page's highkey. This would
likely work fine for inner nodes (there are still flag bits left unused,
and the concerns related to deduplication of equal-but-distinct values do
not apply there), though I'm uncertain about the ability to truncate
duplicate prefixes in leaf nodes as it is basically prefix deduplication
with the same problems attached as 'normal' deduplication.

Thanks,

Matthias van de Meent

[0] https://archive.org/stream/symmetricconcurr00lani

Re: Avoid useless retrieval of defaults and check constraints in pg_dump -a

2020-09-15 Thread Julien Rouhaud

On Tue, Sep 15, 2020 at 4:48 AM Michael Paquier  wrote:
>
> On Thu, Sep 10, 2020 at 02:31:32PM +0200, Daniel Gustafsson wrote:
> > Given how unintrusive this optimization is, +1 from me to go ahead with this
> > patch.  pg_dump tests pass.  Personally I would've updated the nearby 
> > comments
> > to reflect why the check for dataOnly is there, but MMV there.  I'm moving 
> > this
> > patch to Ready for Committer.
>
> We need two comments here.  I would suggest something like:
> "Skip default/check for a data-only dump, as this is only needed for
> dumps of the table schema."
>
> Better wording is of course welcome.

FWIW I'm fine with those news comments!

Re: Subscription test 013_partition.pl fails under CLOBBER_CACHE_ALWAYS

2020-09-15 Thread Neha Sharma

Hi Tom,

I have tested the subscription test 013_partition.pl with CCA enabled on
HEAD and PG13,
and I am able to reproduce the issue on both the versions.

*Logs:*
[centos@clobber-cache subscription]$ git branch
* REL_13_STABLE
  master
[centos@clobber-cache-db93 subscription]$ tail -f
tmp_check/log/013_partition_subscriber1.log
2020-09-15 08:42:19.763 UTC [27866] LOG:  logical replication apply worker
for subscription "sub1" has started
2020-09-15 08:42:20.395 UTC [27866] ERROR:  cache lookup failed for
relation 0
2020-09-15 08:42:20.436 UTC [26427] LOG:  background worker "logical
replication worker" (PID 27866) exited with exit code 1
2020-09-15 08:42:20.835 UTC [27868] LOG:  logical replication apply worker
for subscription "sub1" has started
2020-09-15 08:42:21.462 UTC [27868] ERROR:  cache lookup failed for
relation 0
2020-09-15 08:42:21.508 UTC [26427] LOG:  background worker "logical
replication worker" (PID 27868) exited with exit code 1
2020-09-15 08:42:21.921 UTC [27870] LOG:  logical replication apply worker
for subscription "sub1" has started
2020-09-15 08:42:22.551 UTC [27870] ERROR:  cache lookup failed for
relation 0


Thanks.
--
Regards,
Neha Sharma


On Mon, Sep 14, 2020 at 8:50 PM Tom Lane  wrote:

> In connection with a nearby thread, I tried to run the subscription
> test suite in a CLOBBER_CACHE_ALWAYS build.  I soon found that I had
> to increase wal_receiver_timeout, but after doing this:
>
> diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
> index 1488bff..5fe6810 100644
> *** a/src/test/perl/PostgresNode.pm
> --- b/src/test/perl/PostgresNode.pm
> *** sub init
> *** 447,452 
> --- 447,453 
> print $conf "log_statement = all\n";
> print $conf "log_replication_commands = on\n";
> print $conf "wal_retrieve_retry_interval = '500ms'\n";
> +   print $conf "wal_receiver_timeout = '10min'\n";
>
> # If a setting tends to affect whether tests pass or fail, print it
> after
> # TEMP_CONFIG.  Otherwise, print it before TEMP_CONFIG, thereby
> permitting
>
> I let it run overnight, and came back to find that it was stuck at
>
> [03:02:15] t/013_partition.pl . 19/51
>
> and had been for circa eight hours, where extrapolation from other tests
> said it shouldn't take much over half an hour.  Investigation found that
> the subscriber was repeatedly failing like this:
>
> 2020-09-14 11:05:26.483 EDT [1030506] LOG:  logical replication apply
> worker for subscription "sub1" has started
> 2020-09-14 11:05:27.139 EDT [1030506] ERROR:  cache lookup failed for
> relation 0
> 2020-09-14 11:05:27.140 EDT [947156] LOG:  background worker "logical
> replication worker" (PID 1030506) exited with exit code 1
> 2020-09-14 11:05:27.571 EDT [1030509] LOG:  logical replication apply
> worker for subscription "sub1" has started
> 2020-09-14 11:05:28.227 EDT [1030509] ERROR:  cache lookup failed for
> relation 0
> 2020-09-14 11:05:28.228 EDT [947156] LOG:  background worker "logical
> replication worker" (PID 1030509) exited with exit code 1
>
> The publisher's log shows no sign of distress:
>
> 2020-09-14 11:06:09.380 EDT [1030619] sub1 LOG:  statement: SELECT
> pg_catalog.set_config('search_path', '', false);
> 2020-09-14 11:06:09.446 EDT [1030619] sub1 LOG:  received replication
> command: IDENTIFY_SYSTEM
> 2020-09-14 11:06:09.446 EDT [1030619] sub1 LOG:  received replication
> command: START_REPLICATION SLOT "sub1" LOGICAL 0/163CF08 (proto_version
> '2', publication_names '"pub1"')
> 2020-09-14 11:06:09.447 EDT [1030619] sub1 LOG:  starting logical decoding
> for slot "sub1"
> 2020-09-14 11:06:09.447 EDT [1030619] sub1 DETAIL:  Streaming transactions
> committing after 0/163D848, reading WAL from 0/163CF08.
> 2020-09-14 11:06:09.447 EDT [1030619] sub1 LOG:  logical decoding found
> consistent point at 0/163CF08
> 2020-09-14 11:06:09.447 EDT [1030619] sub1 DETAIL:  There are no running
> transactions.
> 2020-09-14 11:06:10.468 EDT [1030621] sub1 LOG:  statement: SELECT
> pg_catalog.set_config('search_path', '', false);
> 2020-09-14 11:06:10.533 EDT [1030621] sub1 LOG:  received replication
> command: IDENTIFY_SYSTEM
> 2020-09-14 11:06:10.534 EDT [1030621] sub1 LOG:  received replication
> command: START_REPLICATION SLOT "sub1" LOGICAL 0/163CF08 (proto_version
> '2', publication_names '"pub1"')
> 2020-09-14 11:06:10.534 EDT [1030621] sub1 LOG:  starting logical decoding
> for slot "sub1"
> 2020-09-14 11:06:10.534 EDT [1030621] sub1 DETAIL:  Streaming transactions
> committing after 0/163D848, reading WAL from 0/163CF08.
> 2020-09-14 11:06:10.534 EDT [1030621] sub1 LOG:  logical decoding found
> consistent point at 0/163CF08
> 2020-09-14 11:06:10.534 EDT [1030621] sub1 DETAIL:  There are no running
> transactions.
>
> I do not have time today to chase this further, but somebody should.
>
> More generally, this seems like good evidence that we really oughta have a
> buildfarm member that's running *all* the tests under

1 2 >

1 - 100 of 110 matches

Mail list logo