Re: POC: postgres_fdw insert batching

2020-11-27 Thread Craig Ringer
On Sat, 28 Nov 2020, 10:10 Tomas Vondra, 
wrote:

>
>
> On 11/27/20 7:05 AM, tsunakawa.ta...@fujitsu.com wrote:
>
> However, the FDW interface as it's implemented today is not designed to
> allow that, I believe (we pretty much just invoke the FWD callbacks as
> if it was a local AM). It assumes the calls are synchronous, and
> redesigning it to work in async way is a much larger/complex patch than
> what's being discussed here.
>
> I do think the FDW extension proposed here (adding the bulk-insert
> callback) is useful in general, for two reasons: (a) even if most client
> libraries support some sort of pipelining, some don't, and (b) I'd bet
> it's still more efficient to send one large insert than pipelining many
> individual inserts.
>
> That being said, I'm against expanding the scope of this patch to also
> require redesign of the whole FDW infrastructure - that would likely
> mean no such improvement landing in PG14. If the libpq pipelining patch
> seems likely to get committed, we can try using it for the bulk insert
> callback (instead of the current multi-value stuff).
>

I totally agree on all points. It was not my intent to expand the scope of
this significantly and I really don't want to hold it back.

I raised the interface consideration in case it was something easy to
accommodate. It's not, so that's done, topic over.


Re: Improving spin-lock implementation on ARM.

2020-11-27 Thread Tom Lane
I wrote:
> It might be that this hardware is capable of showing a difference with a
> better-tuned pgbench test, but with an untuned pgbench run, we just aren't
> sufficiently sensitive to the spinlock properties.  (Which I guess is good
> news, really.)

It occurred to me that if we don't insist on a semi-realistic test case,
it's not that hard to just pound on a spinlock and see what happens.
I made up a simple C function (attached) to repeatedly call
XLogGetLastRemovedSegno, which is basically just a spinlock
acquire/release.  Using this as a "transaction":

$ cat bench.sql
select drive_spinlocks(5);

I get this with HEAD:

$ pgbench -f bench.sql -n -T 60 -c 1 bench
transaction type: bench.sql
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 60 s
number of transactions actually processed: 127597
latency average = 0.470 ms
tps = 2126.479699 (including connections establishing)
tps = 2126.595015 (excluding connections establishing)

$ pgbench -f bench.sql -n -T 60 -c 2 bench
transaction type: bench.sql
scaling factor: 1
query mode: simple
number of clients: 2
number of threads: 1
duration: 60 s
number of transactions actually processed: 108979
latency average = 1.101 ms
tps = 1816.051930 (including connections establishing)
tps = 1816.150556 (excluding connections establishing)

$ pgbench -f bench.sql -n -T 60 -c 4 bench
transaction type: bench.sql
scaling factor: 1
query mode: simple
number of clients: 4
number of threads: 1
duration: 60 s
number of transactions actually processed: 42862
latency average = 5.601 ms
tps = 714.202152 (including connections establishing)
tps = 714.237301 (excluding connections establishing)

(With only 4 high-performance cores, it's probably not
interesting to go further; involving the slower cores
will just confuse matters.)  And this with the patch:

$ pgbench -f bench.sql -n -T 60 -c 1 bench
transaction type: bench.sql
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 60 s
number of transactions actually processed: 130455
latency average = 0.460 ms
tps = 2174.098284 (including connections establishing)
tps = 2174.217097 (excluding connections establishing)

$ pgbench -f bench.sql -n -T 60 -c 2 bench
transaction type: bench.sql
scaling factor: 1
query mode: simple
number of clients: 2
number of threads: 1
duration: 60 s
number of transactions actually processed: 51533
latency average = 2.329 ms
tps = 858.765176 (including connections establishing)
tps = 858.811132 (excluding connections establishing)

$ pgbench -f bench.sql -n -T 60 -c 4 bench
transaction type: bench.sql
scaling factor: 1
query mode: simple
number of clients: 4
number of threads: 1
duration: 60 s
number of transactions actually processed: 31154
latency average = 7.705 ms
tps = 519.116788 (including connections establishing)
tps = 519.144375 (excluding connections establishing)

So at least on Apple's hardware, it seems like the CAS
implementation might be a shade faster when uncontended,
but it's very clearly worse when there is contention for
the spinlock.  That's interesting, because the argument
that CAS should involve strictly less work seems valid ...
but that's what I'm getting.

It might be useful to try this on other ARM platforms,
but I lack the energy right now (plus the only other
thing I've got is a Raspberry Pi, which might not be
something we particularly care about performance-wise).

regards, tom lane

/*

create function drive_spinlocks(count int) returns void
strict volatile language c as '.../spinlocktest.so';

 */

#include "postgres.h"

#include "access/xlog.h"
#include "fmgr.h"
#include "miscadmin.h"

PG_MODULE_MAGIC;

/*
 * drive_spinlocks(count int) returns void
 */
PG_FUNCTION_INFO_V1(drive_spinlocks);
Datum
drive_spinlocks(PG_FUNCTION_ARGS)
{
	int32		count = PG_GETARG_INT32(0);

	while (count-- > 0)
	{
		XLogGetLastRemovedSegno();

		CHECK_FOR_INTERRUPTS();
	}

	PG_RETURN_VOID();
}


Re: POC: postgres_fdw insert batching

2020-11-27 Thread Tomas Vondra



On 11/27/20 7:05 AM, tsunakawa.ta...@fujitsu.com wrote:
> From: Craig Ringer 
>> But in the libpq pipelining patch I demonstrated a 300 times
>> (3000%) performance improvement on a test workload...
> 
> Wow, impressive  number.  I've just seen it in the beginning of the
> libpq pipelining thread (oh, already four years ago..!)  Could you
> share the workload and the network latency (ping time)?  I'm sorry
> I'm just overlooking it.
> 
> Thank you for your (always) concise explanation.  I'd like to check
> other DBMSs and your rich references for the FDW interface.  (My
> first intuition is that many major DBMSs might not have client C APIs
> that can be used to implement an async pipelining FDW interface.
> Also, I'm afraid it requires major surgery or reform of executor.  I
> don't want it to delay the release of reasonably good (10x)
> improvement with the synchronous interface.)
> 

I do agree that pipelining is nice, and can bring huge improvements.

However, the FDW interface as it's implemented today is not designed to
allow that, I believe (we pretty much just invoke the FWD callbacks as
if it was a local AM). It assumes the calls are synchronous, and
redesigning it to work in async way is a much larger/complex patch than
what's being discussed here.

I do think the FDW extension proposed here (adding the bulk-insert
callback) is useful in general, for two reasons: (a) even if most client
libraries support some sort of pipelining, some don't, and (b) I'd bet
it's still more efficient to send one large insert than pipelining many
individual inserts.

That being said, I'm against expanding the scope of this patch to also
require redesign of the whole FDW infrastructure - that would likely
mean no such improvement landing in PG14. If the libpq pipelining patch
seems likely to get committed, we can try using it for the bulk insert
callback (instead of the current multi-value stuff).


> (It'd be kind of you to send emails in text format.  I've changed the
> format of this reply from HTML to text.)
> 

Craig's client is sending messages in both text/plain and text/html. You
probably need to tell your client to prefer that over html, somehow.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: A few new options for CHECKPOINT

2020-11-27 Thread Bossart, Nathan
On 11/27/20, 10:58 AM, "Stephen Frost"  wrote:
> If you'd like to show that I'm wrong, and it's entirely possible that I
> am, then retry the above with actual load on the system, and also
> actually look at how much outstanding WAL you end up with given the
> different scenarios which has to be replayed during crash recovery.

I did a little experiment to show the behavior I'm referring to.  I
used these settings:

checkpoint_completion_target = 0.9
checkpoint_timeout = 30s
max_wal_size = 20GB
WAL segment size is 64MB

I ran the following pgbench command for a few minutes before each
test:

pgbench postgres -T 3600 -c 64 -j 64 -N

For the first test, I killed Postgres just before an automatic, non-
immediate checkpoint completed.

2020-11-28 00:31:57 UTC::@:[51770]:LOG:  checkpoint complete...
2020-11-28 00:32:00 UTC::@:[51770]:LOG:  checkpoint starting: time

Killed Postgres at 00:32:26 UTC, 29 seconds after latest
checkpoint completed.

2020-11-28 00:32:42 UTC::@:[77256]:LOG:  redo starts at 3CF/FD6B8BD0
2020-11-28 00:32:56 UTC::@:[77256]:LOG:  redo done at 3D0/C94D1D00

Recovery took 14 seconds and replayed ~3.2 GB of WAL.

postgres=> SELECT pg_wal_lsn_diff('3D0/C94D1D00', '3CF/FD6B8BD0');
 pg_wal_lsn_diff
-
  3420557616
(1 row)

For the second test, I killed Postgres just after an automatic, non-
immediate checkpoint completed.

2020-11-28 00:41:26 UTC::@:[77475]:LOG:  checkpoint complete...

Killed Postgres at 00:41:26 UTC, just after latest checkpoint
completed.

2020-11-28 00:41:42 UTC::@:[8599]:LOG:  redo starts at 3D3/152EDD78
2020-11-28 00:41:49 UTC::@:[8599]:LOG:  redo done at 3D3/78358A40

Recovery took 7 seconds and replayed ~1.5 GB of WAL.

postgres=> SELECT pg_wal_lsn_diff('3D3/78358A40', '3D3/152EDD78');
 pg_wal_lsn_diff
-
  1661381832
(1 row)

Granted, I used a rather aggressive checkpoint_timeout, but I think
this demonstrates that waiting for a non-immediate checkpoint to
complete can lower the amount of WAL needed for recovery, even though
it might not lower it as much as waiting for an immediate checkpoint
would.

Nathan



Re: [PoC] Non-volatile WAL buffer

2020-11-27 Thread Tomas Vondra
On 11/27/20 1:02 AM, Tomas Vondra wrote:
> 
> Unfortunately, that patch seems to fail for me :-(
> 
> The patches seem to be for PG12, so I applied them on REL_12_STABLE (all
> the parts 0001-0005) and then I did this:
> 
> LIBS="-lpmem" ./configure --prefix=/home/tomas/pg-12-pmem --enable-debug
> make -s install
> 
> initdb -X /opt/pmemdax/benchmarks/wal -D /opt/nvme/benchmarks/data
> 
> pg_ctl -D /opt/nvme/benchmarks/data/ -l pg.log start
> 
> createdb test
> pgbench -i -s 500 test
> 
> 
> which however fails after just about 70k rows generated (PQputline
> failed), and the pg.log says this:
> 
> PANIC:  could not open or mmap file
> "pg_wal/00010006": No such file or directory
> CONTEXT:  COPY pgbench_accounts, line 721000
> STATEMENT:  copy pgbench_accounts from stdin
> 
> Takashi-san, can you check and provide a fixed version? Ideally, I'll
> take a look too, but I'm not familiar with this patch so it may take
> more time.
> 

I did try to get this working today, unsuccessfully. I did manage to
apply the 0002 part separately on REL_12_0 (there's one trivial rejected
chunk), but I still get the same failure. In fact, when built with
assertions, I can't even get initdb to pass :-(

I do get this:

TRAP: FailedAssertion("!(page->xlp_pageaddr == ptr - (ptr % 8192))",
File: "xlog.c", Line: 1813)

The values involved here are

xlp_pageaddr = 16777216
ptr = 20971520

so the page seems to be at the very beginning of the second WAL segment,
but the pointer is somewhere later. A full backtrace is attached.

I'll continue investigating this, but the xlog code is not particularly
easy to understand in general, so it may take time.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Core was generated by `/home/user/pg-12/bin/postgres --single -F -O -j -c 
search_path=pg_catalog -c ex'.
Program terminated with signal SIGABRT, Aborted.
#0  0x7b2909ca09e5 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: dnf debuginfo-install 
glibc-2.31-4.fc32.x86_64 libpmem-1.8-2.fc32.x86_64
(gdb) bt
#0  0x7b2909ca09e5 in raise () from /lib64/libc.so.6
#1  0x7b2909c89895 in abort () from /lib64/libc.so.6
#2  0x00a57de2 in ExceptionalCondition (conditionName=0xb232b8 
"!(page->xlp_pageaddr == ptr - (ptr % 8192))", errorType=0xb22c41 
"FailedAssertion", fileName=0xb22c3a "xlog.c", lineNumber=1813) at assert.c:54
#3  0x0055f0c4 in GetXLogBuffer (ptr=20971520) at xlog.c:1813
#4  0x0055e9c1 in CopyXLogRecordToWAL (write_len=2063, 
isLogSwitch=false, rdata=0x1265268, StartPos=20970544, EndPos=20972632) at 
xlog.c:1515
#5  0x0055df88 in XLogInsertRecord (rdata=0xdd4970 , 
fpw_lsn=20970480, flags=1 '\001') at xlog.c:1088
#6  0x0057477d in XLogInsert (rmid=10 '\n', info=0 '\000') at 
xloginsert.c:462
#7  0x004df1dc in heap_insert (relation=0x7b2900a63e28, tup=0x1361300, 
cid=1, options=0, bistate=0x0) at heapam.c:2009
#8  0x004febec in toast_save_datum (rel=0x7b2900a6fc78, 
value=135416003461200, oldexternal=0x0, options=0) at tuptoaster.c:1658
#9  0x004fd67d in toast_insert_or_update (rel=0x7b2900a6fc78, 
newtup=0x7b2900a0b050, oldtup=0x0, options=0) at tuptoaster.c:816
#10 0x004df4de in heap_prepare_insert (relation=0x7b2900a6fc78, 
tup=0x7b2900a0b050, xid=325, cid=1, options=0) at heapam.c:2088
#11 0x004deba6 in heap_insert (relation=0x7b2900a6fc78, 
tup=0x7b2900a0b050, cid=1, options=0, bistate=0x0) at heapam.c:1883
#12 0x004dff75 in simple_heap_insert (relation=0x7b2900a6fc78, 
tup=0x7b2900a0b050) at heapam.c:2386
#13 0x005902c3 in CatalogTupleInsert (heapRel=0x7b2900a6fc78, 
tup=0x7b2900a0b050) at indexing.c:189
#14 0x00886527 in InsertRule (rulname=0x13e0e30 "_RETURN", evtype=1, 
eventrel_oid=12499, evinstead=true, event_qual=0x0, action=0x13e0df8, 
replace=false) at rewriteDefine.c:143
#15 0x00887467 in DefineQueryRewrite (rulename=0x13e0e30 "_RETURN", 
event_relid=12499, event_qual=0x0, event_type=CMD_SELECT, is_instead=true, 
replace=false, action=0x13e0df8) at rewriteDefine.c:533
#16 0x006d536d in DefineViewRules (viewOid=12499, viewParse=0x1350a30, 
replace=false) at view.c:327
#17 0x006d5aba in StoreViewQuery (viewOid=12499, viewParse=0x1350a30, 
replace=false) at view.c:586
#18 0x006d504e in DefineVirtualRelation (relation=0x1345578, 
tlist=0x1334b80, replace=false, options=0x0, viewParse=0x131dbf8) at view.c:260
#19 0x006d5a49 in DefineView (stmt=0x1377f00, 
queryString=0x1385d70 "\n/*\n * 5.21\n * COLUMNS view\n */\n\nCREATE VIEW 
columns AS\nSELECT CAST(current_database() AS sql_identifier) AS 
table_catalog,\n", ' ' , "CAST(nc.nspname AS sql_identifier) 
AS table_schema,\n "..., stmt_location=0, stmt_len=6191) at view.c:565
#20 0x008e9f5a in ProcessUtilitySlow (pstate=0x1303a88, 
pstmt=0x1299388, 
queryString=0x1385d70 "\n/*\n * 5.21\n 

What to do about the broken btree_gist for inet/cidr?

2020-11-27 Thread Andreas Karlsson

Hi,

As discussed in this thread 
(https://www.postgresql.org/message-id/201010112055.o9bktzf7011...@wwwmaster.postgresql.org) 
btree_gist is broken for the inet and cidr types. The reason is that it 
uses convert_network_to_scalar() to make a lossy GiST index, but 
convert_network_to_scalar() throws away the netmask which is necessary 
to preserve the correct sort order of the keys.


This has been broken for as long as we have had btree_gist indexes over 
inet. And personally I am not a fan of PostgreSQL shipping with known 
broken features. But it is also not obvious to me what the best way to 
fix it is.


To refresh my memory I quickly hacked together a proof of concept patch 
which adds a couple of test cases to reproduce the bug plus uses 
basically the same implementation of the btree_gist as text and numeric, 
which changes to index keys from 64 bit floats to a variable sized 
bytea. This means that the indexes are no longer lossy.


Some questions/thoughts:

Is it even worth fixing btree_gist for inet when we already have 
inet_ops which can do more things and is not broken. We could just 
deprecate and then rip out gist_inet_ops?


If we are going to fix it I cannot see any reasonably sized fix which 
does not also corrupt all existing indexes on inet, even those which do 
not contain any net masks (those which contain netmasks are already 
broken anyway). Do we have some way to handle this kind of breakage nicely?


I see two ways to fix the inet indexes, either do as in my PoC patch and 
make the indexes no longer lossy. This will make it more similar to how 
btree_gist works for other data types but requires us to change the 
storage type of the index keys, but since all indexes will break anyway 
that might not be an issue.


The other way is to implement correct lossy indexes, using something 
similar to what network_abbrev_convert() does.


Does anyone have any thoughts on this?

Andreas
diff --git a/contrib/btree_gist/btree_gist--1.2.sql b/contrib/btree_gist/btree_gist--1.2.sql
index 1efe753043..7283c0bab7 100644
--- a/contrib/btree_gist/btree_gist--1.2.sql
+++ b/contrib/btree_gist/btree_gist--1.2.sql
@@ -1516,11 +1516,11 @@ AS 'MODULE_PATHNAME'
 LANGUAGE C IMMUTABLE STRICT;
 
 CREATE FUNCTION gbt_inet_union(internal, internal)
-RETURNS gbtreekey16
+RETURNS gbtreekey_var
 AS 'MODULE_PATHNAME'
 LANGUAGE C IMMUTABLE STRICT;
 
-CREATE FUNCTION gbt_inet_same(gbtreekey16, gbtreekey16, internal)
+CREATE FUNCTION gbt_inet_same(gbtreekey_var, gbtreekey_var, internal)
 RETURNS internal
 AS 'MODULE_PATHNAME'
 LANGUAGE C IMMUTABLE STRICT;
@@ -1537,15 +1537,15 @@ AS
 	FUNCTION	1	gbt_inet_consistent (internal, inet, int2, oid, internal),
 	FUNCTION	2	gbt_inet_union (internal, internal),
 	FUNCTION	3	gbt_inet_compress (internal),
-	FUNCTION	4	gbt_decompress (internal),
+	FUNCTION	4	gbt_var_decompress (internal),
 	FUNCTION	5	gbt_inet_penalty (internal, internal, internal),
 	FUNCTION	6	gbt_inet_picksplit (internal, internal),
-	FUNCTION	7	gbt_inet_same (gbtreekey16, gbtreekey16, internal),
-	STORAGE		gbtreekey16;
+	FUNCTION	7	gbt_inet_same (gbtreekey_var, gbtreekey_var, internal),
+	STORAGE		gbtreekey_var;
 
 ALTER OPERATOR FAMILY gist_inet_ops USING gist ADD
-	OPERATOR	6	<>  (inet, inet) ;
-	-- no fetch support, the compress function is lossy
+	OPERATOR	6	<>  (inet, inet) ,
+	FUNCTION	9 (inet, inet) gbt_var_fetch (internal) ;
 
 -- Create the operator class
 CREATE OPERATOR CLASS gist_cidr_ops
@@ -1559,12 +1559,12 @@ AS
 	FUNCTION	1	gbt_inet_consistent (internal, inet, int2, oid, internal),
 	FUNCTION	2	gbt_inet_union (internal, internal),
 	FUNCTION	3	gbt_inet_compress (internal),
-	FUNCTION	4	gbt_decompress (internal),
+	FUNCTION	4	gbt_var_decompress (internal),
 	FUNCTION	5	gbt_inet_penalty (internal, internal, internal),
 	FUNCTION	6	gbt_inet_picksplit (internal, internal),
-	FUNCTION	7	gbt_inet_same (gbtreekey16, gbtreekey16, internal),
-	STORAGE		gbtreekey16;
+	FUNCTION	7	gbt_inet_same (gbtreekey_var, gbtreekey_var, internal),
+	STORAGE		gbtreekey_var;
 
 ALTER OPERATOR FAMILY gist_cidr_ops USING gist ADD
-	OPERATOR	6	<> (inet, inet) ;
-	-- no fetch support, the compress function is lossy
+	OPERATOR	6	<> (inet, inet) ,
+	FUNCTION	9 (inet, inet) gbt_var_fetch (internal) ;
diff --git a/contrib/btree_gist/btree_gist--1.5--1.6.sql b/contrib/btree_gist/btree_gist--1.5--1.6.sql
index 5e1fcb47bd..8875f33e1c 100644
--- a/contrib/btree_gist/btree_gist--1.5--1.6.sql
+++ b/contrib/btree_gist/btree_gist--1.5--1.6.sql
@@ -167,7 +167,7 @@ ALTER FUNCTION gbt_inet_compress(internal) PARALLEL SAFE;
 ALTER FUNCTION gbt_inet_penalty(internal, internal, internal) PARALLEL SAFE;
 ALTER FUNCTION gbt_inet_picksplit(internal, internal) PARALLEL SAFE;
 ALTER FUNCTION gbt_inet_union(internal, internal) PARALLEL SAFE;
-ALTER FUNCTION gbt_inet_same(gbtreekey16, gbtreekey16, internal) PARALLEL SAFE;
+ALTER FUNCTION gbt_inet_same(gbtreekey_var, gbtreekey_var, internal) PARALLEL SAFE;
 ALTER FUNCTION 

Re: Improving spin-lock implementation on ARM.

2020-11-27 Thread Tom Lane
Peter Eisentraut  writes:
> I tried this on a M1 MacBook Air.  I cannot reproduce these results. 
> The unpatched numbers are about in the neighborhood of what you showed, 
> but the patched numbers are only about a few percent better, not the 
> 1.5x or 2x change that you showed.

After redoing the test, I can't find any outside-the-noise difference
at all between HEAD and the patch.  So clearly, I screwed up yesterday.
The most likely theory is that I managed to measure an assert-enabled
build of HEAD.

It might be that this hardware is capable of showing a difference with a
better-tuned pgbench test, but with an untuned pgbench run, we just aren't
sufficiently sensitive to the spinlock properties.  (Which I guess is good
news, really.)

One thing that did hold up is that the thermal performance of this box
is pretty ridiculous.  After being beat on for a solid hour, the fan
still hasn't turned on to any noticeable level, and the enclosure is
only a little warm to the touch.  Try that with Intel hardware ;-)

regards, tom lane




[Doc Patch] Clarify that CREATEROLE roles can GRANT default roles

2020-11-27 Thread Michael Banck
Hi,

https://www.postgresql.org/docs/current/default-roles.html mentions the
"Administrator" several times, but does not specify it further. I
understand that it is mentioned elsewhere [1], but I think it would be
beneficial to remind the reader on that page at least once that
"Administrators" includes "roles with the CREATEROLE privilege" in the
context of GRANTing and REVOKEing default privileges, e.g. with the
attached patch.


Michael

[1] e.g. at https://www.postgresql.org/docs/current/sql-createrole.html
it is mentioned that CREATEROLE privs can be regarded as 
"almost-superuser-roles"

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz
diff --git a/doc/src/sgml/user-manag.sgml b/doc/src/sgml/user-manag.sgml
index cc082521a2..cf1e3a948c 100644
--- a/doc/src/sgml/user-manag.sgml
+++ b/doc/src/sgml/user-manag.sgml
@@ -493,9 +493,10 @@ DROP ROLE doomed_role;
   
PostgreSQL provides a set of default roles
which provide access to certain, commonly needed, privileged capabilities
-   and information.  Administrators can GRANT these roles to users and/or
-   other roles in their environment, providing those users with access to
-   the specified capabilities and information.
+   and information.  Administrators (including roles that have the
+   CREATEROLE privilege) can GRANT these
+   roles to users and/or other roles in their environment, providing those
+   users with access to the specified capabilities and information.
   
 
   


Re: Log message for GSS connection is missing once connection authorization is successful.

2020-11-27 Thread Stephen Frost
Greetings,

* vignesh C (vignes...@gmail.com) wrote:
> Thanks for testing this, I had missed testing this. The expression
> matching was not correct. Attached v6 patch which includes the fix for
> this.

This generally looks pretty good to me.  I did reword the commit message
a bit, run pgindent, and added the appropriate log message for the last
test (was there a reason you didn't include that..?).  In general, this
looks pretty good to commit to me.

I'll look at it again over the weekend or early next week and unless
there's objections, I'll push it.

Thanks,

Stephen
From f78100d1c7401d4d47e6ee58f1baaac5b21b1216 Mon Sep 17 00:00:00 2001
From: Stephen Frost 
Date: Fri, 27 Nov 2020 16:52:44 -0500
Subject: [PATCH] Add GSS information to connection authorized log message

GSS information (if used) such as if the connection was authorized using
GSS or if it was encrypted using GSS, and perhaps most importantly, what
the GSS principal used for the authentication was, is extremely useful
but wasn't being included in the connection authorized log message.

Therefore, add to the connection authorized log message that
information, in a similar manner to how we log SSL information when SSL
is used for a connection.

Author: Vignesh C
Reviewed-by: Bharath Rupireddy
Discussion: https://www.postgresql.org/message-id/CALDaNm2N1385_Ltoo%3DS7VGT-ESu_bRQa-sC1wg6ikrM2L2Z49w%40mail.gmail.com
---
 src/backend/utils/init/postinit.c |  82 -
 src/test/kerberos/t/001_auth.pl   | 118 +-
 2 files changed, 114 insertions(+), 86 deletions(-)

diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index f2dd8e4914..82d451569d 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -245,62 +245,40 @@ PerformAuthentication(Port *port)
 
 	if (Log_connections)
 	{
+		StringInfoData logmsg;
+
+		initStringInfo();
 		if (am_walsender)
-		{
-#ifdef USE_SSL
-			if (port->ssl_in_use)
-ereport(LOG,
-		(port->application_name != NULL
-		 ? errmsg("replication connection authorized: user=%s application_name=%s SSL enabled (protocol=%s, cipher=%s, bits=%d, compression=%s)",
-  port->user_name,
-  port->application_name,
-  be_tls_get_version(port),
-  be_tls_get_cipher(port),
-  be_tls_get_cipher_bits(port),
-  be_tls_get_compression(port) ? _("on") : _("off"))
-		 : errmsg("replication connection authorized: user=%s SSL enabled (protocol=%s, cipher=%s, bits=%d, compression=%s)",
-  port->user_name,
-  be_tls_get_version(port),
-  be_tls_get_cipher(port),
-  be_tls_get_cipher_bits(port),
-  be_tls_get_compression(port) ? _("on") : _("off";
-			else
-#endif
-ereport(LOG,
-		(port->application_name != NULL
-		 ? errmsg("replication connection authorized: user=%s application_name=%s",
-  port->user_name,
-  port->application_name)
-		 : errmsg("replication connection authorized: user=%s",
-  port->user_name)));
-		}
+			appendStringInfo(, _("replication connection authorized: user=%s"),
+			 port->user_name);
 		else
-		{
+			appendStringInfo(, _("connection authorized: user=%s"),
+			 port->user_name);
+		if (!am_walsender)
+			appendStringInfo(, _(" database=%s"), port->database_name);
+
+		if (port->application_name != NULL)
+			appendStringInfo(, _(" application_name=%s"),
+			 port->application_name);
+
 #ifdef USE_SSL
-			if (port->ssl_in_use)
-ereport(LOG,
-		(port->application_name != NULL
-		 ? errmsg("connection authorized: user=%s database=%s application_name=%s SSL enabled (protocol=%s, cipher=%s, bits=%d, compression=%s)",
-  port->user_name, port->database_name, port->application_name,
-  be_tls_get_version(port),
-  be_tls_get_cipher(port),
-  be_tls_get_cipher_bits(port),
-  be_tls_get_compression(port) ? _("on") : _("off"))
-		 : errmsg("connection authorized: user=%s database=%s SSL enabled (protocol=%s, cipher=%s, bits=%d, compression=%s)",
-  port->user_name, port->database_name,
-  be_tls_get_version(port),
-  be_tls_get_cipher(port),
-  be_tls_get_cipher_bits(port),
-  be_tls_get_compression(port) ? _("on") : _("off";
-			else
+		if (port->ssl_in_use)
+			appendStringInfo(, _(" SSL enabled (protocol=%s, cipher=%s, bits=%d, compression=%s)"),
+			 be_tls_get_version(port),
+			 be_tls_get_cipher(port),
+			 be_tls_get_cipher_bits(port),
+			 be_tls_get_compression(port) ? _("on") : _("off"));
 #endif
-ereport(LOG,
-		(port->application_name != NULL
-		 ? errmsg("connection authorized: user=%s database=%s application_name=%s",
-  port->user_name, port->database_name, port->application_name)
-		 : errmsg("connection authorized: user=%s database=%s",
-  port->user_name, port->database_name)));
-		}
+#ifdef ENABLE_GSS

Re: bitmaps and correlation

2020-11-27 Thread Tom Lane
Heikki Linnakangas  writes:
> Other than that, and a quick pgdindent run, this seems ready to me. I'll 
> mark it as Ready for Committer.

I dunno, this seems largely misguided to me.

It's already the case that index correlation is just not the right
stat for this purpose, since it doesn't give you much of a toehold
on whether a particular scan is going to be accessing tightly-clumped
data.  For specific kinds of index conditions, such as a range query
on a btree index, maybe you could draw that conclusion ... but this
patch isn't paying any attention to the index condition in use.

And then the rules for bitmap AND and OR correlations, if not just
plucked out of the air, still seem *far* too optimistic.  As an
example, even if my individual indexes are perfectly correlated and
so a probe would touch only one page, OR'ing ten such probes together
is likely going to touch ten different pages.  But unless I'm
misreading the patch, it's going to report back an OR correlation
that corresponds to touching one page.

Even if we assume that the correlation is nonetheless predictive of
how big a part of the table we'll be examining, I don't see a lot
of basis for deciding that the equations the patch adds to
cost_bitmap_heap_scan are the right ones.

I'd have expected this thread to focus a whole lot more on actual
examples than it has done, so that we could have some confidence
that these equations have something to do with reality.

regards, tom lane




Re: Setof RangeType returns

2020-11-27 Thread Tom Lane
Patrick Handja  writes:
> This is what I am doing:

> static int
> get_range_lower(FunctionCallInfo fcinfo, RangeType *r1)
> {
> /* Return NULL if there's no finite lower bound */
> if (empty || lower.infinite)
> PG_RETURN_NULL();

You can't really use PG_RETURN_NULL here, mainly because there is
no good value for it to return from get_range_lower().

> return (lower.val);

Here and elsewhere, you're cavalierly casting between Datum and int.
While you can get away with that as long as the SQL type you're
working with is int4, it's bad style; mainly because it's confusing,
but also because you'll have a hard time adapting the code if you
ever want to work with some other type.  Use DatumGetInt32 or
Int32GetDatum as appropriate.

> TypeCacheEntry *typcache;
> PG_RETURN_RANGE_P(range_serialize(typcache, , , false));

This sure appears to be passing an uninitialized typcache pointer
to range_serialize().  If your compiler isn't whining about that,
you don't have adequately paranoid warning options enabled.
That's an excellent way to waste a lot of time, as you just have.
C is an unforgiving language, so you need all the help you can get.

BTW, use of PG_RETURN_RANGE_P here isn't very appropriate either,
since the function is not declared as returning Datum.

regards, tom lane




Re: [PATCH] LWLock self-deadlock detection

2020-11-27 Thread Peter Geoghegan
On Fri, Nov 27, 2020 at 10:22 AM Heikki Linnakangas  wrote:
> I've made the mistake of forgetting to release an LWLock many times,
> leading to self-deadlock. And I just reviewed a patch that did that this
> week [1]. You usually find that mistake very quickly when you start
> testing though, I don't think I've seen it happen in production.

+1. The fact that you don't get deadlock detection with LWLocks is a
cornerstone of the whole design. This assumption is common to all
database systems (LWLocks are generally called latches in the database
research community, but the concept is exactly the same).

> So yeah, I agree it's not worth spending cycles on this. Maybe it would
> be worth it if it's really simple to check, and you only do it after
> waiting X seconds. (I haven't looked at this patch at all)

-1 for that, unless it's only for debug builds. Even if it is
practically free, it still means committing to the wrong priorities.
Plus the performance validation would be very arduous as a practical
matter.

We've made LWLocks much more scalable in the last 10 years. Why
shouldn't we expect to do the same again in the next 10 years? I
wouldn't bet against it. I might even do the opposite (predict further
improvements to LWLocks).

-- 
Peter Geoghegan




Re: Setof RangeType returns

2020-11-27 Thread Patrick Handja
Hello Heikki,

Thank you for responding to my email.
This is what I am doing:

//= C file 
static int
get_range_lower(FunctionCallInfo fcinfo, RangeType *r1)
{
TypeCacheEntry *typcache;
RangeBound lower;
RangeBound upper;
bool empty;

typcache = range_get_typcache(fcinfo, RangeTypeGetOid(r1));
range_deserialize(typcache, r1, , , );

/* Return NULL if there's no finite lower bound */
if (empty || lower.infinite)
PG_RETURN_NULL();

return (lower.val);
}

static int
get_range_upper_griis(FunctionCallInfo fcinfo, RangeType *r1)
{
TypeCacheEntry *typcache;
RangeBound lower;
RangeBound upper;
bool empty;

typcache = range_get_typcache(fcinfo, RangeTypeGetOid(r1));
range_deserialize(typcache, r1, , , );

/* Return NULL if there's no finite upper bound */
if (empty || upper.infinite)
PG_RETURN_NULL();

return (upper.val);
}

static RangeType *
make_range(int start, int finish)
{
RangeBound lower;
RangeBound upper;

lower.val = (Datum) (start);
lower.infinite = false;
lower.inclusive = true;
lower.lower = true;

upper.val = (Datum) (finish);
upper.infinite = false;
upper.inclusive = false;
upper.lower = false;

if (!lower.infinite && !lower.inclusive)
{
lower.val = DirectFunctionCall2(int4pl, lower.val, Int32GetDatum(1));
lower.inclusive = true;
}

if (!upper.infinite && upper.inclusive)
{
upper.val = DirectFunctionCall2(int4pl, upper.val, Int32GetDatum(1));
upper.inclusive = false;
}

TypeCacheEntry *typcache;
PG_RETURN_RANGE_P(range_serialize(typcache, , , false));
}

typedef struct
{
int32 current;
int32 finish;
int32 step;
} generate_series_range_fctx;

static inline bool
control_increment(int32 a, int32 b, int32 *result)
{
int64 res = (int64) a + (int64) b;

if (res > PG_INT32_MAX || res < PG_INT32_MIN)
{
*result = 0x5EED;
return true;
}
*result = (int32) res;
return false;
}

PG_FUNCTION_INFO_V1(generate_ranges);
Datum
generate_ranges(PG_FUNCTION_ARGS)
{
FuncCallContext *funcctx;
generate_series_range_fctx *fctx;
MemoryContext oldcontext;

RangeType  *r1 = PG_GETARG_RANGE_P(0);
RangeType *result;
TypeCacheEntry *typcache;
typcache = range_get_typcache(fcinfo, RangeTypeGetOid(r1));
int32 lower = get_range_lower(fcinfo, r1);
int32 upper = get_range_upper(fcinfo, r1);

if (SRF_IS_FIRSTCALL())
{
int32 start =   lower;
int32 finish =  upper;
int32 step = 1;

funcctx = SRF_FIRSTCALL_INIT();
oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
fctx = (generate_series_range_fctx *)
palloc(sizeof(generate_series_range_fctx));

fctx->current = start;
fctx->finish = finish;
fctx->step = step;

funcctx->user_fctx = fctx;
MemoryContextSwitchTo(oldcontext);
}

funcctx = SRF_PERCALL_SETUP();
fctx = funcctx->user_fctx;

result = make_range(fctx->current, fctx->current+1);

if ((fctx->step > 0 && fctx->current <= fctx->finish) ||
(fctx->step < 0 && fctx->current >= fctx->finish))
{
if (control_increment(fctx->current, fctx->step, >current))
fctx->step = 0;

SRF_RETURN_NEXT(funcctx, PointerGetDatum(result));
}
else
SRF_RETURN_DONE(funcctx);
}

//= SQL file 
CREATE FUNCTION generate_ranges(anyrange) RETURNS setof anyrange
AS 'MODULE_PATHNAME'
LANGUAGE C IMMUTABLE STRICT;
//= Test File Expected 
SELECT generate_ranges(int4range(4,10));
 generate_ranges
---
 [4,5)
 [5,6)
 [6,7)
 [7,8)
 [8,9)
 [9,10)
 [10,11)
(7 row)
//=

Regards,

*Andjasubu Bungama, Patrick *



Le ven. 27 nov. 2020 à 04:01, Heikki Linnakangas  a écrit :

> On 26/11/2020 23:28, Patrick Handja wrote:
> > Hello,
> >
> > I am currently working on Library with some specific operators to
> > manipulate RangeType in PostGreSQL. I would like to know if it is
> > possible to return a setof rangetype elements in Postresql in C-Language
> > function using the suggestion like specified here:
> > https://www.postgresql.org/docs/current/xfunc-c.html#XFUNC-C-RETURN-SET
> > .
>
> > I have been trying this for days. If what I am trying to do is
> > impossible, is there any way I can use to have a RangeType set return?
>
> Yes, it is possible.
>
> I bet there's just a silly little bug or misunderstanding in your code.
> This stuff can be fiddly. Feel free to post what you have here, and I'm
> sure someone will point out where the problem is very quickly.
>
> - Heikki
>


Re: A few new options for CHECKPOINT

2020-11-27 Thread Stephen Frost
Greetings,

* Bossart, Nathan (bossa...@amazon.com) wrote:
> On 11/27/20, 8:29 AM, "Stephen Frost"  wrote:
> > Also note that, in all other cases (that is, when there *is* outstanding
> > WAL since the last checkpoint), pg_start_backup actually just waits for
> > the existing checkpoint to complete- and while it's waiting for that to
> > happen, there'll be additional WAL building up since that checkpoint
> > started that will have to be replayed as part of crash recovery, just as
> > if you took a snapshot of the system at any other time.
> >
> > So, either there won't be any WAL outstanding, in which case running a
> > CHECKPOINT FORCE just ends up creating more WAL without actually being
> > useful, or there's WAL outstanding and the only thing this does is delay
> > the snapshot being taken but doesn't actually reduce the amount of WAL
> > that's going to end up being outstanding and which will have to be
> > replayed during crash recovery.
> >
> > Maybe there's a useful reason to have these options, but at least the
> > stated one isn't it and I wouldn't want to encourage people who are
> > using snapshot-based backups to use these options since they aren't
> > going to work the way that this thread is implying they would.
> 
> I don't think it's true that pg_start_backup() just waits for the
> existing checkpoint to complete.  It calls RequestCheckpoint() with
> CHECKPOINT_WAIT, which should wait for a new checkpoint to start.

erm... right?  pg_start_backup waits for a new checkpoint to start.  I'm
confused how that's different from "waits for the existing checkpoint to
complete".  The entire point is that it *doesn't* force a checkpoint to
happen immediately.

> /* Wait for a new checkpoint to start. */
> ConditionVariablePrepareToSleep(>start_cv);
> for (;;)
> {
> 
> I also tried running pg_start_backup() while an automatic checkpoint
> was ongoing, and it seemed to create a new one.
> 
> psql session:
> postgres=# SELECT now(); SELECT pg_start_backup('test'); 
> SELECT now();
> now
> ---
> 2020-11-27 16:52:31.958124+00
> (1 row)
> 
> pg_start_backup
> -
> 0/D3D24F0
> (1 row)
> 
> now
> ---
> 2020-11-27 16:52:50.113372+00
> (1 row)
> 
> logs:
> 2020-11-27 16:52:20.129 UTC [16029] LOG:  checkpoint 
> starting: time
> 2020-11-27 16:52:35.121 UTC [16029] LOG:  checkpoint 
> complete...
> 2020-11-27 16:52:35.122 UTC [16029] LOG:  checkpoint 
> starting: force wait
> 2020-11-27 16:52:50.110 UTC [16029] LOG:  checkpoint 
> complete...

Yes- pg_start_backup specifies 'force' because it wants a checkpoint to
happen even if there isn't any outstanding WAL, but it doesn't make the
existing checkpoint go faster, which is the point that I'm trying to
make here.

If you'd really like to test and see what happens, run a pgbench that
loads the system up while doing pg_start_backup and see how long it
takes before pg_start_backup returns, and see how much outstanding WAL
there is from the starting checkpoint from pg_start_backup and the time
it returns.  To make it really clear, you should really also set
checkpoint completion timeout to 0.9 and make sure you max_wal_size is
set to a pretty large value, and make sure that the pgbench is
generating enough to have pretty large checkpoints while still allowing
them to happen due to 'time'.

> The patch I've submitted does the same thing.

Yes, I'm not surprised by that.  That doesn't change anything about the
point I'm trying to make..

> Even if it did simply wait for the existing checkpoint to complete,
> isn't it still preferable to take a snapshot right after a checkpoint
> completes, even if it is non-immediate?  You'll need to replay WAL in
> either case, and it's true that you could need to replay less WAL if
> you take an immediate checkpoint versus a non-immediate checkpoint.

Why is it preferable?  Your argument here is that it's preferable
because there'll be less outstanding WAL, but what I'm pointing out is
that that's not the case, so that isn't a reason for it to be preferable
and so I'm asking: what other reason is it preferable..?  I'm not aware
of one..

> However, if you take a snapshot without a checkpoint, you might need
> to replay up to checkpoint_timeout + (time it takes for a non-
> immediate checkpoint to complete) worth of WAL.  For the logs just
> above this paragraph, if I take a snapshot at 16:47:04, I'd need to
> replay 29 seconds of WAL.  However, if I take the snapshot at
> 16:47:06, I only need to replay 16 seconds of WAL.

This is exactly the point I'm making- if you're using a deferred
checkpoint then you're still 

Re: proposal: possibility to read dumped table's name from file

2020-11-27 Thread Stephen Frost
Greetings,

* Pavel Stehule (pavel.steh...@gmail.com) wrote:
> > I agree that being able to configure pg_dump via a config file would
> > be very useful, but the syntax proposed here feels much more like a
> > hacked-up syntax designed to meet this one use case, rather than a
> > good general-purpose design that can be easily extended.
> 
> Nobody sent a real use case for introducing the config file. There was a
> discussion about formats, and you introduce other dimensions and
> variability.

I'm a bit baffled by this because it seems abundently clear to me that
being able to have a config file for pg_dump would be extremely helpful.
There's no shortage of times that I've had to hack up a shell script and
figure out quoting and set up the right set of options for pg_dump,
resulting in things like:

pg_dump \
  --host=myserver.com \
  --username=postgres \
  --schema=public \
  --schema=myschema \
  --no-comments \
  --no-tablespaces \
  --file=somedir \
  --format=d \
  --jobs=5

which really is pretty grotty.  Being able to have a config file that
has proper comments would be much better and we could start to extend to
things like "please export schema A to directory A, schema B to
directory B" and other ways of selecting source and destination, and
imagine if we could validate it too, eg:

pg_dump --config=whatever --dry-run

or --check-config maybe.

This isn't a new concept either- export and import tools for other
databases have similar support, eg: Oracle's imp/exp tool, mysqldump
(see: https://dev.mysql.com/doc/refman/8.0/en/option-files.html which
has a TOML-looking format too), pgloader of course has a config file,
etc.  We certainly aren't in novel territory here.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: [PATCH] LWLock self-deadlock detection

2020-11-27 Thread Heikki Linnakangas

On 26/11/2020 04:50, Tom Lane wrote:

Craig Ringer  writes:

On Wed, Nov 25, 2020 at 9:23 PM Ashutosh Bapat 
wrote:

I'd prefer to make the lock self deadlock check run for production
builds, not just cassert builds.


I'd like to register a strong objection to spending any cycles whatsoever
on this.  If you're using LWLocks in a way that could deadlock, you're
doing it wrong.  The entire point of that mechanism is to be Light Weight
and not have all the overhead that the heavyweight lock mechanism has.
Building in deadlock checks and such is exactly the wrong direction.
If you think you need that, you should be using a heavyweight lock.

Maybe there's some case for a cassert-only check of this sort, but running
it in production is right out.

I'd also opine that checking for self-deadlock, but not any more general
deadlock, seems pretty pointless.  Careless coding is far more likely
to create opportunities for the latter.  (Thus, I have little use for
the existing assertions of this sort, either.)


I've made the mistake of forgetting to release an LWLock many times, 
leading to self-deadlock. And I just reviewed a patch that did that this 
week [1]. You usually find that mistake very quickly when you start 
testing though, I don't think I've seen it happen in production.


So yeah, I agree it's not worth spending cycles on this. Maybe it would 
be worth it if it's really simple to check, and you only do it after 
waiting X seconds. (I haven't looked at this patch at all)


[1] 
https://www.postgresql.org/message-id/8e1cda9b-1cb9-a634-d383-f757bf67b...@iki.fi


- Heikki




Re: configure and DocBook XML

2020-11-27 Thread Paul Förster
Hi Tom,

> On 27. Nov, 2020, at 18:54, Tom Lane  wrote:
> 
> See the other thread I pointed to.

ok, thanks.

Cheers,
Paul




Re: configure and DocBook XML

2020-11-27 Thread Tom Lane
=?utf-8?Q?Paul_F=C3=B6rster?=  writes:
> ok, last question: is there an option for configure to not check for DocBook? 
> It's not installed and PostgreSQL doesn't need it to compile and run just 
> fine.

See the other thread I pointed to.

regards, tom lane




Re: A few new options for CHECKPOINT

2020-11-27 Thread Bossart, Nathan
On 11/27/20, 8:29 AM, "Stephen Frost"  wrote:
> Also note that, in all other cases (that is, when there *is* outstanding
> WAL since the last checkpoint), pg_start_backup actually just waits for
> the existing checkpoint to complete- and while it's waiting for that to
> happen, there'll be additional WAL building up since that checkpoint
> started that will have to be replayed as part of crash recovery, just as
> if you took a snapshot of the system at any other time.
>
> So, either there won't be any WAL outstanding, in which case running a
> CHECKPOINT FORCE just ends up creating more WAL without actually being
> useful, or there's WAL outstanding and the only thing this does is delay
> the snapshot being taken but doesn't actually reduce the amount of WAL
> that's going to end up being outstanding and which will have to be
> replayed during crash recovery.
>
> Maybe there's a useful reason to have these options, but at least the
> stated one isn't it and I wouldn't want to encourage people who are
> using snapshot-based backups to use these options since they aren't
> going to work the way that this thread is implying they would.

I don't think it's true that pg_start_backup() just waits for the
existing checkpoint to complete.  It calls RequestCheckpoint() with
CHECKPOINT_WAIT, which should wait for a new checkpoint to start.

/* Wait for a new checkpoint to start. */
ConditionVariablePrepareToSleep(>start_cv);
for (;;)
{

I also tried running pg_start_backup() while an automatic checkpoint
was ongoing, and it seemed to create a new one.

psql session:
postgres=# SELECT now(); SELECT pg_start_backup('test'); SELECT 
now();
now
---
2020-11-27 16:52:31.958124+00
(1 row)

pg_start_backup
-
0/D3D24F0
(1 row)

now
---
2020-11-27 16:52:50.113372+00
(1 row)

logs:
2020-11-27 16:52:20.129 UTC [16029] LOG:  checkpoint starting: 
time
2020-11-27 16:52:35.121 UTC [16029] LOG:  checkpoint complete...
2020-11-27 16:52:35.122 UTC [16029] LOG:  checkpoint starting: 
force wait
2020-11-27 16:52:50.110 UTC [16029] LOG:  checkpoint complete...

The patch I've submitted does the same thing.

psql session:
postgres=# SELECT now(); CHECKPOINT (FAST FALSE); SELECT now();
now
---
2020-11-27 16:46:39.346131+00
(1 row)

CHECKPOINT
now
---
2020-11-27 16:47:05.083944+00
(1 row)

logs:
2020-11-27 16:46:35.056 UTC [16029] LOG:  checkpoint starting: 
time
2020-11-27 16:46:50.099 UTC [16029] LOG:  checkpoint complete...
2020-11-27 16:46:50.099 UTC [16029] LOG:  checkpoint starting: 
force wait
2020-11-27 16:47:05.083 UTC [16029] LOG:  checkpoint complete...

Even if it did simply wait for the existing checkpoint to complete,
isn't it still preferable to take a snapshot right after a checkpoint
completes, even if it is non-immediate?  You'll need to replay WAL in
either case, and it's true that you could need to replay less WAL if
you take an immediate checkpoint versus a non-immediate checkpoint.
However, if you take a snapshot without a checkpoint, you might need
to replay up to checkpoint_timeout + (time it takes for a non-
immediate checkpoint to complete) worth of WAL.  For the logs just
above this paragraph, if I take a snapshot at 16:47:04, I'd need to
replay 29 seconds of WAL.  However, if I take the snapshot at
16:47:06, I only need to replay 16 seconds of WAL.

I apologize if I'm missing something obvious here.

Nathan



Re: configure and DocBook XML

2020-11-27 Thread Paul Förster
Hi Tom,

> On 27. Nov, 2020, at 16:29, Tom Lane  wrote:
> 
> To me, it makes sense to have an option to do that, but I do find it
> surprising that it's the default.

ok, last question: is there an option for configure to not check for DocBook? 
It's not installed and PostgreSQL doesn't need it to compile and run just fine.

As I said, it's not an annoyance or something as I only need to build new 
binaries after the versions (major and minor) are released, which is not often 
enough to call it an annoyance. :-) I just would like to switch it off if 
possible, but it's ok if it's not.

Cheers,
Paul



Re: bitmaps and correlation

2020-11-27 Thread Heikki Linnakangas

On 06/11/2020 19:57, Justin Pryzby wrote:

On Fri, Nov 06, 2020 at 01:51:26PM +, Anastasia Lubennikova wrote:

The first patch is simply a refactoring and don't see any possible objections 
against it.
The second patch also looks fine to me. The logic is understandable and the 
code is neat.


+1


It wouldn't hurt to add a comment for this computation, though.
+   pages_fetched = pages_fetchedMAX + 
indexCorrelation*indexCorrelation*(pages_fetchedMIN - pages_fetchedMAX);


You're right.  It's like this:
// interpolate between c==0: pages_fetched=max and c==1: pages_fetched=min
pages_fetched = min + (max-min)*(1-c**2)
pages_fetched = min + max*(1-c**2) - min*(1-c**2)
pages_fetched = max*(1-c**2) + min - min*(1-c**2)
pages_fetched = max*(1-c**2) + min*(c**2)
pages_fetched = max - max*c**2 + min*(c**2)
pages_fetched = max + min*(c**2) - max*c**2
pages_fetched = max + c**2 * (min-max)

I'm not sure if there's a reason why it's written like that, but (min-max)
looks odd, so I wrote it like:
pages_fetched = max - c**2 * (max-min)


I agree min-max looks odd. max - c**2 * (max-min) looks a bit odd too, 
though. Whatever we do here, though, I'd suggest that we keep it 
consistent with cost_index().


Other than that, and a quick pgdindent run, this seems ready to me. I'll 
mark it as Ready for Committer.


- Heikki




Re: configure and DocBook XML

2020-11-27 Thread Peter Eisentraut

On 2020-11-26 17:48, Tom Lane wrote:

=?utf-8?Q?Paul_F=C3=B6rster?=  writes:

On 26. Nov, 2020, at 17:21, Tom Lane  wrote:

There's a nearby thread in which I was suggesting that we should just
not bother with this configure test [1].
[1] 
https://www.postgresql.org/message-id/flat/E2EE6B76-2D96-408A-B961-CAE47D1A86F0%40yesql.se



I haven't installed DocBook at all. So the check for DocBook naturally always 
fails. Could that be the reason?


If you don't have the docbook stylesheets, but you do have xmllint,
configure's probe will cause xmllint to try to download those
stylesheets off the net.  For me, that always succeeds, but it
takes two or three seconds.  I find it curious that it seems to be
timing out for you.


Correction: xmllint is interested in the DocBook XML DTD, which is 
downloadable from 
.  This is what 
might be in a package named "docbook" or "docbook-xml".  xsltproc is 
interested in the DocBook XSLT stylesheets, which are at 
, or 
locally in a package named something like "docbook-xsl".


AFAICT, configure only runs an xmllint test, so your download issues (at 
that point) are likely related to the DTD, not the stylesheets.





Re: proposal: possibility to read dumped table's name from file

2020-11-27 Thread Stephen Frost
Greetings,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Dean Rasheed  writes:
> > Actually, that raises a different possible benefit of passing options
> > in an options file -- if the user wants to pass in a table name
> > pattern, it can be a nuisance if the shell's argument processing does
> > additional unwanted things like globbing and environment variable
> > substitutions. Using an options file could provide a handy way to
> > ensure that any option values are interpreted exactly as written,
> > without any additional mangling.
> 
> Huh?  Any format we might devise, or borrow, will have to have some
> kind of escaping/quoting convention.  The idea that "we don't need
> that" tends to lead to very ugly workarounds later.

Agreed.

> I do agree that the shell's quoting conventions are pretty messy
> and so those aren't the ones we should borrow.  We could do a lot
> worse than to use some established data format like JSON or YAML.
> Given that we already have src/common/jsonapi.c, it seems like
> JSON would be the better choice of those two.

JSON doesn't support comments, something that's really useful to have in
configuration files, so I don't agree that it's a sensible thing to use
in this case.  JSON also isn't very forgiving, which is also
unfortunate and makes for a poor choice.

This is why I was suggesting TOML up-thread, which is MIT licensed, has
been around for a number of years, supports comments, has sensible
quoting that's easier to deal with than the shell, and has a C (C99)
implementation.  It's also used in quite a few other projects.

In a quick look, I suspect it might also be something that could be used
to replace our existing hand-hacked postgresql.conf parser and give us
the ability to handle things a bit cleaner there too...

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: remove spurious CREATE INDEX CONCURRENTLY wait

2020-11-27 Thread Alvaro Herrera
Actually, I noticed two things.  The first of them, addressed in this
new version of the patch, is that REINDEX CONCURRENTLY is doing a lot of
repetitive work by reopening each index and table in the build/validate
loops, so that they can report progress.  This is easy to remedy by
adding a couple more members to the new struct (which I also renamed to
ReindexIndexInfo), for tableId and amId.  The code seems a bit simpler
this way.

The other thing is that ReindexRelationConcurrenty seems to always be
called with the relations already locked by RangeVarGetRelidExtended.
So claiming to acquire locks on the relations over and over is
pointless.  (I only noticed this because there was an obvious deadlock
hazard in one of the loops, that locked index before table.)  I think we
should reduce all those to NoLock.  My patch does not do that.
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index ca24620fd0..a52cb16b5e 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -114,6 +114,17 @@ typedef struct ReindexErrorInfo
 	char		relkind;
 } ReindexErrorInfo;
 
+/*
+ * Index to process in ReindexRelationConcurrently
+ */
+typedef struct ReindexIndexInfo
+{
+	Oid			indexId;
+	Oid			tableId;
+	Oid			amId;
+	bool		safe;			/* for set_indexsafe_procflags */
+} ReindexIndexInfo;
+
 /*
  * CheckIndexCompatible
  *		Determine whether an existing index definition is compatible with a
@@ -385,7 +396,7 @@ CompareOpclassOptions(Datum *opts1, Datum *opts2, int natts)
  * lazy VACUUMs, because they won't be fazed by missing index entries
  * either.  (Manual ANALYZEs, however, can't be excluded because they
  * might be within transactions that are going to do arbitrary operations
- * later.)  Processes running CREATE INDEX CONCURRENTLY
+ * later.)  Processes running CREATE INDEX CONCURRENTLY or REINDEX CONCURRENTLY
  * on indexes that are neither expressional nor partial are also safe to
  * ignore, since we know that those processes won't examine any data
  * outside the table they're indexing.
@@ -1564,9 +1575,11 @@ DefineIndex(Oid relationId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
+	/*
+	 * This transaction doesn't need to set the PROC_IN_SAFE_IC flag, because
+	 * it only takes a snapshot to do some catalog manipulations, after the
+	 * wait is over.
+	 */
 
 	/* We should now definitely not be advertising any xmin. */
 	Assert(MyProc->xmin == InvalidTransactionId);
@@ -3132,10 +3145,16 @@ ReindexRelationConcurrently(Oid relationOid, int options)
 		get_rel_name(cellOid;
 	else
 	{
+		ReindexIndexInfo *idx;
+
 		/* Save the list of relation OIDs in private context */
 		oldcontext = MemoryContextSwitchTo(private_context);
 
-		indexIds = lappend_oid(indexIds, cellOid);
+		idx = palloc(sizeof(ReindexIndexInfo));
+		idx->indexId = cellOid;
+		/* other fields set later */
+
+		indexIds = lappend(indexIds, idx);
 
 		MemoryContextSwitchTo(oldcontext);
 	}
@@ -3172,13 +3191,18 @@ ReindexRelationConcurrently(Oid relationOid, int options)
 			get_rel_name(cellOid;
 		else
 		{
+			ReindexIndexInfo *idx;
+
 			/*
 			 * Save the list of relation OIDs in private
 			 * context
 			 */
 			oldcontext = MemoryContextSwitchTo(private_context);
 
-			indexIds = lappend_oid(indexIds, cellOid);
+			idx = palloc(sizeof(ReindexIndexInfo));
+			idx->indexId = cellOid;
+			indexIds = lappend(indexIds, idx);
+			/* other fields set later */
 
 			MemoryContextSwitchTo(oldcontext);
 		}
@@ -3197,6 +3221,7 @@ ReindexRelationConcurrently(Oid relationOid, int options)
 Oid			heapId = IndexGetRelation(relationOid,
 	  (options & REINDEXOPT_MISSING_OK) != 0);
 Relation	heapRelation;
+ReindexIndexInfo *idx;
 
 /* if relation is missing, leave */
 if (!OidIsValid(heapId))
@@ -3247,7 +3272,10 @@ ReindexRelationConcurrently(Oid relationOid, int options)
  * Save the list of relation OIDs in private context.  Note
  * that invalid indexes are allowed here.
  */
-indexIds = lappend_oid(indexIds, relationOid);
+idx = palloc(sizeof(ReindexIndexInfo));
+idx->indexId = relationOid;
+indexIds = lappend(indexIds, idx);
+/* other fields set later */
 
 MemoryContextSwitchTo(oldcontext);
 break;
@@ -3306,31 +3334,40 @@ ReindexRelationConcurrently(Oid relationOid, int options)
 	foreach(lc, indexIds)
 	{
 		char	   *concurrentName;
-		Oid			indexId = lfirst_oid(lc);
+		ReindexIndexInfo *idx = lfirst(lc);
+		ReindexIndexInfo *newidx;
 		Oid			newIndexId;
 		Relation	indexRel;
 		Relation	heapRel;
 		Relation	newIndexRel;
 		LockRelId  *lockrelid;
 
-		indexRel = index_open(indexId, ShareUpdateExclusiveLock);
+		/* XXX pointless lock acquisition */
+		indexRel = 

Re: Removing unneeded self joins

2020-11-27 Thread Heikki Linnakangas

On 31/10/2020 11:26, Andrey V. Lepikhov wrote:

+   /*
+* Process restrictlist to seperate out the self join 
quals from
+* the other quals. e.g x = x goes to selfjoinquals and 
a = b to
+* otherjoinquals.
+*/
+   split_selfjoin_quals(root, restrictlist, ,
+
);
+
+   if (list_length(selfjoinquals) == 0)
+   {
+   /*
+* Have a chance to remove join if target list 
contains vars from
+* the only one relation.
+*/
+   if (list_length(otherjoinquals) == 0)
+   {
+   /* Can't determine uniqueness without 
any quals. */
+   continue;
+
+   }
+   else if (!tlist_contains_rel_exprs(root, 
joinrelids, inner))
+   {
+   if (!innerrel_is_unique(root, joinrelids, 
outer->relids,
+   
 inner, JOIN_INNER, otherjoinquals,
+   
 false))
+   continue;
+   }
+   else
+   /*
+* The target list contains vars from 
both inner and outer
+* relations.
+*/
+   continue;
+   }
+
+   /*
+* Determine if the inner table can duplicate outer 
rows.  We must
+* bypass the unique rel cache here since we're 
possibly using a
+* subset of join quals. We can use 'force_cache' = 
true when all
+* join quals are selfjoin quals.  Otherwise we could 
end up
+* putting false negatives in the cache.
+*/
+   else if (!innerrel_is_unique(root, joinrelids, 
outer->relids,
+   
 inner, JOIN_INNER, selfjoinquals,
+   
 list_length(otherjoinquals) == 0))
+   continue;


I don't understand the logic here. If 'selfjoinquals' is empty, it means 
that there is no join qual between the two relations, right? How can we 
ever remove the join in that case? And how does the target list affect 
that? Can you give an example query of that?



--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -4553,11 +4553,13 @@ explain (costs off)
 select p.* from
   (parent p left join child c on (p.k = c.k)) join parent x on p.k = x.k
   where p.k = 1 and p.k = 2;
-QUERY PLAN
---
+   QUERY PLAN   
+

  Result
One-Time Filter: false
-(2 rows)
+   ->  Index Scan using parent_pkey on parent x
+ Index Cond: (k = 1)
+(4 rows)
 
 -- bug 5255: this is not optimizable by join removal

 begin;


That doesn't seem like an improvement...

My general impression of this patch is that it's a lot of code to deal 
with a rare special case. At the beginning of this thread there was 
discussion on the overhead that this might add to planning queries that 
don't benefit, but adding a lot of code isn't nice either, even if the 
performance is acceptable. That's not necessarily a show-stopper, but it 
does make me much less excited about this. I'm not sure what to suggest 
to do about that, except a really vague "Can you make is simpler?"


- Heikki




Re: A few new options for CHECKPOINT

2020-11-27 Thread Stephen Frost
Greetings,

* Bossart, Nathan (bossa...@amazon.com) wrote:
> Thanks to all for the feedback.  I've attached v2 of the patch.  I've
> removed the WAIT and FORCE options and renamed IMMEDIATE to FAST.
> 
> On 11/25/20, 7:52 AM, "Stephen Frost"  wrote:
> > I'm a bit confused by the idea here..  The whole point of running a
> > CHECKPOINT is to get the immediate behavior with the IO spike to get
> > things flushed out to disk so that, on crash recovery, there's less
> > outstanding WAL to replay.
> >
> > Avoiding the IO spike implies that you're waiting for a regular
> > checkpoint and that additional WAL is building up since that started and
> > therefore you're going to have to replay that WAL during crash recovery
> > and so you won't end up reducing your recovery time, so I'm failing to
> > see the point..?  I don't think you really get to have both..  pay the
> > price at backup time, or pay it during crash recovery.
> 
> It's true that you might not lower recovery time as much as if you did
> an immediate checkpoint, but presumably there can still be some
> benefit from doing a non-immediate checkpoint.  I think a similar
> argument can be made for pg_start_backup(), which AFAICT is presently
> the only way to manually request a non-immediate checkpoint.

If there isn't any actual outstanding WAL since the last checkpoint, the
only time that the fact that pg_start_backup includes CHECKPOINT_FORCE
is relevant, then performing a checkpoint isn't going to actually reduce
your crash recovery.

Also note that, in all other cases (that is, when there *is* outstanding
WAL since the last checkpoint), pg_start_backup actually just waits for
the existing checkpoint to complete- and while it's waiting for that to
happen, there'll be additional WAL building up since that checkpoint
started that will have to be replayed as part of crash recovery, just as
if you took a snapshot of the system at any other time.

So, either there won't be any WAL outstanding, in which case running a
CHECKPOINT FORCE just ends up creating more WAL without actually being
useful, or there's WAL outstanding and the only thing this does is delay
the snapshot being taken but doesn't actually reduce the amount of WAL
that's going to end up being outstanding and which will have to be
replayed during crash recovery.

Maybe there's a useful reason to have these options, but at least the
stated one isn't it and I wouldn't want to encourage people who are
using snapshot-based backups to use these options since they aren't
going to work the way that this thread is implying they would.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: Printing LSN made easy

2020-11-27 Thread Li Japin
Hi,

Here, we cannot use sizeof(but) to get the buf size, because it is a pointer, 
so it always
8 bytes on 64-bit or 4 bytes on 32-bit machine.

+char *
+pg_lsn_out_buffer(XLogRecPtr lsn, char *buf)
+{
+   snprintf(buf, sizeof(buf), LSN_FORMAT, LSN_FORMAT_ARG(lsn));
+
+   return buf;
+}

--
Best regards
Japin Li
ChengDu WenWu Information Technolog Co.,Ltd.


> On Nov 27, 2020, at 10:24 PM, Alexey Kondratov  
> wrote:
> 
> Hi,
> 
> On 2020-11-27 13:40, Ashutosh Bapat wrote:
>> Off list Peter Eisentraut pointed out that we can not use these macros
>> in elog/ereport since it creates problems for translations. He
>> suggested adding functions which return strings and use %s when doing
>> so.
>> The patch has two functions pg_lsn_out_internal() which takes an LSN
>> as input and returns a palloc'ed string containing the string
>> representation of LSN. This may not be suitable in performance
>> critical paths and also may leak memory if not freed. So there's
>> another function pg_lsn_out_buffer() which takes LSN and a char array
>> as input, fills the char array with the string representation and
>> returns the pointer to the char array. This allows the function to be
>> used as an argument in printf/elog etc. Macro MAXPG_LSNLEN has been
>> extern'elized for this purpose.
> 
> If usage of macros in elog/ereport can cause problems for translation, then 
> even with this patch life is not get simpler significantly. For example, 
> instead of just doing like:
> 
> elog(WARNING,
> - "xlog min recovery request %X/%X is past current point 
> %X/%X",
> - (uint32) (lsn >> 32), (uint32) lsn,
> - (uint32) (newMinRecoveryPoint >> 32),
> - (uint32) newMinRecoveryPoint);
> + "xlog min recovery request " LSN_FORMAT " is past current 
> point " LSN_FORMAT,
> + LSN_FORMAT_ARG(lsn),
> + LSN_FORMAT_ARG(newMinRecoveryPoint));
> 
> we have to either declare two additional local buffers, which is verbose; or 
> use pg_lsn_out_internal() and rely on memory contexts (or do pfree() 
> manually, which is verbose again) to prevent memory leaks.
> 
>> Off list Craig Ringer suggested introducing a new format specifier
>> similar to %m for LSN but I did not get time to take a look at the
>> relevant code. AFAIU it's available only to elog/ereport, so may not
>> be useful generally. But teaching printf variants about the new format
>> would be the best solution. However, I didn't find any way to do that.
> 
> It seems that this topic has been extensively discussed off-list, but still 
> strong +1 for the patch. I always wanted LSN printing to be more concise.
> 
> I have just tried new printing utilities in a couple of new places and it 
> looks good to me.
> 
> +char *
> +pg_lsn_out_internal(XLogRecPtr lsn)
> +{
> + charbuf[MAXPG_LSNLEN + 1];
> +
> + snprintf(buf, sizeof(buf), LSN_FORMAT, LSN_FORMAT_ARG(lsn));
> +
> + return pstrdup(buf);
> +}
> 
> Would it be a bit more straightforward if we palloc buf initially and just 
> return a pointer instead of doing pstrdup()?
> 
> 
> Regards
> -- 
> Alexey Kondratov
> 
> Postgres Professional https://www.postgrespro.com
> Russian Postgres Company<0001-Make-it-easy-to-print-LSN.patch>





Re: Online verification of checksums

2020-11-27 Thread Stephen Frost
Greetings,

* Magnus Hagander (mag...@hagander.net) wrote:
> On Thu, Nov 26, 2020 at 8:42 AM Michael Paquier  wrote:
> > On Tue, Nov 24, 2020 at 12:38:30PM -0500, David Steele wrote:
> > > We are not just looking at one LSN value. Here are the steps we are
> > > proposing (I'll skip checks for zero pages here):
> > >
> > > 1) Test the page checksum. If it passes the page is OK.
> > > 2) If the checksum does not pass then record the page offset and LSN and
> > > continue.
> >
> > But here the checksum is broken, so while the offset is something we
> > can rely on how do you make sure that the LSN is fine?  A broken
> > checksum could perfectly mean that the LSN is actually *not* fine if
> > the page header got corrupted.

Of course that could be the case, but it gets to be a smaller and
smaller chance by checking that the LSN read falls within reasonable
bounds.

> > > 3) After the file is copied, reopen and reread the file, seeking to 
> > > offsets
> > > where possible invalid pages were recorded in the first pass.
> > > a) If the page is now valid then it is OK.
> > > b) If the page is not valid but the LSN has increased from the LSN
> >
> > Per se the previous point about the LSN value that we cannot rely on.
> 
> We cannot rely on the LSN itself. But it's a lot more likely that we
> can rely on the LSN changing, and on the LSN changing in a "correct
> way". That is, if the LSN *decreases* we know it's corrupt. If the LSN
> *doesn't change* we know it's corrupt. But if the LSN *increases* AND
> the new page now has a correct checksum, it's very most likely to be
> correct. You could perhaps even put cap on it saying "if the LSN
> increased, but less than ", where  is a sufficiently high number
> that it's entirely unreasonable to advanced that far between the
> reading of two blocks. But it has to have a very high margin in that
> case.

This is, in fact, included in what was proposed- the "max increase"
would be "the ending LSN of the backup".  I don't think we can make it
any tighter than that though without risking false positives, which is
surely worse than a false negative in this particular case- we already
risk false negatives due to the fact that our checksum isn't perfect, so
even a perfect check to make sure that the page will, in fact, be
replayed over during crash recovery doesn't guarantee that there's no
corruption.

> > > A malicious attacker could easily trick these checks, but as Stephen 
> > > pointed
> > > out elsewhere they would likely make the checksums valid which would 
> > > escape
> > > detection anyway.
> > >
> > > We believe that the chances of random storage corruption passing all these
> > > checks is incredibly small, but eventually we'll also check against the 
> > > WAL
> > > to be completely sure.
> >
> > The lack of check for any concurrent I/O on the follow-up retries is
> > disturbing.  How do you guarantee that on the second retry what you
> > have is a torn page and not something corrupted?  Init forks for
> > example are made of up to 2 blocks, so the window would get short for
> > at least those.  There are many instances with tables that have few
> > pages as well.

If there's an easy and cheap way to see if there was concurrent i/o
happening for the page, then let's hear it.  One idea that has occured
to me which hasn't been discussed is checking the file's mtime to see if
it's changed since the backup started.  In that case, I would think it'd
be something like:

- Checksum is invalid
- LSN is within range
- Close file
- Stat file
- If mtime is from before the backup then signal possible corruption

If the checksum is invalid and the LSN isn't in range, then signal
corruption.

In general, however, I don't like the idea of reaching into PG and
asking PG for this page.

> Here I was more worried that the window might get *too long* if tables
> are large :)

I'm not sure that there's really a 'too long' possibility here.

> The risk is certainly that you get a torn page *again* on the second
> read. It could be the same torn page (if it hasn't changed), but you
> can detect that (by the fact that it hasn't actually changed) and
> possibly do a short delay before trying again if it gets that far.

I'm really not a fan of introducing these delays in the hopes that
they'll work..

> That could happen if the process is too quick. It could also be that
> you are unlucky and that you hit a *new* write, and you were so
> unlucky that both times it happened to hit exactly when you were
> reading the page the next time. I'm not sure the chance of that
> happening is even big enough we have to care about it, though?

If there's actually a new write, surely the LSN would be new?  At the
least, it wouldn't be the same LSN as the first read that picked up a
torn page.

In general though, I agree, we are getting to the point here where the
chances of missing something with this approach seems extremely slim.  I
do still like the idea of doing better by actually 

Re: parallel distinct union and aggregate support patch

2020-11-27 Thread Heikki Linnakangas

I also had a quick look at the patch and the comments made so far. Summary:

1. The performance results are promising.

2. The code needs comments.

Regarding the design:

Thomas Munro mentioned the idea of a "Parallel Repartition" node that 
would redistribute tuples like this. As I understand it, the difference 
is that this BatchSort implementation collects all tuples in a tuplesort 
or a tuplestore, while a Parallel Repartition node would just 
redistribute the tuples to the workers, without buffering. The receiving 
worker could put the tuples to a tuplestore or sort if needed.


I think a non-buffering Reparttion node would be simpler, and thus 
better. In these patches, you have a BatchSort node, and batchstore, but 
a simple Parallel Repartition node could do both. For example, to 
implement distinct:


Gather
-  > Unique
   -> Sort
   -> Parallel Redistribute
   -> Parallel Seq Scan

And a Hash Agg would look like this:

Gather
-  > Hash Agg
-> Parallel Redistribute
-> Parallel Seq Scan


I'm marking this as Waiting on Author in the commitfest.

- Heikki




Re: configure and DocBook XML

2020-11-27 Thread Tom Lane
=?utf-8?Q?Paul_F=C3=B6rster?=  writes:
> So why would xmllint try to download DocBook XML stylesheets if DocBook is 
> not installed? I'm not a developer but such a thing doesn't make sense to me.

You'd have to ask the authors of those programs.

To me, it makes sense to have an option to do that, but I do find it
surprising that it's the default.

regards, tom lane




Re: proposal: unescape_text function

2020-11-27 Thread Peter Eisentraut

On 2020-10-07 11:00, Pavel Stehule wrote:

Since the idea originated from unescaping unicode string
literals i.e.
        select unescape('Odpov\u011Bdn\u00E1 osoba');

Shouldn't the built-in function support the above syntax as well?


good idea. The prefixes u (4 digits) and U (8 digits) are supported


I don't really get the point of this function.  There is AFAICT no 
function to produce this escaped format, and it's not a recognized 
interchange format.  So under what circumstances would one need to use this?





Re: Printing LSN made easy

2020-11-27 Thread Alexey Kondratov

Hi,

On 2020-11-27 13:40, Ashutosh Bapat wrote:


Off list Peter Eisentraut pointed out that we can not use these macros
in elog/ereport since it creates problems for translations. He
suggested adding functions which return strings and use %s when doing
so.

The patch has two functions pg_lsn_out_internal() which takes an LSN
as input and returns a palloc'ed string containing the string
representation of LSN. This may not be suitable in performance
critical paths and also may leak memory if not freed. So there's
another function pg_lsn_out_buffer() which takes LSN and a char array
as input, fills the char array with the string representation and
returns the pointer to the char array. This allows the function to be
used as an argument in printf/elog etc. Macro MAXPG_LSNLEN has been
extern'elized for this purpose.



If usage of macros in elog/ereport can cause problems for translation, 
then even with this patch life is not get simpler significantly. For 
example, instead of just doing like:


 elog(WARNING,
- "xlog min recovery request %X/%X is past current point 
%X/%X",

- (uint32) (lsn >> 32), (uint32) lsn,
- (uint32) (newMinRecoveryPoint >> 32),
- (uint32) newMinRecoveryPoint);
+ "xlog min recovery request " LSN_FORMAT " is past 
current point " LSN_FORMAT,

+ LSN_FORMAT_ARG(lsn),
+ LSN_FORMAT_ARG(newMinRecoveryPoint));

we have to either declare two additional local buffers, which is 
verbose; or use pg_lsn_out_internal() and rely on memory contexts (or do 
pfree() manually, which is verbose again) to prevent memory leaks.




Off list Craig Ringer suggested introducing a new format specifier
similar to %m for LSN but I did not get time to take a look at the
relevant code. AFAIU it's available only to elog/ereport, so may not
be useful generally. But teaching printf variants about the new format
would be the best solution. However, I didn't find any way to do that.



It seems that this topic has been extensively discussed off-list, but 
still strong +1 for the patch. I always wanted LSN printing to be more 
concise.


I have just tried new printing utilities in a couple of new places and 
it looks good to me.


+char *
+pg_lsn_out_internal(XLogRecPtr lsn)
+{
+   charbuf[MAXPG_LSNLEN + 1];
+
+   snprintf(buf, sizeof(buf), LSN_FORMAT, LSN_FORMAT_ARG(lsn));
+
+   return pstrdup(buf);
+}

Would it be a bit more straightforward if we palloc buf initially and 
just return a pointer instead of doing pstrdup()?



Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres CompanyFrom 698e481f5f55b967b5c60dba4bc577f8baa20ff4 Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat 
Date: Fri, 16 Oct 2020 17:09:29 +0530
Subject: [PATCH] Make it easy to print LSN

The commit introduces following macros and functions to make it easy to
use LSNs in printf variants, elog, ereport and appendStringInfo
variants.

LSN_FORMAT - macro representing the format in which LSN is printed

LSN_FORMAT_ARG - macro to pass LSN as an argument to the above format

pg_lsn_out_internal - a function which returns palloc'ed char array
containing string representation of given LSN.

pg_lsn_out_buffer - similar to above but accepts and returns a char
array of size (MAXPG_LSNLEN + 1)

The commit also has some example usages of these.

Ashutosh Bapat
---
 contrib/pageinspect/rawpage.c|  3 +-
 src/backend/access/rmgrdesc/replorigindesc.c |  5 +-
 src/backend/access/rmgrdesc/xlogdesc.c   |  3 +-
 src/backend/access/transam/xlog.c|  8 ++--
 src/backend/utils/adt/pg_lsn.c   | 49 ++--
 src/include/access/xlogdefs.h|  7 +++
 src/include/utils/pg_lsn.h   |  3 ++
 7 files changed, 55 insertions(+), 23 deletions(-)

diff --git a/contrib/pageinspect/rawpage.c b/contrib/pageinspect/rawpage.c
index c0181506a5..2cd055a5f0 100644
--- a/contrib/pageinspect/rawpage.c
+++ b/contrib/pageinspect/rawpage.c
@@ -261,8 +261,7 @@ page_header(PG_FUNCTION_ARGS)
 	{
 		char		lsnchar[64];
 
-		snprintf(lsnchar, sizeof(lsnchar), "%X/%X",
- (uint32) (lsn >> 32), (uint32) lsn);
+		snprintf(lsnchar, sizeof(lsnchar), LSN_FORMAT, LSN_FORMAT_ARG(lsn));
 		values[0] = CStringGetTextDatum(lsnchar);
 	}
 	else
diff --git a/src/backend/access/rmgrdesc/replorigindesc.c b/src/backend/access/rmgrdesc/replorigindesc.c
index 19e14f910b..a3f49b5750 100644
--- a/src/backend/access/rmgrdesc/replorigindesc.c
+++ b/src/backend/access/rmgrdesc/replorigindesc.c
@@ -29,10 +29,9 @@ replorigin_desc(StringInfo buf, XLogReaderState *record)
 
 xlrec = (xl_replorigin_set *) rec;
 
-appendStringInfo(buf, "set %u; lsn %X/%X; force: %d",
+appendStringInfo(buf, "set %u; lsn " LSN_FORMAT "; force: %d",
  xlrec->node_id,
- (uint32) (xlrec->remote_lsn >> 32),
- (uint32) xlrec->remote_lsn,
+ 

Re: Improper use about DatumGetInt32

2020-11-27 Thread Ashutosh Bapat
On Thu, Nov 26, 2020 at 9:57 PM Peter Eisentraut <
peter.eisentr...@enterprisedb.com> wrote:

> On 2020-11-26 14:27, Alvaro Herrera wrote:
> > On 2020-Nov-26, Peter Eisentraut wrote:
> >
> >> The point of the patch is to have the range check somewhere.  If you
> just
> >> cast it, then you won't notice out of range arguments.  Note that other
> >> contrib modules that take block numbers work the same way.
> >
> > I'm not saying not to do that; just saying we should not propagate it to
> > places that don't need it.  get_raw_page gets its page number from
> > PG_GETARG_INT64(), and the range check should be there.  But then it
> > calls get_raw_page_internal, and it could pass a BlockNumber -- there's
> > no need to pass an int64.  So get_raw_page_internal does not need a
> > range check.
>
> Yeah, I had it like that for a moment, but then you need to duplicate
> the check in get_raw_page() and get_raw_page_fork().  I figured since
> get_raw_page_internal() does all the other argument checking also, it
> seems sensible to put the block range check there too.  But it's not a
> big deal either way.
>

FWIW, my 2c. Though I agree with both sides, I
prefer get_raw_page_internal() accepting BlockNumber, since that's what it
deals with and not the entire int8.

--
Best Wishes,
Ashutosh


Re: Improper use about DatumGetInt32

2020-11-27 Thread Ashutosh Bapat
On Wed, Nov 25, 2020 at 8:13 PM Anastasia Lubennikova <
a.lubennik...@postgrespro.ru> wrote:

> On 02.11.2020 18:59, Peter Eisentraut wrote:
> > I have committed 0003.
> >
> > For 0001, normal_rand(), I think you should reject negative arguments
> > with an error.
>
> I've updated 0001. The change is trivial, see attached.
>
> >
> > For 0002, I think you should change the block number arguments to
> > int8, same as other contrib modules do.
> >
> Agree. It will need a bit more work, though. Probably a new version of
> pageinspect contrib, as the public API will change.
> Ashutosh, are you going to continue working on it?
>

Sorry I was away on Diwali vacation so couldn't address Peter's comments in
time. Thanks for taking this further. I will review Peter's patch.

--
Best Wishes,
Ashutosh


Re: Improving spin-lock implementation on ARM.

2020-11-27 Thread Peter Eisentraut

On 2020-11-26 23:55, Tom Lane wrote:

... and, after retrieving my jaw from the floor, I present the
attached.  Apple's chips evidently like this style of spinlock a LOT
better.  The difference is so remarkable that I wonder if I made a
mistake somewhere.  Can anyone else replicate these results?


I tried this on a M1 MacBook Air.  I cannot reproduce these results. 
The unpatched numbers are about in the neighborhood of what you showed, 
but the patched numbers are only about a few percent better, not the 
1.5x or 2x change that you showed.






Re: Add Information during standby recovery conflicts

2020-11-27 Thread Alvaro Herrera
On 2020-Nov-27, Drouvot, Bertrand wrote:

> + if (nprocs > 0)
> + {
> + ereport(LOG,
> + (errmsg("%s after %ld.%03d ms",
> + 
> get_recovery_conflict_desc(reason), msecs, usecs),
> +  (errdetail_log_plural("Conflicting 
> process: %s.",
> + 
>"Conflicting processes: %s.",
> + 
>nprocs, buf.data;
> + }

> +/* Return the description of recovery conflict */
> +static const char *
> +get_recovery_conflict_desc(ProcSignalReason reason)
> +{
> + const char *reasonDesc = gettext_noop("unknown reason");
> +
> + switch (reason)
> + {
> + case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
> + reasonDesc = gettext_noop("recovery is still waiting 
> for recovery conflict on buffer pin");
> + break;

This doesn't work from a translation point of view.  First, you're
building a sentence from parts, which is against policy.  Second, you're
not actually invoking gettext to translate the string returned by
get_recovery_conflict_desc.

I think this needs to be rethought.  To handle the first problem I
suggest to split the error message in two.  One phrase is the complain
that recovery is waiting, and the other string is the reason for the
wait.  Separate both either by splitting with a :, or alternatively put
the other sentence in DETAIL.  (The latter would require to mix with the
list of conflicting processes, which might be hard.)

The first idea would look like this:

LOG:  recovery still waiting after %ld.03d ms: for recovery conflict on buffer 
pin
DETAIL:  Conflicting processes: 1, 2, 3.

To achieve this, apart from editing the messages returned by
get_recovery_conflict_desc, you'd need to ereport this way:

ereport(LOG,
errmsg("recovery still waiting after %ld.%03d ms: %s",
   msecs, usecs, _(get_recovery_conflict_desc(reason))),
errdetail_log_plural("Conflicting process: %s.",
 "Conflicting processes: %s.",




Re: Feature improvement for pg_stat_statements

2020-11-27 Thread Fujii Masao




On 2020/11/27 21:39, Seino Yuki wrote:

2020-11-27 21:37 に Seino Yuki さんは書きました:

2020-11-16 12:28 に Seino Yuki さんは書きました:

Due to similar fixed, we have also merged the patches discussed in the
following thread.
https://commitfest.postgresql.org/30/2736/
The patch is posted on the 2736 side of the thread.

Regards.




I forgot the attachment and will resend it.

The following patches have been committed and we have created a
compliant patch for them.
https://commitfest.postgresql.org/30/2736/

Please confirm.


Thanks for updating the patch! Here are review comments from me.

+OUT reset_exec_time TIMESTAMP WITH TIME ZONE

I prefer "stats_reset" as the name of this column for the sake of
consistency with the similar column in other stats views like
pg_stat_database.

Isn't it better to use lower cases for "TIMESTAMP WITH TIME ZONE"
because other DDLs in pg_stat_statements do that for the declaraion
of data type?

@@ -565,6 +568,8 @@ pgss_shmem_startup(void)
pgss->n_writers = 0;
pgss->gc_count = 0;
pgss->stats.dealloc = 0;
+   pgss->stats.reset_exec_time = 0;
+   pgss->stats.reset_exec_time_isnull = true;

The reset time should be initialized with GetCurrentTimestamp() instead
of 0? If so, I think that we can completely get rid of reset_exec_time_isnull.

+   memset(nulls, 0, sizeof(nulls));

If some entries in "values[]" may not be set, "values[]" also needs to
be initialized with 0.

MemSet() should be used, instead?

+   /* Read dealloc */
+   values[0] = stats.dealloc;

Int64GetDatum() should be used here?

+   reset_ts = GetCurrentTimestamp();

GetCurrentTimestamp() needs to be called only when all the entries
are removed. But it's called even in other cases.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




RE: Disable WAL logging to speed up data loading

2020-11-27 Thread osumi.takami...@fujitsu.com
Thank you, Horiguchi-San

> I haven't seen a criteria of whether a record is emitted or not for
> wal_leve=none.
> 
> We're emitting only redo logs.  So I think theoretically we don't need 
> anything
> other than the shutdown checkpoint record because we don't perform
> recovery and checkpoint record is required at startup.
> 
> RM_XLOG_ID:
>  XLOG_FPI_FOR_HINT  - not needed?
>  XLOG_FPI   - not needed?
> 
>  XLOG_CHECKPOINT_SHUTDOWN - must have
> 
> So how about the followings?
>  XLOG_CHECKPOINT_ONLINE
>  XLOG_NOOP
>  XLOG_NEXTOID
>  XLOG_SWITCH
>  XLOG_BACKUP_END
>  XLOG_PARAMETER_CHANGE
>  XLOG_RESTORE_POINT
>  XLOG_FPW_CHANGE
>  XLOG_END_OF_RECOVERY
> 
> 
> RM_XACT_ID:
>  XLOG_XACT_COMMIT
>  XLOG_XACT_PREPARE
>  XLOG_XACT_ABORT
>  XLOG_XACT_COMMIT_PREPARED
>  XLOG_XACT_ABORT_PREPARED
>  XLOG_XACT_ASSIGNMENT
>  XLOG_XACT_INVALIDATIONS
> 
> Do we need all of these?
No. Strictly speaking, you are right.
We still have types of WAL that are not necessarily needed.
For example, XLOG_END_OF_RECOVERY is not useful
because wal_level=none doesn't recover from any accidents.
Or, XLOG_CHECKPOINT_ONLINE is used when we execute CHECKPOINT
not for shutting down. Thus we could eliminate more.

> And, currenly what decides whether to emit a wal record according to
> wal_level is the caller of XLogInsert. 
Yes.

> So doing this at XLogInsert-level means
> that we bring the criteria of the necessity of wal-record into xlog layer 
> only for
> wal_level=none. I'm not sure it is the right direction.
I'm sorry. I didn't understand what "doing this" and "xlog layer" meant.
Did you mean that fixing the caller side of XLogInsert (e.g. CreateCheckPoint)
is not the right direction ? Or, fixing the function of XLogInsert is not the 
right direction ?


> At Fri, 27 Nov 2020 07:01:16 +, "tsunakawa.ta...@fujitsu.com"
>  wrote in
> > I'm afraid "none" doesn't represent the behavior because RM_XLOG_ID and
> RM_XACT_ID WAL records, except for XLOG_FPI_*, are emitted.  What's the
> good name?  IIUC, "minimal" is named after the fact that the minimal
> amount of WAL necessary for crash recovery is generated.  "norecovery" or
> "unrecoverable"?
Lastly, I found another name which expresses the essential characteristic of 
this wal_level.
How about the name of wal_level="crash_unsafe" ?
What did you think ?


Best,
Takamichi Osumi





Re: [HACKERS] logical decoding of two-phase transactions

2020-11-27 Thread Ajin Cherian
On Thu, Nov 26, 2020 at 10:43 PM Amit Kapila  wrote:

> I think what you need to do to reproduce this is to follow the
> snapshot machinery in SnapBuildFindSnapshot. Basically, first, start a
> transaction  (say transaction-id is 500) and do some operations but
> don't commit. Here, if you create a slot (via subscription or
> otherwise), it will wait for 500 to complete and make the state as
> SNAPBUILD_BUILDING_SNAPSHOT. Here, you can commit 500 and then having
> debugger in that state, start another transaction (say 501), do some
> operations but don't commit. Next time when you reach this function,
> it will change the state to SNAPBUILD_FULL_SNAPSHOT and wait for 501,
> now you can start another transaction (say 502) which you can prepare
> but don't commit. Again start one more transaction 503, do some ops,
> commit both 501 and 503. At this stage somehow we need to ensure that
> XLOG_RUNNING_XACTS record. Then commit prepared 502. Now, I think you
> should notice that the consistent point is reached after 502's prepare
> and before its commit. Now, this is just a theoretical scenario, you
> need something on these lines and probably a way to force
> XLOG_RUNNING_XACTS WAL (probably via debugger or some other way) at
> the right times to reproduce it.
>
> Thanks for trying to build a test case for this, it is really helpful.

I tried the above steps, I was able to get the builder state to
SNAPBUILD_BUILDING_SNAPSHOT but was not able to get into the
SNAPBUILD_FULL_SNAPSHOT state.
Instead the code moves straight from SNAPBUILD_BUILDING_SNAPSHOT  to
SNAPBUILD_CONSISTENT state.

In the function SnapBuildFindSnapshot, either the following check fails:

1327: TransactionIdPrecedesOrEquals(SnapBuildNextPhaseAt(builder),
   running->oldestRunningXid))

because the SnapBuildNextPhaseAt (which is same as running->nextXid)
is higher than oldestRunningXid, or when the both are the same, then
it falls through into the below condition higher in the code

1247: if (running->oldestRunningXid == running->nextXid)

and then the builder moves straight into the SNAPBUILD_CONSISTENT
state. At no point will the nextXid be less than oldestRunningXid. In
my sessions, I commit multiple txns, hoping to bump
up oldestRunningXid, I do checkpoints,  have made sure the
XLOG_RUNNING_XACTS are being inserted.,
 but while iterating into SnapBuildFindSnapshot with a ,new
XLOG_RUNNING_XACTS:record,  the oldestRunningXid is being incremented
at one xid at a time, which will eventually make it catch up
running->nextXid and reach a
SNAPBUILD_CONSISTENT state without entering the SNAPBUILD_FULL_SNAPSHOT state.



regards,
Ajin Cherian
Fujitsu Australia




Re: autovac issue with large number of tables

2020-11-27 Thread Fujii Masao




On 2020/11/27 18:38, Kasahara Tatsuhito wrote:

Hi,

On Fri, Nov 27, 2020 at 1:43 AM Fujii Masao  wrote:




On 2020/11/26 10:41, Kasahara Tatsuhito wrote:

On Wed, Nov 25, 2020 at 8:46 PM Masahiko Sawada  wrote:


On Wed, Nov 25, 2020 at 4:18 PM Kasahara Tatsuhito
 wrote:


Hi,

On Wed, Nov 25, 2020 at 2:17 PM Masahiko Sawada  wrote:


On Fri, Sep 4, 2020 at 7:50 PM Kasahara Tatsuhito
 wrote:


Hi,

On Wed, Sep 2, 2020 at 2:10 AM Kasahara Tatsuhito
 wrote:

I wonder if we could have table_recheck_autovac do two probes of the stats
data.  First probe the existing stats data, and if it shows the table to
be already vacuumed, return immediately.  If not, *then* force a stats
re-read, and check a second time.

Does the above mean that the second and subsequent table_recheck_autovac()
will be improved to first check using the previous refreshed statistics?
I think that certainly works.

If that's correct, I'll try to create a patch for the PoC


I still don't know how to reproduce Jim's troubles, but I was able to reproduce
what was probably a very similar problem.

This problem seems to be more likely to occur in cases where you have
a large number of tables,
i.e., a large amount of stats, and many small tables need VACUUM at
the same time.

So I followed Tom's advice and created a patch for the PoC.
This patch will enable a flag in the table_recheck_autovac function to use
the existing stats next time if VACUUM (or ANALYZE) has already been done
by another worker on the check after the stats have been updated.
If the tables continue to require VACUUM after the refresh, then a refresh
will be required instead of using the existing statistics.

I did simple test with HEAD and HEAD + this PoC patch.
The tests were conducted in two cases.
(I changed few configurations. see attached scripts)

1. Normal VACUUM case
- SET autovacuum = off
- CREATE tables with 100 rows
- DELETE 90 rows for each tables
- SET autovacuum = on and restart PostgreSQL
- Measure the time it takes for all tables to be VACUUMed

2. Anti wrap round VACUUM case
- CREATE brank tables
- SELECT all of these tables (for generate stats)
- SET autovacuum_freeze_max_age to low values and restart PostgreSQL
- Consumes a lot of XIDs by using txid_curent()
- Measure the time it takes for all tables to be VACUUMed

For each test case, the following results were obtained by changing
autovacuum_max_workers parameters to 1, 2, 3(def) 5 and 10.
Also changing num of tables to 1000, 5000, 1 and 2.

Due to the poor VM environment (2 VCPU/4 GB), the results are a little unstable,
but I think it's enough to ask for a trend.

===
[1.Normal VACUUM case]
   tables:1000
autovacuum_max_workers 1:   (HEAD) 20 sec VS (with patch)  20 sec
autovacuum_max_workers 2:   (HEAD) 18 sec VS (with patch)  16 sec
autovacuum_max_workers 3:   (HEAD) 18 sec VS (with patch)  16 sec
autovacuum_max_workers 5:   (HEAD) 19 sec VS (with patch)  17 sec
autovacuum_max_workers 10:  (HEAD) 19 sec VS (with patch)  17 sec

   tables:5000
autovacuum_max_workers 1:   (HEAD) 77 sec VS (with patch)  78 sec
autovacuum_max_workers 2:   (HEAD) 61 sec VS (with patch)  43 sec
autovacuum_max_workers 3:   (HEAD) 38 sec VS (with patch)  38 sec
autovacuum_max_workers 5:   (HEAD) 45 sec VS (with patch)  37 sec
autovacuum_max_workers 10:  (HEAD) 43 sec VS (with patch)  35 sec

   tables:1
autovacuum_max_workers 1:   (HEAD) 152 sec VS (with patch)  153 sec
autovacuum_max_workers 2:   (HEAD) 119 sec VS (with patch)   98 sec
autovacuum_max_workers 3:   (HEAD)  87 sec VS (with patch)   78 sec
autovacuum_max_workers 5:   (HEAD) 100 sec VS (with patch)   66 sec
autovacuum_max_workers 10:  (HEAD)  97 sec VS (with patch)   56 sec

   tables:2
autovacuum_max_workers 1:   (HEAD) 338 sec VS (with patch)  339 sec
autovacuum_max_workers 2:   (HEAD) 231 sec VS (with patch)  229 sec
autovacuum_max_workers 3:   (HEAD) 220 sec VS (with patch)  191 sec
autovacuum_max_workers 5:   (HEAD) 234 sec VS (with patch)  147 sec
autovacuum_max_workers 10:  (HEAD) 320 sec VS (with patch)  113 sec

[2.Anti wrap round VACUUM case]
   tables:1000
autovacuum_max_workers 1:   (HEAD) 19 sec VS (with patch) 18 sec
autovacuum_max_workers 2:   (HEAD) 14 sec VS (with patch) 15 sec
autovacuum_max_workers 3:   (HEAD) 14 sec VS (with patch) 14 sec
autovacuum_max_workers 5:   (HEAD) 14 sec VS (with patch) 16 sec
autovacuum_max_workers 10:  (HEAD) 16 sec VS (with patch) 14 sec

   tables:5000
autovacuum_max_workers 1:   (HEAD) 69 sec VS (with patch) 69 sec
autovacuum_max_workers 2:   (HEAD) 66 sec VS (with patch) 47 sec
autovacuum_max_workers 3:   (HEAD) 59 sec VS (with patch) 37 sec
autovacuum_max_workers 5:   (HEAD) 39 sec VS (with patch) 28 sec
autovacuum_max_workers 10:  (HEAD) 39 sec VS (with patch) 

Re: Feature improvement for pg_stat_statements

2020-11-27 Thread Seino Yuki

2020-11-27 21:37 に Seino Yuki さんは書きました:

2020-11-16 12:28 に Seino Yuki さんは書きました:

Due to similar fixed, we have also merged the patches discussed in the
following thread.
https://commitfest.postgresql.org/30/2736/
The patch is posted on the 2736 side of the thread.

Regards.




I forgot the attachment and will resend it.

The following patches have been committed and we have created a
compliant patch for them.
https://commitfest.postgresql.org/30/2736/

Please confirm.

Regards.
diff --git a/contrib/pg_stat_statements/pg_stat_statements--1.8--1.9.sql b/contrib/pg_stat_statements/pg_stat_statements--1.8--1.9.sql
index 2019a4ffe0..21a4b6c36f 100644
--- a/contrib/pg_stat_statements/pg_stat_statements--1.8--1.9.sql
+++ b/contrib/pg_stat_statements/pg_stat_statements--1.8--1.9.sql
@@ -5,9 +5,10 @@
 
 --- Define pg_stat_statements_info
 CREATE FUNCTION pg_stat_statements_info(
-OUT dealloc bigint
+OUT dealloc bigint,
+OUT reset_exec_time TIMESTAMP WITH TIME ZONE
 )
-RETURNS bigint
+RETURNS record
 AS 'MODULE_PATHNAME'
 LANGUAGE C STRICT VOLATILE PARALLEL SAFE;
 
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 70cfdb2c9d..fd90aab471 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -81,6 +81,7 @@
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/memutils.h"
+#include "utils/timestamp.h"
 
 PG_MODULE_MAGIC;
 
@@ -199,6 +200,8 @@ typedef struct Counters
 typedef struct pgssGlobalStats
 {
 	int64		dealloc;		/* # of times entries were deallocated */
+	TimestampTz reset_exec_time;		/* timestamp with all stats reset */
+	bool		reset_exec_time_isnull;		/* if true last_dealloc is null */
 } pgssGlobalStats;
 
 /*
@@ -565,6 +568,8 @@ pgss_shmem_startup(void)
 		pgss->n_writers = 0;
 		pgss->gc_count = 0;
 		pgss->stats.dealloc = 0;
+		pgss->stats.reset_exec_time = 0;
+		pgss->stats.reset_exec_time_isnull = true;
 	}
 
 	memset(, 0, sizeof(info));
@@ -1510,6 +1515,7 @@ pg_stat_statements_reset(PG_FUNCTION_ARGS)
 #define PG_STAT_STATEMENTS_COLS_V1_3	23
 #define PG_STAT_STATEMENTS_COLS_V1_8	32
 #define PG_STAT_STATEMENTS_COLS			32	/* maximum of above */
+#define PG_STAT_STATEMENTS_INFO_COLS	2
 
 /*
  * Retrieve statement statistics.
@@ -1889,7 +1895,17 @@ Datum
 pg_stat_statements_info(PG_FUNCTION_ARGS)
 {
 	pgssGlobalStats stats;
+	TupleDesc	tupdesc;
+	HeapTuple	result_tuple;
+	Datum 		values[PG_STAT_STATEMENTS_INFO_COLS];
+	bool 		nulls[PG_STAT_STATEMENTS_INFO_COLS];
+
+	if (get_call_result_type(fcinfo, NULL, ) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	tupdesc = BlessTupleDesc(tupdesc);
 
+	memset(nulls, 0, sizeof(nulls));
 	/* Read global statistics for pg_stat_statements */
 	{
 		volatile pgssSharedState *s = (volatile pgssSharedState *) pgss;
@@ -1899,7 +1915,17 @@ pg_stat_statements_info(PG_FUNCTION_ARGS)
 		SpinLockRelease(>mutex);
 	}
 
-	PG_RETURN_INT64(stats.dealloc);
+	/* Read dealloc */
+	values[0] = stats.dealloc;
+
+	/* Read reset_exec_time */
+	if (!stats.reset_exec_time_isnull)
+		values[1] = stats.reset_exec_time;
+	else
+		nulls[1] = true;
+
+	result_tuple = heap_form_tuple(tupdesc, values, nulls);
+	return HeapTupleGetDatum(result_tuple);
 }
 
 /*
@@ -2507,6 +2533,9 @@ entry_reset(Oid userid, Oid dbid, uint64 queryid)
 	long		num_entries;
 	long		num_remove = 0;
 	pgssHashKey key;
+	TimestampTz reset_ts;
+
+	reset_ts = GetCurrentTimestamp();
 
 	if (!pgss || !pgss_hash)
 		ereport(ERROR,
@@ -2559,6 +2588,8 @@ entry_reset(Oid userid, Oid dbid, uint64 queryid)
 
 			SpinLockAcquire(>mutex);
 			s->stats.dealloc = 0;
+			s->stats.reset_exec_time = reset_ts;	/* reset execution time */
+			s->stats.reset_exec_time_isnull = false;
 			SpinLockRelease(>mutex);
 		}
 	}
diff --git a/doc/src/sgml/pgstatstatements.sgml b/doc/src/sgml/pgstatstatements.sgml
index 81915ea69b..821ae1f96e 100644
--- a/doc/src/sgml/pgstatstatements.sgml
+++ b/doc/src/sgml/pgstatstatements.sgml
@@ -523,6 +523,15 @@
pg_stat_statements.max were observed
   
  
+ 
+  
+   reset_exec_time timestamp with time zone
+  
+  
+Shows the time at which pg_stat_statements_reset(0,0,0) was last called.
+  
+ 
+
 

   


Re: Feature improvement for pg_stat_statements

2020-11-27 Thread Seino Yuki

2020-11-16 12:28 に Seino Yuki さんは書きました:

Due to similar fixed, we have also merged the patches discussed in the
following thread.
https://commitfest.postgresql.org/30/2736/
The patch is posted on the 2736 side of the thread.

Regards.


The following patches have been committed and we have created a 
compliant patch for them.

https://commitfest.postgresql.org/30/2736/

Please confirm.

Regards.




Re: A problem about partitionwise join

2020-11-27 Thread Ashutosh Bapat
On Tue, Nov 10, 2020 at 2:43 PM Richard Guo  wrote:
>
>
> On Fri, Nov 6, 2020 at 11:26 PM Anastasia Lubennikova 
>  wrote:
>>
>> Status update for a commitfest entry.
>>
>> According to CFbot this patch fails to apply. Richard, can you send an 
>> update, please?
>>
>> Also, I see that the thread was inactive for a while.
>> Are you going to continue this work? I think it would be helpful, if you 
>> could write a short recap about current state of the patch and list open 
>> questions for reviewers.
>>
>> The new status of this patch is: Waiting on Author
>
>
> Thanks Anastasia. I've rebased the patch with latest master.
>
> To recap, the problem we are fixing here is when generating join clauses
> from equivalence classes, we only select the joinclause with the 'best
> score', or the first joinclause with a score of 3. This may cause us to
> miss some joinclause on partition keys and thus fail to generate
> partitionwise join.
>
> The initial idea for the fix is to create all the RestrictInfos from ECs
> in order to check whether there exist equi-join conditions involving
> pairs of matching partition keys of the relations being joined for all
> partition keys. And then Tom proposed a much better idea which leverages
> function exprs_known_equal() to tell whether the partkeys can be found
> in the same eclass, which is the current implementation in the latest
> patch.
>

In the example you gave earlier, the equi join on partition key was
there but it was replaced by individual constant assignment clauses.
So if we keep the original restrictclause in there with a new flag
indicating that it's redundant, have_partkey_equi_join will still be
able to use it without much change. Depending upon where all we need
to use avoid restrictclauses with the redundant flag, this might be an
easier approach. However, with Tom's idea partition-wise join may be
used even when there is no equi-join between partition keys but there
are clauses like pk = const for all tables involved and const is the
same for all such tables.

In the spirit of small improvement made to the performance of
have_partkey_equi_join(), pk_has_clause should be renamed as
pk_known_equal and pks_known_equal as num_equal_pks.

The loop traversing the partition keys at a given position, may be
optimized further if we pass lists to exprs_known_equal() which in
turns checks whether one expression from each list is member of a
given EC. This will avoid traversing all equivalence classes for each
partition key expression, which can be a huge improvement when there
are many ECs. But I think if one of the partition key expression at a
given position is member of an equivalence class all the other
partition key expressions at that position should be part of that
equivalence class since there should be an equi-join between those. So
the loop in loop may not be required to start with.

-- 
Best Wishes,
Ashutosh Bapat




Re: [patch] CLUSTER blocks scanned progress reporting

2020-11-27 Thread Fujii Masao




On 2020/11/27 1:51, Fujii Masao wrote:



On 2020/11/27 1:45, Matthias van de Meent wrote:

On Thu, 26 Nov 2020 at 00:55, Fujii Masao  wrote:


+    * A heap scan need not return tuples for the 
last page it has
+    * scanned. To ensure that heap_blks_scanned is 
equivalent to
+    * total_heap_blks after the table scan phase, 
this parameter
+    * is manually updated to the correct value 
when the table scan
+    * finishes.

So it's better to update this comment a bit? For example,

  If the scanned last pages are empty, it's possible to go to the next phase
  while heap_blks_scanned != heap_blks_total. To ensure that they are
  equivalet after the table scan phase, this parameter is manually updated
  to the correct value when the table scan finishes.



PFA a patch with updated message and comment. I've reworded the
messages to specifically mention empty pages and the need for setting
heap_blks_scanned = total_heap_blks explicitly.


Thanks for updating the patch! It looks good to me.
Barring any objection, I will commit this patch (also back-patch to v12).


Pushed. Thanks!

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Printing LSN made easy

2020-11-27 Thread Ashutosh Bapat
Hi All,
An LSN is printed using format "%X/%X" and passing (uint32) (lsn >>
32), (uint32) lsn as arguments to that format specifier. This pattern
is repeated at 180 places according to Cscope. I find it hard to
remember this pattern every time I have to print LSN. Usually I end up
copying from some other place. Finding such a place takes a few
minutes and might be error prone esp when the lsn to be printed is an
expression. If we ever have to change this pattern in future, we will
need to change all those 180 places.

The solution seems to be simple though. In the attached patch, I have
added two macros
#define LSN_FORMAT "%X/%X"
#define LSN_FORMAT_ARG(lsn) (uint32) ((lsn) >> 32), (uint32) (lsn)

which can be used instead.

E.g. the following change in the patch
@@ -261,8 +261,7 @@ page_header(PG_FUNCTION_ARGS)
{
charlsnchar[64];

-   snprintf(lsnchar, sizeof(lsnchar), "%X/%X",
-(uint32) (lsn >> 32), (uint32) lsn);
+   snprintf(lsnchar, sizeof(lsnchar), LSN_FORMAT,
LSN_FORMAT_ARG(lsn));
values[0] = CStringGetTextDatum(lsnchar);

LSN_FORMAT_ARG expands to two comma separated arguments and is kinda
open at both ends  but it's handy that way.

Off list Peter Eisentraut pointed out that we can not use these macros
in elog/ereport since it creates problems for translations. He
suggested adding functions which return strings and use %s when doing
so.

The patch has two functions pg_lsn_out_internal() which takes an LSN
as input and returns a palloc'ed string containing the string
representation of LSN. This may not be suitable in performance
critical paths and also may leak memory if not freed. So there's
another function pg_lsn_out_buffer() which takes LSN and a char array
as input, fills the char array with the string representation and
returns the pointer to the char array. This allows the function to be
used as an argument in printf/elog etc. Macro MAXPG_LSNLEN has been
extern'elized for this purpose.

Off list Craig Ringer suggested introducing a new format specifier
similar to %m for LSN but I did not get time to take a look at the
relevant code. AFAIU it's available only to elog/ereport, so may not
be useful generally. But teaching printf variants about the new format
would be the best solution. However, I didn't find any way to do that.

In that patch I have changed some random code to use one of the above
methods appropriately, demonstrating their usage. Given that we have
grown 180 places already, I think that something like would have been
welcome earlier. But given that replication code is being actively
worked upon, it may not be too late. As a precedence lfirst_node() and
its variants arrived when the corresponding pattern had been repeated
at so many places.

I think we should move pg_lsn_out_internal() and pg_lsn_out_buffer()
somewhere else. Ideally xlogdefs.c would have been the best place but
there's no such file. May be we add those functions in pg_lsn.c and
add their declarations i xlogdefs.h.

-- 
Best Wishes,
Ashutosh Bapat
From 698e481f5f55b967b5c60dba4bc577f8baa20ff4 Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat 
Date: Fri, 16 Oct 2020 17:09:29 +0530
Subject: [PATCH] Make it easy to print LSN

The commit introduces following macros and functions to make it easy to
use LSNs in printf variants, elog, ereport and appendStringInfo
variants.

LSN_FORMAT - macro representing the format in which LSN is printed

LSN_FORMAT_ARG - macro to pass LSN as an argument to the above format

pg_lsn_out_internal - a function which returns palloc'ed char array
containing string representation of given LSN.

pg_lsn_out_buffer - similar to above but accepts and returns a char
array of size (MAXPG_LSNLEN + 1)

The commit also has some example usages of these.

Ashutosh Bapat
---
 contrib/pageinspect/rawpage.c|  3 +-
 src/backend/access/rmgrdesc/replorigindesc.c |  5 +-
 src/backend/access/rmgrdesc/xlogdesc.c   |  3 +-
 src/backend/access/transam/xlog.c|  8 ++--
 src/backend/utils/adt/pg_lsn.c   | 49 ++--
 src/include/access/xlogdefs.h|  7 +++
 src/include/utils/pg_lsn.h   |  3 ++
 7 files changed, 55 insertions(+), 23 deletions(-)

diff --git a/contrib/pageinspect/rawpage.c b/contrib/pageinspect/rawpage.c
index c0181506a5..2cd055a5f0 100644
--- a/contrib/pageinspect/rawpage.c
+++ b/contrib/pageinspect/rawpage.c
@@ -261,8 +261,7 @@ page_header(PG_FUNCTION_ARGS)
 	{
 		char		lsnchar[64];
 
-		snprintf(lsnchar, sizeof(lsnchar), "%X/%X",
- (uint32) (lsn >> 32), (uint32) lsn);
+		snprintf(lsnchar, sizeof(lsnchar), LSN_FORMAT, LSN_FORMAT_ARG(lsn));
 		values[0] = CStringGetTextDatum(lsnchar);
 	}
 	else
diff --git a/src/backend/access/rmgrdesc/replorigindesc.c b/src/backend/access/rmgrdesc/replorigindesc.c
index 19e14f910b..a3f49b5750 100644
--- a/src/backend/access/rmgrdesc/replorigindesc.c
+++ 

Re: Extending range type operators to cope with elements

2020-11-27 Thread Anastasia Lubennikova

On 31.10.2020 01:08, Tomas Vondra wrote:

Hi,

On Fri, Oct 30, 2020 at 04:01:27PM +, Georgios Kokolatos wrote:

Hi,

thank you for your contribution.

I did notice that the cfbot [1] is failing for this patch.
Please try to address the issues if you can for the upcoming commitfest.



I took a look at the patch today - the regression failure was trivial,
the expected output for one query was added to the wrong place, a couple
lines off the proper place. Attached is an updated version of the patch,
fixing that.

I also reviewed the code - it seems pretty clean and in line with the
surrounding code in rangetypes.c. Good job Esteban! I'll do a bit more
review next week, and I'll see if I can get it committed.

regards



CFM reminder. Just in case you forgot about this thread)
The commitfest is heading to the end. Tomas, will you have time to push 
this patch?


The patch still applies and passes all cfbot checks. I also took a quick 
look at the code and everything looks good to me.


--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company





Re: autovac issue with large number of tables

2020-11-27 Thread Kasahara Tatsuhito
On Fri, Nov 27, 2020 at 5:22 PM Masahiko Sawada  wrote:
>
> On Fri, Nov 27, 2020 at 1:43 AM Fujii Masao  
> wrote:
> >
> >
> >
> > On 2020/11/26 10:41, Kasahara Tatsuhito wrote:
> > > On Wed, Nov 25, 2020 at 8:46 PM Masahiko Sawada  
> > > wrote:
> > >>
> > >> On Wed, Nov 25, 2020 at 4:18 PM Kasahara Tatsuhito
> > >>  wrote:
> > >>>
> > >>> Hi,
> > >>>
> > >>> On Wed, Nov 25, 2020 at 2:17 PM Masahiko Sawada  
> > >>> wrote:
> > 
> >  On Fri, Sep 4, 2020 at 7:50 PM Kasahara Tatsuhito
> >   wrote:
> > >
> > > Hi,
> > >
> > > On Wed, Sep 2, 2020 at 2:10 AM Kasahara Tatsuhito
> > >  wrote:
> > >>> I wonder if we could have table_recheck_autovac do two probes of 
> > >>> the stats
> > >>> data.  First probe the existing stats data, and if it shows the 
> > >>> table to
> > >>> be already vacuumed, return immediately.  If not, *then* force a 
> > >>> stats
> > >>> re-read, and check a second time.
> > >> Does the above mean that the second and subsequent 
> > >> table_recheck_autovac()
> > >> will be improved to first check using the previous refreshed 
> > >> statistics?
> > >> I think that certainly works.
> > >>
> > >> If that's correct, I'll try to create a patch for the PoC
> > >
> > > I still don't know how to reproduce Jim's troubles, but I was able to 
> > > reproduce
> > > what was probably a very similar problem.
> > >
> > > This problem seems to be more likely to occur in cases where you have
> > > a large number of tables,
> > > i.e., a large amount of stats, and many small tables need VACUUM at
> > > the same time.
> > >
> > > So I followed Tom's advice and created a patch for the PoC.
> > > This patch will enable a flag in the table_recheck_autovac function 
> > > to use
> > > the existing stats next time if VACUUM (or ANALYZE) has already been 
> > > done
> > > by another worker on the check after the stats have been updated.
> > > If the tables continue to require VACUUM after the refresh, then a 
> > > refresh
> > > will be required instead of using the existing statistics.
> > >
> > > I did simple test with HEAD and HEAD + this PoC patch.
> > > The tests were conducted in two cases.
> > > (I changed few configurations. see attached scripts)
> > >
> > > 1. Normal VACUUM case
> > >- SET autovacuum = off
> > >- CREATE tables with 100 rows
> > >- DELETE 90 rows for each tables
> > >- SET autovacuum = on and restart PostgreSQL
> > >- Measure the time it takes for all tables to be VACUUMed
> > >
> > > 2. Anti wrap round VACUUM case
> > >- CREATE brank tables
> > >- SELECT all of these tables (for generate stats)
> > >- SET autovacuum_freeze_max_age to low values and restart 
> > > PostgreSQL
> > >- Consumes a lot of XIDs by using txid_curent()
> > >- Measure the time it takes for all tables to be VACUUMed
> > >
> > > For each test case, the following results were obtained by changing
> > > autovacuum_max_workers parameters to 1, 2, 3(def) 5 and 10.
> > > Also changing num of tables to 1000, 5000, 1 and 2.
> > >
> > > Due to the poor VM environment (2 VCPU/4 GB), the results are a 
> > > little unstable,
> > > but I think it's enough to ask for a trend.
> > >
> > > ===
> > > [1.Normal VACUUM case]
> > >   tables:1000
> > >autovacuum_max_workers 1:   (HEAD) 20 sec VS (with patch)  20 sec
> > >autovacuum_max_workers 2:   (HEAD) 18 sec VS (with patch)  16 sec
> > >autovacuum_max_workers 3:   (HEAD) 18 sec VS (with patch)  16 sec
> > >autovacuum_max_workers 5:   (HEAD) 19 sec VS (with patch)  17 sec
> > >autovacuum_max_workers 10:  (HEAD) 19 sec VS (with patch)  17 sec
> > >
> > >   tables:5000
> > >autovacuum_max_workers 1:   (HEAD) 77 sec VS (with patch)  78 sec
> > >autovacuum_max_workers 2:   (HEAD) 61 sec VS (with patch)  43 sec
> > >autovacuum_max_workers 3:   (HEAD) 38 sec VS (with patch)  38 sec
> > >autovacuum_max_workers 5:   (HEAD) 45 sec VS (with patch)  37 sec
> > >autovacuum_max_workers 10:  (HEAD) 43 sec VS (with patch)  35 sec
> > >
> > >   tables:1
> > >autovacuum_max_workers 1:   (HEAD) 152 sec VS (with patch)  153 sec
> > >autovacuum_max_workers 2:   (HEAD) 119 sec VS (with patch)   98 sec
> > >autovacuum_max_workers 3:   (HEAD)  87 sec VS (with patch)   78 sec
> > >autovacuum_max_workers 5:   (HEAD) 100 sec VS (with patch)   66 sec
> > >autovacuum_max_workers 10:  (HEAD)  97 sec VS (with patch)   56 sec
> > >
> > >   tables:2
> > >autovacuum_max_workers 1:   (HEAD) 338 sec VS (with patch)  339 sec
> > >

Re: autovac issue with large number of tables

2020-11-27 Thread Kasahara Tatsuhito
Hi,

On Fri, Nov 27, 2020 at 1:43 AM Fujii Masao  wrote:
>
>
>
> On 2020/11/26 10:41, Kasahara Tatsuhito wrote:
> > On Wed, Nov 25, 2020 at 8:46 PM Masahiko Sawada  
> > wrote:
> >>
> >> On Wed, Nov 25, 2020 at 4:18 PM Kasahara Tatsuhito
> >>  wrote:
> >>>
> >>> Hi,
> >>>
> >>> On Wed, Nov 25, 2020 at 2:17 PM Masahiko Sawada  
> >>> wrote:
> 
>  On Fri, Sep 4, 2020 at 7:50 PM Kasahara Tatsuhito
>   wrote:
> >
> > Hi,
> >
> > On Wed, Sep 2, 2020 at 2:10 AM Kasahara Tatsuhito
> >  wrote:
> >>> I wonder if we could have table_recheck_autovac do two probes of the 
> >>> stats
> >>> data.  First probe the existing stats data, and if it shows the table 
> >>> to
> >>> be already vacuumed, return immediately.  If not, *then* force a stats
> >>> re-read, and check a second time.
> >> Does the above mean that the second and subsequent 
> >> table_recheck_autovac()
> >> will be improved to first check using the previous refreshed 
> >> statistics?
> >> I think that certainly works.
> >>
> >> If that's correct, I'll try to create a patch for the PoC
> >
> > I still don't know how to reproduce Jim's troubles, but I was able to 
> > reproduce
> > what was probably a very similar problem.
> >
> > This problem seems to be more likely to occur in cases where you have
> > a large number of tables,
> > i.e., a large amount of stats, and many small tables need VACUUM at
> > the same time.
> >
> > So I followed Tom's advice and created a patch for the PoC.
> > This patch will enable a flag in the table_recheck_autovac function to 
> > use
> > the existing stats next time if VACUUM (or ANALYZE) has already been 
> > done
> > by another worker on the check after the stats have been updated.
> > If the tables continue to require VACUUM after the refresh, then a 
> > refresh
> > will be required instead of using the existing statistics.
> >
> > I did simple test with HEAD and HEAD + this PoC patch.
> > The tests were conducted in two cases.
> > (I changed few configurations. see attached scripts)
> >
> > 1. Normal VACUUM case
> >- SET autovacuum = off
> >- CREATE tables with 100 rows
> >- DELETE 90 rows for each tables
> >- SET autovacuum = on and restart PostgreSQL
> >- Measure the time it takes for all tables to be VACUUMed
> >
> > 2. Anti wrap round VACUUM case
> >- CREATE brank tables
> >- SELECT all of these tables (for generate stats)
> >- SET autovacuum_freeze_max_age to low values and restart PostgreSQL
> >- Consumes a lot of XIDs by using txid_curent()
> >- Measure the time it takes for all tables to be VACUUMed
> >
> > For each test case, the following results were obtained by changing
> > autovacuum_max_workers parameters to 1, 2, 3(def) 5 and 10.
> > Also changing num of tables to 1000, 5000, 1 and 2.
> >
> > Due to the poor VM environment (2 VCPU/4 GB), the results are a little 
> > unstable,
> > but I think it's enough to ask for a trend.
> >
> > ===
> > [1.Normal VACUUM case]
> >   tables:1000
> >autovacuum_max_workers 1:   (HEAD) 20 sec VS (with patch)  20 sec
> >autovacuum_max_workers 2:   (HEAD) 18 sec VS (with patch)  16 sec
> >autovacuum_max_workers 3:   (HEAD) 18 sec VS (with patch)  16 sec
> >autovacuum_max_workers 5:   (HEAD) 19 sec VS (with patch)  17 sec
> >autovacuum_max_workers 10:  (HEAD) 19 sec VS (with patch)  17 sec
> >
> >   tables:5000
> >autovacuum_max_workers 1:   (HEAD) 77 sec VS (with patch)  78 sec
> >autovacuum_max_workers 2:   (HEAD) 61 sec VS (with patch)  43 sec
> >autovacuum_max_workers 3:   (HEAD) 38 sec VS (with patch)  38 sec
> >autovacuum_max_workers 5:   (HEAD) 45 sec VS (with patch)  37 sec
> >autovacuum_max_workers 10:  (HEAD) 43 sec VS (with patch)  35 sec
> >
> >   tables:1
> >autovacuum_max_workers 1:   (HEAD) 152 sec VS (with patch)  153 sec
> >autovacuum_max_workers 2:   (HEAD) 119 sec VS (with patch)   98 sec
> >autovacuum_max_workers 3:   (HEAD)  87 sec VS (with patch)   78 sec
> >autovacuum_max_workers 5:   (HEAD) 100 sec VS (with patch)   66 sec
> >autovacuum_max_workers 10:  (HEAD)  97 sec VS (with patch)   56 sec
> >
> >   tables:2
> >autovacuum_max_workers 1:   (HEAD) 338 sec VS (with patch)  339 sec
> >autovacuum_max_workers 2:   (HEAD) 231 sec VS (with patch)  229 sec
> >autovacuum_max_workers 3:   (HEAD) 220 sec VS (with patch)  191 sec
> >autovacuum_max_workers 5:   (HEAD) 234 sec VS (with patch)  147 sec
> >autovacuum_max_workers 10:  (HEAD) 320 sec VS (with patch)  113 sec
> >

RE: Disable WAL logging to speed up data loading

2020-11-27 Thread osumi.takami...@fujitsu.com
Hi, Tsunakawa-San


> I'm afraid "none" doesn't represent the behavior because RM_XLOG_ID and
> RM_XACT_ID WAL records, except for XLOG_FPI_*, are emitted.  What's the
> good name?  IIUC, "minimal" is named after the fact that the minimal
> amount of WAL necessary for crash recovery is generated.  "norecovery" or
> "unrecoverable"?
Really good point !

Yes, it cannot be helped to generate tiny grain of WALs.
I prefer "unrecoverable" to "none",
which should work to make any users utilize this feature really really 
carefully.
If there's no strong objection, I'll change the name to it from next time.

Best,
Takamichi Osumi



RE: Disable WAL logging to speed up data loading

2020-11-27 Thread osumi.takami...@fujitsu.com
Hello, Sawada-San

On Friday, November 27, 2020 3:08 PM Masahiko Sawada  
wrote:
> -   (errmsg("WAL was generated with wal_level=minimal,
> data may be missing"),
> +   (errmsg("WAL was generated with wal_level<=minimal,
> data may be missing"),
>  errhint("This happens if you temporarily set
> wal_level=minimal without taking a new base backup.")));
> 
> 'wal_level=minimal' in errhint also needs to be changed to
> 'wal_level<=minimal'?
Yeah, thanks. I'll fix this point in the next patch.


> While testing the patch on some workload, I realized that
> XLOG_FPI_FOR_HINT record could still be emitted even when wal_level =
> none. IIUC that WAL record is not necessary during wal_level = none since
> the server cannot be the primary server and the server crash ends up requiring
> to restore the whole database.
That's right. Yeah, I'm aware of the fact that
we can refine the types of WAL. Basically, I thought the amount of 
WALs related to XLOG and XACT should be less than that of other types and
generating those doesn't have a serious impact on the peformance.
But anyway, thanks for your advice !


Best,
Takamichi Osumi



Re: Removing unneeded self joins

2020-11-27 Thread Anastasia Lubennikova

On 31.10.2020 12:26, Andrey V. Lepikhov wrote:

Thank you for this partial review, I included your changes:

On 9/23/20 9:23 AM, David Rowley wrote:
On Fri, 3 Apr 2020 at 17:43, Andrey Lepikhov 
 wrote:

Doing thing the way I describe will allow you to get rid of all the
UniqueRelInfo stuff.

Thanks for the review and sorry for the late reply.
I fixed small mistakes, mentioned in your letter.
Also I rewrote this patch at your suggestion [1].
Because of many changes, this patch can be viewed as a sketch.

To change self-join detection algorithm I used your delta patch from 
[2]. I added in the split_selfjoin_quals routine complex expressions 
handling for demonstration. But, it is not very useful with current 
infrastructure, i think.


Also I implemented one additional way for self-join detection 
algorithm: if the join target list isn't contained vars from inner 
relation, then we can detect self-join with only quals like a1.x=a2.y 
if check innerrel_is_unique is true.
Analysis of the target list is contained in the new routine - 
tlist_contains_rel_exprs - rewritten version of the 
build_joinrel_tlist routine.


Also changes of the join_is_removable() routine is removed from the 
patch. I couldn't understand why it is needed here.


Note, this patch causes change of one join.sql regression test output. 
It is not a bug, but maybe fixed.


Applied over commit 4a071afbd0.

> [1] 
https://www.postgresql.org/message-id/CAKJS1f8p-KiEujr12k-oa52JNWWaQUjEjNg%2Bo1MGZk4mHBn_Rg%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/CAKJS1f8cJOCGyoxi7a_LG7eu%2BWKF9%2BHTff3wp1KKS5gcUg2Qfg%40mail.gmail.com



Status update for a commitfest entry.

This entry was "Waiting on author" during this CF. As I see, the latest 
message contains new version of the patch. Does it need more work? Are 
you going to continue working on it?


--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company





Re: [HACKERS] [PATCH] Generic type subscripting

2020-11-27 Thread Alexander Korotkov
Hi!

I've started to review this patch.  My first question is whether we're
able to handle different subscript types differently.  For instance,
one day we could handle jsonpath subscripts for jsonb.  And for sure,
jsonpath subscripts are expected to be handled differently from text
subscripts.  I see we can distinguish types during in prepare and
validate functions.  But it seems there is no type information in
fetch and assign functions.  Should we add something like this to the
SubscriptingRefState for future usage?

Datum uppertypeoid[MAX_SUBSCRIPT_DEPTH];
Datum lowertypeoid[MAX_SUBSCRIPT_DEPTH];

--
Regards,
Alexander Korotkov




Re: Improving spin-lock implementation on ARM.

2020-11-27 Thread Alexander Korotkov
On Fri, Nov 27, 2020 at 11:55 AM Michael Paquier  wrote:
> Not planning to buy one here, anything I have read on that tells that
> it is worth a performance study.

Another interesting area for experiments is AWS graviton2 instances.
Specification says it supports arm v8.2, so it should have swpal/casa
instructions as well.

--
Regards,
Alexander Korotkov




Re: Add table access method as an option to pgbench

2020-11-27 Thread Fabien COELHO



Hello David,

Some feedback about v3:

In the doc I find TABLEACCESSMETHOD quite hard to read. Use TABLEAM 
instead? Also, the next entry uses lowercase (tablespace), why change the 
style?


Remove space after comma in help string. I'd use the more readable TABLEAM 
in the help string rather than the mouthful.


It looks that the option is *silently* ignored when creating a 
partitionned table, which currently results in an error, and not passed to 
partitions, which would accept them. This is pretty weird.


I'd suggest that the table am option should rather by changing the default 
instead, so that it would apply to everything relevant implicitely when 
appropriate.


About tests :

They should also trigger failures, eg 
"--table-access-method=no-such-table-am", to be added to the @errors list.


I do not understand why you duplicated all possible option entry.

Test with just table access method looks redundant if the feature is make 
to work orthonogonally to partitions, which should be the case.


By the way, I saw the same style for other variables, such as 
escape_tablespace, should this be fixed as well?


Nope, let it be.

--
Fabien.




Re: Setof RangeType returns

2020-11-27 Thread Heikki Linnakangas

On 26/11/2020 23:28, Patrick Handja wrote:

Hello,

I am currently working on Library with some specific operators to 
manipulate RangeType in PostGreSQL. I would like to know if it is 
possible to return a setof rangetype elements in Postresql in C-Language 
function using the suggestion like specified here: 
https://www.postgresql.org/docs/current/xfunc-c.html#XFUNC-C-RETURN-SET 
. 
I have been trying this for days. If what I am trying to do is 
impossible, is there any way I can use to have a RangeType set return?


Yes, it is possible.

I bet there's just a silly little bug or misunderstanding in your code. 
This stuff can be fiddly. Feel free to post what you have here, and I'm 
sure someone will point out where the problem is very quickly.


- Heikki




Re: Improving spin-lock implementation on ARM.

2020-11-27 Thread Michael Paquier
On Fri, Nov 27, 2020 at 02:50:30AM -0500, Tom Lane wrote:
> Yeah, that wasn't making sense to me either.  The most likely explanation
> seems to be that I messed up the test somehow ... but I don't see where.
> So, again, I'm wondering if anyone else can replicate or refute this.

I do find your results extremely surprising not only for 4, but for
all tests with connection numbers lower than 32.  With a scale factor
of 100 that's suspiciously a lot of difference.

> I can't be the only geek around here who sprang for an M1.

Not planning to buy one here, anything I have read on that tells that
it is worth a performance study.
--
Michael


signature.asc
Description: PGP signature


Re: configure and DocBook XML

2020-11-27 Thread Paul Förster
Hi Tom,

> On 26. Nov, 2020, at 17:48, Tom Lane  wrote:
> 
> If you don't have the docbook stylesheets, but you do have xmllint,
> configure's probe will cause xmllint to try to download those
> stylesheets off the net.  For me, that always succeeds, but it
> takes two or three seconds.  I find it curious that it seems to be
> timing out for you.

well, openSUSE 15.2 Leap here:

paul@weasel:~$ xmllint --version
xmllint: using libxml version 20907
   compiled with: Threads Tree Output Push Reader Patterns Writer SAXv1 FTP 
HTTP DTDValid HTML Legacy C14N Catalog XPath XPointer XInclude Iconv ISO8859X 
Unicode Regexps Automata Expr Schemas Schematron Modules Debug Zlib Lzma 
paul@weasel:~$ type -a xmllint
xmllint is /usr/bin/xmllint
paul@weasel:~$ zypper se 'xmllint*'
Loading repository data...
Reading installed packages...
No matching items found.

I wonder why zypper tells me, it's not there. If I use yast2 (GUI) to search 
it, it's there.

Anyway, DocBook XML is definitely not there, neither in zypper se, nor in yast2.

So why would xmllint try to download DocBook XML stylesheets if DocBook is not 
installed? I'm not a developer but such a thing doesn't make sense to me.

Cheers,
Paul



Re: range_agg

2020-11-27 Thread Alexander Korotkov
Hi!

On Thu, Sep 24, 2020 at 3:05 AM Paul A Jungwirth
 wrote:
> On Sun, Aug 16, 2020 at 12:55 PM Paul A Jungwirth
>  wrote:
> > This is rebased on the current master, including some changes to doc
> > tables and pg_upgrade handling of type oids.
>
> Here is a rebased version of this patch, including a bunch of cleanup
> from Alvaro. (Thanks Alvaro!)

I'd like to review this patch.  Could you please rebase it once again?  Thanks.

--
Regards,
Alexander Korotkov




Re: Disable WAL logging to speed up data loading

2020-11-27 Thread Kyotaro Horiguchi
At Fri, 27 Nov 2020 07:01:16 +, "tsunakawa.ta...@fujitsu.com" 
 wrote in 
> From: Masahiko Sawada 
> > While testing the patch on some workload, I realized that
> > XLOG_FPI_FOR_HINT record could still be emitted even when wal_level =
> > none. IIUC that WAL record is not necessary during wal_level = none
> > since the server cannot be the primary server and the server crash
> > ends up requiring to restore the whole database.
> 
> Nice catch!  XLOG_FPI_FOR_HINT and XLOG_FPI should be eliminated, otherwise 
> large amount of WAL may be written.  (It seems that other RMIDs than 
> RM_XLOG_ID and RM_XACT_ID do not have to be written.)
> 
> I'm afraid "none" doesn't represent the behavior because RM_XLOG_ID and 
> RM_XACT_ID WAL records, except for XLOG_FPI_*, are emitted.  What's the good 
> name?  IIUC, "minimal" is named after the fact that the minimal amount of WAL 
> necessary for crash recovery is generated.  "norecovery" or "unrecoverable"?

I haven't seen a criteria of whether a record is emitted or not for
wal_leve=none.

We're emitting only redo logs.  So I think theoretically we don't need
anything other than the shutdown checkpoint record because we don't
perform recovery and checkpoint record is required at startup.

RM_XLOG_ID:
 XLOG_FPI_FOR_HINT  - not needed?
 XLOG_FPI   - not needed?

 XLOG_CHECKPOINT_SHUTDOWN - must have

So how about the followings?
 XLOG_CHECKPOINT_ONLINE
 XLOG_NOOP
 XLOG_NEXTOID
 XLOG_SWITCH
 XLOG_BACKUP_END
 XLOG_PARAMETER_CHANGE
 XLOG_RESTORE_POINT
 XLOG_FPW_CHANGE
 XLOG_END_OF_RECOVERY


RM_XACT_ID:
 XLOG_XACT_COMMIT
 XLOG_XACT_PREPARE
 XLOG_XACT_ABORT
 XLOG_XACT_COMMIT_PREPARED
 XLOG_XACT_ABORT_PREPARED
 XLOG_XACT_ASSIGNMENT
 XLOG_XACT_INVALIDATIONS

Do we need all of these?

And, currenly what decides whether to emit a wal record according to
wal_level is the caller of XLogInsert. So doing this at
XLogInsert-level means that we bring the criteria of the necessity of
wal-record into xlog layer only for wal_level=none. I'm not sure it is
the right direction.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [proposal] de-TOAST'ing using a iterator

2020-11-27 Thread Anastasia Lubennikova

On 02.11.2020 22:08, John Naylor wrote:



On Mon, Nov 2, 2020 at 1:30 PM Alvaro Herrera > wrote:


On 2020-Nov-02, Anastasia Lubennikova wrote:

> Status update for a commitfest entry.
>
> This entry was inactive for a very long time.
> John, are you going to continue working on this?


Not in the near future. For background, this was a 2019 GSoC project 
where I was reviewer of record, and the patch is mostly good, but 
there is some architectural awkwardness. I have tried to address that, 
but have not had success.


The commitfest is nearing the end and as this thread has stalled, I've 
marked it Returned with Feedback. Feel free to open a new entry if you 
return to this patch.


--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: [Patch] Optimize dropping of relation buffers using dlist

2020-11-27 Thread Alexander Korotkov
Hi!

I've found this patch is RFC on commitfest application.  I've quickly
checked if it's really ready for commit.  It seems there are still
unaddressed review notes.  I'm going to switch it to WFA.

--
Regards,
Alexander Korotkov




Setof RangeType returns

2020-11-27 Thread Patrick Handja
Hello,

I am currently working on Library with some specific operators to
manipulate RangeType in PostGreSQL. I would like to know if it is possible
to return a setof rangetype elements in Postresql in C-Language function
using the suggestion like specified here:
https://www.postgresql.org/docs/current/xfunc-c.html#XFUNC-C-RETURN-SET. I
have been trying this for days. If what I am trying to do is impossible, is
there any way I can use to have a RangeType set return?

Regards,

*Andjasubu Bungama, Patrick *


Re: autovac issue with large number of tables

2020-11-27 Thread Masahiko Sawada
On Fri, Nov 27, 2020 at 1:43 AM Fujii Masao  wrote:
>
>
>
> On 2020/11/26 10:41, Kasahara Tatsuhito wrote:
> > On Wed, Nov 25, 2020 at 8:46 PM Masahiko Sawada  
> > wrote:
> >>
> >> On Wed, Nov 25, 2020 at 4:18 PM Kasahara Tatsuhito
> >>  wrote:
> >>>
> >>> Hi,
> >>>
> >>> On Wed, Nov 25, 2020 at 2:17 PM Masahiko Sawada  
> >>> wrote:
> 
>  On Fri, Sep 4, 2020 at 7:50 PM Kasahara Tatsuhito
>   wrote:
> >
> > Hi,
> >
> > On Wed, Sep 2, 2020 at 2:10 AM Kasahara Tatsuhito
> >  wrote:
> >>> I wonder if we could have table_recheck_autovac do two probes of the 
> >>> stats
> >>> data.  First probe the existing stats data, and if it shows the table 
> >>> to
> >>> be already vacuumed, return immediately.  If not, *then* force a stats
> >>> re-read, and check a second time.
> >> Does the above mean that the second and subsequent 
> >> table_recheck_autovac()
> >> will be improved to first check using the previous refreshed 
> >> statistics?
> >> I think that certainly works.
> >>
> >> If that's correct, I'll try to create a patch for the PoC
> >
> > I still don't know how to reproduce Jim's troubles, but I was able to 
> > reproduce
> > what was probably a very similar problem.
> >
> > This problem seems to be more likely to occur in cases where you have
> > a large number of tables,
> > i.e., a large amount of stats, and many small tables need VACUUM at
> > the same time.
> >
> > So I followed Tom's advice and created a patch for the PoC.
> > This patch will enable a flag in the table_recheck_autovac function to 
> > use
> > the existing stats next time if VACUUM (or ANALYZE) has already been 
> > done
> > by another worker on the check after the stats have been updated.
> > If the tables continue to require VACUUM after the refresh, then a 
> > refresh
> > will be required instead of using the existing statistics.
> >
> > I did simple test with HEAD and HEAD + this PoC patch.
> > The tests were conducted in two cases.
> > (I changed few configurations. see attached scripts)
> >
> > 1. Normal VACUUM case
> >- SET autovacuum = off
> >- CREATE tables with 100 rows
> >- DELETE 90 rows for each tables
> >- SET autovacuum = on and restart PostgreSQL
> >- Measure the time it takes for all tables to be VACUUMed
> >
> > 2. Anti wrap round VACUUM case
> >- CREATE brank tables
> >- SELECT all of these tables (for generate stats)
> >- SET autovacuum_freeze_max_age to low values and restart PostgreSQL
> >- Consumes a lot of XIDs by using txid_curent()
> >- Measure the time it takes for all tables to be VACUUMed
> >
> > For each test case, the following results were obtained by changing
> > autovacuum_max_workers parameters to 1, 2, 3(def) 5 and 10.
> > Also changing num of tables to 1000, 5000, 1 and 2.
> >
> > Due to the poor VM environment (2 VCPU/4 GB), the results are a little 
> > unstable,
> > but I think it's enough to ask for a trend.
> >
> > ===
> > [1.Normal VACUUM case]
> >   tables:1000
> >autovacuum_max_workers 1:   (HEAD) 20 sec VS (with patch)  20 sec
> >autovacuum_max_workers 2:   (HEAD) 18 sec VS (with patch)  16 sec
> >autovacuum_max_workers 3:   (HEAD) 18 sec VS (with patch)  16 sec
> >autovacuum_max_workers 5:   (HEAD) 19 sec VS (with patch)  17 sec
> >autovacuum_max_workers 10:  (HEAD) 19 sec VS (with patch)  17 sec
> >
> >   tables:5000
> >autovacuum_max_workers 1:   (HEAD) 77 sec VS (with patch)  78 sec
> >autovacuum_max_workers 2:   (HEAD) 61 sec VS (with patch)  43 sec
> >autovacuum_max_workers 3:   (HEAD) 38 sec VS (with patch)  38 sec
> >autovacuum_max_workers 5:   (HEAD) 45 sec VS (with patch)  37 sec
> >autovacuum_max_workers 10:  (HEAD) 43 sec VS (with patch)  35 sec
> >
> >   tables:1
> >autovacuum_max_workers 1:   (HEAD) 152 sec VS (with patch)  153 sec
> >autovacuum_max_workers 2:   (HEAD) 119 sec VS (with patch)   98 sec
> >autovacuum_max_workers 3:   (HEAD)  87 sec VS (with patch)   78 sec
> >autovacuum_max_workers 5:   (HEAD) 100 sec VS (with patch)   66 sec
> >autovacuum_max_workers 10:  (HEAD)  97 sec VS (with patch)   56 sec
> >
> >   tables:2
> >autovacuum_max_workers 1:   (HEAD) 338 sec VS (with patch)  339 sec
> >autovacuum_max_workers 2:   (HEAD) 231 sec VS (with patch)  229 sec
> >autovacuum_max_workers 3:   (HEAD) 220 sec VS (with patch)  191 sec
> >autovacuum_max_workers 5:   (HEAD) 234 sec VS (with patch)  147 sec
> >autovacuum_max_workers 10:  (HEAD) 320 sec VS (with patch)  113 sec
> >
>