Re: Occasional tablespace.sql failures in check-world -jnn

2020-12-08 Thread Michael Paquier
On Tue, Dec 08, 2020 at 05:29:11PM -0800, Andres Freund wrote:
> I suspect this is related to the pg_upgrade test and the main regression
> test running at the same time. We have the following in
> src/test/regress/GNUMakefile.

Yes, this one is not completely new to -hackers.  See patch 0002 here
that slightly touched the topic by creating a specific makefile rule,
but I never got back to it as I never got annoyed by this problem:
https://www.postgresql.org/message-id/20200511.171354.514381788845037011.horikyota@gmail.com
What we have here is not a solution though...

> It's not clear to me why we have this logic in the makefile at all?
> Somebody taught pg_regress to do so, but only on windows... See
> convert_sourcefiles_in().

...  Because we may still introduce this problem again if some new
stuff uses src/test/pg_regress in a way similar to pg_upgrade,
triggering again tablespace-setup.  Something like the attached may be
enough, though I have not spent much time checking the surroundings,
Windows included.

> The other thing that confuses me is why I started getting that error in
> *multiple* branches recently, even though I have used the parallel
> check-world for ages.

Perhaps you have just increased -j lately or moved to a faster machine
where there are higher changes of collision?
--
Michael
diff --git a/src/bin/pg_upgrade/test.sh b/src/bin/pg_upgrade/test.sh
index 04aa7fd9f5..355477a8ee 100644
--- a/src/bin/pg_upgrade/test.sh
+++ b/src/bin/pg_upgrade/test.sh
@@ -106,7 +106,6 @@ outputdir="$temp_root/regress"
 EXTRA_REGRESS_OPTS="$EXTRA_REGRESS_OPTS --outputdir=$outputdir"
 export EXTRA_REGRESS_OPTS
 mkdir "$outputdir"
-mkdir "$outputdir"/testtablespace
 
 logdir=`pwd`/log
 rm -rf "$logdir"
diff --git a/src/test/regress/GNUmakefile b/src/test/regress/GNUmakefile
index c830627b00..583f975f08 100644
--- a/src/test/regress/GNUmakefile
+++ b/src/test/regress/GNUmakefile
@@ -114,13 +114,6 @@ submake-contrib-spi: | submake-libpgport submake-generated-headers
 
 .PHONY: submake-contrib-spi
 
-# Tablespace setup
-
-.PHONY: tablespace-setup
-tablespace-setup:
-	rm -rf ./testtablespace
-	mkdir ./testtablespace
-
 
 ##
 ## Run tests
@@ -128,19 +121,19 @@ tablespace-setup:
 
 REGRESS_OPTS = --dlpath=. --max-concurrent-tests=20 $(EXTRA_REGRESS_OPTS)
 
-check: all tablespace-setup
+check: all
 	$(pg_regress_check) $(REGRESS_OPTS) --schedule=$(srcdir)/parallel_schedule $(MAXCONNOPT) $(EXTRA_TESTS)
 
-check-tests: all tablespace-setup | temp-install
+check-tests: all | temp-install
 	$(pg_regress_check) $(REGRESS_OPTS) $(MAXCONNOPT) $(TESTS) $(EXTRA_TESTS)
 
-installcheck: all tablespace-setup
+installcheck: all
 	$(pg_regress_installcheck) $(REGRESS_OPTS) --schedule=$(srcdir)/serial_schedule $(EXTRA_TESTS)
 
-installcheck-parallel: all tablespace-setup
+installcheck-parallel: all
 	$(pg_regress_installcheck) $(REGRESS_OPTS) --schedule=$(srcdir)/parallel_schedule $(MAXCONNOPT) $(EXTRA_TESTS)
 
-installcheck-tests: all tablespace-setup
+installcheck-tests: all
 	$(pg_regress_installcheck) $(REGRESS_OPTS) $(TESTS) $(EXTRA_TESTS)
 
 standbycheck: all
@@ -152,10 +145,10 @@ runcheck: check
 runtest: installcheck
 runtest-parallel: installcheck-parallel
 
-bigtest: all tablespace-setup
+bigtest: all
 	$(pg_regress_installcheck) $(REGRESS_OPTS) --schedule=$(srcdir)/serial_schedule numeric_big
 
-bigcheck: all tablespace-setup | temp-install
+bigcheck: all | temp-install
 	$(pg_regress_check) $(REGRESS_OPTS) --schedule=$(srcdir)/parallel_schedule $(MAXCONNOPT) numeric_big
 
 
diff --git a/src/test/regress/pg_regress.c b/src/test/regress/pg_regress.c
index 23d7d0beb2..111457ee8f 100644
--- a/src/test/regress/pg_regress.c
+++ b/src/test/regress/pg_regress.c
@@ -506,24 +506,22 @@ convert_sourcefiles_in(const char *source_subdir, const char *dest_dir, const ch
 
 	snprintf(testtablespace, MAXPGPATH, "%s/testtablespace", outputdir);
 
-#ifdef WIN32
-
 	/*
-	 * On Windows only, clean out the test tablespace dir, or create it if it
-	 * doesn't exist so as it is possible to run the regression tests as a
-	 * Windows administrative user account with the restricted token obtained
-	 * when starting pg_regress.  On other platforms we expect the Makefile to
-	 * take care of that.
+	 * Clean out the test tablespace dir, or create it if it doesn't exist.
+	 * On Windows, doing this cleanup here makes possible to run the
+	 * regression tests as a Windows administrative user account with the
+	 * restricted token obtained when starting pg_regress.
 	 */
 	if (directory_exists(testtablespace))
+	{
 		if (!rmtree(testtablespace, true))
 		{
 			fprintf(stderr, _("\n%s: could not remove test tablespace \"%s\"\n"),
 	progname, testtablespace);
 			exit(2);
 		}
+	}
 	make_directory(testtablespace);
-#endif
 
 	/* finally loop on each file and do the replacement */
 	for (name = names; *name; name++)
diff --git a/src/tools/msvc/vcregress.pl b/src/tools/msvc/vcregress.pl
index 266098e193..3e9110e4ed 100644
--- 

pglz compression performance, take two

2020-12-08 Thread Andrey Borodin
Hi hackers!

A year ago Vladimir Leskov proposed patch to speed up pglz compression[0]. PFA 
the patch with some editorialisation by me.
I saw some reports of bottlenecking in pglz WAL compression [1].

Hopefully soon we will have compression codecs developed by compression 
specialists. The work is going on in nearby thread about custom compression 
methods.
Is it viable to work on pglz optimisation? It's about x1.4 faster. Or should we 
rely on future use of lz4\zstd and others?

Best regards, Andrey Borodin.

[0] 
https://www.postgresql.org/message-id/169163a8-c96f-4dbe-a062-7d1cecbe9...@yandex-team.ru
[1] 
https://smalldatum.blogspot.com/2020/12/tuning-for-insert-benchmark-postgres_4.html



0001-Reorganize-pglz-compression-code.patch
Description: Binary data


Re: get_constraint_index() and conindid

2020-12-08 Thread Michael Paquier
On Tue, Dec 08, 2020 at 01:28:39PM -0500, Tom Lane wrote:
> Matthias van de Meent  writes:
>> On Mon, 7 Dec 2020 at 11:09, Peter Eisentraut
>>  wrote:
>>> get_constraint_index() does its work by going through pg_depend.  It was
>>> added before pg_constraint.conindid was added, and some callers are
>>> still not changed.  Are there reasons for that?  Probably not.  The
>>> attached patch changes get_constraint_index() to an lsyscache-style
>>> lookup instead.
> 
>> This looks quite reasonable, and it passes "make installcheck-world".
> 
> +1, LGTM.

Nice cleanup!

>> Only thing I could think of is that it maybe could use a (small)
>> comment in the message on that/why get_constraint_index is moved to
>> utils/lsyscache from catalog/dependency, as that took me some time to
>> understand.
> 
> commit message could reasonably say that maybe, but I don't think we
> need to memorialize it in a comment.  lsyscache.c *is* where one
> would expect to find a simple catalog-field-fetch function like this.
> The previous implementation was not that, so it didn't belong there.

Agreed.
--
Michael


signature.asc
Description: PGP signature


Re: please update ps display for recovery checkpoint

2020-12-08 Thread Michael Paquier
On Wed, Dec 09, 2020 at 02:00:44AM +0900, Fujii Masao wrote:
> I agree it might be helpful to display something like
> "checkpointing" or "waiting for checkpoint" in PS title for the
> startup process.

Thanks.

> But, at least for me, it seems strange to display "idle" only for
> checkpointer.

Yeah, I'd rather leave out the part of the patch where we have the
checkpointer say "idle".  So I still think that what v3 did upthread,
by just resetting the ps display in all cases, feels more natural.  So
we should remove the parts of v5 from checkpointer.c.

+* Reset the ps status display.  We set the status to "idle" for the
+* checkpointer process, and we clear it for anything else that runs 
this
+* code.
+*/
+   if (MyBackendType == B_CHECKPOINTER)
+   set_ps_display("idle");
+   else
+   set_ps_display("");
Regarding this part, this just needs a reset with an empty string.
The second sentence is confusing (it partially comes fronm v3, from
me..).  Do we lose anything by removing the second sentence of this
comment?

+   snprintf(activitymsg, sizeof(activitymsg), "creating %s%scheckpoint",
[...]
+   snprintf(activitymsg, sizeof(activitymsg), "creating %srestartpoint",
FWIW, I still fing "running" to sound better than "creating" here.  An
extra term I can think of that could be adapted is "performing".

> *If* we want to monitor the current status of
> checkpointer, it should be shown as wait event in pg_stat_activity,
> instead? 

That would be WAIT_EVENT_CHECKPOINTER_MAIN, now there has been also on
this thread an argument that you would not have access to
pg_stat_activity until crash recovery completes and consistency is
restored.
--
Michael


signature.asc
Description: PGP signature


Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

2020-12-08 Thread David Rowley
On Sun, 6 Dec 2020 at 04:10, Andy Fan  wrote:
> For anyone who is interested with these patchsets, here is my plan about this
> now.  1).  I will try EquivalenceClass rather than Expr in UniqueKey and add 
> opfamily
> if needed.

I agree that we should be storing them in EquivalenceClasses. Apart
from what was mentioned already it also allow the optimisation to work
in cases like:

create table t (a int not null unique, b int);
select distinct b from t where a = b;

David




Re: Refactor MD5 implementations and switch to EVP for OpenSSL

2020-12-08 Thread Michael Paquier
On Mon, Dec 07, 2020 at 02:15:58PM +0100, Daniel Gustafsson wrote:
> This is a pretty straightforward patch which given the cryptohash framework
> added previously does what it says on the tin.  +1 on rolling MD5 into what we
> already have for SHA2 now. Applies clean and all tests pass.

Thanks.

> One (two-part) comment on the patch though:
>
> The re-arrangement of the code does lead to some attribution confusion however
> IMO.  pgcrypto/md5.c and src/common/md5.c are combined with the actual MD5
> implementation from pgcrypto/md5.c and the PG glue code from src/common/md5.c.
> That is in itself a good choice, but the file headers are intact and now 
> claims
> two implementations of which only one remains.

I was pretty sure that at some point I got the attributions messed up.
Thanks for looking at this stuff and ringing the bell.

> Further, bytesToHex was imported in commit 957613be18e6b7 with attribution to
> "Sverre H.  Huseby " without a PGDG defined copyright,
> which was later added in 2ca65f716aee9ec441dda91d91b88dd7a00bffa1.  This patch
> moves bytesToHex from md5.c (where his attribution is) to md5_common.c with no
> maintained attribution.  Off the cuff it seems we should either attribute in a
> comment, or leave the function and export it, but the usual "I'm not a lawyer"
> disclaimer applies.  Do you have any thoughts?

Hmm.  Looking at this area of this history, like d330f155, I think
that what we just need to do is to switch the attribution of Sverre
to md5_common.c for all the sub-functions dedicated to the generation
of the MD5 passwords by fixing the header comment of the file as this
is the remaining code coming from the file where this code was.  In
the new md5.c and md5_int.h, the fallback implementation that we get
to use is the KAME one from pgcrypto so we should just mention KAME
there.

I have spent some time double-checking all this stuff, adjusting some
comments, and making the style of the new files more consistent with
the surroundings while minimizing the amount of noise diffs (pgindent
has adjusted some things by itself for the new files).  In short, this
seems rather committable to me.
--
Michael
From 6692b6131b8d0a60ab51583b80edf1c41bae5770 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Wed, 9 Dec 2020 14:39:19 +0900
Subject: [PATCH v4] Refactor MD5 implementations in the tree

This removes the duplicated MD5 implementations present in both
src/common/ and contrib/pgcrypto/, moving it to src/common/.

Similarly to the fallback implementation used for SHA2, the fallback
implementation of MD5 is moved to src/common/md5.c with an internal
header called md5_int.h for the init, update and final routines.  This
gets consumed by cryptohash.c.

When building with OpenSSL, EVP is used for MD5.  With the recent
refactoring work for cryptohash functions, this change is
straight-forward.

The original routines used for MD5-hashed passwords are moved to a
separate file called md5_common.c, also in src/common/, aimed at being
shared between all MD5 implementations.
---
 src/include/common/cryptohash.h   |   3 +-
 src/include/common/md5.h  |   3 +-
 src/common/Makefile   |   3 +-
 src/common/cryptohash.c   |  13 +
 src/common/cryptohash_openssl.c   |   3 +
 src/common/md5.c  | 663 ++
 src/common/md5_common.c   | 149 
 .../pgcrypto/md5.h => src/common/md5_int.h|  60 +-
 contrib/pgcrypto/Makefile |   2 +-
 contrib/pgcrypto/internal.c   |  25 +-
 contrib/pgcrypto/md5.c| 397 ---
 src/tools/msvc/Mkvcbuild.pm   |  12 +-
 12 files changed, 601 insertions(+), 732 deletions(-)
 create mode 100644 src/common/md5_common.c
 rename contrib/pgcrypto/md5.h => src/common/md5_int.h (63%)
 delete mode 100644 contrib/pgcrypto/md5.c

diff --git a/src/include/common/cryptohash.h b/src/include/common/cryptohash.h
index 0e4a6631a3..6ead1cb8e5 100644
--- a/src/include/common/cryptohash.h
+++ b/src/include/common/cryptohash.h
@@ -18,7 +18,8 @@
 /* Context Structures for each hash function */
 typedef enum
 {
-	PG_SHA224 = 0,
+	PG_MD5 = 0,
+	PG_SHA224,
 	PG_SHA256,
 	PG_SHA384,
 	PG_SHA512
diff --git a/src/include/common/md5.h b/src/include/common/md5.h
index 8695f10dff..b15635d600 100644
--- a/src/include/common/md5.h
+++ b/src/include/common/md5.h
@@ -1,7 +1,7 @@
 /*-
  *
  * md5.h
- *	  Interface to libpq/md5.c
+ *	  Constants and common utilities related to MD5.
  *
  * These definitions are needed by both frontend and backend code to work
  * with MD5-encrypted passwords.
@@ -19,6 +19,7 @@
 #define MD5_PASSWD_CHARSET	"0123456789abcdef"
 #define MD5_PASSWD_LEN	35
 
+/* Utilities common to all the MD5 implementations, as of md5_common.c */
 extern bool pg_md5_hash(const void *buff, size_t 

Re: Parallel Inserts in CREATE TABLE AS

2020-12-08 Thread Dilip Kumar
On Tue, Dec 8, 2020 at 6:24 PM Hou, Zhijie  wrote:
>
> > > I'm not quite sure how to address this. Can we not allow the planner
> > > to consider that the select is for CTAS and check only after the
> > > planning is done for the Gather node and other checks?
> > >
> >
> > IIUC, you are saying that we should not influence the cost of gather node
> > even when the insertion would be done by workers? I think that should be
> > our fallback option anyway but that might miss some paths to be considered
> > parallel where the cost becomes more due to parallel_tuple_cost (aka tuple
> > transfer cost). I think the idea is we can avoid the tuple transfer cost
> > only when Gather is the top node because only at that time we can push
> > insertion down, right? How about if we have some way to detect the same
> > before calling generate_useful_gather_paths()? I think when we are calling
> > apply_scanjoin_target_to_paths() in grouping_planner(), if the
> > query_level is 1, it is for CTAS, and it doesn't have a chance to create
> > UPPER_REL (doesn't have grouping, order, limit, etc clause) then we can
> > probably assume that the Gather will be top_node. I am not sure about this
> > but I think it is worth exploring.
> >
>
> I took a look at the parallel insert patch and have the same idea.
> https://commitfest.postgresql.org/31/2844/
>
>  * Consider generating Gather or Gather Merge paths.  We must only do 
> this
>  * if the relation is parallel safe, and we don't do it for child 
> rels to
>  * avoid creating multiple Gather nodes within the same plan. We must 
> do
>  * this after all paths have been generated and before set_cheapest, 
> since
>  * one of the generated paths may turn out to be the cheapest one.
>  */
> if (rel->consider_parallel && !IS_OTHER_REL(rel))
> generate_useful_gather_paths(root, rel, false);
>
> IMO Gatherpath created here seems the right one which can possible ignore 
> parallel cost if in CTAS.
> But We need check the following parse option which will create path to be the 
> parent of Gatherpath here.
>
> if (root->parse->rowMarks)
> if (limit_needed(root->parse))
> if (root->parse->sortClause)
> if (root->parse->distinctClause)
> if (root->parse->hasWindowFuncs)
> if (root->parse->groupClause || root->parse->groupingSets || 
> root->parse->hasAggs || root->root->hasHavingQual)
>

Yeah, and as I pointed earlier, along with this we also need to
consider that the RelOptInfo must be the final target(top level rel).

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: small cleanup in unicode_norm.c

2020-12-08 Thread Michael Paquier
On Tue, Dec 08, 2020 at 02:25:43PM -0400, John Naylor wrote:
> Thanks for taking a look. Sounds good, I've made those adjustments and
> wrote a commit message. I took another look and didn't see anything else to
> address.

Looks good to me, so applied.
--
Michael


signature.asc
Description: PGP signature


RE: [bug fix] ALTER TABLE SET LOGGED/UNLOGGED on a partitioned table does nothing silently

2020-12-08 Thread tsunakawa.ta...@fujitsu.com
From: Alvaro Herrera 
> But what happens when you create another partition after you change the
> "loggedness" of the partitioned table?

The new partition will have a property specified when the user creates it.  
That is, while the storage property of each storage unit (=partition) is 
basically independent, ALTER TABLE on a partitioned table is a convenient way 
to set the same property value to all its underlying storage units.


> Well, my main take from that is we should strive to change all
> subcommands together, in some principled way, rather than change only
> one or some, in arbitrary ways.

I got an impression from the discussion that some form of consensus on the 
principle was made and you were trying to create a fix for REPLICA IDENTITY.  
Do you think the principle was unclear and we should state it first (partly to 
refresh people's memories)?

I'm kind of for it, but I'm hesitant to create the fix for all ALTER actions, 
because it may take a lot of time and energy as you were afraid.  Can we define 
the principle, and then create individual fixes independently based on that 
principle?


Regards
Takayuki Tsunakawa





[PATCH] Feature improvement for CLOSE, FETCH, MOVE tab completion

2020-12-08 Thread Shinya11.Kato
Hi!

I created a patch for improving CLOSE, FETCH, MOVE tab completion.
Specifically, I add CLOSE, FETCH, MOVE tab completion for completing a 
predefined cursors.

Regards,
Shinya Kato


fix_tab_complete_close_fetch_move.patch
Description: fix_tab_complete_close_fetch_move.patch


Re: [HACKERS] [PATCH] Generic type subscripting

2020-12-08 Thread Andres Freund
On 2020-12-08 22:02:24 -0500, Tom Lane wrote:
> BTW, I observe that with MAXDIM gone from execExpr.h, there are
> no widely-visible uses of MAXDIM except for array.h.  I therefore
> suggest that we should pull "#define MAXDIM" out of c.h and put it
> into array.h, as attached.  I was slightly surprised to find that
> this seems to entail *no* new inclusions of array.h ... I expected
> there would be one or two.  But the main point here is we want to
> restrict use of that symbol to stuff that's tightly integrated with
> varlena-array handling, so it ought not be in c.h.

+1




Re: [HACKERS] [PATCH] Generic type subscripting

2020-12-08 Thread Tom Lane
BTW, I observe that with MAXDIM gone from execExpr.h, there are
no widely-visible uses of MAXDIM except for array.h.  I therefore
suggest that we should pull "#define MAXDIM" out of c.h and put it
into array.h, as attached.  I was slightly surprised to find that
this seems to entail *no* new inclusions of array.h ... I expected
there would be one or two.  But the main point here is we want to
restrict use of that symbol to stuff that's tightly integrated with
varlena-array handling, so it ought not be in c.h.

regards, tom lane

diff --git a/src/include/c.h b/src/include/c.h
index 12ea056a35..7bc4b8a001 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -591,10 +591,6 @@ typedef uint32 CommandId;
 #define FirstCommandId	((CommandId) 0)
 #define InvalidCommandId	(~(CommandId)0)
 
-/*
- * Maximum number of array subscripts, for regular varlena arrays
- */
-#define MAXDIM 6
 
 /* 
  *		Variable-length datatypes all share the 'struct varlena' header.
diff --git a/src/include/utils/array.h b/src/include/utils/array.h
index 2809dfee93..16925880a1 100644
--- a/src/include/utils/array.h
+++ b/src/include/utils/array.h
@@ -69,6 +69,11 @@ struct ExprState;
 struct ExprContext;
 
 
+/*
+ * Maximum number of array subscripts (arbitrary limit)
+ */
+#define MAXDIM 6
+
 /*
  * Arrays are varlena objects, so must meet the varlena convention that
  * the first int32 of the object contains the total object size in bytes.


Re: [bug fix] ALTER TABLE SET LOGGED/UNLOGGED on a partitioned table does nothing silently

2020-12-08 Thread Alvaro Herrera
On 2020-Dec-08, tsunakawa.ta...@fujitsu.com wrote:

> From: Alvaro Herrera 
> > Does "ALTER TABLE ONLY parent" work correctly?  Namely, do not affect
> > existing partitions, but cause future partitions to acquire the new
> > setting.
> 
> Yes, it works correctly in the sense that ALTER TABLE ONLY on a
> partitioned table does nothing because it has no storage and therefore
> logged/unlogged has no meaning.

But what happens when you create another partition after you change the
"loggedness" of the partitioned table?

> > This sounds very much related to previous discussion on REPLICA IDENTITY
> > not propagating to partitions; see
> > https://postgr.es/m/201902041630.gpadougzab7v@alvherre.pgsql
> > particularly Robert Haas' comments at
> > http://postgr.es/m/CA+TgmoYjksObOzY8b1U1=BsP_m+702eOf42fqRtQTYio
> > 1nu...@mail.gmail.com
> > and downstream discussion from there.
> 
> There was a hot discussion.  I've read through it.  I hope I'm not
> doing strange in light of that discussioin.

Well, my main take from that is we should strive to change all
subcommands together, in some principled way, rather than change only
one or some, in arbitrary ways.




Fix typo about generate_gather_paths

2020-12-08 Thread Hou, Zhijie
Hi

Since ba3e76c,
the optimizer call generate_useful_gather_paths instead of 
generate_gather_paths() outside.

But I noticed that some comment still talking about generate_gather_paths not 
generate_useful_gather_paths.
I think we should fix these comment, and I try to replace these " 
generate_gather_paths " with " generate_useful_gather_paths "

Best regards,
houzj





0001-fix-typo-about-generate_gather_paths.patch
Description: 0001-fix-typo-about-generate_gather_paths.patch


RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-08 Thread k.jami...@fujitsu.com
On Wednesday, December 9, 2020 10:58 AM, Tsunakawa, Takayuki wrote: 
> From: Kyotaro Horiguchi 
> > At Tue, 8 Dec 2020 16:28:41 +0530, Amit Kapila
> >  wrote in
>  > I also can't think of a way to use an optimized path for such cases
> > > but I don't agree with your comment on if it is common enough that
> > > we leave this optimization entirely for the truncate path.
> >
> > An ugly way to cope with it would be to let other smgr functions
> > manage the cached value, for example, by calling smgrnblocks while
> > InRecovery.  Or letting smgr remember the maximum block number ever
> > accessed.  But we cannot fully rely on that since smgr can be closed
> > midst of a session and smgr doesn't offer such persistence.  In the
> > first place smgr doesn't seem to be the place to store such persistent
> > information.
> 
> Yeah, considering the future evolution of this patch to operations during
> normal running, I don't think that would be a good fit, either.
> 
> Then, the as we're currently targeting just recovery, the options we can take
> are below.  Which would vote for?  My choice would be (3) > (2) > (1).
> 
> 
> (1)
> Use the cached flag in both VACUUM (0003) and TRUNCATE (0004).
> This brings the most safety and code consistency.
> But this would not benefit from optimization for TRUNCATE in unexpectedly
> many cases -- when TOAST storage exists but it's not written, or FSM/VM is
> not updated after checkpoint.
> 
> 
> (2)
> Use the cached flag in VACUUM (0003), but use InRecovery instead of the
> cached flag in TRUNCATE (0004).
> This benefits from the optimization in all cases.
> But this lacks code consistency.
> You may be afraid of safety if the startup process smgrclose()s the relation
> after the shared buffer flushing hits disk full.  However, startup process
> doesn't smgrclose(), so it should be safe.  Just in case the startup process
> smgrclose()s, the worst consequence is PANIC shutdown after repeated
> failure of checkpoints due to lingering orphaned dirty shared buffers.  Accept
> it as Thomas-san's devil's suggestion.
> 
> 
> (3)
> Do not use the cached flag in either VACUUM (0003) or TRUNCATE (0004).
> This benefits from the optimization in all cases.
> The code is consistent and smaller.
> As for the safety, this is the same as (2), but it applies to VACUUM as well.

If we want code consistency, then we'd fall in either 1 or 3.
And if we want to take the benefits of optimization for both 
DropRelFileNodeBuffers
and DropRelFileNodesAllBuffers, then I'd choose 3.
However, if the reviewers and committer want to make use of the "cached" flag,
then we can live with "cached" value in place there even if it's not common to
get the optimization for TRUNCATE path. So only VACUUM would take the most
benefit.
My vote is also (3), then (2), and (1).

Regards,
Kirk Jamison




RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-08 Thread tsunakawa.ta...@fujitsu.com
From: Kyotaro Horiguchi 
> At Tue, 8 Dec 2020 16:28:41 +0530, Amit Kapila 
> wrote in
 > I also can't think of a way to use an optimized path for such cases
> > but I don't agree with your comment on if it is common enough that we
> > leave this optimization entirely for the truncate path.
> 
> An ugly way to cope with it would be to let other smgr functions
> manage the cached value, for example, by calling smgrnblocks while
> InRecovery.  Or letting smgr remember the maximum block number ever
> accessed.  But we cannot fully rely on that since smgr can be closed
> midst of a session and smgr doesn't offer such persistence.  In the
> first place smgr doesn't seem to be the place to store such persistent
> information.

Yeah, considering the future evolution of this patch to operations during 
normal running, I don't think that would be a good fit, either.

Then, the as we're currently targeting just recovery, the options we can take 
are below.  Which would vote for?  My choice would be (3) > (2) > (1).


(1)
Use the cached flag in both VACUUM (0003) and TRUNCATE (0004).
This brings the most safety and code consistency.
But this would not benefit from optimization for TRUNCATE in unexpectedly many 
cases -- when TOAST storage exists but it's not written, or FSM/VM is not 
updated after checkpoint.


(2)
Use the cached flag in VACUUM (0003), but use InRecovery instead of the cached 
flag in TRUNCATE (0004).
This benefits from the optimization in all cases.
But this lacks code consistency.
You may be afraid of safety if the startup process smgrclose()s the relation 
after the shared buffer flushing hits disk full.  However, startup process 
doesn't smgrclose(), so it should be safe.  Just in case the startup process 
smgrclose()s, the worst consequence is PANIC shutdown after repeated failure of 
checkpoints due to lingering orphaned dirty shared buffers.  Accept it as 
Thomas-san's devil's suggestion.


(3)
Do not use the cached flag in either VACUUM (0003) or TRUNCATE (0004).
This benefits from the optimization in all cases.
The code is consistent and smaller.
As for the safety, this is the same as (2), but it applies to VACUUM as well.


Regards
Takayuki Tsunakawa





Re: [HACKERS] [PATCH] Generic type subscripting

2020-12-08 Thread Tom Lane
Andres Freund  writes:
> On 2020-12-08 11:05:05 -0500, Tom Lane wrote:
>> Other than that nit, please finish this up and push it so I can finish
>> the generic-subscripting patch.

> Done.

Cool, thanks.  I'll fix that part of the generic-subscript patch
and push it tomorrow, unless somebody objects to it before then.

regards, tom lane




Re: Logical archiving

2020-12-08 Thread Craig Ringer
On Tue, 8 Dec 2020 at 17:54, Andrey Borodin  wrote:


> > In pglogical3 we already support streaming decoded WAL data to
> alternative writer downstreams including RabbitMQ and Kafka via writer
> plugins.
> Yes, Yandex.Cloud Transfer Manger supports it too. But it has to be
> resynced after physical failover. And internal installation of YC have
> mandatory drills: few times in a month one datacenter is disconnected and
> failover happens for thousands a DBS.
>

You'll want to look at
https://wiki.postgresql.org/wiki/Logical_replication_and_physical_standby_failover#All-logical-replication_HA
then.


Re: [HACKERS] [PATCH] Generic type subscripting

2020-12-08 Thread Andres Freund
Hi,

On 2020-12-08 11:05:05 -0500, Tom Lane wrote:
> I've now studied this patch and it seems sane to me, although
> I wondered why you wrote "extern"s here:

Brainfart...


> Other than that nit, please finish this up and push it so I can finish
> the generic-subscripting patch.

Done.


> > WRT the prototype, I think it may be worth removing most of the types
> > from llvmjit.h. Worth keeping the most common ones, but most aren't used
> > all the time so terseness doesn't matter that much, and
> > the llvm_pg_var_type() would suffice.
> 
> Hm, that would mean redoing llvm_pg_var_type() often wouldn't it?
> I don't have a very good feeling for how expensive that is, so I'm
> not sure if this seems like a good idea or not.

It should be fairly cheap - it's basically one hashtable lookup.

Greetings,

Andres Freund




Re: Blocking I/O, async I/O and io_uring

2020-12-08 Thread Craig Ringer
On Tue, 8 Dec 2020 at 15:04, Andres Freund  wrote:

> Hi,
>
> On 2020-12-08 04:24:44 +, tsunakawa.ta...@fujitsu.com wrote:
> > I'm looking forward to this from the async+direct I/O, since the
> > throughput of some write-heavy workload decreased by half or more
> > during checkpointing (due to fsync?)
>
> Depends on why that is. The most common, I think, cause is that your WAL
> volume increases drastically just after a checkpoint starts, because
> initially all page modification will trigger full-page writes.  There's
> a significant slowdown even if you prevent the checkpointer from doing
> *any* writes at that point.  I got the WAL AIO stuff to the point that I
> see a good bit of speedup at high WAL volumes, and I see it helping in
> this scenario.
>
> There's of course also the issue that checkpoint writes cause other IO
> (including WAL writes) to slow down and, importantly, cause a lot of
> jitter leading to unpredictable latencies.  I've seen some good and some
> bad results around this with the patch, but there's a bunch of TODOs to
> resolve before delving deeper really makes sense (the IO depth control
> is not good enough right now).
>
> A third issue is that sometimes checkpointer can't really keep up - and
> that I think I've seen pretty clearly addressed by the patch. I have
> managed to get to ~80% of my NVMe disks top write speed (> 2.5GB/s) by
> the checkpointer, and I think I know what to do for the remainder.
>
>
Thanks for explaining this. I'm really glad you're looking into it. If I
get the chance I'd like to try to apply some wait-analysis and blocking
stats tooling to it. I'll report back if I make any progress there.


Occasional tablespace.sql failures in check-world -jnn

2020-12-08 Thread Andres Freund
Hi,

For fairly obvious reasons I like to run check-world in parallel [1]. In
the last few months I've occasionally seen failures during that that I
cannot recall seeing before.

--- 
/home/andres/build/postgres/13-assert/vpath/src/test/regress/expected/tablespace.out
2020-12-07 18:41:23.079235588 -0800
+++ 
/home/andres/build/postgres/13-assert/vpath/src/test/regress/results/tablespace.out
 2020-12-07 18:42:01.892632468 -0800
@@ -209,496 +209,344 @@
 ERROR:  cannot specify default tablespace for partitioned relations
 CREATE TABLE testschema.dflt (a int PRIMARY KEY USING INDEX TABLESPACE 
regress_tblspace) PARTITION BY LIST (a);
 ERROR:  cannot specify default tablespace for partitioned relations
 -- but these work:
 CREATE TABLE testschema.dflt (a int PRIMARY KEY USING INDEX TABLESPACE 
regress_tblspace) PARTITION BY LIST (a) TABLESPACE regress_tblspace;
 SET default_tablespace TO '';
 CREATE TABLE testschema.dflt2 (a int PRIMARY KEY) PARTITION BY LIST (a);
 DROP TABLE testschema.dflt, testschema.dflt2;
 -- check that default_tablespace doesn't affect ALTER TABLE index rebuilds
 CREATE TABLE testschema.test_default_tab(id bigint) TABLESPACE 
regress_tblspace;
+ERROR:  could not create directory "pg_tblspc/16387/PG_13_202007201/16384": No 
such file or directory
 INSERT INTO testschema.test_default_tab VALUES (1);

(many failures follow)


I suspect this is related to the pg_upgrade test and the main regression
test running at the same time. We have the following in 
src/test/regress/GNUMakefile

# Tablespace setup

.PHONY: tablespace-setup
tablespace-setup:
echo $(realpath ./testtablespace) >> /tmp/tablespace.log
rm -rf ./testtablespace
mkdir ./testtablespace
...

which pg_upgrade triggers. Even though it, as far as I can tell, never
actually ends up putting any data in it:

# Send installcheck outputs to a private directory.  This avoids conflict when
# check-world runs pg_upgrade check concurrently with src/test/regress check.
# To retrieve interesting files after a run, use pattern tmp_check/*/*.diffs.
outputdir="$temp_root/regress"
EXTRA_REGRESS_OPTS="$EXTRA_REGRESS_OPTS --outputdir=$outputdir"
export EXTRA_REGRESS_OPTS
mkdir "$outputdir"
mkdir "$outputdir"/testtablespace

It's not clear to me why we have this logic in the makefile at all?
Somebody taught pg_regress to do so, but only on windows... See
convert_sourcefiles_in().


The other thing that confuses me is why I started getting that error in
*multiple* branches recently, even though I have used the parallel
check-world for ages.

Greetings,

Andres Freund

[1]: make -Otarget -j20 -s check-world && echo success || echo failed




Re: [Patch] Optimize dropping of relation buffers using dlist

2020-12-08 Thread Kyotaro Horiguchi
At Tue, 8 Dec 2020 16:28:41 +0530, Amit Kapila  wrote 
in 
> On Tue, Dec 8, 2020 at 12:13 PM tsunakawa.ta...@fujitsu.com
>  wrote:
> >
> > From: Jamison, Kirk/ジャミソン カーク 
> > > Because one of the rel's cached value was false, it forced the
> > > full-scan path for TRUNCATE.
> > > Is there a possible workaround for this?
> >
> > Hmm, the other two relfilenodes are for the TOAST table and index of the 
> > target table.  I think the INSERT didn't access those TOAST relfilenodes 
> > because the inserted data was stored in the main storage.  But TRUNCATE 
> > always truncates all the three relfilenodes.  So, the standby had not 
> > opened the relfilenode for the TOAST stuff or cached its size when 
> > replaying the TRUNCATE.
> >
> > I'm afraid this is more common than we can ignore and accept the slow 
> > traditional path, but I don't think of a good idea to use the cached flag.
> >
> 
> I also can't think of a way to use an optimized path for such cases
> but I don't agree with your comment on if it is common enough that we
> leave this optimization entirely for the truncate path.

Mmm. At least btree doesn't need to call smgrnblocks except at
expansion, so we cannot get to the optimized path in major cases of
truncation involving btree (and/or maybe other indexes). TOAST
relations are not accessed until we insert/update/retrive the values
in it.

An ugly way to cope with it would be to let other smgr functions
manage the cached value, for example, by calling smgrnblocks while
InRecovery.  Or letting smgr remember the maximum block number ever
accessed.  But we cannot fully rely on that since smgr can be closed
midst of a session and smgr doesn't offer such persistence.  In the
first place smgr doesn't seem to be the place to store such persistent
information.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: autovac issue with large number of tables

2020-12-08 Thread Kasahara Tatsuhito
On Wed, Dec 9, 2020 at 12:01 AM Fujii Masao  wrote:
>
>
>
> On 2020/12/04 12:21, Kasahara Tatsuhito wrote:
> > Hi,
> >
> > On Thu, Dec 3, 2020 at 9:09 PM Fujii Masao  
> > wrote:
> >>
> >>
> >>
> >> On 2020/12/03 11:46, Kasahara Tatsuhito wrote:
> >>> On Wed, Dec 2, 2020 at 7:11 PM Masahiko Sawada  
> >>> wrote:
> 
>  On Wed, Dec 2, 2020 at 3:33 PM Fujii Masao  
>  wrote:
> >
> >
> >
> > On 2020/12/02 12:53, Masahiko Sawada wrote:
> >> On Tue, Dec 1, 2020 at 5:31 PM Masahiko Sawada  
> >> wrote:
> >>>
> >>> On Tue, Dec 1, 2020 at 4:32 PM Fujii Masao 
> >>>  wrote:
> 
> 
> 
>  On 2020/12/01 16:23, Masahiko Sawada wrote:
> > On Tue, Dec 1, 2020 at 1:48 PM Kasahara Tatsuhito
> >  wrote:
> >>
> >> Hi,
> >>
> >> On Mon, Nov 30, 2020 at 8:59 PM Fujii Masao 
> >>  wrote:
> >>>
> >>>
> >>>
> >>> On 2020/11/30 10:43, Masahiko Sawada wrote:
>  On Sun, Nov 29, 2020 at 10:34 PM Kasahara Tatsuhito
>   wrote:
> >
> > Hi, Thanks for you comments.
> >
> > On Fri, Nov 27, 2020 at 9:51 PM Fujii Masao 
> >  wrote:
> >>
> >>
> >>
> >> On 2020/11/27 18:38, Kasahara Tatsuhito wrote:
> >>> Hi,
> >>>
> >>> On Fri, Nov 27, 2020 at 1:43 AM Fujii Masao 
> >>>  wrote:
> 
> 
> 
>  On 2020/11/26 10:41, Kasahara Tatsuhito wrote:
> > On Wed, Nov 25, 2020 at 8:46 PM Masahiko Sawada 
> >  wrote:
> >>
> >> On Wed, Nov 25, 2020 at 4:18 PM Kasahara Tatsuhito
> >>  wrote:
> >>>
> >>> Hi,
> >>>
> >>> On Wed, Nov 25, 2020 at 2:17 PM Masahiko Sawada 
> >>>  wrote:
> 
>  On Fri, Sep 4, 2020 at 7:50 PM Kasahara Tatsuhito
>   wrote:
> >
> > Hi,
> >
> > On Wed, Sep 2, 2020 at 2:10 AM Kasahara Tatsuhito
> >  wrote:
> >>> I wonder if we could have table_recheck_autovac do 
> >>> two probes of the stats
> >>> data.  First probe the existing stats data, and if it 
> >>> shows the table to
> >>> be already vacuumed, return immediately.  If not, 
> >>> *then* force a stats
> >>> re-read, and check a second time.
> >> Does the above mean that the second and subsequent 
> >> table_recheck_autovac()
> >> will be improved to first check using the previous 
> >> refreshed statistics?
> >> I think that certainly works.
> >>
> >> If that's correct, I'll try to create a patch for the 
> >> PoC
> >
> > I still don't know how to reproduce Jim's troubles, but 
> > I was able to reproduce
> > what was probably a very similar problem.
> >
> > This problem seems to be more likely to occur in cases 
> > where you have
> > a large number of tables,
> > i.e., a large amount of stats, and many small tables 
> > need VACUUM at
> > the same time.
> >
> > So I followed Tom's advice and created a patch for the 
> > PoC.
> > This patch will enable a flag in the 
> > table_recheck_autovac function to use
> > the existing stats next time if VACUUM (or ANALYZE) has 
> > already been done
> > by another worker on the check after the stats have 
> > been updated.
> > If the tables continue to require VACUUM after the 
> > refresh, then a refresh
> > will be required instead of using the existing 
> > statistics.
> >
> > I did simple test with HEAD and HEAD + this PoC patch.
> > The tests were conducted in two cases.
> > (I changed few configurations. see attached scripts)
> >
> > 1. Normal VACUUM case
> >  - SET autovacuum = off
> >  - CREATE tables with 100 rows
> >  - DELETE 90 rows for each tables

Re: [HACKERS] [PATCH] Generic type subscripting

2020-12-08 Thread Tom Lane
Andres Freund  writes:
> On 2020-12-07 17:25:41 -0500, Tom Lane wrote:
>> I can see that that should work for the two existing implementations
>> of EEO_CASE, but I wasn't sure if you wanted to wire in an assumption
>> that it'll always work.

> I don't think it's likely to be a problem, and if it ends up being one,
> we can still deduplicate the ops at that point...

Seems reasonable.

Here's a v38 that addresses the semantic loose ends I was worried about.
I decided that it's worth allowing subscripting functions to dictate
whether they should be considered strict or not, at least for the fetch
side (store is still assumed nonstrict always) and whether they should be
considered leakproof or not.  That requires only a minimal amount of extra
code.  While the planner does have to do extra catalog lookups to check
strictness and leakproofness, those are not common things to need to
check, so I don't think we're paying anything in performance for the
flexibility.  I left out the option of "strict store" because that *would*
have required extra code (to generate a nullness test on the replacement
value) and the potential use-case seems too narrow to justify that.
I also left out any option to control volatility or parallel safety,
again on the grounds of lack of use-case; plus, planner checks for those
properties would have been in significantly hotter code paths.

I'm waiting on your earlier patch to rewrite the llvmjit_expr.c code,
but otherwise I think this is ready to go.

regards, tom lane

diff --git a/contrib/postgres_fdw/deparse.c b/contrib/postgres_fdw/deparse.c
index 2d44df19fe..ca2f9f3215 100644
--- a/contrib/postgres_fdw/deparse.c
+++ b/contrib/postgres_fdw/deparse.c
@@ -426,23 +426,28 @@ foreign_expr_walker(Node *node,
 	return false;
 
 /*
- * Recurse to remaining subexpressions.  Since the container
- * subscripts must yield (noncollatable) integers, they won't
- * affect the inner_cxt state.
+ * Recurse into the remaining subexpressions.  The container
+ * subscripts will not affect collation of the SubscriptingRef
+ * result, so do those first and reset inner_cxt afterwards.
  */
 if (!foreign_expr_walker((Node *) sr->refupperindexpr,
 		 glob_cxt, _cxt))
 	return false;
+inner_cxt.collation = InvalidOid;
+inner_cxt.state = FDW_COLLATE_NONE;
 if (!foreign_expr_walker((Node *) sr->reflowerindexpr,
 		 glob_cxt, _cxt))
 	return false;
+inner_cxt.collation = InvalidOid;
+inner_cxt.state = FDW_COLLATE_NONE;
 if (!foreign_expr_walker((Node *) sr->refexpr,
 		 glob_cxt, _cxt))
 	return false;
 
 /*
- * Container subscripting should yield same collation as
- * input, but for safety use same logic as for function nodes.
+ * Container subscripting typically yields same collation as
+ * refexpr's, but in case it doesn't, use same logic as for
+ * function nodes.
  */
 collation = sr->refcollid;
 if (collation == InvalidOid)
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 79069ddfab..583a5ce3b9 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -8740,6 +8740,21 @@ SCRAM-SHA-256$iteration count:
   
  
 
+ 
+  
+   typsubscript regproc
+   (references pg_proc.oid)
+  
+  
+   Subscripting handler function's OID, or zero if this type doesn't
+   support subscripting.  Types that are true array
+   types have typsubscript
+   = array_subscript_handler, but other types may
+   have other handler functions to implement specialized subscripting
+   behavior.
+  
+ 
+
  
   
typelem oid
@@ -8747,19 +8762,12 @@ SCRAM-SHA-256$iteration count:
   
   
If typelem is not 0 then it
-   identifies another row in pg_type.
-   The current type can then be subscripted like an array yielding
-   values of type typelem.  A
-   true array type is variable length
-   (typlen = -1),
-   but some fixed-length (typlen  0) types
-   also have nonzero typelem, for example
-   name and point.
-   If a fixed-length type has a typelem then
-   its internal representation must be some number of values of the
-   typelem data type with no other data.
-   Variable-length array types have a header defined by the array
-   subroutines.
+   identifies another row in pg_type,
+   defining the type yielded by subscripting.  This should be 0
+   if typsubscript is 0.  However, it can
+   be 0 when typsubscript isn't 0, if the
+   handler doesn't need typelem to
+   determine the subscripting result type.
   
  
 
diff --git a/doc/src/sgml/ref/create_type.sgml b/doc/src/sgml/ref/create_type.sgml
index 970b517db9..fc09282db7 100644
--- a/doc/src/sgml/ref/create_type.sgml
+++ b/doc/src/sgml/ref/create_type.sgml
@@ -43,6 +43,7 @@ CREATE TYPE name (
 [ , 

Re: Commitfest 2020-11 is closed

2020-12-08 Thread Thomas Munro
On Fri, Dec 4, 2020 at 1:29 AM Peter Eisentraut
 wrote:
> On 2020-12-02 23:13, Thomas Munro wrote:
> > I'm experimenting with Github's built in CI.  All other ideas welcome.
>
> You can run Linux builds on AppVeyor, too.

Yeah.  This would be the easiest change to make, because I already
have configuration files, cfbot already knows how to talk to Appveyor
to collect results, and the build results are public URLs.  So I'm
leaning towards this option in the short term, so that cfbot keeps
working after December 31.  We can always review the options later.
Appveyor's free-for-open-source plan has no cap on minutes, but limits
concurrent jobs to 1.

Gitlab's free-for-open-source plan is based on metered CI minutes, but
cfbot is a bit too greedy for the advertised limits.  An annual
allotment of 50,000 minutes would run out some time in February
assuming we rebase each of ~250 branches every few days.

GitHub Actions has very generous resource limits, nice UX and API, etc
etc, so it's tempting, but its build log URLs can only be accessed by
people with accounts.  That limits their usefulness when discussing
test failures on our public mailing list.  I thought about teaching
cfbot to pull down the build logs and host them on its own web server,
but that feels like going against the grain.




Re: get_constraint_index() and conindid

2020-12-08 Thread Tom Lane
Matthias van de Meent  writes:
> On Mon, 7 Dec 2020 at 11:09, Peter Eisentraut
>  wrote:
>> get_constraint_index() does its work by going through pg_depend.  It was
>> added before pg_constraint.conindid was added, and some callers are
>> still not changed.  Are there reasons for that?  Probably not.  The
>> attached patch changes get_constraint_index() to an lsyscache-style
>> lookup instead.

> This looks quite reasonable, and it passes "make installcheck-world".

+1, LGTM.

> Only thing I could think of is that it maybe could use a (small)
> comment in the message on that/why get_constraint_index is moved to
> utils/lsyscache from catalog/dependency, as that took me some time to
> understand.

commit message could reasonably say that maybe, but I don't think we
need to memorialize it in a comment.  lsyscache.c *is* where one
would expect to find a simple catalog-field-fetch function like this.
The previous implementation was not that, so it didn't belong there.

regards, tom lane




Re: small cleanup in unicode_norm.c

2020-12-08 Thread John Naylor
On Tue, Dec 8, 2020 at 5:45 AM Michael Paquier  wrote:
>
> On Mon, Dec 07, 2020 at 03:24:56PM -0400, John Naylor wrote:
> > We've had get_canonical_class() for a while as a backend-only function.
> > There is some ad-hoc code elsewhere that implements the same logic in a
> > couple places, so it makes sense for all sites to use this function
> > instead, as in the attached.
>
> Thanks John for caring about that.  This is a nice simplification, and
> it looks fine to me.
>
> -static uint8
> -get_canonical_class(pg_wchar ch)
> -{
> Two nits here.  I would use "code" for the name of the argument for
> consistency with get_code_entry(), and add a description at the top of
> this helper routine (say a simple "get the combining class of given
> code").  Anything else you can think of?

Thanks for taking a look. Sounds good, I've made those adjustments and
wrote a commit message. I took another look and didn't see anything else to
address.

--
John Naylor
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


v2-0001-Simplify-ad-hoc-code-for-getting-a-Unicode-codepo.patch
Description: Binary data


Re: MultiXact\SLRU buffers configuration

2020-12-08 Thread Andrey Borodin
Hi Gilles!

Many thanks for your message!

> 8 дек. 2020 г., в 21:05, Gilles Darold  написал(а):
> 
> I know that this report is not really helpful 

Quite contrary - this benchmarks prove that controllable reproduction exists. 
I've rebased patches for PG11. Can you please benchmark them (without extending 
SLRU)?

Best regards, Andrey Borodin.



v1106-0001-Use-shared-lock-in-GetMultiXactIdMembers-for-o.patch
Description: Binary data


v1106-0002-Make-MultiXact-local-cache-size-configurable.patch
Description: Binary data


v1106-0003-Add-conditional-variable-to-wait-for-next-Mult.patch
Description: Binary data


v1106-0004-Add-GUCs-to-tune-MultiXact-SLRUs.patch
Description: Binary data


Re: Change default of checkpoint_completion_target

2020-12-08 Thread Magnus Hagander
On Tue, Dec 8, 2020 at 6:42 PM Tom Lane  wrote:

> "Bossart, Nathan"  writes:
> > On 12/7/20, 9:53 AM, "Stephen Frost"  wrote:
> >> Concretely, attached is a patch which changes the default and updates
> >> the documentation accordingly.
>
> > +1 to setting checkpoint_completion_target to 0.9 by default.
>
> FWIW, I kind of like the idea of getting rid of it completely.
> Is there really ever a good reason to set it to something different
> than that?  If not, well, we have too many GUCs already, and each
> of them carries nonzero performance, documentation, and maintenance
> overhead.
>

+1.

There are plenty of cases I think where it doesn't really matter with the
values, but when it does I'm not sure what it would be where something else
would actually be better.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: Change default of checkpoint_completion_target

2020-12-08 Thread Tom Lane
"Bossart, Nathan"  writes:
> On 12/7/20, 9:53 AM, "Stephen Frost"  wrote:
>> Concretely, attached is a patch which changes the default and updates
>> the documentation accordingly.

> +1 to setting checkpoint_completion_target to 0.9 by default.

FWIW, I kind of like the idea of getting rid of it completely.
Is there really ever a good reason to set it to something different
than that?  If not, well, we have too many GUCs already, and each
of them carries nonzero performance, documentation, and maintenance
overhead.

regards, tom lane




Re: Change default of checkpoint_completion_target

2020-12-08 Thread Bossart, Nathan
On 12/7/20, 9:53 AM, "Stephen Frost"  wrote:
> Concretely, attached is a patch which changes the default and updates
> the documentation accordingly.

+1 to setting checkpoint_completion_target to 0.9 by default.

Nathan



Re: please update ps display for recovery checkpoint

2020-12-08 Thread Fujii Masao




On 2020/12/05 2:17, Bossart, Nathan wrote:

On 12/3/20, 1:19 PM, "Bossart, Nathan"  wrote:

I like the idea behind this patch.  I wrote a new version with a
couple of changes:


I missed an #include in this patch.  Here's a new one with that fixed.


I agree it might be helpful to display something like "checkpointing" or "waiting 
for checkpoint" in PS title for the startup process.

But, at least for me, it seems strange to display "idle" only for checkpointer. 
*If* we want to monitor the current status of checkpointer, it should be shown as wait 
event in pg_stat_activity, instead?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: SEARCH and CYCLE clauses

2020-12-08 Thread Vik Fearing
On 6/15/20 11:49 AM, Peter Eisentraut wrote:
> To fix the star expansion I had to add a little bit of infrastructure
> that could also be used as a more general facility "don't include this
> column in star expansion", so this could perhaps use some consideration
> from a more general angle as well.

Could this work be salvaged to add the ability to ALTER a column to hide
it from star expansion?  That's a feature I've often seen requested,
especially from people working with PostGIS's geometry.

Totally off-topic for this thread, though.
-- 
Vik Fearing




Re: SEARCH and CYCLE clauses

2020-12-08 Thread Vik Fearing
On 5/22/20 12:40 PM, Vik Fearing wrote:
>>> 2)
>>> This query is an infinite loop, as expected:
>>>
>>>    with recursive a as (select 1 as b union all select b from a)
>>>    table a;
>>>
>>> But it becomes an error when you add a cycle clause to it:
>>>
>>>    with recursive a as (select 1 as b union all table a)
>>>  cycle b set c to true default false using p
>>>    table a;
>>>
>>>    ERROR:  each UNION query must have the same number of columns
>> table a expands to select * from a, and if you have a cycle clause, then
>> a has three columns, but the other branch of the union only has one, so
>> that won't work anymore, will it?
> It seems there was a copy/paste error here.  The first query should have
> been the same as the second but without the cycle clause.
> 
> It seems strange to me that adding a  would
> break a previously working query.  I would rather see the * expanded
> before adding the new columns.  This is a user's opinion, I don't know
> how hard that would be to implement.


After thinking about it quite a bit more, I have changed my mind on
this.  The transformation does add columns to the 
and so TABLE or SELECT * should see them.  Especially since they see
them from outside of the wle.
-- 
Vik Fearing




Re: MultiXact\SLRU buffers configuration

2020-12-08 Thread Gilles Darold

Le 13/11/2020 à 12:49, Andrey Borodin a écrit :



10 нояб. 2020 г., в 23:07, Tomas Vondra  
написал(а):

On 11/10/20 7:16 AM, Andrey Borodin wrote:


but this picture was not stable.


Seems we haven't made much progress in reproducing the issue :-( I guess
we'll need to know more about the machine where this happens. Is there
anything special about the hardware/config? Are you monitoring size of
the pg_multixact directory?

It's Ubuntu 18.04.4 LTS, Intel Xeon E5-2660 v4, 56 CPU cores with 256Gb of RAM.
PostgreSQL 10.14, compiled by gcc 7.5.0, 64-bit

No, unfortunately we do not have signals for SLRU sizes.
3.5Tb mdadm raid10 over 28 SSD drives, 82% full.

First incident triggering investigation was on 2020-04-19, at that time cluster 
was running on PG 10.11. But I think it was happening before.

I'd say nothing special...


How do you collect wait events for aggregation? just insert into some table 
with cron?


No, I have a simple shell script (attached) sampling data from
pg_stat_activity regularly. Then I load it into a table and aggregate to
get a summary.

Thanks!

Best regards, Andrey Borodin.



Hi,


Some time ago I have encountered a contention on 
MultiXactOffsetControlLock with a performances benchmark. Here are the 
wait event monitoring result with a pooling each 10 seconds and a 30 
minutes run for the benchmarl:



 event_type |   event    |   sum
++--
 Client | ClientRead | 44722952
 LWLock | MultiXactOffsetControlLock | 30343060
 LWLock | multixact_offset   | 16735250
 LWLock | MultiXactMemberControlLock |  1601470
 LWLock | buffer_content |   991344
 LWLock | multixact_member   |   805624
 Lock   | transactionid  |   204997
 Activity   | LogicalLauncherMain    |   198834
 Activity   | CheckpointerMain   |   198834
 Activity   | AutoVacuumMain |   198469
 Activity   | BgWriterMain   |   184066
 Activity   | WalWriterMain  |   171571
 LWLock | WALWriteLock   |    72428
 IO | DataFileRead   |    35708
 Activity   | BgWriterHibernate  |    12741
 IO | SLRURead   | 9121
 Lock   | relation   | 8858
 LWLock | ProcArrayLock  | 7309
 LWLock | lock_manager   | 6677
 LWLock | pg_stat_statements | 4194
 LWLock | buffer_mapping | 3222


After reading this thread I change the value of the buffer size to 32 
and 64 and obtain the following results:



 event_type |   event    |    sum
++---
 Client | ClientRead | 268297572
 LWLock | MultiXactMemberControlLock |  65162906
 LWLock | multixact_member   |  33397714
 LWLock | buffer_content |   4737065
 Lock   | transactionid  |   2143750
 LWLock | SubtransControlLock    |   1318230
 LWLock | WALWriteLock   |   1038999
 Activity   | LogicalLauncherMain    |    940598
 Activity   | AutoVacuumMain |    938566
 Activity   | CheckpointerMain   |    799052
 Activity   | WalWriterMain  |    749069
 LWLock | subtrans   |    710163
 Activity   | BgWriterHibernate  |    536763
 Lock   | object |    514225
 Activity   | BgWriterMain   |    394206
 LWLock | lock_manager   |    295616
 IO | DataFileRead   |    274236
 LWLock | ProcArrayLock  | 77099
 Lock   | tuple  | 59043
 IO | CopyFileWrite  | 45611
 Lock   | relation   | 42714

There was still contention on multixact but less than the first run. I 
have increased the buffers to 128 and 512 and obtain the best results 
for this bench:


 event_type |   event    |    sum
++---
 Client | ClientRead | 160463037
 LWLock | MultiXactMemberControlLock |   5334188
 LWLock | buffer_content |   5228256
 LWLock | buffer_mapping |   2368505
 LWLock | SubtransControlLock    |   2289977
 IPC    | ProcArrayGroupUpdate   |   1560875
 LWLock | ProcArrayLock  |   1437750
 Lock   | transactionid  |    825561
 LWLock | subtrans   |    772701
 LWLock | WALWriteLock   |    666138
 Activity   | LogicalLauncherMain    |    492585
 Activity   | CheckpointerMain   |    492458
 Activity   | AutoVacuumMain |    491548
 LWLock | lock_manager   |    426531
 Lock   | object |    403581
 Activity   | WalWriterMain  |    394668
 

Re: [HACKERS] [PATCH] Generic type subscripting

2020-12-08 Thread Tom Lane
Andres Freund  writes:
> I've attached a prototype conversion for two other such places. Which
> immediately pointed to a bug. And one harmless issue (using a pointer to
> size_t instead of ExprEvalOp* to represent the 'op' parameter), which
> you promptly copied...
> If I pushed a slightly cleaned up version of that, it should be fairly
> easy to adapt your code to it, I think?

I've now studied this patch and it seems sane to me, although
I wondered why you wrote "extern"s here:

@@ -48,6 +48,10 @@
 PGFunction TypePGFunction;
 size_t TypeSizeT;
 bool   TypeStorageBool;
+extern ExprStateEvalFunc TypeExprStateEvalFunc;
+ExprStateEvalFunc TypeExprStateEvalFunc;
+extern ExecEvalSubroutine TypeExecEvalSubroutine;
+ExecEvalSubroutine TypeExecEvalSubroutine;
 
 NullableDatum StructNullableDatum;
 AggState   StructAggState;

The other variables in that file don't have that.  Other than that nit,
please finish this up and push it so I can finish the generic-subscripting
patch.

> WRT the prototype, I think it may be worth removing most of the types
> from llvmjit.h. Worth keeping the most common ones, but most aren't used
> all the time so terseness doesn't matter that much, and
> the llvm_pg_var_type() would suffice.

Hm, that would mean redoing llvm_pg_var_type() often wouldn't it?
I don't have a very good feeling for how expensive that is, so I'm
not sure if this seems like a good idea or not.

regards, tom lane




Re: Parallel Inserts in CREATE TABLE AS

2020-12-08 Thread Amit Kapila
On Tue, Dec 8, 2020 at 6:36 PM Bharath Rupireddy
 wrote:
>
> On Tue, Dec 8, 2020 at 6:24 PM Hou, Zhijie  wrote:
> >
> > > > I'm not quite sure how to address this. Can we not allow the planner
> > > > to consider that the select is for CTAS and check only after the
> > > > planning is done for the Gather node and other checks?
> > > >
> > >
> > > IIUC, you are saying that we should not influence the cost of gather node
> > > even when the insertion would be done by workers? I think that should be
> > > our fallback option anyway but that might miss some paths to be considered
> > > parallel where the cost becomes more due to parallel_tuple_cost (aka tuple
> > > transfer cost). I think the idea is we can avoid the tuple transfer cost
> > > only when Gather is the top node because only at that time we can push
> > > insertion down, right? How about if we have some way to detect the same
> > > before calling generate_useful_gather_paths()? I think when we are calling
> > > apply_scanjoin_target_to_paths() in grouping_planner(), if the
> > > query_level is 1, it is for CTAS, and it doesn't have a chance to create
> > > UPPER_REL (doesn't have grouping, order, limit, etc clause) then we can
> > > probably assume that the Gather will be top_node. I am not sure about this
> > > but I think it is worth exploring.
> > >
> >
> > I took a look at the parallel insert patch and have the same idea.
> > https://commitfest.postgresql.org/31/2844/
> >
> >  * Consider generating Gather or Gather Merge paths.  We must only 
> > do this
> >  * if the relation is parallel safe, and we don't do it for child 
> > rels to
> >  * avoid creating multiple Gather nodes within the same plan. We 
> > must do
> >  * this after all paths have been generated and before 
> > set_cheapest, since
> >  * one of the generated paths may turn out to be the cheapest one.
> >  */
> > if (rel->consider_parallel && !IS_OTHER_REL(rel))
> > generate_useful_gather_paths(root, rel, false);
> >
> > IMO Gatherpath created here seems the right one which can possible ignore 
> > parallel cost if in CTAS.
> > But We need check the following parse option which will create path to be 
> > the parent of Gatherpath here.
> >
> > if (root->parse->rowMarks)
> > if (limit_needed(root->parse))
> > if (root->parse->sortClause)
> > if (root->parse->distinctClause)
> > if (root->parse->hasWindowFuncs)
> > if (root->parse->groupClause || root->parse->groupingSets || 
> > root->parse->hasAggs || root->root->hasHavingQual)
> >
>
> Thanks Amit and Hou. I will look into these areas and get back soon.
>

It might be better to split the patch for this such that in the base
patch, we won't consider anything special for gather costing w.r.t
CTAS and in the next patch, we consider all the checks discussed
above.

-- 
With Regards,
Amit Kapila.




Re: autovac issue with large number of tables

2020-12-08 Thread Fujii Masao




On 2020/12/04 12:21, Kasahara Tatsuhito wrote:

Hi,

On Thu, Dec 3, 2020 at 9:09 PM Fujii Masao  wrote:




On 2020/12/03 11:46, Kasahara Tatsuhito wrote:

On Wed, Dec 2, 2020 at 7:11 PM Masahiko Sawada  wrote:


On Wed, Dec 2, 2020 at 3:33 PM Fujii Masao  wrote:




On 2020/12/02 12:53, Masahiko Sawada wrote:

On Tue, Dec 1, 2020 at 5:31 PM Masahiko Sawada  wrote:


On Tue, Dec 1, 2020 at 4:32 PM Fujii Masao  wrote:




On 2020/12/01 16:23, Masahiko Sawada wrote:

On Tue, Dec 1, 2020 at 1:48 PM Kasahara Tatsuhito
 wrote:


Hi,

On Mon, Nov 30, 2020 at 8:59 PM Fujii Masao  wrote:




On 2020/11/30 10:43, Masahiko Sawada wrote:

On Sun, Nov 29, 2020 at 10:34 PM Kasahara Tatsuhito
 wrote:


Hi, Thanks for you comments.

On Fri, Nov 27, 2020 at 9:51 PM Fujii Masao  wrote:




On 2020/11/27 18:38, Kasahara Tatsuhito wrote:

Hi,

On Fri, Nov 27, 2020 at 1:43 AM Fujii Masao  wrote:




On 2020/11/26 10:41, Kasahara Tatsuhito wrote:

On Wed, Nov 25, 2020 at 8:46 PM Masahiko Sawada  wrote:


On Wed, Nov 25, 2020 at 4:18 PM Kasahara Tatsuhito
 wrote:


Hi,

On Wed, Nov 25, 2020 at 2:17 PM Masahiko Sawada  wrote:


On Fri, Sep 4, 2020 at 7:50 PM Kasahara Tatsuhito
 wrote:


Hi,

On Wed, Sep 2, 2020 at 2:10 AM Kasahara Tatsuhito
 wrote:

I wonder if we could have table_recheck_autovac do two probes of the stats
data.  First probe the existing stats data, and if it shows the table to
be already vacuumed, return immediately.  If not, *then* force a stats
re-read, and check a second time.

Does the above mean that the second and subsequent table_recheck_autovac()
will be improved to first check using the previous refreshed statistics?
I think that certainly works.

If that's correct, I'll try to create a patch for the PoC


I still don't know how to reproduce Jim's troubles, but I was able to reproduce
what was probably a very similar problem.

This problem seems to be more likely to occur in cases where you have
a large number of tables,
i.e., a large amount of stats, and many small tables need VACUUM at
the same time.

So I followed Tom's advice and created a patch for the PoC.
This patch will enable a flag in the table_recheck_autovac function to use
the existing stats next time if VACUUM (or ANALYZE) has already been done
by another worker on the check after the stats have been updated.
If the tables continue to require VACUUM after the refresh, then a refresh
will be required instead of using the existing statistics.

I did simple test with HEAD and HEAD + this PoC patch.
The tests were conducted in two cases.
(I changed few configurations. see attached scripts)

1. Normal VACUUM case
 - SET autovacuum = off
 - CREATE tables with 100 rows
 - DELETE 90 rows for each tables
 - SET autovacuum = on and restart PostgreSQL
 - Measure the time it takes for all tables to be VACUUMed

2. Anti wrap round VACUUM case
 - CREATE brank tables
 - SELECT all of these tables (for generate stats)
 - SET autovacuum_freeze_max_age to low values and restart PostgreSQL
 - Consumes a lot of XIDs by using txid_curent()
 - Measure the time it takes for all tables to be VACUUMed

For each test case, the following results were obtained by changing
autovacuum_max_workers parameters to 1, 2, 3(def) 5 and 10.
Also changing num of tables to 1000, 5000, 1 and 2.

Due to the poor VM environment (2 VCPU/4 GB), the results are a little unstable,
but I think it's enough to ask for a trend.

===
[1.Normal VACUUM case]
tables:1000
 autovacuum_max_workers 1:   (HEAD) 20 sec VS (with patch)  20 sec
 autovacuum_max_workers 2:   (HEAD) 18 sec VS (with patch)  16 sec
 autovacuum_max_workers 3:   (HEAD) 18 sec VS (with patch)  16 sec
 autovacuum_max_workers 5:   (HEAD) 19 sec VS (with patch)  17 sec
 autovacuum_max_workers 10:  (HEAD) 19 sec VS (with patch)  17 sec

tables:5000
 autovacuum_max_workers 1:   (HEAD) 77 sec VS (with patch)  78 sec
 autovacuum_max_workers 2:   (HEAD) 61 sec VS (with patch)  43 sec
 autovacuum_max_workers 3:   (HEAD) 38 sec VS (with patch)  38 sec
 autovacuum_max_workers 5:   (HEAD) 45 sec VS (with patch)  37 sec
 autovacuum_max_workers 10:  (HEAD) 43 sec VS (with patch)  35 sec

tables:1
 autovacuum_max_workers 1:   (HEAD) 152 sec VS (with patch)  153 sec
 autovacuum_max_workers 2:   (HEAD) 119 sec VS (with patch)   98 sec
 autovacuum_max_workers 3:   (HEAD)  87 sec VS (with patch)   78 sec
 autovacuum_max_workers 5:   (HEAD) 100 sec VS (with patch)   66 sec
 autovacuum_max_workers 10:  (HEAD)  97 sec VS (with patch)   56 sec

tables:2
 autovacuum_max_workers 1:   (HEAD) 338 sec VS (with patch)  339 sec
 autovacuum_max_workers 2:   (HEAD) 231 sec VS (with patch)  229 sec
 

Re: get_constraint_index() and conindid

2020-12-08 Thread Matthias van de Meent
On Mon, 7 Dec 2020 at 11:09, Peter Eisentraut
 wrote:
>
> get_constraint_index() does its work by going through pg_depend.  It was
> added before pg_constraint.conindid was added, and some callers are
> still not changed.  Are there reasons for that?  Probably not.  The
> attached patch changes get_constraint_index() to an lsyscache-style
> lookup instead.

This looks quite reasonable, and it passes "make installcheck-world".

Only thing I could think of is that it maybe could use a (small)
comment in the message on that/why get_constraint_index is moved to
utils/lsyscache from catalog/dependency, as that took me some time to
understand.




Re: libpq compression

2020-12-08 Thread Daniil Zakhlystov
Hi, Robert!

First of all, thanks for your detailed reply.

> On Dec 3, 2020, at 2:23 AM, Robert Haas  wrote:
> 
> Like, in the protocol that you proposed previously, you've got a
> four-phase handshake to set up compression. The startup packet carries
> initial information from the client, and the server then sends
> CompressionAck, and then the client sends SetCompressionMethod, and
> then the server sends SetCompressionMethod. This system is fairly
> complex, and it requires some form of interlocking.

I proposed a slightly different handshake (three-phase):

1. At first, the client sends _pq_.compression parameter in startup packet
2. Server replies with CompressionAck and following it with 
SetCompressionMethod message.
These two might be combined but I left them like this for symmetry reasons. In 
most cases they 
will arrive as one piece without any additional delay.
3. Client replies with SetCompressionMethod message.

The handshake like above allows forbidding the uncompressed client-to-server 
or/and server-to-client communication.

For example, if the client did not explicitly specify ‘uncompressed’ in the 
supported decompression methods list, and 
the server does not support any of the other compression algorithms sent by the 
client, the server will send back 
SetCompressionMethod with ‘-1’ index. After receiving this message, the client 
will terminate the connection.

> On Dec 3, 2020, at 2:23 AM, Robert Haas  wrote:
> 
> And again, if you allow the compression method to be switched at any
> time, you just have to know how what to do when you get a
> SetCompressionMethod. If you only allow it to be changed once, to set
> it initially, then you have to ADD code to reject that message the
> next time it's sent. If that ends up avoiding a significant amount of
> complexity somewhere else then I don't have a big problem with it, but
> if it doesn't, it's simpler to allow it whenever than to restrict it
> to only once.

Yes, there is actually some amount of complexity involved in implementing the 
switchable on-the-fly compression. 
Currently, compression itself operates on a different level, independently of 
libpq protocol. By allowing
the compression to be switchable on the fly, we need to solve these tasks:

1. When the new portion of bytes comes to the decompressor from the 
socket.read() call, there may be
a situation when the first part of these bytes is a compressed fragment and the 
other is part is uncompressed, or worse,
in a single portion of new bytes, there may be the end of some ZLIB compressed 
message and the beginning of the ZSTD compressed message.
The problem is that we don’t know the exact end of the ZLIB compressed message 
before decompressing the entire chunk of new bytes
and reading the SetCompressionMethod message. Moreover, streaming compression 
by itself may involve some internal buffering,
which also complexifies this problem.

2. When sending the new portion of bytes, it may be not sufficient to keep 
track of only the current compression method.
There may be a situation when there could be multiple SetCompressionMessages in 
PqSendBuffer (backend) or conn->outBuffer (frontend).
It means that it is not enough to simply track the current compression method 
but also keep track of all compression method
switches in PqSendBuffer or conn->outBuffer. Also, same as for decompression,
internal buffering of streaming compression makes the situation more complex in 
this case too.

Despite that the above two problems might be solvable, I doubt if we should 
oblige to solve these problems not only in libpq,
but in all other third-party Postgres protocol libraries since the exact areas 
of application for switchable compression are not clear yet.
I agree with Konstantine’s point of view on this one:

> And more important question - if we really want to switch algorithms on 
> the fly: who and how will do it?
> Do we want user to explicitly control it (something like "\compression 
> on"  psql command)?
> Or there should be some API for application?
> How it can be supported for example by JDBC driver?
> I do not have answers for this questions...

However, as previously mentioned in the thread, it might be useful in the 
future and we should design a protocol 
that supports it so we won’t have any problems with backward compatibility.
So, basically, this was the only reason to introduce the two separate 
compression modes - switchable and permanent.

In the latest patch, Konstantin introduced the extension part. So in the future 
versions, we can introduce the switchable compression 
handling in this extension part. By now, let the permanent compression be the 
default mode.

> On Dec 3, 2020, at 2:23 AM, Robert Haas  wrote:
> 
> I think that there's no such thing as being able to decompress some
> compression levels with the same algorithm but not others. The level
> only controls behavior on the compression side. So, the client can
> only send zlib data if the server can 

Re: Parallel INSERT (INTO ... SELECT ...)

2020-12-08 Thread vignesh C
On Mon, Dec 7, 2020 at 2:35 PM Greg Nancarrow  wrote:
>
> On Fri, Nov 20, 2020 at 7:44 PM Greg Nancarrow  wrote:
> >
> > Posting an updated set of patches, with some additional testing and
> > documentation updates, and including the latest version of the
> > Parallel Insert patch.
> > Any feedback appreciated, especially on the two points mentioned in
> > the previous post.
> >
>
> Posting an updated set of patches, since a minor bug was found in the
> 1st patch that was causing a postgresql-cfbot build failure.
>

Most of the code present in
v9-0001-Enable-parallel-SELECT-for-INSERT-INTO-.-SELECT.patch is
applicable for parallel copy patch also. The patch in this thread
handles the check for PROPARALLEL_UNSAFE, we could slightly make it
generic by handling like the comments below, that way this parallel
safety checks can be used based on the value set in
max_parallel_hazard_context. There is nothing wrong with the changes,
I'm providing these comments so that this patch can be generalized for
parallel checks and the same can also be used by parallel copy.
Few comments:
1)
+   trigtype = RI_FKey_trigger_type(trigger->tgfoid);
+   if (trigtype == RI_TRIGGER_FK)
+   {
+   context->max_hazard = PROPARALLEL_RESTRICTED;
+
+   /*
+* As we're looking for the max parallel
hazard, we don't break
+* here; examine any further triggers ...
+*/
+   }

Can we change this something like:
trigtype = RI_FKey_trigger_type(trigger->tgfoid);
if (trigtype == RI_TRIGGER_FK)
{
if(max_parallel_hazard_test(PROPARALLEL_RESTRICTED, context)
break;
}

This below line is not required as it will be taken care by
max_parallel_hazard_test.
context->max_hazard = PROPARALLEL_RESTRICTED;

2)
+   /* Recursively check each partition ... */
+   pdesc = RelationGetPartitionDesc(rel);
+   for (i = 0; i < pdesc->nparts; i++)
+   {
+   if (rel_max_parallel_hazard_for_modify(pdesc->oids[i],
+
command_type,
+
context,
+
AccessShareLock) == PROPARALLEL_UNSAFE)
+   {
+   table_close(rel, lockmode);
+   return context->max_hazard;
+   }
+   }


Can we change this something like:
/* Recursively check each partition ... */
pdesc = RelationGetPartitionDesc(rel);
for (i = 0; i < pdesc->nparts; i++)
{
char max_hazard = rel_max_parallel_hazard_for_modify(pdesc->oids[i],

command_type,

context,

AccessShareLock);

if(max_parallel_hazard_test(max_hazard, context)
{
table_close(rel, lockmode);
return context->max_hazard;
}
}

3)
Similarly for the below:
+   /*
+* If there are any index expressions, check that they are parallel-mode
+* safe.
+*/
+   if (index_expr_max_parallel_hazard_for_modify(rel, context) ==
PROPARALLEL_UNSAFE)
+   {
+   table_close(rel, lockmode);
+   return context->max_hazard;
+   }
+
+   /*
+* If any triggers exist, check that they are parallel safe.
+*/
+   if (rel->trigdesc != NULL &&
+   trigger_max_parallel_hazard_for_modify(rel->trigdesc,
context) == PROPARALLEL_UNSAFE)
+   {
+   table_close(rel, lockmode);
+   return context->max_hazard;
+   }


4) Similar change required for the below:
+   /*
+* If the column is of a DOMAIN type,
determine whether that
+* domain has any CHECK expressions that are
not parallel-mode
+* safe.
+*/
+   if (get_typtype(att->atttypid) == TYPTYPE_DOMAIN)
+   {
+   if
(domain_max_parallel_hazard_for_modify(att->atttypid, context) ==
PROPARALLEL_UNSAFE)
+   {
+   table_close(rel, lockmode);
+   return context->max_hazard;
+   }
+   }

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com




Re: Implement for window functions

2020-12-08 Thread Vik Fearing
On 11/21/20 10:07 AM, Krasiyan Andreev wrote:
> Fixed patch attached, after new introduced conflicts.
> Vik, can you add it to the next commitfest, to be able to test it.


I have done this now.  Thanks!
-- 
Vik Fearing




Re: Performance regressions

2020-12-08 Thread Vik Fearing
On 11/9/20 9:29 AM, Vik Fearing wrote:
> Hello,
> 
> I've been doing some benchmarking on recent version of PostgreSQL and
> I'm seeing some regressions.  The benchmark setup is as described in [1]
> except it looks like I got lucky in the runs used for that article.
> 
> After many more runs, I get these NOPM averages (hopefully formatting
> will survive):
> 
>   Users:  50100250500
> 12.4 485,914707,535739,808589,856
> 13.0 485,501697,505837,446385,225
> 14(2020-10-13)   521,902759,609941,212611,647
> 14(2020-11-02)   478,640684,138930,959513,707


I must have done something wrong because I cannot reproduce this
anymore.  Please disregard.
-- 
Vik Fearing




Re: Feature Request: Report additionally error value

2020-12-08 Thread Eugen Konkov
I format query to read it more easy. See attachment

Tuesday, December 8, 2020, 3:43:13 PM, you wrote:


Hello Tom,

>>> Also would be useful if PG point at query where this bad value was
>>> calculated or occur.

>> This is not the first time we've seen this request and it usually ends up
>> getting stalled because its non-trivial to implement and thus isn't
>> feasible for the benefit it brings.

> I do still have ambitions to make that happen, but you're right that it's
> a major undertaking.  Don't hold your breath.

Another case posted below.

Here in database are values which could not construct correct daterange.
There are thousands of rows. It is hard to find even a way or make an assumption
how to get record which cause this error.

It will be very impressive if database could report the line at database where 
error is occur.
Via HINT for example.
If this is not possible, then current cursor position or raw data dump in debug 
mode.
Probably some memory buffers or so.

Any additional info would be helpful.

Thank you.


DBIx::Class::Storage::DBI::_dbh_execute(): DBI Exception: DBD::Pg::st execute 
failed: ERROR:  range lower bound must be less than or equal to range upper 
bound?CONTEXT:  SQL function "accounting_ready" statement 1 [for Statement 
"SELECT COUNT( * ) FROM (( WITH?target_date AS ( SELECT ?::timestamptz 
),?target_order as (?  SELECT?usage_range as bill_range,?o.*?  FROM ( 
SELECT * FROM "order_bt" WHERE sys_period @> sys_time() ) o?  LEFT JOIN period 
prd on prd.id = o.period_id?  LEFT JOIN accounting_ready(?CASE WHEN 
prd.INTERVAL = '1 mon' THEN date_trunc( 'month', o.docdate ) ELSE o.docdate 
END,?prd.INTERVAL, (SELECT * FROM target_date)?  ) acc ON true?  WHERE 
FALSE?OR ? = 0?OR ? = 1 AND o.id = ??  AND acc.usage AND EXISTS (?  
  SELECT * FROM order_bt prev_order?WHERE sys_period @> sys_time() 
AND?  prev_order.id = o.id  AND prev_order.app_period && 
acc.usage_range?  )?  AND o.app_period && acc.usage_range?OR ? = 2 
AND o.agreement_id = ? and o.period_id = ??  AND acc.usage AND EXISTS (?
SELECT * FROM order_bt prev_order?WHERE sys_period @> sys_time() 
AND?  prev_order.id = o.id  AND prev_order.app_period && 
acc.usage_range?  )?  AND o.app_period && acc.usage_range?OR ? = 
3?),?USAGE AS ( SELECT?  (ocd.o).id  as order_id,?  
(dense_rank() over (PARTITION BY o.agreement_id, o.id ORDER BY 
(ocd.ic).consumed_period )) as group_id,?  (ocd.c).id  as 
detail_id,?  (ocd.c).service_type_id as service_type,?  
(ocd.c).resource_type_idas detail_type,?  (ocd.c).amount
  as detail_amount,?  (ocd.c).allocated_resource_id   as resource_id,?  null
as resource_uuid,?  nullas 
resource_desc,?  rt.unit as resource_unit,?  -- count 
changes. Logic is next: How many times configration for this order is met at 
this period?  count(*) OVER (PARTITION BY (ocd.o).id, (ocd.c).id )  as 
consumed_count,?  (ocd.ic).consumed   as consumed_days,?  null  
  as consumed_amount,?  nullas 
consumed_last,??  a.idas agreement_id,?  coalesce( 
substring( a.docn from '^A?(.*?)(?:\s*-[^-]*)?$' ), a.docn ) as agreement,?  
a.docdate   as docdate,??  pkg.id   
 as package_id,?  pkg.link_1c_idas package_1c_id,?  
coalesce( pkg.display, pkg.name ) as package,??  coalesce( st.display, st.name, 
rt.display, rt.name )  as sr_name,?  COALESCE( (ocd.p).label, rt.LABEL )
   as unit_label,??  coalesce( (ocd.c).sort_order, pd.sort_order ) as 
sort_order,??  ocd.item_priceAS price,?  -- 
We want to display QTY for resources too?  coalesce( ocd.item_qty, 
(ocd.c).amount/rt.unit )  AS qty,?  0   
  AS month_length,?  0 AS 
days_count,?  o.bill_range,?  lower( (ocd.ic).consumed_period ) 
 as consumed_from,?  upper( (ocd.ic).consumed_period ) -interval '1sec' as 
consumed_till,???  ocd.item_suma,??  0 as discount,?  (sum( ocd.item_suma ) 
OVER( PARTITION BY (ocd.o).id, (ocd.ic).consumed_period ))::numeric( 10, 2) AS 
group_suma,?  (sum( ocd.item_cost ) OVER( PARTITION BY (ocd.o).id, 
(ocd.ic).consumed_period ))::numeric( 10, 2) AS order_cost?FROM target_order 
o?LEFT JOIN order_cost_details( o.bill_range ) ocd?  ON (ocd.o).id = o.id  AND  
(ocd.ic).consumed_period && o.app_period?  LEFT JOIN agreement a   ON a.id  
 = o.agreement_id?  LEFT JOIN package   pkg ON pkg.id = o.package_id?  LEFT 
JOIN package_detail pd  ON pd.package_id = (ocd.o).package_id?AND 
pd.resource_type_id IS NOT DISTINCT FROM (ocd.c).resource_type_id?AND 
pd.service_type_id  IS 

Re: Feature Request: Report additionally error value

2020-12-08 Thread Eugen Konkov
Hello Tom,

>>> Also would be useful if PG point at query where this bad value was
>>> calculated or occur.

>> This is not the first time we've seen this request and it usually ends up
>> getting stalled because its non-trivial to implement and thus isn't
>> feasible for the benefit it brings.

> I do still have ambitions to make that happen, but you're right that it's
> a major undertaking.  Don't hold your breath.

Another case posted below.

Here in database are values which could not construct correct daterange.
There are thousands of rows. It is hard to find even a way or make an 
assumption 
how to get record which cause this error.

It will be very impressive if database could report the line at database where 
error is occur.
Via HINT for example. 
If this is not possible, then current cursor position or raw data dump in debug 
mode.
Probably some memory buffers or so.

Any additional info would be helpful.

Thank you.


DBIx::Class::Storage::DBI::_dbh_execute(): DBI Exception: DBD::Pg::st execute 
failed: ERROR:  range lower bound must be less than or equal to range upper 
bound
CONTEXT:  SQL function "accounting_ready" statement 1 [for Statement "SELECT 
COUNT( * ) FROM (( WITH
target_date AS ( SELECT ?::timestamptz ),
target_order as (
  SELECT
usage_range as bill_range,
o.*
  FROM ( SELECT * FROM "order_bt" WHERE sys_period @> sys_time() ) o
  LEFT JOIN period prd on prd.id = o.period_id
  LEFT JOIN accounting_ready(
CASE WHEN prd.INTERVAL = '1 mon' THEN date_trunc( 'month', o.docdate ) ELSE 
o.docdate END,
prd.INTERVAL, (SELECT * FROM target_date)
  ) acc ON true
  WHERE FALSE
OR ? = 0
OR ? = 1 AND o.id = ?
  AND acc.usage AND EXISTS (
SELECT * FROM order_bt prev_order
WHERE sys_period @> sys_time() AND
  prev_order.id = o.id  AND prev_order.app_period && acc.usage_range
  )
  AND o.app_period && acc.usage_range
OR ? = 2 AND o.agreement_id = ? and o.period_id = ?
  AND acc.usage AND EXISTS (
SELECT * FROM order_bt prev_order
WHERE sys_period @> sys_time() AND
  prev_order.id = o.id  AND prev_order.app_period && acc.usage_range
  )
  AND o.app_period && acc.usage_range
OR ? = 3
),
USAGE AS ( SELECT
  (ocd.o).id  as order_id,
  (dense_rank() over (PARTITION BY o.agreement_id, o.id ORDER BY 
(ocd.ic).consumed_period )) as group_id,
  (ocd.c).id  as detail_id,
  (ocd.c).service_type_id as service_type,
  (ocd.c).resource_type_idas detail_type,
  (ocd.c).amount  as detail_amount,
  (ocd.c).allocated_resource_id   as resource_id,
  nullas resource_uuid,
  nullas resource_desc,
  rt.unit as resource_unit,
  -- count changes. Logic is next: How many times configration for this order 
is met at this period
  count(*) OVER (PARTITION BY (ocd.o).id, (ocd.c).id )  as consumed_count,
  (ocd.ic).consumed   as consumed_days,
  nullas consumed_amount,
  nullas consumed_last,

  a.idas agreement_id,
  coalesce( substring( a.docn from '^A?(.*?)(?:\s*-[^-]*)?$' ), a.docn ) as 
agreement,
  a.docdate   as docdate,

  pkg.idas package_id,
  pkg.link_1c_idas package_1c_id,
  coalesce( pkg.display, pkg.name ) as package,

  coalesce( st.display, st.name, rt.display, rt.name )  as sr_name,
  COALESCE( (ocd.p).label, rt.LABEL )   as unit_label,

  coalesce( (ocd.c).sort_order, pd.sort_order ) as sort_order,

  ocd.item_priceAS price,
  -- We want to display QTY for resources too
  coalesce( ocd.item_qty, (ocd.c).amount/rt.unit )  AS qty,
  0 AS month_length,
  0 AS days_count,
  o.bill_range,
  lower( (ocd.ic).consumed_period )  as consumed_from,
  upper( (ocd.ic).consumed_period ) -interval '1sec' as consumed_till,


  ocd.item_suma,

  0 as discount,
  (sum( ocd.item_suma ) OVER( PARTITION BY (ocd.o).id, (ocd.ic).consumed_period 
))::numeric( 10, 2) AS group_suma,
  (sum( ocd.item_cost ) OVER( PARTITION BY (ocd.o).id, (ocd.ic).consumed_period 
))::numeric( 10, 2) AS order_cost
FROM target_order o
LEFT JOIN order_cost_details( o.bill_range ) ocd
  ON (ocd.o).id = o.id  AND  (ocd.ic).consumed_period && o.app_period
  LEFT JOIN agreement a   ON a.id   = o.agreement_id
  LEFT JOIN package   pkg ON pkg.id = o.package_id
  LEFT JOIN package_detail pd  ON pd.package_id = (ocd.o).package_id
AND pd.resource_type_id IS NOT DISTINCT FROM (ocd.c).resource_type_id
AND pd.service_type_id  IS NOT DISTINCT FROM (ocd.c).service_type_id
  LEFT JOIN resource_type rt  ON rt.id  = (ocd.c).resource_type_id
  LEFT JOIN service_type  st  on 

Re: PoC Refactor AM analyse API

2020-12-08 Thread Denis Smirnov
Andrey, thanks for your feedback!

I agree that AMs with fix sized blocks can have much alike code in 
acquire_sample_rows() (though it is not a rule). But there are several points 
about current master sampling.

* It is not perfect - AM developers may want to improve it with other sampling 
algorithms.
* It is designed with a big influence of heap AM - for example, 
RelationGetNumberOfBlocks() returns uint32 while other AMs can have a bigger 
amount of blocks.
* heapam_acquire_sample_rows() is a small function - I don't think it is not a 
big trouble to write something alike for any AM developer.
* Some AMs may have a single level sampling (only algorithm Z from Vitter for 
example) - why not?

As a result we get a single and clear method to acquire rows for statistics. If 
we don’t modify but rather extend current API ( for example in a manner it is 
done for FDW) the code becomes more complicated and difficult to understand.

> 8 дек. 2020 г., в 18:42, Andrey Borodin  написал(а):
> 
> Hi Denis!
> 
>> 7 дек. 2020 г., в 18:23, Смирнов Денис  написал(а):
>> 
>> I suggest a refactoring of analyze AM API as it is too much heap specific at 
>> the moment. The problem was inspired by Greenplum’s analyze improvement for 
>> append-optimized row and column AM with variable size compressed blocks.
>> Currently we do analyze in two steps.
>> 
>> 1. Sample fix size blocks with algorithm S from Knuth (BlockSampler function)
>> 2. Collect tuples into reservoir with algorithm Z from Vitter.
>> 
>> So this doesn’t work for AMs using variable sized physical blocks for 
>> example. They need weight random sampling (WRS) algorithms like A-Chao or 
>> logical blocks to follow S-Knuth (and have a problem with 
>> RelationGetNumberOfBlocks() estimating a physical number of blocks). Another 
>> problem with columns - they are not passed to analyze begin scan and can’t 
>> benefit from column storage at ANALYZE TABLE (COL).
>> 
>> The suggestion is to replace table_scan_analyze_next_block() and 
>> table_scan_analyze_next_tuple() with a single function: 
>> table_acquire_sample_rows(). The AM implementation of 
>> table_acquire_sample_rows() can use the BlockSampler functions if it wants 
>> to, but if the AM is not block-oriented, it could do something else. This 
>> suggestion also passes VacAttrStats to table_acquire_sample_rows() for 
>> column-oriented AMs and removes PROGRESS_ANALYZE_BLOCKS_TOTAL and 
>> PROGRESS_ANALYZE_BLOCKS_DONE definitions as not all AMs can be 
>> block-oriented.
> 
> Just few random notes about the idea.
> Heap pages are not of a fixed size, when measured in tuple count. And comment 
> in the codes describes it.
> * Although every row has an equal chance of ending up in the final
> * sample, this sampling method is not perfect: not every possible
> * sample has an equal chance of being selected.  For large relations
> * the number of different blocks represented by the sample tends to be
> * too small.  We can live with that for now.  Improvements are welcome.
> 
> Current implementation provide framework with shared code. Though this 
> framework is only suitable for block-of-tuples AMs. And have statistical 
> downsides when count of tuples varies too much.
> Maybe can we just provide a richer API? To have both: avoid copying code and 
> provide flexibility.
> 
> Best regards, Andrey Borodin.
> 





Re: Parallel Inserts in CREATE TABLE AS

2020-12-08 Thread Bharath Rupireddy
On Tue, Dec 8, 2020 at 6:24 PM Hou, Zhijie  wrote:
>
> > > I'm not quite sure how to address this. Can we not allow the planner
> > > to consider that the select is for CTAS and check only after the
> > > planning is done for the Gather node and other checks?
> > >
> >
> > IIUC, you are saying that we should not influence the cost of gather node
> > even when the insertion would be done by workers? I think that should be
> > our fallback option anyway but that might miss some paths to be considered
> > parallel where the cost becomes more due to parallel_tuple_cost (aka tuple
> > transfer cost). I think the idea is we can avoid the tuple transfer cost
> > only when Gather is the top node because only at that time we can push
> > insertion down, right? How about if we have some way to detect the same
> > before calling generate_useful_gather_paths()? I think when we are calling
> > apply_scanjoin_target_to_paths() in grouping_planner(), if the
> > query_level is 1, it is for CTAS, and it doesn't have a chance to create
> > UPPER_REL (doesn't have grouping, order, limit, etc clause) then we can
> > probably assume that the Gather will be top_node. I am not sure about this
> > but I think it is worth exploring.
> >
>
> I took a look at the parallel insert patch and have the same idea.
> https://commitfest.postgresql.org/31/2844/
>
>  * Consider generating Gather or Gather Merge paths.  We must only do 
> this
>  * if the relation is parallel safe, and we don't do it for child 
> rels to
>  * avoid creating multiple Gather nodes within the same plan. We must 
> do
>  * this after all paths have been generated and before set_cheapest, 
> since
>  * one of the generated paths may turn out to be the cheapest one.
>  */
> if (rel->consider_parallel && !IS_OTHER_REL(rel))
> generate_useful_gather_paths(root, rel, false);
>
> IMO Gatherpath created here seems the right one which can possible ignore 
> parallel cost if in CTAS.
> But We need check the following parse option which will create path to be the 
> parent of Gatherpath here.
>
> if (root->parse->rowMarks)
> if (limit_needed(root->parse))
> if (root->parse->sortClause)
> if (root->parse->distinctClause)
> if (root->parse->hasWindowFuncs)
> if (root->parse->groupClause || root->parse->groupingSets || 
> root->parse->hasAggs || root->root->hasHavingQual)
>

Thanks Amit and Hou. I will look into these areas and get back soon.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




New Table Access Methods for Multi and Single Inserts

2020-12-08 Thread Bharath Rupireddy
Hi,

Currently, for any component (such as COPY, CTAS[1], CREATE/REFRESH
Mat View[1], INSERT INTO SELECTs[2]) multi insert logic such as buffer
slots allocation, maintenance, decision to flush and clean up, need to
be implemented outside the table_multi_insert() API. The main problem
is that it fails to take into consideration the underlying storage
engine capabilities, for more details of this point refer to a
discussion in multi inserts in CTAS thread[1]. This also creates a lot
of duplicate code which is more error prone and not maintainable.

More importantly, in another thread [3] @Andres Freund suggested to
have table insert APIs in such a way that they look more like 'scan'
APIs i.e. insert_begin, insert, insert_end. The main advantages doing
this are(quoting from his statement in [3]) - "more importantly it'd
allow an AM to optimize operations across multiple inserts, which is
important for column stores."

I propose to introduce new table access methods for both multi and
single inserts based on the prototype suggested by Andres in [3]. Main
design goal of these new APIs is to give flexibility to tableam
developers in implementing multi insert logic dependent on the
underlying storage engine.

Below are the APIs. I suggest to have a look at
v1-0001-New-Table-AMs-for-Multi-and-Single-Inserts.patch for details
of the new data structure and the API functionality. Note that
temporarily I used XX_v2, we can change it later.

TableInsertState* table_insert_begin(initial_args);
void table_insert_v2(TableInsertState *state, TupleTableSlot *slot);
void table_multi_insert_v2(TableInsertState *state, TupleTableSlot *slot);
void table_multi_insert_flush(TableInsertState *state);
void table_insert_end(TableInsertState *state);

I'm attaching a few patches(just to show that these APIs work, avoids
a lot of duplicate code and makes life easier). Better commenting can
be added later. If these APIs and patches look okay, we can even
consider replacing them in other places such as nodeModifyTable.c and
so on.

v1-0001-New-Table-AMs-for-Multi-and-Single-Inserts.patch --->
introduces new table access methods for multi and single inserts. Also
implements/rearranges the outside code for heap am into these new
APIs.
v1-0002-CTAS-and-REFRESH-Mat-View-With-New-Multi-Insert-Table-AM.patch
---> adds new multi insert table access methods to CREATE TABLE AS,
CREATE MATERIALIZED VIEW and REFRESH MATERIALIZED VIEW.
v1-0003-ATRewriteTable-With-New-Single-Insert-Table-AM.patch ---> adds
new single insert table access method to ALTER TABLE rewrite table
code.
v1-0004-COPY-With-New-Multi-and-Single-Insert-Table-AM.patch ---> adds
new single and multi insert table access method to COPY code.

Thoughts?

Many thanks to Robert, Vignesh and Dilip for offlist discussion.

[1] - 
https://www.postgresql.org/message-id/4eee0730-f6ec-e72d-3477-561643f4b327%40swarm64.com
[2] - 
https://www.postgresql.org/message-id/20201124020020.GK24052%40telsasoft.com
[3] - 
https://www.postgresql.org/message-id/20200924024128.kyk3r5g7dnu3fxxx%40alap3.anarazel.de

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com


v1-0001-New-Table-AMs-for-Multi-and-Single-Inserts.patch
Description: Binary data


v1-0003-ATRewriteTable-With-New-Single-Insert-Table-AM.patch
Description: Binary data


v1-0004-COPY-With-New-Multi-and-Single-Insert-Table-AM.patch
Description: Binary data


v1-0002-CTAS-and-REFRESH-Mat-View-With-New-Multi-Insert-Table-AM.patch
Description: Binary data


RE: Parallel Inserts in CREATE TABLE AS

2020-12-08 Thread Hou, Zhijie
> > I'm not quite sure how to address this. Can we not allow the planner
> > to consider that the select is for CTAS and check only after the
> > planning is done for the Gather node and other checks?
> >
> 
> IIUC, you are saying that we should not influence the cost of gather node
> even when the insertion would be done by workers? I think that should be
> our fallback option anyway but that might miss some paths to be considered
> parallel where the cost becomes more due to parallel_tuple_cost (aka tuple
> transfer cost). I think the idea is we can avoid the tuple transfer cost
> only when Gather is the top node because only at that time we can push
> insertion down, right? How about if we have some way to detect the same
> before calling generate_useful_gather_paths()? I think when we are calling
> apply_scanjoin_target_to_paths() in grouping_planner(), if the
> query_level is 1, it is for CTAS, and it doesn't have a chance to create
> UPPER_REL (doesn't have grouping, order, limit, etc clause) then we can
> probably assume that the Gather will be top_node. I am not sure about this
> but I think it is worth exploring.
> 

I took a look at the parallel insert patch and have the same idea.
https://commitfest.postgresql.org/31/2844/

 * Consider generating Gather or Gather Merge paths.  We must only do 
this
 * if the relation is parallel safe, and we don't do it for child rels 
to
 * avoid creating multiple Gather nodes within the same plan. We must do
 * this after all paths have been generated and before set_cheapest, 
since
 * one of the generated paths may turn out to be the cheapest one.
 */
if (rel->consider_parallel && !IS_OTHER_REL(rel))
generate_useful_gather_paths(root, rel, false);

IMO Gatherpath created here seems the right one which can possible ignore 
parallel cost if in CTAS.
But We need check the following parse option which will create path to be the 
parent of Gatherpath here.

if (root->parse->rowMarks)
if (limit_needed(root->parse))
if (root->parse->sortClause)
if (root->parse->distinctClause)
if (root->parse->hasWindowFuncs)
if (root->parse->groupClause || root->parse->groupingSets || 
root->parse->hasAggs || root->root->hasHavingQual)

Best regards,
houzj




Re: Blocking I/O, async I/O and io_uring

2020-12-08 Thread Fujii Masao




On 2020/12/08 11:55, Craig Ringer wrote:

Hi all

A new kernel API called io_uring has recently come to my attention. I assume 
some of you (Andres?) have been following it for a while.

io_uring appears to offer a way to make system calls including reads, writes, 
fsync()s, and more in a non-blocking, batched and pipelined manner, with or 
without O_DIRECT. Basically async I/O with usable buffered I/O and fsync 
support. It has ordering support which is really important for us.

This should be on our radar. The main barriers to benefiting from linux-aio 
based async I/O in postgres in the past has been its reliance on direct I/O, 
the various kernel-version quirks, platform portability, and its 
maybe-async-except-when-it's-randomly-not nature.

The kernel version and portability remain an issue with io_uring so it's not 
like this is something we can pivot over to completely. But we should probably 
take a closer look at it.

PostgreSQL spends a huge amount of time waiting, doing nothing, for blocking 
I/O. If we can improve that then we could potentially realize some major 
increases in I/O utilization especially for bigger, less concurrent workloads. 
The most obvious candidates to benefit would be redo, logical apply, and bulk 
loading.

But I have no idea how to even begin to fit this into PostgreSQL's executor 
pipeline. Almost all PostgreSQL's code is synchronous-blocking-imperative in 
nature, with a push/pull executor pipeline. It seems to have been recognised 
for some time that this is increasingly hurting our performance and scalability 
as platforms become more and more parallel.

To benefit from AIO (be it POSIX, linux-aio, io_uring, Windows AIO, etc) we 
have to be able to dispatch I/O and do something else while we wait for the 
results. So we need the ability to pipeline the executor and pipeline redo.

I thought I'd start the discussion on this and see where we can go with it. 
What incremental steps can be done to move us toward parallelisable I/O without 
having to redesign everything?

I'm thinking that redo is probably a good first candidate. It doesn't depend on 
the guts of the executor. It is much less sensitive to ordering between 
operations in shmem and on disk since it runs in the startup process. And it 
hurts REALLY BADLY from its single-threaded blocking approach to I/O - as shown 
by an extension written by 2ndQuadrant that can double redo performance by 
doing read-ahead on btree pages that will soon be needed.

Thoughts anybody?


I was wondering if async I/O might be helpful for the performance
improvement of walreceiver. In physical replication, walreceiver receives,
writes and fsyncs WAL data. Also it does tasks like keepalive. Since
walreceiver is a single process, for example, currently it cannot do other
tasks while fsyncing WAL to the disk.

OTOH, if walreceiver can do other tasks even while fsyncing WAL by
using async I/O, ISTM that it might improve the performance of walreceiver.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Additional improvements to extended statistics

2020-12-08 Thread Tomas Vondra
On 12/7/20 5:15 PM, Dean Rasheed wrote:
> On Wed, 2 Dec 2020 at 15:51, Dean Rasheed  wrote:
>>
>> The sort of queries I had in mind were things like this:
>>
>>   WHERE (a = 1 AND b = 1) OR (a = 2 AND b = 2)
>>
>> However, the new code doesn't apply the extended stats directly using
>> clauselist_selectivity_or() for this kind of query because there are
>> no RestrictInfos for the nested AND clauses, so
>> find_single_rel_for_clauses() (and similarly
>> statext_is_compatible_clause()) regards those clauses as not
>> compatible with extended stats. So what ends up happening is that
>> extended stats are used only when we descend down to the two AND
>> clauses, and their results are combined using the original "s1 + s2 -
>> s1 * s2" formula. That actually works OK in this case, because there
>> is no overlap between the two AND clauses, but it wouldn't work so
>> well if there was.
>>
>> I'm pretty sure that can be fixed by teaching
>> find_single_rel_for_clauses() and statext_is_compatible_clause() to
>> handle BoolExpr clauses, looking for RestrictInfos underneath them,
>> but I think that should be left for a follow-in patch.
> 
> Attached is a patch doing that, which improves a couple of the
> estimates for queries with AND clauses underneath OR clauses, as
> expected.
> 
> This also revealed a minor bug in the way that the estimates for
> multiple statistics objects were combined while processing an OR
> clause -- the estimates for the overlaps between clauses only apply
> for the current statistics object, so we really have to combine the
> estimates for each set of clauses for each statistics object as if
> they were independent of one another.
> 
> 0001 fixes the multiple-extended-stats issue for OR clauses, and 0002
> improves the estimates for sub-AND clauses underneath OR clauses.
> 

Cool! Thanks for taking time to investigate and fixing those. Both
patches seem fine to me.

> These are both quite small patches, that hopefully won't interfere
> with any of the other extended stats patches.
> 

I haven't tried, but it should not interfere with it too much.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Parallel Inserts in CREATE TABLE AS

2020-12-08 Thread Amit Kapila
On Mon, Dec 7, 2020 at 7:04 PM Bharath Rupireddy
 wrote:
>
> I'm not quite sure how to address this. Can we not allow the planner
> to consider that the select is for CTAS and check only after the
> planning is done for the Gather node and other checks?
>

IIUC, you are saying that we should not influence the cost of gather
node even when the insertion would be done by workers? I think that
should be our fallback option anyway but that might miss some paths to
be considered parallel where the cost becomes more due to
parallel_tuple_cost (aka tuple transfer cost). I think the idea is we
can avoid the tuple transfer cost only when Gather is the top node
because only at that time we can push insertion down, right? How about
if we have some way to detect the same before calling
generate_useful_gather_paths()? I think when we are calling
apply_scanjoin_target_to_paths() in grouping_planner(), if the
query_level is 1, it is for CTAS, and it doesn't have a chance to
create UPPER_REL (doesn't have grouping, order, limit, etc clause)
then we can probably assume that the Gather will be top_node. I am not
sure about this but I think it is worth exploring.

-- 
With Regards,
Amit Kapila.




Re: PoC/WIP: Extended statistics on expressions

2020-12-08 Thread Tomas Vondra



On 12/7/20 5:02 PM, Dean Rasheed wrote:
> On Mon, 7 Dec 2020 at 14:15, Tomas Vondra  
> wrote:
>>
>> On 12/7/20 10:56 AM, Dean Rasheed wrote:
>>> it might actually be
>>> neater to have separate documented syntaxes for single- and
>>> multi-column statistics:
>>>
>>>   CREATE STATISTICS [ IF NOT EXISTS ] statistics_name
>>> ON (expression)
>>> FROM table_name
>>>
>>>   CREATE STATISTICS [ IF NOT EXISTS ] statistics_name
>>> [ ( statistics_kind [, ... ] ) ]
>>> ON { column_name | (expression) } , { column_name | (expression) } [, 
>>> ...]
>>> FROM table_name
>>
>> I think it makes sense in general. I see two issues with this approach,
>> though:
>>
>> * By adding expression/standard stats for individual statistics, it
>> makes the list of statistics longer - I wonder if this might have
>> measurable impact on lookups in this list.
>>
>> * I'm not sure it's a good idea that the second syntax would always
>> build the per-expression stats. Firstly, it seems a bit strange that it
>> behaves differently than the other kinds. Secondly, I wonder if there
>> are cases where it'd be desirable to explicitly disable building these
>> per-expression stats. For example, what if we have multiple extended
>> statistics objects, overlapping on a couple expressions. It seems
>> pointless to build the stats for all of them.
>>
> 
> Hmm, I'm not sure it would really be a good idea to build MCV stats on
> expressions without also building the standard stats for those
> expressions, otherwise the assumptions that
> mcv_combine_selectivities() makes about simple_sel and mcv_basesel
> wouldn't really hold. But then, if multiple MCV stats shared the same
> expression, it would be quite wasteful to build standard stats on the
> expression more than once.
> 

Yeah. You're right it'd be problematic to build MCV on expressions
without having the per-expression stats. In fact, that's exactly the
problem what forced me to add the per-expression stats to this patch.
Originally I planned to address it in a later patch, but I had to move
it forward.

So I think you're right we need to ensure we have standard stats for
each expression at least once, to make this work well.

> It feels like it should build a single extended stats object for each
> unique expression, with appropriate dependencies for any MCV stats
> that used those expressions, but I'm not sure how complex that would
> be. Dropping the last MCV stat object using a standard expression stat
> object might reasonably drop the expression stats ... except if they
> were explicitly created by the user, independently of any MCV stats.
> That could get quite messy.
> 

Possibly. But I don't think it's worth the extra complexity. I don't
expect people to have a lot of overlapping stats, so the amount of
wasted space and CPU time is expected to be fairly limited.

So I don't think it's worth spending too much time on this now. Let's
just do what you proposed, and revisit this later if needed.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: About to add WAL write/fsync statistics to pg_stat_wal view

2020-12-08 Thread Masahiro Ikeda

On 2020-12-08 16:45, Li Japin wrote:

Hi,

On Dec 8, 2020, at 1:06 PM, Masahiro Ikeda  
wrote:


Hi,

I propose to add wal write/fsync statistics to pg_stat_wal view.
It's useful not only for developing/improving source code related to 
WAL

but also for users to detect workload changes, HW failure, and so on.

I introduce "track_wal_io_timing" parameter and provide the following 
information on pg_stat_wal view.
I separate the parameter from "track_io_timing" to 
"track_wal_io_timing"
because IIUC, WAL I/O activity may have a greater impact on query 
performance than database I/O activity.


```
postgres=# SELECT wal_write, wal_write_time, wal_sync, wal_sync_time 
FROM pg_stat_wal;

-[ RECORD 1 ]--+
wal_write  | 650  # Total number of times WAL data was written to 
the disk


wal_write_time | 43   # Total amount of time that has been spent in 
the portion of WAL data was written to disk
 # if track-wal-io-timing is enabled, otherwise 
zero


wal_sync   | 78   # Total number of times WAL data was synced to 
the disk


wal_sync_time  | 104  # Total amount of time that has been spent in 
the portion of WAL data was synced to disk
 # if track-wal-io-timing is enabled, otherwise 
zero

```

What do you think?
Please let me know your comments.

Regards
--
Masahiro Ikeda
NTT DATA 
CORPORATION<0001_add_wal_io_activity_to_the_pg_stat_wal.patch>


There is a no previous prototype warning for ‘fsyncMethodCalled’, and
it now only used in xlog.c,
should we declare with static? And this function wants a boolean as a
return, should we use
true/false other than 0/1?

+/*
+ * Check if fsync mothod is called.
+ */
+bool
+fsyncMethodCalled()
+{
+   if (!enableFsync)
+   return 0;
+
+   switch (sync_method)
+   {
+   case SYNC_METHOD_FSYNC:
+   case SYNC_METHOD_FSYNC_WRITETHROUGH:
+   case SYNC_METHOD_FDATASYNC:
+   return 1;
+   default:
+   /* others don't have a specific fsync method */
+   return 0;
+   }
+}
+


Thanks for your review.
I agree with your comments. I fixed them.

Regards
--
Masahiro Ikeda
NTT DATA CORPORATIONdiff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 8cd3d690..ba923a2b 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7388,6 +7388,27 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
   
  
 
+ 
+  track_wal_io_timing (boolean)
+  
+   track_wal_io_timing configuration parameter
+  
+  
+  
+   
+Enables timing of WAL I/O calls. This parameter is off by default,
+because it will repeatedly query the operating system for
+the current time, which may cause significant overhead on some
+platforms.  You can use the  tool to
+measure the overhead of timing on your system.
+I/O timing information is
+displayed in 
+pg_stat_wal.  Only superusers can
+change this setting.
+   
+  
+ 
+
  
   track_functions (enum)
   
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 52a69a53..ce4f652d 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3479,7 +3479,51 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
wal_buffers_full bigint
   
   
-   Number of times WAL data was written to the disk because WAL buffers got full
+   Total number of times WAL data was written to the disk because WAL buffers got full
+  
+ 
+
+ 
+  
+   wal_write bigint
+  
+  
+   Total number of times WAL data was written to the disk
+  
+ 
+
+ 
+  
+   wal_write_time double precision
+  
+  
+   Total amount of time that has been spent in the portion of
+   WAL data was written to disk, in milliseconds
+   (if  is enabled, otherwise zero).
+   To avoid standby server's performance degradation, they don't collect
+   this statistics
+  
+ 
+
+ 
+  
+   wal_sync bigint
+  
+  
+   Total number of times WAL data was synced to the disk
+  
+ 
+
+ 
+  
+   wal_sync_time double precision
+  
+  
+   Total amount of time that has been spent in the portion of
+   WAL data was synced to disk, in milliseconds
+   (if  is enabled, otherwise zero).
+   To avoid standby server's performance degradation, they don't collect
+   this statistics
   
  
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7e81ce4f..18d6ecc5 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -109,6 +109,7 @@ int			CommitDelay = 0;	/* precommit delay in microseconds */
 int			CommitSiblings = 5; /* # concurrent xacts needed to sleep */
 int			

Re: [Patch] Optimize dropping of relation buffers using dlist

2020-12-08 Thread Amit Kapila
On Tue, Dec 8, 2020 at 12:13 PM tsunakawa.ta...@fujitsu.com
 wrote:
>
> From: Jamison, Kirk/ジャミソン カーク 
> > Because one of the rel's cached value was false, it forced the
> > full-scan path for TRUNCATE.
> > Is there a possible workaround for this?
>
> Hmm, the other two relfilenodes are for the TOAST table and index of the 
> target table.  I think the INSERT didn't access those TOAST relfilenodes 
> because the inserted data was stored in the main storage.  But TRUNCATE 
> always truncates all the three relfilenodes.  So, the standby had not opened 
> the relfilenode for the TOAST stuff or cached its size when replaying the 
> TRUNCATE.
>
> I'm afraid this is more common than we can ignore and accept the slow 
> traditional path, but I don't think of a good idea to use the cached flag.
>

I also can't think of a way to use an optimized path for such cases
but I don't agree with your comment on if it is common enough that we
leave this optimization entirely for the truncate path.

-- 
With Regards,
Amit Kapila.




Re: POC: Better infrastructure for automated testing of concurrency issues

2020-12-08 Thread Alexander Korotkov
On Tue, Dec 8, 2020 at 1:26 PM Andrey Borodin  wrote:
> > 25 нояб. 2020 г., в 19:10, Alexander Korotkov  
> > написал(а):
> >
> > In the code stop events are defined using macro STOPEVENT(event_id, 
> > params).  The 'params' should be a function call, and it's evaluated only 
> > if stop events are enabled.  pg_isolation_test_session_is_blocked() takes 
> > stop events into account.  So, stop events are suitable for isolation tests.
>
> Thanks for this infrastructure. Looks like a really nice way to increase test 
> coverage of most difficult things.
>
> Can we also somehow prove that test was deterministic? I.e. expect number of 
> blocked backends (if known) or something like that.
> I'm not really sure it's useful, just an idea.

Thank you for your feedback!

I forgot to mention, patch comes with pg_stopevents() function which
returns rowset (stopevent text, condition jsonpath, waiters int[]).
Waiters is an array of pids of waiting processes.

Additionally, isolation tester checks if a particular backend is
waiting using pg_isolation_test_session_is_blocked(), which works with
stop events too.

--
Regards,
Alexander Korotkov




Re: POC: Better infrastructure for automated testing of concurrency issues

2020-12-08 Thread Andrey Borodin
Hi Alexander!

> 25 нояб. 2020 г., в 19:10, Alexander Korotkov  
> написал(а):
> 
> In the code stop events are defined using macro STOPEVENT(event_id, params).  
> The 'params' should be a function call, and it's evaluated only if stop 
> events are enabled.  pg_isolation_test_session_is_blocked() takes stop events 
> into account.  So, stop events are suitable for isolation tests.

Thanks for this infrastructure. Looks like a really nice way to increase test 
coverage of most difficult things.

Can we also somehow prove that test was deterministic? I.e. expect number of 
blocked backends (if known) or something like that.
I'm not really sure it's useful, just an idea.

Thanks!

Best regards, Andrey Borodin.



[PATCH] Hooks at XactCommand level

2020-12-08 Thread Gilles Darold
Hi,

Based on a PoC reported in a previous thread [1] I'd like to propose new
hooks around transaction commands. The objective of this patch is to
allow PostgreSQL extension to act at start and end (including abort) of
a SQL statement in a transaction.

The idea for these hooks is born from the no go given to Takayuki
Tsunakawa's patch[2] proposing an in core implementation of
statement-level rollback transaction and the pg_statement_rollback
extension[3] that we have developed at LzLabs. The extension
pg_statement_rollback has two limitation, the first one is that the
client still have to call the ROLLBACK TO SAVEPOINT when an error is
encountered and the second is that it generates a crash when PostgreSQL
is compiled with assert that can not be fixed at the extension level.

Although that I have not though about other uses for these hooks, they
will allow a full server side statement-level rollback feature like in
commercial DBMSs like DB2 and Oracle. This feature is very often
requested by users that want to migrate to PostgreSQL.


SPECIFICATION
==


There is no additional syntax or GUC, the patch just adds three new hooks:


* start_xact_command_hook called at end of the start_xact_command()
function.
* finish_xact_command called in finish_xact_command() just before
CommitTransactionCommand().
* abort_current_transaction_hook called after an error is encountered at
end of AbortCurrentTransaction().

These hooks allow an external plugins to execute code related to the SQL
statements executed in a transaction.


DESIGN
==


Nothing more to add here.


CONSIDERATIONS AND REQUESTS
==


An extension using these hooks that implements the server side rollback
at statement level feature is attached to demonstrate the interest of
these hooks. If we want to support this feature the extension could be
added under the contrib/ directory.

Here is an example of use of these hooks through the
pg_statement_rollbackv2 extension:

    LOAD 'pg_statement_rollbackv2.so';
    LOAD
    SET pg_statement_rollback.enabled TO on;
    SET
    CREATE SCHEMA testrsl;
    CREATE SCHEMA
    SET search_path TO testrsl,public;
    SET
    BEGIN;
    BEGIN
    CREATE TABLE tbl_rsl(id integer, val varchar(256));
    CREATE TABLE
    INSERT INTO tbl_rsl VALUES (1, 'one');
    INSERT 0 1
    WITH write AS (INSERT INTO tbl_rsl VALUES (2, 'two') RETURNING id,
val) SELECT * FROM write;
     id | val
    +-
      2 | two
    (1 row)

    UPDATE tbl_rsl SET id = 'two', val = 2 WHERE id = 1; -- > will fail
    psql:simple.sql:14: ERROR:  invalid input syntax for type integer: "two"
    LINE 1: UPDATE tbl_rsl SET id = 'two', val = 2 WHERE id = 1;
                    ^
    SELECT * FROM tbl_rsl; -- Should show records id 1 + 2
     id | val
    +-
      1 | one
      2 | two
    (2 rows)

    COMMIT;
    COMMIT

As you can see the failing UPDATE statement has been rolled back and we
recover the state of the transaction just before the statement without
any client savepoint and rollback to savepoint action.


I'll add this patch to Commitfest 2021-01.


Best regards


[1]
https://www.postgresql-archive.org/Issue-with-server-side-statement-level-rollback-td6162387.html
[2]
https://www.postgresql.org/message-id/flat/0A3221C70F24FB45833433255569204D1F6A9286%40G01JPEXMBYT05
[3] https://github.com/darold/pg_statement_rollbackv2

-- 
Gilles Darold
http://www.darold.net/

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 9cd0b7c11b..5449446884 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -314,6 +314,9 @@ typedef struct SubXactCallbackItem
 static SubXactCallbackItem *SubXact_callbacks = NULL;
 
 
+/* Hook for plugins to get control of at end of AbortCurrentTransaction */
+AbortCurrentTransaction_hook_type abort_current_transaction_hook = NULL;
+
 /* local function prototypes */
 static void AssignTransactionId(TransactionState s);
 static void AbortTransaction(void);
@@ -3358,6 +3361,9 @@ AbortCurrentTransaction(void)
 			AbortCurrentTransaction();
 			break;
 	}
+
+	if (abort_current_transaction_hook)
+		(*abort_current_transaction_hook) ();
 }
 
 /*
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 3679799e50..3ed8cb3f94 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -192,6 +192,12 @@ static void log_disconnections(int code, Datum arg);
 static void enable_statement_timeout(void);
 static void disable_statement_timeout(void);
 
+/*
+ * Hooks for plugins to get control at end/start of
+ * start_xact_command()/finish_xact_command().
+ */
+XactCommandStart_hook_type start_xact_command_hook = NULL;
+XactCommandFinish_hook_type finish_xact_command_hook = NULL;
 
 /* 
  *		routines to obtain 

Re: Single transaction in the tablesync worker?

2020-12-08 Thread Amit Kapila
On Tue, Dec 8, 2020 at 11:53 AM Peter Smith  wrote:
>
> Yes, I observed this same behavior.
>
> IIUC the only way for the tablesync worker to go from CATCHUP mode to
> SYNCDONE is via the call to process_sync_tables.
>
> But a side-effect of this is, when messages arrive during this CATCHUP
> phase one tx will be getting handled by the tablesync worker before
> the process_sync_tables() is ever encountered.
>
> I have created and attached a simple patch which allows the tablesync
> to detect if there is anything to do *before* it enters the apply main
> loop. Calling process_sync_tables() before the apply main loop  offers
> a quick way out so the message handling will not be split
> unnecessarily between the workers.
>

Yeah, this demonstrates the idea can work but as mentioned in my
previous email [1] this needs much more work to make the COPY and then
later fetching the changes from the publisher consistently. So, let me
summarize the discussion so far. We wanted to enhance the tablesync
phase of Logical Replication to enable decoding of prepared
transactions [2]. The problem was when we stream prepared transactions
in the tablesync worker, it will simply commit the same due to the
requirement of maintaining a single transaction for the entire
duration of copy and streaming of transactions afterward. We can't
simply disable the decoding of prepared xacts for tablesync workers
because it can skip some of the prepared xacts forever on subscriber
as explained in one of the emails above [3]. Now, while investigating
the solutions to enhance tablesync to support decoding at prepare
time, I found that due to the current design of tablesync we can see
partial data of transactions on subscribers which is also explained in
the email above with an example [4]. This problem of visibility is
there since the Logical Replication is introduced in PostgreSQL and
the only answer I got till now is that there doesn't seem to be any
other alternative which I think is not true and I have provided one
alternative as well.

Next, we have discussed three different solutions all of which will
solve the first problem (allow the tablesync worker to decode
transactions at prepare time) and one of which solves both the first
and second problem (partial transaction data visibility).

Solution-1: Allow the table-sync worker to use multiple transactions.
The reason for doing it in a single transaction is that if after
initial COPY we commit and then crash while streaming changes of other
transactions, the state of the table won't be known after the restart
as we are using temporary slot so we don't from where to restart
syncing the table.

IIUC, we need to primarily do two things to achieve multiple
transactions, one is to have an additional state in the catalog (say
catch up) which will say that the initial copy is done. Then we need
to have a permanent slot using which we can track the progress of the
slot so that after restart (due to crash, connection break, etc.) we
can start from the appropriate position. Now, this will allow us to do
less work after recovering from a crash because we will know the
restart point. As Craig mentioned, it also allows the sync slot to
advance, freeing any held upstream resources before the whole sync is
done, which is good if the upstream is busy and generating lots of
WAL. Finally, committing as we go means we won't exceed the cid
increment limit in a single txn.

Solution-2: The next solution we discussed is to make "tablesync"
worker just gather messages after COPY in a similar way to how the
current streaming of in-progress transaction feature gathers messages
into a "changes" file so that they can be replayed later by the apply
worker. Now, here as we don't need to replay the individual
transactions in tablesync worker in a single transaction, it will
allow us to send decode prepared to the subscriber. This has some
disadvantages such as each transaction processed by tablesync worker
needs to be durably written to file and it can also lead to some apply
lag later when we process the same by apply worker.

Solution-3: Allow the table-sync workers to just perform initial COPY
and then once the COPY is done for all relations the apply worker will
stream all the future changes. Now, surely if large copies are
required for multiple relations then we would delay a bit to replay
transactions partially by the apply worker but don't know how much
that matters as compared to transaction visibility issue and anyway we
would have achieved the maximum parallelism by allowing copy via
multiple workers. This would reduce the load (both CPU and I/O) on the
publisher-side by allowing to decode the data only once instead of for
each table sync worker once and separately for the apply worker. I
think it will use fewer resources to finish the work.

Currently, in tablesync worker, we create a slot with CRS_USE_SNAPSHOT
option which creates a transaction snapshot on the publisher, and then
we use the same snapshot 

Re: Wrong check in pg_visibility?

2020-12-08 Thread Konstantin Knizhnik



On 06.12.2020 23:50, Konstantin Knizhnik wrote:

Hi hackers!

Due to the error in PG-ProEE we have added the following test to 
pg_visibility:


create table vacuum_test as select 42 i;
vacuum vacuum_test;
select count(*) > 0 from pg_check_visible('vacuum_test');
drop table vacuum_test;

Sometime (very rarely) this test failed: pg_visibility reports 
"corrupted" tuples.
The same error can be reproduced not only with PG-Pro but also with 
vanilla REL_11_STABLE, REL_12_STABLE and REL_13_STABLE.
It is not reproduced with master after Andres snapshot optimization - 
commit dc7420c2.


It is not so easy to reproduce this error: it is necessary to repeat 
this tests many times.

Probability increased with more aggressive autovacuum settings.
But even with such settings and thousands of iterations I was not able 
to reproduce this error at my notebook - only at virtual machine.


The error is reported here:

            /*
             * If we're checking whether the page is all-visible, we 
expect

             * the tuple to be all-visible.
             */
            if (check_visible &&
                !tuple_all_visible(, OldestXmin, buffer))
            {
                TransactionId RecomputedOldestXmin;

                /*
                 * Time has passed since we computed OldestXmin, so it's
                 * possible that this tuple is all-visible in reality 
even

                 * though it doesn't appear so based on our
                 * previously-computed value.  Let's compute a new 
value so we

                 * can be certain whether there is a problem.
                 *
                 * From a concurrency point of view, it sort of sucks to
                 * retake ProcArrayLock here while we're holding the 
buffer

                 * exclusively locked, but it should be safe against
                 * deadlocks, because surely GetOldestXmin() should 
never take
                 * a buffer lock. And this shouldn't happen often, so 
it's

                 * worth being careful so as to avoid false positives.
                 */
                RecomputedOldestXmin = GetOldestXmin(NULL, 
PROCARRAY_FLAGS_VACUUM);


                if (!TransactionIdPrecedes(OldestXmin, 
RecomputedOldestXmin))

                    record_corrupt_item(items, _self);


I debugger I have checked that OldestXmin = RecomputedOldestXmin = 
tuple.t_data->xmin
I wonder if this check in pg_visibility is really correct and it can 
not happen that OldestXmin=tuple.t_data->xmin?
Please notice that tuple_all_visible returns false if 
!TransactionIdPrecedes(xmin, OldestXmin)




I did more investigations and have to say that this check in 
pg_visibility.c is really not correct.

The process which is keeping oldest xmin is autovacuum.
It should be ignored by GetOldestXmin because of PROCARRAY_FLAGS_VACUUM 
flags, but it is not actually skipped
because PROC_IN_VACUUM flag is not set yet. There is yet another flag - 
PROC_IS_AUTOVACUUM
which is always set in autovacuum, but it can not be passed to 
GetOldestXmin? because is cleared by PROCARRAY_PROC_FLAGS_MASK.


If we just repeat RecomputedOldestXmin = GetOldestXmin(NULL, 
PROCARRAY_FLAGS_VACUUM);

several times, then finally we will get right xmin.

I wonder if such check should be excluded from pg_visibility or made in 
more correct way?
Because nothing in documentation of pg_check_visible says that it may 
return false positives:


|pg_check_visible(relation regclass, t_ctid OUT tid) returns setof tid|

    Returns the TIDs of non-all-visible tuples stored in pages
   marked all-visible in the visibility map. If this function returns a
   non-empty set of TIDs, the visibility map is corrupt.

||
And comment to this function is even morefrightening:

/*
 * Return the TIDs of not-all-visible tuples in pages marked all-visible
 * in the visibility map.  We hope no one will ever find any, but there 
could

 * be bugs, database corruption, etc.
 */

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Logical archiving

2020-12-08 Thread Andrey Borodin
Thanks Craig!
Probably, I should better ask in your nearby thread about logical replication, 
it just seemed to me that logical archiving is somewhat small independent piece 
of functionality...

> 7 дек. 2020 г., в 08:05, Craig Ringer  
> написал(а):
> 
> Reply follows inline. I addressed your last point first, so it's out of order.
> 
> On Fri, 4 Dec 2020 at 15:33, Andrey Borodin  wrote
> 
> > If OLAP cannot consume data fast enough - we are out of space due to repl 
> > slot.
> 
> There is a much simpler solution to this than logical PITR.
> 
> What we should be doing is teaching xlogreader how to invoke the 
> restore_command to fetch archived WALs for decoding.

> Replication slots already have a WAL retention limit, but right now when that 
> limit is reached the slot is invalidated and becomes useless, it's 
> effectively dropped. Instead, if WAL archiving is enabled, we should leave 
> the slot as valid. If a consumer of the slot needs WAL that no longer exists 
> in pg_wal, we should have the walsender invoke the restore_command to read 
> the missing WAL segment, decode it, and remove it again.
> 
> This would not be a technically difficult patch, and it's IMO one of the more 
> important ones for improving logical replication.
Currently we have restore_command in regular config, not in recovery.conf, so, 
probably, it should not be a very big deal to implement this.
> 
> > I was discussing problems of CDC with scientific community and they asked 
> > this simple question: "So you have efficient WAL archive on a very cheap 
> > storage, why don't you have a logical archive too?"
> 
> I've done work in this area, as has Petr (CC'd).
> 
> In short, logical archiving and PITR is very much desirable, but we're not 
> nearly ready for it yet and we're missing a lot of the foundations needed to 
> make it really useful.
> 
> IMO the strongest pre-requisite is that we need integrated DDL capture and 
> replication in Pg. While this could be implemented in the 
> publisher/subscriber logic for logical replication, it would make much more 
> sense (IMO) to make it easier to feed DDL events into any logical replication 
> output plugin.
> 
> pglogical3 (the closed one) has quite comprehensive DDL replication support. 
> Doing it is not simple though - there are plenty of complexities:
> 
> * Reliably identifying the target objects and mapping them to replication set 
> memberships for DML-replication
> * Capturing, replicating and managing the search_path and other DDL execution 
> context (DateStyle and much more) reliably
>   • Each statement type needs specific logic to indicate whether it needs 
> DDL replication (and often filter functions since we have lots of sub-types 
> where some need replication and some don't)
>   • Handling DDL affecting global objects in pg_global correctly, like 
> those affecting roles, grants, database security labels etc. There's no one 
> right answer for this, it depends on the deployment and requires the user to 
> cooperate.
>   • Correct handling of transactions that mix DDL and DML (mostly only an 
> issue for multimaster).
>   • Identifying statements that target a mix of replicated and 
> non-replicated objects and handling them appropriately, including for CASCADEs
>   • Gracefully handling DDL statements that mix TEMPORARY and persistent 
> targets. We can do this ok for DROPs but it still requires care. Anything 
> else gets messier.
>   • Lack of hooks into table rewrite operations and the extremely clumsy 
> and inefficient way logical decoding currently exposes decoding of the 
> temp-table data during decoding of rewrites means handling table-rewriting 
> DDL is difficult and impractical to do correctly. In pglogical we punt on it 
> entirely and refuse to permit DDL that would rewrite a table except where we 
> can prove it's reliant only on immutable inputs so we can discard the 
> upstream rewrite and rely on statement replication.
>   • As a consequence of the above, reliably determining whether a given 
> statement will cause a table rewrite.
>   • Handling re-entrant ProcessUtility_hook calls for ALTER TABLE etc.
>   • Handling TRUNCATE's pseudo-DDL pseudo-DML halfway state, doing 
> something sensible for truncate cascade.
>   • Probably more I've forgotten
> 
> If we don't handle these, then any logical change-log archives will become 
> largely useless as soon as there's any schema change.
> 
> So we kind of have to solve DDL replication first IMO.
> 
> Some consideration is also required for metadata management. Right now 
> relation and type metadata has session-lifetime, but you'd want to be able to 
> discard old logical change-stream archives and have the later ones still be 
> usable. So we'd need to define some kind of restartpoint where we repeat the 
> metadata, or we'd have to support externalizing the metadata so it can be 
> retained when the main change archives get aged out.
> 
> We'd also need 

Re: small cleanup in unicode_norm.c

2020-12-08 Thread Michael Paquier
On Mon, Dec 07, 2020 at 03:24:56PM -0400, John Naylor wrote:
> We've had get_canonical_class() for a while as a backend-only function.
> There is some ad-hoc code elsewhere that implements the same logic in a
> couple places, so it makes sense for all sites to use this function
> instead, as in the attached.

Thanks John for caring about that.  This is a nice simplification, and
it looks fine to me.

-static uint8
-get_canonical_class(pg_wchar ch)
-{
Two nits here.  I would use "code" for the name of the argument for
consistency with get_code_entry(), and add a description at the top of 
this helper routine (say a simple "get the combining class of given
code").  Anything else you can think of?
--
Michael


signature.asc
Description: PGP signature


Re: Printing backtrace of postgres processes

2020-12-08 Thread vignesh C
On Tue, Dec 1, 2020 at 2:15 PM vignesh C  wrote:
>
> On Tue, Dec 1, 2020 at 9:31 AM Tom Lane  wrote:
> >
> > Andres Freund  writes:
> > > It should be quite doable to emit such backtraces directly to stderr,
> > > instead of using appendStringInfoString()/elog().
> >
> > No, please no.
> >
> > (1) On lots of logging setups (think syslog), anything that goes to
> > stderr is just going to wind up in the bit bucket.  I realize that
> > we have that issue already for memory context dumps on OOM errors,
> > but that doesn't make it a good thing.
> >
> > (2) You couldn't really write "to stderr", only to fileno(stderr),
> > creating issues about interleaving of the output with regular stderr
> > output.  For instance it's quite likely that the backtrace would
> > appear before stderr output that had actually been emitted earlier,
> > which'd be tremendously confusing.
> >
> > (3) This isn't going to do anything good for my concerns about interleaved
> > output from different processes, either.
> >
>
> I felt if we are not agreeing on logging on the stderr, even using
> static buffer we might not be able to log as
> send_message_to_server_log calls appendStringInfo. I felt that doing
> it from CHECK_FOR_INTERRUPTS may be better.
>

I have implemented printing of backtrace based on handling it in
CHECK_FOR_INTERRUPTS. This patch also includes the change to allow
getting backtrace of any particular process based on the suggestions.
Attached patch has the implementation for the same.
Thoughts?

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com
From ffcd85f310dcedbe21f78ed4a7f7b281dc4d6ad8 Mon Sep 17 00:00:00 2001
From: Vignesh C 
Date: Sun, 22 Nov 2020 05:58:24 +0530
Subject: [PATCH] Print backtrace of postgres process that are part of this
 instance.

The idea here is to implement & expose pg_print_callstack function, internally
what this function does is, the connected backend will send SIGUSR1 signal by
setting PMSIGNAL_BACKTRACE_EMIT to the postmaster process. Postmaster process
will send SIGUSR1 signal to process by setting PROCSIG_BACKTRACE_PRINT if the
process that have access to ProcSignal. As syslogger process & Stats process
don't have access to ProcSignal, multiplexing with SIGUSR1 is not possible
for these processes, hence SIGUSR2 signal will be sent for these process.
Once the process receives this signal it will log the backtrace of the process.
---
 src/backend/postmaster/autovacuum.c   |  4 ++
 src/backend/postmaster/checkpointer.c |  5 +++
 src/backend/postmaster/interrupt.c|  5 +++
 src/backend/postmaster/pgstat.c   | 34 -
 src/backend/postmaster/postmaster.c   | 29 ++
 src/backend/postmaster/syslogger.c| 30 ++-
 src/backend/storage/ipc/procsignal.c  | 50 
 src/backend/storage/ipc/signalfuncs.c | 67 
 src/backend/tcop/postgres.c   | 38 ++
 src/backend/utils/init/globals.c  |  1 +
 src/bin/pg_ctl/t/005_backtrace.pl | 72 +++
 src/include/catalog/pg_proc.dat   |  9 -
 src/include/miscadmin.h   |  2 +
 src/include/storage/pmsignal.h|  2 +
 src/include/storage/procsignal.h  |  4 ++
 src/include/tcop/tcopprot.h   |  1 +
 16 files changed, 350 insertions(+), 3 deletions(-)
 create mode 100644 src/bin/pg_ctl/t/005_backtrace.pl

diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index aa5b97f..597f14b 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -832,6 +832,10 @@ HandleAutoVacLauncherInterrupts(void)
 	if (ProcSignalBarrierPending)
 		ProcessProcSignalBarrier();
 
+	/* Process printing back trace */
+	if (PrintBacktracePending)
+		LogBackTrace();
+
 	/* Process sinval catchup interrupts that happened while sleeping */
 	ProcessCatchupInterrupt();
 }
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 429c801..5f9501b 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -57,6 +57,7 @@
 #include "storage/shmem.h"
 #include "storage/smgr.h"
 #include "storage/spin.h"
+#include "tcop/tcopprot.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
 #include "utils/resowner.h"
@@ -547,6 +548,10 @@ HandleCheckpointerInterrupts(void)
 	if (ProcSignalBarrierPending)
 		ProcessProcSignalBarrier();
 
+	/* Process printing back trace */
+	if (PrintBacktracePending)
+		LogBackTrace();
+
 	if (ConfigReloadPending)
 	{
 		ConfigReloadPending = false;
diff --git a/src/backend/postmaster/interrupt.c b/src/backend/postmaster/interrupt.c
index ee7dbf9..d341d6f 100644
--- a/src/backend/postmaster/interrupt.c
+++ b/src/backend/postmaster/interrupt.c
@@ -21,6 +21,7 @@
 #include "storage/ipc.h"
 #include "storage/latch.h"
 #include "storage/procsignal.h"
+#include "tcop/tcopprot.h"
 #include "utils/guc.h"
 
 volatile sig_atomic_t 

Re: Huge memory consumption on partitioned table with FKs

2020-12-08 Thread Amit Langote
On Mon, Dec 7, 2020 at 11:01 PM Amit Langote  wrote:
> On Fri, Dec 4, 2020 at 12:05 PM Kyotaro Horiguchi
>  wrote:
> > > Also, the comment that was in RI_ConstraintInfo now appears in
> > > RI_ConstraintParam, and the new struct (RI_ConstraintInfo) is now
> > > undocumented.  What is the relationship between those two structs?  I
> > > see that they have pointers to each other, but I think the relationship
> > > should be documented more clearly.
> >
> > I'm not sure the footprint of this patch worth doing but here is a bit
> > more polished version.
>
> I noticed that the foreign_key test fails and it may have to do with
> the fact that a partition's param info remains attached to the
> parent's RI_ConstraintInfo even after it's detached from the parent
> table using DETACH PARTITION.

Just for (maybe not so distant) future reference, I managed to fix the
failure with this:

diff --git a/src/backend/utils/adt/ri_triggers.c
b/src/backend/utils/adt/ri_triggers.c
index 187884f..c67f2a6 100644
--- a/src/backend/utils/adt/ri_triggers.c
+++ b/src/backend/utils/adt/ri_triggers.c
@@ -1932,7 +1932,7 @@ ri_BuildQueryKey(RI_QueryKey *key, const
RI_ConstraintInfo *riinfo,
 * We assume struct RI_QueryKey contains no padding bytes, else we'd need
 * to use memset to clear them.
 */
-   key->constr_id = riinfo->param->query_key;
+   key->constr_id = riinfo->constraint_id;
key->constr_queryno = constr_queryno;
 }

It seems the shareable part (param) is not properly invalidated after
a partition's constraint is detached, apparently leaving query_key in
it set to the parent's constraint id.

-- 
Amit Langote
EDB: http://www.enterprisedb.com




Re: Improving spin-lock implementation on ARM.

2020-12-08 Thread Krunal Bauskar
On Thu, 3 Dec 2020 at 21:32, Tom Lane  wrote:

> Krunal Bauskar  writes:
> > Any updates or further inputs on this.
>
> As far as LSE goes: my take is that tampering with the
> compiler/platform's default optimization options requires *very*
> strong evidence, which we have not got and likely won't get.  Users
> who are building for specific hardware can choose to supply custom
> CFLAGS, of course.  But we shouldn't presume to do that for them,
> because we don't know what they are building for, or with what.
>
> I'm very willing to consider the CAS spinlock patch, but it still
> feels like there's not enough evidence to show that it's a universal
> win.  The way to move forward on that is to collect more measurements
> on additional ARM-based platforms.  And I continue to think that
> pgbench is only a very crude tool for testing spinlock performance;
> we should look at other tests.
>

Thanks Tom.

Given pg-bench limited option I decided to try things with sysbench to
expose
the real contention using zipfian type (zipfian pattern causes part of the
database
to get updated there-by exposing main contention point).


*Baseline for 256 threads update-index use-case:*
-   44.24%174935  postgres postgres [.] s_lock
transactions:
transactions:5587105 (92988.40 per sec.)

*Patched for 256 threads update-index use-case:*
 0.02%80  postgres  postgres  [.] s_lock
transactions:
transactions:10288781 (171305.24 per sec.)

*perf diff*

* 0.02%+44.22%  postgres [.] s_lock*


As we see from the above result s_lock is exposing major contention that
could be relaxed using the
said cas patch. Performance improvement in range of 80% is observed.

Taking this guideline we decided to run it for all scalability for update
and non-update use-case.
Check the attached graph. Consistent improvement is observed.

I presume this should help re-establish that for major contention cases
existing tas approach will always give up.

---

Unfortunately, I don't have access to different ARM arch except for Kunpeng
or Graviton2 where
we have already proved the value of the patch.
[ref: Apple M1 as per your evaluation patch doesn't show regression for
select. Maybe if possible can you try update scenarios too].

Do you know anyone from the community who has access to other ARM arches we
can request them to evaluate?
But since it is has proven on 2 independent ARM arch I am pretty confident
it will scale with other ARM arches too.


>
> From a system structural standpoint, I seriously dislike that lwlock.c
> patch: putting machine-specific variant implementations into that file
> seems like a disaster for maintainability.  So it would need to show a
> very significant gain across a range of hardware before I'd want to
> consider adopting it ... and it has not shown that.
>
> regards, tom lane
>


-- 
Regards,
Krunal Bauskar


Re: PoC Refactor AM analyse API

2020-12-08 Thread Andrey Borodin
Hi Denis!

> 7 дек. 2020 г., в 18:23, Смирнов Денис  написал(а):
> 
> I suggest a refactoring of analyze AM API as it is too much heap specific at 
> the moment. The problem was inspired by Greenplum’s analyze improvement for 
> append-optimized row and column AM with variable size compressed blocks.
> Currently we do analyze in two steps.
> 
> 1. Sample fix size blocks with algorithm S from Knuth (BlockSampler function)
> 2. Collect tuples into reservoir with algorithm Z from Vitter.
> 
> So this doesn’t work for AMs using variable sized physical blocks for 
> example. They need weight random sampling (WRS) algorithms like A-Chao or 
> logical blocks to follow S-Knuth (and have a problem with 
> RelationGetNumberOfBlocks() estimating a physical number of blocks). Another 
> problem with columns - they are not passed to analyze begin scan and can’t 
> benefit from column storage at ANALYZE TABLE (COL).
> 
> The suggestion is to replace table_scan_analyze_next_block() and 
> table_scan_analyze_next_tuple() with a single function: 
> table_acquire_sample_rows(). The AM implementation of 
> table_acquire_sample_rows() can use the BlockSampler functions if it wants 
> to, but if the AM is not block-oriented, it could do something else. This 
> suggestion also passes VacAttrStats to table_acquire_sample_rows() for 
> column-oriented AMs and removes PROGRESS_ANALYZE_BLOCKS_TOTAL and 
> PROGRESS_ANALYZE_BLOCKS_DONE definitions as not all AMs can be block-oriented.

Just few random notes about the idea.
Heap pages are not of a fixed size, when measured in tuple count. And comment 
in the codes describes it.
 * Although every row has an equal chance of ending up in the final
 * sample, this sampling method is not perfect: not every possible
 * sample has an equal chance of being selected.  For large relations
 * the number of different blocks represented by the sample tends to be
 * too small.  We can live with that for now.  Improvements are welcome.

Current implementation provide framework with shared code. Though this 
framework is only suitable for block-of-tuples AMs. And have statistical 
downsides when count of tuples varies too much.
Maybe can we just provide a richer API? To have both: avoid copying code and 
provide flexibility.

Best regards, Andrey Borodin.



Re: {CREATE INDEX, REINDEX} CONCURRENTLY improvements

2020-12-08 Thread Hamid Akhtar
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   not tested
Spec compliant:   not tested
Documentation:not tested

The patch looks good to me. With regards to the code comments, I had pretty 
similar concerns as already raised by Dmitry; and those have already been 
answered by you. So this patch is good to go from my side once you have updated 
the comments per your last email.

Re: PoC Refactor AM analyse API

2020-12-08 Thread Denis Smirnov
It seems that my mailing client set wrong MIME types for attached patch and it 
was filtered by the web archive. So I attach the patch again for the web 
archive.

diff --git a/contrib/file_fdw/file_fdw.c b/contrib/file_fdw/file_fdw.c
index 9863e32748..3cdb839489 100644
--- a/contrib/file_fdw/file_fdw.c
+++ b/contrib/file_fdw/file_fdw.c
@@ -159,6 +159,7 @@ static void estimate_costs(PlannerInfo *root, RelOptInfo 
*baserel,
   FileFdwPlanState 
*fdw_private,
   Cost *startup_cost, Cost 
*total_cost);
 static int file_acquire_sample_rows(Relation onerel, int elevel,
+int 
natts, VacAttrStats **stats,
 
HeapTuple *rows, int targrows,
 double 
*totalrows, double *totaldeadrows);
 
@@ -1093,6 +1094,7 @@ estimate_costs(PlannerInfo *root, RelOptInfo *baserel,
  */
 static int
 file_acquire_sample_rows(Relation onerel, int elevel,
+int natts, VacAttrStats 
**stats,
 HeapTuple *rows, int targrows,
 double *totalrows, double 
*totaldeadrows)
 {
diff --git a/contrib/postgres_fdw/postgres_fdw.c 
b/contrib/postgres_fdw/postgres_fdw.c
index b6c72e1d1e..471970601a 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -466,6 +466,7 @@ static void process_query_params(ExprContext *econtext,
 List 
*param_exprs,
 const char 
**param_values);
 static int postgresAcquireSampleRowsFunc(Relation relation, int elevel,
+   
  int natts, VacAttrStats **stats,

  HeapTuple *rows, int targrows,

  double *totalrows,

  double *totaldeadrows);
@@ -4483,6 +4484,7 @@ postgresAnalyzeForeignTable(Relation relation,
  */
 static int
 postgresAcquireSampleRowsFunc(Relation relation, int elevel,
+ int natts, 
VacAttrStats **stats,
  HeapTuple *rows, int 
targrows,
  double *totalrows,
  double *totaldeadrows)
diff --git a/src/backend/access/heap/heapam_handler.c 
b/src/backend/access/heap/heapam_handler.c
index 3eea215b85..b2f31626b8 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -33,6 +33,7 @@
 #include "catalog/storage.h"
 #include "catalog/storage_xlog.h"
 #include "commands/progress.h"
+#include "commands/vacuum.h"
 #include "executor/executor.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -44,6 +45,7 @@
 #include "storage/smgr.h"
 #include "utils/builtins.h"
 #include "utils/rel.h"
+#include "utils/sampling.h"
 
 static void reform_and_rewrite_tuple(HeapTuple tuple,
 
Relation OldHeap, Relation NewHeap,
@@ -1154,6 +1156,151 @@ heapam_scan_analyze_next_tuple(TableScanDesc scan, 
TransactionId OldestXmin,
return false;
 }
 
+/*
+ * As of May 2004 we use a new two-stage method:  Stage one selects up
+ * to targrows random blocks (or all blocks, if there aren't so many).
+ * Stage two scans these blocks and uses the Vitter algorithm to create
+ * a random sample of targrows rows (or less, if there are less in the
+ * sample of blocks).  The two stages are executed simultaneously: each
+ * block is processed as soon as stage one returns its number and while
+ * the rows are read stage two controls which ones are to be inserted
+ * into the sample.
+ *
+ * Although every row has an equal chance of ending up in the final
+ * sample, this sampling method is not perfect: not every possible
+ * sample has an equal chance of being selected.  For large relations
+ * the number of different blocks represented by the sample tends to be
+ * too small.  We can live with that for now.  Improvements are welcome.
+ *
+ * An important property of this sampling method is that because we do
+ * look at a statistically unbiased set of blocks, we should get
+ * unbiased estimates of the average numbers of live and dead rows per
+ * block.  The previous sampling method put too much credence in the row
+ * density near the start of the table.
+ */
+static int